[go: up one dir, main page]

WO2022125686A1 - Precision engineering of a cellular sense-and-response function - Google Patents

Precision engineering of a cellular sense-and-response function Download PDF

Info

Publication number
WO2022125686A1
WO2022125686A1 PCT/US2021/062442 US2021062442W WO2022125686A1 WO 2022125686 A1 WO2022125686 A1 WO 2022125686A1 US 2021062442 W US2021062442 W US 2021062442W WO 2022125686 A1 WO2022125686 A1 WO 2022125686A1
Authority
WO
WIPO (PCT)
Prior art keywords
variant
library
variants
dna
barcoded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2021/062442
Other languages
French (fr)
Inventor
David Judson Ross
Drew Scott TACK
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
United States Department of Commerce
Original Assignee
United States Department of Commerce
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by United States Department of Commerce filed Critical United States Department of Commerce
Publication of WO2022125686A1 publication Critical patent/WO2022125686A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1065Preparation or screening of tagged libraries, e.g. tagged microorganisms by STM-mutagenesis, tagged polynucleotides, gene tags

Definitions

  • a process for determining a coding DNA sequence for a specific dose-response curve for a genetic sensor comprising: producing a coding DNA sequence library comprising a plurality of mutated variants of the coding DNA sequence (CDS) for the genetic sensor, such that each mutated variant comprises a mutated coding DNA sequence of the coding DNA sequence; attaching a separate DNA barcode to each of the mutated variants in the coding DNA sequence library to produce a barcoded variant library of barcoded mutant variants, wherein each barcoded mutant variant in the barcoded variant library corresponds to one of the mutated variants having an attached DNA barcode; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants in the barcoded variant library into cells; growing the cells in various cell cultures comprising different values of an input signal of the genetic sensor; extracting DNA from the cells in the cell cultures; determining the number of each DNA barcode
  • FIG. 1 shows details for a library-scale allosteric genotype-phenotype landscape measurement.
  • A library of lac repressor (Lacl ) variants was generated by random mutagenesis of the lacl coding DNA sequence (CDS). The CDS for each variant was attached to a DNA barcode and inserted into a plasmid where the Lacl variant regulated expression of a tetracycline resistance gene. The CDS and corresponding barcode on each plasmid were determined with long-read sequencing.
  • B The library was transformed into Escherichia coli.
  • FIG. 2 shows data for an accuracy of the library-scale dose-response curve measurement.
  • A-D The plots compare the results from the library-scale measurement (y-axis) with the flow cytometry verification results (x-axis) for each Hill equation parameter. Data are shown for all of the verified Lacl variants with sigmoidal dose-response curves (i.e. , band-stop and band-pass variants are not included). Data for different variants are plotted with different combinations of color and shape. Variants that occurred more than once in the library (with different DNA barcodes) are plotted multiple times. For example, the wild type (dark gray “X” symbols) is plotted 53 times.
  • the accuracy for each Hill equation parameter is 4-fold for G 0 (A), 1.5-fold for G ⁇ (B), 1 .8-fold for EC 50 (C), and ⁇ 0.28 for n (D).
  • the accuracy is calculated as: exp(RMSE(ln(x)))expRMSEInx, where RMSE(ln(x))RMSEInx is the root-mean-square difference between the logarithm of each parameter from the library-scale and cytometry measurements.
  • n the accuracy is given simply as the root-mean-square difference between the library-scale and cytometry results.
  • the inverse-variance-weighted coefficient of determination (R 2 ) for each Hill equation parameter is: 0.83 for G 0 (A), 0.55 for G ⁇ (B), 0.86 for EC 50 (C), and -0.04 for n (D).
  • the variance of the posterior distribution from the Bayesian inference was used for weighting.
  • the contribution from the wild-type observations were weighted by a factor of 1/53 to avoid bias from multiple observations.
  • points indicate the median and error bars indicate ⁇ one standard deviation from the Bayesian posterior.
  • Data are from a single library-scale measurement, and a single flow cytometry measurement for each Lacl variant at each IPTG concentration.
  • A-C Protein structures showing the locations of amino acid substitutions that affect each Hill equation parameter G 0 (A), G ⁇ (B), EC 50 (C).
  • the operator-binding structure is shown on the left (operator DNA in light orange, PDB ID: 1 LBG (Lewis et al, 1996)) and the ligand-binding structure is shown on the right (IPTG in cyan, PDB ID: 1 LBH (Lewis etal, 1996)). Both structures are shown with the view oriented along the protein dimer interface, with one monomer in light gray and the other monomer in dark gray.
  • Colored spheres highlight residues where substitutions cause a greater than 5-fold change in the Hill equation parameter relative to wild-type Lacl. Red spheres indicate residues where substitutions increase the parameter, and blue spheres indicate residues where substitutions decrease the parameter. At three residues (A82, 183, and F161), some substitutions decrease EC 50 , while other substitutions increase EC 50 (violet spheres in C).
  • D-F Scatter plots showing the effect of each substitutions as a function of position. Substitutions that change the parameter by less than 5-fold are shown as gray points. Substitutions that change the parameter by more than 5-fold are shown as red or blue points with error bars. Histograms to the right of each scatter plot show the overall distribution of single- substitution effects.
  • FIG. 4 shows effects of amino acid substitutions on EC50 of Lacl are log-additive.
  • the log-additive EC 50 for double-substitution Lacl variants i.e., two amino acid substitutions
  • log( EC 50, AB / EC 50, wt ) log(EC 50, A l EC 50, wt ) + log(EC 50,B / EC 50, wt ), where “wt” indicates the wild type, “A” and “B” indicate the single-substitution variants, and “AB” indicates the double-substitution variant.
  • the measured EC 50 of double-substitution variants is from the library-scale measurement. Orange points mark double-substitution variants in which one of the single substitutions causes a greater than 2.5-fold change in EC 50 . Dark red points mark double-substitution variants in which both single substitutions cause a greater than 2.5-fold change in EC 50 .
  • the EC 50 of wild-type Lacl is marked with a black “X”. For this analysis, only experimental data were used (no results from the DNN model). Also, only data from Lacl variants with low EC 50 uncertainty were used (SD(log 10 (EC 50 )) ⁇ 0.35). Points show the best consensus estimate for the parameter values as described in the Materials and Methods.
  • Error bars for the measured result indicate ⁇ one standard deviation estimated from the Bayesian posteriors; error bars for the log-additive result indicate ⁇ one standard deviation propagated from the Bayesian posterior uncertainties of the single-substitution results.
  • Data are from a single library-scale measurement.
  • FIG. 5 shows structural and genetic diversity of inverted and band- stop genotypes.
  • A, B Protein structures showing the locations of amino acid substitutions associated with strongly inverted (A) and strong band-stop (B) phenotypes.
  • C For strong band-stop variants (B), helix 9 is shaded blue, and p-strand J is shaded violet.
  • C, D Network diagrams showing relatedness among genotypes for strongly inverted (C) and strong band-stop (D) variants.
  • larger polygonal nodes represent Lacl variants, with a colormap indicating the G 0 /G ⁇ or G 0 /G min ratio (see FIG. 1 E).
  • the number of sides of the polygon indicates the number of amino acid substitutions relative to the wild type, and bold outlines indicate variants that were verified with flow cytometry.
  • Smaller circular nodes represent specific amino acid substitutions, with connecting lines showing the substitutions for each variant.
  • FIG. 6 shows that a band-stop phenotype emerges from combinations of nearly silent amino acid substitutions.
  • the blue circles plotted with error bars show the effects of substitutions associated with the strongly inverted phenotype and the orange triangles plotted with error bars show the effects of substitutions associated with the strong band-stop phenotype.
  • Most substitutions associated with the inverted phenotype cause a large shift in either EC 50 , G», or both, consistent with the biophysical requirements for inverting the dose-response curve.
  • most of the amino acid substitutions associated with the band-stop phenotype are nearly silent.
  • Light blue circles and light orange triangles show the effects for all amino acid substitutions found in the sets of strongly inverted and strong band-stop variants, respectively. Dashed gray lines mark the wild-type parameter values.
  • Plotted data includes a combination of direct experimental measurements and DNN model predictions.
  • FIG. 7 shows plasmid maps used for the library-scale and verification measurements. Both plasmids contain the p15A origin of replication (p15A), kanamycin resistance gene (KanR), and lacl coding DNA sequence (lacl). In both plasmids, the encoded Lacl protein transcriptionally regulates an output that was driven by the Ptacl promoter, lacO operator, and RiboJ transcriptional insulator.
  • the library plasmid (pTY 1 ) was used for measurement of the genotype and phenotype of the entire Lacl library.
  • the Lacl library including the coding DNA sequences of lacl variants and their corresponding barcodes, was cloned into pTY1.
  • pTY1 the Lacl variants regulated the expression of a tetracycline resistance gene, tetA.
  • Plasmid pTY1 also encoded yellow fluorescence protein (YFP), which was used for cloning purposes.
  • pVER The verification plasmid was used to verify the dose-response curves of over 100 variants from the library.
  • the coding DNA sequence for that variant was chemically synthesized and cloned into pVER, where the Lacl variant regulated the expression of YFP. The dose-response curve was then measured using flow cytometry.
  • FIG. 8 shows a comparison of fluorescence distribution of the Lacl library before and after fluorescence activated cell sorting (FACS) sorting.
  • FACS fluorescence activated cell sorting
  • FIG. 9 shows a diversity of phenotypes in the Lacl library.
  • A Relative abundance of the various Lacl phenotypes found in the library.
  • Variants with a “normal response” phenotype have dose-response curves qualitatively similar to the wild-type, with GO ⁇ G ⁇ Normal response variants include variants with EC50 lower than the wild-type value (between approximately 1 ⁇ mol/L and 100 ⁇ mol/L) and variants with EC50 higher than the wild-type value (between approximately 100 ⁇ mol/L and 2000 pmoi/L).
  • Variants with a “flat response” phenotype have flat dose-response curves with Fiat response variants include always-off variants (G(0) ⁇ 0.25 x G ⁇ ,wt; i.e., the IS phenotype from Markiewicz et al., Journal of Molecular Biology, 240, 421-433, 1994) and always-on variants (G(0) > 0.25 x G ⁇ ,wt; i.e. the I- phenotype from Markiewicz et al.).
  • Variants with a “negative response” phenotype have dose-response curves with a negative slope, for some portion of the measured concentration range.
  • Negative response variants include band-stop, inverted, and band-pass variants.
  • FIG. 10 shows amino acid substitution count heat map.
  • the heat map indicates the number of times each possible amino acid substitution was observed across the entire Lacl library, with the residue number as the x-axis and the possible amino acid substitutions as the y-axis.
  • the wild-type amino acid at each residue is marked with an ‘X’.
  • SNP-accessibie substitutions (possible via a single DNA base change per codon) are outlined with black squares. Substitutions that were not observed for any variant in the library are shown as white. Substitutions at residues 187 and 188 are under-represented because those codons contained a restriction site used during library and plasmid assembly. Most of the other SNP-accessible substitutions that were not observed were probably excluded by FACS selection during library preparation (see FIG. 8).
  • FIG. 11 shows example data for fitness and dose-response curves for normal phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant.
  • the fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points.
  • the black lines show the results from the library-scale measurement using the Bayesian Hill equation model.
  • the purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals.
  • GP Bayesian Gaussian process
  • the plotted purple points show the dose-response curves from the flow cytometry verification measurements.
  • Error bars indicate ⁇ one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
  • FIG. 12 shows example data for fitness and dose-response curves for inverted phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant.
  • the fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points.
  • the black lines show the results from the library-scale measurement using the Bayesian Hill equation model.
  • the purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals.
  • GP Bayesian Gaussian process
  • the plotted purple points show the dose-response curves from the flow cytometry verification measurements.
  • Error bars indicate ⁇ one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
  • FIG. 13 shows example data for fitness and dose-response curves for band-stop phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant.
  • the fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points.
  • the black lines show the results from the library-scale measurement using the Bayesian Hill equation model.
  • the purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals.
  • GP Bayesian Gaussian process
  • FIG. 14 shows example data for fitness and dose-response curves for band-pass phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant.
  • the fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points.
  • the black lines show the results from the library-scale measurement using the Bayesian Hill equation model.
  • the purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals.
  • the plotted purple points show the dose-response curves from the flow cytometry verification measurements. Error bars indicate ⁇ one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
  • FIG. 15 shows distributions of Hill equation parameters measured for the full library and the wild-type variants.
  • the distribution from the full Lacl library is shown in blue
  • the distribution from variants with the wild-type Lacl DNA coding sequence is shown in green
  • the distribution from variants with synonymous mutations is shown in orange.
  • the green and orange data are only shown for variants with zero unintended mutations in the plasmid (i.e. no mutations in regions of the plasmid other than the lacl coding DNA sequence).
  • the Kolmogorov-Smirnov test was used to compare the distributions of Hill equation parameters between variants with the wild-type DNA sequence (green, 39 variants) and synonymous mutations (orange, 310 variants). The resulting p-values (0.71 , 0.40, 0.28, and 0.17 for GO, G M , EC50, and n respectively, Kolmogorov-Smirnov test) indicate that there were no significant differences between them.
  • FIG. 16 shows calibration data for the determination of dose- response curves using barcode sequencing.
  • A Dose-response curves measured with flow cytometry for nine Lacl variants used to calibrate library-scale measurement.
  • B-C The fitness impact of tetracycline (from the library-scale measurement) plotted vs. gene expression output (from flow cytometry of individual variants). The fitness impact of tetracycline is defined as the decrease in fitness (E. coll growth rate) measured with tetracycline vs. without tetracycline normalized by the fitness measured without tetracycline, i.e. ( ⁇ tet / ⁇ 0 - 1 ) from Eq. (9).
  • (B) Data from a small-scale test library containing only the calibration variants.
  • (C) Data from measurement of the full library. Results for different calibration variants are shown with different colored symbols. The solid black lines in show the results of a fit using Eq. (9). Error bars indicate ⁇ one standard deviation estimated from the least-squares fit. The results for the full library measurement are noisier than the small-scale test because of the lower read count per variant (approximately 100-fold difference), but the general trend and the fits for both results are similar.
  • FIG. 17 shows gene expression measurement uncertainty for library- scale measurement of Lacl dose-response curves.
  • the bottom plot shows the uncertainty vs. the median for the gene expression values, G(L), estimated using the Gaussian process (GP) inference model (Materials and Methods).
  • the uncertainty is the posterior standard deviation of log10(G(L)) from the Bayesian inference.
  • the color map shows the relative density of data for all of the gene expression levels from the library-scale measurements (i.e. for every variant at every IPTG concentration).
  • the uncertainty is relatively high for G(L) ⁇ 104 because the fitness impact of tetracycline used for the measurement (top plot, taken from Appendix Figure S10c) is approximately constant over that range.
  • the dark red dashed line in the bottom plot shows the inverse slope of the fitness impact curve.
  • G(L) between about 7,000 MEF and 25,000 MEF, the uncertainty is approximately proportional to the inverse slope, as expected. Outside that range, the uncertainly is no longer proportional to the inverse slope of the fitness impact curve because the posterior G(L) estimate is constrained by the prior used in the Bayesian inference to the range 100 MEF ⁇ G(L) ⁇ 50,000 MEF.
  • FIG. 18 shows a deep neural network (DNN) model.
  • DNN deep neural network
  • the recurrent DNN model (green) outperforms both the linear-additive (orange) and feed-forward (blue) models, giving the highest R 2 values for the Hill equation parameters GO and G M .
  • the recurrent DNN model and the linear-additive model give similar R 2 values.
  • FIG. 19 shows a prediction of Lacl variant phenotypes with recurrent deep neural network (DNN) model. Measured vs. predicted values for each Hill equation parameter. The plotted results only show holdout data not used in model training. Predictions are taken from the variational posterior mean for each Hill equation parameter. Measured values are the posterior medians from the Bayesian fits to the experimental data. (A-B) GO and are given in molecules of equivalent fluorophore (MEF) based on the calibration with flow cytometry data.
  • DNN deep neural network
  • FIG. 20 shows a DNN model prediction uncertainty and root-mean- square error (RMSE).
  • A Model prediction uncertainty for the base-ten logarithm of each Hill equation parameter as a function of the number of amino acid substitutions relative to the wild-type Lacl sequence. The boxplot shows the distribution of posterior standard deviation values for the set of variants with the indicated number of substitutions.
  • B Model RMSE for each Hill equation parameter as a function of the number of substitutions relative to the wild-type sequence. RMSE is the root-mean- square difference between the model prediction and experimental measurement for the base-ten logarithm of each Hill equation parameter. The boxplot shows the distribution of RMSE values from the ten cross-validation tests for the recurrent DNN model.
  • Solid lines with points show the RMSE forthe holdout data not used fortraining. Both the prediction uncertainty (A) and the RMSE (B) increase with increasing number of substitutions relative to the wild-type sequence.
  • the RMSE values are from the holdout data, i.e. the solid lines with points from (B).
  • the median prediction accuracy values are from the distributions plotted in (A).
  • the dashed line indicates the multiplicative factor (approximately 3.8) used to correct the uncertainty values.
  • FIG. 21 shows a comparison between the DNN model root- mean- square error (RMSE) and the measurement uncertainty for single mutants.
  • the histograms show distributions of the measurement uncertainty for Lacl variants with single amino acid substitutions (relative to the wild type) for each Hill equation parameter.
  • the dashed line in each plot shows the RMSE for the DNN model predictions for single-substitution hold-out data.
  • the model RMSE as a function of the number of substitutions is shown in FIG. 20B.
  • FIG. 22 shows measured vs. predicted values from DNN model for single amino acid substitutions.
  • the plots compare the DNN model predictions (x- axis) with the measured values (y-axis) for the Hill equation parameters for Lacl variants with single amino acid substitutions.
  • Blue symbols show data for ail of the single-substitution variants in the library.
  • Orange symbols show data for variants with a high uncertainty for the measured EC50 (std(log10(EC50)) > 0.35).
  • the EC50 uncertainty forthose variants was relatively high either because G ⁇ was similar to GO and/or because EC50 was near or above the maximum ligand concentration used (2048 ⁇ mol/L).
  • FIG. 23 shows a heat map of single substitution effects on GO.
  • the heat map indicates the change in GO for all amino acid substitutions measured as single substitutions in the Lacl library.
  • the amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis.
  • the wild-type amino acid at each residue is marked with an ‘X’.
  • the 83 substitutions that are completely missing from the landscape dataset and that have been shown by previous work to result in constitutively high G(L)1 ,2 are each marked with a
  • FIG. 24 shows a heat map of single substitution effects on G ⁇ .
  • the heat map indicates the change in G ⁇ for all amino acid substitutions measured as single substitutions in the Lacl library and/or predicted from the DNN model.
  • the amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis.
  • the wild-type amino acid at each residue is marked with an ‘X’.
  • Substitutions with effects that are predicted from the DNN model are outlined with solid lines.
  • Substitutions with high measurement uncertainty, where the DNN model result was used in place of the experimental result are outlined with dashed lines (see FIG. 22).
  • FIG. 25 shows a heat map of single substitution effects on EC 50 .
  • the heat map indicates the change in EC 50 for all amino acid substitutions measured as single substitutions in the Lacl library and/or predicted from the DNN model.
  • the amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis.
  • the wild-type amino acid at each residue is marked with an ‘X’.
  • Substitutions with effects that are predicted from the DNN model are outlined with solid lines.
  • Substitutions with high measurement uncertainty, where the DNN model result was used in place of the experimental result are outlined with dashed lines (see FIG. 22).
  • FIG. 26 shows a multiparametric impact of amino acid substitutions on allosteric function of Lacl.
  • Each plot shows the joint effect of single amino acid substitutions on two Hill equation parameters.
  • substitutions that change both Hill equation parameters by less than 5-fold are shown as light gray points
  • substitutions that change one or both Hill equation parameters by more than 5-fold are shown as red or blue points with error bars.
  • red indicates a decrease in Hill equation parameter and blue indicates an increase.
  • the left half of each symbol and the y-error bar are colored based on the y-axis parameter; the right half of each symbol and the x-error bars are colored based on the x-axis parameter.
  • the wild-type phenotype is indicated with a black ‘X’ in each plot.
  • GO and are given in molecules of equivalent fluorophore (MEF). Error bars indicate ⁇ one standard deviation estimated from the Bayesian posterior.
  • FIG. 27 shows a comparison between log-additive prediction and measurement for double-substitution variants for GO and G ⁇ .
  • the log-additive GO and G® 5 for double-substitution Lacl variants i.e. two amino acid substitutions
  • FIG. 28 shows nearest neighbor distance histograms for inverted and band-stop Lacl variants.
  • the orange bars show the distribution of nearest neighbor Hamming distance for the amino acid sequences for strongly inverted (A) and strong band-stop (B) variants
  • the blue bars show the distribution of nearest neighbor Hamming distance for a similar number of randomly selected sequences from the full library.
  • the full-library histograms (blue bars) are averaged over 1000 iterations of randomly selected sequences.
  • FIG. 29 shows that the band-stop phenotype emerges from the combination of two amino acid substitutions.
  • A Dose-response curves measured with flow cytometry for wild-type Lacl and a strong band-stop variant identified from the library with only three amino acid substitutions (R195H, G265D, A337D).
  • B-D Dose-response curves for Lacl variants containing single- and double-substitution permutations of the three substitutions in the band-stop variant. Each plot shows two Lacl variants containing single substitutions and one Lacl variant containing both substitutions.
  • FIG. 30 shows data for a measurement of OD 600 vs. time for E. coll growth during initial, 12-hour incubation in automated culture protocol. Measurements are shown for all 96 wells in the growth plate during the library-scale measurement of Lacl dose-response curves. The OD600 measurement is not corrected for the zero- OD offset caused by the membrane (approximately 1.7).
  • FIG. 31 shows data for a measurement of OD 600 vs. time for E. coll in Growth Plates 1-4. Measurements are shown for all 96 wells in each growth plate during the library-scale measurement of Lacl dose-response curves.
  • A Growth Plate 1.
  • B Growth Plate 2.
  • C Growth Plate 3.
  • D Growth Plate 4.
  • the OD 600 measurements are not corrected for the zero-OD offset caused by the membrane (approximately 1.7). Note that the flattening of the growth curves at time ⁇ 2.4 hours in (A) and (B), is not the onset of stationary phase. Instead, it is an indication of a diauxic shift which was used as a marker for the maximum cell density at which the cells remained in exponential growth phase.
  • An example set of full growth curves is shown in FIG. 32.
  • FIG. 32 shows an example of OD 600 vs. time measurement for a full E. coll growth curve.
  • the inset shows a zoomed-in view of the short flat portion of the growth curves indicating the onset of a diauxic shift. This diauxic shift was used as a marker for the maximum ceil density at which cells remained in exponential growth.
  • FIG. 33 shows a flow cytometry gating example.
  • A Side scatter vs. forward scatter plot before automated cell gating, showing both cell and non-cell detection events.
  • B Side scatter vs. forward scatter plot after automated cell gating, showing only events most likely to be cell events.
  • C Side scatter vs. side scatter area/height plot before automated singlet gating, showing both singlet and multiplet cell detection events.
  • D Side scatter vs. side scatter area/height plot after automated singlet gating, showing only singlet cell detection events.
  • FIG. 34 lists a number of measured genotypes in library for different mutational distances from wild-type lack A characterization of the diversity found within the library in terms of in terms of single nucleotide polymorphisms of the wildtype lacl sequence.
  • FIG. 35 lists Lacl protein structural domains and features. Residues within 7 A of the ligand were determined using the crystal structure of Lacl bound to IPTG (PDB ID: 1 LBH). The dimer interface was taken as described by Lewis etal. plus residues 117 and 247-249, which are also along the dimer interface based on the published crystal structures (PDB IDs: 1 LBG and 1 LBH). The core pivot region was taken as described by Swint-Kruse et al.
  • FIG. 36 lists amino acid substitutions and structural domains or features associated with the inverted Lacl phenotype.
  • the top portion of the table lists the amino acid substitutions that occur more frequently in the set of strongly inverted Lacl variants than in the full library, with a p-value cutoff of 5 x 10 -3 .
  • the bottom portion of the table lists the structural domains or features where substitutions were found at higher frequency in the set of strongly inverted Lacl variants than in the full library, with a p-value cutoff of 2.5 x 10 -2 .
  • the criteria used to select the set of strongly inverted variants and the hypergeometric test used to calculate p-values are described in the Example.
  • FIG. 37 lists amino acid substitutions and structural domains or features associated with the band-stop Lacl phenotype.
  • the top portion of the table lists the amino acid substitutions that occur more frequently in the set of strong band- stop Lad variants than in the full iibrary, with a p-value cutoff of 5 x 10 -2 .
  • the bottom portion of the table lists the structural domains or features where substitutions were found at higher frequency in the set of strong band-stop Lad variants than in the full library, with a p-vaiue cutoff of 2.5 x 10 -2 .
  • the criteria used to select the set of strong band-stop variants and the hypergeometric test used to calculate p-values are described in the Example.
  • FIG. 38 lists cell growth in the 24 chemical environments used for library-scale measurement of Lacl dose-response curves. Cultures were grown at 37 °C with 0.5 mL culture volume per well and double-orbital shaking at 807 cycles per minute. The final OD 600 values were calculated by subtracting the offset from the gas- permeable membrane used during growth and are given as the mean ⁇ standard deviation across the four wells used for each chemical environment. The time intervals between the starts of the incubation cycles for sequential growth plates was 10043 s, 10045 s, and 10038 s. *Tetracycline was only added to Growth Plates 2-4.
  • FIG. 39 lists regions extracted from long-read sequencing data for library-scale analysis.
  • the lacl region includes the lacl coding DNA sequence (CDS) and stop codon.
  • the regulatory region includes the Placl and Ptacl promoters, the lacO operator, the riboJ insulator, and the RBS sites for both lacl and tetA.
  • the tetA region includes the tetA CDS.
  • the YFP region includes the YFP CDS and its RBS.
  • the KAN region includes the transcriptional promoter and CDS for the kanamycin resistance gene as well as the transcriptional terminator for the genes regulated by Lacl (T L3S3P21 ).
  • the origin of replication was split by the start of the PacBio circular consensus HiFi reads, so the last 32 bases of the origin of replication were grouped with the intergenic region between the origin and the barcodes.
  • the start and end columns in the table give the nucleotide position of the start and end of each region based on the nominal full plasmid (GenBank ID: MT702633).
  • the mutation rate column lists the estimated mean number of single nucleotide changes per 1000 bases for each region, using the nominal/wild-type sequence as the reference, and considers only single nucleotide polymorphisms (i.e. no indels). Regions adjacent to the barcodes and the lacl region have elevated mutation rates, due to the methods used for library generation and assembly.
  • FIG. 40 lists PCR primers used during Lad library construction Sequences of primers used for PCRs during library construction.
  • FIG. 41 lists forward index primers for barcode amplification.
  • Forward primer sequences used to amplify DNA barcodes from the plasmids.
  • Underlined section anneals to plasmid.
  • Bold section is the multiplexing tag sequence.
  • N’s are randomized positions used as unique molecular identifiers to correct for PCR jackpotting.
  • FIG. 42 lists reverse index primers for barcode amplification.
  • Reverse primer sequences used to amplify DNA barcodes from the plasmids.
  • Underlined section anneals to plasmid.
  • Bold section is the multiplexing tag sequence.
  • N’s are randomized positions used as unique molecular identifiers to correct for PCR jackpotting.
  • FIG. 43 lists primers sequences for the second PCR. Primers used in the second PCR (15-cycle) to attach the standard Illumina paired-end adapter sequences and to amplify the resulting amplicons for sequencing. The underlined section anneals to the barcode amplification primers.
  • Biotechniques are unparalleled in their ability to synthesize polypeptides of enormous sequence diversity from 20 natural amino acid building blocks.
  • a polypeptide of 100 amino acids has 20 100 possible combinations of mutations.
  • Theory predicts 10 60 possible small molecules in chemical space. These numbers are too large to explore by conventional drug discovery approaches.
  • Protein engineering is highly complex, making the incorporation of new chemical moieties into polypeptides difficult.
  • directed evolution is promising but limited in scope.
  • Amino acids and non-canonical or non-natural amino acids can be used for protein synthesis, which raises the complexity of protein engineering, and many conventional methods have difficulty selectively incorporating different non- natural amino acids.
  • Some conventional methods used in protein engineering have technical problems that include instability of linkers, post-translational labeling of ribosome-displayed libraries produced in transcription-translation lysates require complex and unique analytical QC of each library scaffold produced, and short transcription/translation times that are incompatible with complete labeling of translated polypeptides, leading to loss of fidelity due to mixtures of fully-reacted and unreacted amino acids for a single sequence.
  • conventional systems for protein engineering are limited in their ability to generate both sufficient quantities of product and chemically complex libraries of intermediates for efficient encoded translation.
  • a method for engineering sense-and- response functions into living cells provides precisely defined dose-response curves.
  • the method includes creation of a large library of genetic variants, e.g., about 100,000 or more. Measurement of the dose-response curve for every variant in the library can take approximately one day, which is faster and less expensive than conventional directed evolution methods. By examining the full set of dose-response curves and the matching DNA sequences, the method can allow selection of the DNA sequence that provides the precise dose-response that is required. Moreover, the method can include producing genetic sensors with precisely defined dose-response curves.
  • a genetic sensor is a biological function, encoded in DNA, that enables cells to sense and respond to environmental signals by changing the level of gene expression in the cells.
  • Exemplary signals include temperature, pH, or concentration of a specific chemical.
  • the method identifies DNA sequences that encode a genetic sensor with a specific, specified, dose-response curve and can be used to identify DNA sequences for genetic sensors that respond specifically to a desired input signal.
  • a process for determining a coding DNA sequence for a specific dose-response curve for a genetic sensor includes producing a coding DNA sequence library including a plurality of mutated variants of the coding DNA sequence (CDS) for the genetic sensor, such that each mutated variant includes a mutated coding DNA sequence of the coding DNA sequence: attaching a separate DNA barcode to each of the mutated variants in the coding DNA sequence library to produce a barcoded variant library of barcoded mutant variants, wherein each barcoded mutant variant in the barcoded variant library corresponds to one of the mutated variants having an attached DNA barcode; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants in the barcoded variant library into cells; growing the cells in various cell cultures including different values of an input signal of the genetic sensor; extracting DNA from the cells in the cell cultures; determining the number of each DNA barcode in the DNA extracted from the cells
  • the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes synthesizing the mutated coding DNA sequence. In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes making a mutated genetic sensor including the mutated coding DNA sequence. [0054] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes producing a selection system for the genetic sensor, wherein the genetic sensor output modulates the fitness of the cells including the genetic sensor. It should be appreciated that the genetic sensor output is the expression level of the genes regulated by the genetic sensor.
  • the selection system can be tuned so that the cell fitness varies in a measurable way across the range of genetic sensor output for the protein engineering goal.
  • the genetic sensor is the /ac repressor from E. coli, and the selection system can be set up by using the genetic sensor to regulate the expression level of a tetracycline resistance gene (tetA) while growing the cells in the presence of tetracycline (an antibiotic).
  • tetA tetracycline resistance gene
  • ribosome binding site used for the tetA gene (to adjust the amount of tetA protein made) and the concentration of tetracycline used during cell growth (to adjust the fitness impact).
  • RBS ribosome binding site
  • concentration of tetracycline used during cell growth to adjust the fitness impact.
  • adjust those two factors so that the cell fitness varies measurably over a range of genetic sensor output levels spanning a range from approximately 0.1 times to approximately 1 times the maximum output level. Accordingly, this allows measurement and engineering genetic sensors with output levels spanning that range.
  • the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes determining the DNA sequence of each barcoded mutant variant that includes subjecting the barcoded mutant variants of the barcoded variant library to high- throughput long-read DNA sequencing.
  • the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes, in growing the cells in various cell cultures including different values of the input signal of the genetic sensor, the input signal includes a concentration of a molecule sensed by the genetic sensor.
  • the DNA in extracting the DNA from the cells in the cell cultures, is extracted during at least two different time points during growth of the cells in the cell culture. Determining the number of each DNA barcode in the DNA extracted from the cells can be performed at each of the different time points with high-throughput short-read DNA sequencing. Determining the fitness of cells containing each barcoded mutant variant can include calculating the fitness from the change in the number of each barcoded mutant variant over time.
  • determining the mutant variant dose-response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant incudes calculating each mutant variant dose-response curve from the fitness as a function of the input signal.
  • the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes calibrating the mutant variant dose-response curves with a spike-in control variant. In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes normalizing the amount of DNA extracted from the cells in the cell cultures with a spike-in control variant.
  • Producing the coding DNA sequence library can include site- saturation mutagenesis, chemical mutagens, error-prone PCR, propagation in error- prone microbes, homologous recombination, gene shuffling, and others.
  • the coding DNA sequence library is produced by mutagenic PCR.
  • the DNA barcodes In attaching the separate DNA barcode to each of the mutated variants, the DNA barcodes have a distinct DNA sequence for each variant.
  • DNA barcodes can be added to the barcoded variant library (i.e. , the library of mutated DNA variants) with PCR using oligonucleotides containing sequence positions with randomized nucleotides.
  • the length of the DNA barcode i.e., the number of randomized DNA bases, is sufficient to create a barcode library much larger than the mutated CDS library.
  • the high-throughput long-read DNA sequencing can read lengths at least great enough to span the CDS, the attached DNA barcode, and any DNA sequence regions between them. Such sequencing method has high-fidelity to confidently assign variant sequences to the DNA barcode.
  • the barcoded variant library can be transformed and maintained in a host organism via extrachromosomal methods (e.g., a DNA plasmid) or artificial chromosome (e.g., bacterial artificial chromosome or yeast artificial chromosome, and the like) or integrated into a genomic chromosome of the host organism (via CRISPR/cas9, homologous recombination, and the like).
  • DNA plasmids are used to transform the barcoded mutant variant into a host organism. Transformation of appropriate cell hosts with a DNA construct of the present invention is accomplished by well-known methods that typically depend on the type of vector used.
  • transformation of prokaryotic host cells see, e.g., Cohen, S. N. et al., Proc. Natl. Acad. Sci. U.S.A 69 (1972): 2110-2114. Transformation of can also follow that described by Beggs, J. D., Nature 275 (1978): 104-109.
  • reagents useful in transfecting such cells for example calcium phosphate and DEAE-dextran or liposome formulations. Electroporation is also useful for transforming or transfecting cells and is well known in the art for transforming yeast cell, bacterial cells, insect cells and vertebrate cells.
  • Successfully transformed cells i.e., cells that contain a DNA construct such as the barcoded mutant variant, can be identified by well-known techniques such as PCR. Alternatively, the presence of the protein in the supernatant can be detected using antibodies.
  • DNA barcode can refer to a polynucleotide sequence that can include various labels.
  • a DNA barcode can be a polynucleotide sequence that can be used for DNA barcoding.
  • DNA barcode can be used to quantify targets within a sample.
  • DNA barcode can be used to control for errors which may occur after a label is associated with a target. For example, a DNA barcode can be used to assess amplification or sequencing errors.
  • a DNA barcode associated with a target e.g., a mutated variant
  • DNA barcoding can refer to the random labeling (e.g., barcoding) of nucleic acids.
  • DNA barcoding can utilize a recursive Poisson strategy to associate and quantify labels associated with targets.
  • DNA barcoding can be used interchangeably with DNA labeling.
  • the DNA barcode can include one or more labels.
  • Exemplary labels include a universal label, a cellular label, a molecular label, a sample label, a plate label, a spatial label, or a pre-spatial label.
  • a DNA barcode can comprise a 5'amine that may link the DNA barcode to a solid support.
  • the DNA barcode can include a universal label, a cellular label, or a molecular label.
  • the universal label can be a 5’- most label.
  • the molecular label may be the 3’-most label. In some instances, the universal label, the cellular label, and the molecular label are in any order.
  • the DNA barcode can include a target-binding region.
  • the target-binding region can interact with a target (e.g., mutated variant) in a sample.
  • a target e.g., mutated variant
  • the labels of the DNA barcode e.g., universal label, dimension label, spatial label, cellular label, and molecular label
  • One or more nucleic acid amplification reactions can be performed to make copies of the nucleic acid sequences such as CDS.
  • Amplification can be performed in a multiplexed manner, wherein multiple target nucleic acid sequences are amplified simultaneously.
  • the amplification reaction can be used to add sequencing adaptors to the nucleic acid molecules.
  • the amplification reactions can include amplifying at least a portion of a barcode, CDS, or combination thereof.
  • amplification can be performed using a polymerase chain reaction (PCR).
  • PCR can be a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA.
  • PCR can include derivative forms of the reaction, including but not limited to, RT-PCR, real-time PCR nested PCR, quantitative PCR, multiplexed PCR, digital PCR and assembly PCR.
  • Amplification of the labeled nucleic acids can comprise non-PCR based methods.
  • non-PCR based methods include, but are not limited to, multiple displacement amplification (MDA), transcription- mediated amplification (TMA), whole transcriptome amplification (WTA), whole genome amplification (WGA), nucleic acid sequence-based amplification (NASBA), strand displacement amplification (SDA), real-time SDA, rolling circle amplification, or circle-to-circle amplification.
  • Non-PCR-based amplification methods include multiple cycles of DNA-dependent RNA polymerase-driven RNA transcription amplification or RNA-directed DNA synthesis and transcription to amplify DNA or RNA targets, a ligase chain reaction (LCR), and a Q ⁇ replicase ( Q ⁇ ) method, use of palindromic probes, strand displacement amplification, oligonucleotide-driven amplification using a restriction endonuclease, an amplification method in which a primer is hybridized to a nucleic acid sequence and the resulting duplex is cleaved prior to the extension reaction and amplification, strand displacement amplification using a nucleic acid polymerase lacking 5’ exonuclease activity, rolling circle amplification, and ramification extension amplification (RAM).
  • the amplification may not produce circularized transcripts.
  • Conducting the amplification reactions can include use of primers that can include, e.g., a select number nucleotides.
  • the first round PCR can amplify molecules (e.g., attached to the bead) using a gene specific primer and a primer against the universal Illumina sequencing primer 1 sequence.
  • the second round of PCR can amplify the first PCR products using a nested gene specific primer flanked by Illumina sequencing primer 2 sequence, and a primer against the universal Illumina sequencing primer 1 sequence.
  • the third round of PCR adds P5 and P7 and sample index to turn PCR products into an Illumina sequencing library. Sequencing using 150 bp*2 sequencing can reveal the cell label and molecular index on read 1 , the gene on read 2, and the sample index on index 1 read.
  • Amplification can be performed in one or more rounds. In some instances there are multiple rounds of amplification. Amplification can include two or more rounds of amplification. The first amplification can be an extension to generate the gene specific region. The second amplification can occur when a sample nucleic hybridizes to the newly generated strand. [0075] In some embodiments hybridization does not need to occur at the end of a nucleic acid molecule. In some embodiments a target nucleic acid within an intact strand of a longer nucleic acid is hybridized and amplified. A target can be more than 50 nt, more than 100 nt, or more that 1000 nt from an end of a polynucleotide.
  • the methods disclosed herein can include counting the number of molecular labels (e.g., DNA barcodes) associated with such moieties such the mutant variant. Counting the molecular labels can be conducted in a variety of ways, e.g., by sequencing the molecular labels. It should be appreciated that to count the number of molecular labels sequencing part of the nucleic acid molecules generated from reverse transcription, second-strand synthesis, or amplification can be sufficient.
  • molecular labels e.g., DNA barcodes
  • Any suitable sequencing method known in the art can be used, preferably high-throughput approaches.
  • cyclic array sequencing using platforms such as Roche 454, Illumina Solexa, ABI-SOLiD, ION Torrent, Complete Genomics, Pacific Bioscience, Helicos, or the Polonator platform can be used.
  • Sequencing can include MiSeq sequencing.
  • Sequencing may include HiSeq sequencing.
  • counting the molecular labels can be conducted by hybridization.
  • the molecular labels can be counted by- hybridization to a microarray.
  • the microarray comprise a plurality of probes that specifically binds to the molecular labels associated with the one or more reference genes in a sample or spike-in.
  • the number of values used for the input signal can be at least two or large enough to distinguish different dose-response curves for protein engineering goals.
  • 12 different input signal levels are concentrations of a selected molecule (e.g., isopropyl p-d-1 -thiogalactopyranoside, IPTG).
  • the cells can be grown in environments with and without selective pressure (e.g., tetracycline). This can provide improved accuracy for the measurement since some fitness effects can arise that are not due to changes in the dose-response curves.
  • a mutated genetic sensor CDS could encode for proteins that are detrimental to the host cells. It is contemplated that cells can be grown in environments with and without pressure so that the difference between the fitness with and without the pressure is used to determine the sensor dose-response curves. To simplify data analysis, the cells can be maintained in exponential growth for the full measurement.
  • Extracting DNA from the cells in the cell cultures can include a variety of DNA extraction methods.
  • solid phase extraction is used to remove plasmid DNA from the cells.
  • various pre-counting steps can be inolved that can include sequencing sample amplification following loading samples into sequencing instrument. It is contemplated that high-throughput DNA sequencing can be short-read sequencing of the bar code that includes, e.g., from 10 bases to 200 bases, but is not so limited in some embodiments. The high-throughput DNA sequencing reads the barcode DNA sequences. It is contemplated that the sequencing method provides as many reads as possible to reduce the uncertainty associated with counting each barcode.
  • the DNA barcodes can be isolated and amplified using PCR, e.g., according to the Example. Automation of data processing can be achieved by software packages that analyze raw sequencing data to produce a count for each barcode in each sample.
  • An exemplary software package is Bartender, which is a C++ tool that is designed to process random barcode data, available github.com/LaoZZZZZ/bartender-1 .1 .
  • the fitness of cells containing each barcode can be calculated using a linear fit to the logarithm of each barcode count as a function of time. Alternately, a model can be used that accounts for the time lag of fitness effects, as necessary, such as according to the Example.
  • a variant with known fitness can be used to normalize the barcode count data, e.g., as described in the Example. For accuracy sake, the variant with known fitness can be spiked into the cel! culture during cell growth. It is contemplated such variant can be added to the culture with a higher cell number than other variants.
  • the fitness and dose-response of several variants can be independently measured.
  • Those calibration variants can be randomly chosen from the library (e.g., plate cultures and pick individual colonies) or synthesized. For accuracy- sake, such can have dose-response curves qualitatively similar to the dose response for the engineering goals.
  • a set of variants with constitutive output i.e., a constant dose response
  • a set of variants responsive to a different input signal can be used for calibration.
  • calibration variants can be used that are randomly selected from a library (e.g., a lac repressor library), wherein the fitness and dose-response curves are used with a select pressure (e.g., IPTG) as the input signal for calibration.
  • mutant variant dose-response curve as the selected mutant variant dose-response curve, in the mutant variant dose-response curve library matches the target dose-response curve
  • data resulting from prior steps can be stored in a table that matches each CDS variant with its attached DNA barcode and the dose-response curve for each DNA barcode.
  • the collated data can be used to identify the CDS for any variants with the desired dose-response curve. If none of the variants in the library have the desired dose-response curve, the large-scale dataset generated from this process can be used to characterize systematic patterns relating CDS mutations to changes in the dose-response.
  • variant dose-response curve matches the target dose-response curve for the mutant variant dose-response curve library, when the mutant variant dose-response curve most closely matches the target dose-response curve, even if not identical.
  • exemplary methods for synthesizing DNA sequences are solid phase phosphoramidite-techniques.
  • Certain solid phase phosphoramidite-techniques involve sequential de-protection and synthesis of sequences built from phosphoramidite reagents corresponding to natural (or non-natural) nucleic acid bases.
  • Phosphoramidite nucleic acid synthesis can limit a length in the number of base pairs in a nucleic acid synthesizer. There are many methods that can be used. Many are commercially available.
  • the methods described herein engineer genetic sensors with specific dose-response curves.
  • Conventional methods for engineering sensors lack quantitatively specified sensitivity, and the methods herein are deemed to be faster, less expensive, and less likely to fail than conventional methods.
  • conventional methods for engineering genetic sensors require both positive and negative selection. Positive selection ensures that the sensor output switches ‘on’ when desired. Negative selection ensures that the sensor output switches off’ when desired. Implementing both positive and negative selection for the same sensor is difficult, time consuming, and prone to failure.
  • the methods herein overcome these limitations and involve only one of positive or negative selection.
  • conventional methods for engineering genetic sensors may require multiple repeated rounds of selection.
  • the methods herein can be implemented with a single round of selection, e.g., genetic sensors can be engineered with the instant methods and have a sensitivity (EC 50 ) within 1 .25-fold of a targeted value using a single round of selection.
  • Methods herein can be used to make genetic sensors that have applications including: living therapeutics that live in a patient’s gut and sense a disease state and provide a treatment specifically in response; microbes or other cells engineered as environmental or process-monitoring sensors; or microbes or other cells engineered to grow nano- or micro-structured materials.
  • the methods herein provide routine and scalable engineering of genetic sensors with quantitatively specified dose-response curves. Consequently, the dose response of the genetic sensors can be chosen to match the needs of the larger engineering project, resulting in a higher success rate and a more rapid time to realization. Additionally, the methods here can engineer genetic sensors that respond specifically to new input signals (e.g. , different molecules).
  • Allostery is a fundamental biophysical mechanism that underlies cellular sensing, signaling, and metabolism, but a quantitative understanding of allosteric genotype-phenotype relationships remains elusive.
  • This Example describes the large-scale measurement of the genotype-phenotype landscape for an allosteric protein, the lac repressor from Escherichia coli, Lacl. Quantitative measure of the dose-response curves for nearly 10 5 variants of the Lacl genetic sensor was made by a method that combines long-read and short-read DNA sequencing. The resulting data provide a quantitative map of the effect of amino acid substitutions on Lacl allostery and reveal systematic sequence-structure-function relationships. Allosteric phenotypes can be quantitatively predicted with additive or neural-network models, but unpredictable changes also occur, such as a new band-stop phenotype that challenges conventional models of allostery and that emerges from combinations of nearly silent amino acid substitutions.
  • Allostery is an inherent property of biomolecules that underlies cellular regulatory processes including sensing, signaling, and metabolism (Fenton, 2008; Motlagh et al, 2014; Razo-Mejia et at, 2018).
  • ligand binding at one site on a biomolecule changes the activity of another, often distal, site.
  • Switching between active and inactive states provides a sense-and- response function that defines the allosteric phenotype.
  • Quantitative descriptions relating that phenotype to its causal genotype would improve our understanding of cellular function and evolution, and advance protein design and engineering (Raman et al, 2014; He & Liu, 2016; Huang et al, 2016).
  • the intramolecular interactions that mediate allosteric regulation are complex and distributed widely across the biomolecular structure, making the development of general quantitative descriptions challenging.
  • genotype-phenotype landscape approaches have enabled the phenotypic characterization of 10 4 -10 5 genotypes simultaneously (Li et al , 2016; Puchta et al, 2016; Sarkisyan et at, 2016; Domingo et al, 2018; Li & Zhang, 2018; Pressman et at, . 2019). Measurements at this scale facilitate the exploration of genotypes with widely distributed mutations, making them ideal for probing complex biological mechanisms like allostery. However, to quantitatively characterize the sense-and-response phenotypes inherent to allostery, a measurement must encompass the full dose-response curve that describes biomolecular activity as a function of ligand concentration.
  • Genetic sensors have served as a model of allosteric regulation for decades, and today are central to engineering biology. Genetic sensors are allosteric proteins that regulate gene expression in response to stimuli, giving cells the ability to regulate their metabolism and respond to environmental changes. Like other allosteric biomolecules, the lac repressor, Lacl, switches between an active state and an inactive state. In the active state, Lacl binds to a DNA operator upstream of regulated genes, preventing transcription. Ligand binding to Lacl stabilizes the inactive (non- operator-binding) state that allows transcription to proceed.
  • the dose-response curve depends on several biophysical parameters, including ligand-binding affinity, operator-binding affinity, and the allosteric constant, which is the equilibrium ratio between the inactive and active states in the absence of ligand and DNA operator (Monod etal, 1965; Daber et al, 2011 ; Razo-Mejia etal, 2018; Chure et al, 2019).
  • the amino acid sequence sets these biophysical parameters, and thus, amino acid substitutions can change these parameters (Daber et al, 2011 ; Chure et al, 2019).
  • the effect of any particular substitution on the biophysical parameters is unpredictable.
  • the Lacl dose-response modulates cellular fitness (i.e., growth rate) based on the concentration of the input ligand isopropyl-p-d-thiogalactoside (IPTG).
  • IPTG isopropyl-p-d-thiogalactoside
  • FACS fluorescence-activated cell sorting
  • the library contained 62,472 different Lacl genotypes, with an average of 7.0 single nucleotide polymorphisms (SNPs) per genotype. Many SNPs were synonymous, i.e., coded for the same amino acid, so the library encoded 60,398 different amino acid sequences with an average of 4.4 amino acid substitutions per variant (FIG. 9B, the number of variants in the library at each mutational distance from the wild type are listed FIG. 34, and the number of observations of each amino acid substitution in the library is shown in FIG. 10).
  • SNPs single nucleotide polymorphisms
  • n we calculated the accuracy simply as the root-mean- square difference between the library-scale and cytometry results.
  • the accuracy for the gene expression levels (G 0 and G ⁇ ) was better at higher gene expression levels (typical for G ⁇ ) than at low gene expression levels (typical for G 0 ), which is expected based on the non-linearity of the fitness impact of tetracycline (FIG. 16 and FIG. 17).
  • Measurements of the Hill coefficient, n had high relative uncertainties for both barcode sequencing and flow cytometry, so the parameter n was not used in any quantitative analysis.
  • the flow cytometry results demonstrated that our experimental method measures dose-response curves with both high qualitative and quantitative accuracy (FIG. 2A-D, FIG. 11 , FIG. 12, FIG. 13, and FIG. 14).
  • the model RMSE is comparable to the experimental measurement uncertainty (FIG. 21). So, we could confidently integrate the experimental and DNN results to provide a nearly complete map of the effects of SNP-accessible amino acid substitutions. Furthermore, by integrating information about the causal substitutions from multiple genetic backgrounds, the model provided improved estimates of EC 50 and G ⁇ for variants with EC 50 near or above the maximum ligand concentration measured (FIG. 22).
  • the resulting map of single-substitution effects includes quantitative point estimates and uncertainties of the Hill equation parameters for 94% of the possible SNP-accessible amino acid substitutions (1 ,991 of 2,110: 964 directly from measured data, and 1 ,027 from DNN predictions; FIG. 23, FIG. 24, FIG. 25). Most of the 119 substitutions missing from the dataset were probably excluded by FACS during library preparation because they cause a substantial increase in G 0 . These include 83 substitutions that have been shown to result in constitutively high G(L) (Markiewicz et al, 1994; Pace et al, 1997). Of the 1 ,991 substitutions included in the dataset, 38% measurably affect the dose-response curve (beyond a 95% confidence bound).
  • the Lacl protein has 360 amino acids arranged into three structural domains (Lewis etal, 1996; Flynn etal, 2003; Swint-Kruse etal, 2003).
  • the first 62 N- terminal amino acids form the DNA-binding domain, comprising a helix-turn-helix DNA-binding motif and a hinge that connects the DNA-binding motif and the core domain.
  • the core domain comprising amino acid positions 63-324, is divided into two structural subdomains: the N-terminal core and the C-terminal core.
  • the full core domain forms the ligand-binding pocket, core-pivot region, and dimer interface.
  • the tetramerization domain comprises the final 30 amino acids and includes a flexible linker and an 18 amino acid a-helix (FIG. 3, FIG. 35).
  • Lacl functions as a dimer of dimers: Two Lacl monomers form a symmetric dimer that further assembles into a tetramer (a dimer of dimers).
  • G 0 and G ⁇ are reported in molecules of equivalent fluorophore (MEF) based on the calibration with flow cytometry measurements. Error bars indicate ⁇ one standard deviation estimated from the Bayesian posteriors. Data are from a single library-scale measurement.
  • substitutions that affect G 0 must alter either the operator-binding affinity, the allosteric constant, or the copy number of Lacl proteins per cell (Daber et al, 2011 ; Razo-Mejia et al, 2018; Chure et al, 2019).
  • Substitutions at the first and second codons probably reduce the Lacl copy number (Bivona et at, 2010; Hecht et at, 2017).
  • Helix 4 forms part of the hinge connecting the DNA-binding motif to the core domain. It changes from a disordered coil to an order helix only upon binding of Lacl to its cognate DNA operator, and interactions between the helix 4 residues of each Lacl monomer have been shown to stabilize helix formation (Spronk et al, 1996) and therefore the active state of Lacl.
  • substitutions at position L71 might disrupt these interactions, shifting the allosteric constant to favor the inactive state.
  • the substitution L71Q which replaces the hydrophobic leucine with a hydrophilic glutamine, causes the largest change (20-fold increase in G 0 and 14-fold decrease in EC 50 ), likely due to perturbation of the local hydrophobic environment at the dimer interface.
  • Our results for hydrophobic substitutions at this position support this picture, with just a 3-fold to 4-fold reduction in EC 50 (and little change to G 0 ), consistent with a smaller shift in the allosteric constant.
  • Amino acid substitutions that change the effective concentration, EC 50 are the most numerous and are spread throughout the protein structure, with approximately 9 and 20% of all substitutions causing a greater than 5- fold or 2.5-fold shift in EC 50 , respectively (FIG. 3C and F). The strongest effects are from substitutions in the DNA-binding domain, ligand-binding pocket, core-pivot region, or dimer interface.
  • substitutions that cause the largest decrease in EC 50 are at the dimer interface and probably disrupt cross-dimer interactions.
  • substitutions T68N (27-fold decrease) and L71Q (14-fold decrease) each probably disrupt the L71- Q78' interaction (discussed above).
  • substitutions V99E (25-fold decrease), E100G (17-fold decrease), and V95M (16-fold decrease) are each in p-strand B and each probably disrupts the K84-K84' interaction (discussed below). All of these substitutions likely shift the allosteric constant to favor the inactive state.
  • Residue F161 sits in the core-pivot region and is sequestered in a hydrophobic cluster (Swint-Kruse et at, 2001 ; Flynn et al 2003), where the phenylalanine ring makes van der Waals contacts with Q291 .
  • Q291 is involved in hydrogen bonding networks that span the ligand-binding pocket and dimer interface (Flynn et at, 2003).
  • the contacts between F161 and Q291 change, contributing to rearrangements throughout the Lacl structure.
  • Positions A82 and I83 are in helix 5 of the N-terminal core domain, and both are proximal to and pointed toward helix 13.
  • the A82E substitution which replaces the diminutive alanine with the larger glutamate, decreases EC 50 approximately 30-fold.
  • a smaller amino acid at this position increases the EC 50 approximately 5-fold.
  • H74P may shift the allosteric constant to favor the active state by stabilizing secondary structure.
  • K84 forms transient interactions with the side chain of S97 (also in p- strand B) and the backbone of M98, before eventually forming a stable charge-charge interaction with D88.
  • Substitutions that disrupt this process have significant effects on the structure and function of Lacl.
  • the substitution K84L causes significant structural changes to the N-terminal core domain and dimer interface (Bell et al, 2001), and substitutions at position S97 and M98 can greatly alter the biophysical properties of Lacl (Zhan et at, 2010). Given the extent of structural and functional changes that can occur with substitutions involved in this process, precise mechanisms of the observed substitutions are difficult to predict, and observed changes are not easily described by the biophysical models.
  • substitutions that disrupt dimer formation dissociated monomers are in the inactive state
  • substitutions that lock the dimer rigidly into either the active or inactive state or substitutions that more subtly affect the balance between the active and inactive states.
  • substitutions that disrupt dimer formation dissociated monomers are in the inactive state
  • substitutions that lock the dimer rigidly into either the active or inactive state or substitutions that more subtly affect the balance between the active and inactive states.
  • the Lacl genotype-phenotype landscape measurement revealed a surprising number of variants with phenotypes that differ qualitatively from the wild type. For example, approximately 230 of the Lacl variants have an inverted phenotype (G 0 > G ⁇ , FIG. 1 E), accounting for approximately 0.35% of the measured library (FIG. 9A).
  • FIG. 9A The measured library
  • the set of strongly inverted variants are more genetically distant from each other than randomly selected variants from the library (FIG. 5C, FIG. 8).
  • the genetic diversity of the inverted variants found in our measurement is striking when compared with previous reports of inverted Lacl variants resulting from site-saturated mutagenesis (Daber et al . 2011 ) or directed evolution with random mutagenesis (Poelwijk et al, 2011 ; Meyer et at, 2013).
  • the inverted Lacl variants can provide specific insight into allosteric biophysics and structure-function relationships, since inversion of the dose-response curve requires inversion of both the allosteric constant and the relative ligand-binding affinity between the active and inactive states (Razo-Mejia et al, 2018; Chure et at, 2019).
  • Razo-Mejia et al, 2018; Chure et at, 2019 Although the set of strongly inverted Lacl variants are genetically diverse, many of them have substitutions in similar regions of the protein that may account for the requisite biophysical changes (FIG. 36).
  • strain MG1655A/ac was constructed by replacing the lactose operon of E. coll strain MG1655 (ATCC #47076) with the bleomycin resistance gene from Streptoalloteichus hindustanus (Shble).
  • plasmids Two plasmids were used for this work: a library plasmid (pTY1, FIG. 7A) used for the measurement of the genotype and phenotype of the entire Lacl library, and a verification plasmid (pVER, FIG. 7B) used to verify the function of over 100 Lacl variants from the library chosen to test the accuracy of the library-scale dose-response curve measurement method.
  • a library plasmid pTY1, FIG. 7A
  • pVER verification plasmid
  • Plasmid pTY1 contained the lacl CDS and the lactose operator (/acO) regulating the transcription of a tetracycline resistance gene, tetA, which, in the presence of tetracycline, confers a measurable change in fitness connected with the expression level of the regulated genes. Plasmid pTY1 also encoded Enhanced Yellow Fluorescent Protein (YFP), which was used during library construction to select a library in which most of the Lacl variants could function as allosteric repressors.
  • YFP Enhanced Yellow Fluorescent Protein
  • Plasmid pVER contained a similar system in which Lacl and lacO regulate the transcription of only YFP. Plasmid pVER was used to measure dose-response curves of clonal Lacl variants using flow cytometry. Each variant chosen from the library for verification was chemically synthesized, inserted into pVER, and transformed into E. coll strain MG1655 ⁇ /acfor flow cytometry measurements to confirm the dose-response curve inferred from the library-scale measurement.
  • the Lacl library was generated by error-prone PCR of the wild- type lacl CDS encoded on plasmid pTY1 .
  • the library was constructed by splitting the lacl CDS into an N-terminal half and a C-terminal half using an Apal restriction enzyme cut site that was near the center of the lacl CDS covering codons for A186, G187, and P188. Error-prone PCR of each half introduced genetic diversity, and then each unique sequence was attached to a DNA barcode. Plasmid-encoded versions of both sub- libraries were established (an N-terminal library, pNTL, and a C-terminal library, pCTL). These two sub-libraries were then assembled to generate the full /ac/ CDSs and to combine the two halves of the DNA barcode. The library was inserted into pTY1 along with randomly synthesized DNA barcodes (FIG. 7A).
  • the plasmid containing the N-terminal library, pNTL contained the N-terminal half of the /ac/ CDS (lacl-N) and a randomized nucleotide sequence that forms half of the DNA barcode.
  • the protocol for constructing the N-terminal library included:
  • PCR amplify the N-terminal half of the /ac/ CDS (coding for amino acids M1- A186) from pTY1 using the GeneMorph II Random Mutagenesis Kit (Agilent, cat. #200550) with primers DT.01 and DT.02 (Appendix Table S7). Agarose gel purify PCR product.
  • PCR amplify the T7 terminator (TT7) with primers DT.03 and DT.04 with Phusion Flash polymerase (used for PCR unless otherwise specified) and then re-amplified with primers DT.03 and DT.05. Agarose gel purify PCR product.
  • PCR amplify the pMB1 origin of replication and ampicillin resistance gene from plasmid pUC19 (NEB, cat #N3041 S) with primers DT.06 and DT.07.
  • primer DT.06 incorporates half of the final DNA barcode, consisting of 27 randomized nucleotides interspersed with constant A/T bases to limit restriction site formation. Primer DT.06 was ordered with hand-mixed bases, with “N” representing equal ratios of A, T, G, and C.
  • TNNTNNNANNTNNNANNTNNNANNTNNNANNTNNNANNTNNNANNA-3 Ligate the assembly PCR product and the pUC19 amplicon using T4 DNA ligase (Thermo Scientific, EL0011) to form the final sub-library plasmid pNTL, and desalt using QIAquick PCR Purification Kit (Qiagen, cat. #28106).
  • the plasmid containing the C-terminal library, pCTL contained the C-terminal half of the /ac/ CDS (lacl-C) and a randomized nucleotide sequence that forms the second half of the DNA barcode.
  • the protocol for constructing the C-terminal library is:
  • PCR amplify the C-terminal half of /ac/ CDS (coding for amino acids L189- Q360) from pTY1 using the GeneMorph II Random Mutagenesis Kit with primers DT.10 and DT.11. Agarose gel purify PCR product.
  • primer DT.12 incorporates half of the final DNA barcode, consisting of 27 randomized nucleotides interspersed with constant A/T bases to limit restriction site formation. Primer DT.12 was ordered with hand-mixed bases, with “N” representing equal ratios of A, T, G, and C. Half DNA Barcode: 5’-
  • Plasmids pNTL, pCTL, and pTY1 are starting points to assemble full length /ac/ CDS and combine the two halves of the DNA barcodes.
  • the protocol is:
  • both PCRs use primers DT.14 and DT.15.
  • Agarose gel purify both digested amplicons, then ligate the two together using T4 DNA ligase to assemble the sensor library with full length /ac/ CDS. Agarose gel purify the ligation product.
  • FACS fluorescence-activated cell sorting
  • a spike-in control strain was used to normalize the DNA barcode read counts for the sequencing-based fitness measurement.
  • the spike-in control strain contained the library plasmid (pTY1) with a Lacl variant that had a constant, high tetA expression level.
  • the fitness of the spike-in control was determined from OD 600 data acquired during growth of clonal cultures with the same automated growth protocol as used for the genotype-phenotype landscape measurement. The fitness of the spike-in control was measured in all 24 chemical environments and was independent of IPTG concentration but was slightly lower with tetracycline (0.75 h -1 ) than without tetracycline (0.81 h -1 ).
  • E. coli cultures were grown in a rich M9 media (3 g/l KH 2 PO 4 , 6.78 g/l Na 2 HPO 4 , 0.5 g/l NaCI, 1 g/l NH 4 CI, 0.1 mmol/l CaCl 2 , 2 mmol/l MgSO 4 , 4% glycerol, and 20 g/l casamino acids) supplemented with 50 pg/ml kanamycin.
  • M9 media 3 g/l KH 2 PO 4 , 6.78 g/l Na 2 HPO 4 , 0.5 g/l NaCI, 1 g/l NH 4 CI, 0.1 mmol/l CaCl 2 , 2 mmol/l MgSO 4 , 4% glycerol, and 20 g/l casamino acids
  • Escherichia coli cultures were grown in a laboratory automation system that controlled preparation of 96-well culture plates with media and additives (i.e., IPTG and tetracycline). Cultures were grown in clear-bottom 96-well plates with 1.1 ml square wells. The culture volume per well was 0.5 ml. Before incubation, an automated plate sealer was used to seal each 96-well plate with a gas-permeable membrane. Cultures were incubated in a multi-mode plate reader at 37°C with a 1 °C gradient applied from the botom to the top of the incubation chamber to minimize condensation on the inside of the membrane. During incubation, the plate reader was set for double-orbital shaking at 807 cycles per minute. Optical density at 600 nm (OD 600 ) was measured every 5 min during incubation, with continuous shaking applied between measurements. After incubation, an automated desealer was used to remove the gas-permeable membrane from each 96-well plate.
  • media and additives
  • a culture of E. coli containing the Lacl library was mixed at a 99:1 ratio with a culture of the E. coli spike-in control.
  • the culture was loaded into the automated microbial growth and measurement system where it was distributed across a 96-well plate and then grown to stationary phase (12 h, FiG. 30). Cultures were then diluted 50-fold into a new 96-well plate, Growth Plate 1 , containing 11 rows with a 2- fold serial dilution gradient of IPTG with concentrations ranging from 2 to 2,048 ⁇ mol/l and one column without IPTG.
  • IPTG IPTG
  • Growth in IPTG allowed each variant to reach a steady- state tetA expression level in each IPTG concentration.
  • Growth Plate 1 was grown for 160 min, corresponding to approximately 3.3 generations, and then diluted 10-fold into Growth Plate 2.
  • Growth Plate 2 contained the same IPTG gradient as Growth Plate 1 with the addition of tetracycline (20 pg/ml) to alternating rows in the plate, resulting in 24 chemical environments, with each environment spread across 4 wells.
  • Growth Plate 2 was grown for 160 min and then diluted 10-fold into Growth Plate 3, which contained the same 24 chemical environments as Growth Plate 2. This process was repeated for Growth Plate 4, which also contained the same 24 chemical environments.
  • Each growth plate was pre-heated to 37°C before transferring the cells from the previous growth plate to avoid any disruption of cell growth due to large variations in temperature.
  • the cultures without tetracycline increased approximately 10-fold in optical density, to a final OD 600 of approximately 0.5 (corresponding to an estimated cell density of 4 x 10 8 cells/ml.
  • the protocol was:
  • the plate reader was set for continuous double- orbital shaking at 807 cycles per minute. OD 600 was measured every 5 min.
  • Growth Plate 1 contains only the IPTG gradient (no tetracycline). This allows cells to reach exponential growth and to reach steady-state expression of tetA before adding tetracycline.
  • Binding Buffer Add 900 ⁇ l Binding Buffer to each well and use the filter press to push buffer through the binding plate at 40 psi for 60 s.
  • the filter press jammed just before elution, so those samples were eluted by centrifugation at 1 ,000 g for 3 min.
  • the concentration of DNA in each sample ranged from undetectable up to approximately 1 .5 ng/ ⁇ l. This corresponds to an estimated maximum of 10 10 plasmids per sample.
  • each set of 24 plasmid DNA samples was prepared for barcode sequencing. Briefly, the plasmid DNA was linearized with Apal restriction enzyme. Then, a three-cycle PCR was performed to amplify the barcodes from the plasmids and attach sample multiplexing tags so samples from different chemical environments could be distinguished when pooled and run on the same sequencing flow cell. Eight forward index primers (FIG. 41) and 12 reverse index primers (FIG. 42) were used to label the amplicons from each sample across the 24 chemical environments and the four time points.
  • Vortex Sera-mag SpeedBeads to resuspend and transfer 1 ml to each of two 2 ml microcentrifuge tubes.
  • the Forward and Reverse Index Primers attach sample multiplexing tags to the resulting amplicons so that the different samples can be distinguished when they are pooled and run on the same Illumina platform sequencing flow cell. Eight different Forward Index Primers and 12 different Reverse Index Primers are used to uniquely label the amplicons from each sample across the 24 different chemical conditions and the four Growth Plates used for barcode sequencing.
  • the ratio of Bead Stock to PCR volume is 0.6x. Consequently, because of the relatively low PEG concentration in the mixture, the 5,500 bp plasmid template binds to the beads.
  • the ratio of Bead Stock to PCR volume is 1 .1
  • the 306 bp PCR amplicon binds to the beads while the 70 bp primers remain in the supernatant.
  • DNA was diluted to 5 nmol/l and combined with 20% phiX control DNA.
  • DNA from each of the 4 time points was sequenced in a separate lane on an Illumina HiSeqX using paired-end mode with 150 bp in each direction.
  • the Z's at the beginning of each read are random nucleotides used as unique molecular identifiers (UMIs) to correct for PCR jackpotting (Kivioja et al, 2012), the X’s are the sample multiplexing tag sequences, and the N's are the random nucleotides of the DNA barcodes.
  • UMIs unique molecular identifiers
  • the X’s are the sample multiplexing tag sequences
  • the N's are the random nucleotides of the DNA barcodes.
  • the four bases after the multiplexing tag must match the nominal sequence with one allowed mismatch, and the multiplexing tag sequence (X's in the sequences above) must match the nominal sequence for one of the multiplexing tags used with up to three allowed mismatches.
  • the five flanking bases before and after the barcodes must match the nominal sequence with one allowed mismatch per set of five bases, and the number of bases in the barcode must be between 35 and 41 (inclusive).
  • the mean Illumina quality score for the barcode and the five flanking bases before and after the barcode must be greater than 30.
  • barcode clusters of different length were considered for merging.
  • the full plasmid sequence was used to detect unintended mutations in the plasmid, i.e., mutations to plasmid regions other than the /ac/ CDS.
  • the full plasmid sequence was divided into 11 non- overlapping regions that roughly correspond to different functional elements of the plasmid (FIG. 39), and the sequences for each region were extracted from the HiFi reads using a custom bioinformatic pipeline (see, e.g., code available at https://github.com/djross22/nist__lacl__landscape__analysis).
  • 43 of the 535 variants with the wild-type Lacl amino acid sequence had mutations in the regulatory region (containing the P.aci and P tacl promoters, the /acO operator, the riboJ insulator, and the RBS sites for both /ac/ and tetA). Of those 43 variants, three had EC 50 values that differed by approximately 2-fold or more from the geometric mean value for the wild-type EC 50 .
  • the Kolmogorov-Smirnov test was used to compare the distributions of EC 50 values between the wild-type variants with and without mutations in the regulatory region; the results indicated a significant difference (P-value: 0.024).
  • variants were excluded from those analyses if they had one or more mutations in the non-/ac/ regions that show significant mutational effects: tetA, YFP, KAN, the origin of replication, and the regulatory region. After applying this data quality filter in addition to those described above, there were 54,162 variants that we used for further quantitative analysis.
  • the ratio of the barcode read count to the spike-in read count was fit to a function assuming exponential growth and a delay in the onset of the fitness impact of tetracycline.
  • the fitness associated with each variant in each of the 24 chemical environments was determined as a parameter in the corresponding least-squares fit as detailed below.
  • the analysis accounts for the variation in ceil fitness (growth rate) as a function of time after the ceils were exposed to tetracycline.
  • the fitness as a function of time is taken to approach the new value with an exponential decay: is the steady-state fitness with tetracycline, and ⁇ is a transition rate.
  • the barcode read count for variant i in Growth Plate j was assumed to be proportional to the cell number: where a/ is a proportionality constant associated with variant /, and bj is a proportionality constant associated with Growth Plate j.
  • the proportionality constant a can be different for each variant / due to differences in PCR amplification efficiency resulting from variations in the barcode sequences on each amplicon.
  • the proportionality constant bj can be different for each Growth Plate because of sample-to-sample variations in the DNA extraction efficiency or differences in PCR efficiency associated with different sample multiplexing tag sequences.
  • Plasmids pTY1 and pVER were engineered to provide two independent measurements of the dose-response curve for Lacl variants.
  • pTY1 Lacl regulates the expression of a tetracycline resistance gene (tetA) that enables determination of the dose-response from barcode sequencing data by comparing the fitness measured with tetracycline to the fitness measured without tetracycline.
  • tetA tetracycline resistance gene
  • YFP fluorescent protein
  • a set of nine randomly selected Lacl variants were used to calibrate the estimation of regulated gene expression output from the barcode sequencing fitness measurements (FIG. 16).
  • the calibration data consisted of the fitness data for each calibration variant from the library barcode sequencing measurement (using the library plasmid, pTY1 ) and flow cytometry data for each calibration variant prepared as a clonal culture (using the verification plasmid, pVER).
  • Source code for both models is included in the software archive at https://github.com/djross22/nist__lacl__landscape__analysis.
  • Both inference models used equation (9) to represent the relationship between fitness and regulated gene expression.
  • the first Bayesian inference model assumed that the dose-response curve for each Lacl variant was described by the Hill equation.
  • the Hill equation parameters for each variant, G «, Go, EC 50 , and n and their associated uncertainties were determined using Bayesian parameter estimation by Markov Chain Monte Carlo (MCMC) sampling with PyStan (Carpenter et a/, 2017). Broad, flat priors were used for log 10 (Go), log 10 (G ⁇ ), and log 10 (EC 50 ), with error function boundaries to constrain those parameter estimates to within the measurable range (100 MEF ⁇ Go, G- ⁇ 50,000 MEF; 0.1 ⁇ mol/l ⁇ EC 50 .i ⁇ 40,000 ⁇ mol/l).
  • n i was a gamma distribution with shape parameter of 4.0 and inverse scale parameter of 3.33.
  • the inference model was run individually for each Lacl variant, with four independent chains, 1 ,000 iterations per chain (500 warmup iterations), and the adapt todelta parameter set to 0.9. Testing with data from a set of randomly selected variants indicated that these settings for the Stan sampling algorithm typically produced a Gelman-Rubin diagnostic ⁇ 1.05 and number of effective iterations > 100.
  • the second Bayesian inference model was a non-parametric GP model (Rasmussen & Williams, 2005) that assumed only that the dose-response curve for each Lacl variant was a smooth function of IPTG concentration.
  • the GP model was used to determine which variants had band-pass or band-stop phenotypes.
  • the GP model was also implemented using MCMC sampling with PyStan (Carpenter et al, 2017).
  • the GP inference model was run individually for each variant, with four independent chains, 1 ,000 iterations per chain (500 warmup iterations), and the adapt_delta parameter set to 0.9. Testing with data from a set of randomly selected variants indicated that these settings for the Stan sampling algorithm of the GP model typically produced a Gelman-Rubin diagnostic less than 1.02 and number of effective iterations greater than 200.
  • Lacl variants from the library were chosen for flow cytometry verification of the dose-response curves.
  • the CDSs of these variants were chemically synthesized. Each synthesized sequence was digested with restriction enzymes Xhol and Sgsl, and ligated into the verification plasmid, pVER, and then transformed into MG1655A/ac. Transformants were plated on LB agar supplemented with kanamycin and 0.2% glucose. Lacl variant sequences were verified with Sanger sequencing (Psomagen USA). For flow cytometry measurements of dose-response curves, a culture of E.
  • coli containing pVER with a chosen variant sequence was distributed across 12 wells of a 96-well plate and grown to stationary phase using the automated microbial growth system. After growth to stationary phase, cultures were diluted 50-fold into a plate containing the same 12 IPTG concentrations used during the fitness landscape measurement (0-2,048 ⁇ mol/l). In some cases, higher IPTG concentrations were used to capture the full dose-response curves of selected variants (e.g., FIG. 11 , FIG. 12, FIG. 13, FIG. 14). Cultures were then grown for 160 min (- 3.3 generations) before being diluted 10-fold into the same IPTG gradient and grown for another 160 min.
  • each culture was diluted into 195 ⁇ l of PBS supplemented with 170 ⁇ g/ml chloramphenicol and incubated at room temperature for 30-60 min to halt the translation of YFP and allow extant YFP to mature in the cells.
  • Samples were measured on an Attune NxT flow cytometry with autosampler using a 488 nm excitation laser and a 530 nm ⁇ 15 nm band-pass emission filter. Blank samples were measured with each batch of cell measurements, and an automated gating algorithm was used to discriminate cell events from non-cell events (FIG. 33A and B). With the Attune cytometer, the area and height parameters for each detection channel are calibrated to give the same value for singlet events. So, to identify singlet cell events and exclude multiplet cell events, a second automated gating algorithm was applied to select only cells with side scatter area side scatter height (FIG. 33C and D). All subsequent analysis was performed using the singlet cell event data.
  • Fluorescence data were calibrated to molecules of equivalent fluorophore (MEF) using fluorescent calibration beads.
  • the cytometer was programmed to measure a 25 ⁇ l portion of each cell sample, and the 40-fold dilution used in the cytometry sample preparation resulted in approximately 20,000 singlet cell measurements per sample.
  • the geometric mean of the YFP fluorescence was used as a summary statistic to represent the regulated gene expression level as a function of the input ligand concentration, [IPTG] for each Lacl variant.
  • An autofluorescence control strain MG1655A/ac with a plasmid similar to pVER but lacking the YFP gene
  • the regulated gene expression output level for each variant is reported as the geometric mean of the measured fluorescence forthat variant minus the geometric mean of the measured fluorescence for the zero-fluorescence control (92 MEF).
  • Variants were labeled as having a negative response if the slope, was negative at one or more IPTG concentrations with 0.95 or higher posterior probability (from the GP model inference). To avoid false positives from end effects, this negative slope criteria was only applied for IPTG concentrations between 2 ⁇ mol/l IPTG and 1 ,024 ⁇ mol/L Variants were labeled as “always on” (the phenotype from reference (Markiewicz et at, 1994)) if they were flat- response and if G(0) was greater than 0.25 times the wild-type G ⁇ value with 0.95 or higher posterior probability (from the GP model inference).
  • Variants were labeled as “always off” (the / s phenotype from reference (Markiewicz et al, 1994)) if they were flat-response but not always on. Variants were labeled as band-stop or band-pass if the slope, was negative at some IPTG concentrations and positive at other IPTG concentrations, both with 0.95 or higher posterior probability (from the GP model inference). Band-stop and band-pass variants were distinguished by the ordering of the negative-slope and positive-slope portions of the dose-response curves. Variants that had a negative response but that were not band-pass or band-stop, were labeled as inverted.
  • False-positive rates were estimated for each phenotypic category by manually examining the fitness vs IPTG data for Lacl variants with less than three substitutions. Typical causes of false-positive phenotypic labeling included unusually high noise in the fitness measurement and biased fit results due to outlier fitness data points. Estimated false-positive rates ranged between 0.001 and 0.005. The relative abundance values shown in FIG. 9A were corrected for false positives using the estimated rates.
  • the library contained a set of 39 variants with the wild-type /ac/ CDS (each with a different DNA barcode), and a set of 310 variants with only synonymous nucleotide changes (i.e., no amino acid substitutions). Both sets had long-read sequencing coverage for the entire plasmid and were screened to retain only variants with zero unintended mutations in the piasmid (i.e., no mutations in regions of the piasmid other than the /ac/ CDS) The Hiii equation fit resuits for those two sets were compared to determine whether synonymous nucieotide changes significantly affected the phenotype.
  • the Koimogorov-Smirnov test was used to compare the distributions of Hill equation parameters between these two sets.
  • the resulting P- values (0.71 , 0.40, 0.28, and 0.17 for G 0 , G ⁇ , EC 50 , and n, respectively) indicate that there were no significant differences between them.
  • the library contained 40 sets of variants, each with four or more synonymous CDSs (including the set of synonymous wild-type sequences and 39 non-wild-type sequences).
  • a hierarchical model was used to compare the Hill equation parameters within each set of synonymous CDSs. Within each set, the uncertainty associated with individual variants was typically larger than the variant-to-variant variability estimated by the hierarchical model. Overall, these results indicate that synonymous SNPs did not measurably impact the Lacl phenotype, so only the amino acid sequences were considered for any subsequent quantitative genotype-to-phenotype analysis.
  • the single amino acid substitution results presented in FIG. 3, FIG. 6B and C, FIG. 23, FIG. 24, FIG. 25, and FIG. 26 are a combination of direct experimental observations, DNN model results, and estimates of G 0 for missing substitutions.
  • the DNN model can provide a better parameter estimate for these flat-response variants because it uses data and relationships from the full library (e.g., the log-additivity of EC 50 ) to predict parameter values for each single substitution. So, the DNN model results were used for these 74 substitutions. Finally, the DNN model results were used for an additional 953 substitutions that are found in the library, but only in combination with other substitutions (i.e. , not as single substitutions).
  • the set of 43 strongly inverted Laci variants discussed in the main text and used for the plots in FIG. 5A and C were identified by the following criteria: and EG 50 between 3 ⁇ mol/l and 1 ,000 ⁇ mol/l.
  • the set of 31 strong band-stop variants discussed in the main text and used for the plots in FIG. 5B and D were identified by the following criteria: and the slope, of less than -0.07 at low IPTG concentrations and greater than zero at higher IPTG concentrations, both with 0.95 or higher posterior probability (from the GP model inference). To avoid false positives due to noise in the DNA barcode counting, only Laci variants with a total DNA barcode read count greater than 3,000 were included. Also, the sets of strongly inverted and strong band-stop variants were manually screened for additional likely false positives due to outlier fitness data points.
  • a hypergeometric test was used to determine the amino acid substitutions that occur more frequently in the set of strongly inverted or strong band- stop variants than in the full library (the set of 52,321 variants with more than 3,000 DNA barcode reads and with the /ac/ sequences determined by long-read sequencing). For each possible substitution, the cumulative hypergeometric distribution was used to calculate the probability of the observed number of occurrences of that substitution in the set of inverted or band-stop variants under a null model of no association. This probability was used as a P-value for the null hypothesis that the observed number of inverted or band-stop variants with that substitution resuited from an unbiased random selection of variants from the full library.
  • substitutions were considered to occur at significantly higher frequency if they had a P-value ⁇ 0.005 and if they occurred more than once in the set of inverted or band-stop variants.
  • 10 associated (higher frequency) amino acid substitutions were identified: S70I, K84N, D88Y, V96E, A135T, V192A, G200S, Q248H, Y273H, and A343G.
  • eight associated substitutions were identified: V4A, A92V, H179Q, R195H, G178D, G265D, D292G, and R351 G.
  • Deep neural network modeling [00215] The dataset was pruned to a set of high-quality sequences for DNN modeling. Specifically, data for a Lacl variant was only used for modeling if it satisfied the following criteria:
  • the total number of barcode read counts for a Lacl variant was > 3,000.
  • the number of amino acid substitutions was ⁇ 14.
  • the logarithm of the Hill equation parameter values were normalized to a standard deviation of 1 , and then shifted by the corresponding value of the wild-type sequence in order to correctly represent the prediction goal of the change in each parameter relative to wild-type Lacl.
  • a long-term, short-term recurrent neural network was selected for the underlying model (Hochreiter & Schmidhuber, 1997), with 16 hidden units, a single hidden layer, and hyperbolic tangent (tanh) non-linearities.
  • pTY1 plasmid sequence NCBI GenBank MT702633 (availabe at https://www.ncbi.nlm.nih.gov/nuccore/MT702633).
  • pVER plasmid sequence NCBI GenBank MT702634 (availabe at https://www.ncbi.nlm.nih.gov/nuccore/MT702634).
  • the processed data table containing comprehensive data and information for each Lacl variant in the library is publicly available via the NIST Science Data Portal, with the identifier ark:/88434/mds2-2259 (avaiable at https://data.nist.gov/od/id/mds2-2259 or https://doi.org/10.18434/M32259).
  • the data table includes the DNA barcode sequences, the barcode read counts for each sample and time point used for the library-scale measurement, fitness estimates for each barcoded variant across the 24 chemical environments, the results of both Bayesian inference models (including posterior medians, covariances, and 0.05, 0.25, 0.75, and 0.95 posterior quantiles), the Lacl CDS and amino acid sequence for each barcoded variant (as determined by long-read sequencing), the number of Lacl CDS reads in the long-read sequencing dataset for each barcoded variant, and the number of unintended mutations in other regions of the plasmid (from the long-read sequencing data).
  • Domingo J, Diss G, Lehner B (2018) Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature 558: 117-121 .
  • the processes described herein may be embodied in, and fully automated via, software code modules executed by a computing system that includes one or more general purpose computers or processors.
  • the code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Some or ail the methods may alternatively be embodied in specialized computer hardware.
  • the components referred to herein may be implemented in hardware, software, firmware, or a combination thereof.
  • Any logical blocks, modules, and algorithm elements described or used in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both.
  • various illustrative components, blocks, modules, and elements have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality can be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.
  • a processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like.
  • a processor can include electrical circuitry configured to process computer-executable instructions.
  • a processor in another embodiment, includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions.
  • a processor can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
  • a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry.
  • a computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
  • a software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non- transitory computer-readable storage medium, media, or physical computer storage known in the art.
  • An exampie storage medium can be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium can be integral to the processor.
  • the storage medium can be volatile or nonvolatile.
  • a combination thereof refers to a combination comprising at least one of the named constituents, components, compounds, or elements, optionally together with one or more of the same class of constituents, components, compounds, or elements.
  • first current could be termed a second current
  • second current could be termed a first current
  • the first current and the second current are both currents, but they are not the same condition unless explicitly stated as such.
  • the modifier about used in connection with a quantity is inclusive of the stated value and has the meaning dictated by the context (e.g., it includes the degree of error associated with measurement of the particular quantity).

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Biomedical Technology (AREA)
  • Wood Science & Technology (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Plant Pathology (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Determining a coding DNA sequence for a genetic sensor includes producing a CDS library of mutated variants; attaching a DNA barcode to the mutated variants to produce a barcoded variant library; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants into cells; growing the cells in various cell cultures for different values of an input signal of the genetic sensor; extracting DNA from the cells; determining the number of each DNA barcode in the DNA extracted from the cells; determining a fitness of the cells for each barcoded mutant variant; determining a mutant variant dose-response curve for each barcoded mutant variant to produce a mutant variant dose-response curve library; providing a target dose-response curve; determining which mutant variant dose-response curve matches the target dose-response curve; and determining the mutated coding DNA sequence of the selected mutant variant dose-response curve.

Description

PRECISION ENGINEERING OF A CELLULAR
SENSE-AND-RESPONSE FUNCTION
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH
[0001] This invention was made with United States Government support from the National Institute of Standards and Technology (NIST), an agency of the United States Department of Commerce. The Government has certain rights in this invention.
CROSS REFERENCE TO RELATED APPLICATIONS
[0002] This application claims the benefit of U.S. Provisional Patent Application Serial No. 63/122,501 (filed December 8, 2020), which is herein incorporated by reference in its entirety.
BRIEF DESCRIPTION
[0003] Disclosed is a process for determining a coding DNA sequence for a specific dose-response curve for a genetic sensor, the process comprising: producing a coding DNA sequence library comprising a plurality of mutated variants of the coding DNA sequence (CDS) for the genetic sensor, such that each mutated variant comprises a mutated coding DNA sequence of the coding DNA sequence; attaching a separate DNA barcode to each of the mutated variants in the coding DNA sequence library to produce a barcoded variant library of barcoded mutant variants, wherein each barcoded mutant variant in the barcoded variant library corresponds to one of the mutated variants having an attached DNA barcode; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants in the barcoded variant library into cells; growing the cells in various cell cultures comprising different values of an input signal of the genetic sensor; extracting DNA from the cells in the cell cultures; determining the number of each DNA barcode in the DNA extracted from the cells; determining a fitness of the cells for each barcoded mutant variant; determining a mutant variant dose-response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant, to produce a mutant variant dose-response curve library comprising a plurality of the mutant variant dose- response curves; providing a target dose-response curve; determining which mutant variant dose-response curve, as a selected mutant variant dose-response curve, in the mutant variant dose-response curve library matches the target dose-response curve; and determining the mutated coding DNA sequence of the selected mutant variant dose-response curve.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The following description cannot be considered limiting in any way. Various objectives, features, and advantages of the disclosed subject matter can be more fully appreciated with reference to the following detailed description of the disclosed subject matter when considered in connection with the following drawings, in which like reference numerals identify like elements.
[0005] FIG. 1 shows details for a library-scale allosteric genotype-phenotype landscape measurement. (A) library of lac repressor (Lacl ) variants was generated by random mutagenesis of the lacl coding DNA sequence (CDS). The CDS for each variant was attached to a DNA barcode and inserted into a plasmid where the Lacl variant regulated expression of a tetracycline resistance gene. The CDS and corresponding barcode on each plasmid were determined with long-read sequencing. (B) The library was transformed into Escherichia coli. (C) Cells containing the library were grown in 24 chemical environments, including 12 concentrations of the ligand IPTG, each with (orange) and without (blue) tetracycline. Cultures were maintained in exponential growth. Changes in the relative abundance of each variant were measured with short-read sequencing of DNA barcodes at four timepoints and were used to determine the fitness associated with each variant in each environment. (D) The fitness without tetracycline (blue) is independent of IPTG concentration. The fitness with tetracycline (orange) depends on the IPTG concentration via the dose- response of each variant. Error bars indicate ± one standard deviation estimated from least-squares fits of the barcode abundance vs time (Materials and Methods) and are often within markers. Data are from a single library-scale measurement. (E) Dose- response curves for 62,472 Lacl variants were determined from the fitness measurements with Bayesian inference using a Hill equation model (black lines for variants with normal and inverted dose-response curves) and a Gaussian process (GP) model (purple lines, shaded regions indicate 50% and 90% credible intervals). Flow cytometry verification measurements (purple points) generally agreed with Bayesian inference results and verified the existence of the band-stop and other phenotypes, dose-response output was calibrated from fitness to fluorescent protein expression (FIG. 16) and reported in molecules of equivalent fluorophore (MEF). Purple points represent the geometric mean of the YFP fluorescence minus the geometric mean of a zero-fluorescence control (92 MEF), as determined from a single flow cytometry measurement at each IPTG concentration.
[0006] FIG. 2 shows data for an accuracy of the library-scale dose-response curve measurement. (A-D) The plots compare the results from the library-scale measurement (y-axis) with the flow cytometry verification results (x-axis) for each Hill equation parameter. Data are shown for all of the verified Lacl variants with sigmoidal dose-response curves (i.e. , band-stop and band-pass variants are not included). Data for different variants are plotted with different combinations of color and shape. Variants that occurred more than once in the library (with different DNA barcodes) are plotted multiple times. For example, the wild type (dark gray “X” symbols) is plotted 53 times. The accuracy for each Hill equation parameter is 4-fold for G0 (A), 1.5-fold for G∞ (B), 1 .8-fold for EC50 (C), and ± 0.28 for n (D). For G0, G∞, and EC50 (A-C), the accuracy is calculated as: exp(RMSE(ln(x)))expRMSEInx, where RMSE(ln(x))RMSEInx is the root-mean-square difference between the logarithm of each parameter from the library-scale and cytometry measurements. For n, the accuracy is given simply as the root-mean-square difference between the library-scale and cytometry results. The inverse-variance-weighted coefficient of determination (R2) for each Hill equation parameter is: 0.83 for G0 (A), 0.55 for G∞ (B), 0.86 for EC50 (C), and -0.04 for n (D). The variance of the posterior distribution from the Bayesian inference was used for weighting. In addition, the contribution from the wild-type observations were weighted by a factor of 1/53 to avoid bias from multiple observations. In all plots, points indicate the median and error bars indicate ± one standard deviation from the Bayesian posterior. Data are from a single library-scale measurement, and a single flow cytometry measurement for each Lacl variant at each IPTG concentration. [0007] FIG. 3 shows an effect of single amino acid substitutions on allosteric function of Lacl. (A-C) Protein structures showing the locations of amino acid substitutions that affect each Hill equation parameter G0 (A), G∞ (B), EC50 (C). For each, the operator-binding structure is shown on the left (operator DNA in light orange, PDB ID: 1 LBG (Lewis et al, 1996)) and the ligand-binding structure is shown on the right (IPTG in cyan, PDB ID: 1 LBH (Lewis etal, 1996)). Both structures are shown with the view oriented along the protein dimer interface, with one monomer in light gray and the other monomer in dark gray. Colored spheres highlight residues where substitutions cause a greater than 5-fold change in the Hill equation parameter relative to wild-type Lacl. Red spheres indicate residues where substitutions increase the parameter, and blue spheres indicate residues where substitutions decrease the parameter. At three residues (A82, 183, and F161), some substitutions decrease EC50, while other substitutions increase EC50 (violet spheres in C). (D-F) Scatter plots showing the effect of each substitutions as a function of position. Substitutions that change the parameter by less than 5-fold are shown as gray points. Substitutions that change the parameter by more than 5-fold are shown as red or blue points with error bars. Histograms to the right of each scatter plot show the overall distribution of single- substitution effects.
[0008] FIG. 4 shows effects of amino acid substitutions on EC50 of Lacl are log-additive. The log-additive EC50 for double-substitution Lacl variants (i.e., two amino acid substitutions) was calculated assuming log-additivity of the effect of each single substitution on the EC50 relative to wild-type Lacl: log( EC50, AB/ EC50, wt) = log(EC50, Al EC50, wt) + log(EC50,B/ EC50, wt), where “wt” indicates the wild type, “A” and “B” indicate the single-substitution variants, and “AB” indicates the double-substitution variant. The measured EC50 of double-substitution variants is from the library-scale measurement. Orange points mark double-substitution variants in which one of the single substitutions causes a greater than 2.5-fold change in EC50. Dark red points mark double-substitution variants in which both single substitutions cause a greater than 2.5-fold change in EC50. The EC50 of wild-type Lacl is marked with a black “X". For this analysis, only experimental data were used (no results from the DNN model). Also, only data from Lacl variants with low EC50 uncertainty were used (SD(log10(EC50)) ≤ 0.35). Points show the best consensus estimate for the parameter values as described in the Materials and Methods. Error bars for the measured result indicate ± one standard deviation estimated from the Bayesian posteriors; error bars for the log-additive result indicate ± one standard deviation propagated from the Bayesian posterior uncertainties of the single-substitution results. Data are from a single library-scale measurement.
[0009] FIG. 5 shows structural and genetic diversity of inverted and band- stop genotypes. (A, B) Protein structures showing the locations of amino acid substitutions associated with strongly inverted (A) and strong band-stop (B) phenotypes. For each, the operator-binding structure of Lacl is shown on the left (PDB ID: 1 LBG, available at http://www.rcsb. org/pdb/search/structidSearch.do?structureld=1 LBG), with the operator DNA at the bottom in light orange; the ligand-binding structure is shown on the right (PDB ID: 1 LBH, available at http://www.rcsb. org/pdb/search/structidSearch.do?structureld=1 LBH)i with IPTG in cyan. Both structures are shown with the view oriented along the protein dimer interface, with one monomer in light gray and the other monomer in dark gray. The locations of associated (i.e., high-frequency) amino acid substitutions are highlighted as red spheres, and secondary structures where inverted or band-stop variants have amino acid substitutions at a significantly higher frequency than the full library are shaded with different colors. For strongly inverted variants (A), helix 5 is shaded blue, helix 11 is shaded violet, and the residues near the ligand-binding pocket are shaded orange. For strong band-stop variants (B), helix 9 is shaded blue, and p-strand J is shaded violet. (C, D) Network diagrams showing relatedness among genotypes for strongly inverted (C) and strong band-stop (D) variants. Within each network diagram, larger polygonal nodes represent Lacl variants, with a colormap indicating the G0/G∞ or G0/Gmin ratio (see FIG. 1 E). The number of sides of the polygon indicates the number of amino acid substitutions relative to the wild type, and bold outlines indicate variants that were verified with flow cytometry. Smaller circular nodes represent specific amino acid substitutions, with connecting lines showing the substitutions for each variant. Bold red outlines on the substitution nodes indicate the associated substitutions shown as spheres in (A and B), and the shading of substitution nodes matches the shading used to highlight secondary structures in (A and B). [0010] FIG. 6 shows that a band-stop phenotype emerges from combinations of nearly silent amino acid substitutions. (A) Dose-response curves measured with flow cytometry for selected Lacl variants: wild-type Lacl (gray” X”s), a strong band-stop variant identified from the library with only three amino acid substitutions (R195H/G265D/A337D; blue diamonds), Lacl variants containing the single-substitution R195H (orange circles) and G265D (green circles), Lacl variant with the double-substitution R195H/G265D (red diamonds). The single-substitution R195H (orange) or G265D (green) results in sigmoidal dose-response curves similar to wild-type Lacl, but the combination of the two, R195H/G265D (red), results in a band-stop phenotype. The complete set of permutations of R195H, G265D, and A337D are shown in FIG. 29. (B) Location of the three amino acid substitutions found in a strong band-stop variant. The operator-binding structure of Lacl is shown on the left (PDB ID: 1 LBG, available at http://www.rcsb. org/pdb/search/structidSearch.do?structureld=1 LBG), with the operator DNA at the bottom in light orange; the ligand-binding structure is shown on the right (PDB ID: 1 LBH, available at http://www.rcsb. org/pdb/search/structidSearch.do?structureld=1 LBH), with IPTG in cyan. Amino acid positions R195 (orange), G265 (green), and A337 (purple) are highlighted as spheres. (C, D) Effects of individual amino acid substitutions associated with inverted and band-stop phenotypes. Each plot shows the joint effect of individual amino acid substitutions on two Hill equation parameters. The blue circles plotted with error bars show the effects of substitutions associated with the strongly inverted phenotype and the orange triangles plotted with error bars show the effects of substitutions associated with the strong band-stop phenotype. Most substitutions associated with the inverted phenotype cause a large shift in either EC50, G», or both, consistent with the biophysical requirements for inverting the dose-response curve. In contrast, most of the amino acid substitutions associated with the band-stop phenotype are nearly silent. Light blue circles and light orange triangles show the effects for all amino acid substitutions found in the sets of strongly inverted and strong band-stop variants, respectively. Dashed gray lines mark the wild-type parameter values. Plotted data includes a combination of direct experimental measurements and DNN model predictions. Error bars indicate ± one standard deviation estimated from the Bayesian posterior. Data are from a single library-scale measurement. [0011] FIG. 7 shows plasmid maps used for the library-scale and verification measurements. Both plasmids contain the p15A origin of replication (p15A), kanamycin resistance gene (KanR), and lacl coding DNA sequence (lacl). In both plasmids, the encoded Lacl protein transcriptionally regulates an output that was driven by the Ptacl promoter, lacO operator, and RiboJ transcriptional insulator. (A) The library plasmid (pTY 1 ) was used for measurement of the genotype and phenotype of the entire Lacl library. The Lacl library, including the coding DNA sequences of lacl variants and their corresponding barcodes, was cloned into pTY1. In pTY1 , the Lacl variants regulated the expression of a tetracycline resistance gene, tetA. Plasmid pTY1 also encoded yellow fluorescence protein (YFP), which was used for cloning purposes. (B) The verification plasmid (pVER) was used to verify the dose-response curves of over 100 variants from the library. To verify the dose-response curve of a Lacl variant, the coding DNA sequence for that variant was chemically synthesized and cloned into pVER, where the Lacl variant regulated the expression of YFP. The dose-response curve was then measured using flow cytometry.
[0012] FIG. 8 shows a comparison of fluorescence distribution of the Lacl library before and after fluorescence activated cell sorting (FACS) sorting. To generate a library in which most of the Lacl variants could function as allosteric repressors, FACS was used to select a portion of the library with low fluorescence in the absence of ligand. To allow comprehensive long-read sequencing of the library, the library size was further reduced by dilution of the FACS-selected library to create a population bottleneck of approximately 105 Lacl variants. The orange histogram bars show the fluorescence distribution for the library before the FACS sorting. The blue histogram bars show the fluorescence distribution for the final library used for the library-scale measurements. The bottom plot shows the same histogram as the top plot with the y-axis on a logarithmic scale.
[0013] FIG. 9 shows a diversity of phenotypes in the Lacl library. (A) Relative abundance of the various Lacl phenotypes found in the library. Variants with a “normal response” phenotype have dose-response curves qualitatively similar to the wild-type, with GO ≤ G∞ Normal response variants include variants with EC50 lower than the wild-type value (between approximately 1 μmol/L and 100 μmol/L) and variants with EC50 higher than the wild-type value (between approximately 100 μmol/L and 2000 pmoi/L). Variants with a “flat response” phenotype have flat dose-response curves with
Figure imgf000010_0001
Fiat response variants include always-off variants (G(0) ≤ 0.25 x G∞ ,wt; i.e., the IS phenotype from Markiewicz et al., Journal of Molecular Biology, 240, 421-433, 1994) and always-on variants (G(0) > 0.25 x G∞ ,wt; i.e. the I- phenotype from Markiewicz et al.). Variants with a “negative response” phenotype have dose-response curves with a negative slope,
Figure imgf000010_0002
for some portion of the measured concentration range. Negative response variants include band-stop, inverted, and band-pass variants. (B) Histogram of the number of variants with each Lacl phenotype found in the library as a function of the number of amino acid substitutions. The plots in (A) and (B) share the same legend. The outer ring of (A) and the left panel of (B) show the proportion of variants with dose-response curves that are normal, flat-response, or negative response. The inner ring of (A) and the right panel of (B) show more detailed descriptors of variant phenotypes.
[0014] FIG. 10 shows amino acid substitution count heat map. The heat map indicates the number of times each possible amino acid substitution was observed across the entire Lacl library, with the residue number as the x-axis and the possible amino acid substitutions as the y-axis. The wild-type amino acid at each residue is marked with an ‘X’. SNP-accessibie substitutions (possible via a single DNA base change per codon) are outlined with black squares. Substitutions that were not observed for any variant in the library are shown as white. Substitutions at residues 187 and 188 are under-represented because those codons contained a restriction site used during library and plasmid assembly. Most of the other SNP-accessible substitutions that were not observed were probably excluded by FACS selection during library preparation (see FIG. 8).
[0015] FIG. 11 shows example data for fitness and dose-response curves for normal phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant. The fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points. In the dose- response plots, the black lines show the results from the library-scale measurement using the Bayesian Hill equation model. The purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals. The plotted purple points show the dose-response curves from the flow cytometry verification measurements. Error bars indicate ± one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
[0016] FIG. 12 shows example data for fitness and dose-response curves for inverted phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant. The fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points. In the dose- response plots, the black lines show the results from the library-scale measurement using the Bayesian Hill equation model. The purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals. The plotted purple points show the dose-response curves from the flow cytometry verification measurements. Error bars indicate ± one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
[0017] FIG. 13 shows example data for fitness and dose-response curves for band-stop phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant. The fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points. In the dose- response plots, the black lines show the results from the library-scale measurement using the Bayesian Hill equation model. The purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals. The plotted purple points show the dose-response curves from the flow cytometry verification measurements. Error bars indicate ± one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers. [0018] FIG. 14 shows example data for fitness and dose-response curves for band-pass phenotype Lacl variants. Each pair of plots shows the fitness (top) and dose-response curve (bottom) for a different Lacl variant. The fitness data is from the library-scale landscape measurement. The fitness without tetracycline is shown as blue points and the fitness with tetracycline is shown as orange points. In the dose- response plots, the black lines show the results from the library-scale measurement using the Bayesian Hill equation model. The purple lines and shaded regions show the results from the library-scale measurement using the Bayesian Gaussian process (GP) model, where the line is the median GP results and the shaded regions indicate 50% and 90% credible intervals. The plotted purple points show the dose-response curves from the flow cytometry verification measurements. Error bars indicate ± one standard deviation estimated from the least-squares fit (fitness, top plots) or the Bayesian posterior (output, bottom plots), and are often within the markers.
[0019] FIG. 15 shows distributions of Hill equation parameters measured for the full library and the wild-type variants. In each plot, the distribution from the full Lacl library is shown in blue, the distribution from variants with the wild-type Lacl DNA coding sequence is shown in green, and the distribution from variants with synonymous mutations (i.e. wild-type amino acid sequence but non-wild-type DNA sequence) is shown in orange. The green and orange data are only shown for variants with zero unintended mutations in the plasmid (i.e. no mutations in regions of the plasmid other than the lacl coding DNA sequence). The Kolmogorov-Smirnov test was used to compare the distributions of Hill equation parameters between variants with the wild-type DNA sequence (green, 39 variants) and synonymous mutations (orange, 310 variants). The resulting p-values (0.71 , 0.40, 0.28, and 0.17 for GO, GM, EC50, and n respectively, Kolmogorov-Smirnov test) indicate that there were no significant differences between them.
[0020] FIG. 16 shows calibration data for the determination of dose- response curves using barcode sequencing. (A) Dose-response curves measured with flow cytometry for nine Lacl variants used to calibrate library-scale measurement. (B-C) The fitness impact of tetracycline (from the library-scale measurement) plotted vs. gene expression output (from flow cytometry of individual variants). The fitness impact of tetracycline is defined as the decrease in fitness (E. coll growth rate) measured with tetracycline vs. without tetracycline normalized by the fitness measured without tetracycline, i.e. (μtet0 - 1 ) from Eq. (9). (B) Data from a small-scale test library containing only the calibration variants. (C) Data from measurement of the full library. Results for different calibration variants are shown with different colored symbols. The solid black lines in show the results of a fit using Eq. (9). Error bars indicate ± one standard deviation estimated from the least-squares fit. The results for the full library measurement are noisier than the small-scale test because of the lower read count per variant (approximately 100-fold difference), but the general trend and the fits for both results are similar.
[0021] FIG. 17 shows gene expression measurement uncertainty for library- scale measurement of Lacl dose-response curves. The bottom plot shows the uncertainty vs. the median for the gene expression values, G(L), estimated using the Gaussian process (GP) inference model (Materials and Methods). Here, the uncertainty is the posterior standard deviation of log10(G(L)) from the Bayesian inference. The color map shows the relative density of data for all of the gene expression levels from the library-scale measurements (i.e. for every variant at every IPTG concentration). The uncertainty is relatively high for G(L) ≤ 104 because the fitness impact of tetracycline used for the measurement (top plot, taken from Appendix Figure S10c) is approximately constant over that range. To illustrate this further, the dark red dashed line in the bottom plot shows the inverse slope of the fitness impact curve. For G(L) between about 7,000 MEF and 25,000 MEF, the uncertainty is approximately proportional to the inverse slope, as expected. Outside that range, the uncertainly is no longer proportional to the inverse slope of the fitness impact curve because the posterior G(L) estimate is constrained by the prior used in the Bayesian inference to the range 100 MEF ≤ G(L) ≤ 50,000 MEF.
[0022] FIG. 18 shows a deep neural network (DNN) model. (A) Schematic illustration of recurrent deep neural network model architecture. Using the wild-type Lacl amino acid sequence as a starting point, the model predicts the Hill equation parameters for non-wild-type sequences by composing the individual parameter changes due to sequential amino acid substitutions along a mutational path. These changes are then added together to predict the final parameter values. All paths to a non-wild-type sequence converge to the same value, and this fact is leveraged to build a recurrent neural network model that learns to predict the effects of each substitution, given all previous substitutions. Potential non-additive effects are captured by the hidden state of the model, which predicts the change in parameter value for the most recent substitution and serves as a set of latent variables for predicting subsequent substitutions. Note that variants with intermediate sequences may be present in the library, but this is not necessary to train the model. The model will still learn to predict intermediate steps in the path, even if that data is not present. (B) Performance of recurrent DNN model compared with linear-additive model and feed-forward DNN model. The boxplot summarizes the R2 values for ten cross-validation tests for each model. The recurrent DNN model (green) outperforms both the linear-additive (orange) and feed-forward (blue) models, giving the highest R2 values for the Hill equation parameters GO and GM. For EC50, the recurrent DNN model and the linear-additive model give similar R2 values.
[0023] FIG. 19 shows a prediction of Lacl variant phenotypes with recurrent deep neural network (DNN) model. Measured vs. predicted values for each Hill equation parameter. The plotted results only show holdout data not used in model training. Predictions are taken from the variational posterior mean for each Hill equation parameter. Measured values are the posterior medians from the Bayesian fits to the experimental data. (A-B) GO and
Figure imgf000014_0001
are given in molecules of equivalent fluorophore (MEF) based on the calibration with flow cytometry data. (C) The cluster of points with measured EC50 near 100 μmol/L and predicted EC50 near 1000 μmol/L correspond to variants for which the DNN model provided better EC50 and G∞ estimates than the measurement, because EC50 was near or above the maximum ligand concentration used. For these points, the nearly constant dose-response across the measured concentration range resulted in a large uncertainty for EC50 and a posterior median near the median of the EC50 prior (100 μmol/L).
[0024] FIG. 20 shows a DNN model prediction uncertainty and root-mean- square error (RMSE). (A) Model prediction uncertainty for the base-ten logarithm of each Hill equation parameter as a function of the number of amino acid substitutions relative to the wild-type Lacl sequence. The boxplot shows the distribution of posterior standard deviation values for the set of variants with the indicated number of substitutions. (B) Model RMSE for each Hill equation parameter as a function of the number of substitutions relative to the wild-type sequence. RMSE is the root-mean- square difference between the model prediction and experimental measurement for the base-ten logarithm of each Hill equation parameter. The boxplot shows the distribution of RMSE values from the ten cross-validation tests for the recurrent DNN model. Solid lines with points show the RMSE forthe holdout data not used fortraining. Both the prediction uncertainty (A) and the RMSE (B) increase with increasing number of substitutions relative to the wild-type sequence. (C) Model RMSE as a function of the median prediction uncertainty for each Hill equation parameter. Each plotted point is for a different number of substitutions. The RMSE values are from the holdout data, i.e. the solid lines with points from (B). The median prediction accuracy values are from the distributions plotted in (A). The dashed line indicates the multiplicative factor (approximately 3.8) used to correct the uncertainty values.
[0025] FIG. 21 shows a comparison between the DNN model root- mean- square error (RMSE) and the measurement uncertainty for single mutants. The histograms show distributions of the measurement uncertainty for Lacl variants with single amino acid substitutions (relative to the wild type) for each Hill equation parameter. For comparison, the dashed line in each plot shows the RMSE for the DNN model predictions for single-substitution hold-out data. The model RMSE as a function of the number of substitutions is shown in FIG. 20B.
[0026] FIG. 22 shows measured vs. predicted values from DNN model for single amino acid substitutions. The plots compare the DNN model predictions (x- axis) with the measured values (y-axis) for the Hill equation parameters for Lacl variants with single amino acid substitutions. Blue symbols show data for ail of the single-substitution variants in the library. Orange symbols show data for variants with a high uncertainty for the measured EC50 (std(log10(EC50)) > 0.35). The EC50 uncertainty forthose variants was relatively high either because G∞ was similar to GO and/or because EC50 was near or above the maximum ligand concentration used (2048 μmol/L). For those variants, in the analysis of single-mutant effects, the DNN model result for G∞ and EC50 was used in place of the experimental result. Points marked with an 'x' were in the holdout data not used for model training. In all plots, GO and GM are given in molecules of equivalent fluorophore (MEF). [0027] FIG. 23 shows a heat map of single substitution effects on GO. The heat map indicates the change in GO for all amino acid substitutions measured as single substitutions in the Lacl library. The amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis. The wild-type amino acid at each residue is marked with an ‘X’. The 83 substitutions that are completely missing from the landscape dataset and that have been shown by previous work to result in constitutively high G(L)1 ,2 are each marked with a
Figure imgf000016_0001
[0028] FIG. 24 shows a heat map of single substitution effects on G∞ . The heat map indicates the change in G∞ for all amino acid substitutions measured as single substitutions in the Lacl library and/or predicted from the DNN model. The amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis. The wild-type amino acid at each residue is marked with an ‘X’. Substitutions with effects that are predicted from the DNN model are outlined with solid lines. Substitutions with high measurement uncertainty, where the DNN model result was used in place of the experimental result are outlined with dashed lines (see FIG. 22).
[0029] FIG. 25 shows a heat map of single substitution effects on EC50. The heat map indicates the change in EC50 for all amino acid substitutions measured as single substitutions in the Lacl library and/or predicted from the DNN model. The amino acid residue number is plotted as the x-axis and the possible amino acid substitutions as the y-axis. The wild-type amino acid at each residue is marked with an ‘X’. Substitutions with effects that are predicted from the DNN model are outlined with solid lines. Substitutions with high measurement uncertainty, where the DNN model result was used in place of the experimental result are outlined with dashed lines (see FIG. 22).
[0030] FIG. 26 shows a multiparametric impact of amino acid substitutions on allosteric function of Lacl. Each plot shows the joint effect of single amino acid substitutions on two Hill equation parameters. In each plot, substitutions that change both Hill equation parameters by less than 5-fold are shown as light gray points, and substitutions that change one or both Hill equation parameters by more than 5-fold are shown as red or blue points with error bars. As in FIG. 3, red indicates a decrease in Hill equation parameter and blue indicates an increase. The left half of each symbol and the y-error bar are colored based on the y-axis parameter; the right half of each symbol and the x-error bars are colored based on the x-axis parameter. The wild-type phenotype is indicated with a black ‘X’ in each plot. In all plots, GO and
Figure imgf000017_0001
are given in molecules of equivalent fluorophore (MEF). Error bars indicate ± one standard deviation estimated from the Bayesian posterior.
[0031] FIG. 27 shows a comparison between log-additive prediction and measurement for double-substitution variants for GO and G∞ . The log-additive GO and G®5 for double-substitution Lacl variants (i.e. two amino acid substitutions) was calculated assuming log-additivity of the effect of each single substitution relative to the wild-type, e.g. log( G∞ ,AB/ G∞ ,wt) = log(G∞ ,A/G∞ ,wt) + log(G∞ ,B/G∞ ,wt), where ‘wt’ indicates the wild-type, ‘A’ and ‘B’ indicate the single-substitution variants, and !AB’ indicates the double-substitution variant. Similar to the analysis of genotype- phenotype landscape data for green fluorescent protein, the log-additive predictions were constrained to within the minimum and maximum measured GO and G∞ values: 410 MEF ≤ GO, AB, G∞ ,AB ≤ 39,000 MEF. (A) Orange points mark double-substitution variants in which one of the single substitutions causes a greater than 10-fold change in GO. (B) Orange points mark double-substitution variants in which one of the single substitutions causes a greater than 2.5-fold change in G∞ , and dark red points mark double-substitution variants in which both single substitutions cause a greater than 2.5-fold change in G∞ . For GM, data were only used for variants with EC50 ≤ 500 μmol/L to ensure that the G(L) level was near saturation at the highest IPTG concentration used (2048 μmol/L). In each plot, the wild-type parameter value is marked with a black ‘X’. For this analysis, only experimental data was used (no results from the DNN model). Error bars for the measured result indicate ± one standard deviation estimated from the Bayesian posterior; error bars for the lag-additive result indicate ± one standard deviation propagated from the Bayesian posterior uncertainties of the single-substitution results.
[0032] FIG. 28 shows nearest neighbor distance histograms for inverted and band-stop Lacl variants. In each plot, the orange bars show the distribution of nearest neighbor Hamming distance for the amino acid sequences for strongly inverted (A) and strong band-stop (B) variants, and the blue bars show the distribution of nearest neighbor Hamming distance for a similar number of randomly selected sequences from the full library. The full-library histograms (blue bars) are averaged over 1000 iterations of randomly selected sequences.
[0033] FIG. 29 shows that the band-stop phenotype emerges from the combination of two amino acid substitutions. (A) Dose-response curves measured with flow cytometry for wild-type Lacl and a strong band-stop variant identified from the library with only three amino acid substitutions (R195H, G265D, A337D). (B-D) Dose-response curves for Lacl variants containing single- and double-substitution permutations of the three substitutions in the band-stop variant. Each plot shows two Lacl variants containing single substitutions and one Lacl variant containing both substitutions. The single substitutions R195H (orange) or G265D (green) result in sigmoidal dose-response curves similar to the wild-type, but the combination of the two, R195H/G265D (red), results in a band-stop phenotype (B). All other combinations of single- and double-substitutions result in sigmoidal dose-response curves similar to the wild-type (C-D).
[0034] FIG. 30 shows data for a measurement of OD600 vs. time for E. coll growth during initial, 12-hour incubation in automated culture protocol. Measurements are shown for all 96 wells in the growth plate during the library-scale measurement of Lacl dose-response curves. The OD600 measurement is not corrected for the zero- OD offset caused by the membrane (approximately 1.7).
[0035] FIG. 31 shows data for a measurement of OD600 vs. time for E. coll in Growth Plates 1-4. Measurements are shown for all 96 wells in each growth plate during the library-scale measurement of Lacl dose-response curves. (A) Growth Plate 1. (B) Growth Plate 2. (C) Growth Plate 3. (D) Growth Plate 4. The OD600 measurements are not corrected for the zero-OD offset caused by the membrane (approximately 1.7). Note that the flattening of the growth curves at time ~ 2.4 hours in (A) and (B), is not the onset of stationary phase. Instead, it is an indication of a diauxic shift which was used as a marker for the maximum cell density at which the cells remained in exponential growth phase. An example set of full growth curves is shown in FIG. 32.
[0036] FIG. 32 shows an example of OD600 vs. time measurement for a full E. coll growth curve. The inset shows a zoomed-in view of the short flat portion of the growth curves indicating the onset of a diauxic shift. This diauxic shift was used as a marker for the maximum ceil density at which cells remained in exponential growth.
[0037] FIG. 33 shows a flow cytometry gating example. (A) Side scatter vs. forward scatter plot before automated cell gating, showing both cell and non-cell detection events. (B) Side scatter vs. forward scatter plot after automated cell gating, showing only events most likely to be cell events. (C) Side scatter vs. side scatter area/height plot before automated singlet gating, showing both singlet and multiplet cell detection events. (D) Side scatter vs. side scatter area/height plot after automated singlet gating, showing only singlet cell detection events.
[0038] FIG. 34 lists a number of measured genotypes in library for different mutational distances from wild-type lack A characterization of the diversity found within the library in terms of in terms of single nucleotide polymorphisms of the wildtype lacl sequence.
[0039] FIG. 35 lists Lacl protein structural domains and features. Residues within 7 A of the ligand were determined using the crystal structure of Lacl bound to IPTG (PDB ID: 1 LBH). The dimer interface was taken as described by Lewis etal. plus residues 117 and 247-249, which are also along the dimer interface based on the published crystal structures (PDB IDs: 1 LBG and 1 LBH). The core pivot region was taken as described by Swint-Kruse et al.
[0040] FIG. 36 lists amino acid substitutions and structural domains or features associated with the inverted Lacl phenotype. The top portion of the table lists the amino acid substitutions that occur more frequently in the set of strongly inverted Lacl variants than in the full library, with a p-value cutoff of 5 x 10-3. The bottom portion of the table lists the structural domains or features where substitutions were found at higher frequency in the set of strongly inverted Lacl variants than in the full library, with a p-value cutoff of 2.5 x 10-2. The criteria used to select the set of strongly inverted variants and the hypergeometric test used to calculate p-values are described in the Example.
[0041] FIG. 37 lists amino acid substitutions and structural domains or features associated with the band-stop Lacl phenotype. The top portion of the table lists the amino acid substitutions that occur more frequently in the set of strong band- stop Lad variants than in the full iibrary, with a p-value cutoff of 5 x 10-2. The bottom portion of the table lists the structural domains or features where substitutions were found at higher frequency in the set of strong band-stop Lad variants than in the full library, with a p-vaiue cutoff of 2.5 x 10-2. The criteria used to select the set of strong band-stop variants and the hypergeometric test used to calculate p-values are described in the Example.
[0042] FIG. 38 lists cell growth in the 24 chemical environments used for library-scale measurement of Lacl dose-response curves. Cultures were grown at 37 °C with 0.5 mL culture volume per well and double-orbital shaking at 807 cycles per minute. The final OD600 values were calculated by subtracting the offset from the gas- permeable membrane used during growth and are given as the mean ± standard deviation across the four wells used for each chemical environment. The time intervals between the starts of the incubation cycles for sequential growth plates was 10043 s, 10045 s, and 10038 s. *Tetracycline was only added to Growth Plates 2-4.
[0043] FIG. 39 lists regions extracted from long-read sequencing data for library-scale analysis. The lacl region includes the lacl coding DNA sequence (CDS) and stop codon. The regulatory region includes the Placl and Ptacl promoters, the lacO operator, the riboJ insulator, and the RBS sites for both lacl and tetA. The tetA region includes the tetA CDS. The YFP region includes the YFP CDS and its RBS. The KAN region includes the transcriptional promoter and CDS for the kanamycin resistance gene as well as the transcriptional terminator for the genes regulated by Lacl (TL3S3P21 ). The origin of replication was split by the start of the PacBio circular consensus HiFi reads, so the last 32 bases of the origin of replication were grouped with the intergenic region between the origin and the barcodes. The start and end columns in the table give the nucleotide position of the start and end of each region based on the nominal full plasmid (GenBank ID: MT702633). The mutation rate column lists the estimated mean number of single nucleotide changes per 1000 bases for each region, using the nominal/wild-type sequence as the reference, and considers only single nucleotide polymorphisms (i.e. no indels). Regions adjacent to the barcodes and the lacl region have elevated mutation rates, due to the methods used for library generation and assembly. [0044] FIG. 40 lists PCR primers used during Lad library construction Sequences of primers used for PCRs during library construction.
[0045] FIG. 41 lists forward index primers for barcode amplification. Forward primer sequences used to amplify DNA barcodes from the plasmids. Underlined section anneals to plasmid. Bold section is the multiplexing tag sequence. N’s are randomized positions used as unique molecular identifiers to correct for PCR jackpotting.
[0046] FIG. 42 lists reverse index primers for barcode amplification. Reverse primer sequences used to amplify DNA barcodes from the plasmids. Underlined section anneals to plasmid. Bold section is the multiplexing tag sequence. N’s are randomized positions used as unique molecular identifiers to correct for PCR jackpotting.
[0047] FIG. 43 lists primers sequences for the second PCR. Primers used in the second PCR (15-cycle) to attach the standard Illumina paired-end adapter sequences and to amplify the resulting amplicons for sequencing. The underlined section anneals to the barcode amplification primers.
DETAILED DESCRIPTION
[0048] A detailed description of one or more embodiments is presented herein by way of exemplification and not limitation.
[0049] Biological systems are unparalleled in their ability to synthesize polypeptides of enormous sequence diversity from 20 natural amino acid building blocks. A polypeptide of 100 amino acids has 20100 possible combinations of mutations. Theory predicts 1060 possible small molecules in chemical space. These numbers are too large to explore by conventional drug discovery approaches. It would be advantageous to engineer specific mutations guided by an easily accessible biochemical metric such as a dose-response curve. Such would allow the generation, selection, and screening of combinatorial biopolymer libraries. Protein engineering is highly complex, making the incorporation of new chemical moieties into polypeptides difficult. Moreover, directed evolution is promising but limited in scope. Amino acids and non-canonical or non-natural amino acids (nnAAs; e g., modified amino acids) can be used for protein synthesis, which raises the complexity of protein engineering, and many conventional methods have difficulty selectively incorporating different non- natural amino acids.
[0050] Some conventional methods used in protein engineering have technical problems that include instability of linkers, post-translational labeling of ribosome-displayed libraries produced in transcription-translation lysates require complex and unique analytical QC of each library scaffold produced, and short transcription/translation times that are incompatible with complete labeling of translated polypeptides, leading to loss of fidelity due to mixtures of fully-reacted and unreacted amino acids for a single sequence. Furthermore, conventional systems for protein engineering are limited in their ability to generate both sufficient quantities of product and chemically complex libraries of intermediates for efficient encoded translation. These and other deficiencies of conventional methods of protein engineering are solved by the present invention. Advantageously, the present invention involves large libraries of mutant variants to drive protein engineering, wherein the chemical space is explored to a great depth around any given starting chemical structure and sequence.
[0051] It has been discovered that a method for engineering sense-and- response functions into living cells provides precisely defined dose-response curves. The method includes creation of a large library of genetic variants, e.g., about 100,000 or more. Measurement of the dose-response curve for every variant in the library can take approximately one day, which is faster and less expensive than conventional directed evolution methods. By examining the full set of dose-response curves and the matching DNA sequences, the method can allow selection of the DNA sequence that provides the precise dose-response that is required. Moreover, the method can include producing genetic sensors with precisely defined dose-response curves. It should be appreciated that a genetic sensor is a biological function, encoded in DNA, that enables cells to sense and respond to environmental signals by changing the level of gene expression in the cells. Exemplary signals include temperature, pH, or concentration of a specific chemical. Beneficially, the method identifies DNA sequences that encode a genetic sensor with a specific, specified, dose-response curve and can be used to identify DNA sequences for genetic sensors that respond specifically to a desired input signal.
[0052] In an embodiment, a process for determining a coding DNA sequence for a specific dose-response curve for a genetic sensor includes producing a coding DNA sequence library including a plurality of mutated variants of the coding DNA sequence (CDS) for the genetic sensor, such that each mutated variant includes a mutated coding DNA sequence of the coding DNA sequence: attaching a separate DNA barcode to each of the mutated variants in the coding DNA sequence library to produce a barcoded variant library of barcoded mutant variants, wherein each barcoded mutant variant in the barcoded variant library corresponds to one of the mutated variants having an attached DNA barcode; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants in the barcoded variant library into cells; growing the cells in various cell cultures including different values of an input signal of the genetic sensor; extracting DNA from the cells in the cell cultures; determining the number of each DNA barcode in the DNA extracted from the cells; determining a fitness of the cells for each barcoded mutant variant; determining a mutant variant dose-response curve 213 for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant, to produce a mutant variant dose-response curve library comprising a plurality of the mutant variant dose-response curves; providing a target dose-response curve; determining which mutant variant dose- response curve, as a selected mutant variant dose-response curve, in the mutant variant dose-response curve library matches the target dose-response curve; and determining the mutated coding DNA sequence of the selected mutant variant dose- response curve.
[0053] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes synthesizing the mutated coding DNA sequence. In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes making a mutated genetic sensor including the mutated coding DNA sequence. [0054] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes producing a selection system for the genetic sensor, wherein the genetic sensor output modulates the fitness of the cells including the genetic sensor. It should be appreciated that the genetic sensor output is the expression level of the genes regulated by the genetic sensor. The selection system can be tuned so that the cell fitness varies in a measurable way across the range of genetic sensor output for the protein engineering goal. In an embodiment, the genetic sensor is the /ac repressor from E. coli, and the selection system can be set up by using the genetic sensor to regulate the expression level of a tetracycline resistance gene (tetA) while growing the cells in the presence of tetracycline (an antibiotic). When the genetic sensor output is low, cells have low fitness when grown with tetracycline. When the genetic sensor output is high, cells have high fitness when grown with tetracycline. To tune the selection system, adjust the ribosome binding site (RBS) used for the tetA gene (to adjust the amount of tetA protein made) and the concentration of tetracycline used during cell growth (to adjust the fitness impact). Here, adjust those two factors so that the cell fitness varies measurably over a range of genetic sensor output levels spanning a range from approximately 0.1 times to approximately 1 times the maximum output level. Accordingly, this allows measurement and engineering genetic sensors with output levels spanning that range.
[0055] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes determining the DNA sequence of each barcoded mutant variant that includes subjecting the barcoded mutant variants of the barcoded variant library to high- throughput long-read DNA sequencing.
[0056] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes, in growing the cells in various cell cultures including different values of the input signal of the genetic sensor, the input signal includes a concentration of a molecule sensed by the genetic sensor.
[0057] In an embodiment, in extracting the DNA from the cells in the cell cultures, the DNA is extracted during at least two different time points during growth of the cells in the cell culture. Determining the number of each DNA barcode in the DNA extracted from the cells can be performed at each of the different time points with high-throughput short-read DNA sequencing. Determining the fitness of cells containing each barcoded mutant variant can include calculating the fitness from the change in the number of each barcoded mutant variant over time.
[0058] In an embodiment, determining the mutant variant dose-response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant incudes calculating each mutant variant dose-response curve from the fitness as a function of the input signal.
[0059] In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes calibrating the mutant variant dose-response curves with a spike-in control variant. In an embodiment, the process for determining the coding DNA sequence for the specific dose-response curve for a genetic sensor includes normalizing the amount of DNA extracted from the cells in the cell cultures with a spike-in control variant.
[0060] Producing the coding DNA sequence library can include site- saturation mutagenesis, chemical mutagens, error-prone PCR, propagation in error- prone microbes, homologous recombination, gene shuffling, and others. In an embodiment, the coding DNA sequence library is produced by mutagenic PCR.
[0061] In attaching the separate DNA barcode to each of the mutated variants, the DNA barcodes have a distinct DNA sequence for each variant. DNA barcodes can be added to the barcoded variant library (i.e. , the library of mutated DNA variants) with PCR using oligonucleotides containing sequence positions with randomized nucleotides. The length of the DNA barcode, i.e., the number of randomized DNA bases, is sufficient to create a barcode library much larger than the mutated CDS library.
[0062] In subjecting the barcoded mutant variants of the barcoded variant library to high-throuhgput long-read DNA sequencing, the high-throughput long-read DNA sequencing can read lengths at least great enough to span the CDS, the attached DNA barcode, and any DNA sequence regions between them. Such sequencing method has high-fidelity to confidently assign variant sequences to the DNA barcode. [0063] In transforming the barcoded mutant variants in the barcoded variant library into the cells, the barcoded variant library can be transformed and maintained in a host organism via extrachromosomal methods (e.g., a DNA plasmid) or artificial chromosome (e.g., bacterial artificial chromosome or yeast artificial chromosome, and the like) or integrated into a genomic chromosome of the host organism (via CRISPR/cas9, homologous recombination, and the like). In an embodiment, DNA plasmids are used to transform the barcoded mutant variant into a host organism. Transformation of appropriate cell hosts with a DNA construct of the present invention is accomplished by well-known methods that typically depend on the type of vector used. With regard to transformation of prokaryotic host cells, see, e.g., Cohen, S. N. et al., Proc. Natl. Acad. Sci. U.S.A 69 (1972): 2110-2114. Transformation of can also follow that described by Beggs, J. D., Nature 275 (1978): 104-109. With regard to vertebrate cells, reagents useful in transfecting such cells, for example calcium phosphate and DEAE-dextran or liposome formulations. Electroporation is also useful for transforming or transfecting cells and is well known in the art for transforming yeast cell, bacterial cells, insect cells and vertebrate cells.
[0064] Successfully transformed cells, i.e., cells that contain a DNA construct such as the barcoded mutant variant, can be identified by well-known techniques such as PCR. Alternatively, the presence of the protein in the supernatant can be detected using antibodies.
[0065] Various biotechnical methods that can be used in the methods described here and background information is described in U.S. Patent No. 10,526,386, issued Jan. 7, 2020, the content of which is herein incorporated by reference in its entirety.
[0066] DNA barcode can refer to a polynucleotide sequence that can include various labels. A DNA barcode can be a polynucleotide sequence that can be used for DNA barcoding. DNA barcode can be used to quantify targets within a sample. DNA barcode can be used to control for errors which may occur after a label is associated with a target. For example, a DNA barcode can be used to assess amplification or sequencing errors. A DNA barcode associated with a target (e.g., a mutated variant) can be called a barcoded mutant variant. [0067] DNA barcoding can refer to the random labeling (e.g., barcoding) of nucleic acids. DNA barcoding can utilize a recursive Poisson strategy to associate and quantify labels associated with targets. DNA barcoding can be used interchangeably with DNA labeling.
[0068] The DNA barcode can include one or more labels. Exemplary labels include a universal label, a cellular label, a molecular label, a sample label, a plate label, a spatial label, or a pre-spatial label. A DNA barcode can comprise a 5'amine that may link the DNA barcode to a solid support. The DNA barcode can include a universal label, a cellular label, or a molecular label. The universal label can be a 5’- most label. The molecular label may be the 3’-most label. In some instances, the universal label, the cellular label, and the molecular label are in any order. The DNA barcode can include a target-binding region. The target-binding region can interact with a target (e.g., mutated variant) in a sample. In some instances, the labels of the DNA barcode (e.g., universal label, dimension label, spatial label, cellular label, and molecular label) may be separated by 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , 12, 13, 14, 15, 16, 17, 18, 19, or 20 or more nucleotides.
[0069] One or more nucleic acid amplification reactions can be performed to make copies of the nucleic acid sequences such as CDS. Amplification can be performed in a multiplexed manner, wherein multiple target nucleic acid sequences are amplified simultaneously. The amplification reaction can be used to add sequencing adaptors to the nucleic acid molecules. The amplification reactions can include amplifying at least a portion of a barcode, CDS, or combination thereof.
[0070] In some embodiments, amplification can be performed using a polymerase chain reaction (PCR). As used herein, PCR can be a reaction for the in vitro amplification of specific DNA sequences by the simultaneous primer extension of complementary strands of DNA. PCR can include derivative forms of the reaction, including but not limited to, RT-PCR, real-time PCR nested PCR, quantitative PCR, multiplexed PCR, digital PCR and assembly PCR.
[0071] Amplification of the labeled nucleic acids can comprise non-PCR based methods. Examples of non-PCR based methods include, but are not limited to, multiple displacement amplification (MDA), transcription- mediated amplification (TMA), whole transcriptome amplification (WTA), whole genome amplification (WGA), nucleic acid sequence-based amplification (NASBA), strand displacement amplification (SDA), real-time SDA, rolling circle amplification, or circle-to-circle amplification. Other non-PCR-based amplification methods include multiple cycles of DNA-dependent RNA polymerase-driven RNA transcription amplification or RNA-directed DNA synthesis and transcription to amplify DNA or RNA targets, a ligase chain reaction (LCR), and a Qβ replicase ( Qβ) method, use of palindromic probes, strand displacement amplification, oligonucleotide-driven amplification using a restriction endonuclease, an amplification method in which a primer is hybridized to a nucleic acid sequence and the resulting duplex is cleaved prior to the extension reaction and amplification, strand displacement amplification using a nucleic acid polymerase lacking 5’ exonuclease activity, rolling circle amplification, and ramification extension amplification (RAM). In some instances, the amplification may not produce circularized transcripts.
[0072] Conducting the amplification reactions can include use of primers that can include, e.g., a select number nucleotides.
[0073] Any amplification scheme can be used. For example, in one scheme, the first round PCR can amplify molecules (e.g., attached to the bead) using a gene specific primer and a primer against the universal Illumina sequencing primer 1 sequence. The second round of PCR can amplify the first PCR products using a nested gene specific primer flanked by Illumina sequencing primer 2 sequence, and a primer against the universal Illumina sequencing primer 1 sequence. The third round of PCR adds P5 and P7 and sample index to turn PCR products into an Illumina sequencing library. Sequencing using 150 bp*2 sequencing can reveal the cell label and molecular index on read 1 , the gene on read 2, and the sample index on index 1 read.
[0074] Amplification can be performed in one or more rounds. In some instances there are multiple rounds of amplification. Amplification can include two or more rounds of amplification. The first amplification can be an extension to generate the gene specific region. The second amplification can occur when a sample nucleic hybridizes to the newly generated strand. [0075] In some embodiments hybridization does not need to occur at the end of a nucleic acid molecule. In some embodiments a target nucleic acid within an intact strand of a longer nucleic acid is hybridized and amplified. A target can be more than 50 nt, more than 100 nt, or more that 1000 nt from an end of a polynucleotide.
[0076] The methods disclosed herein can include counting the number of molecular labels (e.g., DNA barcodes) associated with such moieties such the mutant variant. Counting the molecular labels can be conducted in a variety of ways, e.g., by sequencing the molecular labels. It should be appreciated that to count the number of molecular labels sequencing part of the nucleic acid molecules generated from reverse transcription, second-strand synthesis, or amplification can be sufficient.
[0077] Any suitable sequencing method known in the art can be used, preferably high-throughput approaches. For example, cyclic array sequencing using platforms such as Roche 454, Illumina Solexa, ABI-SOLiD, ION Torrent, Complete Genomics, Pacific Bioscience, Helicos, or the Polonator platform, can be used. Sequencing can include MiSeq sequencing. Sequencing may include HiSeq sequencing.
[0078] In some embodiments, counting the molecular labels can be conducted by hybridization. For example, the molecular labels can be counted by- hybridization to a microarray. In some embodiments, the microarray comprise a plurality of probes that specifically binds to the molecular labels associated with the one or more reference genes in a sample or spike-in.
[0079] In growing the cells in various cell cultures for different values of the input signal of the genetic sensor, the number of values used for the input signal can be at least two or large enough to distinguish different dose-response curves for protein engineering goals. In an embodiment, 12 different input signal levels are concentrations of a selected molecule (e.g., isopropyl p-d-1 -thiogalactopyranoside, IPTG). The cells can be grown in environments with and without selective pressure (e.g., tetracycline). This can provide improved accuracy for the measurement since some fitness effects can arise that are not due to changes in the dose-response curves. A mutated genetic sensor CDS could encode for proteins that are detrimental to the host cells. It is contemplated that cells can be grown in environments with and without pressure so that the difference between the fitness with and without the pressure is used to determine the sensor dose-response curves. To simplify data analysis, the cells can be maintained in exponential growth for the full measurement.
[0080] Extracting DNA from the cells in the cell cultures can include a variety of DNA extraction methods. In an embodiment, solid phase extraction is used to remove plasmid DNA from the cells.
[0081] In determining the number of each DNA barcode in the DNA extracted from the cells that is performed at each of the different time points with high- throughput short-read DNA sequencing, various pre-counting steps can be inolved that can include sequencing sample amplification following loading samples into sequencing instrument. It is contemplated that high-throughput DNA sequencing can be short-read sequencing of the bar code that includes, e.g., from 10 bases to 200 bases, but is not so limited in some embodiments. The high-throughput DNA sequencing reads the barcode DNA sequences. It is contemplated that the sequencing method provides as many reads as possible to reduce the uncertainty associated with counting each barcode. As part of the high-throughput DNA sequencing method, the DNA barcodes can be isolated and amplified using PCR, e.g., according to the Example. Automation of data processing can be achieved by software packages that analyze raw sequencing data to produce a count for each barcode in each sample. An exemplary software package is Bartender, which is a C++ tool that is designed to process random barcode data, available github.com/LaoZZZZZ/bartender-1 .1 .
[0082] In calculating the fitness from the change in the number of each barcoded mutant variant over time, e.g., over the at least two different time points, if the cells are maintained in exponential growth while over the time periods of DNA extraction, the fitness of cells containing each barcode can be calculated using a linear fit to the logarithm of each barcode count as a function of time. Alternately, a model can be used that accounts for the time lag of fitness effects, as necessary, such as according to the Example. To improve accuracy of the fitness calculation, a variant with known fitness can be used to normalize the barcode count data, e.g., as described in the Example. For accuracy sake, the variant with known fitness can be spiked into the cel! culture during cell growth. It is contemplated such variant can be added to the culture with a higher cell number than other variants.
[0083] In determining the mutant variant dose-response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant and calculating each mutant variant dose-response curve from the fitness as a function of the input signa, to calibrate the calculation of dose-response from fitness, the fitness and dose-response of several variants can be independently measured. Those calibration variants can be randomly chosen from the library (e.g., plate cultures and pick individual colonies) or synthesized. For accuracy- sake, such can have dose-response curves qualitatively similar to the dose response for the engineering goals. If qualitatively similar dose-response curves are not available, either a set of variants with constitutive output (i.e., a constant dose response) at different levels or a set of variants responsive to a different input signal can be used for calibration. For example, calibration variants can be used that are randomly selected from a library (e.g., a lac repressor library), wherein the fitness and dose-response curves are used with a select pressure (e.g., IPTG) as the input signal for calibration.
[0084] In determining which mutant variant dose-response curve, as the selected mutant variant dose-response curve, in the mutant variant dose-response curve library matches the target dose-response curve, data resulting from prior steps can be stored in a table that matches each CDS variant with its attached DNA barcode and the dose-response curve for each DNA barcode. The collated data can be used to identify the CDS for any variants with the desired dose-response curve. If none of the variants in the library have the desired dose-response curve, the large-scale dataset generated from this process can be used to characterize systematic patterns relating CDS mutations to changes in the dose-response. Methods including simple linear models of deep learning models can be used for this. Variants in the library with dose-response curves close to the desired curve can then be combined with additional mutations to give the desired dose-response. It should be appreciated that a mutant variant dose-response curve matches the target dose-response curve for the mutant variant dose-response curve library, when the mutant variant dose-response curve most closely matches the target dose-response curve, even if not identical. Moreover, it is contemplated that among all mutant variant dose-response curves 213, several closest matching curves can be selected even though not identical, based on a selection criteria such as selecting the top 1 % of closest matching mutant variant dose- response curves. Matching can be based on a statistical metric such as least squares minimization or smallest average residual from the target dose-response curve.
[0085] For synthesizing the mutated coding DNA sequence, various well- known methods or instruments can be used. Exemplary methods for synthesizing DNA sequences, including mutated coding DNA sequence, are solid phase phosphoramidite-techniques. Certain solid phase phosphoramidite-techniques involve sequential de-protection and synthesis of sequences built from phosphoramidite reagents corresponding to natural (or non-natural) nucleic acid bases. Phosphoramidite nucleic acid synthesis can limit a length in the number of base pairs in a nucleic acid synthesizer. There are many methods that can be used. Many are commercially available.
[0086] Advantageously the methods described herein engineer genetic sensors with specific dose-response curves. Conventional methods for engineering sensors lack quantitatively specified sensitivity, and the methods herein are deemed to be faster, less expensive, and less likely to fail than conventional methods. More specifically, conventional methods for engineering genetic sensors require both positive and negative selection. Positive selection ensures that the sensor output switches ‘on’ when desired. Negative selection ensures that the sensor output switches off’ when desired. Implementing both positive and negative selection for the same sensor is difficult, time consuming, and prone to failure. The methods herein overcome these limitations and involve only one of positive or negative selection. In addition, conventional methods for engineering genetic sensors may require multiple repeated rounds of selection. Beneficially, the methods herein can be implemented with a single round of selection, e.g., genetic sensors can be engineered with the instant methods and have a sensitivity (EC50) within 1 .25-fold of a targeted value using a single round of selection.
[0087] Methods herein can be used to make genetic sensors that have applications including: living therapeutics that live in a patient’s gut and sense a disease state and provide a treatment specifically in response; microbes or other cells engineered as environmental or process-monitoring sensors; or microbes or other cells engineered to grow nano- or micro-structured materials. The methods herein provide routine and scalable engineering of genetic sensors with quantitatively specified dose-response curves. Consequently, the dose response of the genetic sensors can be chosen to match the needs of the larger engineering project, resulting in a higher success rate and a more rapid time to realization. Additionally, the methods here can engineer genetic sensors that respond specifically to new input signals (e.g. , different molecules).
[0088] The articles and processes herein are illustrated further by the following Example, which is non-limiting.
EXAMPLE
[0089] Allostery is a fundamental biophysical mechanism that underlies cellular sensing, signaling, and metabolism, but a quantitative understanding of allosteric genotype-phenotype relationships remains elusive. This Example describes the large-scale measurement of the genotype-phenotype landscape for an allosteric protein, the lac repressor from Escherichia coli, Lacl. Quantitative measure of the dose-response curves for nearly 105 variants of the Lacl genetic sensor was made by a method that combines long-read and short-read DNA sequencing. The resulting data provide a quantitative map of the effect of amino acid substitutions on Lacl allostery and reveal systematic sequence-structure-function relationships. Allosteric phenotypes can be quantitatively predicted with additive or neural-network models, but unpredictable changes also occur, such as a new band-stop phenotype that challenges conventional models of allostery and that emerges from combinations of nearly silent amino acid substitutions.
[0090] Here, a large-scale approach is used to measure the dose-response curves of more than 60,000 variants of the lac repressor. Results reveal systematic sequence-structure-function relationships underlying allostery, as well as a surprising diversity of allosteric phenotypes. Long-read and short-read DNA sequencing is used to measure the genotype and dose-response phenotype of 62,000 variants of the lac repressor. The effects of single amino acid substitutions reveal systematic sequence-structure-function relationships underlying Lacl allostery that may apply to allosteric proteins more generally. Novel phenotypes emerge with few substitutions, including variants with inverted dose-response curves and variants with biphasic dose- response curves.
[0091] Allostery is an inherent property of biomolecules that underlies cellular regulatory processes including sensing, signaling, and metabolism (Fenton, 2008; Motlagh et al, 2014; Razo-Mejia et at, 2018). With allosteric regulation, ligand binding at one site on a biomolecule changes the activity of another, often distal, site. Switching between active and inactive states provides a sense-and- response function that defines the allosteric phenotype. Quantitative descriptions relating that phenotype to its causal genotype would improve our understanding of cellular function and evolution, and advance protein design and engineering (Raman et al, 2014; He & Liu, 2016; Huang et al, 2016). However, the intramolecular interactions that mediate allosteric regulation are complex and distributed widely across the biomolecular structure, making the development of general quantitative descriptions challenging.
[0092] Some genotype-phenotype landscape approaches have enabled the phenotypic characterization of 104-105 genotypes simultaneously (Li et al , 2016; Puchta et al, 2016; Sarkisyan et at, 2016; Domingo et al, 2018; Li & Zhang, 2018; Pressman et at, . 2019). Measurements at this scale facilitate the exploration of genotypes with widely distributed mutations, making them ideal for probing complex biological mechanisms like allostery. However, to quantitatively characterize the sense-and-response phenotypes inherent to allostery, a measurement must encompass the full dose-response curve that describes biomolecular activity as a function of ligand concentration.
[0093] Genetic sensors have served as a model of allosteric regulation for decades, and today are central to engineering biology. Genetic sensors are allosteric proteins that regulate gene expression in response to stimuli, giving cells the ability to regulate their metabolism and respond to environmental changes. Like other allosteric biomolecules, the lac repressor, Lacl, switches between an active state and an inactive state. In the active state, Lacl binds to a DNA operator upstream of regulated genes, preventing transcription. Ligand binding to Lacl stabilizes the inactive (non- operator-binding) state that allows transcription to proceed. This switching results in the allosteric phenotype that is quantitatively defined by a dose-response curve relating the concentration of input iigand (L) to the output response (the expression level of regulated genes, G). Genetic sensors typically have sigmoidal dose-response curves following the Hill equation:
Figure imgf000035_0001
where G0 is basal gene expression in the absence of ligand, G∞ is gene expression at saturating ligand concentrations, EC50 is the effective concentration of ligand that results in gene expression midway between G0 and G∞, and the Hill coefficient, n, quantifies the steepness of the dose-response curve (FIG. 1 E).
[0094] As a framework to relate changes in the dose-response curve to the underlying biophysics of the Lacl protein, there is recently described biophysical models that extend the general Monod-Wyman-Changeux (MWC) model of allostery (Monod et al, 1965) to the case of allosteric transcription factors (Daber et al, 2011 ; Razo-Mejia et al, 2018; Chure et al, 2019). Within those models, the dose-response curve depends on several biophysical parameters, including ligand-binding affinity, operator-binding affinity, and the allosteric constant, which is the equilibrium ratio between the inactive and active states in the absence of ligand and DNA operator (Monod etal, 1965; Daber et al, 2011 ; Razo-Mejia etal, 2018; Chure et al, 2019). The amino acid sequence (and corresponding structure) sets these biophysical parameters, and thus, amino acid substitutions can change these parameters (Daber et al, 2011 ; Chure et al, 2019). However, in the absence of data, the effect of any particular substitution on the biophysical parameters is unpredictable. Furthermore, substitutions distal to the active sites of a biomolecule can strongly affect allosteric function (Taylor et al, 2016; Leander et al, 2020). Consequently, to develop a more predictive understanding of allostery involves large-scale, quantitative measurements of changes to an allosteric dose-response curve resulting from wide- spread substitutions.
[0095] Measuring the genotype-phenotype landscape [0096] To measure the genotype-phenotype landscape for the allosteric Lacl sensor, we first created a library of Lacl variants using error-prone PCR and attached a DNA barcode to the coding DNA sequence (CDS) of each variant (FIG. 1A). We used error-prone PCR across the full lacl CDS to investigate the effects of higher-order substitutions spread across the entire Lacl sequence and structure. We then inserted the barcoded library into a plasmid where Lacl regulates the expression of a tetracycline resistance gene (FIG. 34A). Consequently, in the presence of tetracycline, the Lacl dose-response modulates cellular fitness (i.e., growth rate) based on the concentration of the input ligand isopropyl-p-d-thiogalactoside (IPTG). We then transformed the library into Escherichia coli for the landscape measurement (FIG. 1 B). To ensure that most variants in the library could regulate gene expression, we used fluorescence-activated cell sorting (FACS) to enrich the library for variants with low G0 (FIG. 8). Then, using high-accuracy, long-read sequencing (Wenger et al, 2019), we determined the genotype for every variant in the library and indexed each variant to its attached DNA barcode (FIG. 1 A).
[0097] The library contained 62,472 different Lacl genotypes, with an average of 7.0 single nucleotide polymorphisms (SNPs) per genotype. Many SNPs were synonymous, i.e., coded for the same amino acid, so the library encoded 60,398 different amino acid sequences with an average of 4.4 amino acid substitutions per variant (FIG. 9B, the number of variants in the library at each mutational distance from the wild type are listed FIG. 34, and the number of observations of each amino acid substitution in the library is shown in FIG. 10).
[0098] To quantitatively determine the allosteric phenotype for every Lacl variant in the library, we developed a new method to characterize the dose-response curves for large genetic sensor libraries. Briefly, we grew E. coll containing the library in 24 chemical environments (12 ligand concentrations, each with and without tetracycline). We used short-read sequencing of the DNA barcodes to measure the relative abundance of each variant at four timepoints during growth (FIG. 1 C). We then used the changes in relative abundance to determine the fitness associated with each variant in each environment (FIG. 1 D). Finally, for each variant in the library, we used the fitness difference (with vs without tetracycline) from all 12 ligand concentrations to quantitatively determine the dose-response curve using Bayesian inference (FIG. 1 E). Most variants had sigmoidal dose-response curves (e.g., FIG. 38 and FIG. 39), which we analyzed using a Hill equation-based inference model to quantitatively determine the Hill equation parameters and their associated uncertainties. Some variants had non-sigmoidal dose-response curves (e.g., FIG. 40 and FIG. 41), so we also analyzed all of the variants using a non-parametric Gaussian process (GP) inference model.
[0099] We compared the distributions of the resulting Hill equation parameters between two sets of variants: 39 variants with exactly the wild-type CDS for Lacl (but with different DNA barcodes) and 310 variants with synonymous nucleotide changes (i.e. , the wild-type amino acid sequence, but a non-wild-type DNA coding sequence). Using the Kolmogorov-Smirnov test, we found no significant differences between the two sets (P-values of 0.71 , 0.40, 0.28, and 0.17 for G0, G«, EC50, and n, respectively, FIG. 42). So, for all subsequent analyses we considered only amino acid substitutions.
[00100] To evaluate the accuracy of the new method for library-scale dose- response curve measurements, we independently verified the results for over 100 Lacl variants from the library. For each verification measurement, we chemically synthesized the CDS for a single variant and inserted it into a plasmid where Lacl regulates the expression of a fluorescent protein (FIG. 34B). We transformed the plasmid into E. coli and measured the resulting dose-response curve with flow cytometry (e.g., FIG. 1 E). We compared the Hill equation parameters from the library- scale measurement with those same parameters determined from flow cytometry measurements for each of the chemically synthesized Lacl variants (FIG. 2A-D). This served as a check of the new library-scale method's overall ability to measure dose- response curves with quantitative accuracy. The accuracy for each Hill equation parameter in the library-scale measurement was 4-fold for G0, 1 .5-fold for G∞, 1 .8-fold for EC50, and ± 0.28 for n. For G0, G∞, and EC50, we calculated the accuracy as: exp[RMSE(ln(x))]expRMSEInx, where RMSE(ln(x))RMSEInx is the root-mean- square difference between the logarithm of each parameter from the library-scale and cytometry measurements. For n, we calculated the accuracy simply as the root-mean- square difference between the library-scale and cytometry results. The accuracy for the gene expression levels (G0 and G∞) was better at higher gene expression levels (typical for G∞) than at low gene expression levels (typical for G0), which is expected based on the non-linearity of the fitness impact of tetracycline (FIG. 16 and FIG. 17). Measurements of the Hill coefficient, n, had high relative uncertainties for both barcode sequencing and flow cytometry, so the parameter n was not used in any quantitative analysis. Overall, the flow cytometry results demonstrated that our experimental method measures dose-response curves with both high qualitative and quantitative accuracy (FIG. 2A-D, FIG. 11 , FIG. 12, FIG. 13, and FIG. 14).
[00101] Effects of amino acid substitutions on Lacl phenotype
[00102] During library construction, we chose the mutation rate to simultaneously achieve two objectives: exploration of a broad genotype-phenotype space, and acquisition of the single amino acid substitution data most useful for building quantitative biophysical models of allosteric function (Monod et al, 1965; Razo-Mejia etal, 2018; Chure etal, 2019). Starting from the wild-type DNA sequence for Lacl, there were 2,110 possible SNP-accessible amino acid substitutions. Most of those substitutions were present in one or more variants within the library; however, nearly half were found only in combination with other substitutions. So, to comprehensively determine the impact of single amino acid substitutions, we constructed a deep neural network model (DNN) capable of accurately predicting the Hill equation parameters for Lacl variants that were not directly measured. We tested two different neural network architectures: a recurrent DNN and a more conventional feed-forward DNN, as well as a linear-additive model. Of the three models, the recurrent DNN model provides the best predictive performance for each of the Hill equation parameters, though for EC50, the recurrent DNN and linear-additive models have similar performance (FIG. 18). So, for subsequent analysis, we used the recurrent DNN model, which captures the context dependence of amino acid substitution effects (FIG. 18). In addition, to estimate uncertainties for the model predictions, we used approximate Bayesian inference methods (Hochreiter & Schmidhuber, 1997).
[00103] We trained the DNN model to predict the Hill equation parameters G0, G∞, and EC50 (FIG. 19), the three Hill equation parameters that were determined with relatively low uncertainty by the library-scale measurement. To evaluate the accuracy of the model predictions, we used the root-mean-square error (RMSE) for the model predictions compared with the measurement results. We calculated RMSE using only held-out data not used in the model training, and the split between held-out data and training data was chosen so that all variants with a specific amino add sequence appear in only one of the two sets. For all three parameters, the RMSE for the model predictions increases with the number of amino acid substitutions relative to the wild type (FIG. 20). For single-substitution variants, the model RMSE is comparable to the experimental measurement uncertainty (FIG. 21). So, we could confidently integrate the experimental and DNN results to provide a nearly complete map of the effects of SNP-accessible amino acid substitutions. Furthermore, by integrating information about the causal substitutions from multiple genetic backgrounds, the model provided improved estimates of EC50 and G∞ for variants with EC50 near or above the maximum ligand concentration measured (FIG. 22).
[00104] The resulting map of single-substitution effects includes quantitative point estimates and uncertainties of the Hill equation parameters for 94% of the possible SNP-accessible amino acid substitutions (1 ,991 of 2,110: 964 directly from measured data, and 1 ,027 from DNN predictions; FIG. 23, FIG. 24, FIG. 25). Most of the 119 substitutions missing from the dataset were probably excluded by FACS during library preparation because they cause a substantial increase in G0. These include 83 substitutions that have been shown to result in constitutively high G(L) (Markiewicz et al, 1994; Pace et al, 1997). Of the 1 ,991 substitutions included in the dataset, 38% measurably affect the dose-response curve (beyond a 95% confidence bound).
[00105] The Lacl protein has 360 amino acids arranged into three structural domains (Lewis etal, 1996; Flynn etal, 2003; Swint-Kruse etal, 2003). The first 62 N- terminal amino acids form the DNA-binding domain, comprising a helix-turn-helix DNA-binding motif and a hinge that connects the DNA-binding motif and the core domain. The core domain, comprising amino acid positions 63-324, is divided into two structural subdomains: the N-terminal core and the C-terminal core. The full core domain forms the ligand-binding pocket, core-pivot region, and dimer interface. The tetramerization domain comprises the final 30 amino acids and includes a flexible linker and an 18 amino acid a-helix (FIG. 3, FIG. 35). Naturally, Lacl functions as a dimer of dimers: Two Lacl monomers form a symmetric dimer that further assembles into a tetramer (a dimer of dimers). [00106] Regarding data in FiG. 3, in (A) and (D), gray-pink spheres and points indicate positions for substitutions that are completely missing from the library-scale dataset reported here and that have been shown by previous work to result in constitutively high G(L) (Markiewicz et al, 1994; Pace et al, 1997). In (D-F), points show the best consensus estimate for the parameter values. G0 and G∞ are reported in molecules of equivalent fluorophore (MEF) based on the calibration with flow cytometry measurements. Error bars indicate ± one standard deviation estimated from the Bayesian posteriors. Data are from a single library-scale measurement.
[00107] The effect of any amino acid substitution depends strongly on its location within the protein structure, indicating systematic sequence-structure-function relationships underlying Lacl allostery (FIG. 3). For example, substitutions that increase the basal expression, G0, by more than 5-fold that were not excluded by FACS are located either in helix 4 of the DNA-binding domain, along the dimer interface, in the tetramerization helix, or at the protein start codon (FIG. 3A and D). G0 quantifies gene expression in the absence of ligand. So, within the biophysical models, substitutions that affect G0 must alter either the operator-binding affinity, the allosteric constant, or the copy number of Lacl proteins per cell (Daber et al, 2011 ; Razo-Mejia et al, 2018; Chure et al, 2019). Substitutions at the first and second codons (M1 I, M1T, and, K2E) probably reduce the Lacl copy number (Bivona et at, 2010; Hecht et at, 2017). But the other substitutions that affect G0 (R51C, Q54K, L56M, T68N, S70C, L71 Q, A92S, F226V, S322P, and Q352L) almost certainly change the operator-binding affinity, the allosteric constant, or both.
[00108] Interestingly, substitutions in helix 4 (R51C, Q54K, and L56M) that increase G0 also decrease EC50 approximately 10-fold, consistent with a change in the allosteric constant favoring the inactive state (Chure etal, 2019) (FIG. 26A). Helix 4 forms part of the hinge connecting the DNA-binding motif to the core domain. It changes from a disordered coil to an order helix only upon binding of Lacl to its cognate DNA operator, and interactions between the helix 4 residues of each Lacl monomer have been shown to stabilize helix formation (Spronk et al, 1996) and therefore the active state of Lacl. So, although helix 4 is more closely associated with the DNA-binding domain of Lacl, the observed substitutions in helix 4 probably disrupt those interactions, changing the allosteric constant in a way that favors the inactive (non-operator-binding) state.
[00109] The remaining substitutions that increase G0 (T68N, S70C, L71Q, A92S, F226V, S322P, and Q352L) are far from the DNA-binding domain. So, they most likely affect the allosteric constant. Substitutions T68N, S70C, and L71Q, which are near the dimer interface, aiso decrease EC50 between 4-fold and 30-foid (FIG. 26A), similar to the substitutions in helix 4. Targeted molecular dynamic simulations have suggested that interactions between the L71 backbone and Q78' (on the opposite monomer) stabilize the active state (Flynn et al, 2003). Substitutions at position L71 might disrupt these interactions, shifting the allosteric constant to favor the inactive state. The substitution L71Q, which replaces the hydrophobic leucine with a hydrophilic glutamine, causes the largest change (20-fold increase in G0 and 14-fold decrease in EC50), likely due to perturbation of the local hydrophobic environment at the dimer interface. Our results for hydrophobic substitutions at this position (L71V and L71 M) support this picture, with just a 3-fold to 4-fold reduction in EC50 (and little change to G0), consistent with a smaller shift in the allosteric constant.
[00110] Approximately 3.5 and 5% of all amino acid substitutions decrease ligand-saturated expression, G∞, more than 5-fold or 2.5-fold, respectively. Substitutions that decrease G- by more than 5-fold are all located near the ligand- binding pocket or along the dimer interface (FIG. 3B and E). Six of these substitutions also increase EC50 more than 5-fold (A75T, D88N, S193L, Q248R, D275Y, and F293Y; (FIG. 26B). Except for D88N, which is at the dimer interface in helix 5, these substitutions are near the ligand-binding pocket. Substitutions near the ligand-binding pocket probably decrease ligand-binding affinity by changing the ligand-binding pocket environment directly. This would explain the observed increase in EC50 for each of these substitutions, though studies with targeted substitutions have shown that substitutions near the ligand-binding pocket can also change the allosteric constant (Chure et a/, 2019).
[00111] Amino acid substitutions that change the effective concentration, EC50, are the most numerous and are spread throughout the protein structure, with approximately 9 and 20% of all substitutions causing a greater than 5- fold or 2.5-fold shift in EC50, respectively (FIG. 3C and F). The strongest effects are from substitutions in the DNA-binding domain, ligand-binding pocket, core-pivot region, or dimer interface.
[00112] Substitutions that cause the largest decrease in EC50 are at the dimer interface and probably disrupt cross-dimer interactions. In particular, substitutions T68N (27-fold decrease) and L71Q (14-fold decrease) each probably disrupt the L71- Q78' interaction (discussed above). Substitutions V99E (25-fold decrease), E100G (17-fold decrease), and V95M (16-fold decrease) are each in p-strand B and each probably disrupts the K84-K84' interaction (discussed below). All of these substitutions likely shift the allosteric constant to favor the inactive state.
[00113] Substitutions that cause the largest increase in EC50 are often near the ligand-binding pocket or core-pivot domain. Often, substitutions at these positions also affect G°° (discussed above). However, we also identified nine positions near the ligand-binding pocket or core-pivot domain (N125, P127, D149, V192, A194, A245, N246, T276, Q291), where different substitutions either reduce
Figure imgf000042_0001
by more than 5-fold or increase EC50 by more than 5-fold, but not both. Given their positions, each of these substitutions probably disrupt the ligand-binding pocket thereby reducing ligand- binding affinity, though they may also change the allosteric constant to favor the active state.
[00114] At three positions (A82, I83, and F161), different substitutions can either increase or decrease EC50 more than 5-fold, depending on the substitution.
[00115] Residue F161 sits in the core-pivot region and is sequestered in a hydrophobic cluster (Swint-Kruse et at, 2001 ; Flynn et al 2003), where the phenylalanine ring makes van der Waals contacts with Q291 . In turn, Q291 is involved in hydrogen bonding networks that span the ligand-binding pocket and dimer interface (Flynn et at, 2003). During the transition between active and inactive states, the contacts between F161 and Q291 change, contributing to rearrangements throughout the Lacl structure. At position F161 , large hydrophobic amino acids (F161 I, F161 L) increase the EC50 approximately 10-fold, while a slightly smaller hydrophobic amino acid (F161V) increases the EC50 approximately 3-fold. In contrast, a small, hydrophilic amino acid (F161S) reduces EC50 approximately 10-fold. The hydrophobic substitutions likely have little effect on the hydrophobic environment surrounding the position, but with different geometries, these amino acids may not make the required contacts with Q291. This could cause a shift in the allosteric constant to favor the active state, consistent with the observed increase in EC50 for F161 I, F161 L, and F161V. On the other hand, the hydrophilic substitution at this position, F161S, likely disrupts the local hydrophobic environment, destabilizing the active state and shifting the allosteric constant to favor the inactive state, in agreement with the observed decrease in EC50.
[00116] Positions A82 and I83 are in helix 5 of the N-terminal core domain, and both are proximal to and pointed toward helix 13. The A82E substitution, which replaces the diminutive alanine with the larger glutamate, decreases EC50 approximately 30-fold. However, a smaller amino acid at this position (A82G) increases the EC50 approximately 5-fold. These results suggest a steric clash between the side chain of residue 82 and helix 13 that is disrupts the active state and that effectively shifts the allosteric constant to favor the inactive state. At position I83, the I83F substitution decreases EC50 approximately 5-fold while I83M increases EC50 approximately 5-fold. Interestingly, both of these substitutions, as well as the wild-type isoleucine, are similar in volume (Zamyatnin, 1972) and hydropathy (Kyte & Doolittle, 1982). So, simple physiochemical differences do not satisfactorily account for the observed effects. The effects could perhaps be steric, as with position A82, but driven by changes in side-chain flexibility instead of size. Phenylalanine is the most rigid of the three side chains, followed by isoleucine, and the even more flexible methionine (Miao & Cao, 2016). As with position A82, our results suggest that such steric effects destabilize the active state, effectively shifting the allosteric constant to favor the inactive state.
[00117] We also identified five positions (H74, V80, K84, S97, M98) where different substitutions reduce either G∞ or EC50 by more than 5-fold, but not both. These positions are all located at the dimer interface, specifically in or near helix 5 or p-strand B.
[00118] Substitutions at position H74 either decrease EC50 approximately 8- fold (H74Q) or decrease G∞ approximately 10-fold while increasing EC50 approximately 3-fold (H74P and H74Y). In the active state, residues H74 from both monomers form stable π -stacking interactions with each other. These interactions are disrupted in the inactive state, and instead, H74 forms a charge- charge interaction with D278’ (on the opposite monomer) (Lewis et al. 1996). Substitutions at this position that aboiish the ir-stacking interactions would presumably destabilize the active state and shift the allosteric constant toward the inactive state. This is consistent with our result for H74Q. Our results for substitutions H74P and H74Y (decrease G∞ and increase EC50) are consistent with either a shift in the allosteric constant to favor the active state or a decrease in the ligand affinity of the inactive state. H74Y can form the same TT-stacking interactions seen in the active state of the wild type but cannot form the charge-charge interaction with D278' to stabilize the inactive state. H74Y, therefore, would be expected to shift the allosteric constant toward the active state, agreeing with our observations. H74P cannot form either the TT-stacking interactions or the charge-charge interaction with D278', yet that substitution still increases EC50 similarly to H74P. Since proline is a helix initiator (Richardson & Richardson, 1988; Kim & Kang, 1999) and H74 is positioned at the beginning of helix 5, H74P may shift the allosteric constant to favor the active state by stabilizing secondary structure.
[00119] Substitutions at positions K84, S97, and M98 decrease G∞ (S97P, M98R), or decrease EC50 (K84N, S97W, M98L), or both (K84E, K84I, K84T, M98K). These residues are all involved in a coordinated process during the transition from the active state to the inactive state (Flynn etal, 2003). In the active state, the side chains of the K84 residues from both monomers sit in-plane with p-strand B and p-strand B', interacting with the backbone of V94 and V96' (both in p-strands B). In this process, K84 residues act as a bridge between the two p-strands. During the transition to the inactive state, K84 forms transient interactions with the side chain of S97 (also in p- strand B) and the backbone of M98, before eventually forming a stable charge-charge interaction with D88. Substitutions that disrupt this process have significant effects on the structure and function of Lacl. For example, the substitution K84L causes significant structural changes to the N-terminal core domain and dimer interface (Bell et al, 2001), and substitutions at position S97 and M98 can greatly alter the biophysical properties of Lacl (Zhan et at, 2010). Given the extent of structural and functional changes that can occur with substitutions involved in this process, precise mechanisms of the observed substitutions are difficult to predict, and observed changes are not easily described by the biophysical models. For example, within the biophysical models, to simultaneously decrease both G∞ and EC50 (as observed for K84E, K84I, K84T, and M98K) requires a change to the ligand-binding affinity. Yet positions K84 and M98 are approximately 14 and 12 A, respectively, from the ligand pocket (based on the wild-type Lacl crystal structure).
[00120] None of the single amino substitutions measured in the library simultaneously decrease
Figure imgf000045_0001
and increase G0 (FIG. 26C). This is not surprising, since substitutions that shift the biophysics to favor the active state tend to decrease G∞ while those that favor the inactive state tend to increase G0, and the biophysical models (Daber et al , 2011 ; Razo-Mejia et al, 2018; Chure et al, 2019) indicate that only a combination of parameter changes can cause both modifications to the dose-response. The library did, however, contain several multi-substitution variants with simultaneously decreased G∞ and increased G0. These inverted variants, and their associated substitutions are discussed below.
[00121] Combining multiple substitutions in a single protein almost always has a log-additive effect on EC50. That is, the proportional effects of two individual amino acid substitutions on the EC50 can be multiplied together. For example, if substitution >4 results in a 3-fold change, and substitution B results in a 2-fold change, the double substitution, AB, behaving log-additively, results in a 6-fold change. Only 0.57% (12 of 2,101 ) of double amino acid substitutions in the measured data have EC50 values that differ from the log-additive effects of the single substitutions by more than 2.5-fold (FIG. 4). This result, combined with the wide distribution of residues that affect EC50, reinforces the view that allostery is a distributed biophysical phenomenon controlled by a free energy balance with additive contributions from many residues and interactions, a mechanism proposed previously (Matzen et al, 2013; Motlagh et al, 2014) and supported by other recent studies (Leander et al, 2020), rather than a process driven by the propagation of local, contiguous structural rearrangements along a defined pathway.
[00122] A similar analysis of log-additivity for G0 and G∞ is complicated by the more limited range of measured values for those parameters, the smaller number of substitutions that cause large shifts in G0 or G∞, and the higher relative measurement uncertainty at low G(L). However, the effects of multiple substitutions on G0 and G∞ are also consistent with tog-additivity for almost every measured double-substitution variant (FIG. 27).
[00123] Most of the non-silent substitutions discussed above are more likely to affect the allosteric constant than either the ligand or operator affinities. Within the biophysical models, those affinities are specific to either the active or inactive state of Lacl i.e. , they are defined conditionally, assuming that the protein is in the appropriate state. So, almost by definition, substitutions that affect the ligand-binding or operator- binding affinities (as defined in the models) must be at positions that are close to the ligand-binding site or within the DNA-binding domain. Substitutions that modify the ability of the Lacl protein to access either the active state or inactive state, by definition, affect the allosteric constant. This includes, for example, substitutions that disrupt dimer formation (dissociated monomers are in the inactive state), substitutions that lock the dimer rigidly into either the active or inactive state, or substitutions that more subtly affect the balance between the active and inactive states. Thus, because there are many more positions far from the ligand- and DNA-binding regions than close to those regions, there are many more opportunities for substitutions to affect the allosteric constant than the other biophysical parameters. Note that this analysis assumes that substitutions do not perturb the Lacl structure too much, so that the active and inactive states remain somewhat similar to the wild-type states. Our results suggest that this is not always the case: consider, for example, the substitutions at positions K84 and M98 discussed above and the substitutions resulting in the inverted and band-stop phenotypes discussed below.
[00124] Phenotypic innovation in an allosteric landscape
[00125] Beyond the comprehensive mapping of single-substitution effects, the Lacl genotype-phenotype landscape measurement revealed a surprising number of variants with phenotypes that differ qualitatively from the wild type. For example, approximately 230 of the Lacl variants have an inverted phenotype (G0 > G∞, FIG. 1 E), accounting for approximately 0.35% of the measured library (FIG. 9A). We verified the dose-response curves for 10 inverted variants with flow cytometry (e.g., FIG. 12). To understand the mutational basis for the inverted phenotype, we examined a set of 43 strongly inverted variants (with G0/G∞ > 2, G0 > G∞ wt/2, and EC50 between 3 and 1 ,000 μmol/l). The results indicate that diverse substitutions can lead to the inverted phenotype. For example, we identified 10 amino acid substitutions associated with the inverted phenotype (S70I, K84N, D88Y, V96E, A135T, V192A, G200S, Q248H, Y273H, A343G, P-value ≤ 0.005; FIG. 5A and C; FIG. 9). However, none of these substitutions are present in more than 12% of the strongly inverted variants, and 51 % of the strongly inverted variants have none of these substitutions. Furthermore, the set of strongly inverted variants are more genetically distant from each other than randomly selected variants from the library (FIG. 5C, FIG. 8). The genetic diversity of the inverted variants found in our measurement is striking when compared with previous reports of inverted Lacl variants resulting from site-saturated mutagenesis (Daber et al . 2011 ) or directed evolution with random mutagenesis (Poelwijk et al, 2011 ; Meyer et at, 2013). Those previous reports yielded only a small number of inverted variants with closely related genotypes and substitutions at specific positions that were key for inversion (I79, S97, and L296). Even more striking, most of the positions previously identified as important for the inverted phenotype are not significantly enriched in the set of strongly inverted sensors reported here.
[00126] The inverted Lacl variants can provide specific insight into allosteric biophysics and structure-function relationships, since inversion of the dose-response curve requires inversion of both the allosteric constant and the relative ligand-binding affinity between the active and inactive states (Razo-Mejia et al, 2018; Chure et at, 2019). Although the set of strongly inverted Lacl variants are genetically diverse, many of them have substitutions in similar regions of the protein that may account for the requisite biophysical changes (FIG. 36). First, 67% of the strongly inverted variants have substitutions within 7 A of the ligand-binding pocket (compared with 31 % of the full library, P-value = 1.15 x 10-6), which likely contribute to the change in ligand- binding affinity. Surprisingly, 21 % of the strongly inverted variants have no substitutions within 10 A of the binding pocket, so binding affinity must be indirectly affected by distal substitutions in those variants. Second, nearly all strongly inverted variants have substitutions at the dimer interface (91 %, compared with 54% for the full library, P-value = 2.05 x 10-7 ), with most (70%) having substitutions in helix 5 (47%), helix 11 (28%), or both (5%, FIG. 5A and C). This suggests that residues in those structural features are important for modulating the allosteric constant.
[00127] Discovery of novel allosteric phenotypes [00128] In addition to the inverted phenotypes, we were surprised to discover Lad variants with dose-response curves that did not match the sigmoidal form of the Hill equation. Specifically, we found variants with biphasic dose-response curves that repress or activate gene expression only over a narrow range of ligand concentrations. These indude examples of Lac! variants with band-stop dose-response curves (i.e., variants with high-low-high gene expression: e.g., FIG. 1 E, FIG. 13), and Lacl variants with band-pass dose-response curves (i.e., variants with low-high-low gene expression; e.g., FIG. 14). Approximately 200 of the Lad variants have band-stop or band-pass phenotypes, accounting for approximately 0.3% of the measured library (FIG. 9A). We verified the dose-response curves of 13 band-stop variants and two band-pass variants using flow cytometry (e.g., FIG. 13 and FIG. 14). To our knowledge, this is the first identification of single-protein genetic sensors with band- stop dose-response curves.
[00129] Phenotypic similarities between band-stop and inverted Lacl variants (i.e., high G0, and initially decreasing gene expression as ligand concentration increases) suggest similar biophysical requirements (i.e., inversion of both the allosteric constant and the relative ligand-binding affinity between the two states). However, amino acid substitutions associated with the band-stop phenotype are remarkably different from those associated with inverted phenotype (V4A, A92V, G178D, H179Q, R195H, G265D, D292G, R351G, P-value ≤ 0.005; FIG. 5B and D; FIG. 37). While inverted variants often have substitutions near the ligand-binding pocket and dimer interface, a set of 31 strong band-stop variants are twice as likely as the full library to have substitutions in helix 9 (32% compared with 16%, P- value " 2.14 x 10-2 ) and nearly four times as likely to have substitutions in p-strand J (13% compared with 3.4%, P-value = 2.08 x 10-2 ). Helix 9 is on the periphery of the protein, and p-strand J is in the center of the C-terminal core domain. Furthermore, 100% of the strong band-stop variants have substitutions in the C-terminal core of the protein, compared with 78% of the full library (P-value = 4.67 x 10-4 ).
[00130] To further investigate the band-stop phenotype, we chose a strong band-stop Lacl variant with only three amino acid substitutions (R195H/G265D/A337D). These three positions are distributed distally on the periphery of the C-terminal core domain, and the role that each of these substitutions plays in the emergence of the band-stop phenotype is unclear. To investigate the impact of these substitutions, we synthesized Laci variants with all possible combinations of those substitutions and measured their dose-response curves with flow cytometry. Although each single substitution resulted in a sigmoidal dose-response similar to wild- type Laci, the combination of two substitutions (R195H/G265D) gave rise to the band- stop phenotype (FIG. 6A and B; FIG. 29). To test whether this result applies to the band-stop phenotype generally, we used the single-substitution effects presented above to examine each of the substitutions associated with the strong band-stop phenotype. Individually, the substitutions associated with the band-stop phenotype are nearly silent, i.e., they have little or no effect on the dose-response curve; yet in combination with other substitutions, they result in the band-stop phenotype. In contrast, most of the individual substitutions associated with the inverted phenotype cause a large shift in either EC50, G∞, or both (FIG. 6C and D).
[00131] For the goal of an improved understanding of allostery, our results reveal the dual nature of the problem: First, the DNN model and the mapping of single- substitution effects demonstrate that large-scale measurements and analysis can overcome the challenges inherent to the structural complexity of allosteric function. They can provide accurate predictions for specific allosteric proteins and can also reveal systematic sequence-structure-function relationships that may be more generalizable (i.e., the importance of the dimer interface and the log-additivity of EC50). However, the band-stop phenotype highlights the limits of that predictability, as well as the constraints of conventional models of allostery.
[00132] While the allosteric function of many Laci variants is well described by extensions of the MWC model of allostery (Monod et al , 1965; Razo-Mejia et al, 2018; Chure et al, 2019), the band-stop phenotype is inconsistent with that model. In particular, the biphasic dose-response of the band-stop variants suggests negative cooperativity: that is, successive ligand-binding steps have reduced ligand-binding affinity. Negative cooperativity has been shown to be required for biphasic dose- response curves (Onufriev & UIImann, 2004; Bouhaddou & Birtwistle, 2014). The biphasic dose-response and apparent negative cooperativity are also reminiscent of systems where protein disorder and dynamics have been shown to play an important role in allosteric function (Motlagh et al, 2014), including catabolite activator protein (CAP) (Popovych et al, 2006; Tzeng & Kalodimos, 2012) and the Doc/Phd toxin- antitoxin system (Garcia-Pino et at, 2010). This suggests that entropic changes may also be important for the band-stop phenotype. A potential mechanism is that band- stop Lacl variants have two distinct inactive states: an inactive monomeric state and an inactive dimeric state. In the absence of ligand, inactive monomers may dominate the population. Then, at intermediate ligand concentrations, ligand binding stabilizes dimerization of Lacl into an active state which can bind to the DNA operator and repress transcription. When a second ligand binds to the dimer, it returns to an inactive dimeric state, similar to wild-type Lacl. Similar dimerization-based regulation has been described before and supports the observed negative cooperativity and biphasic dose- response (Bouhaddou & Birtwistle, 2014). This mechanism and other possible mechanisms do not match the MWC model of allostery or its extensions (Monod et at, 1965; Daber et at, 2011 ; Razo-Mejia et at, 2018; Chure et at, 2019) and require a more comprehensive study and understanding of the ensemble of states in which these band-stop Lacl variants exist.
[00133] Our most surprising and unpredictable result is the emergence of the band-stop phenotype from combinations of nearly silent amino acid substitutions. However, with over one hundred genetically diverse band-stop variants, our dataset provides a basis for more systematic understanding even in this case. Furthermore, the relatively high abundance of inverted and band-stop variants (approximately 0.35 and 0.2% of the library, respectively, FIG. 9A) with genotypes near the wild type suggests that allosteric genotype-phenotype landscapes allow' for rapid evolutionary innovation, a conclusion that is supported by the existence of natural transcription factors related to Lacl with inverted phenotypes (Myers & Sadler, 1971 ; Rolfes & Zalkin, 1990).
[00134] A surprising diversity of useful and potentially novel allosteric phenotypes exist with genotypes that are readily discoverable via large-scale landscape measurements. Novel phenotypes emerged at mutational distances greater than one amino acid substitution, highlighting the value in sampling a broader genotype space with higher-order mutations. Furthermore, the untargeted, random mutagenesis approach used here was critical for finding these novel phenotypes, as the genotypes required for these novel phenotypes were unpredictable. [00135] Strain, plasmid, and library construction
[00136] Ail reported measurements were completed using E. coll strain MG1655A/ac (Sarkar et al, 2020). Briefly, strain MG1655A/ac was constructed by replacing the lactose operon of E. coll strain MG1655 (ATCC #47076) with the bleomycin resistance gene from Streptoalloteichus hindustanus (Shble).
[00137] Two plasmids were used for this work: a library plasmid (pTY1, FIG. 7A) used for the measurement of the genotype and phenotype of the entire Lacl library, and a verification plasmid (pVER, FIG. 7B) used to verify the function of over 100 Lacl variants from the library chosen to test the accuracy of the library-scale dose-response curve measurement method. A version of this protocol is maintained at protocols.io (availage at https://doi.org/10.17504/protocols.io.bjjxkkpn).
[00138] Plasmid pTY1 contained the lacl CDS and the lactose operator (/acO) regulating the transcription of a tetracycline resistance gene, tetA, which, in the presence of tetracycline, confers a measurable change in fitness connected with the expression level of the regulated genes. Plasmid pTY1 also encoded Enhanced Yellow Fluorescent Protein (YFP), which was used during library construction to select a library in which most of the Lacl variants could function as allosteric repressors.
[00139] Plasmid pVER contained a similar system in which Lacl and lacO regulate the transcription of only YFP. Plasmid pVER was used to measure dose-response curves of clonal Lacl variants using flow cytometry. Each variant chosen from the library for verification was chemically synthesized, inserted into pVER, and transformed into E. coll strain MG1655Δ/acfor flow cytometry measurements to confirm the dose-response curve inferred from the library-scale measurement.
[00140] The Lacl library was generated by error-prone PCR of the wild- type lacl CDS encoded on plasmid pTY1 . The library was constructed by splitting the lacl CDS into an N-terminal half and a C-terminal half using an Apal restriction enzyme cut site that was near the center of the lacl CDS covering codons for A186, G187, and P188. Error-prone PCR of each half introduced genetic diversity, and then each unique sequence was attached to a DNA barcode. Plasmid-encoded versions of both sub- libraries were established (an N-terminal library, pNTL, and a C-terminal library, pCTL). These two sub-libraries were then assembled to generate the full /ac/ CDSs and to combine the two halves of the DNA barcode. The library was inserted into pTY1 along with randomly synthesized DNA barcodes (FIG. 7A).
[00141] The plasmid containing the N-terminal library, pNTL contained the N-terminal half of the /ac/ CDS (lacl-N) and a randomized nucleotide sequence that forms half of the DNA barcode. The protocol for constructing the N-terminal library included:
PCR amplify the N-terminal half of the /ac/ CDS (coding for amino acids M1- A186) from pTY1 using the GeneMorph II Random Mutagenesis Kit (Agilent, cat. #200550) with primers DT.01 and DT.02 (Appendix Table S7). Agarose gel purify PCR product.
PCR amplify the T7 terminator (TT7) with primers DT.03 and DT.04 with Phusion Flash polymerase (used for PCR unless otherwise specified) and then re-amplified with primers DT.03 and DT.05. Agarose gel purify PCR product.
Assemble the two amplicons using assembly PCR with primers DT.02 and DT.03.
Digest the assembled amplicon with restriction enzymes FastDigest Apal and Xmal. Agarose gel purify.
PCR amplify the pMB1 origin of replication and ampicillin resistance gene from plasmid pUC19 (NEB, cat #N3041 S) with primers DT.06 and DT.07. Digest amplicon with restriction enzymes Apal and Xmal and dephosphorylate the ends with Calf Intestinal Phosphatase. Agarose gel purify.
During this step, primer DT.06 incorporates half of the final DNA barcode, consisting of 27 randomized nucleotides interspersed with constant A/T bases to limit restriction site formation. Primer DT.06 was ordered with hand-mixed bases, with “N” representing equal ratios of A, T, G, and C.
Half DNA Barcode: 5-
TNNTNNNANNTNNNANNTNNNANNTNNNANNTNNNANNA-3’ Ligate the assembly PCR product and the pUC19 amplicon using T4 DNA ligase (Thermo Scientific, EL0011) to form the final sub-library plasmid pNTL, and desalt using QIAquick PCR Purification Kit (Qiagen, cat. #28106).
Transform into One Shot TOP10 Electrocomp E coli (I nvitrogen, cat. #0404050).
Recover with SOC media at 37°C while shaking for 1 h.
[00142] The plasmid containing the C-terminal library, pCTL, contained the C-terminal half of the /ac/ CDS (lacl-C) and a randomized nucleotide sequence that forms the second half of the DNA barcode. The protocol for constructing the C-terminal library is:
PCR amplify araC terminator from pTY1 using primers DT.08 and DT.09. Agarose gel purify PCR product.
PCR amplify the C-terminal half of /ac/ CDS (coding for amino acids L189- Q360) from pTY1 using the GeneMorph II Random Mutagenesis Kit with primers DT.10 and DT.11. Agarose gel purify PCR product.
Assemble the two amplicons using assembly PCR with primers DT.08 and DT.11.
Digest the assembled amplicon with restriction enzymes FastDigest Apal and Xmal. Agarose gel purify PCR product.
PCR amplify the pMB1 origin of replication (pMB1) and ampicillin resistance gene from plasmid pUC19 with primers DT.12 and DT.13. Digest amplicon with restriction enzymes Apal and Xmal dephosphoryiate the ends with Calf Intestinal Phosphatase. Agarose gel purify.
During this step, primer DT.12 incorporates half of the final DNA barcode, consisting of 27 randomized nucleotides interspersed with constant A/T bases to limit restriction site formation. Primer DT.12 was ordered with hand-mixed bases, with “N” representing equal ratios of A, T, G, and C. Half DNA Barcode: 5’-
GNNTNNNANNTNNNANNTNNNTNNTNNNANNTNNNANNA-3'
Ligate the assembly PCR product and the pUC19 amplicon using T4 DNA ligase to form the final sub-library plasmid pCTL, and desalt using the ligation product using QIAquick PCR Purification Kit.
Transform into OneShot TOP10 Electrocomp E. coli.
Immediately recover with SOC media at 37°C while shaking for 1 h.
[00143] Plasmids pNTL, pCTL, and pTY1 are starting points to assemble full length /ac/ CDS and combine the two halves of the DNA barcodes. The protocol is:
PCR amplify the lacl-C and lacl-N libraries from plasmids pCTL and pNTL, respectively, both PCRs use primers DT.14 and DT.15.
Digest the lacl-C library amplicon with restriction enzymes Apal and Dpnl and treat with CIP.
Digest the lacl-N library amplicon with restriction enzymes Apal and Dpnl.
Agarose gel purify both digested amplicons, then ligate the two together using T4 DNA ligase to assemble the sensor library with full length /ac/ CDS. Agarose gel purify the ligation product.
Digest with Fsel (NEB, cat. #R0588S). Agarose gel purify digested product.
Circularize the linear product by ligating with T4 DNA ligase, then relinearize by digestion with restriction enzymes Sgsl and Nhel (Thermo Scientific, cat. #FD1894 and #FD0974) and agarose gel purify.
This is the assembled /ac/ CDS library with attached DNA barcodes, ready for ligation into pTY1 .
Prepare plasmid backbone by digesting pTY1 plasmid DNA with restriction enzymes Sgsl and Nhel, then treat with CIP. Agarose gel purify. Insert the assembled /ac/ CDS library with attached DNA barcodes into the pTY1 backbone using T4 DNA ligase with a 3-fold molar excess of pTY1 backbone.
Desalt the ligation product and electroporate into MG1655A/ac.
[00144] To prepare electrocom petent E. coli MG1655A/ac:
Dilute overnight culture of E. coli MG1655A/ac 1 ,000-fold into 500 ml of LB media.
Incubated the culture at 37°C for3.5 h in a 2-I baffled Erlenmeyer flask to a final optical density at 600 nm (OD600) of approximately 0.8.
Chill the culture in ice slurry for 20 min.
Centrifuge the culture at 3,500 g for 10 min in refrigerated centrifuge at 4°C.
Decant supernatant media, and then resuspended the cell pellet in 500 ml of 10% glycerol.
Centrifuge the solution at 3,500 g for 10 min.
Decant the supernatant glycerol solution.
Repeated the glycerol wash one additional time (two washes total)
Resuspended the cell pellet with residual 10% glycerol.
Transform the plasmid-encoded sensor library (see above) into the freshly prepared electrocompetent MG1655Alac.
Immediately recover with SOC media at 37°C while shaking for 1 h.
Dilute the library in LB media supplemented with glucose (2 g/l) and kanamycin (50 μg/ml) to a final volume of 500 ml and incubate for 12 h at 37°C while shaking.
Divide the library into 1 ml aliquots and store them in 20% glycerol at -80°C (1 :1 dilution with 40% glycerol). [00145] Most of the variants in the initial library had high G(0), i.e., the /- phenotype (Mackiewicz et al, 1994). The initial library had a bimodal distribution of G0, as indicated by flow cytometry results, with a mode at low fluorescence (near G0 of wild-type Lacl) and a mode at higher gene expression. To generate a library in which most of the Laci variants could function as allosteric repressors, we used fluorescence-activated cell sorting (FACS) to select the portion of the library with low fluorescence in the absence of ligand, gating at the trough between the two modes (Sony SH800S Cell Sorter, FIG. 8). To allow comprehensive long-read sequencing of the library (PacBio sequel II, see Long-read sequencing section, below), we further reduced the library size by dilution of the FACS-selected library to create a population bottleneck of the desired size. For the work reported here, we used a library of approximately 105 Lacl variants (determined by serial plating and colony counting).
[00146] A spike-in control strain was used to normalize the DNA barcode read counts for the sequencing-based fitness measurement. The spike-in control strain contained the library plasmid (pTY1) with a Lacl variant that had a constant, high tetA expression level. The fitness of the spike-in control was determined from OD600 data acquired during growth of clonal cultures with the same automated growth protocol as used for the genotype-phenotype landscape measurement. The fitness of the spike-in control was measured in all 24 chemical environments and was independent of IPTG concentration but was slightly lower with tetracycline (0.75 h-1) than without tetracycline (0.81 h-1).
[00147] Culture conditions
[00148] Unless otherwise noted, E. coli cultures were grown in a rich M9 media (3 g/l KH2PO4, 6.78 g/l Na2HPO4, 0.5 g/l NaCI, 1 g/l NH4CI, 0.1 mmol/l CaCl2, 2 mmol/l MgSO4, 4% glycerol, and 20 g/l casamino acids) supplemented with 50 pg/ml kanamycin.
[00149] Escherichia coli cultures were grown in a laboratory automation system that controlled preparation of 96-well culture plates with media and additives (i.e., IPTG and tetracycline). Cultures were grown in clear-bottom 96-well plates with 1.1 ml square wells. The culture volume per well was 0.5 ml. Before incubation, an automated plate sealer was used to seal each 96-well plate with a gas-permeable membrane. Cultures were incubated in a multi-mode plate reader at 37°C with a 1 °C gradient applied from the botom to the top of the incubation chamber to minimize condensation on the inside of the membrane. During incubation, the plate reader was set for double-orbital shaking at 807 cycles per minute. Optical density at 600 nm (OD600) was measured every 5 min during incubation, with continuous shaking applied between measurements. After incubation, an automated desealer was used to remove the gas-permeable membrane from each 96-well plate.
[00150] Growth protocol for landscape measurement
[00151] To measure the fitness and dose-response curve of every Lacl variant in the library, a culture of E. coli containing the Lacl library was mixed at a 99:1 ratio with a culture of the E. coli spike-in control. The culture was loaded into the automated microbial growth and measurement system where it was distributed across a 96-well plate and then grown to stationary phase (12 h, FiG. 30). Cultures were then diluted 50-fold into a new 96-well plate, Growth Plate 1 , containing 11 rows with a 2- fold serial dilution gradient of IPTG with concentrations ranging from 2 to 2,048 μmol/l and one column without IPTG. Growth in IPTG allowed each variant to reach a steady- state tetA expression level in each IPTG concentration. Growth Plate 1 was grown for 160 min, corresponding to approximately 3.3 generations, and then diluted 10-fold into Growth Plate 2. Growth Plate 2 contained the same IPTG gradient as Growth Plate 1 with the addition of tetracycline (20 pg/ml) to alternating rows in the plate, resulting in 24 chemical environments, with each environment spread across 4 wells. Growth Plate 2 was grown for 160 min and then diluted 10-fold into Growth Plate 3, which contained the same 24 chemical environments as Growth Plate 2. This process was repeated for Growth Plate 4, which also contained the same 24 chemical environments. Each growth plate was pre-heated to 37°C before transferring the cells from the previous growth plate to avoid any disruption of cell growth due to large variations in temperature. The total growth time for the fitness measurements in the 24 chemical environments, 480 min across Growth Plates 2-4, corresponded to approximately 10 generations for the fastest-growing cultures. The 50-fold dilution factor from stationary phase into Growth Plate 1 and the 160-min growth time per plate were chosen to maintain the cultures in exponential growth for the entire 480 min. During each 160-min incubation, the cultures without tetracycline increased approximately 10-fold in optical density, to a final OD600 of approximately 0.5 (corresponding to an estimated cell density of 4 x 108 cells/ml. FIG. 31 , FIG. 32, FIG. 38). The protocol was:
Inoculate 100 ml media in a 250-ml baffled Erlenmeyer flask with 2 ml frozen glycerol stock of library.
Inoculate 50 ml media in a 250-ml baffled Erlenmeyer flask with scrapping of spike-in control glycerol stock.
Incubate both at 37°C shaking at 300 rpm for 18 h.
Into a 250-ml baffled Erlenmeyer flask, combine 49 ml of library culture, 0.5 ml of spike-in control culture, and 50 ml of media, incubate 6 h at 37°C shaking at 300 rpm.
[00152] This mixture was used to begin the automated growth and measurement process:
Distribute 450 μl media to each well of a 96-well growth plate.
Distribute 50 μl of culture mixture of library and spike-in control into each well of plate.
Seal plate with a gas-permeable membrane and incubate in plate reader at 37°C for 12 hr.
During incubations, the plate reader was set for continuous double- orbital shaking at 807 cycles per minute. OD600 was measured every 5 min.
Prepare Growth Plate 1 :
Distribute 490 μl media across a 96-well growth plate
Use a 2-fold serial dilution of IPTG to add a gradient of IPTG across columns so that the final concentrations ranges from 2 to 2,048 μmol/l, and one column without IPTG. Ten minutes before the end of the 12-h incubation, preheat Growth Piate 1 to 37°C.
Preheat on temperature-controiied position set to 47°C for 10 min. Measurements of media temperature vs time indicated that this resuited in a media temperature of approximately 37°C.
Remove 12 h growth piate from piate reader, remove gas-permeabie membrane, and transfer 10 pi of culture from each well into the corresponding well of Growth Piate 1 .
Seal Growth Plate 1 with a gas-permeable membrane and incubate in plate reader at 37°C for 160 min.
Growth Plate 1 contains only the IPTG gradient (no tetracycline). This allows cells to reach exponential growth and to reach steady-state expression of tetA before adding tetracycline.
Prepare Growth Plate 2
Distribute media across a 96-well growth plate, 450 pl total volume.
In alternating rows, supplement media with tetracycline to a final concentration of 20 pg/ml (rows B, D, F, H).
Use a 2-fold serial dilution of IPTG to add a gradient of IPTG across columns so that the final concentrations ranges from 2 to 2,048 μmol/l, and one column without IPTG.
Ten minutes before the end of the 160-min incubation, preheat Growth Plate 2 to 37°C.
Preheat on temperature-controlled position set to 47°C for 10 min. Measurements of media temperature vs time indicated that this resulted in a media temperature of approximately 37°C. Remove Growth Plate 1 from plate reader, remove gas-permeable membrane, and transfer 50 pl of culture from each well of Growth Plate 1 into the corresponding well of Growth Plate 2.
Seal Growth Plate 2 with a gas-permeable membrane and incubate in plate reader at 37°C for 160 min.
Immediately proceed with plasmid DNA extraction for Growth Plate 1 (below).
Repeat plate preparation and dilution protocol for Growth Plate 3 and Growth Plate 4 (with the same IPTG gradient and tetracycline in rows B, D, F, H).
At the conclusion of Growth Plate 4 proceed with plasmid DNA extraction, there is no dilution into another plate.
[00153] After each growth plate was used to seed the subsequent plate (or at the end of 160 min for Growth Plate 4), the remaining culture volumes for each chemical environment (approximately 450 μl/well, four wells per plate) were combined and pelleted by centrifugation (3,878 g for 10 min at 23°C). Plasmid DNA was then extracted from the 24 combined samples with a custom method using reagents from the QIAprep Miniprep Kit (commercial available at Qiagen cat. #27104) on an automated liquid handler equipped with a positive-pressure filter press (a version of the protocol is maintained at protocols.io https://doi.org/10.17504/protocols.io.bjjvkkn6). The protocol was:
Resuspend each cell pellet in in 200 pl of resuspension buffer (50 mmol/l of Tris-CL pH 8.0, 10 mmol/l EDTA, 100 pg/ml RNase A).
Transfer resuspended cell samples to a new 96-well plate located on an automated microplate shaker.
Add 250 μl lysis buffer (200 mmol/l NaOH, 10 g/l SDS) to each sample and mix by shaking the plate at 90 rpm for 2 min.
Add 350 μl cold (4°C) neutralization Buffer to each sample and mix by shaking at 90 rpm for 2 min. Using wide bore tips (3.2 mm tip diameter), gentiy mixed the sampies by three repeated cycles of aspiration and dispensing, then transfer samples to a 96- well filter plate and allowed to settle in the filter plate for 2 min.
Use the filter press to push the lysate solutions through the filter plate into a new 96-well deep well plate at 20 psi for 180 s followed by 65 psi for 30 s.
Transfer the cleared lysate solutions to a 96-well glass fiber binding plate and use the filter press to push the solutions through the binding plate at 40 psi for 60 s.
Add 900 μl Binding Buffer to each well and use the filter press to push buffer through the binding plate at 40 psi for 60 s.
Add 900 μl Wash Buffer (8 mmol/l Tris-CI, pH7.5, 80% ethanol) to each well and use the filter press to push buffer through the binding plate at 40 psi for 60 s.
Use the filter press to dry the binding plate by applying 65 psi for 7 min.
Add 100 μl nuclease-free water warmed to 60°C to each well; wait for 5 min.
Elute DNA from binding plate into a 96-well low-binding elution plate using the filter press at 65 psi for 7 min.
[00154] Supernatant removal, cell resuspension, and cell sample transfer were performed using the automated liquid handler's 8-channel head with which each channel is capable of independent movement and liquid-level sensing to allow for variations in cell culture volume and density recovered from each sample. Most of the subsequent pipetting was performed using the automated liquid handler’s 96-channel head with an offset pickup of 24 pipette tips so that the timing for each sample was identical. Pipetting the water for elution was performed using the automated liquid handler’s 8-channel head to allow for individual channel movement.
[00155] For the DNA extracted from Growth Plate 4, the filter press jammed just before elution, so those samples were eluted by centrifugation at 1 ,000 g for 3 min. [00156] After elution, the concentration of DNA in each sample ranged from undetectable up to approximately 1 .5 ng/μl. This corresponds to an estimated maximum of 1010 plasmids per sample.
[00157] Barcode sequencing
[00158] After plasmid extraction, each set of 24 plasmid DNA samples was prepared for barcode sequencing. Briefly, the plasmid DNA was linearized with Apal restriction enzyme. Then, a three-cycle PCR was performed to amplify the barcodes from the plasmids and attach sample multiplexing tags so samples from different chemical environments could be distinguished when pooled and run on the same sequencing flow cell. Eight forward index primers (FIG. 41) and 12 reverse index primers (FIG. 42) were used to label the amplicons from each sample across the 24 chemical environments and the four time points. After a magnetic bead-based cleanup step, a second, 15-cycle PCR was run to attach the standard Illumina paired-end adapter sequences and to amplify the resulting amplicons for sequencing (FIG. 43). After a second magnetic bead-based cleanup, the 24 samples from each time point were pooled and stored at 4°C until sequencing. This entire process was completed using an automated liquid handler programmed with a custom sequencing sample preparation method with the following steps (a version of the protocol is maintained protocols.io https://doi.org/10.17504/protocols.io.bjjzkkp6). The protocol was:
Prepare Sera-Mag SpeedBeads Carboxyl Magnetic Beads:
To a 50 ml conical tube, add 9 g PEG-8000, 10 ml of 5 mol/l NaCI, 500 pl of 1 mol/l Tris-HCI, pH 8, 500 pi of 0.5 mol/l EDTA.
Add water to a final volume of 45 ml and mix until PEG dissolves.
Add 27.5 μl Tween-20 and mix.
Vortex Sera-mag SpeedBeads to resuspend and transfer 1 ml to each of two 2 ml microcentrifuge tubes.
Place SpeedBeads solutions on a magnetic separation rack until beads are drawn to the magnet and solutions are clear. Remove supernatant.
Add 1 ml 1 x TE Buffer to each microcentrifuge tube containing beads, remove from magnetic stand, and vortex.
Repeat the TE wash step two additional times, then resuspend in 1 ml 1 x TE buffer,
After mixing, add both 1 ml SpeedBeads solutions to the 50 mi conical tube containing the PEG solution.
Add water to a final volume of 50 ml and mix gently.
Store in the dark at 4°C until use.
Transfer 35 μl of each plasmid DNA sample to a PCR plate.
Add 4 μl FastDigest Buffer and 1 μl Apal Restriction Enzyme Solution for plasmid DNA linearization; mix 3x by repeated aspiration and dispense.
Prepare a mixture of 4:1 of FastDigest Buffer: Apal beforehand to expedite process.
Incubate samples in PCR plate in automated thermocycler at 37°C for 15 min.
Add 2.5 μl Forward Index Primer and 2.5 μl Reverse Index Primer to each sample (20 μmol/l stock primer solutions, primer sequences are listed in FIG. 40 and FIG. 41 ).
The Forward and Reverse Index Primers attach sample multiplexing tags to the resulting amplicons so that the different samples can be distinguished when they are pooled and run on the same Illumina platform sequencing flow cell. Eight different Forward Index Primers and 12 different Reverse Index Primers are used to uniquely label the amplicons from each sample across the 24 different chemical conditions and the four Growth Plates used for barcode sequencing.
Add 45 pl Phusion Flash 2x Master Mix to each sample; mix 3* by repeated aspiration and dispense. Run the first PCR in automated thermocyder with the foliowing conditions:
Initial denaturation: 98°C for 60 s.
Three cycles:
Denaturation: 98°C for 10 s.
Annealing: 58°C for 20 s.
Elongation: 72°C for 20 s.
Final extension: 72°C for 60 s.
Cooling: 20°C for 15 s.
During the first PCR, pipette 48 μl Magnetic Bead/PEG-NaCI Stock into each of 24 wells in a 96-weli midi plate; mix bead stock thoroughly by repeated aspiration and dispense before each transfer.
When the first PCR is finished, transfer 80 μl from each PCR reaction to a well in the midi plate; mix PCR solution and bead stock 10x by repeated aspiration and dispense; wait for 5 min.
During this step the ratio of Bead Stock to PCR volume is 0.6x. Consequently, because of the relatively low PEG concentration in the mixture, the 5,500 bp plasmid template binds to the beads.
Move the midi plate to 96-well magnet base to pull the magnetic beads out of suspension; wait for 2 min.
Transfer 128 μl of supernatant from each sample to a clean well in the midi plate (still on the magnet base); wait an additional 3 min for final bead separation.
Transfer 118 μl of supernatant (containing the 218 bp PCR amplicon and primers) from each sample to another clean well in the midi plate; move the midi plate from the magnet base to automated shaker. Add 70.8 μl Magnetic Bead Stock (0.6x of 118 μl supernatant volume) to each sample; mix supernatant and bead stock 10x by repeated aspiration and dispense; wait for 5 min. During this step, only the 206 bp PCR amplicon binds to the beads; the 70 bp primers remain in the supernatant.
Move the midi plate back to the magnet base; wait for 5 min.
Remove and discard the supernatant from each well.
Add 200 μl 80% ethanol to each magnetic bead pellet; wait 30 s.
Remove and discard the ethanol supernatant; then, using 50 μl tips, remove residual supernatant from the bottom of each well.
Move the midi plate from magnet base to the automated shaker; allow magnetic bead pellets to dry for 5 min at room temperature.
Add 35 μl nuclease-free water to each sample; resuspend beads by 5* repeated aspiration and dispense.
Mix samples by shaking at 1 ,800 rpm for 10 s; wait for 5 min.
Move the midi plate back to the magnet base; wait for 5 min.
During 5-min wait, pipette 33 μl Phusion Flash 2x Master Mix and 1 .5 pl of each secondary PCR primer (20 μmol/l) into each of 24 wells in a PCR plate.
After 5-min wait, transfer 30 μl of each supernatant from the midi plate (on magnet base) to the PCR plate.
Run the second PCR in the automated thermocycler with the following conditions:
Initial denaturation: 98°C for 60 s.
15 cycles:
-. Denaturation: 98°C for 10 s.
.Annealing and elongation: 72°C for 30 s. Final extension: 72°C for 120 s.
Cooling: 15°C for 20 s.
During the second PCR, pipette 66 μl (1.1 x of the 60 μl PCR volume) Magnetic Bead Stock into each of 24 clean wells in the midi plate (on the automated shaker); mix bead stock well by repeated aspiration and dispense before each transfer.
When the second PCR is finished, transfer 60 μl from each PCR to a well in the midi plate; mix the PCR solution and bead stock 10x by repeated aspiration and dispensing; wait for 5 min.
During this step the ratio of Bead Stock to PCR volume is 1 .1 , the 306 bp PCR amplicon binds to the beads while the 70 bp primers remain in the supernatant.
Move the midi plate back to the magnet base: wait for 4 min.
Remove and discard the supernatant from each well.
Add 200 μl 80% ethanol to each magnetic bead pellet; wait 30 s.
Remove and discard the ethanol supernatant.
Move the midi plate from the magnet base to the automated shaker; allow magnetic bead pellets to dry for 5 min at room temperature.
Add 50 μl Elution Buffer (10 mmol/l Tris-CI, pH 8.0) to each sample; resuspend beads by 5x repeated aspiration and dispense.
Mix samples by shaking at 1 ,800 rpm for 10 s; wait for 5 min.
Move the midi plate back to the magnet base; wait for 5 min.
Transfer the supernatant from each of the 24 samples to a single pooled sequencing sample, one pooled sample per input growth plate. [00159] After each 24-sample sequencing sample preparation, we transferred the resulting pooled samples to a 2-ml microcentrifuge tube and placed the tube on a magnetic separation rack until the remaining beads were drawn to the magnet (approximately 5 min). We then transferred the fully clarified samples to a new 2-ml microcentrifuge tube and stored them at 4°C. We stored the pooled sequencing samples for each growth plate separately until needed for sequencing. The concentration of DNA in each pooled sequencing sample ranged from 10 to 16 ng/pl. DNA concentrations throughout were determined by fluorimetry.
[00160] For sequencing, DNA was diluted to 5 nmol/l and combined with 20% phiX control DNA. DNA from each of the 4 time points was sequenced in a separate lane on an Illumina HiSeqX using paired-end mode with 150 bp in each direction.
[00161] To count DNA barcodes and estimate the fitness associated with each Lacl variant, the sequencing data were analyzed using custom software written in C# and Python, and the Bartender1.1 barcode clustering algorithm (Zhao et al, 2018) (availabe at https://github.com/djross22/nist_lacLlandscape__analysis).
[00162] The sequence of the nominal Illumina compatible amplicon was (with Illumina adapters and flow cell binding sequences in italics):
Figure imgf000067_0001
[00163] The nominal forward and reverse reads from paired-end barcode sequencing were:
Figure imgf000067_0002
and
Figure imgf000068_0001
[00164] The Z's at the beginning of each read are random nucleotides used as unique molecular identifiers (UMIs) to correct for PCR jackpotting (Kivioja et al, 2012), the X’s are the sample multiplexing tag sequences, and the N's are the random nucleotides of the DNA barcodes. To minimize the chances of barcode crosstalk, we used dual barcodes, with independent random barcode sequences an the forward and reverse reads and 27 random nucleotides in each of the forward and reverse barcodes.
[00165] The raw sequences were parsed, and sequences were kept for further analysis only if they passed the following quality criteria for both the forward and reverse reads:
The four bases after the multiplexing tag (underlined in the sequences above) must match the nominal sequence with one allowed mismatch, and the multiplexing tag sequence (X's in the sequences above) must match the nominal sequence for one of the multiplexing tags used with up to three allowed mismatches.
The five flanking bases before and after the barcodes (bold in the sequences above) must match the nominal sequence with one allowed mismatch per set of five bases, and the number of bases in the barcode must be between 35 and 41 (inclusive).
The mean Illumina quality score for the barcode and the five flanking bases before and after the barcode must be greater than 30.
[00166] For the four lanes of HiSeq data, there were 2,024,537,456 raw reads, of which 1 ,576,168,836 reads passed the quality criteria (78%). Note that 20% of the DNA sample loaded onto the HiSeq instrument was phiX DNA.
[00167] True barcode sequences were identified using the Bartender1.1 clustering algorithm (Zhao et at, 2018) with the following parameter settings: maximum duster distance = 4, duster merging threshold = 8, duster seed length = 5, duster seed step = 1 , and frequency cutoff = 500. Barcodes from the forward and reverse reads were clustered independently. The Bartender1.1 clustering algorithm identified 43,259 distinguishable forward barcode clusters and 31 ,055 distinguishable reverse barcode clusters.
[00168] To correct for insertion-deletion read errors, barcode clusters of different length were considered for merging. First, barcode clusters with sequences that were sub-strings of one another were automatically merged. Second, pairs of barcode clusters with a DNA sequence Levenshtein distance of 1 or 2 were merged if the ratio of the smaller cluster read count to the total read count of both clusters was ≤ 0.001 and 0.0001 , respectively. Third, all barcode clusters with a Levenshtein distance ≤ 7 from the barcode for the spike-in control were merged.
[00169] After merging barcode clusters of different lengths, there were 43,169 distinguishable forward barcode clusters and 30,931 distinguishable reverse barcode clusters. The random positions within the forward and reverse barcodes had approximately equal probabilities for each nucleotide, with a mean entropy per position of 1.9799 bits ± 0.0066 bits.
[00170] .After barcode clustering and merging, the barcode sequencing reads were sorted based on the sample multiplexing tags and the barcode read counts were corrected for PCR jackpotting effects. Sets of multiple barcode reads were treated as PCR jackpot duplicates if they had the same UMI sequence, the same multiplexing tag, and the same barcode sequence for both forward and reverse barcode reads. In the corrected barcode count, each set of PCR jackpot duplicates was counted as a single read. Approximately 15% of the total barcode sequencing reads were found to be PCR jackpot duplicates.
[00171] The forward and reverse barcodes were then combined to give the DNA barcodes used to measure the relative abundance of each Lacl variant in the library. An additional barcode count threshold was applied, keeping only DNA barcodes with a total read count (across all 24 environments and 4 time points) greater than 2,000. A small number (139) of DNA barcodes were identified as likely chimeras with forward and reverse barcodes combined from different plasmid templates (Smyth et al , 2010; Schiecht et al , 2017; Omelina et al , 2019). The likely chimera barcodes were not used in further analysis.
[00172] Finally, 14 pairs of DNA barcodes were found with DNA sequence Hamming distance of one (across both forward and reverse barcodes). Only one DNA barcode from each pair was also found in the long-read sequencing data (see Long- read sequencing section, below). In addition, the fitness curves (vs IPTG concentration) were very similar for both barcodes in each pair. Based on this, the read counts associated with each of those 14 pairs of dual barcodes were merged, and each pair was treated as a single DNA barcode.
[00173] The final set of 67,730 DNA barcodes was used for all subsequent analysis to extract estimates of the fitness and dose-response curve associated with each barcode.
[00174] Long-read sequencing
[00175] The full sequence of the library plasmid (pTY 1 ) for every Lacl variant in the library was measured using PacBio circular consensus HiFi sequencing. The stock of E. coli containing the library was grown in media and plasmid was purified by miniprep. Purified plasmid DNA was linearized with BspOl restriction enzyme digest and sequenced. The HiFi sequencing data were used to determine the consensus /ac/ sequence for each variant and the corresponding DNA barcode. Of the 67,731 distinct DNA barcodes, the HiFi sequencing data were used to determine the /ac/ sequences for 63,064 (93%), 3,878 with a single HiFi lacl read, and 59,186 with multiple HiFi lacl reads.
[00176] In addition, the full plasmid sequence was used to detect unintended mutations in the plasmid, i.e., mutations to plasmid regions other than the /ac/ CDS. For analysis of the HiFi read data, the full plasmid sequence was divided into 11 non- overlapping regions that roughly correspond to different functional elements of the plasmid (FIG. 39), and the sequences for each region were extracted from the HiFi reads using a custom bioinformatic pipeline (see, e.g., code available at https://github.com/djross22/nist__lacl__landscape__analysis). The number of unintended mutations to plasmid regions other than the /ac/ CDS was relatively low (FIG. 39), so it was not possible to examine mutational effects with base pair- or residue-level resolution. However, by pooling the mutational information for each region, significant region-specific effects could be detected. To determine if mutations in a region of the plasmid had a significant effect, the estimated Hill equation parameters were compared for all variants with one or more mutations in a given plasmid region vs all variants with zero mutations in that region. Significant differences in the geometric mean of one or more Hill equation parameters were found for variants with mutations in the following regions: tetA (P-value for log10(G∞): 2 x 10-56), KAN (P-value for log10(G∞): 4 x 10-11), origin of replication (P-value for log10(G∞): 6 x 10-14), and YFP (P-value for log10(G0): 4 x 10-109; P-value for log10(G∞): 5 x 10-10; P-value for log10(EC50): 2 x 10-74), where the P-values given are for Welch's unequal-variances t- test.
[00177] In addition, 43 of the 535 variants with the wild-type Lacl amino acid sequence had mutations in the regulatory region (containing the P.aci and Ptacl promoters, the /acO operator, the riboJ insulator, and the RBS sites for both /ac/ and tetA). Of those 43 variants, three had EC50 values that differed by approximately 2-fold or more from the geometric mean value for the wild-type EC50. The Kolmogorov-Smirnov test was used to compare the distributions of EC50 values between the wild-type variants with and without mutations in the regulatory region; the results indicated a significant difference (P-value: 0.024).
[00178] To avoid biasing the results of the machine learning and other quantitative phenotypic analyses, variants were excluded from those analyses if they had one or more mutations in the non-/ac/ regions that show significant mutational effects: tetA, YFP, KAN, the origin of replication, and the regulatory region. After applying this data quality filter in addition to those described above, there were 54,162 variants that we used for further quantitative analysis.
[00179] Library-scale fitness measurement
[00180] The experimental approach for this work was designed to maintain bacterial cultures in exponential growth phase for the full duration of the measurements. So, in all analyses, the Malthusian definition of fitness was used, i.e., fitness is the exponential growth rate (Wu et al, 2013). [00181] The fitness of cells containing each Laci variant was calculated from the change in the relative abundance of DNA barcodes overtime. The spike-in control was used to normalize the DNA barcode count data to enable the determination of the absolute fitness for each Laci variant in the library.
[00182] Briefly, for each Laci variant in each of the 24 chemical environments, the ratio of the barcode read count to the spike-in read count was fit to a function assuming exponential growth and a delay in the onset of the fitness impact of tetracycline. The fitness associated with each variant in each of the 24 chemical environments was determined as a parameter in the corresponding least-squares fit as detailed below.
[00183] The barcode sequencing data were analyzed with a model based on the assumption that the number of cells containing each Laci variant grows with an exponential expansion rate that is independent of all other variants. So, for each sample, at the end of the incubation cycle for Growth Plate j, the number of cells with Laci variant / is:
Figure imgf000072_0001
where, d (= 10) is the dilution factor used in transferring the cell culture from Growth Plate j - 1 to Growth Plate j, Δt (≈ 165 min) is the total incubation time for each growth plate (including time required for automated cell passaging), and μij is the fitness (i.e. mean exponential growth rate) of cells with Laci variant / in Growth Plate j. Note that each growth plate was pre-heated to 37°C before transferring cells from the previous growth plate, so the cell growth rate was assumed to be unaffected by temperature changes during passaging.
[00184] For samples without tetracycline, the chemical composition of the media was the same for all growth plates, so the fitness is assumed to be constant,
Figure imgf000072_0002
where μi 0 is the fitness associated with Laci variant / in the absence of tetracycline. Consequently, the number of cells in each Growth Plate for samples grown without tetracycline is:
Figure imgf000073_0001
where is the number of ceils with Lad variant / at the end of Growth Plate j for
Figure imgf000073_0006
samples grown without tetracycline.
[00185] For samples grown with tetracycline, the tetracycline was only added to the culture media for Growth Plates 2-4. Because of the mode of action of tetracycline (inhibition of translation), there was a delay in its effect on cell fitness: immediately after diluting cells into Growth Plate 2 (the first plate with tetracycline), the ceils still had a normal level of proteins needed for growth and proliferation and they continued to grow at nearly the same rate as without tetracycline. Over time, as the level of proteins required for ceil growth decreased due to tetracycline, the growth rate of the ceils decreased. Accordingly, the analysis accounts for the variation in ceil fitness (growth rate) as a function of time after the ceils were exposed to tetracycline. With the assumption that the fitness is approximately proportional to the number of proteins needed for growth, the fitness as a function of time is taken to approach the new value with an exponential decay:
Figure imgf000073_0002
is the steady-state fitness with tetracycline, and α is a transition rate. The
Figure imgf000073_0003
transition rate was kept fixed at a = log(5), determined from a small-scale calibration measurement. Note that at the tetracycline concentration used during the library-scale measurement (20 μg/ml),
Figure imgf000073_0005
was greater than zero even at the lowest G(L) levels (FIG. 16). From equation (3), the number of ceils in each Growth Plate for samples grown with tetracycline is:
Figure imgf000073_0004
[00186] The barcode read count for variant i in Growth Plate j was assumed to be proportional to the cell number:
Figure imgf000074_0001
where a/ is a proportionality constant associated with variant /, and bj is a proportionality constant associated with Growth Plate j. The proportionality constant a, can be different for each variant / due to differences in PCR amplification efficiency resulting from variations in the barcode sequences on each amplicon. Similarly, the proportionality constant bj can be different for each Growth Plate because of sample-to-sample variations in the DNA extraction efficiency or differences in PCR efficiency associated with different sample multiplexing tag sequences.
[00187] The logarithm of the read count normalized by the spike-in read count was used to estimate the fitness of each variant from its associated barcode read count:
Figure imgf000074_0002
[00188] For samples without tetracycline,
Figure imgf000074_0006
was estimated for each variant using a weighted linear least-squares fit to the log-count ratio vs j:
Figure imgf000074_0003
[00189] For samples grown with tetracycline
Figure imgf000074_0004
was estimated for each variant with a weighted least-squares fit to the nonlinear form for the log-count ratio:
Figure imgf000074_0005
Figure imgf000075_0001
[00191] For the fitness landscape measurement, there were a large number of outliers for the read count measurements from three of the samples: Growth Plate 3, without tetracycline, [IPTG] = 8 μmol/l; Growth Plate 4, without tetracycline, [IPTG] = 64 μmol/l and [IPTG] = 2,048 μmol/l. These three samples were excluded from the analysis.
[00192] Dose-response curve measurements
[00193] Plasmids pTY1 and pVER were engineered to provide two independent measurements of the dose-response curve for Lacl variants. First, in pTY1 , Lacl regulates the expression of a tetracycline resistance gene (tetA) that enables determination of the dose-response from barcode sequencing data by comparing the fitness measured with tetracycline to the fitness measured without tetracycline. Second, in pVER, Lacl regulates the expression of a fluorescent protein (YFP) that enables direct measurement of the dose-response curve with flow cytometry.
[00194] A set of nine randomly selected Lacl variants were used to calibrate the estimation of regulated gene expression output from the barcode sequencing fitness measurements (FIG. 16). The calibration data consisted of the fitness data for each calibration variant from the library barcode sequencing measurement (using the library plasmid, pTY1 ) and flow cytometry data for each calibration variant prepared as a clonal culture (using the verification plasmid, pVER). These data were fit to a Hill equation model for the fitness impact of tetracycline as a function of the regulated gene expression level, G:
Figure imgf000076_0001
where is the fitness with tetracycline,
Figure imgf000076_0002
is the fitness without tetracycline, Δf is the maximal fitness impact of tetracycline (when G = 0), G50 is the gene expression level that produces a 50% recovery in fitness, and nf characterizes the steepness of the fitness calibration curve. Because the fitness calibration curve, equation (9), is nonlinear, it cannot be directly inverted to give the regulated gene expression level for all possible fitness measurements. So, two Bayesian inference models were used to estimate the dose-response curves for every Lacl variant in the library using the barcode sequencing fitness measurements. Source code for both models is included in the software archive at https://github.com/djross22/nist__lacl__landscape__analysis. Both inference models used equation (9) to represent the relationship between fitness and regulated gene expression. The parameters lf, G50, and were included in both inference models as parameters with informative priors. Priors for G50 and nr- were based on the results of the fit to the fitness calibration data (FIG. 16: G50 ~ normal(mean = 13,330, SD = 500), nf ~ normalfmean = 3.24, SD = 0.29). We chose the prior for Δf based on an examination of μtet/μ0-1 μtet/μ0-1 measured with zero IPTG: Δf ~ exponentially-modified-normal(mean = 0.720, SD = 0.015, rate ™ 14). The use of a prior for Δf with a broad right-side tail was important to accommodate variants in the library for which
Figure imgf000076_0003
was systematically less than -0.722
[00195] The first Bayesian inference model assumed that the dose-response curve for each Lacl variant was described by the Hill equation. The Hill equation parameters for each variant, G«, Go, EC50, and n and their associated uncertainties were determined using Bayesian parameter estimation by Markov Chain Monte Carlo (MCMC) sampling with PyStan (Carpenter et a/, 2017). Broad, flat priors were used for log10(Go), log10(G~), and log10(EC50), with error function boundaries to constrain those parameter estimates to within the measurable range (100 MEF ≤ Go, G- < 50,000 MEF; 0.1 μmol/l ≤ EC50.i ≤ 40,000 μmol/l). The prior for ni was a gamma distribution with shape parameter of 4.0 and inverse scale parameter of 3.33. The inference model was run individually for each Lacl variant, with four independent chains, 1 ,000 iterations per chain (500 warmup iterations), and the adapt„delta parameter set to 0.9. Testing with data from a set of randomly selected variants indicated that these settings for the Stan sampling algorithm typically produced a Gelman-Rubin
Figure imgf000077_0001
diagnostic ≤ 1.05 and number of effective iterations > 100.
[00196] The second Bayesian inference model was a non-parametric GP model (Rasmussen & Williams, 2005) that assumed only that the dose-response curve for each Lacl variant was a smooth function of IPTG concentration. The GP model was used to determine which variants had band-pass or band-stop phenotypes. The GP model was also implemented using MCMC sampling with PyStan (Carpenter et al, 2017). The GP inference model was run individually for each variant, with four independent chains, 1 ,000 iterations per chain (500 warmup iterations), and the adapt_delta parameter set to 0.9. Testing with data from a set of randomly selected variants indicated that these settings for the Stan sampling algorithm of the GP model typically produced a Gelman-Rubin
Figure imgf000077_0002
diagnostic less than 1.02 and number of effective iterations greater than 200.
[00197] Flow cytometry measurements
[00198] Over 100 Lacl variants from the library were chosen for flow cytometry verification of the dose-response curves. The CDSs of these variants were chemically synthesized. Each synthesized sequence was digested with restriction enzymes Xhol and Sgsl, and ligated into the verification plasmid, pVER, and then transformed into MG1655A/ac. Transformants were plated on LB agar supplemented with kanamycin and 0.2% glucose. Lacl variant sequences were verified with Sanger sequencing (Psomagen USA). For flow cytometry measurements of dose-response curves, a culture of E. coli containing pVER with a chosen variant sequence was distributed across 12 wells of a 96-well plate and grown to stationary phase using the automated microbial growth system. After growth to stationary phase, cultures were diluted 50-fold into a plate containing the same 12 IPTG concentrations used during the fitness landscape measurement (0-2,048 μmol/l). In some cases, higher IPTG concentrations were used to capture the full dose-response curves of selected variants (e.g., FIG. 11 , FIG. 12, FIG. 13, FIG. 14). Cultures were then grown for 160 min (- 3.3 generations) before being diluted 10-fold into the same IPTG gradient and grown for another 160 min. Then, 5 μl of each culture was diluted into 195 μl of PBS supplemented with 170 μg/ml chloramphenicol and incubated at room temperature for 30-60 min to halt the translation of YFP and allow extant YFP to mature in the cells.
[00199] Samples were measured on an Attune NxT flow cytometry with autosampler using a 488 nm excitation laser and a 530 nm ± 15 nm band-pass emission filter. Blank samples were measured with each batch of cell measurements, and an automated gating algorithm was used to discriminate cell events from non-cell events (FIG. 33A and B). With the Attune cytometer, the area and height parameters for each detection channel are calibrated to give the same value for singlet events. So, to identify singlet cell events and exclude multiplet cell events, a second automated gating algorithm was applied to select only cells with side scatter area side scatter height (FIG. 33C and D). All subsequent analysis was performed using the singlet cell event data. Fluorescence data were calibrated to molecules of equivalent fluorophore (MEF) using fluorescent calibration beads. The cytometer was programmed to measure a 25 μl portion of each cell sample, and the 40-fold dilution used in the cytometry sample preparation resulted in approximately 20,000 singlet cell measurements per sample. The geometric mean of the YFP fluorescence was used as a summary statistic to represent the regulated gene expression level as a function of the input ligand concentration, [IPTG] for each Lacl variant. An autofluorescence control (strain MG1655A/ac with a plasmid similar to pVER but lacking the YFP gene) was also measured with flow cytometry and analyzed in the same way as other variants. The regulated gene expression output level for each variant is reported as the geometric mean of the measured fluorescence forthat variant minus the geometric mean of the measured fluorescence for the zero-fluorescence control (92 MEF).
[00200] Hill equation parameters were estimated from the flow cytometry data using Bayesian parameter estimation by Markov Chain Monte Carlo (MCMC) sampling with PyStan (Carpenter et al, 2017). The Bayesian inference model used for flow cytometry data analysis assumed that the flow cytometry data resulted from a Hill equation response plus normally distributed measurement errors. The model used the same priors for the Hill equation parameters as described above.
[00201] Calculation of abundance for Lacl phenotypes [00202] The relative abundance of the various Lad phenotypes (FIG. 9) was estimated using the results of both Bayesian inference models (Hill equation and GP). Variants were labeled as “flat response” if the Hill equation model and the GP model agreed (i.e., if the median estimate for the Hill equation dose-response curve was within the central 90% credible interval from the GP model at all 12 IPTG concentrations) and if the posterior probability for G0 > G∞ was between 0.05 and 0.95 (from the Hill equation model inference). Variants were labeled as having a negative response if the slope,
Figure imgf000079_0001
was negative at one or more IPTG concentrations with 0.95 or higher posterior probability (from the GP model inference). To avoid false positives from end effects, this negative slope criteria was only applied for IPTG concentrations between 2 μmol/l IPTG and 1 ,024 μmol/L Variants were labeled as “always on” (the phenotype from reference (Markiewicz et at, 1994)) if they were flat- response and if G(0) was greater than 0.25 times the wild-type G∞ value with 0.95 or higher posterior probability (from the GP model inference). Variants were labeled as “always off” (the /s phenotype from reference (Markiewicz et al, 1994)) if they were flat-response but not always on. Variants were labeled as band-stop or band-pass if the slope,
Figure imgf000079_0002
was negative at some IPTG concentrations and positive at other IPTG concentrations, both with 0.95 or higher posterior probability (from the GP model inference). Band-stop and band-pass variants were distinguished by the ordering of the negative-slope and positive-slope portions of the dose-response curves. Variants that had a negative response but that were not band-pass or band-stop, were labeled as inverted. False-positive rates were estimated for each phenotypic category by manually examining the fitness vs IPTG data for Lacl variants with less than three substitutions. Typical causes of false-positive phenotypic labeling included unusually high noise in the fitness measurement and biased fit results due to outlier fitness data points. Estimated false-positive rates ranged between 0.001 and 0.005. The relative abundance values shown in FIG. 9A were corrected for false positives using the estimated rates.
[00203] Comparison of synonymous mutations
[00204] The library contained a set of 39 variants with the wild-type /ac/ CDS (each with a different DNA barcode), and a set of 310 variants with only synonymous nucleotide changes (i.e., no amino acid substitutions). Both sets had long-read sequencing coverage for the entire plasmid and were screened to retain only variants with zero unintended mutations in the piasmid (i.e., no mutations in regions of the piasmid other than the /ac/ CDS) The Hiii equation fit resuits for those two sets were compared to determine whether synonymous nucieotide changes significantly affected the phenotype. The Koimogorov-Smirnov test was used to compare the distributions of Hill equation parameters between these two sets. The resulting P- values (0.71 , 0.40, 0.28, and 0.17 for G0, G∞, EC50, and n, respectively) indicate that there were no significant differences between them. Additionally, the library contained 40 sets of variants, each with four or more synonymous CDSs (including the set of synonymous wild-type sequences and 39 non-wild-type sequences). A hierarchical model was used to compare the Hill equation parameters within each set of synonymous CDSs. Within each set, the uncertainty associated with individual variants was typically larger than the variant-to-variant variability estimated by the hierarchical model. Overall, these results indicate that synonymous SNPs did not measurably impact the Lacl phenotype, so only the amino acid sequences were considered for any subsequent quantitative genotype-to-phenotype analysis.
[00205] Analysis of single-substitution data
[00206] The single amino acid substitution results presented in FIG. 3, FIG. 6B and C, FIG. 23, FIG. 24, FIG. 25, and FIG. 26 are a combination of direct experimental observations, DNN model results, and estimates of G0 for missing substitutions.
[00207] For direct experimental observations, multiple Lacl variants were often present in the library with the same single substitution. To ensure that the highest quality data was used for the single-substitution analysis, only data for variants with more than 5,000 total barcode reads were used. For each single substitution, if there was only one Lacl variant with more than 5,000 barcode reads, the median and standard deviation for each parameter were used directly from the Bayesian inference using the Hill equation model. If there was more than one Lacl variant with a given single substitution and more than 5,000 barcode reads, the consensus Hill equation parameter values and standard deviations for that substitution were calculated using a hierarchical model based on the eight schools model (Rubin, 1981 ; see also information at https://mc-stan.org/users/documentation/case- studies/divergences_and_bias.html). The hierarchical model was applied separately for each Hill equation parameter. The logarithm of the parameter values was used as input to the hierarchical model, and the input data were centered and normalized by 1.15 x the minimum measurement uncertainty. The standard normal distribution was used as a loosely informative prior for the consensus mean effect, and a half-normal prior (mean = 0.5, SD = 1) was used forth© normalized consensus standard deviation (i.e. , hierarchical standard deviation). These priors and normalization were chosen so that the model gave intuitively reasonable results for the consensus of two Lacl variants (i.e., close to the results for the Lacl variant with the lowest measurement uncertainty). Results for the hierarchical model were determined using Bayesian parameter estimation by Markov Chain Monte Carlo (MCMC) sampling with PyStan (Carpenter et al, 2017). MCMC sampling was run with four independent chains, 10,000 iterations per chain (5,000 warmup iterations), and the adapt_delta parameter set to 0.975.
[00208] For G0, the direct experimental results were used for the 1 ,047 substitutions plotted as gray points or red points and error bars in FIG. 3D and FIG. 26. In addition, estimated values were used for the 83 missing substitutions that have been previously shown to result in an “always on” Lacl phenotype (i.e., the /” phenotype (Markiewicz et at, 1994; Pace et at, 1997)). For these substitutions, plotted as pink-gray points and error bars in FIG. 3D, the median value was estimated to be equal to the wild-type value for G∞ (24,000 MEF), and the geometric standard deviation was estimated to be 4-fold, both based on information from previous publications (Markiewicz et al, 1994; Pace et al, 1997). Note that these 83 substitutions are completely missing from the experimental landscape dataset reported here, i.e., they are not found in any Lacl variant, as single substitutions or in combination with other substitutions.
[00209] For G∞ and EC50, the direct experimental results were used for the 964 substitutions that are found as single substitutions in the library and that have a consensus standard deviation for log10(EC50) less than 0.35. An additional 74 substitutions are found as single substitutions in the library, but with higher EC50 uncertainty. For these substitutions, either EC50 is comparable to or higher than the maximum ligand concentration used for the measurement (2,048 μmol/l IPTG), or G∞ is comparable to G0 (or both). Consequently, the dose- response curve is flat or nearly flat across the range of concentrations used, and the Bayesian inference used to estimate the Hill equation parameters results in EC50 and G∞ estimates with large uncertainties. The DNN model can provide a better parameter estimate for these flat-response variants because it uses data and relationships from the full library (e.g., the log-additivity of EC50) to predict parameter values for each single substitution. So, the DNN model results were used for these 74 substitutions. Finally, the DNN model results were used for an additional 953 substitutions that are found in the library, but only in combination with other substitutions (i.e. , not as single substitutions).
[00210] Identification of high-frequency substitutions and structural features associated with inverted and band-stop phenotypes
[00211] The set of 43 strongly inverted Laci variants discussed in the main text and used for the plots in FIG. 5A and C were identified by the following criteria: and EG50 between 3 μmol/l and
Figure imgf000082_0001
1 ,000 μmol/l. The set of 31 strong band-stop variants discussed in the main text and used for the plots in FIG. 5B and D were identified by the following criteria:
Figure imgf000082_0002
and the slope,
Figure imgf000082_0003
of less than -0.07 at low IPTG concentrations and greater than zero at higher IPTG concentrations, both with 0.95 or higher posterior probability (from the GP model inference). To avoid false positives due to noise in the DNA barcode counting, only Laci variants with a total DNA barcode read count greater than 3,000 were included. Also, the sets of strongly inverted and strong band-stop variants were manually screened for additional likely false positives due to outlier fitness data points.
[00212] A hypergeometric test was used to determine the amino acid substitutions that occur more frequently in the set of strongly inverted or strong band- stop variants than in the full library (the set of 52,321 variants with more than 3,000 DNA barcode reads and with the /ac/ sequences determined by long-read sequencing). For each possible substitution, the cumulative hypergeometric distribution was used to calculate the probability of the observed number of occurrences of that substitution in the set of inverted or band-stop variants under a null model of no association. This probability was used as a P-value for the null hypothesis that the observed number of inverted or band-stop variants with that substitution resuited from an unbiased random selection of variants from the full library. Substitutions were considered to occur at significantly higher frequency if they had a P-value ≤ 0.005 and if they occurred more than once in the set of inverted or band-stop variants. In the set of strongly inverted variants, 10 associated (higher frequency) amino acid substitutions were identified: S70I, K84N, D88Y, V96E, A135T, V192A, G200S, Q248H, Y273H, and A343G. In the set of strong band-stop variants, eight associated substitutions were identified: V4A, A92V, H179Q, R195H, G178D, G265D, D292G, and R351 G. To estimate the number of false positives, random sets of Lacl variants were chosen with the same sample size as the strongly inverted (43) or the strong band-stop (31) variants and the same significance criteria was applied. From 300 independent iterations of the random selection, the estimated mean number of false-positive substitutions was 2.1 and 2.3 for the inverted and band-stop phenotypes, respectively. Statistics for the test results are given in FIG. 36 and FIG. 37.
[00213] A similar procedure was used to determine which structural features within the protein are mutated with higher frequency in the inverted or band-stop Lacl variants. The structural features considered were the secondary structures from the complete crystal structure of Lacl (Lewis et al, 1996), as well as larger structural features (N-terminal core domain, C-terminal core domain, DNA-binding domain, dimer interface) and functional domains (ligand-binding, core-pivot). All of the domains and features included in the analysis are listed in FIG. 35. The P-value threshold used for significance was 0.025. For the strongly inverted variants, five features were identified with a higher frequency of amino acid substitutions: the dimer interface, residues within 7 A of the ligand-binding pocket, helix 5, helix 11 , and p-strand I. For the strong band-stop variants, three features were identified: the C-terminal core, p- strand J, and helix 9. From 300 independent random selections of variants from the full library, the estimated mean number of false-positive features was 0.38 and 0.51 for the inverted and band-stop phenotypes, respectively. Statistics for the test results are given in FIG. 36 and FIG. 37.
[00214] Deep neural network modeling [00215] The dataset was pruned to a set of high-quality sequences for DNN modeling. Specifically, data for a Lacl variant was only used for modeling if it satisfied the following criteria:
No mutations were found in the long-read sequencing results for the regions of the plasmid encoding kanamycin resistance, the origin of replication, the tetA and YFP genes, and the regulatory region containing the promoters and ribosomal binding sites for /ac/ and tetA (FIG. 39).
The total number of barcode read counts for a Lacl variant was > 3,000.
The number of amino acid substitutions was ≤ 14.
The measurement uncertainty for log10(G∞) was ≤ 0.7.
The results of the Hill equation model and the GP model agreed at all 12 IPTG concentrations. More specifically, data were only used if the median estimate for the dose-response curve from the Hill equation model was within the central 90% credible interval from the GP model at all 12 IPTG concentrations.
[00216] After applying the quality criteria listed above, 47,462 Lacl variants remained for DNN modeling. The data were used to train the DNN model to predict the Hill equation parameters G0, G∞, and EC50 as detailed below.
[00217] Amino acid sequences were represented as one-hot encoded vectors of length L = 2,536, and with mutational paths represented as K x L tensors for a sequence with K substitutions. The logarithm of the Hill equation parameter values were normalized to a standard deviation of 1 , and then shifted by the corresponding value of the wild-type sequence in order to correctly represent the prediction goal of the change in each parameter relative to wild-type Lacl. A long-term, short-term recurrent neural network was selected for the underlying model (Hochreiter & Schmidhuber, 1997), with 16 hidden units, a single hidden layer, and hyperbolic tangent (tanh) non-linearities. Inference was performed in pytorch (preprint: Paszke et al, 2019) using the Adam optimizer (preprint: Kingma & Ba, 2017). For EC50 and G0, the contribution of individual data points to the regression loss were weighted inversely proportional to their experimental uncertainty. Model selection was performed with 10- fold cross-validation on the training set (80% of all available data). Approximate Bayesian inference was performed with the Bayes-by-backprop approach (preprint: Blundell et al , 2015). Briefly, this substitutes the point-estimate parameters of the neural network with variational approximations to a Bayesian model, represented as a mean and variance of a normal random variable. Effectively, this only doubles the number of parameters in the model. A mixture of two normal distributions was used as a prior for each parameter weight, with the two mixture components having high and low variance, respectively. This prior emulates a sparsifying spike-slab prior while remaining tractable for inference based on back propagation. Posterior means of each weight were used to calculate posterior predictive means, while Monte Carlo draws from the variational posterior were used to calculate the model prediction uncertainty (FIG. 20).
[00218] Variational approximations typically underestimate uncertainty. So, to correct the uncertainty estimates, the model prediction uncertainty obtained from the variational approximation was compared with the model root-mean-square error (RMSE; i.e., the root-mean-square difference between the model prediction and the experimental measurement). For all three Hill equation parameters (G0, G∞, and EC50), both the prediction uncertainty and the RMSE increase with the number of amino acid substitutions relative to wild-type sequence (FIG. 20A and B), and the RMSE at each substitutional distance is an approximately linear function of the median model uncertainty (Appendix Fig S13C). So, for the single-substitution analysis (FIG. 3, FIG. 6B and C, FIG. 26), the uncertainties from the variational approximation were multiplied by a factor of 3.8. This rescaled the uncertainties so that the median uncertainty was approximately equal to the RMSE for each substitutional distance.
[00219] We compared the performance of the recurrent DNN model against two alternative models (FIG. 18B). First, we trained a linear-additive model, which assumes that each parameter is log-additive and so only learns the average effect of each amino substitution across the entire dataset. Second, we trained a more conventional feed-forward neural network. The feed-forward neural network had four hidden layers, with 32, 64, 64, and 32 hidden units, respectively. Each hidden layer had a rectified linear unit (ReLU) non-linearity and a batch-normalization step between each layer. Both the linear-additive and feed-forward models were constructed and trained with pytorch (preprint: Paszke etal, 2019), using the Adam optimizer (preprint: Kingma & Ba, 2017) and a learning rate of 10-3.
[00220] Data availability
[00221] Long-read and short-read DNA sequencing: NCBI BioProject PRJNA643436 (available at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA643436).
[00222] pTY1 plasmid sequence: NCBI GenBank MT702633 (availabe at https://www.ncbi.nlm.nih.gov/nuccore/MT702633).
[00223] pVER plasmid sequence: NCBI GenBank MT702634 (availabe at https://www.ncbi.nlm.nih.gov/nuccore/MT702634).
[00224] The processed data table containing comprehensive data and information for each Lacl variant in the library is publicly available via the NIST Science Data Portal, with the identifier ark:/88434/mds2-2259 (avaiable at https://data.nist.gov/od/id/mds2-2259 or https://doi.org/10.18434/M32259). The data table includes the DNA barcode sequences, the barcode read counts for each sample and time point used for the library-scale measurement, fitness estimates for each barcoded variant across the 24 chemical environments, the results of both Bayesian inference models (including posterior medians, covariances, and 0.05, 0.25, 0.75, and 0.95 posterior quantiles), the Lacl CDS and amino acid sequence for each barcoded variant (as determined by long-read sequencing), the number of Lacl CDS reads in the long-read sequencing dataset for each barcoded variant, and the number of unintended mutations in other regions of the plasmid (from the long-read sequencing data).
[00225] All data analysis code is available at https://github.com/djross22/nist__lacLlandscape__analysis.
[00226] REFERENCES
[00227] Citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention. All references cited herein are hereby incorporated by reference in their entirety. Below is a listing of various references cited herein: [00228] Alperovich N, Romantseva J, Vasilyeva O, Ross D (2020a) Automation protocol for plasmid DNA extraction from E. coli. protocols.io https://doi.org/! 0.17504/protocols.io.bjjvkkn6.
[00229] Alperovich N, Tack DS, Vasilyeva O, Levy SF, Ross D
(2020b) Automation protocol for DNA barcode sequencing library preparation. protocols.io https://doi.org/10.17504/protocols.io.bjjzkkp6.
[00230] Bell CE, Barry J, Matthews KS, Lewis M (2001 ) Structure of a variant of lac repressor with increased thermostability and decreased affinity for operator. J Mol 8/0/ 313: 99-109.
[00231] Bivona L, Zou Z, Stutzman N, Sun PD (2010) Influence of the second amino acid on recombinant protein expression. Protein Expr Purif 74: 248-256.
[00232] Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural networks. arXiv https://arxiv.org/pdfZ1505.05424.pdf.
[00233] Bouhaddou M, Birtwistle MR (2014) Dimerization-based control of cooperativity. Mol Biosyst 10: 1824-1832.
[00234] Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker M, Guo J, Li P, Riddell A (2017) Stan: a probabilistic programming language. J Stat Softw 76: 1-32.
[00235] Chure G, Razo-Mejia M, Belliveau NM, Einav T, Kaczmarek ZA, Barnes SL, Lewis M, Phillips R (2019) Predictive shifts in free energy couple mutations to their phenotypic consequences. Proc Natl Acad Sci USA 116: 18275-18284.
[00236] Daber R, Sochor MA, Lewis M (2011 ) Thermodynamic analysis of mutant lac repressors. J Mol Biol 409: 76-87.
[00237] Domingo J, Diss G, Lehner B (2018) Pairwise and higher-order genetic interactions during the evolution of a tRNA. Nature 558: 117-121 .
[00238] Fenton AW (2008) Allostery: an illustrated definition for the ‘second secret of life’. Trends B/ochem Sci 33: 420-425. [00239] Flynn TC, Swint-Kruse L, Kong Y, Booth C, Matthews KS, Ma J (2003) Allosteric transition pathways in the lactose repressor protein core domains: asymmetric motions in a homodimer. Protein Sci 12: 2523-2541.
[00240] Garcia-Pino A, Balasubramanian S, Wyns L, Gazit E, De Greve H, Magnuson RD, Charlier D, van Nuland NAJ, Loris R (2010) Allostery and intrinsic disorder mediate transcription regulation by conditional cooperativity. Cell 142: 1 QI- 111.
[00241] He X, Liu L (2016) Toward a prospective molecular evolution. Science 352: 769-770.
[00242] Hecht A, Glasgow J, Jaschke PR, Bawazer LA, Munson MS, Cochran JR, Endy D, Salit M (2017) Measurements of translation initiation from all 64 codons in E. coll. Nucleic Acids Res 45: 3615-3626.
[00243] Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput Q: 1735-1780.
[00244] Huang P-S, Boyken SE, Baker D (2016) The coming of age of de novo protein design. Nature 537: 320-327.
[00245] Kim MK, Kang YK (1999) Positional preference of proline in alpha- helices. Protein Sc/ 8: 1492-1499.
[00246] Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. arXiv https://arxiv.org/pdf/1412.6980v9.
[00247] Kivioja T, Vaharautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J (2012) Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods 9: 72-74.
[00248] Kyte J, Doolittle RF (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157: 105-132.
[00249] Leander M, Yuan Y, Meger A, Cui Q, Raman S (2020) Functional plasticity and evolutionary adaptation of allosteric regulation. Proc Natl Acad Sci USA 117: 25445-25454. [00250] Lewis M, Chang G, Horton NC, Kercher MA, Pace HC, Schumacher MA, Brennan RG, Lu P (1996) Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271 : 1247-1254.
[00251] Li C, Qian W, Maclean CJ, Zhang J (2016) The fitness landscape of a tRNA gene. Science 352: 837-840.
[00252] Li C, Zhang J (2018) Multi-environment fitness landscapes of a tRNA gene. Nat Ecol Evol 2: 1025-1032.
[00253] Markiewicz P, Kleina LG, Cruz C, Ehret S, Miller JH (1994) Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as ‘Spacers' which do not require a specific sequence. J Mol Biol 240: 421-433.
[00254] Marzen S, Garcia HG, Phillips R (2013) Statistical mechanics of Monod-Wyman-Changeux (MWC) Models. J Mol Biol 425: 1433-1460.
[00255] Meyer S, Ramot R, Kishore Inampudi K, Luo B, Lin C, Amere S, Wilson CJ (2013) Engineering alternate cooperative-communications in the lactose repressor protein scaffold. Protein Eng Des Se/ 26: 433-443.
[00256] Miao Z, Cao Y (2016) Quantifying side-chain conformational variations in protein structure. Sc/ Rep 6: 37024.
[00257] Monod J, Wyman J, Changeux JP (1965) On the nature of allosteric transitions: a plausible model. J Mol Biol 12: 88-118.
[00258] Motlagh HN, Wrabl JO, Li J, Hilser VJ (2014) The ensemble nature of allostery. Nature 508: 331-339.
[00259] Myers GL, Sadler JR (1971) Mutational inversion of control of the lactose operon of Escherichia coli. J Mol Biol 58: 1-28.
[00260] Omelina ES, Ivankin AV, Letiagina AE, Pindyurin AV (2019) Optimized PCR conditions minimizing the formation of chimeric DNA molecules from MPRA plasmid libraries. BMC Genom 20: 536. [00261] Onufriev A, Ullmann GM (2004) Decomposing complex cooperative ligand binding into simple components: connections between microscopic and macroscopic models. J Phys Chem B 108: 11157—11169.
[00262] Pace HC, Kercher MA, Lu P, Mackiewicz P, Miller JH, Chang G, Lewis M (1997) Lac repressor genetic map in real space. Trends Biochem Sci22: 334-339.
[00263] Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimeishein N, Antiga L et al (2019) PyTorch: an imperative style. High- performance deep learning library. arXiv https://arxiv.org/pdf/1912.01703v1.
[00264] Poelwijk FJ, de Vos MGJ, Tans SJ (2011) Tradeoffs and optimality in the evolution of gene regulation. Ce// 146: 462-470.
[00265] Popovych N, Sun S, Ebright RH, Kalodimos CG (2006) Dynamically driven protein allostery. Nat Struct Mot Biot 13: 831-838.
[00266] Pressman AD, Liu Z, Janzen E, Blanco C, Muller UF, Joyce GF, Pascal R, Chen IA (2019) Mapping a systematic ribozyme fitness landscape reveals a frustrated evolutionary network for self-aminoacylating RNA. J Am Chem Soo 141 : 6213-6223.
[00267] Puchta O, Cseke B, Czaja H, Tollervey D, Sanguinetti G, Kudla G (2016) Network of epistatic interactions within a yeast snoRNA. Science 352: 840- 844.
[00268] Raman S, Taylor N, Genuth N, Fields S, Church GM (2014) Engineering allostery. Trends Genet 30: 521-528.
[00269] Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning (Adaptive computation and machine learning). Cambridge, MA: The MIT Press.
[00270] Razo-Mejia M, Barnes SL, Belliveau NM, Chure G, Einav T, Lewis M, Phillips R (2018) Tuning transcriptional regulation through signaling: a predictive theory of allosteric induction. Cell Systems 6: 456-469. e10. [00271] Richardson JS, Richardson DC (1988) Amino add preferences for specific locations at the ends of alpha helices. Science 240: 1648-1652.
[00272] Rolfes RJ, Zalkin H (1990) Purification of the Escherichia coli purine regulon repressor and identification of corepressors. J Bacteriol 172: 5637-5642.
[00273] Rubin DB (1981) Estimation in parallel randomized experiments. Journal of Educational Statistics 6: 377-401.
[00274] Sarkar S, Tack D, Ross D (2020) Sparse estimation of mutual information landscapes quantifies information transmission through cellular biochemical reaction networks. Commun Biol 3: 1-8.
[00275] Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, Ivankov DN, Bozhanova NG, Baranov MS, Soylemez O et al (2016) Local fitness landscape of the green fluorescent protein. Nature 533: 397- 401.
[00276] Schlecht U, Liu Z, Blundell JR, St.Onge RP, Levy SF (2017) A scalable double-barcode sequencing platform for characterization of dynamic protein- protein interactions. Nat Commun 8: 15586.
[00277] Smyth RP, Schlub TE, Grimm A, Venturi V, Chopra A, Mallal S, Davenport MP, Mak J (2010) Reducing chimera formation during PCR amplification to ensure accurate genotyping. Gene 469: 45-51 .
[00278] Spronk CAEM, Slijper M, van Boom JH, Kaptein R, Boelens R (1996) Formation of the hinge helix in the lac represser is induced upon binding to the lac operator. Nat Struct Biol 3: 916-919.
[00279] Swint-Kruse L, Elam CR, Lin JW, Wycuff DR, Matthews KS (2001) Plasticity of quaternary structure: twenty-two ways to form a Lacl dimer. Protein Sc/ 10: 262-276.
[00280] Swint-Kruse L, Zhan H, Fairbanks BM, Maheshwari A, Matthews KS (2003) Perturbation from a distance: mutations that alter Lacl function through long- range effects. Biochemistry 42: 14004-14016. [00281] Tack DS, Alperovich N, Vasilyeva O, Ross D (2020) Assembly of plasmids for Lacl genotype-phenotype landscape measurement, protocois.io.
[00282] Taylor ND, Garruss AS, Moretti R, Chan S, Arbing MA, Cascio D, Rogers JK, Isaacs FJ, Kosuri S, Baker D et al (2016) Engineering an allosteric transcription factor to respond to new ligands. Nat Meth 13: 177—183.
[00283] Tzeng S-R, Kalodimos CG (2012) Protein activity regulation by conformational entropy. Nature 488: 236-240.
[00284] Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND etal (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol 37: 1 155-1162.
[00285] Wu B, Gokhale CS, van Veelen M, Wang L, Traulsen A (2013) Interpretations arising from Wrightian and Malthusian fitness under strong frequency dependent selection. Eco/ Evo/ 3: 1276-1280.
[00286] Zamyatnin AA (1972) Protein volume in solution. Prog Biophys Mot Biot 24: 107-123.
[00287] Zhan H, Camargo M, Matthews KS (2010) Positions 94-98 of the lactose repressor N-subdomain monomer-monomer interface are critical for allosteric communication. Biochemistry 49: 8636-8645.
[00288] Zhao L, Liu Z, Levy SF, Wu S (2018) Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics 34: 739-747
[00289] Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S. & Miller, J. H. Genetic Studies of the lac Repressor. XIV. Analysis of 4000 Altered Escherichia coli lac Repressors Reveals Essential and Non-essential Residues, as well as ‘Spacers’ which do not Require a Specific Sequence. Journal of Molecular Biology 240, 421- 433 (1994).
[00290] Pace, H. C. et al. Lac repressor genetic map in real space. Trends in Biochemical Sciences 22, 334-339 (1997). [00291] Lewis, M. et al. Crystal structure of the iactose operon repressor and its complexes with DNA and inducer. Science 271 , 1247-1254 (1996).
[00292] Swint-Kruse, L., Zhan, H., Fairbanks, B. M., Maheshwari, A. & Matthews, K. S. Perturbation from a distance: mutations that aiter Lad function through long-range effects. Biochemistry 42, 14004-14016 (2003).
[00293] The processes described herein may be embodied in, and fully automated via, software code modules executed by a computing system that includes one or more general purpose computers or processors. The code modules may be stored in any type of non-transitory computer-readable medium or other computer storage device. Some or ail the methods may alternatively be embodied in specialized computer hardware. In addition, the components referred to herein may be implemented in hardware, software, firmware, or a combination thereof.
[00294] Many other variations than those described herein will be apparent from this disclosure. For example, depending on the embodiment, certain acts, events, or functions of any of the algorithms described herein can be performed in a different sequence, can be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the algorithms). Moreover, in certain embodiments, acts or events can be performed concurrently, e.g., through multi- threaded processing, interrupt processing, or multiple processors or processor cores or on other parallel architectures, rather than sequentially. In addition, different tasks or processes can be performed by different machines and/or computing systems that can function together.
[00295] Any logical blocks, modules, and algorithm elements described or used in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, and elements have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. The described functionality can be implemented in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.
[00296] The various illustrative logical blocks and modules described or used in connection with the embodiments disclosed herein can be implemented or performed by a machine, such as a processing unit or processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor can be a microprocessor, but in the alternative, the processor can be a controller, microcontroller, or state machine, combinations of the same, or the like. A processor can include electrical circuitry configured to process computer-executable instructions. In another embodiment, a processor includes an FPGA or other programmable device that performs logic operations without processing computer-executable instructions. A processor can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Although described herein primarily with respect to digital technology, a processor may also include primarily analog components. For example, some or all of the signal processing algorithms described herein may be implemented in analog circuitry or mixed analog and digital circuitry. A computing environment can include any type of computer system, including, but not limited to, a computer system based on a microprocessor, a mainframe computer, a digital signal processor, a portable computing device, a device controller, or a computational engine within an appliance, to name a few.
[00297] The elements of a method, process, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module stored in one or more memory devices and executed by one or more processors, or in a combination of the two. A software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of non- transitory computer-readable storage medium, media, or physical computer storage known in the art. An exampie storage medium can be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium can be integral to the processor. The storage medium can be volatile or nonvolatile.
[00298] While one or more embodiments have been shown and described, modifications and substitutions may be made thereto without departing from the spirit and scope of the invention. Accordingly, it is to be understood that the present invention has been described by way of illustrations and not limitation. Embodiments herein can be used independently or can be combined.
[00299] All ranges disclosed herein are inclusive of the endpoints, and the endpoints are independently combinable with each other. The ranges are continuous and thus contain every value and subset thereof in the range. Unless otherwise stated or contextually inapplicable, all percentages, when expressing a quantity, are weight percentages. The suffix (s) as used herein is intended to include both the singular and the plural of the term that it modifies, thereby including at least one of that term (e.g., the colorant(s) includes at least one colorants). Option, optional, or optionally means that the subsequently described event or circumstance can or cannot occur, and that the description includes instances where the event occurs and instances where it does not. As used herein, combination is inclusive of blends, mixtures, alloys, reaction products, collection of elements, and the like.
[00300] As used herein, a combination thereof refers to a combination comprising at least one of the named constituents, components, compounds, or elements, optionally together with one or more of the same class of constituents, components, compounds, or elements.
[00301] All references are incorporated herein by reference.
[00302] The use of the terms “a,” “an,” and “the” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. It can further be noted that the terms first, second, primary, secondary, and the like herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. It will also be understood that, although the terms first, second, etc. are, in some instances, used herein to describe various elements, these elements should not be limited by these terms. For example, a first current could be termed a second current, and, similarly, a second current could be termed a first current, without departing from the scope of the various described embodiments. The first current and the second current are both currents, but they are not the same condition unless explicitly stated as such.
[00303] The modifier about used in connection with a quantity is inclusive of the stated value and has the meaning dictated by the context (e.g., it includes the degree of error associated with measurement of the particular quantity). The conjunction or is used to link objects of a list or alternatives and is not disjunctive; rather the elements can be used separately or can be combined together under appropriate circumstances.

Claims

What is claimed is:
1. A process for determining a coding DNA sequence for a specific dose- response curve for a genetic sensor, the process comprising: producing a coding DNA sequence library comprising a plurality of mutated variants of the coding DNA sequence (CDS) for the genetic sensor, such that each mutated variant comprises a mutated coding DNA sequence of the coding DNA sequence; attaching a separate DNA barcode to each of the mutated variants in the coding DNA sequence library to produce a barcoded variant library of barcoded mutant variants, wherein each barcoded mutant variant in the barcoded variant library correponds to one of the mutated variants having an attached DNA barcode; determining the DNA sequence of each barcoded mutant variant in the barcoded variant library; transforming the barcoded mutant variants in the barcoded variant library into cells; growing the cells in various cell cultures comprising different values of an input signal of the genetic sensor; extracting DNA from the cells in the cell cultures; determining the number of each DNA barcode in the DNA extracted from the cells; determining a fitness of the cells for each barcoded mutant variant; determining a mutant variant dose-response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant to produce a mutant variant dose-response curve library comprising a plurality of the mutant variant dose-response curves; providing a target dose-response curve; determining which mutant variant dose-response curve, as a selected mutant variant dose-response curve, in the mutant variant dose-response curve library- matches the target dose-response curve; and determining the mutated coding DNA sequence of the selected mutant variant dose-response curve.
2. The process of claim 1 , further comprising synthesizing the mutated coding DNA sequence.
3. The process of claim 2, further comprising making a mutated genetic sensor comprising the mutated coding DNA sequence.
4. The process of claim 1 , further comprising producing a selection system for the genetic sensor, wherein the genetic sensor output modulates the fitness of the cells comprising the genetic sensor.
5. The process of claim 1 , wherein determining the DNA sequence of each barcoded mutant variant comprises subjecting the barcoded mutant variants of the barcoded variant library to high-throuhgput long-read DNA sequencing.
6. The process of claim 1 , wherein, in growing the cells in various cell cultures comprising different values of the input signal of the genetic sensor, the input signal comprises a concentration of a molecule sensed by the genetic sensor.
7. The process of claim 1 , further comprising, in extracting the DNA from the cells in the cell cultures, the DNA is extracted during at least two different time points during growth of the cells in the cell culture.
8. The process of claim 7, wherein determining the number of each DNA barcode in the DNA extracted from the cells is performed at each of the different time points with high-throughput short-read DNA sequencing.
9. The process of claim 8, wherein determining the fitness of cells containing each barcoded mutant variant comprises: calculating the fitness from the change in the number of each barcoded mutant variant over time.
10. The process of claim 1 , wherein determining the mutant variant dose- response curve for each barcoded mutant variant in the barcoded variant library from the fitness of the cells for each barcoded mutant variant comprises: calculating each mutant variant dose-response curve from the fitness as a function of the input signal.
11. The process of claim 1 , further comprising calibrating the mutant variant dose-response curves with a spike-in control variant.
12. The process of claim 11 , further comprising normalizing the amount of DNA extracted from the cells in the cell cultures with a spike-in control variant.
13. The process of claim 1 , wherein producing the coding DNA sequence library comprises site-saturation mutagenesis, chemical mutagens, error-prone PCR, mutagenic PCR, propagation in error-prone microbes, homologous recombination, or gene shuffling.
14. The process of claim 1 , wherein attaching the separate DNA barcode to each of the mutated variants in the coding DNA sequence library comprises attaching by PCR with oligonucleotides comprising sequence positions with randomized nucleotides.
15. The process of claim 5, in subjecting the barcoded mutant variants of the barcoded variant library to high-throuhgput long-read DNA sequencing, read lengths span the coding DNA sequence, the attached DNA barcode, and any DNA sequence regions between the coding DNA sequence and the attached DNA barcode.
16. The process of claim 1 , in transforming the barcoded mutant variants in the barcoded variant library into the cells, the barcoded mutant variants in the barcoded variant library are transformed into the cells with a DNA plasmid, artificial chromosome, or a genetic chromosome of the cells.
17. The process of claim 1 , wherein extracting DNA from the cells in the cell cultures comprises solid phase extraction.
18. The process of claim 8, wherein determining the number of each DNA barcode in the DNA extracted from the cells is performed at each of the different time points with high-throughput short-read DNA sequencing.
PCT/US2021/062442 2020-12-08 2021-12-08 Precision engineering of a cellular sense-and-response function Ceased WO2022125686A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063122501P 2020-12-08 2020-12-08
US63/122,501 2020-12-08

Publications (1)

Publication Number Publication Date
WO2022125686A1 true WO2022125686A1 (en) 2022-06-16

Family

ID=80445568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/062442 Ceased WO2022125686A1 (en) 2020-12-08 2021-12-08 Precision engineering of a cellular sense-and-response function

Country Status (1)

Country Link
WO (1) WO2022125686A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025133101A1 (en) 2023-12-20 2025-06-26 Danmarks Tekniske Universitet Method for enzyme fitness data collection

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180187184A1 (en) * 2016-12-30 2018-07-05 Systasy Bioscience GmbH Novel constructs and screening methods
US10526386B2 (en) 2015-05-06 2020-01-07 Immatics Biotechnologies Gmbh Peptides and combination of peptides and scaffolds thereof for use in immunotherapy against colorectal carcinoma (CRC) and other cancers

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10526386B2 (en) 2015-05-06 2020-01-07 Immatics Biotechnologies Gmbh Peptides and combination of peptides and scaffolds thereof for use in immunotherapy against colorectal carcinoma (CRC) and other cancers
US20180187184A1 (en) * 2016-12-30 2018-07-05 Systasy Bioscience GmbH Novel constructs and screening methods

Non-Patent Citations (68)

* Cited by examiner, † Cited by third party
Title
BEGGS, J. D., NATURE, vol. 275, 1978, pages 104 - 109
BELL CEBARRY JMATTHEWS KSLEWIS M: "Structure of a variant of lac repressor with increased thermostability and decreased affinity for operator", J MOL BIOL, vol. 313, 2001, pages 99 - 109, XP004466191, DOI: 10.1006/jmbi.2001.5041
BIVONA LZOU ZSTUTZMAN NSUN PD: "influence of the second amino acid on recombinant protein expression", PROTEIN EXPR PUF, vol. 74, 2010, pages 248 - 256, XP027394798, DOI: 10.1016/j.pep.2010.06.005
BLUNDELL CCORNEBISE JKAVUKCUOGLU KWIERSTRA D: "Weight uncertainty in neural networks", ARXIV, 2015, Retrieved from the Internet <URL:https://arxiv.org/pdf/1505.05424.pdf>
BOUHADDOU MBIRTWISTLE MR: "Dimerization-based control of cooperativity", MOL BIOSYST, vol. 10, 2014, pages 1824 - 1832
CARPENTER BGELMAN AHOFFMAN MDLEE DGOODRICH BBETANCOURT MBRUBAKER MGUO JLI PRIDDELL A: "Stan: a probabilistic programming language", J STAT SOFTW, vol. 76, 2017, pages 1 - 32
CHURE GRAZO-MEJIA MBELLIVEAU NMEINAV TKACZMAREK ZABARNES SLLEWIS MPHILLIPS R: "Predictive shifts in free energy couple mutations to their phenotypic consequences", PROC NATL ACAD SCI USA, vol. 116, 2019, pages 18275 - 18284
COHEN, S. N., PROC. NATL. ACAD. SCI. U.S.A, vol. 69, 1972, pages 2110 - 2114
CROOK NATHAN ET AL: "Transcript Barcoding Illuminates the Expression Level of Synthetic Constructs in E. coli Nissle Residing in the Mammalian Gut", ACS SYNTHETIC BIOLOGY, vol. 9, no. 5, 15 May 2020 (2020-05-15), Washington DC ,USA, pages 1010 - 1021, XP055915061, ISSN: 2161-5063, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7293544/pdf/nihms-1587451.pdf> DOI: 10.1021/acssynbio.0c00040 *
DABER RSOCHOR MALEWIS M: "Thermodynamic analysis of mutant lac repressors", J MOL BIOL, vol. 409, 2011, pages 76 - 87, XP028198530, DOI: 10.1016/j.jmb.2011.03.057
DOMINGO JDISS GLEHNER B: "Pairwise and higher-order genetic interactions during the evolution of a tRNA", NATURE, vol. 558, 2018, pages 117 - 121, XP036519543, DOI: 10.1038/s41586-018-0170-7
FENTON AW: "Allostery: an illustrated definition for the 'second secret of life", TRENDS BIOCHEM SCI, vol. 33, 2008, pages 420 - 425, XP024527228, DOI: 10.1016/j.tibs.2008.05.009
FLYNN TCSWINT-KRUSE LKONG YBOOTH CMATTHEWS KSMA J: "Allosteric transition pathways in the lactose repressor protein core domains: asymmetric motions in a homodimer", PROTEIN SCI, vol. 12, 2003, pages 2523 - 2541
GARCIA-PINO ABALASUBRAMANIAN SWYNS LGAZIT EDE GREVE HMAGNUSON RDCHARLIER DVAN NULAND NAJLORIS R: "Allostery and intrinsic disorder mediate transcription regulation by conditional cooperativity", CELL, vol. 142, 2010, pages 101 - 111, XP028931086, DOI: 10.1016/j.cell.2010.05.039
GOSSELIN PAULINE ET AL: "RESOURCE/METHODOLOGY Unbiased identification of signal-activated transcription factors by barcoded synthetic tandem repeat promoter screening (BC-STAR-PROM)", 1 August 2016 (2016-08-01), XP055915058, Retrieved from the Internet <URL:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5024686/pdf/1895.pdf> [retrieved on 20220425], DOI: 10.1101/gad.284828 *
HE XLIU L: "Toward a prospective molecular evolution", SCIENCE, vol. 352, 2016, pages 769 - 770
HECHT AGLASGOW JJASCHKE PRBAWAZER LAMUNSON MSCOCHRAN JRENDY DSALIT M: "Measurements of translation initiation from all 64 codons in E. coli", NUCLEIC ACIDS RES, vol. 45, 2017, pages 3615 - 3626, XP055873274, DOI: 10.1093/nar/gkx070
HOCHREITER SSCHMIDHUBER J: "Long short-term memory", NEURAL COMPUT, vol. 9, 1997, pages 1735 - 1780, XP055232921, DOI: 10.1162/neco.1997.9.8.1735
HUANG P-SBOYKEN SEBAKER D: "The coming of age of de novo protein design", NATURE, vol. 537, 2016, pages 320 - 327, XP055635430, DOI: 10.1038/nature19946
KIM MKKANG YK: "Positional preference of proline in alpha-helices", PROTEIN SCI, vol. 8, 1999, pages 1492 - 1499
KINGMA DPBA J: "Adam: a method for stochastic optimization", ARXIV, 2017
KIVIOJA TVAHARAUTIO AKARLSSON KBONKE MENGE MLINNARSSON STAIPALE J: "Counting absolute numbers of molecules using unique molecular identifiers", NAT METHODS, vol. 9, 2012, pages 72 - 74, XP055696105, DOI: 10.1038/nmeth.1778
KYTE JDOOLITTLE RF: "A simple method for displaying the hydropathic character of a protein", J MOL BIOI, vol. 157, 1982, pages 105 - 132, XP024014365, DOI: 10.1016/0022-2836(82)90515-0
LEANDER MYUAN YMEGER ACUI QRAMAN S: "Functional plasticity and evolutionary adaptation of allosteric regulation", PROC NAT! ACAD SCI USA, vol. 117, 2020, pages 25445 - 25454
LEWIS MCHANG GHORTON NCKERCHER MAPACE HCSCHUMACHER MABRENNAN RGLU P: "Crystal structure of the lactose operon repressor and its complexes with DNA and inducer", SCIENCE, vol. 271, 1996, pages 1247 - 1254, XP001205867, DOI: 10.1126/science.271.5253.1247
LEWIS, M. ET AL.: "Crystal structure of the lactose operon repressor and its complexes with DNA and inducer.", SCIENCE, vol. 271, 1996, pages 1247 - 1254, XP001205867, DOI: 10.1126/science.271.5253.1247
LI CQIAN WMACLEAN CJZHANG J: "The fitness landscape of a tRNA gene", SCIENCE, vol. 352, 2016, pages 837 - 840
LI CZHANG J: "Multi-environment fitness landscapes of a tRNA gene", NAT ECOL EVOI, vol. 2, 2018, pages 1025 - 1032
LIU XIANGYANG ET AL: "De novo design of programmable inducible promoters", NUCLEIC ACIDS RESEARCH, vol. 47, no. 19, 4 November 2019 (2019-11-04), GB, pages 10452 - 10463, XP055915065, ISSN: 0305-1048, Retrieved from the Internet <URL:https://watermark.silverchair.com/gkz772.pdf> DOI: 10.1093/nar/gkz772 *
MARKIEWICZ PKLEINA LGCRUZ CEHRET SMILLER JH: "Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as 'Spacers' which do not require a specific sequence", J MOL BIOL, vol. 240, 1994, pages 421 - 433, XP024008307, DOI: 10.1006/jmbi.1994.1458
MARKIEWICZ, P.KLEINA, L. G.CRUZ, C.EHRET, S.MILLER, J. H.: "Genetic Studies of the lac Repressor. XIV. Analysis of 4000 Altered Escherichia coli lac Repressors Reveals Essential and Non-essential Residues, as well as 'Spacers' which do not Require a Specific Sequence", JOURNAL OF MOLECULAR BIOLOGY, vol. 240, 1994, pages 421 - 433, XP024008307, DOI: 10.1006/jmbi.1994.1458
MARZEN SGARCIA HGPHILLIPS R: "Statistical mechanics of Monod-Wyman-Changeux (MWC) Models", J MOL BIOL, vol. 425, 2013, pages 1433 - 1460, XP028582640, DOI: 10.1016/j.jmb.2013.03.013
MEYER SRAMOT RKISHORE INAMPUDI KLUO BLIN CAMERE SWILSON CJ: "Engineering alternate cooperative-communications in the lactose repressor protein scaffold", PROTEIN ENG DES, vol. 126, 2013, pages 433 - 443
MIAO ZCAO Y: "Quantifying side-chain conformational variations in protein structure", SCI REP, vol. 6, 2016, pages 37024
MONOD JWYMAN JCHANGEUX JP: "On the nature of allosteric transitions: a plausible model", J MOL BIOL, vol. 12, 1965, pages 88 - 118, XP026105814, DOI: 10.1016/S0022-2836(65)80285-6
MOTLAGH HNWRABL JOLI JHILSER VJ: "The ensemble nature of allostery", NATURE, vol. 508, 2014, pages 331 - 339, XP037693463, DOI: 10.1038/nature13001
MYERS GLSADLER JR: "Mutational inversion of control of the lactose operon of Escherichia coli", J MOL BIOL, vol. 58, 1971, pages 1 - 28, XP024014943, DOI: 10.1016/0022-2836(71)90229-4
OMELINA ESIVANKIN AVLETIAGINA AEPINDYURIN AV: "Optimized PCR conditions minimizing the formation of chimeric DNA molecules from MPRA plasmid libraries", BMC GENOM, vol. 20, 2019, pages 536
ONUFRIEV AULLMANN GM: "Decomposing complex cooperative ligand binding into simple components: connections between microscopic and macroscopic models", J PHYS CHEM B, vol. 108, 2004, pages 11157 - 11169
PACE HCKERCHER MALU PMARKIEWICZ PMILLER JHCHANG GLEWIS M: "Lac repressor genetic map in real space", TRENDS BIOCHERN SCI, vol. 22, 1997, pages 334 - 339, XP004088008, DOI: 10.1016/S0968-0004(97)01104-3
PACE, H. C. ET AL.: "Lac repressor genetic map in real space", TRENDS IN BIOCHEMICAL SCIENCES, vol. 22, 1997, pages 334 - 339, XP004088008, DOI: 10.1016/S0968-0004(97)01104-3
POELWIJK FJDE VOS MGJTANS SJ: "Tradeoffs and optimality in the evolution of gene regulation", CELL, vol. 146, 2011, pages 462 - 470, XP028382581, DOI: 10.1016/j.cell.2011.06.035
POPOVYCH NSUN SEBRIGHT RHKALODIMOS CG: "Dynamically driven protein allostery", NAT STRUCT MOL BIOL, vol. 13, 2006, pages 831 - 838
PRESSMAN ADLIU ZJANZEN EBLANCO CMULIER UFJOYCE GFPASCAL RCHEN IA: "Mapping a systematic ribozyme fitness landscape reveals a frustrated evolutionary network for self-aminoacylating RNA", J AM CHEM SOC, vol. 141, 2019, pages 6213 - 6223
PUCHTA 0CSEKE BCZAJA HTOLLERVEY DSANGUINETTI GKUDLA G: "Network of epistatic interactions within a yeast snoRNA", SCIENCE, vol. 352, 2016, pages 840 - 844
RAMAN STAYLOR NGENUTH NFIELDS SCHURCH GM: "Engineering allostery", TRENDS GENET, vol. 30, 2014, pages 521 - 528
RASMUSSEN CEWILLIAMS CKI: "Adaptive computation and machine learning", 2005, MIT PRESS, article "Gaussian processes for machine learning"
RAZO-MEJIA MBARNES SLBELLIVEAU NMCHURE GEINAV TLEWIS MPHILLIPS R: "Tuning transcriptional regulation through signaling: a predictive theory of allosteric induction", CEIL SYSTEMS, vol. 6, 2018, pages 456 - 469
RICHARDSON JSRICHARDSON DC: "Amino acid preferences for specific locations at the ends of alpha helices", SCIENCE, vol. 240, 1988, pages 1648 - 1652, XP055311208, DOI: 10.1126/science.3381086
ROLFES RJZALKIN H: "Purification of the Escherichia coli purine regulon repressor and identification of corepressors", J BACTERIAL, vol. 172, 1990, pages 5637 - 5642
RUBIN DB: "Estimation in parallel randomized experiments", JOURNAL OF EDUCATIONAL STATISTICS, vol. 6, 1981, pages 377 - 401
SARKAR STACK DROSS D: "Sparse estimation of mutual information landscapes quantifies information transmission through cellular biochemical reaction networks", COMMUN BIOL, vol. 3, 2020, pages 1 - 8
SARKISYAN KSBOLOTIN DAMEER MVUSMANOVA DRMISHIN ASSHARONOV GVIVANKOV DNBOZHANOVA NGBARANOV MSSOYLEMEZ 0: "Local fitness landscape of the green fluorescent protein", NATURE, vol. 533, 2016, pages 397 - 401
SCHLECHT ULIU ZBLUNDELL JRST.ONGE RPLEVY SF: "A scalable double-barcode sequencing platform for characterization of dynamic protein-protein interactions", NAT COMMUN, vol. 8, 2017, pages 15586
SMYTH RPSCHLUB TEGRIMM AVENTURI VCHOPRA AMALLAL SDAVENPORT MPMAK J: "Reducing chimera formation during PCR amplification to ensure accurate genotyping", GENE, vol. 469, 2010, pages 45 - 51, XP027417925
SPRONK CAEMSLIJPER MVAN BOOM JHKAPTEIN RBOELENS R: "Formation of the hinge helix in the lac represser is induced upon binding to the lac operator", NAT STRUCT BIOL, vol. 3, 1996, pages 916 - 919
SWINT-KRUSE LELAM CRLIN JWWYCUFF DRMATTHEWS KS: "Plasticity of quaternary structure: twenty-two ways to form a Lac! dimer", PROTEIN SCI, vol. 10, 2001, pages 262 - 276
SWINT-KRUSE, L.ZHAN, H.FAIRBANKS, B. M.MAHESHWARI, AMATTHEWS, K. S.: "Perturbation from a distance: mutations that alter Lacl function through long-range effects", BIOCHEMISTRY, vol. 42, 2003, pages 14004 - 14016
TACK DREW S. ET AL: "The genotype-phenotype landscape of an allosteric protein", BIORXIV, 11 July 2020 (2020-07-11), XP055915087, Retrieved from the Internet <URL:https://www.biorxiv.org/content/10.1101/2020.07.10.197574v1.full.pdf> [retrieved on 20220425], DOI: 10.1101/2020.07.10.197574 *
TACK DREW: "The genotype-phenotype landscape of an allosteric protein | Molecular Systems Biology", 1 February 2021 (2021-02-01), XP055915104, Retrieved from the Internet <URL:https://www.embopress.org/doi/full/10.15252/msb.202010179> [retrieved on 20220425] *
TAYLOR NDGARRUSS ASMORETTI RCHAN SARBING MACASCIO DROGERS JKISAACS FJKOSURI SBAKER D ET AL.: "Engineering an allosteric transcription factor to respond to new ligands", NAT METH, vol. 13, 2016, pages 177 - 183, XP055469782, DOI: 10.1038/nmeth.3696
TZENG S-RKALODIMOS CG: "Protein activity regulation by conformational entropy", NATURE, vol. 488, 2012, pages 236 - 240
WENGER AMPELUSO PROWELL WJCHANG P-CHALL RJCONCEPCION GTEBLER JFUNGTAMMASAN AKOLESNIKOV AOLSON ND: "Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome", NAT BIOTECHNOL, vol. 37, 2019, pages 1155 - 1162, XP036897227, DOI: 10.1038/s41587-019-0217-9
WU BGOKHALE CSVAN VEELEN MWANG LTRAULSEN A: "Interpretations arising from Wrightian and Malthusian fitness under strong frequency dependent selection", ECOI EVOI, vol. 3, 2013, pages 1276 - 1280
ZAMYATNIN AA: "Protein volume in solution", PROG BIOPHYS MOL BIOL, vol. 24, 1972, pages 107 - 123, XP023444928, DOI: 10.1016/0079-6107(72)90005-3
ZHAN HCAMARGO MMATTHEWS KS: "Positions 94-98 of the lactose repressor N-subdomain monomer-monomer interface are critical for allosteric communication", BIOCHEMISTRY, vol. 49, 2010, pages 8636 - 8645
ZHAO LLIU ZLEVY SFWU S: "Bartender: a fast and accurate clustering algorithm to count barcode reads", BIOINFORMATICS, vol. 34, 2018, pages 739 - 747
ZHOU YIKANG ET AL: "Encoding Genetic Circuits with DNA Barcodes Paves the Way for Machine Learning-Assisted Metabolite Biosensor Response Curve Profiling in Yeast", ACS SYNTHETIC BIOLOGY, vol. 11, no. 2, 28 January 2022 (2022-01-28), Washington DC ,USA, pages 977 - 989, XP055915052, ISSN: 2161-5063, Retrieved from the Internet <URL:https://pubs.acs.org/doi/pdf/10.1021/acssynbio.1c00595> DOI: 10.1021/acssynbio.1c00595 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025133101A1 (en) 2023-12-20 2025-06-26 Danmarks Tekniske Universitet Method for enzyme fitness data collection

Similar Documents

Publication Publication Date Title
Tack et al. The genotype‐phenotype landscape of an allosteric protein
McClune et al. Engineering orthogonal signalling pathways reveals the sparse occupancy of sequence space
Behrens et al. High-resolution quantitative profiling of tRNA abundance and modification status in eukaryotes by mim-tRNAseq
Armenteros et al. Detecting sequence signals in targeting peptides using deep learning
Urtecho et al. Systematic dissection of sequence elements controlling σ70 promoters using a genomically encoded multiplexed reporter assay in Escherichia coli
Duttke et al. Position-dependent function of human sequence-specific transcription factors
Baumgart et al. Persistence and plasticity in bacterial gene regulation
McLure et al. High-throughput directed evolution: a golden era for protein science
Kelsic et al. RNA structural determinants of optimal codons revealed by MAGE-Seq
Shalem et al. Systematic dissection of the sequence determinants of gene 3’end mediated expression control
Currin et al. Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries
US20190330661A1 (en) Efficient genetic screening method
Bratulic et al. Mistranslation can enhance fitness through purging of deleterious mutations
Lyu et al. Adaptation of codon usage to tRNA I34 modification controls translation kinetics and proteome landscape
Charpentier et al. 3’RNA sequencing for robust and low-cost gene expression profiling
Rudan et al. RNA chaperones buffer deleterious mutations in E. coli
Culviner et al. Global analysis of the specificities and targets of endoribonucleases from Escherichia coli toxin-antitoxin systems
Sandberg et al. Adaptive evolution of a minimal organism with a synthetic genome
Scheele et al. Droplet-based screening of phosphate transfer catalysis reveals how epistasis shapes MAP kinase interactions with substrates
Westmann et al. The highly rugged yet navigable regulatory landscape of the bacterial transcription factor TetR
WO2022125686A1 (en) Precision engineering of a cellular sense-and-response function
Weeks et al. Fitness and functional landscapes of the E. coli RNase III gene rnc
Zhang et al. Identifying chromatin features that regulate gene expression distribution
Minshull et al. Engineered protein function by selective amino acid diversification
Shave et al. PuLSE: Quality control and quantification of peptide sequences explored by phage display libraries

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21854896

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21854896

Country of ref document: EP

Kind code of ref document: A1