WO2023225358A1 - Generation and tracking of cells with precise edits - Google Patents
Generation and tracking of cells with precise edits Download PDFInfo
- Publication number
- WO2023225358A1 WO2023225358A1 PCT/US2023/022989 US2023022989W WO2023225358A1 WO 2023225358 A1 WO2023225358 A1 WO 2023225358A1 US 2023022989 W US2023022989 W US 2023022989W WO 2023225358 A1 WO2023225358 A1 WO 2023225358A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target locus
- sequence
- target
- retron
- locus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/63—Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
- C12N15/64—General methods for preparing the vector, for introducing it into the cell or for selecting the vector-containing host
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1065—Preparation or screening of tagged libraries, e.g. tagged microorganisms by STM-mutagenesis, tagged polynucleotides, gene tags
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/10—Processes for the isolation, preparation or purification of DNA or RNA
- C12N15/1034—Isolating an individual clone by screening libraries
- C12N15/1079—Screening libraries by altering the phenotype or phenotypic trait of the host
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/63—Introduction of foreign genetic material using vectors; Vectors; Use of hosts therefor; Regulation of expression
- C12N15/79—Vectors or expression systems specially adapted for eukaryotic hosts
- C12N15/80—Vectors or expression systems specially adapted for eukaryotic hosts for fungi
- C12N15/81—Vectors or expression systems specially adapted for eukaryotic hosts for fungi for yeasts
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N15/00—Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
- C12N15/09—Recombinant DNA-technology
- C12N15/87—Introduction of foreign genetic material using processes not otherwise provided for, e.g. co-transformation
- C12N15/90—Stable introduction of foreign DNA into chromosome
- C12N15/902—Stable introduction of foreign DNA into chromosome using homologous recombination
- C12N15/905—Stable introduction of foreign DNA into chromosome using homologous recombination in yeast
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/10—Transferases (2.)
- C12N9/12—Transferases (2.) transferring phosphorus containing groups, e.g. kinases (2.7)
- C12N9/1241—Nucleotidyltransferases (2.7.7)
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12N—MICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
- C12N9/00—Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
- C12N9/14—Hydrolases (3)
- C12N9/16—Hydrolases (3) acting on ester bonds (3.1)
- C12N9/22—Ribonucleases [RNase]; Deoxyribonucleases [DNase]
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Y—ENZYMES
- C12Y207/00—Transferases transferring phosphorus-containing groups (2.7)
- C12Y207/07—Nucleotidyltransferases (2.7.7)
Definitions
- the present disclosure provides a nucleic acid composition that comprises two or more editing modules that are present on an expression vector.
- the compositions and methods allow for producing combinations of targeted genetic modifications in the genome of a host cell.
- the disclosure provides a retron-guide RNA cassette comprising: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv
- the first target locus is located in trans to the second target locus. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit.
- the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the genetic variant comprises a trans-expression quantitative trait locus (eQTL) variant at the first target locus.
- eQTL trans-expression quantitative trait locus
- the first target locus is located in cis to the second target locus.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the transcription unit.
- the first donor DNA sequence comprises a genetic variant relative to the sequence at the first target locus.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- the second target locus is i) an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene.
- the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker.
- the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron. In some embodiments, the first or second gRNA coding region is downstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 3’ of the RNA transcribed from the retron. [0011] In some embodiments, the retron-guide RNA cassette further comprises one or more ribozyme sequences. In some embodiments, the first and second retrons are connected by a self-cleaving ribozyme sequence.
- the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead, and combinations thereof.
- the one or more ribozyme sequences are different from each other.
- the retron-guide RNA cassette further comprises a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region.
- the disclosure provides a vector comprising a retron-guide RNA cassette described herein.
- the disclosure provides a method for identifying a genetic modification at a target locus in a host cell, the method comprising: (a) transforming the host cell with a vector or retron-guide RNA cassette described herein; (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region, wherein the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target
- the method identifies a genetic modification at a target locus within a genome of a host cell, where the genome comprises the endogenous genomic chromosomal DNA of the host cell. In some embodiments, the method identifies a genetic modification at a target locus anywhere within a genome of a host cell. In some embodiments, the target locus is located in an exogenous genome that is present in a host cell, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome.
- the target locus is located in heterologous or exogenous DNA, such as the DNA of transgenes, viruses or transposons, that are present in the host cell or host cell nucleus. In some embodiments, the target locus is located in heterologous or exogenous DNA that is integrated into the host cell genomic DNA. In some embodiments, the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids. [0016] In some embodiments, the first target locus is located in trans to the second target locus.
- the first target locus is located in a trans-regulatory element, and the second target locus is located in a 5’ untranslated region, protein coding region, or the 3’ untranslated region (UTR) of a transcription unit.
- the genetic variant comprises a trans-eQTL variant at the first target locus.
- the first target locus is located in cis to the second target locus.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, protein coding region, or the 3’ untranslated region (UTR) of the transcription unit.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- the first and/or second target locus is located in an intergenic, non-coding region of the host cell genomic DNA.
- the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus.
- the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker.
- detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence.
- the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence.
- greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus.
- the method further comprises: (d) transforming the host cell with a second vector comprising a second retron-guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and
- the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus.
- the third target locus is located in trans to the fourth target locus.
- the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit.
- the genetic variant comprises a trans-eQTL variant at the third target locus.
- the third target locus is located in cis to the fourth target locus.
- the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the transcription unit.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- the method further comprises detecting the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci.
- the first and third gRNAs are the same; (ii) the first and third target loci are the same; (iii) the genetic modification at the first and third loci is different; (vi) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
- the first and third gRNAs are different; (ii) the first and third target loci are different; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
- the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site.
- greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the third target locus.
- the method further comprises detecting the presence of the unique barcode at the third target locus, thereby identifying the genetic modification at both the first and third target loci.
- the method further comprises repeating steps (d)-(f) with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus.
- the host cell is a prokaryotic cell.
- the host cell is a eukaryotic cell.
- the eukaryotic cell is a yeast cell.
- the eukaryotic cell is a mammalian cell or cell line.
- the mammalian cell is a human cell or cell line.
- the host cell comprises a clonal population of host cells.
- the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
- the method further comprises transforming a mixture of cells with one or more vectors comprising the first, second or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell.
- the method further comprises detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette.
- the disclosure provides a method for identifying two or more genetic modifications at two different target loci in a host cell, the method comprising: transforming the host cell with a vector or retron-guide RNA cassette described herein; wherein the vector or retron-guide RNA cassette comprises two or more variant editing cassettes that are expressed in the same transcript, and a donor DNA sequence comprising homology to one or more sequences within a third, different target locus and a unique barcode sequence.
- Fig.1a-k Design and validation of CRISPEY-BAR for generating and tracking thousands of precise genome edits simultaneously. [0039] Fig.1a.
- CRISPEY-BAR dual edit strategy Top, CRISPEY-BAR expression cassette consisting of pGAL7 galactose-inducible promoter and terminator (brown); self-cleaving HDV-like-ribozymes RzCIV, RZHDV and RZSpur3 (magenta); barcode insertion retron-guide cassette (blue) containing programmed barcode (orange) and UMI (yellow); variant editing cassette (green). Middle, the variant editing cassette converts a wildtype (WT) allele into an alternative allele. Bottom, the barcode insertion retron-guide cassette. [0040] Fig.1b. Schematic for conventional CRISPEY.
- Fig.1c Schematic for CRISPEY-BAR. Variants tracked across three growth replicates by genomically-integrated barcodes with attached UMIs.
- Fig.1d Workflow for CRISPEY-BAR library pool construction.
- Fig. 1e Validation of genomic variant editing rate from CRISPEY-BAR. Blue, randomly picked colonies that contain both genomic-integrated barcode and the designed edit. Orange, randomly picked colonies that contain only the genomic-integrated barcode but not the designed edit.
- Fig. 1f Schematic for CRISPEY-BAR pooled competition in yeast.
- Fig.1g Example of CRISPEY-BAR data over time. Each line indicates normalized counts for a single UMI for a given barcode from 1 of 3 replicates in a competition experiment. Counts in later time points are normalized to the first time point. Light blue and blue: two barcodes representing different guides targeting the same variant chr7: 848783 AC>A. Red and dark red: two barcodes representing different guides targeting the same variant chr7: 847050 C>A. Gray scale: Non-targeting of variants, barcode integration only (no-edit control regarding variants). Data shown are from Terbinafine competition across approximately 26 generations. [0046] Fig.1h. Example of outlier removal.
- Fig.1k Validation of pooled fitness in fluconazole by pairwise competition.
- X-axis fitness ef-fect measured by CRISPEY-BAR pooled competition.
- Y-axis fitness effect measured through pairwise competition against GFP strain using flow cytometry. Data shown for 13 variants in fluconazole. Data presented as mean ⁇ SEM.
- Fig.2a-g Detection of natural variants affecting fitness within QTLs mapped in complex traits.
- Fig.2a Diagram of library design process using natural variants and QTL regions, as well as library statistics.
- Fig.2b Schematic for experiment workflow for QTL fine-mapping with CRISPEY- BAR.
- Fig. 2c Number of variants with fitness effect (FDR ⁇ 0.01) within SC and appropriate stress condition.
- Fig. 2d Annotation enrichment of variants with fitness effect (FDR ⁇ 0.01). Blue, variant enrichment for hits in fluconazole condition. Orange, variant enrichment for hits in caffeine condition. Green, variant enrichment for hits in cobalt chloride condition.
- Fig.2e Fitness effects of example QTL regions. Dark blue, fitness effects in stress condition (FDR ⁇ 0.01). Dark orange, fitness effects in SC (FDR ⁇ 0.01). Light blue, no fitness effects stress condition. Gold, no fitness effects in SC. Most variants are represented twice (effect in QTL condition and complete media).
- Fig. 2f PDR5 fitness effects in CAFF and FLC. Magenta, PDR5 variant fitness measured in caffeine condition. Orange, PDR5 variant fitness measured in fluconazole condition. Dark gray, noncoding regions flanking PDR5. Light gray, coding region of PDR5. Vertical lines connect the same variant fitness values measured in both caffeine and fluconazole.
- Fig.2g Fitness effects of example QTL regions. Dark blue, fitness effects in stress condition (FDR ⁇ 0.01). Dark orange, fitness effects in SC (FDR ⁇ 0.01). Light blue, no fitness effects stress condition. Gold, no fitness effects in SC. Most variants are represented twice (effect in QTL condition and complete media).
- Fig. 2f PDR5 fitness
- Fig.3a-h CRISPEY-BAR enabled robust mapping of variant-level GxE interactions within the ergosterol biosynthesis pathway.
- Fig.3a Ergosterol pathway diagram showing 24 genes from the ergosterol synthesis pathway surveyed in this study. Lovastatin and terbinafine target genes in the ergosterol pathway.
- Fig.3b The same pool of yeast edited at natural ergosterol pathway variants was grown in six different conditions and tracked by barcode sequencing.
- Fig.3c Gene level fitness effects of surveyed natural variants in six conditions.
- X- axis labels indicate the genes containing the variants. Red, causal variants (p ⁇ 0.01). Gray, non-significant variants. Target genes are outlined by dashed black lines where applicable.
- Fig.3d GxE interactions were calculated between each pair of conditions (15 pairwise comparisons).
- Fig.3e Diagram showing definition of GxE variants in this study: A positive effect variant (black circle) in condition 1 can either have the same effect in another condition (white circle at same height in red region), a stronger positive effect (top white circle in red region), no effect, white circle at zero, or a negative effect (bottom white circle in blue region).
- Fig. 3f The number of significant GxE interactions for each pairwise comparison.
- Fig.3g GxE annotation enrichments for variants with GxE. Enrichment of variants with GxE in each category were normalized to all variants tested. Red dashed line indicates an enrichment factor of 1.0, corresponding to no enrichment over the library.
- Fig.3h Variants with GxE effects within the HMG1 promoter. Clusters of variants with significant GxE effects within 8 bp of each other are in gray highlighted areas.
- Fig.4a-f Quantifying GxE interactions among ergosterol pathway variants
- Fig.4a Schematic of rare GxE between conditions (correlated effects).
- Fig.4b Schematic of common GxE between conditions (uncorrelated effects).
- Fig.4c Fitness effects of variants within PDR5 in caffeine and fluconazole.
- Fig.4d Fitness effects of variants within ergosterol pool in lovastatin and CoCl2.
- Fig.4e Fitness effects of variants within ergosterol pool in lovastatin and CoCl2.
- Fig.4f Heatmaps showing fitness effects of all variants with a significant effect in any condition. Significant positive effects (red), significant negative effects (blue), non- significant positive effects (pink), and non-significant negative effects (light blue).
- Fig. 5a-e Types of GxE variants and effect of natural variation on ERG4 expression.
- Fig. 5a Example of fitness effect detected in only one condition.
- Fig.5b Example of fitness effect detected in only one condition.
- Fig.5c Example of fitness effects with same direction detected in two conditions.
- Fig.5c Example of fitness effects with opposite directions between conditions, showing sign GxE.
- Fig.5d Sign GxE variants have larger maximum fitness effects. Whiskers represent Q3 + 1.5xIQR and Q1 - 1.5xIQR, or the maximum and minimum values of the dataset if these are respectively lower or higher than the IQR based intervals.
- Fig.5e Effect of natural variants on ERG4 expression. Top left: Consensus Rpn4p binding motif. Top right: Genomic location of Rpn4p binding site affected by chr7: 472522 C>A variant within ERG4/PDR1 divergent promoter.
- FIG.6 Schematic for library cloning in CRISPEY-BAR.
- Fig. 7 Schematic for pooled editing and growth competition in CRISPEY-BAR.
- Fig.8 Schematic for CRISPEY-BAR sequencing library preparation.
- Fig.9 Fitness and ERG4 expression for variants in Fig.5e.
- X-axis Paired fitness from flow cytometry measurements similar to Fig.1i, see also Methods.
- the present disclosure provides compositions and methods for tracking one or more targeted genetic modifications (also referred to as genetic “edits” or “variants”) made in the genome of a cell or organism.
- the present disclosure provides a nucleic acid composition that comprises two or more editing modules that are present on an expression vector.
- the compositions and methods allow for producing combinations of targeted genetic modifications in the genome of a host cell, where the combinations of modifications are predetermined.
- the first module comprises nucleic acid sequences that can modify a genetic locus in a host cell (e.g., a first target locus) and the second module comprises nucleic acid sequences that modify a second genetic locus in a host cell (e.g., a second target locus).
- the first target locus is at a different location in the genome than the second target locus.
- the genetic modification at the first target locus is different than the genetic modification at the second target locus.
- the genetic modification at the first target locus comprises a mutation, edit, variant or deletion in the nucleic acid sequence of the first target locus.
- the genetic modification at the second target locus comprises a mutation, edit or variant of the nucleic acid sequence of the second target locus. In some embodiments, the genetic modification at the second target locus comprises or further comprises introducing a unique barcode sequence at the second target locus. In some embodiments, the genetic modification at the second target locus comprises introducing both a mutation, edit, or variant and a unique barcode sequence at the second target locus.
- compositions and methods can be used to introduce a second genetic modification at a target locus in the same host cell or its progeny by transfecting the cell with a second vector comprising nucleic acid sequences that can modify a third target locus and a second module comprising nucleic acid sequences that can introduce a barcode sequence at a fourth target locus.
- the first and third target loci are the same, but the genetic modification is different.
- the second and fourth target loci are the same, but the barcode sequence is different. The above can be repeated to introduce additional genetic modifications along with different unique barcode sequences at the same or different target loci.
- the vector can be removed or lost in the host cell and its daughter cells.
- the intended combination of precise edits made in each cell can be determined by detecting the unique barcode sequence assigned to each edit combination.
- barcode sequence can be detected by Sanger sequencing, next generation sequencing (NGS) or other detection methods that distinguish the unique barcode sequence assigned to each edit combination. This can be performed in a mixture of host cells, a single host cell, or a clonal cell lineage.
- NGS next generation sequencing
- the compositions and methods described herein provide the following advantages. 1) Detecting genetic modifications in the host cell does not require the presence of the expression vector in the host cell or its progeny.
- the two or more editing modules are present on a bicistronic retron-donor-guide editing vector.
- the bicistronic retron-donor-guide editing vector allows simultaneous editing of two different genetic target loci.
- the first and second modules comprise a retron-guide RNA cassette.
- Retron- guide RNA cassettes are described in US 2019/0330619 A1 (corresponding to WO 2018/049168) and US Provisional Patent App. No.63/232,080 (filed 8 August 2021),which are hereby incorporated by reference herein in their entirety.
- the combination of edits that will be made across all modules are predetermined.
- the three editing modules are present on a retron-donor- guide editing vector.
- the bicistronic retron-donor-guide editing vector allows simultaneous editing of three different genetic target loci.
- the first, second and third modules comprise a retron-guide RNA cassette.
- the first and second modules introduce two (a pair) of genetic edits in two different target sequences, and the third module introduces a unique barcode sequence that is associated with the pair of genetic variants introduced by the first and second modules (a “variant-pair” specific barcode).
- the first and second editing modules are connected by self- cleaving HDV-like ribozymes to allow separation of either module to detach from the RNA pol2 transcript, which allows Cas9/retron binding and nuclear export.
- ribozymes are selected from drz-CIV-1, HDV ribozyme, and drz-Spur-3, though other combinations of ribozymes are expressly included herein.
- Genome editing methods commonly include the provision of both an engineered nuclease or nickase and a donor DNA repair template that contains the DNA sequence to be inserted at a desired location.
- the CRISPR/Cas9 system utilizes a guide RNA (gRNA) that directs the Cas9 nuclease to introduce a double-strand cut at a specific location.
- gRNA guide RNA
- a donor DNA repair template can then be provided, enabling the precise insertion of a new sequence mediated by homology-directed repair of the double-strand cut.
- the gRNA and donor DNA template have been supplied as separate molecules, meaning that each editing experiment must be performed in a separate tube or vessel.
- the reverse transcription of the DNA coding unit (msd region) of the retron transcript results in a multicopy single- stranded DNA (msDNA) molecule that contains a donor DNA repair template and is physically tethered to the gRNA, increasing editing efficiency.
- msDNA multicopy single- stranded DNA
- the practice of the present disclosure employs, unless otherwise indicated, conventional techniques of immunology, biochemistry, chemistry, molecular biology, microbiology, cell biology, genomics and recombinant DNA, which are within the skill of the art. See Sambrook, Fritsch and Maniatis, Molecular Cloning: A Laboratory Manual, 2nd edition (1989), Current Protocols in Molecular Biology (F. M.
- Oligonucleotides that are not commercially available can be chemically synthesized, e.g., according to the solid phase phosphoramidite triester method first described by Beaucage and Caruthers, Tetrahedron Lett.22:1859-1862 (1981), using an automated synthesizer, as described in Van Devanter et.
- any method or material similar or equivalent to a method or material described herein can be used in the practice of the present disclosure.
- the following terms are defined.
- the terms “a,” “an,” or “the” as used herein not only include aspects with one member, but also include aspects with more than one member.
- the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.
- reference to “a cell” includes a plurality of such cells and reference to “the agent” includes reference to one or more agents known to those skilled in the art, and so forth.
- the term “about” in relation to a reference numerical value can include a range of values plus or minus 10% from that value.
- the amount “about 10” includes amounts from 9 to 11, including the reference numbers of 9, 10, and 11.
- the term “about” in relation to a reference numerical value can also include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value.
- the terms “5’ ” and “3’ ” denote the positions of elements or features relative to the overall arrangement of the retron-guide RNA cassettes, vectors, or retron donor DNA-guide molecules of the present disclosure in which they are included. Positions are not, unless otherwise specified, referred to in the context of the orientation of a particular element or features.
- the msr and msd loci in FIG. 4 are shown in opposite orientations.
- the msr locus is said to be 5’ of the msd locus.
- the 3’ end of the msr locus is said to be overlapping with the 5’ end of the msd locus.
- the term “upstream” refers to a position that is 5’ of a point of reference.
- the term “downstream” refers to a position that is 3’ of a point of reference.
- the msr locus is said to be located upstream of the reverse transcriptase sequence, and the reverse transcriptase sequence is said to be located downstream of the msr locus.
- the term “genome editing” refers to a type of genetic engineering in which DNA is inserted, replaced, or removed from a target DNA (e.g., the genome of a cell) using one or more nucleases and/or nickases.
- the nucleases create specific double-strand breaks (DSBs) at desired locations in the genome, and harness the cell’s endogenous mechanisms to repair the induced break by homology-directed repair (HDR) (e.g., homologous recombination) or by nonhomologous end joining (NHEJ).
- HDR homology-directed repair
- NHEJ nonhomologous end joining
- two nickases can be used to create two single-strand breaks on opposite strands of a target DNA, thereby generating a blunt or a sticky end.
- Any suitable DNA nuclease can be introduced into a cell to induce genome editing of a target DNA sequence.
- the terms “genetic modification,” “genetic edit,” and “genome edit” can be used interchangeably and refer to a change in the nucleic acid sequence of a target polynucleotide (e.g., the genomic DNA of a cell), such that the nucleic acid sequence of the modified DNA is different from the native, endogenous, previously modified, or wild-type sequence of the target DNA.
- DNA nuclease refers to an enzyme capable of cleaving the phosphodiester bonds between the nucleotide subunits of DNA, and may be an endonuclease or an exonuclease. According to the present disclosure, the DNA nuclease may be an engineered (e.g., programmable or targetable) DNA nuclease which can be used to induce genome editing of a target DNA sequence.
- DNA nuclease can be used including, but not limited to, CRISPR-associated protein (Cas) nucleases, other endo- or exo- nucleases, variants thereof, fragments thereof, and combinations thereof.
- CRISPR-associated protein (Cas) nucleases CRISPR-associated protein (Cas) nucleases, other endo- or exo- nucleases, variants thereof, fragments thereof, and combinations thereof.
- double-strand break or “double-strand cut” refers to the severing or cleavage of both strands of the DNA double helix.
- the DSB may result in cleavage of both stands at the same position leading to “blunt ends” or staggered cleavage resulting in a region of single-stranded DNA at the end of each DNA fragment, or “sticky ends”.
- a DSB may arise from the action of one or more DNA nucleases.
- nonhomologous end joining or “NHEJ” refers to a pathway that repairs double-strand DNA breaks in which the break ends are directly ligated without the need for a homologous template.
- HDR homologous recombination
- the most common form of HDR is homologous recombination (HR), a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA.
- nucleic acid refers to deoxyribonucleic acids (DNA), ribonucleic acids (RNA) and polymers thereof in either single-, double- or multi-stranded form.
- the term includes, but is not limited to, single-, double- or multi-stranded DNA or RNA, genomic DNA, cDNA, DNA-RNA hybrids, or a polymer comprising purine and/or pyrimidine bases or other natural, chemically modified, biochemically modified, non-natural, synthetic or derivatized nucleotide bases.
- a nucleic acid can comprise a mixture of DNA, RNA and analogs thereof.
- nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated.
- degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem.260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)).
- SNP single nucleotide polymorphism
- SNPs are biallelic markers although tri- and tetra-allelic markers can also exist.
- a nucleic acid molecule comprising SNP A ⁇ C may include a C or A at the polymorphic position.
- the term “gene” means the segment of DNA involved in producing a polypeptide chain. The DNA segment may include regions preceding and following the coding region (leader and trailer) involved in the transcription/translation of the gene product and the regulation of the transcription/translation, as well as intervening sequences (introns) between individual coding segments (exons).
- cassette refers to a combination of genetic sequence elements that may be introduced as a single element and may function together to achieve a desired result.
- a cassette typically comprises polynucleotides in combinations that are not found in nature.
- a cassette can be inserted into a vector, such as an expression vector.
- operably linked refers to two or more genetic elements, such as a polynucleotide coding sequence and a promoter, placed in relative positions that permit the proper biological functioning of the elements, such as the promoter directing transcription of the coding sequence.
- inducible promoter refers to a promoter that responds to environmental factors and/or external stimuli that can be artificially controlled in order to modify the expression of, or the level of expression of, a polynucleotide sequence or refers to a combination of elements, for example an exogenous promoter and an additional element such as a trans-activator operably linked to a separate promoter.
- An inducible promoter may respond to abiotic factors such as oxygen levels or to chemical or biological molecules. In some embodiments, the chemical or biological molecules may be molecules not naturally present in humans.
- vector and “expression vector” refer to a nucleic acid construct, generated recombinantly or synthetically, with a series of specified nucleic acid elements that permit transcription of a particular polynucleotide sequence in a host cell.
- An expression vector may be part of a plasmid, viral genome, or nucleic acid fragment.
- an expression vector includes a polynucleotide to be transcribed, operably linked to a promoter.
- promoter is used herein to refer to an array of nucleic acid control sequences that direct transcription of a nucleic acid.
- a promoter includes necessary nucleic acid sequences near the start site of transcription, such as, in the case of a polymerase II type promoter, a TATA element.
- a promoter also optionally includes distal enhancer or repressor elements, which can be located as much as several thousand base pairs from the start site of transcription.
- Other elements that may be present in an expression vector include those that enhance transcription (e.g., enhancers) and terminate transcription (e.g., terminators).
- “Recombinant” refers to a genetically modified polynucleotide, polypeptide, cell, tissue, or organism.
- a recombinant polynucleotide (or a copy or complement of a recombinant polynucleotide) is one that has been manipulated using well known methods.
- a recombinant expression cassette comprising a promoter operably linked to a second polynucleotide can include a promoter that is heterologous to the second polynucleotide as the result of human manipulation (e.g., by methods described in Sambrook et al., Molecular Cloning - A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, (1989) or Current Protocols in Molecular Biology Volumes 1-3, John Wiley & Sons, Inc. (1994-1998)).
- a recombinant expression cassette typically comprises polynucleotides in combinations that are not found in nature. For instance, human manipulated restriction sites or plasmid vector sequences can flank or separate the promoter from other sequences.
- a recombinant protein is one that is expressed from a recombinant polynucleotide, and recombinant cells, tissues, and organisms are those that comprise recombinant sequences (polynucleotide and/or polypeptide).
- heterologous refers to biological material that is introduced, inserted, or incorporated into a recipient (e.g., host) organism that originates from another organism.
- heterologous material that is introduced into the recipient organism is not normally found in that organism.
- Heterologous material can include, but is not limited to, nucleic acids, amino acids, peptides, proteins, and structural elements such as genes, promoters, and cassettes.
- a host cell can be, but is not limited to, a bacterium, a yeast cell, a mammalian cell, or a plant cell. The introduction of heterologous material into a host cell or organism can result, in some instances, in the expression of additional heterologous material in or by the host cell or organism.
- the transformation of a yeast host cell with an expression vector that contains DNA sequences encoding a bacterial protein may result in the expression of the bacterial protein by the yeast cell.
- the incorporation of heterologous material may be permanent or transient.
- the expression of heterologous material may be permanent or transient.
- reporter and “selectable marker” can be used interchangeably and refer to a gene product that permits a cell expressing that gene product to be identified and/or isolated from a mixed population of cells. Such isolation might be achieved through the selective killing of cells not expressing the selectable marker, which may be, as a non- limiting example, an antibiotic resistance gene.
- the selectable marker may permit identification and/or subsequent isolation of cells expressing the marker as a result of the expression of a fluorescent protein such as GFP or the expression of a cell surface marker which permits isolation of cells by fluorescence-activated cell sorting (FACS), magnetic- activated cell sorting (MACS), or analogous methods.
- a cell surface marker include CD8, CD19, and truncated CD19.
- cell surface markers used for isolating desired cells are non-signaling molecules, such as subunit or truncated forms of CD8, CD19, or CD20. Suitable markers and techniques are known in the art.
- culture when referring to cell culture itself or the process of culturing, can be used interchangeably to mean that a cell (e.g., yeast cell) is maintained outside its normal environment under controlled conditions, e.g., under conditions suitable for survival.
- a cell e.g., yeast cell
- Cultured cells are allowed to survive, and culturing can result in cell growth, stasis, differentiation or division. The term does not imply that all cells in the culture survive, grow, or divide, as some may naturally die or senesce.
- Cells are typically cultured in media, which can be changed during the course of the culture.
- the terms “subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed.
- administering includes oral administration, topical contact, administration as a suppository, intravenous, intraperitoneal, intramuscular, intralesional, intrathecal, intranasal, or subcutaneous administration to a subject.
- Parenteral administration is by any route, including parenteral and transmucosal (e.g., buccal, sublingual, palatal, gingival, nasal, vaginal, rectal, or transdermal).
- Parenteral administration includes, e.g., intravenous, intramuscular, intra-arteriole, intradermal, subcutaneous, intraperitoneal, intraventricular, and intracranial.
- Other modes of delivery include, but are not limited to, the use of liposomal formulations, intravenous infusion, transdermal patches, etc.
- the term “treating” refers to an approach for obtaining beneficial or desired results including, but not limited to, a therapeutic benefit and/or a prophylactic benefit.
- compositions may be administered to a subject at risk of developing a particular disease, condition, or symptom, or to a subject reporting one or more of the physiological symptoms of a disease, even though the disease, condition, or symptom may not have yet been manifested.
- effective amount or “sufficient amount” refers to the amount of an agent that is sufficient to effect beneficial or desired results.
- the therapeutically effective amount may vary depending upon one or more of: the subject and disease condition being treated, the weight and age of the subject, the severity of the disease condition, the manner of administration and the like, which can readily be determined by one of ordinary skill in the art.
- the specific amount may vary depending on one or more of: the particular agent chosen, the host cell type, the location of the host cell in the subject, the dosing regimen to be followed, whether it is administered in combination with other compounds, timing of administration, and the physical delivery system in which it is carried.
- pharmaceutically acceptable carrier refers to a substance that aids the administration of an active agent to a cell, an organism, or a subject.
- “Pharmaceutically acceptable carrier” refers to a carrier or excipient that can be included in the compositions of the disclosure and that causes no significant adverse toxicological effect on the patient.
- Non- limiting examples of pharmaceutically acceptable carrier include water, NaCl, normal saline solutions, lactated Ringer’s, normal sucrose, normal glucose, cell culture media, and the like.
- pharmaceutically acceptable carrier include water, NaCl, normal saline solutions, lactated Ringer’s, normal sucrose, normal glucose, cell culture media, and the like.
- degrons can be located anywhere in a protein, and can include short amino acid sequences, structural motifs, or exposed amino acids (e.g., lysine, arginine). Degrons exist in both prokaryotic and eukaryotic organisms. Degrons can be classified as being either ubiquitin-dependent or ubiquitin-independent.
- cellular localization tag refers to an amino acid sequence, also known as a “protein localization signal,” that targets a protein for localization to a specific cellular or subcellular region, compartment, or organelle (e.g., nuclear localization sequence, Golgi retention signal).
- Cellular localization tags are typically located at either the N-terminal or C- terminal end of a protein.
- the term “synthetic response element” refers to a recombinant DNA sequence that is recognized by a transcription factor and facilitates gene regulation by various regulatory agents. A synthetic response element can be located within a gene promoter and/or enhancer region.
- the term “ribozyme” refers to an RNA molecule that is capable of catalyzing a biochemical reaction.
- ribozymes function in protein synthesis, catalyzing the linking of amino acids in the ribosome.
- ribozymes participate in various other RNA processing functions, such as splicing, viral replication, and tRNA biosynthesis.
- ribozymes can be self-cleaving.
- Non-limiting examples of ribozymes include the HDV ribozyme, the Lariat capping ribozyme (formally called GIR1 branching ribozyme), the glmS ribozyme, group I and group II self-splicing introns, the hairpin ribozyme, the hammerhead ribozyme, various rRNA molecules, RNase P, the twister ribozyme, the VS ribozyme, the pistol ribozyme, and the hatchet ribozyme.
- GIR1 branching ribozyme Lariat capping ribozyme
- glmS ribozyme group I and group II self-splicing introns
- the hairpin ribozyme the hammerhead ribozyme
- various rRNA molecules RNase P
- the twister ribozyme the VS ribozyme
- pistol ribozyme the hatchet ribozyme
- Percent similarity in the context of polynucleotide or peptide sequences, is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the sequence (e.g., an msr locus sequence) in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence which does not comprise additions or deletions, for optimal alignment of the two sequences.
- the percentage is calculated by determining the number of positions at which the identical nucleotide or amino acid occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison and multiplying the result by 100 to yield the percentage of similarity (e.g., sequence similarity).
- a polynucleotide or peptide has at least about 70% similarity (e.g., sequence similarity), preferably at least about 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% similarity, to a reference sequence, when compared and aligned for maximum correspondence over a comparison window, or designated region as measured using one of the following sequence comparison algorithms or by manual alignment and visual inspection, such sequences are then said to be “substantially similar.”
- this definition also refers to the complement of a test sequence.
- sequence comparison typically one sequence acts as a reference sequence, to which test sequences are compared.
- test and reference sequences are entered into a computer, subsequence coordinates are designated, if necessary, and sequence algorithm program parameters are designated. Default program parameters can be used, or alternative parameters can be designated.
- sequence comparison algorithm then calculates the percent sequence similarities for the test sequences relative to the reference sequence, based on the program parameters.
- sequence comparison of nucleic acids and proteins the BLAST and BLAST 2.0 algorithms and the default parameters discussed below are used.
- Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math.2:482 (1981), by the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol.48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat’l. Acad. Sci.
- HSPs high scoring sequence pairs
- Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always ⁇ 0).
- the BLASTP program uses as defaults a word size (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see, e.g., Henikoff and Henikoff, Proc. Natl. Acad. Sci. USA 89:10915 (1989)).
- the BLAST algorithm also performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin and Altschul, Proc. Nat’l. Acad. Sci. USA, 90:5873-5787 (1993)).
- One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance.
- P(N) the smallest sum probability
- a nucleic acid is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.2, more preferably less than about 0.01, and most preferably less than about 0.001.
- the present disclosure provides compositions and methods for simultaneously introducing genetic modifications at two different target loci in the genome of a host cell.
- the disclosure provides methods comprising the use of retron-guide RNA cassettes, vectors comprising said cassettes, and retron donor DNA-guide molecules of the present disclosure to modify nucleic acids of interest at target loci of interest, and to screen genetic loci of interest, in the genomes of host cells.
- the present disclosure also provides compositions and methods for preventing or treating genetic diseases by enhancing precise genome editing to correct a mutation in target genes associated with the diseases. Kits for genome editing and screening are also provided.
- the present disclosure can be used with any cell type and at any gene locus that is amenable to nuclease-mediated genome editing technology.
- the present disclosure provides a retron-guide RNA (gRNA) cassette.
- the cassette comprises: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the first donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one
- the first donor DNA sequence can introduce a genetic modification or edit at the first target locus.
- the first and second donor DNA sequences can introduce genetic modifications or edits at the first and second target loci.
- the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus.
- the first and second donor DNA sequences comprise genetic variants compared to the sequences within the first and second target loci, respectively.
- the first and second donor DNA sequences can introduce genetic modifications at the first and second target loci by HDR.
- the second donor DNA sequence comprises a sequence having a mutation (or edit) relative to the nucleic acid sequence of the second target locus.
- the second donor DNA sequence comprises or further comprises a unique barcode sequence.
- the second donor DNA sequence comprises both a mutation (or edit) relative to the nucleic acid sequence of the second target locus and a unique barcode sequence.
- the retron-guide RNA (gRNA) cassette can be used to introduce two mutations/edits at the first and second target loci, or to introduce two mutations/edits at the first and second target loci and a unique barcode sequence at the second target loci.
- the mutations introduced by the first and second donor DNA sequences are different.
- the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus.
- Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12-bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes.
- the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker.
- compositions and methods described herein provide the ability to introduce two or more edits into the genome of a host cell, where a first edit at the first target locus causes a biological effect that can be monitored by measuring the second edit at the second target locus.
- the first edit comprises an eQTL variant edit that affects expression/transcription of a gene, which can be tracked by the RNA/DNA ratio of the second edit (e.g., by inserting a barcode sequence into the 3’UTR of the gene).
- the first edit at the first target locus affects the phenotype of a cell, such as cell physiology or growth, cultured in a media comprising a test compound or drug, where the phenotype can be monitored by determining the number of copies of a DNA barcode inserted at the second target locus measured at different timepoints during growth in the media comprising the test compound or drug.
- the first edit at the first target locus introduces an amino acid variant in an enzyme, and the second edit inserts a barcode into a gene encoding a substrate of the enzyme.
- the first edit at the first target locus introduces an amino acid variant into a ubiquitin ligase that affects target protein translation
- the first edit can be tracked by sorting cells comprising a barcode and sequences encoding a detectable marker (such as green fluorescent protein (GFP)) integrated at the second target locus, e.g., in sequences encoding the C-terminus of a target protein.
- a detectable marker such as green fluorescent protein (GFP)
- GFP green fluorescent protein
- the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron.
- transcription products of the retron and the gRNA coding region are physically coupled.
- the resulting gRNA and donor DNA sequences are also physically coupled (e.g., during genome editing and/or screening).
- the transcription products are coupled during a single transcription event.
- the transcription products of the retron and the gRNA coding region are initially coupled, and then subsequently become uncoupled (e.g., after transcription of the retron, or after reverse transcription of the retron transcript), in which case the guide RNA and the donor DNA sequence will also be physically uncoupled during genome editing and/or screening.
- uncoupling can be induced by a ribozyme.
- a suitable ribozyme is the hepatitis delta virus (HDV) ribozyme.
- the cassette further comprises a ribozyme sequence (e.g., HDV ribozyme sequence).
- the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead and combinations thereof.
- HDV hepatitis delta virus
- transcription products of the retron and the gRNA coding region are not initially physically coupled (i.e., the transcription products are created in separate transcription events).
- the retron and the gRNA coding region can be included in two different retron-gRNA cassettes, which can be included in the same vector or in different vectors.
- expression from the vector(s) occurs inside a host cell.
- transcription of the retron and/or the gRNA coding region occurs outside of the host cell, and then the transcription product(s) are introduced into the host cell.
- the transcription products are created in separate transcription events and are subsequently joined together for genome editing and/or screening, in which case the resulting gRNA and donor DNA sequence will also be physically coupled for genome editing and/or screening. Such joining can occur before or after reverse transcription of the retron transcript (i.e., before or after creation of msDNA from the retron transcript).
- the transcription products of the retron and the gRNA coding region result in a donor DNA sequence and a gRNA that are never physically coupled.
- the retron and the gRNA coding region are located in different cassettes and the resulting donor DNA sequence and gRNA act in trans.
- the gRNA coding region of the cassette is located 3’ of the retron. In other embodiments, the gRNA coding region is located 5’ of the retron. The relative positions of the gRNA coding region and retron may be selected, for example, based upon the particular nuclease being used.
- the retron-gRNA cassette is at least about 5,000 nucleotides in length.
- the retron-gRNA cassette is between about 1,000 and 5,000 (i.e., about 1,000, 1,100, 1,200, 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, 1,900, 2,000, 2,100, 2,200, 2,300, 2,400, 2,500, 2,600, 2,700, 2,800, 2,900, 3,000, 3,100, 3,200, 3,300, 3,400, 3,500, 3,600, 3,700, 3,800, 3,900, 4,000, 4,100, 4,200, 4,300, 4,400, 4,500, 4,600, 4,700, 4,800, 4,900, or 5,000) nucleotides in length.
- the cassette is between about 300 and 1,000 (i.e., about 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or 1,000) nucleotides in length.
- the cassette is between about 200 and 300 (i.e., about 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300) nucleotides in length.
- the cassette is between about 30 and 200 (i.e., about 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200) nucleotides in length.
- the cassette further comprises one or more sequences having homology to a vector cloning site.
- These vector homology sequences can be about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more nucleotides in length.
- the vector homology sequences are about 20 nucleotides in length.
- the vector homology sequence are about 15 nucleotides in length.
- the vector homology sequences are about 25 nucleotides in length.
- the present disclosure provides a vector comprising a retron- guide RNA cassette of the present disclosure.
- the vector further comprises a promoter.
- the promoter is operably linked to the cassette.
- the promoter is inducible.
- the promoter is an RNA polymerase II promoter.
- the promoter is an RNA polymerase III promoter.
- a combination of promoters is used.
- the vector further comprises a terminator sequence.
- Vectors of the present disclosure can include commercially available recombinant expression vectors and fragments and variants thereof.
- Vectors of the present disclosure may further comprise a reverse transcriptase (RT) coding sequence and, optionally, may further comprise a nuclear localization sequence (NLS). In some instances, the NLS will be located 5’ of the RT coding sequence.
- Vectors of the present disclosure can further comprise a nuclease coding sequence. The sequence can encode Cas9, Cpf1, or any other suitable nuclease. Examples of suitable nucleases are provided herein and will also be known to one of skill in the art.
- expression of the retron-gRNA cassette and the RT coding sequence and/or the nuclease coding sequence can all be under the control of a single promoter.
- expression of the retron-gRNA cassette and the RT coding sequence and/or the nuclease coding sequence can each be under the control of a different promoter.
- Other combinations are also possible.
- expression of the retron-gRNA cassette can be under the control of one promoter, while expression of the RT coding sequence and/or the nuclease coding sequence are under the control of another promoter.
- expression of the retron-gRNA cassette and expression of the RT coding sequence can be under the control of one promoter, while expression of the nuclease coding sequence can be under the control of another promoter.
- expression of the retron-gRNA cassette and expression of the nuclease coding sequence can be under the control of one promoter, while the RT coding sequence is under the control of another promoter.
- one or more of the promoters are inducible.
- the vector can comprise a retron-gRNA cassette under the control of a Gal7 promoter, an RT coding sequence under the control of a Gal10 promoter, and a nuclease (e.g., Cas9) coding sequence under the control of a Gal1 promoter.
- a reporter unit that includes a nucleotide sequence encoding a reporter polypeptide (e.g., a detectable polypeptide, fluorescent polypeptide, or a selectable marker (e.g., URA3)).
- the size of the vector will depend on the size of the individual components within the vector, e.g., retron-gRNA cassette, RT coding sequence, nuclease coding sequence, NLS, and so on.
- the vector is between about 1,000 and about 20,000 (i.e., about 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, 10,000, 10,500, 11,000, 11,500, 12,000, 12,500, 13,000, 13,500, 14,000, 14,500, 15,000, 15,500, 16,000, 16,500, 17,000, 17,500, 18,000, 18,500, 19,000, 19,500, or 20,000) nucleotides in length.
- the vector is more than about 20,000 nucleotides in length.
- msDNA multicopy single-stranded DNA
- the donor DNA sequence is physically coupled to the gRNA, by virtue of the msDNA being physically coupled to the gRNA.
- at least some of the RNA content of the msDNA is degraded (e.g., by an RNase such as RNase H).
- Retrons have been known for some time as a class of retroelement, first discovered in gram-negative bacteria such as Myxococcus xanthus (e.g., retrons Mx65 and Mx162), Stigmatella aurantiaca (e.g., retron Sa163), and Escherichia coli (e.g., retrons Ec48, Ec67, Ec73, Ec78, Ec83, Ec86, and Ec107).
- Myxococcus xanthus e.g., retrons Mx65 and Mx162
- Stigmatella aurantiaca e.g., retron Sa163
- Escherichia coli e.g., retrons Ec48, Ec67, Ec73, Ec78, Ec83, Ec86, and Ec107.
- Retrons are also found in Salmonella typhimurium (e.g., retron St85), Salmonella enteritidis, Vibrio cholera (e.g., retron Vc95), Vibrio parahaemolyticus (e.g., retron Vp96), Klebsiella pneumoniae, Proteus mirabilis, Xanthomonas campestris, Rhizobium sp., Bradyrhizobium sp., Ralstonia metallidurans, Nannocystis exedens (e.g., retron Ne144), Geobacter sulfurreducens, Trichodesmium erythraeum, Nostoc punctiforme, Nostoc sp., Staphylococcus aureus, Fusobacterium nucleatum, and Flexibacter elegans.
- Salmonella typhimurium e.g., retron St85
- Salmonella enteritidis e.g., retron
- the present disclosure provides for retron- guide RNA cassettes that comprise a retron.
- the retron is derived from the E. coli retron Ec86, which is shown in FIG.2.
- Retrons mediate the synthesis in host cells of multicopy single-stranded DNA (msDNA) molecules, which result from the reverse transcription of a retron transcript and typically include a DNA component and an RNA component.
- the native msDNA molecules reportedly exist as single-stranded DNA-RNA hybrids, characterized by a structure which comprises a single-stranded DNA branching out of an internal guanosine residue of a single- stranded RNA molecule at a 2 ⁇ ,5 ⁇ -phosphodiester linkage.
- RNA content of the msDNA molecule is degraded. In some instances, the RNA content is degraded by RNase H.
- Native retrons have been found to consist of the gene for reverse transcriptase (RT) and msr and msd loci under the control of a single promoter.
- a vector comprising a retron-guide RNA cassette further comprises a sequence encoding an RT.
- methods are provided wherein the RT is encoded on a separate plasmid from the retron-guide RNA cassette.
- the RT is encoded in a sequence that has been integrated into the host cell genome.
- the msd region of a retron transcript typically codes for the DNA component of msDNA
- the msr region of a retron transcript typically codes for the RNA component of msDNA.
- the msr and msd loci have overlapping ends, and may be oriented opposite one another with a promoter located upstream of the msr locus which transcribes through the msr and msd loci.
- sequence of the msd locus will vary, depending on the particular donor DNA sequence that is located within the msd locus.
- the msd and msr regions of retron transcripts generally contain first and second inverted repeat sequences, which together make up a stable stem structure.
- the combined msr-msd region of the retron transcript serves not only as a template for reverse transcription but, by virtue of its secondary structure, also serves as a primer (i.e., self-priming) for msDNA synthesis by a reverse transcriptase.
- the first inverted repeat sequence coding region is located within the 5’ end of the msr locus.
- the second inverted repeat sequence coding region is located 3’ of the msd locus.
- the first inverted repeat sequence is located within the 5’ end of the msr region.
- the second inverted repeat sequence is located 3’ of the msd region.
- a non-limiting example is shown in FIG.4, wherein the msr and msd loci are arranged in opposite orientations.
- the first inverted sequence repeat coding region is shown at the 5’ end of the cassette, while the second inverted sequence repeat coding region is shown near the 3’ end of the cassette.
- sequence of an inverted repeat sequence coding region can be varied, so long as the sequence of the counterpart inverted repeat sequence coding region within the same retron is also varied such that the two resulting inverted repeat sequences (i.e., present within a retron transcript) are complementary and allow for the formation of a stable stem structure.
- Any number of RTs may be used in alternative embodiments of the present disclosure, including prokaryotic and eukaryotic RTs. If desired, the nucleotide sequence of a native RT may be modified, for example using known codon optimization techniques, so that expression within the desired host is optimized.
- the RT may be targeted to the nucleus so that efficient utilization of the RNA template may take place.
- An example of such a RT includes any known RT, either prokaryotic or eukaryotic, fused to a nuclear localization sequence or signal (NLS).
- the vector further comprises an NLS.
- the NLS is located 5’ of the RT coding sequence.
- any suitable NLS may also be used, providing that the NLS assists in localizing the RT within the nucleus.
- the use of an RT in the absence of an NLS may also be used if the RT is present within the nuclear compartment at a level that synthesizes a product from the RNA template.
- the retron-guide RNA cassettes and retron donor DNA-guide molecules of the present disclosure comprise guide RNA (gRNA) coding regions and gRNA molecules, respectively.
- the gRNAs for use in the CRISPR-retron system of the present disclosure typically include a crRNA sequence that is complementary to a target nucleic acid sequence and may include a scaffold sequence (e.g., tracrRNA) that interacts with a Cas nuclease (e.g., Cas9) or a variant or fragment thereof, depending on the particular nuclease being used.
- the gRNA can comprise any nucleic acid sequence having sufficient complementarity with a target polynucleotide sequence (e.g., target DNA sequence) to hybridize with the target sequence and direct sequence-specific binding of a nuclease to the target sequence.
- a target polynucleotide sequence e.g., target DNA sequence
- the gRNA may recognize a protospacer adjacent motif (PAM) sequence that may be near or adjacent to the target DNA sequence.
- PAM protospacer adjacent motif
- the target DNA site may lie immediately 5’ of a PAM sequence, which is specific to the bacterial species of the Cas9 used.
- the PAM sequence of Streptococcus pyogenes-derived Cas9 is NGG; the PAM sequence of Neisseria meningitidis-derived Cas9 is NNNNGATT; the PAM sequence of Streptococcus thermophilus-derived Cas9 is NNAGAA; and the PAM sequence of Treponema denticola-derived Cas9 is NAAAAC.
- the PAM sequence can be 5’-NGG, wherein N is any nucleotide; 5’-NRG, wherein N is any nucleotide and R is a purine; or 5’-NNGRR, wherein N is any nucleotide and R is a purine.
- the selected target DNA sequence should immediately precede (i.e., be located 5’ of) a 5’NGG PAM, wherein N is any nucleotide, such that the guide sequence of the DNA- targeting RNA (e.g., gRNA) base pairs with the opposite strand to mediate cleavage at about 3 base pairs upstream of the PAM sequence.
- the target DNA site may lie immediately 3’ of a PAM sequence, e.g., when the Cpf1 endonuclease is used.
- the PAM sequence is 5’- TTTN, where N is any nucleotide.
- the target DNA sequence i.e., the genomic DNA sequence having complementarity for the gRNA
- the target DNA sequence will typically follow (i.e., be located 3’ of) the PAM sequence.
- Two CP1-family nucleases, AsCpf1 (from Acidaminococcus) and LbCpf1 (from Lachnospiraceae) are known to function in human cells. Both AsCpf1 and LbCpf1 cut 19 bp after the PAM sequence on the targeted strand and 23 bp after the PAM sequence on the opposite strand of the DNA molecule.
- the degree of complementarity between a guide sequence of the gRNA (i.e., crRNA sequence) and its corresponding target sequence, when optimally aligned using a suitable alignment algorithm is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more.
- Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting examples of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g., the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAT, Novoalign (Novocraft Technologies, ELAND (Illumina, San Diego, Calif.), SOAP (available at soap.genomics.org.cn), and Maq (available at maq.sourceforge.net).
- Burrows-Wheeler Transform e.g., the Burrows Wheeler Aligner
- ClustalW ClustalW
- Clustal X Clustal X
- BLAT Novoalign
- SOAP available at soap.genomics.org.cn
- Maq available at maq.sourceforge.net
- a crRNA sequence is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 75, or more nucleotides in length. In some instances, a crRNA sequence is about 20 nucleotides in length. In other instances, a crRNA sequence is about 15 nucleotides in length. In other instances, a crRNA sequence is about 25 nucleotides in length. [0165] The nucleotide sequence of a modified gRNA can be selected using any of the web- based software described above.
- Considerations for selecting a DNA-targeting RNA include the PAM sequence for the nuclease (e.g., Cas9 or Cpf1) to be used, and strategies for minimizing off-target modifications.
- Tools such as the CRISPR Design Tool, can provide sequences for preparing the gRNA, for assessing target modification efficiency, and/or assessing cleavage at off-target sites.
- the length of the gRNA molecule is about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, or more nucleotides in length.
- the length of the gRNA is about 100 nucleotides in length.
- the gRNA is about 90 nucleotides in length.
- the gRNA is about 110 nucleotides in length. 3.
- the present disclosure provides retron-guide RNA cassettes comprising a retron that comprises a donor DNA sequence.
- the present disclosure provides retron donor DNA-guide molecules comprising retron transcripts that comprise donor DNA sequence coding regions, the retron transcripts subsequently being reverse transcribed to yield msDNA that comprises a donor DNA sequence.
- the donor DNA sequence or sequences participate in homology-directed repair (HDR) of genetic loci of interest following cleavage of genomic DNA at the genetic locus or loci of interest (i.e., after a nuclease has been directed to cut at a specific genetic locus of interest, targeted by binding of gRNA to a target sequence).
- HDR homology-directed repair
- the recombinant donor repair template (i.e., donor DNA sequence) comprises two homology arms that are homologous to portions of the sequence of the genetic locus of interest at either side of a Cas nuclease (e.g., Cas9 or Cpf1 nuclease) cleavage site.
- the homology arms may be the same length or may have different lengths.
- each homology arm has at least about 70 to about 99 percent similarity (i.e., at least about 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95.96, 97, 98, or 99 percent similarity) to a portion of the sequence of the genetic locus of interest at either side of a nuclease (e.g., Cas nuclease) cleavage site.
- a nuclease e.g., Cas nuclease
- the recombinant donor repair template comprises or further comprises a reporter unit that includes a nucleotide sequence encoding a reporter polypeptide (e.g., a detectable polypeptide, fluorescent polypeptide, or a selectable marker). If present, the two homology arms can flank the reporter cassette and are homologous to portions of the genetic locus of interest at either side of the Cas nuclease cleavage site.
- the reporter unit can further comprise a sequence encoding a self-cleavage peptide, one or more nuclear localization signals, and/or a fluorescent polypeptide (e.g., superfolder GFP (sfGFP)). Other suitable reporters are described herein.
- the donor DNA sequence is at least about 500 to 10,000 (i.e., at least about 500, 600, 700, 800, 900, 1,000, 1,100, 1,200, 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, 1,900, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, or 10,000) nucleotides in length.
- the donor DNA sequence is between about 600 and 1,000 (i.e., about 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, or 1,000) nucleotides in length.
- the donor DNA sequence is between about 100 and 500 (i.e., about 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, or 500) nucleotides in length.
- the donor DNA sequence is less than about 100 (i.e., less than about 100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, or 5) nucleotides in length.
- the donor DNA sequence in the second retron comprises a barcode sequence.
- the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus.
- Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12- bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes.
- the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker.
- the compositons and methods of the disclosure can be used to introduce genetic modifications anywhere in the genomic or chromosomal DNA of a cell, or in exogenous (non-host cell) DNA, such as the DNA of transgenes, viruses or transposons.
- the exogenous DNA is present in the nucleus of a host cell.
- the exogenous DNA is integrated into the host cell genomic DNA, for example as a transgene.
- the compositons and methods of the disclosure can be used to modify a heterologous or exogenous genome, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome.
- the compositons and methods of the disclosure can be used to modify a heterologous or exogenous genome of a pathogen, such as a virus or bacteria, that is present in the host cell.
- the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids.
- the method identifies a genetic modification at a target locus within a genome of a host cell, where the genome comprises the endogenous genomic chromosomal DNA of the host cell. In some embodiments, the method identifies a genetic modification at a target locus anywhere within a genome of a host cell.
- the target locus is located in an exogenous genome that is present in a host cell, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome.
- the target locus is located in heterologous or exogenous DNA, such as the DNA of transgenes, viruses or transposons, that are present in the host cell or host cell nucleus.
- the target locus is located in heterologous or exogenous DNA that is integrated into the host cell genomic DNA.
- the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids.
- the retron-guide RNA cassette comprises a first donor DNA sequence having homology to one or more sequences within a first target locus, and a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence, where the first and second target loci are located within the genomic DNA of a host cell.
- the retron-guide RNA cassette comprises a first donor DNA sequence having homology to one or more sequences within a first target locus, and a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence, where the first and second target loci are located within exogenous or heterologous DNA that is present in a host cell or organism.
- the first and second target loci are located within exogenous or heterologous DNA that is integrated in the host cell genomic DNA.
- the first and second target loci are located within exogenous or heterologous DNA that is not-integrated in the host cell genomic DNA.
- the first target locus is located in cis to the second target locus.
- the first and second target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located at a different position in the transcription unit.
- the first target locus is located upstream or 5’ of a gene or transcription unit, and the second target locus is located downstream or 3’ of a gene or transcription unit.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the same transcription unit.
- the first and/or second target locus is located in an intron or non-coding RNA expressed by a gene.
- the first donor DNA sequence in the retron cassette comprises a genetic variant, such as a single nucleotide polymorphism, missense mutation, synonymous mutation, nonsense mutation, insertion, or a deletion, relative to the sequence at the first target locus.
- the genetic variant comprises a cis-expression quantitative train locus (cis-eQTL) variant at the first target locus.
- the first target locus is located in trans to the second target locus.
- the first and second target loci are located on different chromosomes or in different genes.
- the first target locus is located in a trans-regulatory element, and the second target locus is located in a gene, or in a transcription unit that is in trans to the first target locus.
- the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the first target locus.
- the first donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the first target locus.
- the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript.
- the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus.
- the genetic variant comprises a trans-expression quantitative train locus (trans-eQTL) variant at the first target locus.
- multiple rounds of genetic targeting are performed on the same pool of cells, or a single cell that has a genetic modification at a target locus.
- the first round of genetic editing can introduce a genetic modification at a first target locus and a barcode sequence at a second target locus.
- a second genetic modification can be introduced at the same first target locus or a different (third) target locus and a new genetic modification in the barcode sequence, or a new unique barcode sequence, is introduced at the second target locus.
- the consecutive barcodes can be identified by NGS or Sanger sequencing.
- the barcodes could encode different fluorescent markers and the combinations of markers can be determined by flow cytometry or fluorescence microscopy.
- the barcodes could encode different peptides and the combinations of peptides can be determined by mass spectrometry.
- the second target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells resulting from mutated or inserted DNA, often referred to as “safe-harbors.”
- the second target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene.
- the second target locus comprises the yeast S. cerevisiae YBR209W locus described in Levy SF, et al., Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature. 2015 Mar 12;519(7542):181-6. doi: 10.1038/nature14279.
- the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19.
- the AAVS1 is a well-validated “safe harbor” for inserting DNA transgenes with expected function. It has an open chromatin structure and is transcription-competent. Most importantly, there are no known adverse effects on cells resulting from the inserted DNA fragment of interest. See the internet at www.genecopoeia.com/product/aavs1-safe-harbor/. C.
- the CRISPR/Cas system of genome modification includes a Cas nuclease (e.g., Cas9 or Cpf1 nuclease) or a variant or fragment or combination thereof and a DNA-targeting RNA (e.g., guide RNA (gRNA)).
- the gRNA may contain a guide sequence that targets the Cas nuclease to the target genomic DNA and a scaffold sequence that interacts with the Cas nuclease (e.g., tracrRNA).
- the system may optionally include a donor repair template.
- a fragment of a Cas nuclease or a variant thereof with desired properties can be used.
- the donor repair template can include a nucleotide sequence encoding a reporter polypeptide such as a fluorescent protein or an antibiotic resistance marker, and homology arms that are homologous to the target DNA and flank the site of gene modification.
- the CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)/Cas (CRISPR-associated protein) nuclease system is an engineered nuclease system based on a bacterial system that can be used for genome engineering.
- the crRNA then associates, through a region of partial complementarity, with another type of RNA called tracrRNA to guide the Cas (e.g., Cas9) nuclease to a region homologous to the crRNA in the target DNA called a “protospacer.”
- the Cas (e.g., Cas9) nuclease cleaves the DNA to generate blunt ends at the double-strand break at sites specified by a 20-nucleotide guide sequence contained within the crRNA transcript.
- the Cas (e.g., Cas9) nuclease may require both the crRNA and the tracrRNA for site-specific DNA recognition and cleavage.
- This system has now been engineered such that the crRNA and tracrRNA, if needed, can be combined into one molecule (the “single guide RNA” or “sgRNA”), and the crRNA equivalent portion of the guide RNA can be engineered to guide the Cas (e.g., Cas9) nuclease to target any desired sequence (see, e.g., Jinek et al. (2012) Science, 337:816-821; Jinek et al. (2013) eLife, 2:e00471; Segal (2013) eLife, 2:e00563).
- the Cas e.g., Cas9 nuclease
- the CRISPR/Cas system can be engineered to create a double-strand break at a desired target in a genome of a cell, and harness the cell’s endogenous mechanisms to repair the induced break by homology-directed repair (HDR) or nonhomologous end-joining (NHEJ).
- HDR homology-directed repair
- NHEJ nonhomologous end-joining
- the Cas nuclease can direct cleavage of one or both strands at a location in a target DNA sequence.
- the Cas nuclease can be a nickase having one or more inactivated catalytic domains that cleaves a single strand of a target DNA sequence.
- Non-limiting examples of Cas nucleases include Cas1, Cas1B, Cas2, Cas3, Cas4, Cas5, Cas6, Cas7, Cas8, Cas9 (also known as Csn1 and Csx12), Cas10, Csy1, Csy2, Csy3, Cse1, Cse2, Csc1, Csc2, Csa5, Csn2, Csm2, Csm3, Csm4, Csm5, Csm6, Cmr1, Cmr3, Cmr4, Cmr5, Cmr6, Csb1, Csb2, Csb3, Csx17, Csx14, Csx10, Csx16, CsaX, Csx3, Csx1, Csx15, Csf1, Csf2, Csf3, Csf4, Cpf1, homologs thereof, variants thereof, fragments thereof, mutants thereof, derivatives thereof, and combinations thereof.
- Type II Cas nucleases There are three main types of Cas nucleases (type I, type II, and type III), and 10 subtypes including 5 type I, 3 type II, and 2 type III proteins (see, e.g., Hochstrasser and Doudna, Trends Biochem Sci, 2015:40(1):58- 66).
- Type II Cas nucleases include Cas1, Cas2, Csn2, Cas9, and Cpf1. These Cas nucleases are known to those skilled in the art.
- the amino acid sequence of the Streptococcus pyogenes wild-type Cas9 polypeptide is set forth, e.g., in NBCI Ref. Seq. No.
- NP_269215 and the amino acid sequence of Streptococcus thermophilus wild-type Cas9 polypeptide is set forth, e.g., in NBCI Ref. Seq. No. WP_011681470. Furthermore, the amino acid sequence of Acidaminococcus sp. BV3L6 is set forth, e.g., in NBCI Ref. Seq. No. WP_021736722.1.
- Cas nucleases can be derived from a variety of bacterial species including, but not limited to, Veillonella atypical, Fusobacterium nucleatum, Filifactor alocis, Solobacterium moorei, Coprococcus catus, Treponema denticola, Peptoniphilus duerdenii, Catenibacterium mitsuokai, Streptococcus mutans, Listeria innocua, Staphylococcus pseudintermedius, Acidaminococcus intestine, Olsenella uli, Oenococcus kitaharae, Bifidobacterium bifidum, Lactobacillus rhamnosus, Lactobacillus gasseri, Finegoldia magna, Mycoplasma mobile, Mycoplasma gallisepticum, Mycoplasma ovipneumoniae, Mycoplasma canis, Myco
- Torquens Ilyobacter polytropus, Ruminococcus albus, Akkermansia muciniphila, Acidothermus cellulolyticus, Bifidobacterium longum, Bifidobacterium dentium, Corynebacterium diphtheria, Elusimicrobium minutum, Nitratifractor salsuginis, Sphaerochaeta globus, Fibrobacter succinogenes subsp.
- Cpf1 refers to an RNA-guided double-stranded DNA-binding nuclease protein that is a type II Cas nuclease.
- Wild-type Cpf1 contains a RuvC-like endonuclease domain similar to the RuvC domain of Cas9, but does not have an HNH endonuclease domain and the N-terminal region of Cpf1 does not have the alpha-helix recognition lobe possessed by Cas9.
- the wild-type protein requires a single RNA molecule, as no tracrRNA is necessary.
- Wild-type Cpf1 creates staggered-end cuts and utilizes a T-rich protospacer-adjacent motif (PAM) that is 5’ of the guide RNA targeting sequence.
- PAM T-rich protospacer-adjacent motif
- Cas9 refers to an RNA-guided double-stranded DNA-binding nuclease protein or nickase protein that is a type II Cas nuclease. Wild-type Cas9 nuclease has two functional domains, e.g., RuvC and HNH, that cut different DNA strands. The wild-type enzyme requires two RNA molecules (e.g., a crRNA and a tracrRNA), or alternatively, a single fusion molecule (e.g., a gRNA comprising a crRNA and a tracrRNA).
- Wild-type Cas9 utilizes a G- rich protospacer-adjacent motif (PAM) that is 3’ of the guide RNA targeting sequence and creates double-strand cuts having blunt ends. Cas9 can induce double-strand breaks in genomic DNA (target DNA) when both functional domains are active.
- PAM protospacer-adjacent motif
- the Cas9 enzyme can comprise one or more catalytic domains of a Cas9 protein derived from bacteria belonging to the group consisting of Corynebacter, Sutterella, Legionella, Treponema, Filifactor, Eubacterium, Streptococcus, Lactobacillus, Mycoplasma, Bacteroides, Flaviivola, Flavobacterium, Sphaerochaeta, Azospirillum, Gluconacetobacter, Neisseria, Roseburia, Parvibaculum, Staphylococcus, Nitratifractor, and Campylobacter.
- the two catalytic domains are derived from different bacteria species.
- Useful variants of the Cas9 nuclease can include a single inactive catalytic domain, such as a RuvC- or HNH- enzyme or a nickase.
- a Cas9 nickase has only one active functional domain and can cut only one strand of the target DNA, thereby creating a single- strand break or nick.
- a double-strand break can be introduced using a Cas9 nickase if at least two DNA-targeting RNAs that target opposite DNA strands are used.
- a double-nicked induced double-strand break can be repaired by NHEJ or HDR (Ran et al., 2013, Cell, 154:1380-1389).
- This gene editing strategy favors HDR and decreases the frequency of insertion/deletion (“indel”) mutations at off-target DNA sites.
- Cas9 nucleases or nickases are described in, for example, U.S. Patent Nos.8,895,308; 8,889,418; and 8,865,406 and U.S. Application Publication Nos.2014/0356959, 2014/0273226 and 2014/0186919.
- the Cas9 nuclease or nickase can be codon-optimized for the host cell or host organism.
- the Cas nuclease can be a Cas9 fusion protein such as a polypeptide comprising the catalytic domain of a restriction enzyme (e.g., FokI) linked to dCas9.
- a restriction enzyme e.g., FokI
- FokI-dCas9 fusion protein fCas9
- fCas9 can use two guide RNAs to bind to a single strand of target DNA to generate a double-strand break.
- a nucleotide sequence encoding the Cas nuclease is present in a recombinant expression vector.
- the recombinant expression vector is a viral construct, e.g., a recombinant adeno-associated virus construct, a recombinant adenoviral construct, a recombinant lentiviral construct, etc.
- viral vectors can be based on vaccinia virus, poliovirus, adenovirus, adeno-associated virus, SV40, herpes simplex virus, human immunodeficiency virus, and the like.
- a retroviral vector can be based on Murine Leukemia Virus, spleen necrosis virus, and vectors derived from retroviruses such as Rous Sarcoma Virus, Harvey Sarcoma Virus, avian leukosis virus, a lentivirus, human immunodeficiency virus, myeloproliferative sarcoma virus, mammary tumor virus, and the like.
- Useful expression vectors are known to those of skill in the art, and many are commercially available. The following vectors are provided by way of example for eukaryotic host cells: pXT1, pSG5, pSVK3, pBPV, pMSG, and pSVLSV40.
- any other vector may be used if it is compatible with the host cell.
- useful expression vectors containing a nucleotide sequence encoding a Cas9 enzyme are commercially available from, e.g., Addgene, Life Technologies, Sigma-Aldrich, and Origene.
- any of a number of transcription and translation control elements including promoter, transcription enhancers, transcription terminators, and the like, may be used in the expression vector.
- Useful promoters can be derived from viruses, or any organism, e.g., prokaryotic or eukaryotic organisms.
- Promoters may also be inducible (i.e., capable of responding to environmental factors and/or external stimuli that can be artificially controlled).
- Suitable promoters include, but are not limited to: RNA polymerase II promoters (e.g., pGAL7 and pTEF1), RNA polymerase III promoters (e.g., RPR-tetO, SNR52, and tRNA-tyr), the SV40 early promoter, mouse mammary tumor virus long terminal repeat (LTR) promoter; adenovirus major late promoter (Ad MLP); a herpes simplex virus (HSV) promoter, a cytomegalovirus (CMV) promoter such as the CMV immediate early promoter region (CMVIE), a rous sarcoma virus (RSV) promoter, a human U6 small nuclear promoter (U6), an enhanced U6 promoter, a human H1 promoter (H1), etc.
- Suitable terminators include, but are not limited to SNR52 and RPR terminator sequences, which can be used with transcripts created under the control of a RNA polymerase III promoter. Additionally, various primer binding sites may be incorporated into a vector to facilitate vector cloning, sequencing, genotyping, and the like. As a non-limiting example, the Pci1-Up sequence can be incorporated. Other suitable promoter, enhancer, terminator, and primer binding sequences will readily be known to one of skill in the art. D. Methods for identifying genetic modifications at a target locus [0191] The disclosure also provides methods for identifying a genetic modification at a target locus within the genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell.
- the method comprises transforming the host cell with a vector comprising a retron guide cassette described herein.
- the method is an in vitro method.
- the method is an in vivo method.
- the host cell or transformed progeny of the host cell express a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region.
- the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell.
- RT reverse transcriptase
- the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target locus and comprise sequence modifications compared to the sequences within the first target locus.
- the first target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the first gRNA.
- the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the first target locus within the genome.
- at least a portion of the second retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the second target locus.
- msDNA multicopy single-stranded DNA
- the second target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the second gRNA.
- the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert a unique barcode sequence at the second target locus.
- the method comprises detecting the presence of the unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the first target locus, thereby identifying the genetic modification at the first target locus.
- the first target locus is located in cis to the second target locus.
- the first and second target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located at a different position in the transcription unit.
- the first target locus is located upstream or 5’ of a gene or transcription unit, and the second target locus is located downstream or 3’ of a gene or transcription unit.
- the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the same transcription unit.
- UTR untranslated region
- the first and/or second target locus is located in an intron or non-coding RNA expressed by a gene.
- the first donor DNA sequence in the retron cassette comprises a genetic variant, such as a single nucleotide polymorphism, insertion, or a deletion, relative to the sequence at the first target locus.
- the genetic variant comprises a cis-expression quantitative train locus (cis-eQTL) variant at the first target locus.
- the first target locus is located in trans to the second target locus.
- the first and second target loci are located on different chromosomes or in different genes.
- the first target locus is located in a trans-regulatory element, and the second target locus is located in a gene, or in a transcription unit that is in trans to the first target locus. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the first target locus.
- the first donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript.
- the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus.
- the genetic variant comprises a trans-expression quantitative trait locus (trans-eQTL) variant at the first target locus.
- trans-eQTL trans-expression quantitative trait locus
- the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus.
- Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12-bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes.
- the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker.
- the second target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells resulting from mutated or inserted DNA, often referred to as “safe-harbors.”
- the second target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene.
- the second target locus comprises the yeast S. cerevisiae YBR209W locus described in Levy SF, et al., Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature. 2015 Mar 12;519(7542):181-6. doi: 10.1038/nature14279.
- the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19.
- detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence.
- the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. In some embodiments, the vector is not integrated in the genome of the host cell.
- the vector can be lost from the host cell or its progeny by dilution during cell division.
- the vector can be actively removed from the cell.
- the vector contains a gene that is toxic to the host cell.
- the vector contains the URA3 marker gene and the cells are treated with 5-Fluoroorotic acid (5-FOA) to selectively cause toxicity to cells that retain the vector.
- the vector can include a gene that can be used for counter-selection to kill host cells that retain the vector. See Mezzadra R, et al., A Traceless Selection: Counter- selection System That Allows Efficient Generation of Transposon and CRISPR-modified T- cell Products.
- the vector can encode surface markers that are expressed in vector containing cells following the genetic edits, which can be immobilized by antibodies and discarded. The remaining post-edit cells that lost the transient vector can then be retained for later use.
- the vector contains sequences that can be targeted by gRNA introduced to the cell post-editing to cut the DNA vector and expose it to exonuclease degradation.
- greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus.
- the method steps are repeated by transforming the host cell or progeny thereof with a second vector comprising a second retron-guide RNA cassette to introduce a second pair or combination of edits into the genome of the host cell. This allows multiple edits to be tracked in the same cell or clonal population of transformed cells by detecting the presence and/or expression of the different barcodes inserted into the genome of the host cell.
- the method further comprises transforming the host cell or progeny thereof with a second vector comprising a second retron-guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and
- the host cell expresses a third retron donor DNA-guide molecule comprising a third retron transcript and the third gRNA coding region and a fourth retron donor DNA-guide molecule comprising a fourth retron transcript and the fourth gRNA coding region.
- the third and fourth retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell.
- RT reverse transcriptase
- the third retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the third target locus and comprise sequence modifications compared to the sequences within the third target locus.
- the third target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the third gRNA.
- the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the third target locus within the genome.
- at least a portion of the fourth retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the fourth target locus.
- the fourth target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the fourth gRNA.
- the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert the second unique barcode sequence at the fourth target locus.
- the method comprises detecting the presence of the second unique barcode sequence, wherein the presence of the second unique barcode sequence indicates the presence of the genetic modification at the third target locus, thereby identifying the genetic modification at the third target locus.
- the third target locus is located in cis to the fourth target locus.
- the third and fourth target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit.
- the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located at a different position in the transcription unit.
- the third target locus is located upstream or 5’ of a gene or transcription unit, and the fourth target locus is located downstream or 3’ of a gene or transcription unit.
- the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the same transcription unit.
- UTR untranslated region
- the third and/or fourth target locus is located in an intron or non-coding RNA expressed by a gene.
- the third donor DNA sequence in the second retron-guide RNA cassette comprises a genetic variant, such as a single nucleotide polymorphism, insertion, or a deletion, relative to the sequence at the third target locus.
- the genetic variant comprises a cis-expression quantitative train locus (cis- eQTL) variant at the third target locus.
- the third target locus is located in trans to the fourth target locus.
- the third and fourth target loci are located on different chromosomes or in different genes.
- the third target locus is located in a trans-regulatory element, and the fourth target locus is located in a gene, or in a transcription unit that is in trans to the third target locus. In some embodiments, the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the third target locus.
- the third donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the third target locus. In some embodiments, the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript.
- the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus.
- the genetic variant comprises a trans-expression quantitative trait locus (trans-eQTL) variant at the first target locus.
- trans-eQTL trans-expression quantitative trait locus
- the second unique barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus.
- Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12- bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined Hamming distance between any pair of barcodes.
- the second unique barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker.
- the second unique barcode sequence is different than the unique barcode sequence (i.e., the first unique barcode sequence) inserted at the second target locus.
- the fourth target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells resulting from mutated or inserted DNA, often referred to as “safe-harbors.”
- the fourth target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene.
- the fourth target locus comprises the yeast S.
- the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19.
- detecting the presence of the second unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence.
- the second vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. In some embodiments, the second vector is not integrated in the genome of the host cell. In some embodiments, the second vector can be lost from the host cell or its progeny by dilution during cell division. [0213] In some embodiments, greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the second barcode sequence and the sequence modifications compared to the sequences within the third target locus. [0214] In some embodiments, the methods further comprise detecting or determining the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci.
- the relative expression can be determined by quantifying the amount of the barcode sequence and determining the relative ratio of transcript sequences to barcode sequences.
- the amount of the barcode sequence is measured by performing RT-qPCR assays using primers that amplify the barcode sequence.
- the amount of the barcode sequence is determined by next generation sequencing (NGS).
- NGS next generation sequencing
- transcript abundance is determined by measuring or quantifying the amount of a detectable marker encoded by the barcode.
- the TRACE-Seq method tracks recombination alleles and identifies clonal reconstitution dynamics of gene targeted human hematopoietic stem cells.” Nat Commun 12, 472 (2021). https://doi.org/10.1038/s41467-020-20792-y, incorporating the genetic variant and barcode with one guide in a single editing event, which is limited to using amino acid codon replacement as barcodes.
- the codon swap barcoding strategy also is not applicable for non-coding sequences where it is important to preserve all nucleotides.
- the current methods allow insertion of the barcode sequence elsewhere in the genome, and does not interfere with the locus comprising the genetic variant edit.
- the TRACE-seq method is less useful because all loci must be genotyped which limits throughput.
- the first and third gRNAs are the same.
- the first and third target loci are the same.
- the genetic modifications or edits at the first and third loci are different.
- the second and fourth gRNAs are the same.
- the first and third gRNAs are the same, and the second and fourth gRNAs are the same.
- the second and fourth target loci are the same.
- the barcode sequences inserted at the same target loci are different. In some embodiments, the barcode sequences inserted at the second and fourth target loci are different. [0217] In some embodiments of the methods described herein, different guide RNAs are used to introduce different genetic modifications at different target loci, but the same guide RNA is used to introduce different barcodes at the same target locus. This allows the same validated gRNA to be used to insert the barcode sequence at the target locus with high efficiency. Thus, in some embodiments, the first and third gRNAs are different. In some embodiments, the first and third target loci are different. In some embodiments, the genetic modifications at the first and third loci are different.
- the second and fourth gRNAs are the same. In some embodiments, the first and third gRNAs are different, and the second and fourth gRNAs are the same. In some embodiments, the second and fourth target loci are the same. In some embodiments, the barcode sequences inserted at the second and fourth target loci are different. [0218] In some embodiments of the methods described herein, different guide RNAs are used to introduce different genetic modifications at different target loci, and different guide RNAs are used to introduce different barcode sequences at different target loci. Thus, in some embodiments, the first and third gRNAs are different, and the second and fourth gRNAs are different.
- the first and third target loci are different, and the second and fourth target loci are different.
- the genetic modifications at the first and third loci are different, and the barcode sequences inserted at the second and fourth target loci are different.
- the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site.
- the methods comprise detecting the presence of the unique barcode at the second target locus, thereby identifying the genetic modification at both the first and third target loci.
- the methods are repeated with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus.
- the methods can be repeated multiple times with vectors comprising different retron-guide RNA cassettes to insert additional genetic modifications at the same or different target loci and to introduce additional unique barcodes at specific loci in the host cell genome that can be used to track the corresponding genetic modifications.
- the host cell is a prokaryotic cell.
- the host cell is a eukaryotic cell, such as a yeast cell or mammalian cell.
- the host cell comprises a clonal population of host cells.
- the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
- the methods comprise transforming a mixture of cells with one or more vectors comprising the first, second and/or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell.
- the methods comprise detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette.
- the genetic modifications can be detected by sequencing the genomic DNA comprising the modification, or by detecting a change in one or more phenotypes expressed by the host cell or organism comprising the host cell.
- the presence of the unique barcode sequence can be detected by sequencing the genomic DNA comprising the barcode sequence, or by detecting a protein or detectable marker encoded by the barcode sequence.
- Methods for introducing nucleic acids into host cells are known in the art, and any known method can be used to introduce a nuclease or a nucleic acid (e.g., a nucleotide sequence encoding the nuclease or reverse transcriptase, a DNA-targeting RNA (e.g., a guide RNA), a donor repair template for homology-directed repair (HDR), etc.) into a cell.
- a nuclease or a nucleic acid e.g., a nucleotide sequence encoding the nuclease or reverse transcriptase, a DNA-targeting RNA (e.g., a guide RNA), a donor repair template for homology-directed repair (HDR), etc.
- Non-limiting examples of suitable methods include electroporation, viral or bacteriophage infection, transfection, conjugation, protoplast fusion, lipofection, calcium phosphate precipitation, polyethyleneimine (PEI)-mediated transfection, DEAE-dextran mediated transfection, liposome-mediated transfection, particle gun technology, calcium phosphate precipitation, direct microinjection, nanoparticle-mediated nucleic acid delivery, and the like.
- the components of the CRISPR-retron system can be introduced into a cell using a delivery system.
- the delivery system comprises a nanoparticle, a microparticle (e.g., a polymer micropolymer), a liposome, a micelle, a virosome, a viral particle, a nucleic acid complex, a transfection agent, an electroporation agent (e.g., using a NEON transfection system), a nucleofection agent, a lipofection agent, and/or a buffer system that includes a nuclease component (as a polypeptide or encoded by an expression construct), a reverse transcriptase component, and one or more nucleic acid components such as a DNA-targeting RNA (e.g., a guide RNA) and/or a donor repair template.
- a nuclease component as a polypeptide or encoded by an expression construct
- a reverse transcriptase component e.g., a reverse transcriptase component
- nucleic acid components such as a DNA-targeting RNA (e.g
- the components can be mixed with a lipofection agent such that they are encapsulated or packaged into cationic submicron oil-in-water emulsions.
- the components can be delivered without a delivery system, e.g., as an aqueous solution.
- Methods of preparing liposomes and encapsulating polypeptides and nucleic acids in liposomes are described in, e.g., Methods and Protocols, Volume 1: Pharmaceutical Nanocarriers: Methods and Protocols. (ed. Weissig). Humana Press, 2009 and Heyes et al. (2005) J Controlled Release 107:276-87.
- microparticles and encapsulating polypeptides and nucleic acids are described in, e.g., Functional Polymer Colloids and Microparticles volume 4 (Microspheres, microcapsules & liposomes). (eds. Arshady & Guyot). Citus Books, 2002 and Microparticulate Systems for the Delivery of Proteins and Vaccines. (eds. Cohen & Bernstein). CRC Press, 1996.
- F. Host cells [0228]
- the present disclosure provides host cells that have been transformed by vectors of the present disclosure.
- the compositions and methods of the present disclosure can be used for genome editing of any host cell of interest.
- the host cell can be a cell from any organism, e.g., a bacterial cell, an archaeal cell, a cell of a single-cell eukaryotic organism, a plant cell (e.g., a rice cell, a wheat cell, a tomato cell, an Arabidopsis thaliana cell, a Zea mays cell and the like), an algal cell (e.g., Botryococcus braunii, Chlamydomonas reinhardtii, Nannochloropsis gaditana, Chlorella pyrenoidosa, Sargassum patens C.
- a bacterial cell e.g., a bacterial cell, an archaeal cell, a cell of a single-cell eukaryotic organism
- a plant cell e.g., a rice cell, a wheat cell, a tomato cell, an Arabidopsis thaliana cell, a Zea mays cell and the like
- an algal cell e
- a fungal cell e.g., yeast cell, etc.
- an animal cell e.g., fruit fly, cnidarian, echinoderm, nematode, etc.
- a cell from a vertebrate animal e.g., fish, amphibian, reptile, bird, mammal, etc.
- a cell from a mammal a cell from a human, a cell from a healthy human, a cell from a human patient, a cell from a cancer patient, etc.
- the host cell treated by the method disclosed herein can be transplanted to a subject (e.g., patient).
- the host cell can be derived from the subject to be treated (e.g., patient).
- Any type of cell may be of interest, such as a stem cell, e.g., embryonic stem cell, induced pluripotent stem cell, adult stem cell, e.g., mesenchymal stem cell, neural stem cell, hematopoietic stem cell, organ stem cell, a progenitor cell, a somatic cell, e.g., fibroblast, hepatocyte, heart cell, liver cell, pancreatic cell, muscle cell, skin cell, blood cell, neural cell, immune cell, and any other cell of the body, e.g., human body.
- a stem cell e.g., embryonic stem cell, induced pluripotent stem cell
- adult stem cell e.g., mesenchymal stem cell, neural stem cell, hematopoietic stem cell, organ stem cell, a progenitor cell
- a somatic cell e.g., fibroblast,
- the cells can be primary cells or primary cell cultures derived from a subject, e.g., an animal subject or a human subject, and allowed to grow in vitro for a limited number of passages.
- the cells are disease cells or derived from a subject with a disease.
- the cells can be cancer or tumor cells.
- the cells can also be immortalized cells (e.g., cell lines), for instance, from a cancer cell line.
- Cells can be harvested from a subject by any standard method. For instance, cells from tissues, such as skin, muscle, bone marrow, spleen, liver, kidney, pancreas, lung, intestine, stomach, etc., can be harvested by a tissue biopsy or a fine needle aspirate.
- Blood cells and/or immune cells can be isolated from whole blood, plasma or serum.
- suitable primary cells include peripheral blood mononuclear cells (PBMC), peripheral blood lymphocytes (PBL), and other blood cell subsets such as, but not limited to, T cell, a natural killer cell, a monocyte, a natural killer T cell, a monocyte-precursor cell, a hematopoietic stem cell or a non-pluripotent stem cell.
- the cell can be any immune cells including any T-cell such as tumor infiltrating cells (TILs), such as CD3+ T-cells, CD4+ T- cells, CD8+ T-cells, or any other type of T-cell.
- TILs tumor infiltrating cells
- the T cell can also include memory T cells, memory stem T cells, or effector T cells.
- the T cells can also be skewed towards particular populations and phenotypes.
- the T cells can be skewed to phenotypically comprise, CD45RO(-), CCR7(+), CD45RA(+), CD62L(+), CD27(+), CD28(+) and/or IL- 7R ⁇ (+).
- Suitable cells can be selected that comprise one of more markers selected from a list comprising: CD45RO(-), CCR7(+), CD45RA(+), CD62L(+), CD27(+), CD28(+) and/or IL- 7R ⁇ (+).
- Induced pluripotent stem cells can be generated from differentiated cells according to standard protocols described in, for example, U.S. Patent Nos.7,682,828, 8,058,065, 8,530,238, 8,871,504, 8,900,871 and 8,791,248, the disclosures are herein incorporated by reference in their entirety for all purposes.
- the host cell is in vitro. In other embodiments, the host cell is ex vivo. In yet other embodiments, the host cell is in vivo. G.
- the present disclosure provides a method for modifying one or more target nucleic acids of interest at one or more target loci within a genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell.
- the method comprises: (a) transforming the host cell with a vector of the present disclosure; and (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a retron donor DNA-guide molecule comprising a retron transcript and a guide RNA (gRNA) molecule, wherein the retron transcript self-primes reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the one or more target loci and comprise sequence modifications compared to the one or more target nucleic acids, wherein the one or more target loci are cut by a nuclease expressed by the host cell or the transformed progeny of the host cell, wherein the site of nuclease cutting
- the host cell is capable of expressing the RT prior to transforming the host cell with the vector.
- the RT is encoded in a sequence that is integrated into the genome of the host cell.
- the RT is encoded in a sequence on a separate plasmid.
- the host cell is capable of expressing the RT at the same time as, or after, transforming the host cell with the vector.
- the RT is expressed from the vector.
- the RT is encoded in a sequence on a separate plasmid.
- the host cell is capable of expressing the nuclease (e.g., Cas9) prior to transforming the host cell with the vector.
- the nuclease is encoded in a sequence that is integrated into the genome of the host cell. In other instances, the nuclease is encoded in a sequence on a separate plasmid. In other embodiments, the host cell is capable of expressing the nuclease at the same time as, or after, transforming the host cell with the vector. In some instances, the nuclease is expressed from the vector. In other instances, the nuclease is encoded in a sequence on a separate plasmid. [0235] In some embodiments, the vector comprises a retron-gRNA cassette that, when transcribed, yields a retron transcript and gRNA that are physically coupled.
- the resulting donor DNA sequence within the msDNA and the gRNA can also be physically coupled.
- the retron transcript and gRNA subsequently become physically uncoupled (e.g., before or after reverse transcription of the retron transcript occurs).
- Physical uncoupling of the retron transcript and the gRNA can result from, for example, ribozyme cleavage (e.g., the retron-gRNA cassette also contains a ribozyme sequence).
- the resulting donor DNA sequence within the msDNA and the gRNA will be physically uncoupled (e.g., during genome editing and/or screening).
- the retron transcript and the gRNA are not initially physically coupled.
- the retron transcript and the gRNA are subsequently joined together.
- Transcription event(s) that result in the production of the retron transcript and/or gRNA can occur inside a host cell, outside of a host cell (e.g., followed by introduction of the retron transcript and/or gRNA into the host cell), or a combination thereof.
- the one or more target nucleic acids of interest are modified by a donor DNA sequence (e.g., within a msDNA) and a gRNA that are never physically coupled.
- the donor DNA sequence and the gRNA can be expressed from different cassettes (e.g., which are contained in the same vector or different vectors) and the donor DNA sequence and the gRNA can act in trans.
- the present disclosure provides a method for screening one or more genetic loci of interest in a genome of a host cell, the method comprising: (a) modifying one or more target nucleic acids of interest at one or more target loci within the genome of the host cell according to a method of the present disclosure; (b) incubating the modified host cell under conditions sufficient to elicit a phenotype that is controlled by the one or more genetic loci of interest; (c) identifying the resulting phenotype of the modified host cell; and (d) determining that the identified phenotype was the result of the modifications made to the one or more target nucleic acids of interest at the one or more target loci of interest.
- the target DNA can be analyzed by standard methods known to those in the art.
- indel mutations can be identified by sequencing using the SURVEYOR ® mutation detection kit (Integrated DNA Technologies, Coralville, IA) or the Guide-it TM Indel Identification Kit (Clontech, Mountain View, CA).
- Homology-directed repair (HDR) can be detected by PCR-based methods, and in combination with sequencing or RFLP analysis.
- Non-limiting examples of PCR-based kits include the Guide-it Mutation Detection Kit (Clontech) and the GeneArt ® Genomic Cleavage Detection Kit (Life Technologies, Carlsbad, CA). Deep sequencing can also be used, particularly for a large number of samples or potential target/off-target sites.
- editing efficiency can be assessed by employing a reporter or selectable marker to examine the phenotype of an organism or a population of organisms. In some instances, the marker produces a visible phenotype, such as the color of an organism or population of organisms.
- edits can be made that either restore or disrupt the function of metabolic pathways that confer a visible phenotype (e.g., a color) to the organism.
- a successful genome edit results in a color change in the target organism (e.g., because the edit disrupts a metabolic pathway that results in a color change or because the edit restores function in a pathway that results in a color change)
- the absolute number or the proportion of organisms or their progeny that exhibit a color change e.g., an estimated or direct count of the number of organisms exhibiting a color change divided by the total number of organisms for which the genomes were potentially edited
- the phenotype is examined by growing the target organisms and/or their progeny under conditions that result in a phenotype, wherein the phenotype may not be visible under ordinary growth conditions.
- growing yeast in a culture medium that is adenine deficient can lead to a particular phenotype (e.g., a color change) in yeast cells that possess a genetic defect in adenine synthesis.
- growing yeast cells in adenine- deficient media can allow one to discern the effect of genome edits that putatively target adenine biosynthesis loci.
- the reporter or selectable marker is a fluorescent tagged protein, an antibody, a labeled antibody, a chemical stain, a chemical indicator, or a combination thereof.
- the reporter or selectable marker responds to a stimulus, a biochemical, or a change in environmental conditions.
- the reporter or selectable marker responds to the concentration of a metabolic product, a protein product, a synthesized drug of interest, a cellular phenotype of interest, a cellular product of interest, or a combination thereof.
- a cellular product of interest can be, as a non-limiting example, an RNA molecule (e.g., messenger RNA (mRNA), long non-coding RNA (lncRNA), microRNA (miRNA)).
- RNA molecule e.g., messenger RNA (mRNA), long non-coding RNA (lncRNA), microRNA (miRNA)
- Editing efficiency can also be examined or expressed as a function of time. For example, an editing experiment can be allowed to run for a fixed period of time (e.g., 24 or 48 hours) and the number of successful editing events in that fixed time period can be determined. Alternatively, the proportion of successful editing events can be determined for a fixed period of time. Typically, longer editing periods will result in a larger number of successful editing events. Editing experiments or procedures can run for any length of time.
- a genome editing experiment or procedure runs for several hours (e.g., about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 hours). In other embodiments, a genome editing experiment or procedure runs for several days (e.g., about 1, 2, 3, 4, 5, 6, or 7 days). [0242] In addition to the length of time of the editing period, editing efficiency can be affected by the choice of gRNA, donor DNA sequence, the choice of promoter used, or a combination thereof. [0243] In other embodiments, editing efficiency is compared to a control efficiency.
- the control efficiency is determined by running a genome editing experiment in which the retron transcript and gRNA molecule are never physically coupled, or are initially coupled but subsequently become uncoupled. In some instances, the retron transcript and gRNA molecule are initially coupled and then become uncoupled (e.g., by ribozyme cleavage). In other instances, the retron-guide RNA (gRNA) cassette is configured such that the transcript products of the retron and gRNA coding region are never physically coupled. In yet other instances, the retron transcript and gRNA are introduced into the host cell separately.
- the methods and compositions of the present disclosure result in at least about a 1.3- to 3-fold (i.e., at least about a 1.3-, 1.4-, 1.5-, 1.6-, 1.7-, 1.8-, 1.9-, 2-, 2.1-, 2.2-, 2.3-, 2.4-, 2.5-, 2.6-, 2.7-, 2.8-, 2.9-, or 3-fold) increase in efficiency, compared to when the retron transcript and gRNA are not physically coupled during editing.
- At least about a 3- to 10-fold increase i.e., at least about a 3-, 4-, 5-, 6-, 7-, 8-, 9-, or 10-fold
- at least about a 10- to 100-fold i.e., at least about 10-, 20-, 30-, 40-, 50-, 60-, 70-, 80-, 90-, or 100-fold
- Editing efficiency can also be improved by performing editing experiments or procedures in a multiplex format.
- multiplexing comprises cloning two or more editing retron-gRNA cassettes in tandem into a single vector. In some instances, at least about 10 retron-gRNA cassettes (i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 retron- gRNA cassettes) are cloned into a single vector. [0245] In other embodiments, multiplexing comprises transforming a host cell with two or more vectors. Each vector can comprise one or multiple retron-gRNA cassettes. In some instances, at least about 10 vectors (i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 vectors) are used to transform an individual host cell.
- multiplexing comprises transforming two or more individual host cells, each with a different vector or combination of vectors.
- at least about 2 host cells i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 host cells
- between about 10 and 100 host cells i.e., about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 host cells
- between about 100 and 1,000 host cells i.e., about 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 host cells
- between about 1,000 and 10,000 host cells i.e., about 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, or 10,000 host cells are transformed).
- between about 10,000 and 100,000 host cells i.e., about 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 45,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, or 100,000 host cells are transformed.
- host cells i.e., at least about 100,000, 150,000, 200,000, 250,000, 300,000, 350,000, 400,000, 450,000, 500,000, 550,000, 600,000, 650,000, 700,000, 750,000, 800,000, 850,000, 900,000, 950,000 or 1,000,000 host cells
- more than about 1,000,000 host cells are transformed.
- multiple embodiments of multiplexing can be combined. [0247] By using one or a combination of the various multiplexing embodiments, it is possible to modify and/or screen any number of loci within a genome. In some instances, at least about 10 (i.e., about 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10) genetic loci are modified or screened.
- loci are modified or screened.
- between about 100 and 1,000 genetic loci i.e., about 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or 1,000 genetic loci are modified or screened.
- between about 1,000 and 100,000 genetic loci are modified or screened.
- the host cell or host cell comprises a population of host cells.
- one or more sequence modifications are induced in at least about 20 percent (i.e., at least about 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 percent) of the population of cells. In other instances, one or more sequence modifications are induced in at least about 50 percent (i.e., at least about 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 65, 70, 75, 80, 85, 90, 95, or 100 percent) of the population of cells.
- one or more sequence modifications are induced in at least about 75 percent (i.e., at least about 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 95, or 100 percent) of the population of cells.
- one or more sequence modifications are induced in at least about 90 percent (i.e., at least about 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 percent) of the population of cells.
- one or more sequence modifications are induced in at least about 95 percent (i.e., at least about 95, 96, 97, 98, 99, or 100 percent) of the population of cells.
- the precision of genome editing can correspond to the number or percentage of on- target genome editing events relative to the number or percentage of all genome editing events, including on-target and off-target events. Testing for on-target genome editing events can be accomplished by direct sequencing of the target region or other methods described herein.
- editing precision is at least about 80 percent (i.e., at least about 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 95, or 100 percent), meaning that at least about 80 percent of all genome editing events are on-target editing events.
- editing precision is at least about 90 percent (i.e., at least about 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 percent), meaning that at least about 90 percent of all genome editing events are on-target editing events.
- editing precision is at least about 95 percent (i.e., at least about 95, 96, 97, 98, 99, or 100 percent), meaning that at least about 95 percent of all genome editing events are on-target editing events.
- editing precision is at least about 99 percent (i.e., at least about 99 or 100 percent), meaning that at least 99 percent of all genome editing events are on-target editing events.
- compositions and methods of the present disclosure are suitable for any disease that has a genetic basis and is amenable to prevention or amelioration of disease-associated sequelae or symptoms by editing or correcting one or more genetic loci that are linked to the disease.
- Non-limiting examples of diseases include X-linked severe combined immune deficiency, sickle cell anemia, thalassemia, hemophilia, neoplasia, cancer, age-related macular degeneration, schizophrenia, trinucleotide repeat disorders, fragile X syndrome, prion-related disorders, amyotrophic lateral sclerosis, drug addiction, autism, Alzheimer’s disease, Parkinson’s disease, cystic fibrosis, blood and coagulation diseases and disorders, inflammation, immune-related diseases and disorders, metabolic diseases and disorders, liver diseases and disorders, kidney diseases and disorders, muscular/skeletal diseases and disorders, neurological and neuronal diseases and disorders, cardiovascular diseases and disorders, pulmonary diseases and disorders, and ocular diseases.
- compositions and methods of the present disclosure can also be used to prevent or treat any combination of suitable genetic diseases.
- the subject is treated before any symptoms or sequelae of the genetic disease develop.
- the subject has symptoms or sequelae of the genetic disease.
- treatment results in a reduction or elimination of the symptoms or sequelae of the genetic disease.
- treatment includes administering compositions of the present disclosure directly to a subject.
- pharmaceutical compositions of the present disclosure can be delivered directly to a subject (e.g., by local injection or systemic administration).
- compositions of the present disclosure are delivered to a host cell or population of host cells, and then the host cell or population of host cells is administered or transplanted to the subject.
- the host cell or population of host cells can be administered or transplanted with a pharmaceutically acceptable carrier.
- editing of the host cell genome has not yet been completed prior to administration or transplantation to the subject.
- editing of the host cell genome has been completed when administration or transplantation occurs.
- progeny of the host cell or population of host cells are transplanted into the subject.
- correct editing of the host cell or population of host cells, or the progeny thereof is verified before administering or transplanting edited cells or the progeny thereof into a subject.
- compositions of the present disclosure including cells and/or progeny thereof that have had their genomes edited by the methods and/or compositions of the present disclosure, may be administered as a single dose or as multiple doses, for example two doses administered at an interval of about one month, about two months, about three months, about six months or about 12 months. Other suitable dosage schedules can be determined by a medical practitioner.
- Prevention or treatment can further comprise administering agents and/or performing procedures to prevent or treat concomitant or related conditions. As non-limiting examples, it may be necessary to administer drugs to suppress immune rejection of transplanted cells, or prevent or reduce inflammation or infection.
- kits for modifying one or more target nucleic acids of interest at one or more target loci within a genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell, the kit comprising one or a plurality of vectors or retron-guide RNA (gRNA) cassettes of the present disclosure.
- the kit may further comprise a host cell or a plurality of host cells that are recombinantly modified by the vectors or retron-guide RNA (gRNA) cassettes of the present disclosure.
- the kit contains one or more reagents.
- the reagents are useful for transforming a host cell with a vector or a plurality of vectors, and/or inducing expression from the vector or plurality of vectors.
- the kit may further comprise a reverse transcriptase, a plasmid for expressing a reverse transcriptase, one or more nucleases, one or more plasmids for expressing one or more nucleases, or a combination thereof.
- the kit may further comprise one or more reagents useful for delivering nucleases or reverse transcriptases into the host cell and/or inducing expression of the reverse transcriptase and/or the one or more nucleases.
- the kit further comprises instructions for transforming the host cell with the vector, introducing nucleases and/or reverse transcriptases into the host cell, inducing expression of the vector, reverse transcriptase, and/or nucleases, or a combination thereof.
- the present disclosure provides a kit for modifying one or more target nucleic acids of interest at one or more target loci in a host cell, the kit comprising one or a plurality of retron donor DNA-guide molecules of the present disclosure.
- the kit may further comprise a host cell or a plurality of host cells comprising genetic modifications introduced by the retron donor DNA-guide molecules of the present disclosure.
- the kit contains one or more reagents.
- the reagents are useful for introducing the retron donor DNA-guide molecule or plurality thereof into the host cell.
- the kit may further comprise a reverse transcriptase, a plasmid for expressing a reverse transcriptase, one or more nucleases, one or more plasmids for expressing one or more nucleases, or a combination thereof.
- the kit may further comprise one or more reagents useful for delivering into the host cell reverse transcriptases and/or nucleases and/or inducing expression of the reverse transcriptase and/or the one or more nucleases.
- the kit further comprises instructions for introducing the retron donor DNA-guide molecule or plurality thereof into the host cell, introducing nucleases and/or reverse transcriptases into the host cell, inducing expression of the reverse transcriptase and/or nucleases, or a combination thereof.
- J. Applications The compositions and methods provided by the present disclosure are useful for any number of applications. As non-limiting examples, genome editing or screening according to the compositions and methods of the present disclosure can be used for cell lineage tracking or the measurement of RNA abundance, or to track the relative abundance of cells targeted by a mixture of edits in parallel. For example, the insertion of barcodes described herein can be used for cell lineage tracking or the measurement of RNA abundance.
- genome editing or screening according to the compositions and methods of the present disclosure can be used in high-throughput precision editing genetic screens to 1) improve industrial microbial growth; 2) select strains for improving crop yield; 3) track edited cell populations used for medical treatments in vitro or in vivo; and 4) track edited cell populations used in cell therapy.
- genome editing according to the compositions and methods of the present disclosure can be performed to correct detrimental lesions in order to prevent or treat a disease, or to identify one or more specific genetic loci that contribute to a phenotype, disease, biological function, and the like.
- genome editing or screening according to the compositions and methods of the present disclosure can be used to improve or optimize a biological function, pathway, or biochemical entity (e.g., protein optimization).
- optimization applications are especially suited to the compositions and methods of the present disclosure, as they can require the modification of a large number of genetic loci and subsequently assessing the effects.
- Other non-limiting examples of applications suitable for the compositions and methods of the present disclosure include the production of recombinant proteins for pharmaceutical and industrial use, the production of various pharmaceutical and industrial chemicals, the production of vaccines and viral particles, and the production of fuels and nutraceuticals. All of these applications typically involve high-throughput or high-content screening, making them especially suited to the compositions and methods of the present disclosure.
- inducing one or more sequence modifications at one or more genetic loci of interest comprises substituting, inserting, and/or deleting one or more nucleotides at the one or more genetic loci of interest. In some instances, inducing the one or more sequence modifications results in the insertion of one or more sequences encoding cellular localization tags, one or more synthetic response elements, and/or one or more sequences encoding degrons into the genome. [0265] In other embodiments, inducing the one or more sequence modifications at the one or more genetic loci of interest results in the insertion of one or more sequences from a heterologous genome. Introducing heterologous DNA sequences into a genome is useful for any number of applications, some of which are described herein.
- Non-limiting examples are directed protein evolution, biological pathway optimization, and production of recombinant pharmaceuticals.
- EXAMPLES [0266] The following example provides representative methods for performing an exemplary embodiment of the disclosure. The example demonstrates that the methods of the disclosure can be used for high-throughput genome editing. [0267] Introduction [0268] An important issue in understanding complex traits is the phenomenon of gene-by- environment (GxE) interactions, wherein a genetic variant’s effect is dependent on the environment an organism is exposed to 1 .
- GxE gene-by- environment
- QTL mapping uses genetic crosses between strains to create diverse progeny through recombination to calculate statistical signals that associate with environmental response 8–11 .
- QTL quantitative trait locus
- reverse genetic approaches such as constructing knockout libraries and measuring their effects on growth have single-gene resolution, and have been invaluable sources of information about the functions of genes in various organisms and their genetic interactions.
- CRISPEY Cas9 Retron precISe Parallel Editing via homologY
- RT bacterial retron reverse transcriptase
- msDNA multi-copy, single-stranded DNA
- this design has improved statistical power to detect fitness effects by incorporating unique molecular identifiers (UMIs), as well as the ability to maintain strain barcodes in non-selective media, which allows both assaying and detecting GxE effects of thousands of individual genetic variants in any growth condition.
- UMIs unique molecular identifiers
- This approach allows natural variants throughout the genome to be surveyed in any condition, providing the ability to decipher the precise genetic basis and molecular mechanisms giving rise to complex traits.
- CRISPEY-BAR was used to measure the effects of 4184 natural variants segregating in yeast (Saccharomyces cerevisiae) across a variety of conditions.548 variants underlying variation in growth in these environments were identified. Importantly, resolution of the measurements can differentiate the effects of variants even when they are tightly clustered in the genome, as well as different alleles at the same genomic position. This single- nucleotide resolution of GxE interactions not only allows exploration of the natural landscape of complex traits, but also provides direct mechanistic insights into phenotypic evolution 14,19 . More generally, the methods provide a paradigm for studying genetic variants and their environmental interactions at unprecedented resolution and throughput via multiplexed precision genome editing.
- CRISPEY-BAR enables high-resolution mapping of genotype to phenotype relationships
- CRISPEY-BAR is a scalable system for measuring the effects of precise genome edits by tracking an associated genomic barcode (Fig.1a). As described in a previous report, CRISPEY uses a single guide/donor pair to make one precise edit per cell, and in a pooled assay, measures the change in abundance of each guide/donor pair post-editing through high- throughput sequencing of plasmids (Fig.1b) 18 .
- a new vector design was developed incorporating two consecutive retron-guide cassettes flanked by three self-cleaving ribozymes, allowing simultaneous generation of two guide/donor pairs for making two precise edits in the same cell 20 (Fig.1a, Fig.6).
- the different ribozymes prevent unwanted recombination events during pooled cloning and co-transcriptionally separate the two retron- guide RNAs for processing by retron reverse transcriptase (RT).
- CRISPEY-BAR implements a dual-edit design to simultaneously 1) integrate a unique genomic barcode and 2) make a precise variant edit of interest.
- Each variant editing guide/donor pair is associated with a unique barcode, which can be used to track change in the abundance of cells edited by a specific guide/donor pair (Fig.1c).
- UMIs were linked to each barcode to serve as biological replicates for pooled-editing and growth competition (Fig.1c).
- CRISPEY-BAR was designed to measure the fitness effect of each variant with at least two guide/donor pairs, six UMIs and three pooled competition replicates (Fig.1c, Fig.7).
- the barcode is genomically-integrated, no maintenance of an ectopic vector is needed post-editing, and 1:1 stoichiometric measurement of edited strains can be achieved through multiplexed sequencing of barcode amplicons (Fig.1d).
- the barcode was designed to be covered by 76-base short-read sequencing to minimize sequencing costs and run-time, instead of resequencing the plasmid with 300-base paired-end reads to re- identify guide-donor pairs (Fig.8).
- This sequencing design uses primers that are specific to the barcode-integrated genomic locus, therefore sequencing only the barcoded strains (Fig. 8).
- Selective detection of the integrated barcode edit guarantees the edited cell expresses functional Cas9 and retron components, as well as endogenous cellular factors that facilitate HDR. This strategy allowed for enrichment of strains likely containing variant edits, which is crucial for high-throughput screens.
- UMIs unique molecular identifiers
- Fig.1i see also Methods
- CRISPEY-BAR is highly efficient in precision editing and allows massively parallel tracking of variant fitness effects using the dual-edit design.
- Detection of natural variants affecting fitness within QTLs reveals hidden genetic complexity
- variants were first characterized within regions likely to be enriched for effects on growth in response to stress conditions, in which the yeast pool has slower growth overall.
- a total of 36 genomic regions overlapping QTLs for growth of segregants derived from 16 diverse parental strains were measured in three stress conditions: fluconazole (FLC), cobalt chloride (CoCl 2 ) and caffeine (CAFF) (Fig.2a) 8 .
- FLC fluconazole
- CoCl 2 cobalt chloride
- CAFF caffeine
- the library could be enriched for variants impacting fitness in these stress conditions (Fig.2a) 7 .3 oligonucleotide pools (corresponding to variants to be assayed in fluconazole, cobalt chloride, and caffeine) were designed for pooled cloning into 3 separate CRISPEY- BAR libraries, which were then used for pooled editing (see Methods). After plasmid removal, the edited yeast were subjected to pooled growth competitions in synthetic complete media as well as each corresponding stress condition and changes in barcode abundance across roughly 25 generations were tracked (Fig.2b, Fig.7).
- all pairwise comparisons between the relative fitness measurements for each variant were performed in each condition to see if the effects on growth were significantly different (Fig.3d,e).
- two identical competitions in SC media were performed and variants tested for GxE interactions between them.
- CRISPEY-BAR allows measuring more than one variant at the same genomic locus for multiallelic loci within the ergosterol pathway, which highlights the resolution and specificity of the measurements.
- the other variant was a synonymous variant with no effects on fitness.
- variants with significant effects in more than one condition can be grouped into two categories: 1) those with significant fitness effects in only one direction (Fig.5b) and 2) those with significant fitness effects in opposite directions, which is referred to as “sign GxE” (Fig. 5c).
- the pleiotropic variant exhibiting sign GxE at chr7: 472522 C>A was located in a canonical Rpn4p binding site 33 (Fig.5e top, bottom left).
- This variant's strongest effect was a significant fitness decrease in lovastatin.
- Rpn4p is a transcriptional activator, it was hypothesized that the disruption of the Rpn4p binding site might decrease ERG4 expression.
- RT-qPCR was used to measure expression of ERG4 in a genotyped strain carrying chr7: 472522 C>A and found that its expression decreased relative to the wildtype strain (Fig.5e bottom right).
- CRISPEY-BAR was able to survey thousands of natural variants and identify the variants affecting fitness at the nucleotide-level, directly leading to discovery of molecular mechanisms of GxE interactions.
- CRISPEY-BAR strategy and its applications provide a solution to rapidly discover natural genetic variants impacting a complex trait. As a proof of principle, 548 variants with significant effects on growth within QTLs were identified, as well as across a core metabolic pathway.
- CRISPEY-BAR is highly efficient in precise editing.
- the RT was shown in CRISPEY to be effective in production of msDNA as DNA donors for precision editing 18 .
- the inventors have since tested additional retron RTs in CRISPEY, showing higher efficiency in yeast, as well as editing activity in human cells 34 . While this study only applied the SpCas9 with an ‘NGG’ PAM site limiting the variants that can be targeted, alternative nucleases with alternative PAM can be interchanged with SpCas9 to target additional variants 35–37 .
- the CRISPEY-BAR approach has an efficient guide for barcoding, while the variant editing guide can have a range of efficiency.
- CRISPEY-BAR This caveat can be overcome by applying CRISPEY-BAR to additional strains of budding yeast to not only capture the effects of variants within one lab strain, but also the effect of genetic background.
- the CRISPEY-BAR design also allows for additional ribozymes and CRISPEY cassettes to be incorporated.
- a single barcode-insertion cassette plus two or more variant editing cassettes can be expressed in the same transcript, allowing simultaneous editing of two genetic variants of choice and integration of a variant-pair specific barcode.
- gene-by-gene (epistatic) interactions can be observed, as well as gene-by-gene-by-environment (GxGxE) interactions that govern the crosstalk between gene networks and the environment 38–40 .
- the traits include growth in: 'Cobalt_Chloride;2mM;2’, 'Caffeine;15mM;2' and 'Fluconazole;100uM;2', and we refer to these traits as ‘stress conditions’ 11 .
- stress conditions For the ergosterol pool, all non-reference alleles from yeast natural variants that were within +-500bp from the coding region of the selected ergosterol pathway genes were included 4 .
- the guides and donors selected for CRISPEY editing were designed as described, with the following parameters or modifications 18 : 1.
- the alternative allele is within -6 to -1 and +1 to +2 positions of the guide target and PAM sequences; 2.
- the donor template is 108 bp in length with asymmetric homology arms, 40 bp for the 5’ arm and 68 bp for the 3’ arm; 3.
- Variants were included if two or more guides were found for a given variant.
- the resulting msDNA donor will result in a shorter 3’ homology arm and longer 5’ arm flanking the variant, which was to have higher HDR efficiency using ssDNA as repair donor 41 .
- the donors were further filtered to exclude SphI, AscI and NotI restriction sites used in the cloning process, as well as keeping a minimum of 30 bp homology arm 5’ of variant and 55 bp 3’ of homology arm in the donor template.
- the resulting output is 250 bp per oligo, consists of 5’ homology to the pSAC200 CRISPEY- BAR vector, 12 bp programmed barcode, restriction site region for cloning, 108 bp donor template sequence, 34 bp constant region, 20 bp guide sequence and 3’ homology to the pSAC200 CRISPEY-BAR vector (Fig.6).
- the general sequence is: 5’- GTTGCAGTTAGCTAACAGGCCATGCNNNNNNNNNNGCATGCAGCGGCCGCAG GCGCGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNN NNNN NNNN NNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
- Barcodes were designed using a custom script implementing a quaternary Hamming(12,8) code based on the encoding scheme described in a previous study 42 . This encoding scheme generates DNA barcodes with a minimum Hamming distance of 3, allowing for error correction of 1 bp mutations or DNA sequencing errors.
- sgGFP non-editing guide
- Oligonucleotides were first amplified with Q5 polymerase (NEB) with 1 uM primer #615 in 50 uL reaction following manufacturer instructions and initial denaturation of 98°C for 2 min, and then 5 cycles of 98°C for 10 s and 65°C for 30 s, followed by 25 cycles of 98°C for 10 s and 69°C for 40 s, then final extension of 72°C for 2 min.
- NEB Q5 polymerase
- PCR products were then purified with 45 uL nucleoMAG NGS beads (hereafter, “beads”) (Takara) and eluted with 20 uL water.2 uL of the first round PCR product was further amplified with Q5 polymerase (NEB) with 1 uM primer #615 and #576 in 50 uL reaction as manufacturer instructions and initial denaturation of 98°C for 2 min, and then 15 cycles of 98°C for 10 s and 69°C for 30 s, then final extension of 72°C for 2 min. Second round PCR products were then purified with 45 uL beads and eluted with 20 uL Tris pH 8.0.
- Beads nucleoMAG NGS beads
- the pooled oligos were amplified with Q5 polymerase (NEB) with 1 uM primer #617 and #337-343 in 50 uL reaction following manufacturer instructions and initial denaturation of 98°C for 2 min, and then 15 cycles of 98°C for 10 s and 69°C for 40 s, then final extension of 72°C for 2 min, followed by purification using 45 uL beads and indexing PCR using Illumina dual-indexing primers.
- the indexed amplicons corresponding to each pool were then sequenced by MiSeq using reagent kit v2 Nano to obtain paired-end 150bp reads that are mapped to the designed oligonucleotides.
- the assembled products were purified by beads and eluted in 10 uL water.3 uL of the assembled products were used for electroporation with 27 uL Endura Electrocompetent cells for CRISPR DUO (Lucigen). Two electroporation reactions were performed for each pool following manufacturer instructions and recovered in SOC media (Lucigen) for 25 min at 37°C and plated to a single 15 cm LB agar plate with Carbenicillin (GoldBio). A serial dilution of the recovered bateria was plated to estimate colony forming units (cfu), and all pools contained more than 500,000 cfus.
- the transformants were incubated for 22 hr at 32°C and the resulting bacterial lawn was collected for storage in LB with 10% glycerol at -80°C. Half of the collected transformant stock was used for plasmid extraction using Nucleobond Xtra Midi Plus (Macherey-Nagel) and eluted as “post-Gibson” plasmid pools, yielding 105-120 ug of plasmid DNA.
- PCR was performed with 1 uM of each primer as manufacturer instructions, and initial denaturation of 98°C for 3 min, and then 35 cycles of 98°C for 10 s, 66°C for 30 s, 72°C for 40 s; then final extension of 72°C for 2 min.
- the ligation product was purified by beads and eluted in 30 uL water.3 uL of the purified ligation products were used for electroporation with 27 uL Endura Electrocompetent cells for CRISPR DUO (Lucigen). Two electroporation reactions were performed for each pool, one reaction with ligation insert and the other without insert as negative control. Electroporation was performed following manufacturer instructions and recovered in SOC media (Lucigen) for 30 min at 37°C and the with-insert ligations were plated to two 15 cm LB agar plates with Carbenicillin (GoldBio) at 32°C for 22 hr.
- a serial dilution of the recovered bacteria from both with- and without-insert ligations was plated to estimate cfu, and all pools contained more than 1,000,000 cfu, corresponding to at least 2,500x coverage for each oligonucleotide on average within each pool.
- Ligation plates were incubated at 32°C for 22 hr, and transformants were stored in LB with 10% glycerol.
- Ligated plasmids were extracted from one fourth of the collected bacteria from each pool using Nucleobond Xtra Midi Plus (Macherey-Nagel) and eluted as “post-ligation” plasmid pools, yielding 160-240 ug of plasmid DNA per reaction.
- yeast transformant pools were selected on YNB -histidine -uracil 2% glucose (1.7g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix minus histidine, uracil w/o nitrogen base (US Biological) and 20 g/L glucose (Sigma) 2% agar plates and stored in YNB -histidine -uracil 2% glucose media with 15% glycerol at -80°C.
- Cells were harvested from the last galactose media growth and stored in YNB - histidine -uracil 2% glucose media with 15% glycerol at -80°C.
- the plasmid-cured cells were collected and stored in YNB 2% glucose media with 15% glycerol at -80°C. [0319] Pooled competition [0320] Pooled competitions were carried out in 1 L baffled flasks in YNB 2% glucose (SC, hereafter) media with or without specified conditions (Fig.7). The concentration of each drug/salt was titrated to approximately 5 generations of growth of the ZRS111 strain every 12 hr, indicating overall decreased fitness in each condition to apply consistent growth stress to cells. In contrast, for SC media only, there are approximately 5 generations of growth ZRS111 strain in 8 hr.
- Genomic DNA was eluted in 200 uL per sample, further digested with 1 uL RNaseA and quantified by Qubit dsDNA HS assay (Invitrogen).10 ug of genomic DNA was amplified in 400 uL Q5 polymerase (NEB) PCR reaction with 1 uM forward primer #261 and 1 uM reverse primer equimolar mix of primers #327- #334 (Fig.8).
- NEB Q5 polymerase
- PCR was performed following manufacturer’s instructions, with 1M Betaine and initial denaturation of 98°C for 2 min, then 19 cycles of 98°C for 10 s, 65°C for 20 s; then extension at 72°C for 5 min.100 uL of first round of PCR products were purified using 100 uL beads and 15 uL of the purified amplicons were further indexed by 50 uL Q5 polymerase (NEB) PCR reaction following manufacturer’s instructions with 1 uM equimolar mix of indexing primers for Illumina sequencing, and initial denaturation of 98°C for 2 min, then 8 cycles of 98°C for 10 s, 70°C for 20 s; then extension at 72°C for 2 min.
- NEB Q5 polymerase
- the indexed amplicons were purified with 50 uL beads, eluted in 100 uL water and quantified by Qubit dsDNA HS assay (Invitrogen).
- the purified, indexed amplicons from six time point samples for the three replicates per competition were mixed equimolar and purified by SizeSelect II gel (Invitrogen) for ⁇ 300 bp product.
- the size selected libraries were then purified by beads and submitted for paired-end sequencing on NextSeq 550 using custom read1 primer #354, with custom cycles of 12 cycles for read1, 8 + 8 cycles for dual indices and 64 cycles for read2 using a 1 x 75 bp High-Output Kit (Fig.8). Data available at PRJNA827354.
- Fluconazole Ecological Enrichment Test To test whether strains from particular ecological origins were enriched for variants with significant effects in a particular direction in fluconazole, we first split the variants with significant fitness effects in fluconazole into positive and negative effect variants. We then checked for each strain in the 1,011 yeast genomes if they were homozygous or heterozygous for the alternate allele we edited in at each significant variant.
- strains with the alternate allele had 1 added to their score, and for negative effect alleles, strains with the alternate allele had 1 subtracted from their score.
- the total number of negative effect variants was added to this score for all strains, as any strain with the reference allele for those sites in effect had the positive effect allele.
- the 1,011 yeast strains were then sorted by this score, and the top 50 were chosen to look at their ecological origins, as they were presumably the strains with the most evidence for being under selection for increased growth in fluconazole.
- the resulting PCR products were bead purified and cloned into pSAC200, ligated with UMI-containing insert and transformed into yeast as described for library cloning above.
- the yeast transformants were induced for editing by culturing in 5 mL YNB -HIS -URA 2% raffinose media for 24 hr, passaged twice in 5 mL YNB -HIS -URA 2% galactose media for 24 hr each, then streaked out on YNB - URA 2% glucose (1.7g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix minus uracil, w/o nitrogen base (US Biological) and 20 g/L glucose (Sigma) 2% agar plates to obtain single edited clones.
- plasmids were cured from edited clones by restreaking on YNB 2% glucose 2% agar plates with 1 g/L 5- Fluororotic acid monohydrate (GoldBio). The single plasmid-cured colonies were amplified by growing in YNB 2% glucose media overnight and stored in YNB 2% glucose media with 15% glycerol at -80°C. [0339] Colonies were streaked out from the frozen stock and lysed with Zymolyase 20T (US Biological) solution in 50 mM potassium phosphate buffer, pH 7.5.
- PCR cycles had an initial denaturation of 95°C for 2 min; then 35 cycles of 95°C for 10 s, 60°C for 15 s, 72°C for 20 s; then a final extension of 72°C for 5 min.
- PCR products were purified, Sanger sequenced and aligned to the reference genome using SGD BLAST to confirm the intended genotype 50,51 .
- SGD BLAST SGD BLAST
- Genomic amplicons of loci containing the associated variant edit were Sanger sequenced from barcoded colonies to calculate the editing rates shown in Fig.1d.
- qRT-PCR [0341] Strains containing the Sanger sequencing-verified genotypes were thawed from frozen stock and grown overnight in 5 mL YNB 2% glucose media.0.5 mL of the overnight culture was passaged to 50 mL YNB 2% glucose media with or without 30 mg/mL lovastatin. Cells were harvested after 5 generations of growth in media, approximately 12 hr after passaging.
- T1-T6 Harvested cells were spun down and resuspended in 1x DPBS (Gibco) and stored at 4°C and assayed by flow cytometry within 12 hr post-harvest. Generation time was estimated by measuring OD 600 of the culture containing ZRS111 and GFP control strain at every time point. Competition for each edited strain against GFP control strain was replicated four times in four different wells, to control for spontaneous mutation during competition.
- Ratios between each edited strain against GFP control strain were determined by flow cytometry assay, using an Attune NxT Flow Cytometer and Autosampler (ThermoFisher Scientific). GFP was detected using a 530 nm band-pass filter (BL1) with a 488 nm laser. The channel voltages were adapted from a previous study and set as follows: FSC: 200; SSC: 320; and BL1:480 41 . A threshold for FSC of 2.5 x 10 3 A.U. was applied to exclude non-yeast events. Data analysis was performed using Attune NxT Software v2.7.
- Doublets were removed by FSC gating and cell counts for GFP control strain were determined by BL1 gating and the remaining cells were counted as the non-fluorescent, corresponding to edited strains. Samples with fewer than 500 total cells gated, as well as samples with cell counts of less than 3 for either GFP or edited strains, were excluded. Log2 ratios between edited strain count and GFP control strain count were calculated for each sample and fitted to a slope for the estimated generations within each replicate. The slopes were normalized by subtracting the slope calculated by the competition of a non-variant edit, barcode-only control to the GFP control strain in the same replicate. Finally, the mean and standard error for slopes across four replicates were calculated for each edited strain, representing pairwise fitness values.
- a retron-guide RNA cassette comprising: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus,
- the first donor DNA sequence comprises a genetic variant relative to the sequence at the first target locus.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- HDV hepatitis delta virus
- ribozyme sequence selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- CIV-1, drz-Spur-3, drz-Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead, and combinations thereof.
- HDV hepatitis delta virus
- a method for identifying a genetic modification at a target locus in a host cell comprising: (a) transforming the host cell with a vector of embodiment 19; (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region, wherein the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target locus and comprise sequence modifications compared to the sequences within the first target locus, where
- 26. The method of embodiment 25, wherein the first target locus is located in a cis- regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of the transcription unit.
- 27. The method of any one of embodiments 20 to 26, wherein the first and/or second target locus is located in a non-coding intergenic region in the host cell genomic DNA.
- 28. The method of any one of embodiments 25 or 26, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus. 29.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker.
- detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence.
- any one of embodiments 20 to 33 further comprising: (d) transforming the host cell with a second vector comprising a second retron- guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target loc
- the method of embodiment 39 wherein the third target locus is located in a cis- regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the transcription unit.
- the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus.
- the genetic variant comprises a cis-eQTL variant at the first target locus.
- any one of embodiments 34 to 43 wherein (i) the first and third gRNAs are different; (ii) the first and third target loci are different; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
- the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site.
- 52. The method of embodiment 51, wherein the eukaryotic cell is a yeast cell.
- 53. The method of embodiment 51, wherein the eukaryotic cell is a mammalian cell. 54.
- the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
- 56. The method of any one of embodiments 20 to 49, comprising transforming a mixture of cells with one or more vectors comprising the first, second or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell.
- the method of embodiment 56 further comprising detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette.
- AATGATACGGCGACCACCGAGATCTACACACTGCATAACACTCTTTCCCTACAC Primer GACGCTCTTCCGATCT #341 9. AATGATACGGCGACCACCGAGATCTACACAAGGAGTAACACTCTTTCCCTACAC Primer GACGCTCTTCCGATCT #342 10.
- TGCGCACCCTTA Inverted repeat sequenc e 32. TAAGGGTGCGCA Second inverted repeat 33. ATGCGCACCCTTAGCGAGAGGTTTATCATTAAGGTCAACCTCTGGATGTTGTTT msr CGGCATCCTGCATTGAATCTGAGTTACT locus 34. TCTGAGTTACTGTCTGTTTgaacTGTTGGAACGGAGAGCATCGCCTGATGCTCTCC msd GAGCCAACtttAAACCCGTTTcTTCTGAC locus first retron 35.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Biotechnology (AREA)
- Molecular Biology (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- Microbiology (AREA)
- Biophysics (AREA)
- Physics & Mathematics (AREA)
- Plant Pathology (AREA)
- Mycology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Medicinal Chemistry (AREA)
- Cell Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
Abstract
The disclosure provides compositions and methods for introducing two or more genetic modifications into the genome of a host cell or organism. The compositions comprise retron guide RNA cassettes that can be used to introduce a first genetic modification, such as a genetic variant, at a first target locus and a second genetic modification, such as a unique barcode sequence, at a second target locus. The methods allow tracking of the first genetic modification by detecting the presence of the barcode sequence or a protein encoded by the barcode sequence. The methods can be used to track multiple genetic variants introduced into a host cell by detecting the presence of multiple unique barcode sequences, without having to detect the vector sequences used to transform the host cell.
Description
GENERATION AND TRACKING OF CELLS WITH PRECISE EDITS CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Provisional Application No.63/344,470, filed May 20, 2022, the disclosure of which is herein incorporated by reference in its entirety for all purposes. STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER FEDERALLY SPONSORED RESEARCH AND DEVELOPMENT [0002] This invention was made with Government support under contracts ES030282, GM097171, and GM134228 awarded by the National Institutes of Health. The Government has certain rights in the invention. BACKGROUND [0003] Current methods of generation of cells with precise genetic edits require delivery of genetic editing tools to the host cell and then apply the genetic edit. After the editing procedure, the genetic material of the host cell is inherited by its daughter cells. The edited cells would be used for further applications based on the edit they have received. Common methods for detecting the precise edit made upon the cell require genotyping of the edited genetic loci. However, it is difficult to track exactly what edit was made in each cell in a mixture of cells that 1) did or did not receive the delivered editing tools and/or 2) received a mixture of editing tools containing different intended edits. Whole genome genotyping would be required for each cell in the mixture, which is currently expensive and time-consuming. Thus, tracking of genetic edits made during generation of cells with precise genetic edits in a cell mixture requires a method for retrieval of information on what edit was intended by the editing tool in each cell. Current methods directly genotype the edit tool vector, such as plasmid or virus that are either still retained in the cell ectopically or integrate into the host cell genetic material. Both scenarios may affect host cell physiology or cellular response,
such as cancerous growth in human cells. Therefore, a method that does not retain an edit tool vector, such as plasmid or virus, in the host cell while still allowing faster and cheaper tracking of genetic edits made to the cell is of interest. BRIEF SUMMARY [0004] The present disclosure provides compositions and methods for tracking one or more targeted genetic modifications in the genome of a cell or organism. In some embodiments, the present disclosure provides a nucleic acid composition that comprises two or more editing modules that are present on an expression vector. The compositions and methods allow for producing combinations of targeted genetic modifications in the genome of a host cell. [0005] In one aspect, the disclosure provides a retron-guide RNA cassette comprising: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence; and
(v) a second inverted repeat sequence coding region; and (b) a second guide RNA (gRNA) coding region. [0006] In some embodiments, the first target locus is located in trans to the second target locus. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit. In some embodiments, the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the genetic variant comprises a trans-expression quantitative trait locus (eQTL) variant at the first target locus. [0007] In some embodiments, the first target locus is located in cis to the second target locus. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the transcription unit. In some embodiments, the first donor DNA sequence comprises a genetic variant relative to the sequence at the first target locus. In some embodiments, the genetic variant comprises a cis-eQTL variant at the first target locus. [0008] In some embodiments, the second target locus is i) an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene. [0009] In some embodiments, the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker. [0010] In some embodiments, the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron. In some embodiments, the first or second gRNA coding region is downstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 3’ of the RNA transcribed from the retron. [0011] In some embodiments, the retron-guide RNA cassette further comprises one or more ribozyme sequences. In some embodiments, the first and second retrons are connected by a self-cleaving ribozyme sequence. In some embodiments, the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz-
Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead, and combinations thereof. In some embodiments, the one or more ribozyme sequences are different from each other. [0012] In some embodiments, the retron-guide RNA cassette further comprises a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region. [0013] In another aspect, the disclosure provides a vector comprising a retron-guide RNA cassette described herein. [0014] In another aspect, the disclosure provides a method for identifying a genetic modification at a target locus in a host cell, the method comprising: (a) transforming the host cell with a vector or retron-guide RNA cassette described herein; (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region, wherein the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target
locus and comprise sequence modifications compared to the sequences within the first target locus, wherein the first target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the first gRNA, wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the first target locus, wherein at least a portion of the second retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the second target locus, wherein the second target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the second gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert a unique barcode sequence at the second target locus; and (c) detecting the presence of the unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the first target locus, thereby identifying the genetic modification at the first target locus. [0015] In some embodiments, the method identifies a genetic modification at a target locus within a genome of a host cell, where the genome comprises the endogenous genomic chromosomal DNA of the host cell. In some embodiments, the method identifies a genetic modification at a target locus anywhere within a genome of a host cell. In some embodiments, the target locus is located in an exogenous genome that is present in a host cell, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome. In some embodiments, the target locus is located in heterologous or exogenous DNA, such as the DNA of transgenes, viruses or transposons, that are present in the host cell or host cell nucleus. In some embodiments, the
target locus is located in heterologous or exogenous DNA that is integrated into the host cell genomic DNA. In some embodiments, the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids. [0016] In some embodiments, the first target locus is located in trans to the second target locus. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in a 5’ untranslated region, protein coding region, or the 3’ untranslated region (UTR) of a transcription unit. In some embodiments, the genetic variant comprises a trans-eQTL variant at the first target locus. [0017] In some embodiments, the first target locus is located in cis to the second target locus. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, protein coding region, or the 3’ untranslated region (UTR) of the transcription unit. In some embodiments, the genetic variant comprises a cis-eQTL variant at the first target locus. [0018] In some embodiments, the first and/or second target locus is located in an intergenic, non-coding region of the host cell genomic DNA. [0019] In some embodiments, the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus. [0020] In some embodiments, the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker. [0021] In some embodiments, detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence. In some embodiments, the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. [0022] In some embodiments, greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus.
[0023] In some embodiments, the method further comprises: (d) transforming the host cell with a second vector comprising a second retron-guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and a second unique barcode sequence; and (v) a second inverted repeat sequence coding region; and a fourth guide RNA (gRNA) coding region; (e) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a third retron donor DNA-guide molecule comprising a third retron transcript and the third gRNA coding region and a fourth retron donor DNA-guide molecule comprising a fourth retron transcript and the fourth gRNA coding region,
wherein the third and fourth retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the third retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the third target locus and comprise sequence modifications compared to the sequences within the third target locus, wherein the third target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the third gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the third target locus, wherein at least a portion of the fourth retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the fourth target locus, wherein the fourth target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the fourth gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert the second unique barcode sequence at the fourth target locus; and (f) detecting the presence of the second unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the third target locus, thereby identifying the genetic modification at the third target locus. [0024] In some embodiments, the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus.
[0025] In some embodiments, the third target locus is located in trans to the fourth target locus. In some embodiments, the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit. In some embodiments, the genetic variant comprises a trans-eQTL variant at the third target locus. [0026] In some embodiments, the third target locus is located in cis to the fourth target locus. In some embodiments, the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the transcription unit. In some embodiments, the genetic variant comprises a cis-eQTL variant at the first target locus. [0027] In some embodiments, the method further comprises detecting the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci. [0028] In some embodiments: (i) the first and third gRNAs are the same; (ii) the first and third target loci are the same; (iii) the genetic modification at the first and third loci is different; (vi) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different. [0029] In some embodiments: (i) the first and third gRNAs are different; (ii) the first and third target loci are different; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different. [0030] In some embodiments, the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site.
[0031] In some embodiments, greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the third target locus. [0032] In some embodiments, the method further comprises detecting the presence of the unique barcode at the third target locus, thereby identifying the genetic modification at both the first and third target loci. [0033] In some embodiments, the method further comprises repeating steps (d)-(f) with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus. [0034] In some embodiments, the host cell is a prokaryotic cell. In some embodiments, the host cell is a eukaryotic cell. In some embodiments, the eukaryotic cell is a yeast cell. In some embodiments, the eukaryotic cell is a mammalian cell or cell line. In some embodiments, the mammalian cell is a human cell or cell line. In some embodiments, the host cell comprises a clonal population of host cells. In some embodiments, the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells. [0035] In some embodiments, the method further comprises transforming a mixture of cells with one or more vectors comprising the first, second or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell. [0036] In some embodiments, the method further comprises detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette. [0037] In another aspect, the disclosure provides a method for identifying two or more genetic modifications at two different target loci in a host cell, the method comprising: transforming the host cell with a vector or retron-guide RNA cassette described herein; wherein the vector or retron-guide RNA cassette comprises two or more variant editing cassettes that are expressed in the same transcript, and a donor DNA sequence comprising
homology to one or more sequences within a third, different target locus and a unique barcode sequence. BRIEF DESCRIPTION OF THE DRAWINGS [0038] Fig.1a-k: Design and validation of CRISPEY-BAR for generating and tracking thousands of precise genome edits simultaneously. [0039] Fig.1a. Schematic of CRISPEY-BAR dual edit strategy. Top, CRISPEY-BAR expression cassette consisting of pGAL7 galactose-inducible promoter and terminator (brown); self-cleaving HDV-like-ribozymes RzCIV, RZHDV and RZSpur3 (magenta); barcode insertion retron-guide cassette (blue) containing programmed barcode (orange) and UMI (yellow); variant editing cassette (green). Middle, the variant editing cassette converts a wildtype (WT) allele into an alternative allele. Bottom, the barcode insertion retron-guide cassette. [0040] Fig.1b. Schematic for conventional CRISPEY. Variants tracked across three growth replicates by plasmids containing guide-donor oligo. [0041] Fig.1c. Schematic for CRISPEY-BAR. Variants tracked across three growth replicates by genomically-integrated barcodes with attached UMIs. [0042] Fig.1d. Workflow for CRISPEY-BAR library pool construction. [0043] Fig. 1e. Validation of genomic variant editing rate from CRISPEY-BAR. Blue, randomly picked colonies that contain both genomic-integrated barcode and the designed edit. Orange, randomly picked colonies that contain only the genomic-integrated barcode but not the designed edit. [0044] Fig. 1f. Schematic for CRISPEY-BAR pooled competition in yeast. [0045] Fig.1g. Example of CRISPEY-BAR data over time. Each line indicates normalized counts for a single UMI for a given barcode from 1 of 3 replicates in a competition experiment. Counts in later time points are normalized to the first time point. Light blue and blue: two barcodes representing different guides targeting the same variant chr7: 848783 AC>A. Red and dark red: two barcodes representing different guides targeting the same
variant chr7: 847050 C>A. Gray scale: Non-targeting of variants, barcode integration only (no-edit control regarding variants). Data shown are from Terbinafine competition across approximately 26 generations. [0046] Fig.1h. Example of outlier removal. Green solid line, normalized reads from an outlier UMI. Green dotted line, normalized sum of reads from all UMIs of the barcode. Gray solid line, normalized reads from non-outlier UMIs. Black dotted line, normalized sum of reads from all UMIs of the barcode excluding outlier UMI. [0047] Fig. 1i. Replication of fitness effects between two competition triplicates in synthetic complete media (SC). Orange, variants with FDR < 0.25 in both triplicates. Blue, variants with FDR >= 0.25 in one or more triplicates. [0048] Fig.1j. Replication of fitness effects. X-axis and Y-axis indicate fitness effects measured by two independent CRISPEY-BAR experiments, pool1 and pool4, in cobalt chloride. Or-ange, variants with FDR < 0.25 in pool1. Blue, variants with FDR >= 0.25 in pool1. [0049] Fig.1k. Validation of pooled fitness in fluconazole by pairwise competition. X-axis, fitness ef-fect measured by CRISPEY-BAR pooled competition. Y-axis, fitness effect measured through pairwise competition against GFP strain using flow cytometry. Data shown for 13 variants in fluconazole. Data presented as mean ± SEM. [0050] Fig.2a-g: Detection of natural variants affecting fitness within QTLs mapped in complex traits. [0051] Fig.2a. Diagram of library design process using natural variants and QTL regions, as well as library statistics. [0052] Fig.2b. Schematic for experiment workflow for QTL fine-mapping with CRISPEY- BAR. [0053] Fig. 2c. Number of variants with fitness effect (FDR<0.01) within SC and appropriate stress condition.
[0054] Fig. 2d. Annotation enrichment of variants with fitness effect (FDR<0.01). Blue, variant enrichment for hits in fluconazole condition. Orange, variant enrichment for hits in caffeine condition. Green, variant enrichment for hits in cobalt chloride condition. [0055] Fig.2e. Fitness effects of example QTL regions. Dark blue, fitness effects in stress condition (FDR < 0.01). Dark orange, fitness effects in SC (FDR < 0.01). Light blue, no fitness effects stress condition. Gold, no fitness effects in SC. Most variants are represented twice (effect in QTL condition and complete media). [0056] Fig. 2f. PDR5 fitness effects in CAFF and FLC. Magenta, PDR5 variant fitness measured in caffeine condition. Orange, PDR5 variant fitness measured in fluconazole condition. Dark gray, noncoding regions flanking PDR5. Light gray, coding region of PDR5. Vertical lines connect the same variant fitness values measured in both caffeine and fluconazole. [0057] Fig.2g. Diagram depicting primary ecological origins of two adjacent variants mutating K940 in PDR5 with significant fitness effects. [0058] Fig.3a-h: CRISPEY-BAR enabled robust mapping of variant-level GxE interactions within the ergosterol biosynthesis pathway. [0059] Fig.3a. Ergosterol pathway diagram showing 24 genes from the ergosterol synthesis pathway surveyed in this study. Lovastatin and terbinafine target genes in the ergosterol pathway. [0060] Fig.3b. The same pool of yeast edited at natural ergosterol pathway variants was grown in six different conditions and tracked by barcode sequencing. [0061] Fig.3c. Gene level fitness effects of surveyed natural variants in six conditions. X- axis labels indicate the genes containing the variants. Red, causal variants (p < 0.01). Gray, non-significant variants. Target genes are outlined by dashed black lines where applicable. [0062] Fig.3d. GxE interactions were calculated between each pair of conditions (15 pairwise comparisons). [0063] Fig.3e. Diagram showing definition of GxE variants in this study: A positive effect variant (black circle) in condition 1 can either have the same effect in another condition
(white circle at same height in red region), a stronger positive effect (top white circle in red region), no effect, white circle at zero, or a negative effect (bottom white circle in blue region). If the variant has a negative effect or no effect at all in condition 2 (blue and light blue regions), it is labeled as GxE. [0064] Fig. 3f. The number of significant GxE interactions for each pairwise comparison. [0065] Fig.3g. GxE annotation enrichments for variants with GxE. Enrichment of variants with GxE in each category were normalized to all variants tested. Red dashed line indicates an enrichment factor of 1.0, corresponding to no enrichment over the library. [0066] Fig.3h. Variants with GxE effects within the HMG1 promoter. Clusters of variants with significant GxE effects within 8 bp of each other are in gray highlighted areas. Beginning of the HMG1 gene body is shown as a blue rectangle. Green, variants fitness effect in caffeine (CAFF) condition. Purple, variant fitness effect in lovastatin (LOV) condition. [0067] Fig.4a-f: Quantifying GxE interactions among ergosterol pathway variants [0068] Fig.4a. Schematic of rare GxE between conditions (correlated effects). [0069] Fig.4b. Schematic of common GxE between conditions (uncorrelated effects). [0070] Fig.4c. Fitness effects of variants within PDR5 in caffeine and fluconazole. [0071] Fig.4d. Fitness effects of variants within ergosterol pool in lovastatin and CoCl2. [0072] Fig.4e. Histogram showing the fraction of variants with significant fitness effects within a pair of conditions which show non-magnitude GxE for the ergosterol pool. PDR5 variants measured in caffeine and fluconazole are shown as a dotted gray line. [0073] Fig.4f. Heatmaps showing fitness effects of all variants with a significant effect in any condition. Significant positive effects (red), significant negative effects (blue), non- significant positive effects (pink), and non-significant negative effects (light blue). [0074] Fig. 5a-e: Types of GxE variants and effect of natural variation on ERG4 expression. [0075] Fig. 5a. Example of fitness effect detected in only one condition.
[0076] Fig.5b. Example of fitness effects with same direction detected in two conditions. [0077] Fig.5c. Example of fitness effects with opposite directions between conditions, showing sign GxE. [0078] Fig.5d. Sign GxE variants have larger maximum fitness effects. Whiskers represent Q3 + 1.5xIQR and Q1 - 1.5xIQR, or the maximum and minimum values of the dataset if these are respectively lower or higher than the IQR based intervals. [0079] Fig.5e. Effect of natural variants on ERG4 expression. Top left: Consensus Rpn4p binding motif. Top right: Genomic location of Rpn4p binding site affected by chr7: 472522 C>A variant within ERG4/PDR1 divergent promoter. Bottom left: Variants of Rpn4p binding site within ERG4/PDR1 promoter tested. Bottom right: qRT-PCR measured expression of ERG4 scaled by wildtype expression in an unedited strain, data presented as mean ± SEM. [0080] Fig.6: Schematic for library cloning in CRISPEY-BAR. [0081] Fig. 7: Schematic for pooled editing and growth competition in CRISPEY-BAR. [0082] Fig.8: Schematic for CRISPEY-BAR sequencing library preparation. [0083] Fig.9: Fitness and ERG4 expression for variants in Fig.5e. X-axis: Paired fitness from flow cytometry measurements similar to Fig.1i, see also Methods. Y-axis: ERG4 expression change same as shown in Fig.5e. Data presented as mean ± SEM. DETAILED DESCRIPTION I. Introduction [0084] The present disclosure provides compositions and methods for tracking one or more targeted genetic modifications (also referred to as genetic “edits” or “variants”) made in the genome of a cell or organism. In some embodiments, the present disclosure provides a nucleic acid composition that comprises two or more editing modules that are present on an expression vector. The compositions and methods allow for producing combinations of targeted genetic modifications in the genome of a host cell, where the combinations of modifications are predetermined.
[0085] In some embodiments, the first module comprises nucleic acid sequences that can modify a genetic locus in a host cell (e.g., a first target locus) and the second module comprises nucleic acid sequences that modify a second genetic locus in a host cell (e.g., a second target locus). In some embodiments, the first target locus is at a different location in the genome than the second target locus. In some embodiments, the genetic modification at the first target locus is different than the genetic modification at the second target locus. In some embodiments, the genetic modification at the first target locus comprises a mutation, edit, variant or deletion in the nucleic acid sequence of the first target locus. In some embodiments, the genetic modification at the second target locus comprises a mutation, edit or variant of the nucleic acid sequence of the second target locus. In some embodiments, the genetic modification at the second target locus comprises or further comprises introducing a unique barcode sequence at the second target locus. In some embodiments, the genetic modification at the second target locus comprises introducing both a mutation, edit, or variant and a unique barcode sequence at the second target locus. [0086] In some embodiments, the compositions and methods can be used to introduce a second genetic modification at a target locus in the same host cell or its progeny by transfecting the cell with a second vector comprising nucleic acid sequences that can modify a third target locus and a second module comprising nucleic acid sequences that can introduce a barcode sequence at a fourth target locus. In some embodiments, the first and third target loci are the same, but the genetic modification is different. In some embodiments, the second and fourth target loci are the same, but the barcode sequence is different. The above can be repeated to introduce additional genetic modifications along with different unique barcode sequences at the same or different target loci. [0087] After the genetic modifications or edits are made, the vector can be removed or lost in the host cell and its daughter cells. The intended combination of precise edits made in each cell can be determined by detecting the unique barcode sequence assigned to each edit combination. In some embodiments, barcode sequence can be detected by Sanger sequencing, next generation sequencing (NGS) or other detection methods that distinguish the unique barcode sequence assigned to each edit combination. This can be performed in a mixture of host cells, a single host cell, or a clonal cell lineage.
[0088] The compositions and methods described herein provide the following advantages. 1) Detecting genetic modifications in the host cell does not require the presence of the expression vector in the host cell or its progeny. Current methods for tracking genetic modifications in cell mixtures directly genotype the editing vector, such as plasmid or virus, that is either retained in the cell ectopically or integrates into the host cell genetic material, which can alter the host cell physiology or result in deleterious mutations in the host cell genome. 2) The unique barcode at the second target locus allows straightforward tracking of genetic modifications at the first target locus simply by sequencing the second target locus or by detecting expression of a reporter gene encoded by the barcode sequence. Because both genetic modifications are achieved by transfecting the host cell with one vector comprising two editing modules, the presence of the barcode sequence at the second target locus demonstrates that the cellular editing machinery was functional and thus there is a high probability that the genetic modification at the first target locus also occurred. 3) Multiple or iterative genetic modifications to the same host cell or its progeny can be tracked by detecting previous, different, unique barcodes introduced into the host cell genome associated with a previous genetic modification. 4) The barcode sequence can also be linked to sequences that encode antibiotic markers or markers that allow cell purification. [0089] In some embodiments, the two or more editing modules are present on a bicistronic retron-donor-guide editing vector. In some embodiments, the bicistronic retron-donor-guide editing vector allows simultaneous editing of two different genetic target loci. In some embodiments, the first and second modules comprise a retron-guide RNA cassette. Retron- guide RNA cassettes are described in US 2019/0330619 A1 (corresponding to WO 2018/049168) and US Provisional Patent App. No.63/232,080 (filed 8 August 2021),which are hereby incorporated by reference herein in their entirety. In some embodiments, the combination of edits that will be made across all modules are predetermined. [0090] In some embodiments, the three editing modules are present on a retron-donor- guide editing vector. In some embodiments, the bicistronic retron-donor-guide editing vector allows simultaneous editing of three different genetic target loci. In some embodiments, the first, second and third modules comprise a retron-guide RNA cassette. In some embodiments, the first and second modules introduce two (a pair) of genetic edits in two different target sequences, and the third module introduces a unique barcode sequence that is
associated with the pair of genetic variants introduced by the first and second modules (a “variant-pair” specific barcode). [0091] In some embodiments, the first and second editing modules are connected by self- cleaving HDV-like ribozymes to allow separation of either module to detach from the RNA pol2 transcript, which allows Cas9/retron binding and nuclear export. In some embodiments, ribozymes are selected from drz-CIV-1, HDV ribozyme, and drz-Spur-3, though other combinations of ribozymes are expressly included herein. Useful ribozyme sequences are described in Riccitelli NJ, et al., Identification of minimal HDV-like ribozymes with unique divalent metal ion dependence in the human microbiome. Biochemistry.2014 Mar 18;53(10):1616-26. doi: 10.1021/bi401717w. Epub 2014 Mar 5. PMID: 24555915. [0092] Genome editing methods commonly include the provision of both an engineered nuclease or nickase and a donor DNA repair template that contains the DNA sequence to be inserted at a desired location. For example, the CRISPR/Cas9 system utilizes a guide RNA (gRNA) that directs the Cas9 nuclease to introduce a double-strand cut at a specific location. A donor DNA repair template can then be provided, enabling the precise insertion of a new sequence mediated by homology-directed repair of the double-strand cut. In the past, the gRNA and donor DNA template have been supplied as separate molecules, meaning that each editing experiment must be performed in a separate tube or vessel. [0093] However, it has recently been described that physically coupling a gRNA molecule to the transcript product of an obscure bacterial genetic element termed a retron dramatically increases the efficiency of DNA editing and screening. In particular, the reverse transcription of the DNA coding unit (msd region) of the retron transcript results in a multicopy single- stranded DNA (msDNA) molecule that contains a donor DNA repair template and is physically tethered to the gRNA, increasing editing efficiency. See, e.g., US 2019/0330619. General [0094] The practice of the present disclosure employs, unless otherwise indicated, conventional techniques of immunology, biochemistry, chemistry, molecular biology, microbiology, cell biology, genomics and recombinant DNA, which are within the skill of the art. See Sambrook, Fritsch and Maniatis, Molecular Cloning: A Laboratory Manual, 2nd
edition (1989), Current Protocols in Molecular Biology (F. M. Ausubel, et al. eds., (1987)), the series Methods in Enzymology (Academic Press, Inc.): PCR 2: A Practical Approach (M. J. MacPherson, B. D. Hames and G. R. Taylor eds. (1995)), Harlow and Lane, eds. (1988) Antibodies, A Laboratory Manual, and Animal Cell Culture (R. I. Freshney, ed. (1987)). [0095] For nucleic acids, sizes are given in either kilobases (kb), base pairs (bp), or nucleotides (nt). Sizes of single-stranded DNA and/or RNA can be given in nucleotides. These are estimates derived from agarose or acrylamide gel electrophoresis, from sequenced nucleic acids, or from published DNA sequences. For proteins, sizes are given in kilodaltons (kDa) or amino acid residue numbers. Protein sizes are estimated from gel electrophoresis, from sequenced proteins, from derived amino acid sequences, or from published protein sequences. [0096] Oligonucleotides that are not commercially available can be chemically synthesized, e.g., according to the solid phase phosphoramidite triester method first described by Beaucage and Caruthers, Tetrahedron Lett.22:1859-1862 (1981), using an automated synthesizer, as described in Van Devanter et. al., Nucleic Acids Res.12:6159-6168 (1984). Purification of oligonucleotides is performed using any art-recognized strategy, e.g., native acrylamide gel electrophoresis or anion-exchange high performance liquid chromatography (HPLC) as described in Pearson and Reanier, J. Chrom.255: 137-149 (1983). [0097] The disclosure encompasses all combinations of the particular embodiments described herein, as if each combination had been individually recited. Definitions [0098] Unless specifically indicated otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which this disclosure belongs. In addition, any method or material similar or equivalent to a method or material described herein can be used in the practice of the present disclosure. For purposes of the present disclosure, the following terms are defined. [0099] The terms “a,” “an,” or “the” as used herein not only include aspects with one member, but also include aspects with more than one member. For instance, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates
otherwise. Thus, for example, reference to “a cell” includes a plurality of such cells and reference to “the agent” includes reference to one or more agents known to those skilled in the art, and so forth. [0100] The term “about” in relation to a reference numerical value can include a range of values plus or minus 10% from that value. For example, the amount “about 10” includes amounts from 9 to 11, including the reference numbers of 9, 10, and 11. The term “about” in relation to a reference numerical value can also include a range of values plus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from that value. [0101] As used herein, unless otherwise specified, the terms “5’ ” and “3’ ” denote the positions of elements or features relative to the overall arrangement of the retron-guide RNA cassettes, vectors, or retron donor DNA-guide molecules of the present disclosure in which they are included. Positions are not, unless otherwise specified, referred to in the context of the orientation of a particular element or features. For example, the msr and msd loci in FIG. 4 are shown in opposite orientations. However, the msr locus is said to be 5’ of the msd locus. Furthermore, the 3’ end of the msr locus is said to be overlapping with the 5’ end of the msd locus. Unless otherwise specified, the term “upstream” refers to a position that is 5’ of a point of reference. Conversely, the term “downstream” refers to a position that is 3’ of a point of reference. Thus, in FIG.2 the msr locus is said to be located upstream of the reverse transcriptase sequence, and the reverse transcriptase sequence is said to be located downstream of the msr locus. [0102] The term “genome editing” refers to a type of genetic engineering in which DNA is inserted, replaced, or removed from a target DNA (e.g., the genome of a cell) using one or more nucleases and/or nickases. The nucleases create specific double-strand breaks (DSBs) at desired locations in the genome, and harness the cell’s endogenous mechanisms to repair the induced break by homology-directed repair (HDR) (e.g., homologous recombination) or by nonhomologous end joining (NHEJ). The nickases create specific single-strand breaks at desired locations in the genome. In one non-limiting example, two nickases can be used to create two single-strand breaks on opposite strands of a target DNA, thereby generating a blunt or a sticky end. Any suitable DNA nuclease can be introduced into a cell to induce genome editing of a target DNA sequence.
[0103] The terms “genetic modification,” “genetic edit,” and “genome edit” can be used interchangeably and refer to a change in the nucleic acid sequence of a target polynucleotide (e.g., the genomic DNA of a cell), such that the nucleic acid sequence of the modified DNA is different from the native, endogenous, previously modified, or wild-type sequence of the target DNA. The term encompasses mutations and variants of the target DNA sequence, and includes insertions, replacements, or deletions of the target polynucleotide sequence, including insertion of a barcode sequence at a target genomic DNA locus. [0104] The term “DNA nuclease” refers to an enzyme capable of cleaving the phosphodiester bonds between the nucleotide subunits of DNA, and may be an endonuclease or an exonuclease. According to the present disclosure, the DNA nuclease may be an engineered (e.g., programmable or targetable) DNA nuclease which can be used to induce genome editing of a target DNA sequence. Any suitable DNA nuclease can be used including, but not limited to, CRISPR-associated protein (Cas) nucleases, other endo- or exo- nucleases, variants thereof, fragments thereof, and combinations thereof. [0105] The term “double-strand break” or “double-strand cut” refers to the severing or cleavage of both strands of the DNA double helix. The DSB may result in cleavage of both stands at the same position leading to “blunt ends” or staggered cleavage resulting in a region of single-stranded DNA at the end of each DNA fragment, or “sticky ends”. A DSB may arise from the action of one or more DNA nucleases. [0106] The term “nonhomologous end joining” or “NHEJ” refers to a pathway that repairs double-strand DNA breaks in which the break ends are directly ligated without the need for a homologous template. [0107] The term “homology-directed repair” or “HDR” refers to a mechanism in cells to accurately and precisely repair double-strand DNA breaks using a homologous template to guide repair. The most common form of HDR is homologous recombination (HR), a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA. [0108] The term “nucleic acid,” “nucleotide,” or “polynucleotide” refers to deoxyribonucleic acids (DNA), ribonucleic acids (RNA) and polymers thereof in either
single-, double- or multi-stranded form. The term includes, but is not limited to, single-, double- or multi-stranded DNA or RNA, genomic DNA, cDNA, DNA-RNA hybrids, or a polymer comprising purine and/or pyrimidine bases or other natural, chemically modified, biochemically modified, non-natural, synthetic or derivatized nucleotide bases. In some embodiments, a nucleic acid can comprise a mixture of DNA, RNA and analogs thereof. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions), alleles, orthologs, single nucleotide polymorphisms (SNPs), and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem.260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994)). [0109] The term “single nucleotide polymorphism” or “SNP” refers to a change of a single nucleotide within a polynucleotide, including within an allele. This can include the replacement of one nucleotide by another, as well as the deletion or insertion of a single nucleotide. Most typically, SNPs are biallelic markers although tri- and tetra-allelic markers can also exist. By way of non-limiting example, a nucleic acid molecule comprising SNP A\C may include a C or A at the polymorphic position. [0110] The term “gene” means the segment of DNA involved in producing a polypeptide chain. The DNA segment may include regions preceding and following the coding region (leader and trailer) involved in the transcription/translation of the gene product and the regulation of the transcription/translation, as well as intervening sequences (introns) between individual coding segments (exons). [0111] The term “cassette” refers to a combination of genetic sequence elements that may be introduced as a single element and may function together to achieve a desired result. A
cassette typically comprises polynucleotides in combinations that are not found in nature. A cassette can be inserted into a vector, such as an expression vector. [0112] The term “operably linked” refers to two or more genetic elements, such as a polynucleotide coding sequence and a promoter, placed in relative positions that permit the proper biological functioning of the elements, such as the promoter directing transcription of the coding sequence. [0113] The term “inducible promoter” refers to a promoter that responds to environmental factors and/or external stimuli that can be artificially controlled in order to modify the expression of, or the level of expression of, a polynucleotide sequence or refers to a combination of elements, for example an exogenous promoter and an additional element such as a trans-activator operably linked to a separate promoter. An inducible promoter may respond to abiotic factors such as oxygen levels or to chemical or biological molecules. In some embodiments, the chemical or biological molecules may be molecules not naturally present in humans. [0114] The terms “vector” and “expression vector” refer to a nucleic acid construct, generated recombinantly or synthetically, with a series of specified nucleic acid elements that permit transcription of a particular polynucleotide sequence in a host cell. An expression vector may be part of a plasmid, viral genome, or nucleic acid fragment. Typically, an expression vector includes a polynucleotide to be transcribed, operably linked to a promoter. The term “promoter” is used herein to refer to an array of nucleic acid control sequences that direct transcription of a nucleic acid. As used herein, a promoter includes necessary nucleic acid sequences near the start site of transcription, such as, in the case of a polymerase II type promoter, a TATA element. A promoter also optionally includes distal enhancer or repressor elements, which can be located as much as several thousand base pairs from the start site of transcription. Other elements that may be present in an expression vector include those that enhance transcription (e.g., enhancers) and terminate transcription (e.g., terminators). [0115] “Recombinant” refers to a genetically modified polynucleotide, polypeptide, cell, tissue, or organism. For example, a recombinant polynucleotide (or a copy or complement of a recombinant polynucleotide) is one that has been manipulated using well known methods. A recombinant expression cassette comprising a promoter operably linked to a second
polynucleotide (e.g., a coding sequence) can include a promoter that is heterologous to the second polynucleotide as the result of human manipulation (e.g., by methods described in Sambrook et al., Molecular Cloning - A Laboratory Manual, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, (1989) or Current Protocols in Molecular Biology Volumes 1-3, John Wiley & Sons, Inc. (1994-1998)). A recombinant expression cassette (or expression vector) typically comprises polynucleotides in combinations that are not found in nature. For instance, human manipulated restriction sites or plasmid vector sequences can flank or separate the promoter from other sequences. A recombinant protein is one that is expressed from a recombinant polynucleotide, and recombinant cells, tissues, and organisms are those that comprise recombinant sequences (polynucleotide and/or polypeptide). [0116] As used herein, the term “heterologous” refers to biological material that is introduced, inserted, or incorporated into a recipient (e.g., host) organism that originates from another organism. Typically, the heterologous material that is introduced into the recipient organism (e.g., a host cell) is not normally found in that organism. Heterologous material can include, but is not limited to, nucleic acids, amino acids, peptides, proteins, and structural elements such as genes, promoters, and cassettes. A host cell can be, but is not limited to, a bacterium, a yeast cell, a mammalian cell, or a plant cell. The introduction of heterologous material into a host cell or organism can result, in some instances, in the expression of additional heterologous material in or by the host cell or organism. As a non-limiting example, the transformation of a yeast host cell with an expression vector that contains DNA sequences encoding a bacterial protein may result in the expression of the bacterial protein by the yeast cell. The incorporation of heterologous material may be permanent or transient. Also, the expression of heterologous material may be permanent or transient. [0117] The terms “reporter” and “selectable marker” can be used interchangeably and refer to a gene product that permits a cell expressing that gene product to be identified and/or isolated from a mixed population of cells. Such isolation might be achieved through the selective killing of cells not expressing the selectable marker, which may be, as a non- limiting example, an antibiotic resistance gene. Alternatively, the selectable marker may permit identification and/or subsequent isolation of cells expressing the marker as a result of the expression of a fluorescent protein such as GFP or the expression of a cell surface marker which permits isolation of cells by fluorescence-activated cell sorting (FACS), magnetic-
activated cell sorting (MACS), or analogous methods. Suitable cell surface markers include CD8, CD19, and truncated CD19. Preferably, cell surface markers used for isolating desired cells are non-signaling molecules, such as subunit or truncated forms of CD8, CD19, or CD20. Suitable markers and techniques are known in the art. [0118] The terms “culture,” “culturing,” “grow,” “growing,” “maintain,” “maintaining,” “expand,” “expanding,” etc., when referring to cell culture itself or the process of culturing, can be used interchangeably to mean that a cell (e.g., yeast cell) is maintained outside its normal environment under controlled conditions, e.g., under conditions suitable for survival. Cultured cells are allowed to survive, and culturing can result in cell growth, stasis, differentiation or division. The term does not imply that all cells in the culture survive, grow, or divide, as some may naturally die or senesce. Cells are typically cultured in media, which can be changed during the course of the culture. [0119] The terms “subject,” “individual,” and “patient” are used interchangeably herein to refer to a vertebrate, preferably a mammal, more preferably a human. Mammals include, but are not limited to, murines, simians, humans, farm animals, sport animals, and pets. Tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro are also encompassed. [0120] As used herein, the term “administering” includes oral administration, topical contact, administration as a suppository, intravenous, intraperitoneal, intramuscular, intralesional, intrathecal, intranasal, or subcutaneous administration to a subject. Administration is by any route, including parenteral and transmucosal (e.g., buccal, sublingual, palatal, gingival, nasal, vaginal, rectal, or transdermal). Parenteral administration includes, e.g., intravenous, intramuscular, intra-arteriole, intradermal, subcutaneous, intraperitoneal, intraventricular, and intracranial. Other modes of delivery include, but are not limited to, the use of liposomal formulations, intravenous infusion, transdermal patches, etc. [0121] The term “treating” refers to an approach for obtaining beneficial or desired results including, but not limited to, a therapeutic benefit and/or a prophylactic benefit. By therapeutic benefit is meant any therapeutically relevant improvement in or effect on one or more diseases, conditions, or symptoms under treatment. For prophylactic benefit, the
compositions may be administered to a subject at risk of developing a particular disease, condition, or symptom, or to a subject reporting one or more of the physiological symptoms of a disease, even though the disease, condition, or symptom may not have yet been manifested. [0122] The term “effective amount” or “sufficient amount” refers to the amount of an agent that is sufficient to effect beneficial or desired results. The therapeutically effective amount may vary depending upon one or more of: the subject and disease condition being treated, the weight and age of the subject, the severity of the disease condition, the manner of administration and the like, which can readily be determined by one of ordinary skill in the art. The specific amount may vary depending on one or more of: the particular agent chosen, the host cell type, the location of the host cell in the subject, the dosing regimen to be followed, whether it is administered in combination with other compounds, timing of administration, and the physical delivery system in which it is carried. [0123] The term “pharmaceutically acceptable carrier” refers to a substance that aids the administration of an active agent to a cell, an organism, or a subject. “Pharmaceutically acceptable carrier” refers to a carrier or excipient that can be included in the compositions of the disclosure and that causes no significant adverse toxicological effect on the patient. Non- limiting examples of pharmaceutically acceptable carrier include water, NaCl, normal saline solutions, lactated Ringer’s, normal sucrose, normal glucose, cell culture media, and the like. One of skill in the art will recognize that other pharmaceutical carriers are useful in the present disclosure. [0124] The term “degron” refers to a region or portion of a protein that regulates the rate of protein degradation. Degrons can be located anywhere in a protein, and can include short amino acid sequences, structural motifs, or exposed amino acids (e.g., lysine, arginine). Degrons exist in both prokaryotic and eukaryotic organisms. Degrons can be classified as being either ubiquitin-dependent or ubiquitin-independent. For additional information regarding degrons, see, e.g., Raid, et al. Nat. Rev. Mol. Cell Biol.9:679-690 (2008); incorporated herein by reference in its entirety for all purposes. [0125] The term “cellular localization tag” refers to an amino acid sequence, also known as a “protein localization signal,” that targets a protein for localization to a specific cellular or
subcellular region, compartment, or organelle (e.g., nuclear localization sequence, Golgi retention signal). Cellular localization tags are typically located at either the N-terminal or C- terminal end of a protein. A database of protein localization signals (LocSigDB) is maintained online by the University of Nebraska Medical Center (genome.unmc.edu/LocSigDB). For more information regarding cellular localization tags, see, e.g., Negi, et al. Database (Oxford).2015: bav003 (2015); incorporated herein by reference in its entirety for all purposes. [0126] The term “synthetic response element” refers to a recombinant DNA sequence that is recognized by a transcription factor and facilitates gene regulation by various regulatory agents. A synthetic response element can be located within a gene promoter and/or enhancer region. [0127] The term “ribozyme” refers to an RNA molecule that is capable of catalyzing a biochemical reaction. In some instances, ribozymes function in protein synthesis, catalyzing the linking of amino acids in the ribosome. In other instances, ribozymes participate in various other RNA processing functions, such as splicing, viral replication, and tRNA biosynthesis. In some instances, ribozymes can be self-cleaving. Non-limiting examples of ribozymes include the HDV ribozyme, the Lariat capping ribozyme (formally called GIR1 branching ribozyme), the glmS ribozyme, group I and group II self-splicing introns, the hairpin ribozyme, the hammerhead ribozyme, various rRNA molecules, RNase P, the twister ribozyme, the VS ribozyme, the pistol ribozyme, and the hatchet ribozyme. For more information regarding ribozymes, see, e.g., Doherty, et al. Ann. Rev. Biophys. Biomol. Struct. 30: 457-475 (2001); incorporated herein by reference in its entirety for all purposes. [0128] “Percent similarity,” in the context of polynucleotide or peptide sequences, is determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the sequence (e.g., an msr locus sequence) in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence which does not comprise additions or deletions, for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleotide or amino acid occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of
comparison and multiplying the result by 100 to yield the percentage of similarity (e.g., sequence similarity). [0129] When a polynucleotide or peptide has at least about 70% similarity (e.g., sequence similarity), preferably at least about 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% similarity, to a reference sequence, when compared and aligned for maximum correspondence over a comparison window, or designated region as measured using one of the following sequence comparison algorithms or by manual alignment and visual inspection, such sequences are then said to be “substantially similar.” With regard to polynucleotide sequences, this definition also refers to the complement of a test sequence. [0130] For sequence comparison, typically one sequence acts as a reference sequence, to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are entered into a computer, subsequence coordinates are designated, if necessary, and sequence algorithm program parameters are designated. Default program parameters can be used, or alternative parameters can be designated. The sequence comparison algorithm then calculates the percent sequence similarities for the test sequences relative to the reference sequence, based on the program parameters. For sequence comparison of nucleic acids and proteins, the BLAST and BLAST 2.0 algorithms and the default parameters discussed below are used. [0131] Methods of alignment of sequences for comparison are well-known in the art. Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math.2:482 (1981), by the homology alignment algorithm of Needleman & Wunsch, J. Mol. Biol.48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat’l. Acad. Sci. USA 85:2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, WI), or by manual alignment and visual inspection (see, e.g., Current Protocols in Molecular Biology (Ausubel et al., eds.1995 supplement)). [0132] Additional examples of algorithms that are suitable for determining percent sequence similarity are the BLAST and BLAST 2.0 algorithms, which are described in
Altschul et al., (1990) J. Mol. Biol.215: 403-410 and Altschul et al. (1977) Nucleic Acids Res.25: 3389-3402, respectively. Software for performing BLAST analyses is publicly available at the National Center for Biotechnology Information website, ncbi.nlm.nih.gov. The algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive- valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al., supra). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always <0). The BLASTN program (for nucleotide sequences) uses as defaults a word size (W) of 28, an expectation (E) of 10, M=1, N=-2, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a word size (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (see, e.g., Henikoff and Henikoff, Proc. Natl. Acad. Sci. USA 89:10915 (1989)). [0133] The BLAST algorithm also performs a statistical analysis of the similarity between two sequences (see, e.g., Karlin and Altschul, Proc. Nat’l. Acad. Sci. USA, 90:5873-5787 (1993)). One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a nucleic acid is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid to the reference nucleic acid is less than about 0.2, more preferably less than about 0.01, and most preferably less than about 0.001. Detailed Description of the Embodiments [0134] The present disclosure provides compositions and methods for simultaneously introducing genetic modifications at two different target loci in the genome of a host cell. The disclosure provides methods comprising the use of retron-guide RNA cassettes, vectors comprising said cassettes, and retron donor DNA-guide molecules of the present disclosure to
modify nucleic acids of interest at target loci of interest, and to screen genetic loci of interest, in the genomes of host cells. The present disclosure also provides compositions and methods for preventing or treating genetic diseases by enhancing precise genome editing to correct a mutation in target genes associated with the diseases. Kits for genome editing and screening are also provided. The present disclosure can be used with any cell type and at any gene locus that is amenable to nuclease-mediated genome editing technology. A. The CRISPR-retron system [0135] In one aspect, the present disclosure provides a retron-guide RNA (gRNA) cassette. In some embodiments, the cassette comprises: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the first donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one or more sequences within a second target locus; (v) a second inverted repeat sequence coding region; and (b) a second guide RNA (gRNA) coding region. [0136] In some embodiments, the first donor DNA sequence can introduce a genetic modification or edit at the first target locus. In some embodiments, the first and second donor DNA sequences can introduce genetic modifications or edits at the first and second target
loci. Thus, in some embodiments, the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the first and second donor DNA sequences comprise genetic variants compared to the sequences within the first and second target loci, respectively. The first and second donor DNA sequences can introduce genetic modifications at the first and second target loci by HDR. [0137] In some embodiments, the second donor DNA sequence comprises a sequence having a mutation (or edit) relative to the nucleic acid sequence of the second target locus. In some embodiments, the second donor DNA sequence comprises or further comprises a unique barcode sequence. In some embodiments, the second donor DNA sequence comprises both a mutation (or edit) relative to the nucleic acid sequence of the second target locus and a unique barcode sequence. Thus, the retron-guide RNA (gRNA) cassette can be used to introduce two mutations/edits at the first and second target loci, or to introduce two mutations/edits at the first and second target loci and a unique barcode sequence at the second target loci. In some embodiments, the mutations introduced by the first and second donor DNA sequences are different. [0138] In some embodiments, the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus. Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12-bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes. In some embodiments, the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker. [0139] The compositions and methods described herein provide the ability to introduce two or more edits into the genome of a host cell, where a first edit at the first target locus causes a biological effect that can be monitored by measuring the second edit at the second target locus. For example, in some embodiments, the first edit comprises an eQTL variant edit that affects expression/transcription of a gene, which can be tracked by the RNA/DNA ratio of the second edit (e.g., by inserting a barcode sequence into the 3’UTR of the gene). In some
embodiments, the first edit at the first target locus affects the phenotype of a cell, such as cell physiology or growth, cultured in a media comprising a test compound or drug, where the phenotype can be monitored by determining the number of copies of a DNA barcode inserted at the second target locus measured at different timepoints during growth in the media comprising the test compound or drug. In some embodiments, the first edit at the first target locus introduces an amino acid variant in an enzyme, and the second edit inserts a barcode into a gene encoding a substrate of the enzyme. For example, in some embodiments, the first edit at the first target locus introduces an amino acid variant into a ubiquitin ligase that affects target protein translation, and the first edit can be tracked by sorting cells comprising a barcode and sequences encoding a detectable marker (such as green fluorescent protein (GFP)) integrated at the second target locus, e.g., in sequences encoding the C-terminus of a target protein. In this example, populations of cells expressing high GFP signal relative to the rest of the library indicate the first edit disrupts ubiquitin ligase activity. [0140] In some embodiments, the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron. [0141] In some embodiments, transcription products of the retron and the gRNA coding region are physically coupled. In particular embodiments, the resulting gRNA and donor DNA sequences are also physically coupled (e.g., during genome editing and/or screening). In some embodiments, the transcription products are coupled during a single transcription event. In particular embodiments, the transcription products of the retron and the gRNA coding region are initially coupled, and then subsequently become uncoupled (e.g., after transcription of the retron, or after reverse transcription of the retron transcript), in which case the guide RNA and the donor DNA sequence will also be physically uncoupled during genome editing and/or screening. In some instances, uncoupling can be induced by a ribozyme. A non-limiting example of a suitable ribozyme is the hepatitis delta virus (HDV) ribozyme. In some embodiments, the cassette further comprises a ribozyme sequence (e.g., HDV ribozyme sequence). In some embodiments, the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead and combinations thereof.
[0142] In some embodiments, transcription products of the retron and the gRNA coding region are not initially physically coupled (i.e., the transcription products are created in separate transcription events). As a non-limiting example, the retron and the gRNA coding region can be included in two different retron-gRNA cassettes, which can be included in the same vector or in different vectors. In some embodiments, expression from the vector(s) occurs inside a host cell. In other embodiments, transcription of the retron and/or the gRNA coding region occurs outside of the host cell, and then the transcription product(s) are introduced into the host cell. In some embodiments, the transcription products are created in separate transcription events and are subsequently joined together for genome editing and/or screening, in which case the resulting gRNA and donor DNA sequence will also be physically coupled for genome editing and/or screening. Such joining can occur before or after reverse transcription of the retron transcript (i.e., before or after creation of msDNA from the retron transcript). In some embodiments, the transcription products of the retron and the gRNA coding region result in a donor DNA sequence and a gRNA that are never physically coupled. In some instances, the retron and the gRNA coding region are located in different cassettes and the resulting donor DNA sequence and gRNA act in trans. [0143] In some embodiments, the gRNA coding region of the cassette is located 3’ of the retron. In other embodiments, the gRNA coding region is located 5’ of the retron. The relative positions of the gRNA coding region and retron may be selected, for example, based upon the particular nuclease being used. [0144] In some embodiments, the retron-gRNA cassette is at least about 5,000 nucleotides in length. In other embodiments, the retron-gRNA cassette is between about 1,000 and 5,000 (i.e., about 1,000, 1,100, 1,200, 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, 1,900, 2,000, 2,100, 2,200, 2,300, 2,400, 2,500, 2,600, 2,700, 2,800, 2,900, 3,000, 3,100, 3,200, 3,300, 3,400, 3,500, 3,600, 3,700, 3,800, 3,900, 4,000, 4,100, 4,200, 4,300, 4,400, 4,500, 4,600, 4,700, 4,800, 4,900, or 5,000) nucleotides in length. In some other embodiments, the cassette is between about 300 and 1,000 (i.e., about 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or 1,000) nucleotides in length. In particular embodiments, the cassette is between about 200 and 300 (i.e., about 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300) nucleotides in length. In other embodiments, the cassette is between about 30 and 200
(i.e., about 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, or 200) nucleotides in length. [0145] In other embodiments, the cassette further comprises one or more sequences having homology to a vector cloning site. These vector homology sequences can be about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more nucleotides in length. In some instances, the vector homology sequences are about 20 nucleotides in length. In other instances, the vector homology sequence are about 15 nucleotides in length. In yet other instances, the vector homology sequences are about 25 nucleotides in length. [0146] In a second aspect, the present disclosure provides a vector comprising a retron- guide RNA cassette of the present disclosure. In some embodiments, the vector further comprises a promoter. Preferably, the promoter is operably linked to the cassette. In particular embodiments, the promoter is inducible. In some instances, the promoter is an RNA polymerase II promoter. In other instances, the promoter is an RNA polymerase III promoter. In particular instances, a combination of promoters is used. In some other embodiments, the vector further comprises a terminator sequence. Vectors of the present disclosure can include commercially available recombinant expression vectors and fragments and variants thereof. Examples of suitable promoters and recombinant expression vectors are described herein and will also be known to one of skill in the art. [0147] Vectors of the present disclosure may further comprise a reverse transcriptase (RT) coding sequence and, optionally, may further comprise a nuclear localization sequence (NLS). In some instances, the NLS will be located 5’ of the RT coding sequence. [0148] Vectors of the present disclosure can further comprise a nuclease coding sequence. The sequence can encode Cas9, Cpf1, or any other suitable nuclease. Examples of suitable nucleases are provided herein and will also be known to one of skill in the art. [0149] When the vector includes an RT coding sequence and/or a nuclease coding sequence, expression of the retron-gRNA cassette and the RT coding sequence and/or the nuclease coding sequence can all be under the control of a single promoter. Alternatively, expression of the retron-gRNA cassette and the RT coding sequence and/or the nuclease
coding sequence can each be under the control of a different promoter. Other combinations are also possible. As a non-limiting example, expression of the retron-gRNA cassette can be under the control of one promoter, while expression of the RT coding sequence and/or the nuclease coding sequence are under the control of another promoter. As another non-limiting example, expression of the retron-gRNA cassette and expression of the RT coding sequence can be under the control of one promoter, while expression of the nuclease coding sequence can be under the control of another promoter. As yet another non-limiting example, expression of the retron-gRNA cassette and expression of the nuclease coding sequence can be under the control of one promoter, while the RT coding sequence is under the control of another promoter. In particular embodiments, one or more of the promoters are inducible. As a non-limiting example, the vector can comprise a retron-gRNA cassette under the control of a Gal7 promoter, an RT coding sequence under the control of a Gal10 promoter, and a nuclease (e.g., Cas9) coding sequence under the control of a Gal1 promoter. Non-limiting examples of other suitable promoters are described herein. In other embodiments, the vector contains a reporter unit that includes a nucleotide sequence encoding a reporter polypeptide (e.g., a detectable polypeptide, fluorescent polypeptide, or a selectable marker (e.g., URA3)). [0150] The size of the vector will depend on the size of the individual components within the vector, e.g., retron-gRNA cassette, RT coding sequence, nuclease coding sequence, NLS, and so on. In other embodiments, the vector is between about 1,000 and about 20,000 (i.e., about 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, 10,000, 10,500, 11,000, 11,500, 12,000, 12,500, 13,000, 13,500, 14,000, 14,500, 15,000, 15,500, 16,000, 16,500, 17,000, 17,500, 18,000, 18,500, 19,000, 19,500, or 20,000) nucleotides in length. In particular embodiments, the vector is more than about 20,000 nucleotides in length. [0151] Also provided in the present disclosure are molecules further comprising a multicopy single-stranded DNA (msDNA) molecule comprising RNA and DNA (e.g., following reverse transcription of the retron transcript, resulting in a branched hybrid RNA- DNA molecule). In some embodiments, the donor DNA sequence is physically coupled to the gRNA, by virtue of the msDNA being physically coupled to the gRNA. In some instances, at least some of the RNA content of the msDNA is degraded (e.g., by an RNase such as RNase H). In some embodiments, the donor DNA sequence and the gRNA are
initially coupled, and then are subsequently uncoupled (e.g., by cleavage of the msDNA from the gRNA). In some embodiments, the donor DNA sequence and the gRNA are never physically coupled. 1. Retrons [0152] Retrons have been known for some time as a class of retroelement, first discovered in gram-negative bacteria such as Myxococcus xanthus (e.g., retrons Mx65 and Mx162), Stigmatella aurantiaca (e.g., retron Sa163), and Escherichia coli (e.g., retrons Ec48, Ec67, Ec73, Ec78, Ec83, Ec86, and Ec107). Retrons are also found in Salmonella typhimurium (e.g., retron St85), Salmonella enteritidis, Vibrio cholera (e.g., retron Vc95), Vibrio parahaemolyticus (e.g., retron Vp96), Klebsiella pneumoniae, Proteus mirabilis, Xanthomonas campestris, Rhizobium sp., Bradyrhizobium sp., Ralstonia metallidurans, Nannocystis exedens (e.g., retron Ne144), Geobacter sulfurreducens, Trichodesmium erythraeum, Nostoc punctiforme, Nostoc sp., Staphylococcus aureus, Fusobacterium nucleatum, and Flexibacter elegans. In one aspect, the present disclosure provides for retron- guide RNA cassettes that comprise a retron. In some embodiments, the retron is derived from the E. coli retron Ec86, which is shown in FIG.2. [0153] Retrons mediate the synthesis in host cells of multicopy single-stranded DNA (msDNA) molecules, which result from the reverse transcription of a retron transcript and typically include a DNA component and an RNA component. The native msDNA molecules reportedly exist as single-stranded DNA-RNA hybrids, characterized by a structure which comprises a single-stranded DNA branching out of an internal guanosine residue of a single- stranded RNA molecule at a 2ƍ,5ƍ-phosphodiester linkage. In some embodiments of the present disclosure, at least some of the RNA content of the msDNA molecule is degraded. In some instances, the RNA content is degraded by RNase H. [0154] Native retrons have been found to consist of the gene for reverse transcriptase (RT) and msr and msd loci under the control of a single promoter. In some embodiments of the present disclosure, a vector comprising a retron-guide RNA cassette further comprises a sequence encoding an RT. In other embodiments, methods are provided wherein the RT is encoded on a separate plasmid from the retron-guide RNA cassette. In still other
embodiments, the RT is encoded in a sequence that has been integrated into the host cell genome. [0155] The msd region of a retron transcript typically codes for the DNA component of msDNA, and the msr region of a retron transcript typically codes for the RNA component of msDNA. In some retrons, the msr and msd loci have overlapping ends, and may be oriented opposite one another with a promoter located upstream of the msr locus which transcribes through the msr and msd loci. However, one of skill in the art will appreciate that the sequence of the msd locus will vary, depending on the particular donor DNA sequence that is located within the msd locus. [0156] The msd and msr regions of retron transcripts generally contain first and second inverted repeat sequences, which together make up a stable stem structure. The combined msr-msd region of the retron transcript serves not only as a template for reverse transcription but, by virtue of its secondary structure, also serves as a primer (i.e., self-priming) for msDNA synthesis by a reverse transcriptase. In some embodiments of retron-guide RNA cassettes of the present disclosure, the first inverted repeat sequence coding region is located within the 5’ end of the msr locus. In other embodiments, the second inverted repeat sequence coding region is located 3’ of the msd locus. In some embodiments of retron donor DNA-guide molecules of the present disclosure, the first inverted repeat sequence is located within the 5’ end of the msr region. In other embodiments, the second inverted repeat sequence is located 3’ of the msd region. A non-limiting example is shown in FIG.4, wherein the msr and msd loci are arranged in opposite orientations. The first inverted sequence repeat coding region is shown at the 5’ end of the cassette, while the second inverted sequence repeat coding region is shown near the 3’ end of the cassette. [0157] One of ordinary skill in the art will understand that the sequence of an inverted repeat sequence coding region can be varied, so long as the sequence of the counterpart inverted repeat sequence coding region within the same retron is also varied such that the two resulting inverted repeat sequences (i.e., present within a retron transcript) are complementary and allow for the formation of a stable stem structure. [0158] Any number of RTs may be used in alternative embodiments of the present disclosure, including prokaryotic and eukaryotic RTs. If desired, the nucleotide sequence of
a native RT may be modified, for example using known codon optimization techniques, so that expression within the desired host is optimized. By codon optimization it is meant the selection of appropriate DNA nucleotides for the synthesis of oligonucleotide building blocks, and their subsequent enzymatic assembly, of a structural gene or fragment thereof in order to approach codon usage within the host. [0159] The RT may be targeted to the nucleus so that efficient utilization of the RNA template may take place. An example of such a RT includes any known RT, either prokaryotic or eukaryotic, fused to a nuclear localization sequence or signal (NLS). In some embodiments of vectors of the present disclosure, the vector further comprises an NLS. In particular embodiments of vectors of the present disclosure, the NLS is located 5’ of the RT coding sequence. Any suitable NLS may also be used, providing that the NLS assists in localizing the RT within the nucleus. The use of an RT in the absence of an NLS may also be used if the RT is present within the nuclear compartment at a level that synthesizes a product from the RNA template. [0160] For more information regarding retrons, see, e.g., U.S. Pat. No.8,932,860 and Lampson, et al. Cytogenet. Res.110:491-499 (2005); both incorporated herein by reference in their entirety for all purposes. 2. Guide RNA (gRNA) molecules [0161] The retron-guide RNA cassettes and retron donor DNA-guide molecules of the present disclosure comprise guide RNA (gRNA) coding regions and gRNA molecules, respectively. The gRNAs for use in the CRISPR-retron system of the present disclosure typically include a crRNA sequence that is complementary to a target nucleic acid sequence and may include a scaffold sequence (e.g., tracrRNA) that interacts with a Cas nuclease (e.g., Cas9) or a variant or fragment thereof, depending on the particular nuclease being used. [0162] The gRNA can comprise any nucleic acid sequence having sufficient complementarity with a target polynucleotide sequence (e.g., target DNA sequence) to hybridize with the target sequence and direct sequence-specific binding of a nuclease to the target sequence. The gRNA may recognize a protospacer adjacent motif (PAM) sequence that may be near or adjacent to the target DNA sequence. The target DNA site may lie
immediately 5’ of a PAM sequence, which is specific to the bacterial species of the Cas9 used. For instance, the PAM sequence of Streptococcus pyogenes-derived Cas9 is NGG; the PAM sequence of Neisseria meningitidis-derived Cas9 is NNNNGATT; the PAM sequence of Streptococcus thermophilus-derived Cas9 is NNAGAA; and the PAM sequence of Treponema denticola-derived Cas9 is NAAAAC. In some embodiments, the PAM sequence can be 5’-NGG, wherein N is any nucleotide; 5’-NRG, wherein N is any nucleotide and R is a purine; or 5’-NNGRR, wherein N is any nucleotide and R is a purine. For the S. pyogenes system, the selected target DNA sequence should immediately precede (i.e., be located 5’ of) a 5’NGG PAM, wherein N is any nucleotide, such that the guide sequence of the DNA- targeting RNA (e.g., gRNA) base pairs with the opposite strand to mediate cleavage at about 3 base pairs upstream of the PAM sequence. [0163] In other instances, the target DNA site may lie immediately 3’ of a PAM sequence, e.g., when the Cpf1 endonuclease is used. In some embodiments, the PAM sequence is 5’- TTTN, where N is any nucleotide. When using the Cpf1 endonuclease, the target DNA sequence (i.e., the genomic DNA sequence having complementarity for the gRNA) will typically follow (i.e., be located 3’ of) the PAM sequence. Two CP1-family nucleases, AsCpf1 (from Acidaminococcus) and LbCpf1 (from Lachnospiraceae) are known to function in human cells. Both AsCpf1 and LbCpf1 cut 19 bp after the PAM sequence on the targeted strand and 23 bp after the PAM sequence on the opposite strand of the DNA molecule. [0164] In some embodiments, the degree of complementarity between a guide sequence of the gRNA (i.e., crRNA sequence) and its corresponding target sequence, when optimally aligned using a suitable alignment algorithm, is about or more than about 50%, 60%, 75%, 80%, 85%, 90%, 95%, 97.5%, 99%, or more. Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting examples of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g., the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAT, Novoalign (Novocraft Technologies, ELAND (Illumina, San Diego, Calif.), SOAP (available at soap.genomics.org.cn), and Maq (available at maq.sourceforge.net). In some embodiments, a crRNA sequence is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 75, or more nucleotides in length. In some instances, a crRNA sequence is about 20 nucleotides in length. In other
instances, a crRNA sequence is about 15 nucleotides in length. In other instances, a crRNA sequence is about 25 nucleotides in length. [0165] The nucleotide sequence of a modified gRNA can be selected using any of the web- based software described above. Considerations for selecting a DNA-targeting RNA include the PAM sequence for the nuclease (e.g., Cas9 or Cpf1) to be used, and strategies for minimizing off-target modifications. Tools, such as the CRISPR Design Tool, can provide sequences for preparing the gRNA, for assessing target modification efficiency, and/or assessing cleavage at off-target sites. [0166] In some embodiments, the length of the gRNA molecule is about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, or more nucleotides in length. In some instances, the length of the gRNA is about 100 nucleotides in length. In other instances, the gRNA is about 90 nucleotides in length. In other instances, the gRNA is about 110 nucleotides in length. 3. Donor DNA sequences [0167] In one aspect, the present disclosure provides retron-guide RNA cassettes comprising a retron that comprises a donor DNA sequence. In another aspect, the present disclosure provides retron donor DNA-guide molecules comprising retron transcripts that comprise donor DNA sequence coding regions, the retron transcripts subsequently being reverse transcribed to yield msDNA that comprises a donor DNA sequence. The donor DNA sequence or sequences participate in homology-directed repair (HDR) of genetic loci of interest following cleavage of genomic DNA at the genetic locus or loci of interest (i.e., after a nuclease has been directed to cut at a specific genetic locus of interest, targeted by binding of gRNA to a target sequence). [0168] In some embodiments, the recombinant donor repair template (i.e., donor DNA sequence) comprises two homology arms that are homologous to portions of the sequence of the genetic locus of interest at either side of a Cas nuclease (e.g., Cas9 or Cpf1 nuclease) cleavage site. The homology arms may be the same length or may have different lengths. In some instances, each homology arm has at least about 70 to about 99 percent similarity (i.e.,
at least about 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95.96, 97, 98, or 99 percent similarity) to a portion of the sequence of the genetic locus of interest at either side of a nuclease (e.g., Cas nuclease) cleavage site. In other embodiments, the recombinant donor repair template comprises or further comprises a reporter unit that includes a nucleotide sequence encoding a reporter polypeptide (e.g., a detectable polypeptide, fluorescent polypeptide, or a selectable marker). If present, the two homology arms can flank the reporter cassette and are homologous to portions of the genetic locus of interest at either side of the Cas nuclease cleavage site. The reporter unit can further comprise a sequence encoding a self-cleavage peptide, one or more nuclear localization signals, and/or a fluorescent polypeptide (e.g., superfolder GFP (sfGFP)). Other suitable reporters are described herein. [0169] In some embodiments, the donor DNA sequence is at least about 500 to 10,000 (i.e., at least about 500, 600, 700, 800, 900, 1,000, 1,100, 1,200, 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, 1,900, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, or 10,000) nucleotides in length. In some embodiments, the donor DNA sequence is between about 600 and 1,000 (i.e., about 600, 610, 620, 630, 640, 650, 660, 670, 680, 690, 700, 710, 720, 730, 740, 750, 760, 770, 780, 790, 800, 810, 820, 830, 840, 850, 860, 870, 880, 890, 900, 910, 920, 930, 940, 950, 960, 970, 980, 990, or 1,000) nucleotides in length. In some embodiments, the donor DNA sequence is between about 100 and 500 (i.e., about 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, or 500) nucleotides in length. In some embodiments, the donor DNA sequence is less than about 100 (i.e., less than about 100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, or 5) nucleotides in length. [0170] In some embodiments, the donor DNA sequence in the second retron comprises a barcode sequence. In some embodiments, the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus. Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12- bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation
sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes. In some embodiments, the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker. B. Target Loci [0171] The compositons and methods of the disclosure can be used to introduce genetic modifications anywhere in the genomic or chromosomal DNA of a cell, or in exogenous (non-host cell) DNA, such as the DNA of transgenes, viruses or transposons. In some embodiments, the exogenous DNA is present in the nucleus of a host cell. In some embodiments the exogenous DNA is integrated into the host cell genomic DNA, for example as a transgene. In some embodiments, the compositons and methods of the disclosure can be used to modify a heterologous or exogenous genome, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome. In some embodiments, the compositons and methods of the disclosure can be used to modify a heterologous or exogenous genome of a pathogen, such as a virus or bacteria, that is present in the host cell. In some embodiments, the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids. [0172] In some embodiments, the method identifies a genetic modification at a target locus within a genome of a host cell, where the genome comprises the endogenous genomic chromosomal DNA of the host cell. In some embodiments, the method identifies a genetic modification at a target locus anywhere within a genome of a host cell. In some embodiments, the target locus is located in an exogenous genome that is present in a host cell, such as a viral genome, a bacterial genome, a transposable element or an endovirus genome that are not part of the endogenous host cell genome. In some embodiments, the target locus is located in heterologous or exogenous DNA, such as the DNA of transgenes, viruses or transposons, that are present in the host cell or host cell nucleus. In some embodiments, the target locus is located in heterologous or exogenous DNA that is integrated into the host cell genomic DNA. In some embodiments, the target locus is located in heterologous or exogenous DNA that is not integrated into the host cell genomic DNA, such as transiently expressed transgenes, episomes or plasmids.
[0173] Thus, in some embodiments, the retron-guide RNA cassette comprises a first donor DNA sequence having homology to one or more sequences within a first target locus, and a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence, where the first and second target loci are located within the genomic DNA of a host cell. In some embodiments, the retron-guide RNA cassette comprises a first donor DNA sequence having homology to one or more sequences within a first target locus, and a second donor DNA sequence located within the second msd locus, wherein the second donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence, where the first and second target loci are located within exogenous or heterologous DNA that is present in a host cell or organism. In some embodiments, the first and second target loci are located within exogenous or heterologous DNA that is integrated in the host cell genomic DNA. In some embodiments, the first and second target loci are located within exogenous or heterologous DNA that is not-integrated in the host cell genomic DNA. [0174] In some embodiments, the first target locus is located in cis to the second target locus. Thus, in some embodiments, the first and second target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located at a different position in the transcription unit. In some embodiments, the first target locus is located upstream or 5’ of a gene or transcription unit, and the second target locus is located downstream or 3’ of a gene or transcription unit. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the same transcription unit. In some embodiments, the first and/or second target locus is located in an intron or non-coding RNA expressed by a gene. [0175] In some embodiments, the first donor DNA sequence in the retron cassette comprises a genetic variant, such as a single nucleotide polymorphism, missense mutation, synonymous mutation, nonsense mutation, insertion, or a deletion, relative to the sequence at the first target locus. In some embodiments, the genetic variant comprises a cis-expression quantitative train locus (cis-eQTL) variant at the first target locus.
[0176] In some embodiments, the first target locus is located in trans to the second target locus. Thus, in some embodiments, the first and second target loci are located on different chromosomes or in different genes. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in a gene, or in a transcription unit that is in trans to the first target locus. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the first target locus. [0177] In some embodiments, the first donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript. In some embodiments, the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus. In some embodiments, the genetic variant comprises a trans-expression quantitative train locus (trans-eQTL) variant at the first target locus. [0178] In some embodiments, multiple rounds of genetic targeting are performed on the same pool of cells, or a single cell that has a genetic modification at a target locus. For example, the first round of genetic editing can introduce a genetic modification at a first target locus and a barcode sequence at a second target locus. In the second round, a second genetic modification can be introduced at the same first target locus or a different (third) target locus and a new genetic modification in the barcode sequence, or a new unique barcode sequence, is introduced at the second target locus. This process can be repeated to introduce consecutive unique barcode sequences that are associated with genetic modifications at each round. The consecutive barcodes can be identified by NGS or Sanger sequencing. Alternatively, the barcodes could encode different fluorescent markers and the combinations of markers can be determined by flow cytometry or fluorescence microscopy. Alternatively, the barcodes could encode different peptides and the combinations of peptides can be determined by mass spectrometry. [0179] In some embodiments, the second target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells
resulting from mutated or inserted DNA, often referred to as “safe-harbors.” For example, in some embodiments, the second target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene. In some embodiments, the second target locus comprises the yeast S. cerevisiae YBR209W locus described in Levy SF, et al., Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature. 2015 Mar 12;519(7542):181-6. doi: 10.1038/nature14279. Epub 2015 Feb 25. PMID: 25731169; PMCID: PMC4426284, which described lineage tracing barcodes integrated into this locus. In some embodiments, the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19. The AAVS1 is a well-validated “safe harbor” for inserting DNA transgenes with expected function. It has an open chromatin structure and is transcription-competent. Most importantly, there are no known adverse effects on cells resulting from the inserted DNA fragment of interest. See the internet at www.genecopoeia.com/product/aavs1-safe-harbor/. C. CRISPR/Cas system [0180] The CRISPR/Cas system of genome modification includes a Cas nuclease (e.g., Cas9 or Cpf1 nuclease) or a variant or fragment or combination thereof and a DNA-targeting RNA (e.g., guide RNA (gRNA)). The gRNA may contain a guide sequence that targets the Cas nuclease to the target genomic DNA and a scaffold sequence that interacts with the Cas nuclease (e.g., tracrRNA). The system may optionally include a donor repair template. In other instances, a fragment of a Cas nuclease or a variant thereof with desired properties (e.g., capable of generating single- or double-strand breaks and/or modulating gene expression) can be used. The donor repair template can include a nucleotide sequence encoding a reporter polypeptide such as a fluorescent protein or an antibiotic resistance marker, and homology arms that are homologous to the target DNA and flank the site of gene modification. [0181] The CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)/Cas (CRISPR-associated protein) nuclease system is an engineered nuclease system based on a bacterial system that can be used for genome engineering. It is based on part of the adaptive immune response of many bacteria and archaea. When a virus or plasmid invades a bacterium, segments of the invader’s DNA are converted into CRISPR RNAs (crRNA) by the “immune” response. The crRNA then associates, through a region of partial
complementarity, with another type of RNA called tracrRNA to guide the Cas (e.g., Cas9) nuclease to a region homologous to the crRNA in the target DNA called a “protospacer.” The Cas (e.g., Cas9) nuclease cleaves the DNA to generate blunt ends at the double-strand break at sites specified by a 20-nucleotide guide sequence contained within the crRNA transcript. The Cas (e.g., Cas9) nuclease may require both the crRNA and the tracrRNA for site-specific DNA recognition and cleavage. This system has now been engineered such that the crRNA and tracrRNA, if needed, can be combined into one molecule (the “single guide RNA” or “sgRNA”), and the crRNA equivalent portion of the guide RNA can be engineered to guide the Cas (e.g., Cas9) nuclease to target any desired sequence (see, e.g., Jinek et al. (2012) Science, 337:816-821; Jinek et al. (2013) eLife, 2:e00471; Segal (2013) eLife, 2:e00563). Thus, the CRISPR/Cas system can be engineered to create a double-strand break at a desired target in a genome of a cell, and harness the cell’s endogenous mechanisms to repair the induced break by homology-directed repair (HDR) or nonhomologous end-joining (NHEJ). [0182] The Cas nuclease can direct cleavage of one or both strands at a location in a target DNA sequence. For example, the Cas nuclease can be a nickase having one or more inactivated catalytic domains that cleaves a single strand of a target DNA sequence. [0183] Non-limiting examples of Cas nucleases include Cas1, Cas1B, Cas2, Cas3, Cas4, Cas5, Cas6, Cas7, Cas8, Cas9 (also known as Csn1 and Csx12), Cas10, Csy1, Csy2, Csy3, Cse1, Cse2, Csc1, Csc2, Csa5, Csn2, Csm2, Csm3, Csm4, Csm5, Csm6, Cmr1, Cmr3, Cmr4, Cmr5, Cmr6, Csb1, Csb2, Csb3, Csx17, Csx14, Csx10, Csx16, CsaX, Csx3, Csx1, Csx15, Csf1, Csf2, Csf3, Csf4, Cpf1, homologs thereof, variants thereof, fragments thereof, mutants thereof, derivatives thereof, and combinations thereof. There are three main types of Cas nucleases (type I, type II, and type III), and 10 subtypes including 5 type I, 3 type II, and 2 type III proteins (see, e.g., Hochstrasser and Doudna, Trends Biochem Sci, 2015:40(1):58- 66). Type II Cas nucleases include Cas1, Cas2, Csn2, Cas9, and Cpf1. These Cas nucleases are known to those skilled in the art. For example, the amino acid sequence of the Streptococcus pyogenes wild-type Cas9 polypeptide is set forth, e.g., in NBCI Ref. Seq. No. NP_269215, and the amino acid sequence of Streptococcus thermophilus wild-type Cas9 polypeptide is set forth, e.g., in NBCI Ref. Seq. No. WP_011681470. Furthermore, the amino acid sequence of Acidaminococcus sp. BV3L6 is set forth, e.g., in NBCI Ref. Seq. No.
WP_021736722.1. Some CRISPR-related endonucleases that are useful in the present disclosure are disclosed, e.g., in U.S. Application Publication Nos.2014/0068797, 2014/0302563, and 2014/0356959. [0184] Cas nucleases, e.g., Cas9 polypeptides, can be derived from a variety of bacterial species including, but not limited to, Veillonella atypical, Fusobacterium nucleatum, Filifactor alocis, Solobacterium moorei, Coprococcus catus, Treponema denticola, Peptoniphilus duerdenii, Catenibacterium mitsuokai, Streptococcus mutans, Listeria innocua, Staphylococcus pseudintermedius, Acidaminococcus intestine, Olsenella uli, Oenococcus kitaharae, Bifidobacterium bifidum, Lactobacillus rhamnosus, Lactobacillus gasseri, Finegoldia magna, Mycoplasma mobile, Mycoplasma gallisepticum, Mycoplasma ovipneumoniae, Mycoplasma canis, Mycoplasma synoviae, Eubacterium rectale, Streptococcus thermophilus, Eubacterium dolichum, Lactobacillus coryniformis subsp. Torquens, Ilyobacter polytropus, Ruminococcus albus, Akkermansia muciniphila, Acidothermus cellulolyticus, Bifidobacterium longum, Bifidobacterium dentium, Corynebacterium diphtheria, Elusimicrobium minutum, Nitratifractor salsuginis, Sphaerochaeta globus, Fibrobacter succinogenes subsp. Succinogenes, Bacteroides fragilis, Capnocytophaga ochracea, Rhodopseudomonas palustris, Prevotella micans, Prevotella ruminicola, Flavobacterium columnare, Aminomonas paucivorans, Rhodospirillum rubrum, Candidatus Puniceispirillum marinum, Verminephrobacter eiseniae, Ralstonia syzygii, Dinoroseobacter shibae, Azospirillum, Nitrobacter hamburgensis, Bradyrhizobium, Wolinella succinogenes, Campylobacter jejuni subsp. Jejuni, Helicobacter mustelae, Bacillus cereus, Acidovorax ebreus, Clostridium perfringens, Parvibaculum lavamentivorans, Roseburia intestinalis, Neisseria meningitidis, Pasteurella multocida subsp. Multocida, Sutterella wadsworthensis, proteobacterium, Legionella pneumophila, Parasutterella excrementihominis, Wolinella succinogenes, and Francisella novicida. [0185] “Cpf1” refers to an RNA-guided double-stranded DNA-binding nuclease protein that is a type II Cas nuclease. Wild-type Cpf1 contains a RuvC-like endonuclease domain similar to the RuvC domain of Cas9, but does not have an HNH endonuclease domain and the N-terminal region of Cpf1 does not have the alpha-helix recognition lobe possessed by Cas9. The wild-type protein requires a single RNA molecule, as no tracrRNA is necessary. Wild-type Cpf1 creates staggered-end cuts and utilizes a T-rich protospacer-adjacent motif
(PAM) that is 5’ of the guide RNA targeting sequence. Cpf1 enzymes have been isolated, for example, from Acidaminococcus and Lachnospiraceae. [0186] “Cas9” refers to an RNA-guided double-stranded DNA-binding nuclease protein or nickase protein that is a type II Cas nuclease. Wild-type Cas9 nuclease has two functional domains, e.g., RuvC and HNH, that cut different DNA strands. The wild-type enzyme requires two RNA molecules (e.g., a crRNA and a tracrRNA), or alternatively, a single fusion molecule (e.g., a gRNA comprising a crRNA and a tracrRNA). Wild-type Cas9 utilizes a G- rich protospacer-adjacent motif (PAM) that is 3’ of the guide RNA targeting sequence and creates double-strand cuts having blunt ends. Cas9 can induce double-strand breaks in genomic DNA (target DNA) when both functional domains are active. The Cas9 enzyme can comprise one or more catalytic domains of a Cas9 protein derived from bacteria belonging to the group consisting of Corynebacter, Sutterella, Legionella, Treponema, Filifactor, Eubacterium, Streptococcus, Lactobacillus, Mycoplasma, Bacteroides, Flaviivola, Flavobacterium, Sphaerochaeta, Azospirillum, Gluconacetobacter, Neisseria, Roseburia, Parvibaculum, Staphylococcus, Nitratifractor, and Campylobacter. In some embodiments, the two catalytic domains are derived from different bacteria species. [0187] Useful variants of the Cas9 nuclease can include a single inactive catalytic domain, such as a RuvC- or HNH- enzyme or a nickase. A Cas9 nickase has only one active functional domain and can cut only one strand of the target DNA, thereby creating a single- strand break or nick. A double-strand break can be introduced using a Cas9 nickase if at least two DNA-targeting RNAs that target opposite DNA strands are used. A double-nicked induced double-strand break can be repaired by NHEJ or HDR (Ran et al., 2013, Cell, 154:1380-1389). This gene editing strategy favors HDR and decreases the frequency of insertion/deletion (“indel”) mutations at off-target DNA sites. Non-limiting examples of Cas9 nucleases or nickases are described in, for example, U.S. Patent Nos.8,895,308; 8,889,418; and 8,865,406 and U.S. Application Publication Nos.2014/0356959, 2014/0273226 and 2014/0186919. The Cas9 nuclease or nickase can be codon-optimized for the host cell or host organism. [0188] For genome editing methods, the Cas nuclease can be a Cas9 fusion protein such as a polypeptide comprising the catalytic domain of a restriction enzyme (e.g., FokI) linked to
dCas9. The FokI-dCas9 fusion protein (fCas9) can use two guide RNAs to bind to a single strand of target DNA to generate a double-strand break. [0189] In some embodiments, a nucleotide sequence encoding the Cas nuclease is present in a recombinant expression vector. In certain instances, the recombinant expression vector is a viral construct, e.g., a recombinant adeno-associated virus construct, a recombinant adenoviral construct, a recombinant lentiviral construct, etc. For example, viral vectors can be based on vaccinia virus, poliovirus, adenovirus, adeno-associated virus, SV40, herpes simplex virus, human immunodeficiency virus, and the like. A retroviral vector can be based on Murine Leukemia Virus, spleen necrosis virus, and vectors derived from retroviruses such as Rous Sarcoma Virus, Harvey Sarcoma Virus, avian leukosis virus, a lentivirus, human immunodeficiency virus, myeloproliferative sarcoma virus, mammary tumor virus, and the like. Useful expression vectors are known to those of skill in the art, and many are commercially available. The following vectors are provided by way of example for eukaryotic host cells: pXT1, pSG5, pSVK3, pBPV, pMSG, and pSVLSV40. However, any other vector may be used if it is compatible with the host cell. For example, useful expression vectors containing a nucleotide sequence encoding a Cas9 enzyme are commercially available from, e.g., Addgene, Life Technologies, Sigma-Aldrich, and Origene. [0190] Depending on the host cell and expression system used, any of a number of transcription and translation control elements, including promoter, transcription enhancers, transcription terminators, and the like, may be used in the expression vector. Useful promoters can be derived from viruses, or any organism, e.g., prokaryotic or eukaryotic organisms. Promoters may also be inducible (i.e., capable of responding to environmental factors and/or external stimuli that can be artificially controlled). Suitable promoters include, but are not limited to: RNA polymerase II promoters (e.g., pGAL7 and pTEF1), RNA polymerase III promoters (e.g., RPR-tetO, SNR52, and tRNA-tyr), the SV40 early promoter, mouse mammary tumor virus long terminal repeat (LTR) promoter; adenovirus major late promoter (Ad MLP); a herpes simplex virus (HSV) promoter, a cytomegalovirus (CMV) promoter such as the CMV immediate early promoter region (CMVIE), a rous sarcoma virus (RSV) promoter, a human U6 small nuclear promoter (U6), an enhanced U6 promoter, a human H1 promoter (H1), etc. Suitable terminators include, but are not limited to SNR52 and RPR terminator sequences, which can be used with transcripts created under the control
of a RNA polymerase III promoter. Additionally, various primer binding sites may be incorporated into a vector to facilitate vector cloning, sequencing, genotyping, and the like. As a non-limiting example, the Pci1-Up sequence can be incorporated. Other suitable promoter, enhancer, terminator, and primer binding sequences will readily be known to one of skill in the art. D. Methods for identifying genetic modifications at a target locus [0191] The disclosure also provides methods for identifying a genetic modification at a target locus within the genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell. In some embodiments, the method comprises transforming the host cell with a vector comprising a retron guide cassette described herein. In some embodiments, the method is an in vitro method. In some embodiments, the method is an in vivo method. [0192] In some embodiments, the host cell or transformed progeny of the host cell express a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region. In some embodiments, the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell. In some embodiments, at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target locus and comprise sequence modifications compared to the sequences within the first target locus. In some embodiments, the first target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the first gRNA. In some embodiments, the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the first target locus within the genome. In some embodiments, at least a portion of the second retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences,
wherein the one or more donor DNA sequences are homologous to the second target locus. In some embodiments, the second target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the second gRNA. In some embodiments, the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert a unique barcode sequence at the second target locus. In some embodiments, the method comprises detecting the presence of the unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the first target locus, thereby identifying the genetic modification at the first target locus. [0193] In some embodiments, the first target locus is located in cis to the second target locus. Thus, in some embodiments, the first and second target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located at a different position in the transcription unit. In some embodiments, the first target locus is located upstream or 5’ of a gene or transcription unit, and the second target locus is located downstream or 3’ of a gene or transcription unit. In some embodiments, the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in the 3’ untranslated region (UTR) of the same transcription unit. In some embodiments, the first and/or second target locus is located in an intron or non-coding RNA expressed by a gene. [0194] In some embodiments, the first donor DNA sequence in the retron cassette comprises a genetic variant, such as a single nucleotide polymorphism, insertion, or a deletion, relative to the sequence at the first target locus. In some embodiments, the genetic variant comprises a cis-expression quantitative train locus (cis-eQTL) variant at the first target locus. [0195] In some embodiments, the first target locus is located in trans to the second target locus. Thus, in some embodiments, the first and second target loci are located on different chromosomes or in different genes. In some embodiments, the first target locus is located in a trans-regulatory element, and the second target locus is located in a gene, or in a transcription unit that is in trans to the first target locus. In some embodiments, the first
target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the first target locus. [0196] In some embodiments, the first donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the first target locus. In some embodiments, the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript. In some embodiments, the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus. In some embodiments, the genetic variant comprises a trans-expression quantitative trait locus (trans-eQTL) variant at the first target locus. [0197] In some embodiments, the barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus. Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12-bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined hamming distance between any pair of barcodes. In some embodiments, the barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker. [0198] In some embodiments, the second target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells resulting from mutated or inserted DNA, often referred to as “safe-harbors.” For example, in some embodiments, the second target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene. In some embodiments, the second target locus comprises the yeast S. cerevisiae YBR209W locus described in Levy SF, et al., Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature. 2015 Mar 12;519(7542):181-6. doi: 10.1038/nature14279. Epub 2015 Feb 25. PMID: 25731169; PMCID: PMC4426284, which described lineage tracing barcodes integrated into this locus. In some embodiments, the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19.
[0199] In some embodiments, detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence. [0200] In some embodiments, the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. In some embodiments, the vector is not integrated in the genome of the host cell. In some embodiments, the vector can be lost from the host cell or its progeny by dilution during cell division. [0201] In some embodiments, the vector can be actively removed from the cell. For example, in some embodiments, the vector contains a gene that is toxic to the host cell. In some embodiments, the vector contains the URA3 marker gene and the cells are treated with 5-Fluoroorotic acid (5-FOA) to selectively cause toxicity to cells that retain the vector. In some embodiments, the vector can include a gene that can be used for counter-selection to kill host cells that retain the vector. See Mezzadra R, et al., A Traceless Selection: Counter- selection System That Allows Efficient Generation of Transposon and CRISPR-modified T- cell Products. Mol Ther Nucleic Acids.2016;5(3):e298. Published 2016 Mar 22. doi:10.1038/mtna.2016.13. In some embodiments, the vector can encode surface markers that are expressed in vector containing cells following the genetic edits, which can be immobilized by antibodies and discarded. The remaining post-edit cells that lost the transient vector can then be retained for later use. In some embodiments, the vector contains sequences that can be targeted by gRNA introduced to the cell post-editing to cut the DNA vector and expose it to exonuclease degradation. [0202] In some embodiments, greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus. [0203] In some embodiments, the method steps are repeated by transforming the host cell or progeny thereof with a second vector comprising a second retron-guide RNA cassette to introduce a second pair or combination of edits into the genome of the host cell. This allows multiple edits to be tracked in the same cell or clonal population of transformed cells by detecting the presence and/or expression of the different barcodes inserted into the genome of the host cell. Thus, in some embodiments, the method further comprises transforming the
host cell or progeny thereof with a second vector comprising a second retron-guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and a second unique barcode sequence; and (v) a second inverted repeat sequence coding region; and a fourth guide RNA (gRNA) coding region. [0204] In some embodiments, the host cell expresses a third retron donor DNA-guide molecule comprising a third retron transcript and the third gRNA coding region and a fourth retron donor DNA-guide molecule comprising a fourth retron transcript and the fourth gRNA coding region. In some embodiments, the third and fourth retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell. In some embodiments, at least a portion of the third retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the third target locus and comprise sequence modifications compared to the sequences within the third target locus. In some embodiments, the third target locus is cut by a nuclease expressed by the host cell or
transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the third gRNA. In some embodiments, the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the third target locus within the genome. In some embodiments, at least a portion of the fourth retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the fourth target locus. In some embodiments, the fourth target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the fourth gRNA. In some embodiments, the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert the second unique barcode sequence at the fourth target locus. In some embodiments, the method comprises detecting the presence of the second unique barcode sequence, wherein the presence of the second unique barcode sequence indicates the presence of the genetic modification at the third target locus, thereby identifying the genetic modification at the third target locus. [0205] As above, in some embodiments, the third target locus is located in cis to the fourth target locus. Thus, in some embodiments, the third and fourth target loci are located on the same chromosome, in the same gene, or adjacent to or within the same transcription unit. In some embodiments, the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located at a different position in the transcription unit. In some embodiments, the third target locus is located upstream or 5’ of a gene or transcription unit, and the fourth target locus is located downstream or 3’ of a gene or transcription unit. In some embodiments, the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the same transcription unit. In some embodiments, the third and/or fourth target locus is located in an intron or non-coding RNA expressed by a gene. [0206] In some embodiments, the third donor DNA sequence in the second retron-guide RNA cassette comprises a genetic variant, such as a single nucleotide polymorphism, insertion, or a deletion, relative to the sequence at the third target locus. In some
embodiments, the genetic variant comprises a cis-expression quantitative train locus (cis- eQTL) variant at the third target locus. [0207] In some embodiments, the third target locus is located in trans to the fourth target locus. Thus, in some embodiments, the third and fourth target loci are located on different chromosomes or in different genes. In some embodiments, the third target locus is located in a trans-regulatory element, and the fourth target locus is located in a gene, or in a transcription unit that is in trans to the third target locus. In some embodiments, the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit in trans to the third target locus. [0208] In some embodiments, the third donor DNA sequence in the retron cassette comprises a genetic variant compared to the sequences within the third target locus. In some embodiments, the genetic variant comprises an amino acid change in a transcription factor that regulates the expression (e.g., transcription) of another gene or transcript. In some embodiments, the genetic variant comprises a mutation in a transcription factor binding site that modifies the expression of a gene or transcript located in cis or trans to the second target locus. In some embodiments, the genetic variant comprises a trans-expression quantitative trait locus (trans-eQTL) variant at the first target locus. [0209] In some embodiments, the second unique barcode sequence comprises a defined sequence that can be distinguished from endogenous sequences by sequencing the target locus. Examples of exemplary barcode sequences include random barcodes synthesized with poly-(N) tracts, which are added to the retron-sgRNA cassettes by PCR and associated with the first edit by paired sequencing of cloned plasmid libraries; programmed barcodes of 12- bp sequences that exclude common restriction sites; and retron, sgRNA or next-generation sequencing (NGS) related sequences with defined Hamming distance between any pair of barcodes. In some embodiments, the second unique barcode sequence encodes a detectable molecule, such as a fluorescent protein, a selectable marker, or a cell surface marker. In some embodiments, the second unique barcode sequence is different than the unique barcode sequence (i.e., the first unique barcode sequence) inserted at the second target locus. [0210] In some embodiments, the fourth target locus corresponds to a region of the genome that is transcriptionally competent but is not likely to cause adverse effects on cells resulting
from mutated or inserted DNA, often referred to as “safe-harbors.” For example, in some embodiments, the fourth target locus is i) located in an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene. In some embodiments, the fourth target locus comprises the yeast S. cerevisiae YBR209W locus described in Levy SF, et al., Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature. 2015 Mar 12;519(7542):181-6. doi: 10.1038/nature14279. Epub 2015 Feb 25. PMID: 25731169; PMCID: PMC4426284, which described lineage tracing barcodes integrated into this locus. In some embodiments, the second target locus comprises the human AAVS1 (also known as the PPP1R12C locus) locus on chromosome 19. [0211] In some embodiments, detecting the presence of the second unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence. [0212] In some embodiments, the second vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. In some embodiments, the second vector is not integrated in the genome of the host cell. In some embodiments, the second vector can be lost from the host cell or its progeny by dilution during cell division. [0213] In some embodiments, greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the second barcode sequence and the sequence modifications compared to the sequences within the third target locus. [0214] In some embodiments, the methods further comprise detecting or determining the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci. The relative expression can be determined by quantifying the amount of the barcode sequence and determining the relative ratio of transcript sequences to barcode sequences. In some embodiments, the amount of the barcode sequence is measured by performing RT-qPCR assays using primers that amplify the barcode sequence. In some embodiments, the amount of the barcode sequence is determined by next generation sequencing (NGS). In some embodiments, transcript abundance is determined by measuring or quantifying the amount of a detectable marker encoded by the barcode.
[0215] The current methods provide the advantage of using the same validated guide RNA, which provides predictability in making genetic modifications at a given target locus with high efficiency. This is in contrast with the methods such as those described in Sharma, R., et al. “The TRACE-Seq method tracks recombination alleles and identifies clonal reconstitution dynamics of gene targeted human hematopoietic stem cells.” Nat Commun 12, 472 (2021). https://doi.org/10.1038/s41467-020-20792-y, incorporating the genetic variant and barcode with one guide in a single editing event, which is limited to using amino acid codon replacement as barcodes. The codon swap barcoding strategy also is not applicable for non-coding sequences where it is important to preserve all nucleotides. In contrast, the current methods allow insertion of the barcode sequence elsewhere in the genome, and does not interfere with the locus comprising the genetic variant edit. In addition, in applications where the disease variants are far apart, the TRACE-seq method is less useful because all loci must be genotyped which limits throughput. [0216] Thus, in some embodiments of the methods described herein, the first and third gRNAs are the same. In some embodiments, the first and third target loci are the same. In some embodiments, the genetic modifications or edits at the first and third loci are different. In some embodiments, the second and fourth gRNAs (that target the second and fourth target loci) are the same. In some embodiments, the first and third gRNAs are the same, and the second and fourth gRNAs are the same. In some embodiments, the second and fourth target loci are the same. In some embodiments, the barcode sequences inserted at the same target loci are different. In some embodiments, the barcode sequences inserted at the second and fourth target loci are different. [0217] In some embodiments of the methods described herein, different guide RNAs are used to introduce different genetic modifications at different target loci, but the same guide RNA is used to introduce different barcodes at the same target locus. This allows the same validated gRNA to be used to insert the barcode sequence at the target locus with high efficiency. Thus, in some embodiments, the first and third gRNAs are different. In some embodiments, the first and third target loci are different. In some embodiments, the genetic modifications at the first and third loci are different. In some embodiments, the second and fourth gRNAs are the same. In some embodiments, the first and third gRNAs are different, and the second and fourth gRNAs are the same. In some embodiments, the second and fourth
target loci are the same. In some embodiments, the barcode sequences inserted at the second and fourth target loci are different. [0218] In some embodiments of the methods described herein, different guide RNAs are used to introduce different genetic modifications at different target loci, and different guide RNAs are used to introduce different barcode sequences at different target loci. Thus, in some embodiments, the first and third gRNAs are different, and the second and fourth gRNAs are different. In some embodiments, the first and third target loci are different, and the second and fourth target loci are different. In some embodiments, the genetic modifications at the first and third loci are different, and the barcode sequences inserted at the second and fourth target loci are different. [0219] In some embodiments, the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site. [0220] In some embodiments, the methods comprise detecting the presence of the unique barcode at the second target locus, thereby identifying the genetic modification at both the first and third target loci. [0221] In some embodiments, the methods are repeated with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus. The methods can be repeated multiple times with vectors comprising different retron-guide RNA cassettes to insert additional genetic modifications at the same or different target loci and to introduce additional unique barcodes at specific loci in the host cell genome that can be used to track the corresponding genetic modifications. [0222] In some embodiments, the host cell is a prokaryotic cell. In some embodiments, the host cell is a eukaryotic cell, such as a yeast cell or mammalian cell. In some embodiments, the host cell comprises a clonal population of host cells. In some embodiments, the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
[0223] In some embodiments, the methods comprise transforming a mixture of cells with one or more vectors comprising the first, second and/or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell. [0224] In some embodiments, the methods comprise detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette. The genetic modifications can be detected by sequencing the genomic DNA comprising the modification, or by detecting a change in one or more phenotypes expressed by the host cell or organism comprising the host cell. The presence of the unique barcode sequence can be detected by sequencing the genomic DNA comprising the barcode sequence, or by detecting a protein or detectable marker encoded by the barcode sequence. E. Methods for introducing nucleic acids into host cells [0225] Methods for introducing polypeptides and nucleic acids into a host cell are known in the art, and any known method can be used to introduce a nuclease or a nucleic acid (e.g., a nucleotide sequence encoding the nuclease or reverse transcriptase, a DNA-targeting RNA (e.g., a guide RNA), a donor repair template for homology-directed repair (HDR), etc.) into a cell. Non-limiting examples of suitable methods include electroporation, viral or bacteriophage infection, transfection, conjugation, protoplast fusion, lipofection, calcium phosphate precipitation, polyethyleneimine (PEI)-mediated transfection, DEAE-dextran mediated transfection, liposome-mediated transfection, particle gun technology, calcium phosphate precipitation, direct microinjection, nanoparticle-mediated nucleic acid delivery, and the like. [0226] In some embodiments, the components of the CRISPR-retron system can be introduced into a cell using a delivery system. In certain instances, the delivery system comprises a nanoparticle, a microparticle (e.g., a polymer micropolymer), a liposome, a micelle, a virosome, a viral particle, a nucleic acid complex, a transfection agent, an electroporation agent (e.g., using a NEON transfection system), a nucleofection agent, a lipofection agent, and/or a buffer system that includes a nuclease component (as a polypeptide or encoded by an expression construct), a reverse transcriptase component, and
one or more nucleic acid components such as a DNA-targeting RNA (e.g., a guide RNA) and/or a donor repair template. For instance, the components can be mixed with a lipofection agent such that they are encapsulated or packaged into cationic submicron oil-in-water emulsions. Alternatively, the components can be delivered without a delivery system, e.g., as an aqueous solution. [0227] Methods of preparing liposomes and encapsulating polypeptides and nucleic acids in liposomes are described in, e.g., Methods and Protocols, Volume 1: Pharmaceutical Nanocarriers: Methods and Protocols. (ed. Weissig). Humana Press, 2009 and Heyes et al. (2005) J Controlled Release 107:276-87. Methods of preparing microparticles and encapsulating polypeptides and nucleic acids are described in, e.g., Functional Polymer Colloids and Microparticles volume 4 (Microspheres, microcapsules & liposomes). (eds. Arshady & Guyot). Citus Books, 2002 and Microparticulate Systems for the Delivery of Proteins and Vaccines. (eds. Cohen & Bernstein). CRC Press, 1996. F. Host cells [0228] In a particular aspect, the present disclosure provides host cells that have been transformed by vectors of the present disclosure. The compositions and methods of the present disclosure can be used for genome editing of any host cell of interest. The host cell can be a cell from any organism, e.g., a bacterial cell, an archaeal cell, a cell of a single-cell eukaryotic organism, a plant cell (e.g., a rice cell, a wheat cell, a tomato cell, an Arabidopsis thaliana cell, a Zea mays cell and the like), an algal cell (e.g., Botryococcus braunii, Chlamydomonas reinhardtii, Nannochloropsis gaditana, Chlorella pyrenoidosa, Sargassum patens C. Agardh, and the like), a fungal cell (e.g., yeast cell, etc.), an animal cell, a cell from an invertebrate animal (e.g., fruit fly, cnidarian, echinoderm, nematode, etc.), a cell from a vertebrate animal (e.g., fish, amphibian, reptile, bird, mammal, etc.), a cell from a mammal, a cell from a human, a cell from a healthy human, a cell from a human patient, a cell from a cancer patient, etc. In some cases, the host cell treated by the method disclosed herein can be transplanted to a subject (e.g., patient). For instance, the host cell can be derived from the subject to be treated (e.g., patient). [0229] Any type of cell may be of interest, such as a stem cell, e.g., embryonic stem cell, induced pluripotent stem cell, adult stem cell, e.g., mesenchymal stem cell, neural stem cell,
hematopoietic stem cell, organ stem cell, a progenitor cell, a somatic cell, e.g., fibroblast, hepatocyte, heart cell, liver cell, pancreatic cell, muscle cell, skin cell, blood cell, neural cell, immune cell, and any other cell of the body, e.g., human body. The cells can be primary cells or primary cell cultures derived from a subject, e.g., an animal subject or a human subject, and allowed to grow in vitro for a limited number of passages. In some embodiments, the cells are disease cells or derived from a subject with a disease. For instance, the cells can be cancer or tumor cells. The cells can also be immortalized cells (e.g., cell lines), for instance, from a cancer cell line. [0230] Cells can be harvested from a subject by any standard method. For instance, cells from tissues, such as skin, muscle, bone marrow, spleen, liver, kidney, pancreas, lung, intestine, stomach, etc., can be harvested by a tissue biopsy or a fine needle aspirate. Blood cells and/or immune cells can be isolated from whole blood, plasma or serum. In some cases, suitable primary cells include peripheral blood mononuclear cells (PBMC), peripheral blood lymphocytes (PBL), and other blood cell subsets such as, but not limited to, T cell, a natural killer cell, a monocyte, a natural killer T cell, a monocyte-precursor cell, a hematopoietic stem cell or a non-pluripotent stem cell. In some cases, the cell can be any immune cells including any T-cell such as tumor infiltrating cells (TILs), such as CD3+ T-cells, CD4+ T- cells, CD8+ T-cells, or any other type of T-cell. The T cell can also include memory T cells, memory stem T cells, or effector T cells. The T cells can also be skewed towards particular populations and phenotypes. For example, the T cells can be skewed to phenotypically comprise, CD45RO(-), CCR7(+), CD45RA(+), CD62L(+), CD27(+), CD28(+) and/or IL- 7RĮ(+). Suitable cells can be selected that comprise one of more markers selected from a list comprising: CD45RO(-), CCR7(+), CD45RA(+), CD62L(+), CD27(+), CD28(+) and/or IL- 7RĮ(+). Induced pluripotent stem cells can be generated from differentiated cells according to standard protocols described in, for example, U.S. Patent Nos.7,682,828, 8,058,065, 8,530,238, 8,871,504, 8,900,871 and 8,791,248, the disclosures are herein incorporated by reference in their entirety for all purposes. [0231] In some embodiments, the host cell is in vitro. In other embodiments, the host cell is ex vivo. In yet other embodiments, the host cell is in vivo.
G. Methods for genome editing and screening, and assessing the efficiency and precision thereof [0232] In another aspect, the present disclosure provides a method for modifying one or more target nucleic acids of interest at one or more target loci within a genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell. In some embodiments, the method comprises: (a) transforming the host cell with a vector of the present disclosure; and (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a retron donor DNA-guide molecule comprising a retron transcript and a guide RNA (gRNA) molecule, wherein the retron transcript self-primes reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the one or more target loci and comprise sequence modifications compared to the one or more target nucleic acids, wherein the one or more target loci are cut by a nuclease expressed by the host cell or the transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the gRNA, and wherein the one or more donor DNA sequences recombine with the one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the one or more target loci within the genome. [0233] In some embodiments, the host cell is capable of expressing the RT prior to transforming the host cell with the vector. In some instances, the RT is encoded in a sequence that is integrated into the genome of the host cell. In other instances, the RT is encoded in a sequence on a separate plasmid. In other embodiments, the host cell is capable of expressing the RT at the same time as, or after, transforming the host cell with the vector.
In some instances, the RT is expressed from the vector. In other instances, the RT is encoded in a sequence on a separate plasmid. [0234] In other embodiments, the host cell is capable of expressing the nuclease (e.g., Cas9) prior to transforming the host cell with the vector. In some instances, the nuclease is encoded in a sequence that is integrated into the genome of the host cell. In other instances, the nuclease is encoded in a sequence on a separate plasmid. In other embodiments, the host cell is capable of expressing the nuclease at the same time as, or after, transforming the host cell with the vector. In some instances, the nuclease is expressed from the vector. In other instances, the nuclease is encoded in a sequence on a separate plasmid. [0235] In some embodiments, the vector comprises a retron-gRNA cassette that, when transcribed, yields a retron transcript and gRNA that are physically coupled. In such embodiments, the resulting donor DNA sequence within the msDNA and the gRNA can also be physically coupled. In particular embodiments, the retron transcript and gRNA subsequently become physically uncoupled (e.g., before or after reverse transcription of the retron transcript occurs). Physical uncoupling of the retron transcript and the gRNA can result from, for example, ribozyme cleavage (e.g., the retron-gRNA cassette also contains a ribozyme sequence). In such embodiments, the resulting donor DNA sequence within the msDNA and the gRNA will be physically uncoupled (e.g., during genome editing and/or screening). [0236] In some embodiments, the retron transcript and the gRNA are not initially physically coupled. In particular embodiments, the retron transcript and the gRNA are subsequently joined together. Transcription event(s) that result in the production of the retron transcript and/or gRNA can occur inside a host cell, outside of a host cell (e.g., followed by introduction of the retron transcript and/or gRNA into the host cell), or a combination thereof. In some embodiments, the one or more target nucleic acids of interest are modified by a donor DNA sequence (e.g., within a msDNA) and a gRNA that are never physically coupled. For example, the donor DNA sequence and the gRNA can be expressed from different cassettes (e.g., which are contained in the same vector or different vectors) and the donor DNA sequence and the gRNA can act in trans.
[0237] In yet another aspect, the present disclosure provides a method for screening one or more genetic loci of interest in a genome of a host cell, the method comprising: (a) modifying one or more target nucleic acids of interest at one or more target loci within the genome of the host cell according to a method of the present disclosure; (b) incubating the modified host cell under conditions sufficient to elicit a phenotype that is controlled by the one or more genetic loci of interest; (c) identifying the resulting phenotype of the modified host cell; and (d) determining that the identified phenotype was the result of the modifications made to the one or more target nucleic acids of interest at the one or more target loci of interest. [0238] To assess the efficiency and/or precision of genome editing (e.g., testing for whether an edit has been made and/or the accuracy of the edit), the target DNA can be analyzed by standard methods known to those in the art. For example, indel mutations can be identified by sequencing using the SURVEYOR® mutation detection kit (Integrated DNA Technologies, Coralville, IA) or the Guide-it™ Indel Identification Kit (Clontech, Mountain View, CA). Homology-directed repair (HDR) can be detected by PCR-based methods, and in combination with sequencing or RFLP analysis. Non-limiting examples of PCR-based kits include the Guide-it Mutation Detection Kit (Clontech) and the GeneArt® Genomic Cleavage Detection Kit (Life Technologies, Carlsbad, CA). Deep sequencing can also be used, particularly for a large number of samples or potential target/off-target sites. [0239] In some other embodiments, editing efficiency can be assessed by employing a reporter or selectable marker to examine the phenotype of an organism or a population of organisms. In some instances, the marker produces a visible phenotype, such as the color of an organism or population of organisms. As a non-limiting example, edits can be made that either restore or disrupt the function of metabolic pathways that confer a visible phenotype (e.g., a color) to the organism. In the scenario where a successful genome edit results in a color change in the target organism (e.g., because the edit disrupts a metabolic pathway that results in a color change or because the edit restores function in a pathway that results in a color change), the absolute number or the proportion of organisms or their progeny that exhibit a color change (e.g., an estimated or direct count of the number of organisms
exhibiting a color change divided by the total number of organisms for which the genomes were potentially edited) can serve as a measure of editing efficiency. In some instances, the phenotype is examined by growing the target organisms and/or their progeny under conditions that result in a phenotype, wherein the phenotype may not be visible under ordinary growth conditions. As a non-limiting example, growing yeast in a culture medium that is adenine deficient can lead to a particular phenotype (e.g., a color change) in yeast cells that possess a genetic defect in adenine synthesis. As such, growing yeast cells in adenine- deficient media can allow one to discern the effect of genome edits that putatively target adenine biosynthesis loci. [0240] In some embodiments, the reporter or selectable marker is a fluorescent tagged protein, an antibody, a labeled antibody, a chemical stain, a chemical indicator, or a combination thereof. In other embodiments, the reporter or selectable marker responds to a stimulus, a biochemical, or a change in environmental conditions. In some instances, the reporter or selectable marker responds to the concentration of a metabolic product, a protein product, a synthesized drug of interest, a cellular phenotype of interest, a cellular product of interest, or a combination thereof. A cellular product of interest can be, as a non-limiting example, an RNA molecule (e.g., messenger RNA (mRNA), long non-coding RNA (lncRNA), microRNA (miRNA)). [0241] Editing efficiency can also be examined or expressed as a function of time. For example, an editing experiment can be allowed to run for a fixed period of time (e.g., 24 or 48 hours) and the number of successful editing events in that fixed time period can be determined. Alternatively, the proportion of successful editing events can be determined for a fixed period of time. Typically, longer editing periods will result in a larger number of successful editing events. Editing experiments or procedures can run for any length of time. In some embodiments, a genome editing experiment or procedure runs for several hours (e.g., about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 hours). In other embodiments, a genome editing experiment or procedure runs for several days (e.g., about 1, 2, 3, 4, 5, 6, or 7 days).
[0242] In addition to the length of time of the editing period, editing efficiency can be affected by the choice of gRNA, donor DNA sequence, the choice of promoter used, or a combination thereof. [0243] In other embodiments, editing efficiency is compared to a control efficiency. In some embodiments, the control efficiency is determined by running a genome editing experiment in which the retron transcript and gRNA molecule are never physically coupled, or are initially coupled but subsequently become uncoupled. In some instances, the retron transcript and gRNA molecule are initially coupled and then become uncoupled (e.g., by ribozyme cleavage). In other instances, the retron-guide RNA (gRNA) cassette is configured such that the transcript products of the retron and gRNA coding region are never physically coupled. In yet other instances, the retron transcript and gRNA are introduced into the host cell separately. In some instances, the methods and compositions of the present disclosure result in at least about a 1.3- to 3-fold (i.e., at least about a 1.3-, 1.4-, 1.5-, 1.6-, 1.7-, 1.8-, 1.9-, 2-, 2.1-, 2.2-, 2.3-, 2.4-, 2.5-, 2.6-, 2.7-, 2.8-, 2.9-, or 3-fold) increase in efficiency, compared to when the retron transcript and gRNA are not physically coupled during editing. In other instances, at least about a 3- to 10-fold increase (i.e., at least about a 3-, 4-, 5-, 6-, 7-, 8-, 9-, or 10-fold) increase in efficiency is produced, compared to when the retron transcript and gRNA are not physically coupled during editing. In particular instances, at least about a 10- to 100-fold (i.e., at least about 10-, 20-, 30-, 40-, 50-, 60-, 70-, 80-, 90-, or 100-fold) increase in efficiency is produced, compared to when the retron transcript and gRNA are not physically coupled during editing. [0244] Editing efficiency can also be improved by performing editing experiments or procedures in a multiplex format. In some embodiments, multiplexing comprises cloning two or more editing retron-gRNA cassettes in tandem into a single vector. In some instances, at least about 10 retron-gRNA cassettes (i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 retron- gRNA cassettes) are cloned into a single vector. [0245] In other embodiments, multiplexing comprises transforming a host cell with two or more vectors. Each vector can comprise one or multiple retron-gRNA cassettes. In some instances, at least about 10 vectors (i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 vectors) are used to transform an individual host cell.
[0246] In still other embodiments, multiplexing comprises transforming two or more individual host cells, each with a different vector or combination of vectors. In some instances, at least about 2 host cells (i.e., at least about 2, 3, 4, 5, 6, 7, 8, 9, or 10 host cells) are transformed. In other instances, between about 10 and 100 host cells (i.e., about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 host cells) are transformed. In still other instances, between about 100 and 1,000 host cells (i.e., about 100, 200, 300, 400, 500, 600, 700, 800, 900, or 1,000 host cells) are transformed. In particular instances, between about 1,000 and 10,000 host cells (i.e., about 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,500, or 10,000 host cells are transformed). In some other instances, between about 10,000 and 100,000 host cells (i.e., about 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 45,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, or 100,000 host cells) are transformed. In other instances, between about 100,000 and 1,000,000 host cells (i.e., at least about 100,000, 150,000, 200,000, 250,000, 300,000, 350,000, 400,000, 450,000, 500,000, 550,000, 600,000, 650,000, 700,000, 750,000, 800,000, 850,000, 900,000, 950,000 or 1,000,000 host cells) are transformed. In some instances, more than about 1,000,000 host cells are transformed. Also, multiple embodiments of multiplexing can be combined. [0247] By using one or a combination of the various multiplexing embodiments, it is possible to modify and/or screen any number of loci within a genome. In some instances, at least about 10 (i.e., about 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10) genetic loci are modified or screened. In other instances, between about 10 and 100 (i.e., about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100) loci are modified or screened. In still other instances, between about 100 and 1,000 genetic loci (i.e., about 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or 1,000 genetic loci) are modified or screened. In some other instances, between about 1,000 and 100,000 genetic loci (i.e., about 1,000, 1,500, 2,000, 2,500, 3,000, 3,500, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 9,50010,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 45,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000, 85,000, 90,000, 95,000, or 100,000 genetic loci) are modified or screened. In particular instances, between about 100,000 and 1,000,000 genetic loci (i.e., about 100,000, 150,000, 200,000, 250,000, 300,000,
350,000, 400,000, 450,000, 500,000, 550,000, 600,000, 650,000, 700,000, 750,000, 800,000, 850,000, 900,000, 950,000, or 1,000,000 genetic loci) are modified or screened. In certain instances, more than about 1,000,000 loci are screened. [0248] In some embodiments, the host cell or host cell comprises a population of host cells. In some instances, one or more sequence modifications are induced in at least about 20 percent (i.e., at least about 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 percent) of the population of cells. In other instances, one or more sequence modifications are induced in at least about 50 percent (i.e., at least about 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 65, 70, 75, 80, 85, 90, 95, or 100 percent) of the population of cells. In still other instances, one or more sequence modifications are induced in at least about 75 percent (i.e., at least about 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 95, or 100 percent) of the population of cells. In other instances, one or more sequence modifications are induced in at least about 90 percent (i.e., at least about 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 percent) of the population of cells. In particular instances, one or more sequence modifications are induced in at least about 95 percent (i.e., at least about 95, 96, 97, 98, 99, or 100 percent) of the population of cells. [0249] The precision of genome editing can correspond to the number or percentage of on- target genome editing events relative to the number or percentage of all genome editing events, including on-target and off-target events. Testing for on-target genome editing events can be accomplished by direct sequencing of the target region or other methods described herein. When employing the compositions and methods of the present disclosure, in some instances, editing precision is at least about 80 percent (i.e., at least about 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 95, or 100 percent), meaning that at least about 80 percent of all genome editing events are on-target editing events. In other instances, editing precision is at least about 90 percent (i.e., at least about 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 percent), meaning that at least about 90 percent of all genome editing events are on-target editing events. In some other instances, editing precision is at least about 95 percent (i.e., at least about 95, 96, 97, 98, 99, or 100 percent), meaning that at least about 95 percent of all genome editing events are on-target editing events. In particular instances, editing precision is at least about 99 percent (i.e., at least about 99 or 100 percent), meaning that at least 99 percent of all genome editing events are on-target editing events.
H. Methods for preventing or treating genetic diseases [0250] In another aspect, the present disclosure provides a pharmaceutical composition comprising: (a) a retron-guide RNA cassette of the present disclosure, a vector of the present disclosure, a retron donor-DNA guide molecule of the present disclosure, or a combination thereof; and (b) a pharmaceutically acceptable carrier. [0251] In yet another aspect of the present disclosure, provided herein is a method for preventing or treating a genetic disease in a subject, the method comprising administering to the subject an effective amount of a pharmaceutical composition of the present disclosure to correct a mutation in a target gene associated with the genetic disease. [0252] The compositions and methods of the present disclosure are suitable for any disease that has a genetic basis and is amenable to prevention or amelioration of disease-associated sequelae or symptoms by editing or correcting one or more genetic loci that are linked to the disease. Non-limiting examples of diseases include X-linked severe combined immune deficiency, sickle cell anemia, thalassemia, hemophilia, neoplasia, cancer, age-related macular degeneration, schizophrenia, trinucleotide repeat disorders, fragile X syndrome, prion-related disorders, amyotrophic lateral sclerosis, drug addiction, autism, Alzheimer’s disease, Parkinson’s disease, cystic fibrosis, blood and coagulation diseases and disorders, inflammation, immune-related diseases and disorders, metabolic diseases and disorders, liver diseases and disorders, kidney diseases and disorders, muscular/skeletal diseases and disorders, neurological and neuronal diseases and disorders, cardiovascular diseases and disorders, pulmonary diseases and disorders, and ocular diseases. The compositions and methods of the present disclosure can also be used to prevent or treat any combination of suitable genetic diseases. [0253] In some embodiments, the subject is treated before any symptoms or sequelae of the genetic disease develop. In other embodiments, the subject has symptoms or sequelae of the genetic disease. In some instances, treatment results in a reduction or elimination of the symptoms or sequelae of the genetic disease.
[0254] In some embodiments, treatment includes administering compositions of the present disclosure directly to a subject. As a non-limiting example, pharmaceutical compositions of the present disclosure can be delivered directly to a subject (e.g., by local injection or systemic administration). In other embodiments, the compositions of the present disclosure are delivered to a host cell or population of host cells, and then the host cell or population of host cells is administered or transplanted to the subject. The host cell or population of host cells can be administered or transplanted with a pharmaceutically acceptable carrier. In some instances, editing of the host cell genome has not yet been completed prior to administration or transplantation to the subject. In other instances, editing of the host cell genome has been completed when administration or transplantation occurs. In certain instances, progeny of the host cell or population of host cells are transplanted into the subject. In some embodiments, correct editing of the host cell or population of host cells, or the progeny thereof, is verified before administering or transplanting edited cells or the progeny thereof into a subject. Procedures for transplantation, administration, and verification of correct genome editing are discussed herein and will be known to one of skill in the art. [0255] Compositions of the present disclosure, including cells and/or progeny thereof that have had their genomes edited by the methods and/or compositions of the present disclosure, may be administered as a single dose or as multiple doses, for example two doses administered at an interval of about one month, about two months, about three months, about six months or about 12 months. Other suitable dosage schedules can be determined by a medical practitioner. [0256] Prevention or treatment can further comprise administering agents and/or performing procedures to prevent or treat concomitant or related conditions. As non-limiting examples, it may be necessary to administer drugs to suppress immune rejection of transplanted cells, or prevent or reduce inflammation or infection. A medical professional will readily be able to determine the appropriate concomitant therapies. I. Kits [0257] In another aspect, the present disclosure provides kit for modifying one or more target nucleic acids of interest at one or more target loci within a genome of a host cell, or within a heterologous or exogenous genome or DNA present in a host cell, the kit comprising
one or a plurality of vectors or retron-guide RNA (gRNA) cassettes of the present disclosure. The kit may further comprise a host cell or a plurality of host cells that are recombinantly modified by the vectors or retron-guide RNA (gRNA) cassettes of the present disclosure. [0258] In some embodiments, the kit contains one or more reagents. In some instances, the reagents are useful for transforming a host cell with a vector or a plurality of vectors, and/or inducing expression from the vector or plurality of vectors. In other embodiments, the kit may further comprise a reverse transcriptase, a plasmid for expressing a reverse transcriptase, one or more nucleases, one or more plasmids for expressing one or more nucleases, or a combination thereof. The kit may further comprise one or more reagents useful for delivering nucleases or reverse transcriptases into the host cell and/or inducing expression of the reverse transcriptase and/or the one or more nucleases. In yet other embodiments, the kit further comprises instructions for transforming the host cell with the vector, introducing nucleases and/or reverse transcriptases into the host cell, inducing expression of the vector, reverse transcriptase, and/or nucleases, or a combination thereof. [0259] In yet another aspect, the present disclosure provides a kit for modifying one or more target nucleic acids of interest at one or more target loci in a host cell, the kit comprising one or a plurality of retron donor DNA-guide molecules of the present disclosure. The kit may further comprise a host cell or a plurality of host cells comprising genetic modifications introduced by the retron donor DNA-guide molecules of the present disclosure. [0260] In some embodiments, the kit contains one or more reagents. In some instances, the reagents are useful for introducing the retron donor DNA-guide molecule or plurality thereof into the host cell. In other embodiments, the kit may further comprise a reverse transcriptase, a plasmid for expressing a reverse transcriptase, one or more nucleases, one or more plasmids for expressing one or more nucleases, or a combination thereof. The kit may further comprise one or more reagents useful for delivering into the host cell reverse transcriptases and/or nucleases and/or inducing expression of the reverse transcriptase and/or the one or more nucleases. In yet other embodiments, the kit further comprises instructions for introducing the retron donor DNA-guide molecule or plurality thereof into the host cell, introducing nucleases and/or reverse transcriptases into the host cell, inducing expression of the reverse transcriptase and/or nucleases, or a combination thereof.
J. Applications [0261] The compositions and methods provided by the present disclosure are useful for any number of applications. As non-limiting examples, genome editing or screening according to the compositions and methods of the present disclosure can be used for cell lineage tracking or the measurement of RNA abundance, or to track the relative abundance of cells targeted by a mixture of edits in parallel. For example, the insertion of barcodes described herein can be used for cell lineage tracking or the measurement of RNA abundance. As another non- limiting example, genome editing or screening according to the compositions and methods of the present disclosure can be used in high-throughput precision editing genetic screens to 1) improve industrial microbial growth; 2) select strains for improving crop yield; 3) track edited cell populations used for medical treatments in vitro or in vivo; and 4) track edited cell populations used in cell therapy. [0262] As another non-limiting example, genome editing according to the compositions and methods of the present disclosure can be performed to correct detrimental lesions in order to prevent or treat a disease, or to identify one or more specific genetic loci that contribute to a phenotype, disease, biological function, and the like. As another non-limiting example, genome editing or screening according to the compositions and methods of the present disclosure can be used to improve or optimize a biological function, pathway, or biochemical entity (e.g., protein optimization). Such optimization applications are especially suited to the compositions and methods of the present disclosure, as they can require the modification of a large number of genetic loci and subsequently assessing the effects. [0263] Other non-limiting examples of applications suitable for the compositions and methods of the present disclosure include the production of recombinant proteins for pharmaceutical and industrial use, the production of various pharmaceutical and industrial chemicals, the production of vaccines and viral particles, and the production of fuels and nutraceuticals. All of these applications typically involve high-throughput or high-content screening, making them especially suited to the compositions and methods of the present disclosure. [0264] In some embodiments, inducing one or more sequence modifications at one or more genetic loci of interest comprises substituting, inserting, and/or deleting one or more
nucleotides at the one or more genetic loci of interest. In some instances, inducing the one or more sequence modifications results in the insertion of one or more sequences encoding cellular localization tags, one or more synthetic response elements, and/or one or more sequences encoding degrons into the genome. [0265] In other embodiments, inducing the one or more sequence modifications at the one or more genetic loci of interest results in the insertion of one or more sequences from a heterologous genome. Introducing heterologous DNA sequences into a genome is useful for any number of applications, some of which are described herein. Others will be readily apparent to one of skill in the art. Non-limiting examples are directed protein evolution, biological pathway optimization, and production of recombinant pharmaceuticals. EXAMPLES [0266] The following example provides representative methods for performing an exemplary embodiment of the disclosure. The example demonstrates that the methods of the disclosure can be used for high-throughput genome editing. [0267] Introduction [0268] An important issue in understanding complex traits is the phenomenon of gene-by- environment (GxE) interactions, wherein a genetic variant’s effect is dependent on the environment an organism is exposed to1. For example, humans heterozygous for the sickle cell allele of beta-globin have a fitness advantage in environments that include malaria, and those with a lactase persistence allele have a fitness advantage when consuming dairy products2,3. Identifying the genetic basis of such interactions is a key challenge in biology and is essential to the fields of medicine, genetics, synthetic biology, and evolutionary biology4–6. [0269] Studies for identifying GxE generally come in two main varieties: forward and reverse genetic approaches. Forward genetic approaches leverage the association of natural variation to observed traits, which can be as simple as measuring the environmental response of different strains or species. With enough samples across multiple environments, genome- wide association studies (GWAS) can detect signals of GxE7. Alternatively, quantitative trait locus (QTL) mapping uses genetic crosses between strains to create diverse progeny through recombination to calculate statistical signals that associate with environmental response8–11.
However, it is generally impossible to identify the specific variants underlying a GWAS or QTL peak without laborious follow-up experiments, due to insufficient mapping resolution, though crosses with tens of thousands of recombinant genotypes can resolve some QTLs to single nucleotides12–14. [0270] On the other hand, reverse genetic approaches such as constructing knockout libraries and measuring their effects on growth have single-gene resolution, and have been invaluable sources of information about the functions of genes in various organisms and their genetic interactions. However, most reverse genetics approaches to identify GxE interactions assay artificial alleles, such as gene knockouts or over-expression cassettes15,16. These generally do not reflect naturally occurring variants that contribute to phenotypic variation, so it is unknown whether GxE interactions of these alleles are relevant for understanding evolution. In some cases, reciprocal hemizygosity assays have been able to replace whole genes for dissecting QTL traits, but have not been able to separate the many variants within each gene8,17. By using either forward or reverse genetic approaches alone, it is still a challenge to find the precise variants that underlie GxE9. [0271] The methods described herein combine the merits of forward and reverse genetics— integrating natural variation with massively parallel reverse genetic screens—to uncover variants harboring GxE interations at the single nucleotide level. Previously, the inventors showed that Cas9 Retron precISe Parallel Editing via homologY (CRISPEY) can achieve high efficiency precise editing, by utilizing a bacterial retron reverse transcriptase (RT) to generate multi-copy, single-stranded DNA (msDNA) from RNA templates in nucleo to facilitate homology-directed repair after Cas9-mediated genomic DNA cleavage18. To this end the inventors created CRISPEY-BAR, a platform for creating and monitoring thousands of genetic variants in a single experiment. This is achieved through multiplexed, programmed installation of a predefined variant and an associated non-random barcode using a dual- CRISPEY design. Importantly, this design has improved statistical power to detect fitness effects by incorporating unique molecular identifiers (UMIs), as well as the ability to maintain strain barcodes in non-selective media, which allows both assaying and detecting GxE effects of thousands of individual genetic variants in any growth condition. This approach allows natural variants throughout the genome to be surveyed in any condition,
providing the ability to decipher the precise genetic basis and molecular mechanisms giving rise to complex traits. [0272] CRISPEY-BAR was used to measure the effects of 4184 natural variants segregating in yeast (Saccharomyces cerevisiae) across a variety of conditions.548 variants underlying variation in growth in these environments were identified. Importantly, resolution of the measurements can differentiate the effects of variants even when they are tightly clustered in the genome, as well as different alleles at the same genomic position. This single- nucleotide resolution of GxE interactions not only allows exploration of the natural landscape of complex traits, but also provides direct mechanistic insights into phenotypic evolution14,19. More generally, the methods provide a paradigm for studying genetic variants and their environmental interactions at unprecedented resolution and throughput via multiplexed precision genome editing. [0273] Results [0274] CRISPEY-BAR enables high-resolution mapping of genotype to phenotype relationships [0275] CRISPEY-BAR is a scalable system for measuring the effects of precise genome edits by tracking an associated genomic barcode (Fig.1a). As described in a previous report, CRISPEY uses a single guide/donor pair to make one precise edit per cell, and in a pooled assay, measures the change in abundance of each guide/donor pair post-editing through high- throughput sequencing of plasmids (Fig.1b)18. A new vector design was developed incorporating two consecutive retron-guide cassettes flanked by three self-cleaving ribozymes, allowing simultaneous generation of two guide/donor pairs for making two precise edits in the same cell20 (Fig.1a, Fig.6). The different ribozymes prevent unwanted recombination events during pooled cloning and co-transcriptionally separate the two retron- guide RNAs for processing by retron reverse transcriptase (RT). CRISPEY-BAR implements a dual-edit design to simultaneously 1) integrate a unique genomic barcode and 2) make a precise variant edit of interest. Each variant editing guide/donor pair is associated with a unique barcode, which can be used to track change in the abundance of cells edited by a specific guide/donor pair (Fig.1c). UMIs were linked to each barcode to serve as biological replicates for pooled-editing and growth competition (Fig.1c). CRISPEY-BAR was designed
to measure the fitness effect of each variant with at least two guide/donor pairs, six UMIs and three pooled competition replicates (Fig.1c, Fig.7). [0276] Since the barcode is genomically-integrated, no maintenance of an ectopic vector is needed post-editing, and 1:1 stoichiometric measurement of edited strains can be achieved through multiplexed sequencing of barcode amplicons (Fig.1d). In particular, the barcode was designed to be covered by 76-base short-read sequencing to minimize sequencing costs and run-time, instead of resequencing the plasmid with 300-base paired-end reads to re- identify guide-donor pairs (Fig.8). This sequencing design uses primers that are specific to the barcode-integrated genomic locus, therefore sequencing only the barcoded strains (Fig. 8). Selective detection of the integrated barcode edit guarantees the edited cell expresses functional Cas9 and retron components, as well as endogenous cellular factors that facilitate HDR. This strategy allowed for enrichment of strains likely containing variant edits, which is crucial for high-throughput screens. A similar co-CRISPR strategy has been shown to improve edited mutant selection by co-injection of multiple editing vectors for both non- selectable and selectable-markers21. An aggregate 92% pooled editing rate was observed from randomly picked barcoded strains (Fig.1e). [0277] The genome-integrated barcodes from a multiplexed CRISPEY-BAR library provided the ability to track the abundance of thousands of programmed mutants in non- selective media (Fig.1f,g). No-edit controls were included that do not install any variants apart from barcode integration to establish neutral fitness levels that arise from experimental noise and genetic drift (Fig.1f,g). Six pre-defined unique molecular identifiers (UMIs) were incorporated for every barcode-variant edit combination to increase biological replication, which allowed noise from variable editing rates due to guide efficiency to be determined as well as outlier detection of random mutants arising during transformation, editing, and competition to improve estimates of variant fitness effects (Fig.1h). Spontaneous mutations with strong positive fitness effects in particular would be expected to dominate the reads for a given UMI, so by removing these potential outlier UMIs, the number of false positives could be reduced (see Methods). It is unlikely that random mutations would arise for all UMI replicates for a given variant edit independent of CRISPEY-BAR.
[0278] CRISPEY-BAR measured fitness effects are highly reproducible between growth competition replicates. Variant fitness is approximated by fitting a linear model for estimating log2 fold-change abundance of each barcode-UMI over generation time during growth competition, as described previously (Pearson r = 0.9996, p = 1.38x10-16 for variants with FDR<.025 in both replicates) (Fig.1i, see also Methods)18. Importantly, across four independently generated and measured variant pools, measured in 13 competitions, only four putatively non-targeting barcodes had significant fitness effects in any competition at FDR<0.01, out of 43 in each pool, showing that CRISPEY-BAR has a low false positive rate. In addition, CRISPEY-BAR measured fitness effects are highly reproducible between experiments. Overlapping variants from two separately cloned, transformed, edit-induced, growth-completed, library-prepared, and sequenced CRISPEY-BAR experiments showed high replication (Pearson r=0.90, p =4.68x10-13 for all overlapping variants) in fitness effects for growth in cobalt chloride, despite being competed against an otherwise separate library other than technical controls (Fig.1j). This result shows that with overlapping variants between CRISPEY-BAR libraries, pooled screening strategies with minimal batch effects can potentially scale . Finally, 13 genotyped strains edited by CRISPEY-BAR were validated and pairwise competitions in fluconazole versus a fluorescently labeled un-edited strain were performed. The variant fitness measured by these pairwise competitions showed a high correlation with fitness measured in pooled competitions (Pearson r = 0.926, p = 1.53x10-5) (Fig.1k). In sum, CRISPEY-BAR is highly efficient in precision editing and allows massively parallel tracking of variant fitness effects using the dual-edit design. [0279] Detection of natural variants affecting fitness within QTLs reveals hidden genetic complexity [0280] To evaluate CRISPEY-BAR as a high-throughput, scalable platform to measure variants' effects on phenotypes, variants were first characterized within regions likely to be enriched for effects on growth in response to stress conditions, in which the yeast pool has slower growth overall. A total of 36 genomic regions overlapping QTLs for growth of segregants derived from 16 diverse parental strains were measured in three stress conditions: fluconazole (FLC), cobalt chloride (CoCl2) and caffeine (CAFF) (Fig.2a)8. For each stress condition, a CRISPEY-BAR library pool was constructed that targets natural variants that fall within previously identified genomic regions identified by QTL mapping to affect growth in
the corresponding stress condition7,11. QTLs with 1.5-LOD confidence intervals containing only a single gene were selected to not only increase the probability of finding fitness variants affecting fitness, but also maximize the number of QTLs surveyed given a set library size11. By installing diverse natural variants—including many not present in the 16 parental strains—the library could be enriched for variants impacting fitness in these stress conditions (Fig.2a)7.3 oligonucleotide pools (corresponding to variants to be assayed in fluconazole, cobalt chloride, and caffeine) were designed for pooled cloning into 3 separate CRISPEY- BAR libraries, which were then used for pooled editing (see Methods). After plasmid removal, the edited yeast were subjected to pooled growth competitions in synthetic complete media as well as each corresponding stress condition and changes in barcode abundance across roughly 25 generations were tracked (Fig.2b, Fig.7). To ensure the stress conditions were applied during yeast growth, the dose of stress agents (fluconazole, cobalt chloride, and caffeine) was calibrated so that the average growth rate is lower by 50% (see Methods). Barcodes with a non-targeting guide (designed to target a sequence which is not present in these strains) were included as no-edit controls to define the neutral fitness distribution within each pooled competition experiment (Fig.1f,g; see Methods). [0281] 152 variants with significant fitness effects in fluconazole, 84 variants for cobalt chloride and 102 variants for caffeine within the regions screened were identified for each stress condition (FDR<0.01). Substantially fewer variants with significant fitness effects for growth in synthetic complete media (SC) were observed than in the stress condition for each of these pools (Fig.2c). To identify what types of variants were most likely to have significant effects on fitness within these pools, the single-nucleotide resolution of the measurements was determined. A substantial enrichment for missense variants was observed among causal variants in the drug conditions for all three libraries (hypergeometric p= 3.06x10-4 for cobalt chloride, 6.84x10-8 for fluconazole, 1.10x10-31 for caffeine, Fig.2d). Within many of the QTLs dozens of causal variants were identified (Fig.2e). For instance, 65 out of the 66 causal variants in MAM3 for growth in cobalt chloride increased fitness, indicating that they may impair function of this gene, as MAM3 knockout increases resistance to cobalt chloride20. For other QTL genes such as TOR2, the knockout of which has been shown to decrease fitness in the presence of caffeine, many variants both increasing and
decreasing fitness were identified 21. The base-pair level resolution of CRISPEY-BAR enabled the substantial genetic complexity hidden within these QTLs to be identified. [0282] One QTL gene, PDR5, was shared between the caffeine and fluconazole pools. This well-studied multi-drug transporter had multiple variants affecting fitness in both conditions, many of which had effects in the same direction between the conditions (Fig.2f)22,23. For example, two variants with fitness effect located one base-pair apart in the genome were identified, both of which cause missense changes to the same lysine residue in PDR5 (Fig. 2g). These two variants both had substantial positive fitness effects for growth in fluconazole and caffeine, and do not co-occur in strains within the 1011 yeast genomes collection, indicating that they arose independently7. One of these variants is found almost exclusively in strains with the origin “Human, clinical,” while the other is more broadly distributed across ecological origins. Beyond PDR5, there were several other cases where two missense variants changed the same amino acid, both having strong fitness effects (V136, G1967, and A2403 in TOR1, K768 in TOR2, G398 in COT1, and N738 in SWH1). [0283] To look more generally at whether the variants with strong fitness effects within these QTLs were enriched in yeast strains from any ecological origin, the number of positive effect alleles were counted in each strain in the 1,011 yeast genome strains, including reference alleles which were beneficial relative to the negative effect engineered variants (see Methods). Interestingly, 27 of the top 50 scoring strains are from the ecological origins “Human” and “Human, Clinical,” which is a substantial enrichment (hypergeometric p =6.56x10-13). Notably these 27 strains came from two different clades, so this enrichment may not be driven purely by population structure. This could potentially reflect selection for increased fluconazole resistance among human clinical isolates. [0284] The GxE landscape of ergosterol synthesis pathway [0285] Having established the ability to identify multiple natural variants affecting fitness across various environments with CRISPEY-BAR, GxE interactions within the ergosterol biosynthesis pathway were examined 24. This essential metabolic pathway is of great biomedical importance, being the target of multiple classes of antifungal drugs, as well as statins and has also been shown to be affected by various other stress conditions, owing to its complex transcriptional and post-transcriptional regulation25,26(Fig.3a). Natural variants
within genes in this pathway as well as 1000 bp upstream and 500 bp downstream of each ORF were tested to capture promoters and downstream regulatory regions in five stress conditions as well as SC media (Fig.3b). Across these six environments, a total of 1432 variants were identified that passed minimum read filters and outlier detection for at least one of the six conditions (see Methods). [0286] Mapping of variants affecting fitness in two drug conditions, lovastatin and terbinafine, revealed that the target genes for these drugs (HMG1/2 and ERG1 respectively) were enriched for variants showing strong fitness effects in these conditions (Fig.3c, lovastatin p = 6.62x10-12 and terbinafine p =1.55x10-16, hypergeometric test)27–29. This illustrates the specificity of the variant effect measurements obtained from these screens. Notably, though these conditions were enriched for variants with strong effects in the target genes, variants in other ergosterol pathway genes also affected fitness, revealing extensive genetic complexity. In addition, the variants with the strongest fitness effects in the different conditions had different variant annotation enrichments. For instance, the 100 variants with the strongest fitness effects in SC were enriched for upstream/promoter variants (hypergeometric p = 3.61x10-3), while the 100 strongest effect variants in sodium chloride were not (hypergeometric p = 0.137). [0287] To identify GxE variants, all pairwise comparisons between the relative fitness measurements for each variant were performed in each condition to see if the effects on growth were significantly different (Fig.3d,e). To determine a reasonable threshold to define GxE variants and see if the approach to identifying GxE was robust, two identical competitions in SC media were performed and variants tested for GxE interactions between them. With the significance threshold used (FDR <0.01 and a change in sign of fitness effect) there were zero variants showing significant GxE between the replicates, compared to a mean of 44.3 for comparisons between conditions, suggesting that the approach to identifying GxE has a low false positive rate. Since there were no significant differences between SC replicates, these replicate competitions were combined in the final analysis to increase statistical power. Overall, at this threshold 256 distinct variants were identified with at least one significant GxE interaction (GxE variants), harboring 665 pairwise GxE interactions. Annotation enrichments for GxE variants were examined and missense variants were strongly enriched (two-sided Bonferroni-corrected hypergeometric p = 2.09x10-6), while
synonymous variants were depleted (two-sided Bonferroni-corrected hypergeometric p = 1.70x10-3) (Fig.3f). [0288] The stringent definition of GxE above explicitly excludes variants with significantly significant fitness effects which are in the same direction, referred to as “magnitude GxE” (red and pink in Fig.3e). While magnitude GxE is interesting, it also may be highly dependent on many experimental variables such as drug concentration. Thus, a stricter definition of a GxE variant was used: any variant with a significant GxE interaction which has measured fitness effects in opposite directions in the two conditions (blue in Fig.3e). [0289] CRISPEY-BAR allows measuring more than one variant at the same genomic locus for multiallelic loci within the ergosterol pathway, which highlights the resolution and specificity of the measurements. There were five multiallelic sites which had a missense variant with a significant fitness effect in one or more of the growth conditions. For three of these sites, the other variant was a synonymous variant with no effects on fitness. For the other two sites, there was one site within HMG2 where there were two missense variants (C788F and C788Y), which had similar effects on fitness. Strikingly, the other site in HMG1 had two missense variants making P1033A and P1033T changes which had significant effects in opposite directions on growth in lovastatin, perhaps due to the different chemical properties of threonine (polar) and alanine (nonpolar). [0290] Next, the genomic locations of the GxE variants relative to one another was observed to see if there was any spatial structure to GxE. GxE variants were defined as being within a cluster if they were located within 8 bp of another GxE variant. By this metric, 36 out of 69 (52.2%) promoter GxE variants (targeting 34 unique genomic positions) were within a cluster, which is significantly more clustering of promoter GxE hit variants than expected by chance (permutation p = 0.002, see Methods). The 36 clustered promoter GxE variants are located in 14 clusters, with all clusters sharing at least one significant pairwise GxE interaction at a relaxed threshold of FDR<0.1, though not necessarily in the same direction. Interestingly, the HMG1 promoter had three of these clusters, all of which had significant GxE interactions between caffeine and lovastatin, with strong effects on growth in lovastatin in both directions (Fig.3h). Interestingly, five of these clusters overlapped predicted transcription factor binding sites (TFBS)30–32. For the other clusters, they may
disrupt TFBS which have not been previously identified in the datasets examined, perhaps due to context-specific binding, or may affect fitness through another mechanism. [0291] Gene-by-Environment interactions are Pervasive Among Natural Variants [0292] Next, using measurements of variant fitness effects in each condition, the prevalence of GxE interactions was examined among natural variants with significant fitness effects. If GxE interactions at the variant level were rare, it was expected that variants with strong fitness effects in one condition would mostly have similar effects in other conditions, and so would be correlated between conditions (Fig.4a). Conversely, if GxE interactions were common, it was expected that little correlation between fitness effects of the same variant in different conditions would be observed (Fig.4b). These patterns would only be visible for variants with measurable fitness effects, as those that are neutral or close enough to be not detected would be expected to show no correlation in either case. Examples of both of these patterns were observed in the data, with fitness effects in caffeine and fluconazole within PDR5 being generally well-correlated despite differences in magnitude (Fig.4c), while ergosterol pathway variants measured in lovastatin and CoCl2 showed little to no agreement (Fig.4d). [0293] Checking the fraction of GxE among hits for two conditions at a time, the fraction of variants with significant fitness effects in either condition with GxE interactions between the conditions ranges from 24.4 to 71.4% for the ergosterol pool conditions. PDR5 variants in fluconazole and caffeine by this same method had 29.2% of significant variants showing GxE (Fig.4e). Extending this analysis to examine effects in all conditions for the ergosterol pathway variants, it is clear that almost all variants with significant fitness effects showed GxE interactions (Fig.4f). Among all variants measured in all six conditions which have at least one significant fitness effect in any condition, 93.8% have significant GxE interactions. It’s important to note that having a strong fitness effect in one condition would make it more likely for a variant to have a detectable significant GxE interaction due to statistical power to detect a difference. However, if there existed a class of variants that showed consistent fitness effects across the conditions tested in magnitude and direction, they would have significant fitness effects while not showing GxE. In contrast, if all conditions had a similar fraction of GxE variants as the PDR5 caffeine and fluconazole variants and were independent, only
82.2% (1-(1-.292)5) of variants would be expected to show GxE across six conditions. Strikingly, these analyses show that the vast majority of the non-neutral variants in the ergosterol biosynthesis pathway showed GxE, indicating that GxE interactions among natural variants are pervasive in this pathway. [0294] Regulatory GxE interactions in the ergosterol pathway [0295] The finding that GxE interactions are pervasive among natural variants with detectable fitness effects in the ergosterol synthesis pathway led to further investigations regarding the pattern of their effects. In principle, the finding of pervasive GxE could be consistent with a scenario in which most variants have fitness effects in only one condition and are neutral in others (Fig.5a). In this case, it was expected that variants with a significant fitness effect in one condition would be no more likely than any other variant to show a significant fitness effect in another, and so fitness effects across conditions should be distributed independently across the variants. Conversely, if variants with significant fitness effects in any condition were more likely to show strong fitness effects in other conditions (i.e., be pleiotropic), it was expected that the fitness distributions for the different conditions would not be independent, and significant fitness effects from these conditions would be more “clustered” in certain variants than expected by chance.15.0% of variants measured in all six conditions showed a strong fitness effect in at least one condition, but 30.7% of variants that had a significant fitness effect in one condition had a significant fitness effect in at least one other condition, a two-fold enrichment (hypergeometric p = 5.27x10-10). These variants were further enriched for missense variants relative to all GxE variants (hypergeometric p = 5.40x10-6), and further depleted of synonymous variants (hypergeometric p = 1.82x10-3). [0296] Variants with significant effects in more than one condition can be grouped into two categories: 1) those with significant fitness effects in only one direction (Fig.5b) and 2) those with significant fitness effects in opposite directions, which is referred to as “sign GxE” (Fig. 5c). Examining the strongest fitness effect for each of the variants with significant effects, it was observed that variants showing sign GxE had significantly higher maximum effects than variants with significant effects in only one condition or multiple conditions in the same direction (Mann-Whitney U-test p=0.000193 and p =0.0367 respectively) (Fig.5d). This
indicates that variants with more drastic effects on fitness in any given condition may be more likely to have fitness effects in the opposite direction in another condition. [0297] In many cases, the single-nucleotide resolution suggests plausible molecular mechanisms underlying GxE. For example, the pleiotropic variant exhibiting sign GxE at chr7: 472522 C>A was located in a canonical Rpn4p binding site33 (Fig.5e top, bottom left). This variant's strongest effect was a significant fitness decrease in lovastatin. Since Rpn4p is a transcriptional activator, it was hypothesized that the disruption of the Rpn4p binding site might decrease ERG4 expression. RT-qPCR was used to measure expression of ERG4 in a genotyped strain carrying chr7: 472522 C>A and found that its expression decreased relative to the wildtype strain (Fig.5e bottom right). This decrease in expression agreed with ERG4 expression in a strain carrying a fully ablated Rpn4p TFBS, while a strain carrying an Rpn4p consensus site had higher ERG4 expression. Interestingly, a strain with another natural variant that mutated a lower information base within the Rpn4p binding motif (chr7: 472525 T>A) showed a slight fitness decrease in lovastatin and did not show a significant decrease in ERG4 expression. This suggested that the chr7: 472522 C>A variant disrupted ERG4 expression through mutation of the Rpn4p TFBS. The fitness in the Rpn4p consensus and Rpn4p mutated TFBS was tested, which showed that ERG4 expression correlated with fitness in lovastatin (Fig. S4). In sum, CRISPEY-BAR was able to survey thousands of natural variants and identify the variants affecting fitness at the nucleotide-level, directly leading to discovery of molecular mechanisms of GxE interactions. [0298] Discussion [0299] The CRISPEY-BAR strategy and its applications provide a solution to rapidly discover natural genetic variants impacting a complex trait. As a proof of principle, 548 variants with significant effects on growth within QTLs were identified, as well as across a core metabolic pathway. With CRISPEY-BAR, thousands of variants can be screened in each experiment. Scaling up to the level to cover variants across entire genomes should allow even deeper probing of the relationship between genotype and any trait amenable to pooled phenotyping (including traits that can be tied to growth or fluorescence-based reporters). [0300] Deciphering the non-coding genome has been a major challenge even in just one experimental condition and is further complicated by GxE interactions. In this study, the
inventors showed that a class of variants with GxE cluster tightly within promoter regions, and further found that some of them overlap with known TFBS (Fig.3g). Although GxE variants are most highly enriched in missense variants, no genomic clustering of these protein-altering variants was observed. [0301] CRISPEY-BAR is highly efficient in precise editing. The RT was shown in CRISPEY to be effective in production of msDNA as DNA donors for precision editing18. The inventors have since tested additional retron RTs in CRISPEY, showing higher efficiency in yeast, as well as editing activity in human cells34. While this study only applied the SpCas9 with an ‘NGG’ PAM site limiting the variants that can be targeted, alternative nucleases with alternative PAM can be interchanged with SpCas9 to target additional variants35–37. [0302] The CRISPEY-BAR approach has an efficient guide for barcoding, while the variant editing guide can have a range of efficiency. Because two or more untested guides were used to target each variant, it is likely that the guides that show the same significant fitness effect were both efficient in making precise edits. Moreover, the six UMIs allow outlier detection where spontaneous mutations or off-target effects may have taken place. Combining guide reproducibility and UMI editing-competition replication, every CRISPEY- BAR experiment provides additional data points for supervised learning of effective guide design in CRISPEY-based editing strategies. [0303] In this study, CRISPEY-BAR was applied to a lab strain of budding yeast to evaluate the effect of natural variants. This may limit the portability of the fitness effects measured for individual variants, since they are only measured in this lab strain genetic background. This caveat can be overcome by applying CRISPEY-BAR to additional strains of budding yeast to not only capture the effects of variants within one lab strain, but also the effect of genetic background. The CRISPEY-BAR design also allows for additional ribozymes and CRISPEY cassettes to be incorporated. A single barcode-insertion cassette plus two or more variant editing cassettes can be expressed in the same transcript, allowing simultaneous editing of two genetic variants of choice and integration of a variant-pair specific barcode. With this design, gene-by-gene (epistatic) interactions can be observed, as
well as gene-by-gene-by-environment (GxGxE) interactions that govern the crosstalk between gene networks and the environment38–40. [0304] The observation that GxE interactions were found to be pervasive among variants with fitness effect from just six conditions tested was a surprising result. Most of the variants with GxE have a significant effect in only one condition, which by definition shows GxE with respect to the rest of the conditions. More excitingly, we found a fraction of the variants to harbor sign GxE, which implies fitness tradeoffs in fluctuating environments where selection acts in opposite directions on the variant (Fig.4g). Moreover, we found a trend in which large-effect variants tend to also have larger effects in another condition than expected by chance, forming a class of pleiotropic variants with two or conditional effects. While we expect additional variants with fitness effect to be identified as more conditions or drug conditions are tested on the same set of variants, it is intriguing to think that the pleiotropic variants may harbor disproportionate amounts of environment-specific effects. If such is the case, by performing a limited set of CRISPEY-BAR experiments with a diverse set of conditions, we will be able to prioritize a set of pleiotropic variants that are likely to have effects in the remaining, untested conditions spanning the phenotypic space.
[0305] Methods [0306] Variant selection and pooled oligonucleotide design [0307] Natural variants were sourced from the 1,011 genomes project documented the following criteria7. For QTL fine mapping, QTLs (Bloom et al, 2019 supplementary file elife-49212-Fig.3-data1-v2.xls, sheet_name='within-cross model') were filtered for QTLs containing only one gene, have a q-value > 0.05, then ranked by Beta_abs for maximum effect size11. We excluded the following genes to avoid interference with CRISPEY editing and genes unavailable from the base strain genotype: HO, HIS3, URA3, LEU2, LYS2, GAL1, GAL3, GAL4, GAL7, GAL10, GAL80, HAP1 and POLR2. The QTL borders were defined by coordinates within '1.5 LOD drop CI, left' and ‘1.5 LOD drop CI, right' as annotated in Bloom, 2019, and gene regions were defined by +-500bp from the coding region11. Natural variants within the union of the QTL borders and the gene region were included in the library corresponding to the traits, excluding singletons and doubletons7. The traits include growth in: 'Cobalt_Chloride;2mM;2’, 'Caffeine;15mM;2' and 'Fluconazole;100uM;2', and we refer to these traits as ‘stress conditions’11. For the ergosterol pool, all non-reference alleles from yeast natural variants that were within +-500bp from the coding region of the selected ergosterol pathway genes were included4. We targeted more than 1000 variants per QTL condition pool, and all possible variants for the ergosterol pool (Fig.2a,3a). We designed CRISPEY oligos to edit these variants in the ZRS111 strain, which contains the S288c reference alleles. The guides and donors selected for CRISPEY editing were designed as described, with the following parameters or modifications18: 1. The alternative allele is within -6 to -1 and +1 to +2 positions of the guide target and PAM sequences; 2. The donor template is 108 bp in length with asymmetric homology arms, 40 bp for the 5’ arm and 68 bp for the 3’ arm; 3. Variants were included if two or more guides were found for a given variant. The resulting msDNA donor will result in a shorter 3’ homology arm and longer 5’ arm flanking the variant, which was to have higher HDR efficiency using ssDNA as repair donor41. The donors were further filtered to exclude SphI, AscI and NotI restriction sites used in the cloning process, as well as keeping a minimum of 30 bp homology arm 5’ of variant and 55 bp 3’ of homology arm in the donor template. The resulting output is 250 bp per oligo, consists of 5’ homology to the pSAC200 CRISPEY- BAR vector, 12 bp programmed barcode, restriction site region for cloning, 108 bp donor
template sequence, 34 bp constant region, 20 bp guide sequence and 3’ homology to the pSAC200 CRISPEY-BAR vector (Fig.6). Specifically, the general sequence is: 5’- GTTGCAGTTAGCTAACAGGCCATGCNNNNNNNNNNNNGCATGCAGCGGCCGCAG GCGCGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNAGGAAACCCGTTTCTTCTGACGTAAGGGTGCGCANNNNNNNN NNNNNNNNNNNNGTTTCAGAGCTATGCTGGAAACAGCAT-3’, where the first 12 Ns represent programmed barcodes, the following 108 Ns represent donor template sequence, and the last 20 Ns represent guide sequence. [0308] Programmed Barcode Design [0309] Barcodes were designed using a custom script implementing a quaternary Hamming(12,8) code based on the encoding scheme described in a previous study42. This encoding scheme generates DNA barcodes with a minimum Hamming distance of 3, allowing for error correction of 1 bp mutations or DNA sequencing errors. The list of all Hamming(12,8) DNA barcodes was then filtered to remove barcodes containing restriction sites used in the cloning process, Illumina i5 and i7 Nextera handles, homonucleotide stretches greater than length 3, dinucleotide repeats greater than length 5, any 12 bp section of pSAC200, and any 12 bp section of the custom sequencing primers. In addition, Primer3 was used to predict any hairpin structures, and if a structure was found, that barcode was removed43. The final list of barcodes was then assigned to 392 possible wells, ensuring that barcodes within each well had a minimum Hamming distance of 5, theoretically enabling error correction of sequencing errors in up to 2 bp for barcodes within the same well. [0310] Library cloning [0311] Oligonucleotide (Twist Biosciences) libraries were ordered in the format of 192 wells, each well containing 121 oligonucleotides each. This format allows pooling of oligonucleotides in combinations relevant to each competition experiment. Each well included 119 variant editing oligonucleotides, 1 control oligonucleotide with a non-editing guide (sgGFP) and 1 control oligonucleotide editing a 8-bp frameshift deletion as a positive control with gene knockout effects adapted from a previous study44.
[0312] Oligonucleotides were first amplified with Q5 polymerase (NEB) with 1 uM primer #615 in 50 uL reaction following manufacturer instructions and initial denaturation of 98°C for 2 min, and then 5 cycles of 98°C for 10 s and 65°C for 30 s, followed by 25 cycles of 98°C for 10 s and 69°C for 40 s, then final extension of 72°C for 2 min. PCR products were then purified with 45 uL nucleoMAG NGS beads (hereafter, “beads”) (Takara) and eluted with 20 uL water.2 uL of the first round PCR product was further amplified with Q5 polymerase (NEB) with 1 uM primer #615 and #576 in 50 uL reaction as manufacturer instructions and initial denaturation of 98°C for 2 min, and then 15 cycles of 98°C for 10 s and 69°C for 30 s, then final extension of 72°C for 2 min. Second round PCR products were then purified with 45 uL beads and eluted with 20 uL Tris pH 8.0. We quantified and pooled PCR products from each well by equal volume to the assigned pools. [0313] The pooled oligonucleotides PCR products were purified using SizeSelect II 2% gel (Invitrogen), followed by bead purification and prepared NGS libraries to quantify the counts from each well. Briefly, the pooled oligos were amplified with Q5 polymerase (NEB) with 1 uM primer #617 and #337-343 in 50 uL reaction following manufacturer instructions and initial denaturation of 98°C for 2 min, and then 15 cycles of 98°C for 10 s and 69°C for 40 s, then final extension of 72°C for 2 min, followed by purification using 45 uL beads and indexing PCR using Illumina dual-indexing primers. The indexed amplicons corresponding to each pool were then sequenced by MiSeq using reagent kit v2 Nano to obtain paired-end 150bp reads that are mapped to the designed oligonucleotides. We counted the relative proportions of oligonucleotides from each well in the assigned pool, then repooled the PCR products again with normalized volumes to target equal molarity between wells in each pool. [0314] The pSAC200 empty vector was digested twice with NotI-HF (NEB) and Quick CIP (NEB), and the linearized vector was purified using beads.290 ng of linearized pSAC200 vector and 140 ng of well-normalized, pooled oligonucleotide PCR products from above were assembled in 20 uL NEBuilder HiFi mastermix (NEB) reaction according to manufacturer instructions, with 1:10 molar ratio between vector:insert. The assembled products were purified by beads and eluted in 10 uL water.3 uL of the assembled products were used for electroporation with 27 uL Endura Electrocompetent cells for CRISPR DUO (Lucigen). Two electroporation reactions were performed for each pool following
manufacturer instructions and recovered in SOC media (Lucigen) for 25 min at 37°C and plated to a single 15 cm LB agar plate with Carbenicillin (GoldBio). A serial dilution of the recovered bateria was plated to estimate colony forming units (cfu), and all pools contained more than 500,000 cfus. The transformants were incubated for 22 hr at 32°C and the resulting bacterial lawn was collected for storage in LB with 10% glycerol at -80°C. Half of the collected transformant stock was used for plasmid extraction using Nucleobond Xtra Midi Plus (Macherey-Nagel) and eluted as “post-Gibson” plasmid pools, yielding 105-120 ug of plasmid DNA. [0315] 20 ug of post-Gibson plasmid pools were digested twice with SphI-HF(NEB), AscI(NEB), Quick-CIP(NEB) and NotI-HF (NEB), purified by beads and eluted in 12 uL 10mM Tris 8.0 as ligation vectors. A mixture of six UMI associated ligation inserts was generated by six 100 uL reactions Q5 (NEB) PCR reaction with one of six forward primers: #591, #592, #594, #506, #603 and #604; and reverse primer #590, with plasmid pSAC212 as template. PCR was performed with 1 uM of each primer as manufacturer instructions, and initial denaturation of 98°C for 3 min, and then 35 cycles of 98°C for 10 s, 66°C for 30 s, 72°C for 40 s; then final extension of 72°C for 2 min. The ligation insert PCR products were digested with SphI-HF (NEB) and AscI (NEB) and bead purified, then pooled in equal molar into a mixture of six UMI ligation inserts.1 ug of the linearized pool vectors were ligated to 1.5 ug of six UMI ligation mix (vector:insert=1:30) with 10 uL T4 ligase (NEB) in 100 uL 1x T4 ligase buffer at 16°C overnight. [0316] The ligation product was purified by beads and eluted in 30 uL water.3 uL of the purified ligation products were used for electroporation with 27 uL Endura Electrocompetent cells for CRISPR DUO (Lucigen). Two electroporation reactions were performed for each pool, one reaction with ligation insert and the other without insert as negative control. Electroporation was performed following manufacturer instructions and recovered in SOC media (Lucigen) for 30 min at 37°C and the with-insert ligations were plated to two 15 cm LB agar plates with Carbenicillin (GoldBio) at 32°C for 22 hr. A serial dilution of the recovered bacteria from both with- and without-insert ligations was plated to estimate cfu, and all pools contained more than 1,000,000 cfu, corresponding to at least 2,500x coverage for each oligonucleotide on average within each pool. Ligation plates were incubated at 32°C
for 22 hr, and transformants were stored in LB with 10% glycerol. Ligated plasmids were extracted from one fourth of the collected bacteria from each pool using Nucleobond Xtra Midi Plus (Macherey-Nagel) and eluted as “post-ligation” plasmid pools, yielding 160-240 ug of plasmid DNA per reaction. [0317] Yeast transformation, editing induction and plasmid curing The base strain ZRS111 was described previously18.4 ug of the post-ligation plasmid pools were digested with NotI-HF (NEB) and quick-CIP(NEB) and directly transformed into the yeast strain ZRS111 by LiOAc heat shock transformation45. The yeast transformant pools were selected on YNB -histidine -uracil 2% glucose (1.7g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix minus histidine, uracil w/o nitrogen base (US Biological) and 20 g/L glucose (Sigma) 2% agar plates and stored in YNB -histidine -uracil 2% glucose media with 15% glycerol at -80°C. Yeast transformants containing post-ligation pools were inoculated to 200 mL YNB -histidine -uracil 2% raffinose (1.7 g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix minus histidine, uracil w/o nitrogen base (US Biological) and 20 g/L raffinose (Sigma) media starting at OD600=0.4, shaking at 30°C for 16 hr (Fig.7). The raffinose cultures were further re-inoculated in 200 mL YNB -histidine -uracil 2% galactose media starting at OD600=0.4 and shaking at 30°C for 24 hr three times, for a total of 72 hr in galactose media in order to induce CRISPEY-BAR editing. [0318] Cells were harvested from the last galactose media growth and stored in YNB - histidine -uracil 2% glucose media with 15% glycerol at -80°C. Edited cells were then plasmid-cured by growing in 200 mL YNB 2% glucose (1.7g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix complete, w/o nitrogen base (US Biological) and 20g/L glucose (Sigma) media starting at OD600=0.4, shaking at 30°C for 16 hr, then re-inoculated to YNB 2% glucose media with 1 g/L 5- Fluororotic acid monohydrate (GoldBio) starting at OD600=0.4, and shaking at 30°C for 24 hr. The plasmid-cured cells were collected and stored in YNB 2% glucose media with 15% glycerol at -80°C. [0319] Pooled competition
[0320] Pooled competitions were carried out in 1 L baffled flasks in YNB 2% glucose (SC, hereafter) media with or without specified conditions (Fig.7). The concentration of each drug/salt was titrated to approximately 5 generations of growth of the ZRS111 strain every 12 hr, indicating overall decreased fitness in each condition to apply consistent growth stress to cells. In contrast, for SC media only, there are approximately 5 generations of growth ZRS111 strain in 8 hr. Cells were thawed in 200 mL SC media from glycerol stock starting at OD600=0.4 and grown at 30°C shaking at 250 RPM. Cells were passaged every 12 hr and diluted to fresh 1 mL SC media with specified conditions, and every 8 hr for SC media only. Five intervals separated by six timepoints (T1~T6) were harvested at every time point once passage was complete. Harvested cells were spun down, washed with water and stored at - 20°C. [0321] Sequencing library preparation [0322] Yeast genomic DNA was extracted from 60 - 80 OD of each sample using the MasterPure Yeast DNA Purification Kit (Lucigen) with four reactions per sample. Genomic DNA was eluted in 200 uL per sample, further digested with 1 uL RNaseA and quantified by Qubit dsDNA HS assay (Invitrogen).10 ug of genomic DNA was amplified in 400 uL Q5 polymerase (NEB) PCR reaction with 1 uM forward primer #261 and 1 uM reverse primer equimolar mix of primers #327- #334 (Fig.8). PCR was performed following manufacturer’s instructions, with 1M Betaine and initial denaturation of 98°C for 2 min, then 19 cycles of 98°C for 10 s, 65°C for 20 s; then extension at 72°C for 5 min.100 uL of first round of PCR products were purified using 100 uL beads and 15 uL of the purified amplicons were further indexed by 50 uL Q5 polymerase (NEB) PCR reaction following manufacturer’s instructions with 1 uM equimolar mix of indexing primers for Illumina sequencing, and initial denaturation of 98°C for 2 min, then 8 cycles of 98°C for 10 s, 70°C for 20 s; then extension at 72°C for 2 min. The indexed amplicons were purified with 50 uL beads, eluted in 100 uL water and quantified by Qubit dsDNA HS assay (Invitrogen). The purified, indexed amplicons from six time point samples for the three replicates per competition were mixed equimolar and purified by SizeSelect II gel (Invitrogen) for ~300 bp product. The size selected libraries were then purified by beads and submitted for paired-end sequencing on NextSeq 550 using custom read1 primer #354, with custom cycles of 12 cycles for read1, 8 +
8 cycles for dual indices and 64 cycles for read2 using a 1 x 75 bp High-Output Kit (Fig.8). Data available at PRJNA827354. [0323] Read Processing [0324] Reads in fastq format from competition libraries sequenced using NextSeq were processed using a custom script. Briefly, fastq files from the same samples were combined and adaptors were trimmed using cutadapt46. Parameters for read 2 trimming were 5’ adaptor sequence as 'GGCCAGTTTAAACTT', 3’ adaptor sequence as 'GCATGGC', maximum error rate of 0.2 and 27 base pair in length for trimmed read2. Trimmed paired reads were merged using FLASh with minimum overlap of 12 base pairs and maximum mismatch rate of 0.2547. The resulting barcode is 27 base pairs including 12 bp barcode, 6 bp SphI restriction site and 9 bp UMIs. The barcode-UMI combinations with perfect match to all possible barcode-UMI combinations from the designed libraries were counted for analysis described below. [0325] Fitness calculation [0326] Processed counts from each competition experiment of barcode-UMI combinations were combined with generation time estimated from optical density at each timepoint during fitness competition to calculate fold-change values using DESeq248. A minimum filter of 500 reads across 18 samples, including six time points in three replicates, was set for each barcode-UMI combination. The editing effect of each barcode-UMI combination was modeled as described previously by estimating the effect of generation time on the log fraction of barcode-UMI counts, with the Deseq2 design formula as follows18: ^^^^^ ~ ^^^^^^^^^^ ^ ^^^^^ [0327] Where "Count” represent each barcode-UMI combination; “Generation” represents the number of generations from the start of the growth competition, estimated by optical density as described above; “Flask” indicate the flask replicate from which the sample originated. Log2 fold-change was estimated for counts per UMI across generation time for each barcode-UMI combination by Deseq248. [0328] Outlier removal and GxE fitness modeling [0329] Individual UMI log2 fold changes (logFC) for the same variants were combined to estimate the variant fitness effect through a weighted least squares model using a custom
Python script (modified from Ang et al, in submission). First, for each oligomer, we applied a robust outlier detection on its associated barcodes to remove UMIs with large median absolute deviations (MAD) from the median logFC for that variant (logFC >3.5 x MAD from median logFC for that variant), since we expect that the unique molecular tags ligated to the oligomer barcode during library cloning should have relatively comparable effects in the library. Next, for the ergosterol library we calculated the standard deviation of the logFC of the UMIs for each programmed barcode and removed programmed barcodes with logFC standard deviation greater than or equal .05, to remove highly variable barcodes not accounted for in the previous outlier detection step. We omitted this step for the QTL pools due to higher variance that was expected among very high effect variants in those pools. To account for heteroscedasticity in the fitness effects of barcodes with different counts, we used the neutral distribution to calculate inverse-variance weights for each oligomer based on its median barcode’s average count in the competition (baseMean). Finally, barcode fitness effects and weights from each growth competition were jointly fitted into a weighted least squares model to calculate the variant fitness effect in each yeast strain. The model takes the variant, strain, and their interaction terms as independent variables, summarized in the form: logFC ~ C(variant) + C(condition) + C(variant):C(condition) [0330] Where “C(variant)” and “C(condition)” are dummy variables representing the variant and growth condition, and logFC is the logFC for each UMI. Therefore, the variant fitness effect in a condition is the difference between the weighted mean of fitness effects for UMIs associated with the variant and the weighted mean of fitness effects for neutral UMIs in that condition. The model also determines if the variant fitness effect is significantly different from neutral through a weighted t-test. The p-values were adjusted for multiple testing using the Benjamini-Hochberg procedure and significant fitness effects were controlled at FDR = 0.0149. [0331] Fluconazole Ecological Enrichment Test [0332] To test whether strains from particular ecological origins were enriched for variants with significant effects in a particular direction in fluconazole, we first split the variants with significant fitness effects in fluconazole into positive and negative effect variants. We then checked for each strain in the 1,011 yeast genomes if they were homozygous or heterozygous
for the alternate allele we edited in at each significant variant. For positive effect variants, strains with the alternate allele had 1 added to their score, and for negative effect alleles, strains with the alternate allele had 1 subtracted from their score. The total number of negative effect variants was added to this score for all strains, as any strain with the reference allele for those sites in effect had the positive effect allele. The 1,011 yeast strains were then sorted by this score, and the top 50 were chosen to look at their ecological origins, as they were presumably the strains with the most evidence for being under selection for increased growth in fluconazole. A hypergeometric test was performed to determine enrichment of the top categories, "Human" and "Human, clinical." [0333] Detecting significant GxE interactions [0334] The weighted least squares model allows tests for significant differences in a variant’s fitness effects between conditions through a weighted t-test; all pairwise differences in fitness effects between conditions (e.g.15 differences for variants measured in six conditions) were calculated between variants. The p-values were adjusted for multiple testing using the Benjamini-Hochberg procedure and significant differences were controlled at FDR = 0.0149. At this threshold, none of the neutral/non-cutting variants exhibited GxE interactions. [0335] Permutation test for nonrandom clustering of GxE promoter variants [0336] In order to test whether the clustering we saw for promoter hits was more than would be expected by chance, we permuted the promoter hits 5000 times by choosing random promoter variants to be hits, choosing the same number of promoter hits as exist in the real dataset, and then performing the same cluster analysis for these permuted sets. We performed these permutations twice with two seeds. We then counted how many of these permuted data sets had greater than or equal to the number of genomic loci in clusters as the real data set to determine two permutation p-values, and took the average of these two. [0337] Clonal genotyping [0338] Single CRISPEY-BAR oligonucleotides containing partial sequence containing 5’ homology to pSAC200, 12 bp programmed barcode, restriction site region for cloning, 108 bp donor template sequence and 34 bp constant region were ordered from IDT as dsDNA
eblocks for individual validation of genotype correct strains. The eblocks were amplified using primer #576 as forward primer and donor specific primers that append the 20 bp guide sequence and 3’ homology to pSAC200 to the eblocks. The resulting PCR products were bead purified and cloned into pSAC200, ligated with UMI-containing insert and transformed into yeast as described for library cloning above. The yeast transformants were induced for editing by culturing in 5 mL YNB -HIS -URA 2% raffinose media for 24 hr, passaged twice in 5 mL YNB -HIS -URA 2% galactose media for 24 hr each, then streaked out on YNB - URA 2% glucose (1.7g/L yeast nitrogen base (RPI); 5 g/L Ammonium Sulfate (ACROS organics); 1.9 g Dropout synthetic mix minus uracil, w/o nitrogen base (US Biological) and 20 g/L glucose (Sigma) 2% agar plates to obtain single edited clones. plasmids were cured from edited clones by restreaking on YNB 2% glucose 2% agar plates with 1 g/L 5- Fluororotic acid monohydrate (GoldBio). The single plasmid-cured colonies were amplified by growing in YNB 2% glucose media overnight and stored in YNB 2% glucose media with 15% glycerol at -80°C. [0339] Colonies were streaked out from the frozen stock and lysed with Zymolyase 20T (US Biological) solution in 50 mM potassium phosphate buffer, pH 7.5. Cell lysates were used for genotyping using EmeraldAmp MAX PCR Mastermix (Takara), with primers #261 and #262 for determining barcode-UMI sequence and locus-specific primers. PCR cycles had an initial denaturation of 95°C for 2 min; then 35 cycles of 95°C for 10 s, 60°C for 15 s, 72°C for 20 s; then a final extension of 72°C for 5 min. PCR products were purified, Sanger sequenced and aligned to the reference genome using SGD BLAST to confirm the intended genotype50,51. For the QTL pools, colonies were randomly picked from edited cells plated on non-selective media after plasmid removal. Genomic amplicons of loci containing the associated variant edit were Sanger sequenced from barcoded colonies to calculate the editing rates shown in Fig.1d. [0340] qRT-PCR [0341] Strains containing the Sanger sequencing-verified genotypes were thawed from frozen stock and grown overnight in 5 mL YNB 2% glucose media.0.5 mL of the overnight culture was passaged to 50 mL YNB 2% glucose media with or without 30 mg/mL lovastatin. Cells were harvested after 5 generations of growth in media, approximately 12 hr after
passaging. RNA was extracted by vortexing with 500 uL glass beads and 1 mL Trizol (Invitrogen) by manufacturer’s instructions.8 ug total RNA of each sample were digested with RQ1 DNase for 1 hr at 37°C as manufacturer’s instructions and purified by overnight ethanol precipitation.400 ng of the purified RNA from each sample were converted to cDNA using Superscript IV First-Strand Synthesis system by manufacturer’s instructions. qPCR was performed as described previously18. [0342] Fitness validation [0343] Strains containing the Sanger sequencing verified genotypes were thawed from frozen stock and grown overnight in SC media and mixed with the GFP control strain in 1 mL SC media with specified conditions in a 96-well plate. Cells were passaged every 12 hr and diluted to fresh 1 mL SC media with specified conditions. Six timepoints (T1-T6) were harvested once passage was complete. Harvested cells were spun down and resuspended in 1x DPBS (Gibco) and stored at 4°C and assayed by flow cytometry within 12 hr post-harvest. Generation time was estimated by measuring OD600 of the culture containing ZRS111 and GFP control strain at every time point. Competition for each edited strain against GFP control strain was replicated four times in four different wells, to control for spontaneous mutation during competition. [0344] Ratios between each edited strain against GFP control strain were determined by flow cytometry assay, using an Attune NxT Flow Cytometer and Autosampler (ThermoFisher Scientific). GFP was detected using a 530 nm band-pass filter (BL1) with a 488 nm laser. The channel voltages were adapted from a previous study and set as follows: FSC: 200; SSC: 320; and BL1:48041. A threshold for FSC of 2.5 x 103 A.U. was applied to exclude non-yeast events. Data analysis was performed using Attune NxT Software v2.7. Doublets were removed by FSC gating and cell counts for GFP control strain were determined by BL1 gating and the remaining cells were counted as the non-fluorescent, corresponding to edited strains. Samples with fewer than 500 total cells gated, as well as samples with cell counts of less than 3 for either GFP or edited strains, were excluded. Log2 ratios between edited strain count and GFP control strain count were calculated for each sample and fitted to a slope for the estimated generations within each replicate. The slopes were normalized by subtracting the slope calculated by the competition of a non-variant edit, barcode-only control to the GFP
control strain in the same replicate. Finally, the mean and standard error for slopes across four replicates were calculated for each edited strain, representing pairwise fitness values.
References: 1. Grishkevich, V. & Yanai, I. The genomic determinants of genotype × environment interactions in gene expression. Trends Genet. TIG 29, 479–487 (2013). 2. Tishkoff, S. A. et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet.39, 31–40 (2007). 3. Luzzatto, L. Sickle cell anaemia and malaria. Mediterr. J. Hematol. Infect. Dis. 4, e2012065 (2012). 4. Cardinale, S. & Arkin, A. P. Contextualizing context for synthetic biology - identifying causes of failure of synthetic biological systems. Biotechnol. J.7, 856–866 (2012). 5. Via, S. & Lande, R. GENOTYPE-ENVIRONMENT INTERACTION AND THE EVOLUTION OF PHENOTYPIC PLASTICITY. Evolution 39, 505–522 (1985). 6. Li, J., Li, X., Zhang, S. & Snyder, M. Gene-Environment Interaction in the Era of Precision Medicine. Cell 177, 38–44 (2019). 7. Peter, J. et al. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556, 339–344 (2018). 8. Smith, E. N. & Kruglyak, L. Gene-environment interaction in yeast gene expression. PLoS Biol.6, e83 (2008). 9. Ehrenreich, I. M. et al. Genetic architecture of highly complex chemical resistance traits across four yeast strains. PLoS Genet.8, e1002570 (2012). 10. Bloom, J. S., Ehrenreich, I. M., Loo, W. T., Lite, T.-L. V. & Kruglyak, L. Finding the sources of missing heritability in a yeast cross. Nature 494, 234–237 (2013). 11. Bloom, J. S. et al. Rare variants contribute disproportionately to quantitative trait variation in yeast. eLife 8, e49212 (2019). 12. She, R. & Jarosz, D. F. Mapping Causal Variants with Single-Nucleotide Resolution Reveals Biochemical Drivers of Phenotypic Change. Cell 172, 478-490.e15 (2018). 13. Nguyen Ba, A. N. et al. Barcoded bulk QTL mapping reveals highly polygenic and epistatic architecture of complex traits in yeast. eLife 11, e73983 (2022). 14. Rockman, M. V. The QTN program and the alleles that matter for evolution: all that’s gold does not glitter. Evol. Int. J. Org. Evol.66, 1–17 (2012). 15. Jones, G. M. et al. A systematic library for comprehensive overexpression screens in Saccharomyces cerevisiae. Nat. Methods 5, 239–241 (2008). 16. Hillenmeyer, M. E. et al. The Chemical Genomic Portrait of Yeast: Uncovering a Phenotype for All Genes. Science 320, 362–365 (2008). 17. Steinmetz, L. M. et al. Dissecting the architecture of a quantitative trait locus in yeast. Nature 416, 326–330 (2002). 18. Sharon, E. et al. Functional Genetic Variants Revealed by Massively Parallel Precise Genome Editing. Cell 175, 544-557.e16 (2018). 19. Lee, Y. W., Gould, B. A. & Stinchcombe, J. R. Identifying the genes underlying quantitative traits: a rationale for the QTN programme. AoB PLANTS 6, plu004 (2014). 20. Riccitelli, N. J., Delwart, E. & Lupták, A. Identification of minimal HDV-like ribozymes with unique divalent metal ion dependence in the human microbiome. Biochemistry 53, 1616–1626 (2014). 21. Kim, H. et al. A co-CRISPR strategy for efficient genome editing in Caenorhabditis elegans. Genetics 197, 1069–1080 (2014).
22. Balzi, E., Wang, M., Leterme, S., Van Dyck, L. & Goffeau, A. PDR5, a novel yeast multidrug resistance conferring transporter controlled by the transcription regulator PDR1. J. Biol. Chem.269, 2206–2214 (1994). 23. Harris, A. et al. Structure and efflux mechanism of the yeast pleiotropic drug resistance transporter Pdr5. Nat. Commun.12, 5254 (2021). 24. Rodrigues, M. L. The Multifunctional Fungal Ergosterol. mBio 9, e01755-18 (2018). 25. Bhattacharya, S., Esquivel, B. D. & White, T. C. Overexpression or Deletion of Ergosterol Biosynthesis Genes Alters Doubling Time, Response to Stress Agents, and Drug Susceptibility in Saccharomyces cerevisiae. mBio 9, e01291-18 (2018). 26. Kern, A. F. et al. Divergent patterns of selection on metabolite levels and gene expression. BMC Ecol. Evol.21, 185 (2021). 27. Lum, P. Y. et al. Discovering modes of action for therapeutic compounds using a genome-wide screen of yeast heterozygotes. Cell 116, 121–137 (2004). 28. Rine, J., Hansen, W., Hardeman, E. & Davis, R. W. Targeted selection of recombinant clones through gene dosage effects. Proc. Natl. Acad. Sci. U. S. A.80, 6750– 6754 (1983). 29. Jandrositz, A., Turnowsky, F. & Högenauer, G. The gene encoding squalene epoxidase from Saccharomyces cerevisiae: cloning and characterization. Gene 107, 155–160 (1991). 30. Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004). 31. Griffith, O. L. et al. ORegAnno: an open-access community-driven resource for regulatory annotation. Nucleic Acids Res.36, D107-113 (2008). 32. Pachkov, M., Balwierz, P. J., Arnold, P., Ozonov, E. & van Nimwegen, E. SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res.41, D214-220 (2013). 33. Mannhaupt, G., Schnall, R., Karpov, V., Vetter, I. & Feldmann, H. Rpn4p acts as a transcription factor by binding to PACE, a nonamer box found upstream of 26S proteasomal and other genes in yeast. FEBS Lett.450, 27–34 (1999). 34. Zhao, B., Chen, S.-A. A., Lee, J. & Fraser, H. B. Bacterial Retrons Enable Precise Gene Editing in Human Cells. CRISPR J.5, 31–39 (2022). 35. Nishimasu, H. et al. Engineered CRISPR-Cas9 nuclease with expanded targeting space. Science 361, 1259–1262 (2018). 36. Legut, M. et al. High-Throughput Screens of PAM-Flexible Cas9 Variants for Gene Knockout and Transcriptional Modulation. Cell Rep.30, 2859-2868.e5 (2020). 37. Hu, J. H. et al. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature 556, 57–63 (2018). 38. Jaffe, M. et al. Improved discovery of genetic interactions using CRISPRiSeq across multiple environments. Genome Res.29, 668–681 (2019). 39. Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, aaf1420 (2016). 40. Costanzo, M. et al. Environmental robustness of the global yeast genetic interaction network. Science 372, eabf8424 (2021). 41. Richardson, C. D., Ray, G. J., DeWitt, M. A., Curie, G. L. & Corn, J. E. Enhancing homology-directed genome editing by catalytically active and inactive CRISPR-Cas9 using asymmetric donor DNA. Nat. Biotechnol.34, 339–344 (2016).
42. Bystrykh, L. V. Generalized DNA Barcode Design Based on Hamming Codes. PLOS ONE 7, e36852 (2012). 43. Koressaar, T. & Remm, M. Enhancements and modifications of primer design program Primer3. Bioinforma. Oxf. Engl.23, 1289–1291 (2007). 44. Bao, Z. et al. Genome-scale engineering of Saccharomyces cerevisiae with single- nucleotide precision. Nat. Biotechnol.36, 505–508 (2018). 45. Gietz, R. D. & Schiestl, R. H. High-efficiency yeast transformation using the LiAc/SS carrier DNA/PEG method. Nat. Protoc.2, 31–34 (2007). 46. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011). 47. Magoþ, T. & Salzberg, S. L. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinforma. Oxf. Engl.27, 2957–2963 (2011). 48. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15, 550 (2014). 49. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol.57, 289–300 (1995). 50. Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res.40, D700-705 (2012). 51. Engel, S. R. et al. The reference genome sequence of Saccharomyces cerevisiae: then and now. G3 Bethesda Md 4, 389–398 (2014). [0345] It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended claims. The disclosure encompasses all combinations of the particular embodiments recited herein, as if each combination had been individually and laboriously recited. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
EXEMPLARY EMBODIMENTS [0346] Exemplary embodiments provided in accordance with the presently disclosed subject matter include, but are not limited to, the claims and the following embodiments: 1. A retron-guide RNA cassette comprising: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence; and (v) a second inverted repeat sequence coding region; and (b) a second guide RNA (gRNA) coding region. 2. The cassette of embodiment 1, wherein the first target locus is located in trans to the second target locus. 3. The cassette of embodiment 1 or 2, wherein the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit. 4. The cassette of any one of embodiments 1 to 3, wherein the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus.
5. The cassette of embodiment 4, wherein the genetic variant comprises a trans- expression quantitative train locus (eQTL) variant at the first target locus. 6. The cassette of embodiment 1, wherein the first target locus is located in cis to the second target locus. 7. The cassette of embodiment 6, wherein the first target locus is located in a cis- regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of the transcription unit. 8. The cassette of embodiments 6 or 7, wherein the first donor DNA sequence comprises a genetic variant relative to the sequence at the first target locus. 9. The cassette of embodiment 8, wherein the genetic variant comprises a cis-eQTL variant at the first target locus. 10. The cassette of any one of embodiments 4 to 9, wherein the genetic variant comprises a variant that increases or decreases gene expression in cis or trans. 11. The cassette of embodiment 1, wherein the second target locus is i) an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene. 12. The cassette of any one of embodiments 1 to 11, wherein the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker. 13. The cassette of any one of embodiments 1 to 12, wherein the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron. 14. The cassette of any one of embodiments 1 to 12, wherein the first or second gRNA coding region is downstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 3’ of the RNA transcribed from the retron.
15. The cassette of any one of embodiments 1 to 13, further comprising one or more ribozyme sequences. 16. The cassette of embodiment 15, wherein the first and second retrons are connected by a self-cleaving ribozyme sequence. 17. The cassette of embodiments 15 or 16, wherein the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- CIV-1, drz-Spur-3, drz-Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead, and combinations thereof. 18. The cassette of any one or embodiments 15 to 17, wherein the one or more ribozyme sequences are different from each other. 19. A vector comprising the cassette of any one of embodiments 1 to 18. 20. A method for identifying a genetic modification at a target locus in a host cell, the method comprising: (a) transforming the host cell with a vector of embodiment 19; (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region, wherein the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target locus and comprise sequence modifications compared to the sequences within the first target locus,
wherein the first target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the first gRNA, wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the first target locus, wherein at least a portion of the second retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the second target locus, wherein the second target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the second gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert a unique barcode sequence at the second target locus; and (c) detecting the presence of the unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the first target locus, thereby identifying the genetic modification at the first target locus. 21. The method of embodiment 20, wherein the first target locus is located in trans to the second target locus. 22. The method of embodiments 20 or 21, wherein the first target locus is located in a trans-regulatory element, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of a transcription unit. 23. The method of any one of embodiments 20 to 22, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus. 24. The method of embodiment 23, wherein the genetic variant comprises a trans-eQTL variant at the first target locus.
25. The method of embodiment 20, wherein the first target locus is located in cis to the second target locus. 26. The method of embodiment 25, wherein the first target locus is located in a cis- regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of the transcription unit. 27. The method of any one of embodiments 20 to 26, wherein the first and/or second target locus is located in a non-coding intergenic region in the host cell genomic DNA. 28. The method of any one of embodiments 25 or 26, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus. 29. The method of embodiment 28, wherein the genetic variant comprises a cis-eQTL variant at the first target locus. 30. The method of any one of embodiments 20 to 29, wherein the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker. 31. The method of any one of embodiments 20 to 30, wherein detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence. 32. The method of any one of embodiments 20 to 31, wherein the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence. 33. The method of any one of embodiments 20 to 32, wherein greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus. 34. The method of any one of embodiments 20 to 33, further comprising: (d) transforming the host cell with a second vector comprising a second retron- guide RNA cassette comprising:
a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region; a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and a second unique barcode sequence; and (v) a second inverted repeat sequence coding region; and a fourth guide RNA (gRNA) coding region; (e) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a third retron donor DNA-guide molecule comprising a third retron transcript and the third gRNA coding region and a fourth retron donor DNA-guide molecule comprising a fourth retron transcript and the fourth gRNA coding region, wherein the third and fourth retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the third retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the third target locus and comprise sequence modifications compared to the sequences within the third target locus,
wherein the third target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the third gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the third target locus, wherein at least a portion of the fourth retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the fourth target locus, wherein the fourth target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the fourth gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert the second unique barcode sequence at the fourth target locus; and (f) detecting the presence of the second unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the third target locus, thereby identifying the genetic modification at the third target locus. 35. The method of embodiment 34, wherein the third target locus is located in trans to the fourth target locus. 36. The method of embodiments 34 or 35, wherein the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit. 37. The method of any one of embodiments 34 to 36, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus. 38. The method of embodiment 37, wherein the genetic variant comprises a trans-eQTL variant at the third target locus.
39. The method of embodiment 34, wherein the third target locus is located in cis to the fourth target locus. 40. The method of embodiment 39, wherein the third target locus is located in a cis- regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the transcription unit. 41. The method of any one of embodiments 39 or 40, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus. 42. The method of embodiment 41, wherein the genetic variant comprises a cis-eQTL variant at the first target locus. 43. The method of any one of embodiments 40 to 42, further comprising detecting the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci. 44. The method of any one of embodiments 34 to 43, wherein (i) the first and third gRNAs are the same; (ii) the first and third target loci are the same; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different. 45. The method of any one of embodiments 34 to 43, wherein (i) the first and third gRNAs are different; (ii) the first and third target loci are different; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
46. The method of any one of embodiments 20 to 45, wherein the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site. 47. The method of any one of embodiments 34 to 46, wherein greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the third target locus. 48. The method of any one of embodiments 34 to 47, further comprising detecting the presence of the unique barcode at the third target locus, thereby identifying the genetic modification at both the first and third target loci. 49. The method any one of embodiments 34 to 48, further comprising repeating steps (d)-(f) with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus. 50. The method of any one of embodiments 20 to 49, wherein the host cell is a prokaryotic cell. 51. The method of any one of embodiments 20 to 49, wherein the host cell is a eukaryotic cell. 52. The method of embodiment 51, wherein the eukaryotic cell is a yeast cell. 53. The method of embodiment 51, wherein the eukaryotic cell is a mammalian cell. 54. The method of any one of embodiments 50 to 53, wherein the host cell comprises a clonal population of host cells. 55. The method of embodiment 54 wherein the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
56. The method of any one of embodiments 20 to 49, comprising transforming a mixture of cells with one or more vectors comprising the first, second or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell. 57. The method of embodiment 56, further comprising detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette.
Informal Sequence Listing: All nucleotide sequences from 5’ to 3’. SE Sequence Descrip Q tion ID N O: 1. TGCAGCCAAAGATGCGTGCCGTTGCAGTTAGCTAACAGGCCATG Primer #615 2. CTATGCTGTTTCCAGCATAGCTCTGAAAC Primer #576 3. CCTACACGACGCTCTTCCGATCTTGCAGCCAAAGATGCGTG Primer #617 4. CAAGCAGAAGACGGCATACGAGATTTCTGCCTGTGACTGGAGTTCAGACGTGT Primer GCTCTTCCGATCT #337 5. CAAGCAGAAGACGGCATACGAGATGCTCAGGAGTGACTGGAGTTCAGACGTGT Primer GCTCTTCCGATC #338 6. CAAGCAGAAGACGGCATACGAGATAGGAGTCCGTGACTGGAGTTCAGACGTGT Primer GCTCTTCCGATCT #339 7. CAAGCAGAAGACGGCATACGAGATCATGCCTAGTGACTGGAGTTCAGACGTGT Primer GCTCTTCCGATCT #340 8. AATGATACGGCGACCACCGAGATCTACACACTGCATAACACTCTTTCCCTACAC Primer GACGCTCTTCCGATCT #341 9. AATGATACGGCGACCACCGAGATCTACACAAGGAGTAACACTCTTTCCCTACAC Primer GACGCTCTTCCGATCT #342 10. AATGATACGGCGACCACCGAGATCTACACCTAAGCCTACACTCTTTCCCTACAC Primer GACGCTCTTCCGATCT #343 11. ccATGCTAGCATCGATgcatgcATGTGGCTCAAgtttaaacTggccACCTGGCGTTCG Primer #591 12. ccATGCTAGCATCGATgcatgcCTGTGGCAAAAgtttaaacTggccACCTGGCGTTCG Primer #592 13. ccATGCTAGCATCGATgcatgcCAGAGGATCAAgtttaaacTggccACCTGGCGTTCG Primer #594 14. ccATGCTAGCATCGATgcatgcTAGAGGACTAAgtttaaacTggccACCTGGCGTTCG Primer #596 15. ccATGCTAGCATCGATgcatgcGTGTGATTCAAgtttaaacTggccACCTGGCGTTCG Primer #603 16. ccATGCTAGCATCGATgcatgcACGCGTGAAAAgtttaaacTggccACCTGGCGTTCG Primer #604
17. GTTAATAAGCAATTCCCCTGTGGCGCGCCAGGAAAACAGACAGTAACTCAGAT Primer TCAATGC #590 18. CCTACACGACGCTCTTCCGATCTTATCTTATCTGATAAGGGGAAAAAGCC Primer #261 19. TTCAGACGTGTGCTCTTCCGATCTCTATGGCTTGCTGCAGATAAGG Primer # 262 BAR- seq R4 long. 20. TTCAGACGTGTGCTCTTCCGATCTCGCCAGGTGGCCAGTTTAAACTT Primer #327 21. TTCAGACGTGTGCTCTTCCGATCTCGGCCAGGTGGCCAGTTTAAACTT Primer #328 22. TTCAGACGTGTGCTCTTCCGATCTCGAGCCAGGTGGCCAGTTTAAACTT Primer #329 23. TTCAGACGTGTGCTCTTCCGATCTCGATGCCAGGTGGCCAGTTTAAACTT Primer #330 24. TTCAGACGTGTGCTCTTCCGATCTCGATCGCCAGGTGGCCAGTTTAAACTT Primer #331 25. TTCAGACGTGTGCTCTTCCGATCTCGATCGGCCAGGTGGCCAGTTTAAACTT Primer #332 26. TTCAGACGTGTGCTCTTCCGATCTCGATCGAGCCAGGTGGCCAGTTTAAACTT Primer #333 27. TTCAGACGTGTGCTCTTCCGATCTCGATCGATGCCAGGTGGCCAGTTTAAACTT Primer #334 28. GTGCCGTTGCAGTTAGCTAACAGGCCATGC. Primer #354 29. ATGATAATAATGGTTTCTTAGGACGGATCGCTTGCCTGTAACTTACACGCGCCT psac200 CGTATCTTTTAATGATGGAATAATT -crispey- bar- TGGGAATTTACTCTGTGTTTATTTATTTTTATGTTTTGTATTTGGATTTTAGAAAG vector: TAAATAAAGAAGGTAGAAGAGTT plasmid ACGGAATGAAGAAAAAAAAATAAACAAAGGTTTAAAAAATTTCAACAAAAAG sequenc CGTACTTTACATATATATTTATTAGAC e (5' -> 3') AAGAAAAGCAGATTAAATAGATATACATTCGATTAACGATAAGTAAAATGTAA AATCACAGGATTTTCGTGTGTGGTCT TCTACACAGACAAGATGAAACAATTCGGCATTAATACCTGAGAGCAGGAAGAG CAAGATAAAAGGTAGTATTTGTTGGC
GATCCCCCTAGAGTCTTTTACATCTTCGGAAAACAAAAACTATTTTTTCTTTAAT TTCTTTTTTTACTTTCTATTTTTA ATTTATATATTTATATTAAAAAATTTAAATTATAATTATTTTTATAGCACGTGAT GAAAAGGACCCAGGTGGCACTTTT CGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAAT ATGTATCCGCTCATGAGACAATAAC CCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATT TCCGTGTCGCCCTTATTCCCTTTTT TGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAA AGATGCTGAAGATCAGTTGGGTGCA CGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTT CGCCCCGAAGAACGTTTTCCAATGA TGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCG GGCAAGAGCAACTCGGTCGCCGCAT ACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCT TACGGATGGCATGACAGTAAGAGAA TTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTG ACAACGATCGGAGGACCGAAGGAGC TAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGG AACCGGAGCTGAATGAAGCCATACC AAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCA AACTATTAACTGGCGAACTACTTACT CTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGG ACCACTTCTGCGCTCGGCCCTTCCGG CTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTA TCATTGCAGCACTGGGGCCAGATGG TAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGA TGAACGAAATAGACAGATCGCTGAG ATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATAT ATACTTTAGATTGATTTAAAACTTC
ATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCA AAATCCCTTAACGTGAGTTTTCGTT CCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTT TTTTCTGCGCGTAATCTGCTGCTTG CAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCT ACCAACTCTTTTTCCGAAGGTAACTG GCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAG GCCACCACTTCAAGAACTCTGTAGC ACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGG CGATAAGTCGTGTCTTACCGGGTTG GACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGG TTCGTGCACACAGCCCAGCTTGGAGC GAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCC ACGCTTCCCGAAGGGAGAAAGGCGGA CAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTC CAGGGGGAAACGCCTGGTATCTTTAT AGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGT CAGGGGGGCGGAGCCTATGGAAAA ACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCA AGTAGAGGGGGTAATTTTTCCCCT TTATTTTGTTCATACATTCTTAAATTGCTTTGCCTCTCCTTTTGGAAAGCTATACT TCGGAGCACTGTTGAGCGAAGGC TCATTAGATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAGGGTC CAAAAAGCGCTCGGACAACTGTTGA CCGTGATCCGAAGGACTGGCTATACAGTGTTCACAAAATAGCCAAGCTGAAAA TAATGTGTAGCTATGTTCAGTTAGTT TGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATTATGC AGAGCATCAACATGATAAAAAAAAAC AGTTGAATATTCCCTCAAAAGAGGTGCTTGTAGATAACCTCCACGATGGTGCAC CTTGGGCAACACAAAAGTGGCAAAT
CATCTACAATGCGCACCCTTAGCGAGAGGTTTATCATTAAGGTCAACCTCTGGA TGTTGTTTCGGCATCCTGCATTGAA TCTGAGTTACTGTCTGTTTgaacTGCAGCCAAAGATGCGTGCCGTTGCAGTTAGCT AACAggccATGCggccgcgtttC agagctaTGCTGgaaaCAGCAtagcaagttGaaataaggctagtccgttatcaacttgaaaaagtggcaccgagtcggt gctttttGACACTGAGTGAGAAACGTCCCCGTCGTAGTGTCGGTAATGCGTTGTTTC AACGTAGCCAATTCTCACAAAG AAAGTGGAATATTCATTCATATCATATTTTTTCTATTAACTGCCTGGTTTCTTTT AAATTTTTTATTGGTTGTCGACTT GAACGGAGTGACAATATATATATATATATATTTAATAATGACATCATTATCTGT AAATCTGATTCTTAATGCTATTCTA GTTATGTAAGAGTGGTCCTTTCCATAAAAAAAAAAAAAAAGAAAAAAGAATTT TAGGAATACAATGCAGCTTGTAAGTA AAATCTGGAATATTCATATCGCCACAACTTCTTATGCTTATAAAAGCACTAATG CCTGAATTTATGTTGAAAATATGTG TCACAAATAAAGAAACTGTGACATCTGACACATTTCCACgtacccaattcgccctatagtgagt cgtattacgcgcgct cactggccgtcgttttacAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCG CCTTGCAGCACATCCCCC TTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACA GTTGCGCAGCCTGAATGGCGAATGG CGCGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCG CAGCGTGACCGCTACACTTGCCAGCG CCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGC TTTCCCCGTCAAGCTCTAAATCG GGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAA ACTTGATTAGGGTGATGGTTCACGT AGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACG TTCTTTAATAGTGGACTCTTGTTCC
AAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGA TTTTGCCGATTTCGGCCTATTGGTT AAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAA CGTTTACAATTTCCTGATGCGGTATT TTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATAGGGTAATAACTGATAT AATTAAATTGAAGCTCTAATTTGTG AGTTTAGTATACATGCATTTACTTATAATACAGTTTTTTAGTTTTGCTGGCCGCA TCTTCTCAAATATGCTTCCCAGCC TGCTTTTCTGTAACGTTCACCCTCTACCTTAGCATCCCTTCCCTTTGCAAATAGT CCTCTTCCAACAATAATAATGTCA GATCCTGTAGAGACCACATCATCCACGGTTCTATACTGTTGACCCAATGCGTCT CCCTTGTCATCTAAACCCACACCGG GTGTCATAATCAACCAATCGTAACCTTCATCTCTTCCACCCATGTCTCTTTGAGC AATAAAGCCGATAACAAAATCTTT GTCGCTCTTCGCAATGTCAACAGTACCCTTAGTATATTCTCCAGTAGATAGGGA GCCCTTGCATGACAATTCTGCTAAC ATCAAAAGGCCTCTAGGTTCCTTTGTTACTTCTTCTGCCGCCTGCTTCAAACCGC TAACAATACCTGGGCCCACCACAC CGTGTGCATTCGTAATGTCTGCCCATTCTGCTATTCTGTATACACCCGCAGAGTA CTGCAATTTGACTGTATTACCAAT GTCAGCAAATTTTCTGTCTTCGAAGAGTAAAAAATTGTACTTGGCGGATAATGC CTTTAGCGGCTTAACTGTGCCCTCC ATGGAAAAATCAGTCAAGATATCCACATGTGTTTTTAGTAAACAAATTTTGGGA CCTAATGCTTCAACTAACTCCAGTA ATTCCTTGGTGGTACGAACATCCAATGAAGCACACAAGTTTGTTTGCTTTTCGT GCATGATATTAAATAGCTTGGCAGC AACAGGACTAGGATGAGTAGCAGCACGTTCCTTATATGTAGCTTTCGACATGAT TTATCTTCGTTTCCTGCAGGTTTTT GTTCTGTGCAGTTGGGTTAAGAATACTGGGCAATTTCATGTTTCTTCAACACTAC ATATGCGTATATATACCAATCTAA
GTCTGTGCTCCTTCCTTCGTTCTTCCTTCTGTTCGGAGATTACCGAATCAAAAAA ATTTCAAAGAAACCGAAATCAAAA AAAAGAATAAAAAAAAAATGATGAATTGAATTGAAAAGCTGTGGTATGGTGCA CTCTCAGTACAATCTGCTCTGATGCC GCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACG GGCTTGTCTGCTCCCGGCATCCGCTT ACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGT CATCACCGAAACGCGCGAGACGAAA GGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTC 30. ATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGA psac212 GCTGATTTAACAAAAATTTAACGC -crispey- bar- GAATTTTAACAAAATATTAACGTTTACAATTTCCTGATGCGGTATTTTCTCCTTA ade2KO CGCATCTGTGCGGTATTTCACACC : GCATAGGGTAATAACTGATATAATTAAATTGAAGCTCTAATTTGTGAGTTTAGT plasmid ATACATGCATTTACTTATAATACAG sequenc e (5' -> TTTTTTAGTTTTGCTGGCCGCATCTTCTCAAATATGCTTCCCAGCCTGCTTTTCTG 3') TAACGTTCACCCTCTACCTTAGC ATCCCTTCCCTTTGCAAATAGTCCTCTTCCAACAATAATAATGTCAGATCCTGTA GAGACCACATCATCCACGGTTCTA TACTGTTGACCCAATGCGTCTCCCTTGTCATCTAAACCCACACCGGGTGTCATA ATCAACCAATCGTAACCTTCATCTC TTCCACCCATGTCTCTTTGAGCAATAAAGCCGATAACAAAATCTTTGTCGCTCTT CGCAATGTCAACAGTACCCTTAGT ATATTCTCCAGTAGATAGGGAGCCCTTGCATGACAATTCTGCTAACATCAAAAG GCCTCTAGGTTCCTTTGTTACTTCT TCTGCCGCCTGCTTCAAACCGCTAACAATACCTGGGCCCACCACACCGTGTGCA TTCGTAATGTCTGCCCATTCTGCTA TTCTGTATACACCCGCAGAGTACTGCAATTTGACTGTATTACCAATGTCAGCAA ATTTTCTGTCTTCGAAGAGTAAAAA
ATTGTACTTGGCGGATAATGCCTTTAGCGGCTTAACTGTGCCCTCCATGGAAAA ATCAGTCAAGATATCCACATGTGTT TTTAGTAAACAAATTTTGGGACCTAATGCTTCAACTAACTCCAGTAATTCCTTG GTGGTACGAACATCCAATGAAGCAC ACAAGTTTGTTTGCTTTTCGTGCATGATATTAAATAGCTTGGCAGCAACAGGAC TAGGATGAGTAGCAGCACGTTCCTT ATATGTAGCTTTCGACATGATTTATCTTCGTTTCCTGCAGGTTTTTGTTCTGTGC AGTTGGGTTAAGAATACTGGGCAA TTTCATGTTTCTTCAACACTACATATGCGTATATATACCAATCTAAGTCTGTGCT CCTTCCTTCGTTCTTCCTTCTGTT CGGAGATTACCGAATCAAAAAAATTTCAAAGAAACCGAAATCAAAAAAAAGA ATAAAAAAAAAATGATGAATTGAATTG AAAAGCTGTGGTATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTT AAGCCAGCCCCGACACCCGCCAACA CCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAA GCTGTGACCGTCTCCGGGAGCTGCA TGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTC GTGATACGCCTATTTTTATAGGTTAA TGTCATGATAATAATGGTTTCTTAGGACGGATCGCTTGCCTGTAACTTACACGC GCCTCGTATCTTTTAATGATGGAAT AATTTGGGAATTTACTCTGTGTTTATTTATTTTTATGTTTTGTATTTGGATTTTAG AAAGTAAATAAAGAAGGTAGAAG AGTTACGGAATGAAGAAAAAAAAATAAACAAAGGTTTAAAAAATTTCAACAAA AAGCGTACTTTACATATATATTTATT AGACAAGAAAAGCAGATTAAATAGATATACATTCGATTAACGATAAGTAAAAT GTAAAATCACAGGATTTTCGTGTGTG GTCTTCTACACAGACAAGATGAAACAATTCGGCATTAATACCTGAGAGCAGGA AGAGCAAGATAAAAGGTAGTATTTGT TGGCGATCCCCCTAGAGTCTTTTACATCTTCGGAAAACAAAAACTATTTTTTCTT TAATTTCTTTTTTTACTTTCTATT
TTTAATTTATATATTTATATTAAAAAATTTAAATTATAATTATTTTTATAGCACG TGATGAAAAGGACCCAGGTGGCAC TTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCA AATATGTATCCGCTCATGAGACAA TAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCA ACATTTCCGTGTCGCCCTTATTCCCT TTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGT AAAAGATGCTGAAGATCAGTTGGG TGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGA GTTTTCGCCCCGAAGAACGTTTTCCA ATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGAC GCCGGGCAAGAGCAACTCGGTCGCC GCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGC ATCTTACGGATGGCATGACAGTAAG AGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACT TCTGACAACGATCGGAGGACCGAAG GAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGT TGGGAACCGGAGCTGAATGAAGCCA TACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTG CGCAAACTATTAACTGGCGAACTACT TACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGC AGGACCACTTCTGCGCTCGGCCCTT CCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGC GGTATCATTGCAGCACTGGGGCCAG ATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTA TGGATGAACGAAATAGACAGATCGC TGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTC ATATATACTTTAGATTGATTTAAAA CTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGA CCAAAATCCCTTAACGTGAGTTTT
CGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATC CTTTTTTTCTGCGCGTAATCTGCTG CTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAG AGCTACCAACTCTTTTTCCGAAGGTA ACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAG TTAGGCCACCACTTCAAGAACTCTG TAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCA GTGGCGATAAGTCGTGTCTTACCGG GTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGG GGGGTTCGTGCACACAGCCCAGCTTG GAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAG CGCCACGCTTCCCGAAGGGAGAAAGG CGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGA GCTTCCAGGGGGAAACGCCTGGTATCT TTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGC TCGTCAGGGGGGCGGAGCCTATGG AAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTG CTCAAGTAGAGGGGGTAATTTTTC CCCTTTATTTTGTTCATACATTCTTAAATTGCTTTGCCTCTCCTTTTGGAAAGCTA TACTTCGGAGCACTGTTGAGCGA AGGCTCATTAGATATATTTTCTGTCATTTTCCTTAACCCAAAAATAAGGGAAAG GGTCCAAAAAGCGCTCGGACAACTG TTGACCGTGATCCGAAGGACTGGCTATACAGTGTTCACAAAATAGCCAAGCTG AAAATAATGTGTAGCTATGTTCAGTT AGTTTGGCTAGCAAAGATATAAAAGCAGGTCGGAAATATTTATGGGCATTATTA TGCAGAGCATCAACATGATAAAAAA AAACAGTTGAATATTCCCTCAAAAGAGGTGCTTGTAGATAACCTCCACGATGGT GCACCTTGGGCAACACAAAAGTGGC AAATCATCTACAATGCGCACCCTTAGCGAGAGGTTTATCATTAAGGTCAACCTC TGGATGTTGTTTCGGCATCCTGCAT
TGAATCTGAGTTACTGTCTGTTTgaacTGCAGCCAAAGATGCGTGCCGTTGCAGTT AGCTAACAggccATGCTAGCATC GATgcatgcACCTGGCGTTCGGCGATCGCCATAAGAGATCTGCCAATTTTAAAtttAA ACCCGTTTCTTCTGACGTAAG GGTGCGCAGTTGCAGTTAGCTAACAACCgtttCagagctaTGCTGgaaaCAGCAtagcaagtt Gaaataaggctagtcc gttatcaacttgaaaaagtggcaccgagtcggtgctttttGATGGCCGGCATGGTCCCAGCCTCCTCGCT GGCGCCGGC TGGGCAACACCTTCGGGTGGCGAATGGGACTTATGCGCACCCTTAGCGAGAGG TTTATCATTAAGGTCAACCTCTGGAT GTTGTTTCGGCATCCTGCATTGAATCTGAGTTACTGTCTGTTTTCCTggcgcgccACA GGGGAATTGCTTATTAACGAA ATTGCCTGAAGGCCTCACAACTCTGGACATTATACCATTGATGCTTGCGTCACT TCAGGAAACCCGTTTCTTCTGACGT AAGGGTGCGCAATTAACGAAATTGCCCCAgtttCagagctaTGCTGgaaaCAGCAtagcaag ttGaaataaggctagtc cgttatcaacttgaaaaagtggcaccgagtcggtgctttttGACACTGAGTGAGAAACGTCCCCGTCGTA GTGTCGGTA ATGCGTTGTTTCAACGTAGCCAATTCTCACAAAGAAAGTGGAATATTCATTCAT ATCATATTTTTTCTATTAACTGCCT GGTTTCTTTTAAATTTTTTATTGGTTGTCGACTTGAACGGAGTGACAATATATAT ATATATATATTTAATAATGACATC ATTATCTGTAAATCTGATTCTTAATGCTATTCTAGTTATGTAAGAGTGGTCCTTT CCATAAAAAAAAAAAAAAAGAAAA AAGAATTTTAGGAATACAATGCAGCTTGTAAGTAAAATCTGGAATATTCATATC GCCACAACTTCTTATGCTTATAAAA GCACTAATGCCTGAATTTATGTTGAAAATATGTGTCACAAATAAAGAAACTGTG ACATCTGACACATTTCCACgtaccc aattcgccctatagtgagtcgtattacgcgcgctcactggccgtcgttttacAACGTCGTGACTGGGAAAACCCT GGCG
TTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATA GCGAAGAGGCCCGCACCGATCGCCC TTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCGACGCGCCCTGTAGCG GCGCATTAAGCGCGGCGGGTGTGGTG GTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTC GCTTTCTTCCCTTCCTTTCTCGCCA CGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCC GATTTAGTGCTTTACGGCACCTCGA CCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATA GACGGTTTTTCGCCCTTTGACGTTG GAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAAC CCTATCTCGGTCT. 31. TGCGCACCCTTA Inverted repeat sequenc e 32. TAAGGGTGCGCA Second inverted repeat 33. ATGCGCACCCTTAGCGAGAGGTTTATCATTAAGGTCAACCTCTGGATGTTGTTT msr CGGCATCCTGCATTGAATCTGAGTTACT locus 34. TCTGAGTTACTGTCTGTTTgaacTGTTGGAACGGAGAGCATCGCCTGATGCTCTCC msd GAGCCAACtttAAACCCGTTTcTTCTGAC locus first retron 35. TCTGAGTTACTGTCTGTTTTCCTTGTTGGAACGGAGAGCATCGCCTGATGCTCTC msd CGAGCCAACCAGGAAACCCGTTTcTTCTGAC locus second retron 36. TCTGAGTTACTGTCTGTTTcCCTTGTTGGAACGGAGAGCATCGCCTGATGCTCTC Msd CGAGCCAACCAGGAAACCCGTTTcTTCTGAC sequenc e
modifie d to accomm odate RNA polymer ase 37. TGTTGGAACGGAGAGCATCGCCTGATGCTCTCCGAGCCAAC Replace able region 38. GATGGCCGGCATGGTCCCAGCCTCCTCGCTGGCGCCGGCTGGGCAACACCTTCG HDV GGTGGCGAATGGGACTTT ribozym e
Claims
WHAT IS CLAIMED IS: 1. A retron-guide RNA cassette comprising: (a) a first retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a first donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a first target locus; and (v) a second inverted repeat sequence coding region; and (b) a first guide RNA (gRNA) coding region; (c) a second retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a second donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a second target locus and a unique barcode sequence; and (v) a second inverted repeat sequence coding region; and (b) a second guide RNA (gRNA) coding region.
2. The cassette of claim 1, wherein the first target locus is located in trans to the second target locus.
3. The cassette of claim 1, wherein the first target locus is located in a trans-regulatory element, and the second target locus is located in the 3’ untranslated region (UTR) of a transcription unit.
4. The cassette of claim 1, wherein the first donor DNA sequence comprises a genetic variant compared to the sequences within the first target locus.
5. The cassette of claim 4, wherein the genetic variant comprises a trans- expression quantitative train locus (eQTL) variant at the first target locus.
6. The cassette of claim 1, wherein the first target locus is located in cis to the second target locus.
7. The cassette of claim 6, wherein the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of the transcription unit.
8. The cassette of claim 6, wherein the first donor DNA sequence comprises a genetic variant relative to the sequence at the first target locus.
9. The cassette of claim 8, wherein the genetic variant comprises a cis- eQTL variant at the first target locus.
10. The cassette of any one of claims 4 to 9, wherein the genetic variant comprises a variant that increases or decreases gene expression in cis or trans.
11. The cassette of claim 1, wherein the second target locus is i) an intron or ii) is not located in genomic sequences that regulate transcription or translation of a gene.
12. The cassette of claim 1, wherein the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker.
13. The cassette of claim 1, wherein the first or second gRNA coding region is upstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 5’ of the RNA transcribed from the retron.
14. The cassette of claim 1, wherein the first or second gRNA coding region is downstream of the first or second retron in the cassette such that transcription of the cassette results in a transcript in which the gRNA is 3’ of the RNA transcribed from the retron.
15. The cassette of claim 1, further comprising one or more ribozyme sequences.
16. The cassette of claim 15, wherein the first and second retrons are connected by a self-cleaving ribozyme sequence.
17. The cassette of claim 15, wherein the ribozyme sequence encodes a ribozyme selected from the group consisting of hepatitis delta virus (HDV) ribozyme, drz- CIV-1, drz-Spur-3, drz-Agam1-1, drzAgam1-2, drzPmar-1, Twister, Hammerhead, and combinations thereof.
18. The cassette of claim 15, wherein the one or more ribozyme sequences are different from each other.
19. A vector comprising the cassette of claim 1.
20. A method for identifying a genetic modification at a target locus in a host cell, the method comprising: (a) transforming the host cell with a vector of claim 19; (b) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a first retron donor DNA-guide molecule comprising a first retron transcript and the first gRNA coding region and a second retron donor DNA-guide molecule comprising a second retron transcript and the second gRNA coding region, wherein the first and second retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the first retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the first target locus and comprise sequence modifications compared to the sequences within the first target locus, wherein the first target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the first gRNA, wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the first target locus,
wherein at least a portion of the second retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the second target locus, wherein the second target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the second gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert a unique barcode sequence at the second target locus; and (c) detecting the presence of the unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the first target locus, thereby identifying the genetic modification at the first target locus.
21. The method of claim 20, wherein the first target locus is located in trans to the second target locus.
22. The method of claim 20, wherein the first target locus is located in a trans-regulatory element, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of a transcription unit.
23. The method of claim 20, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus.
24. The method of claim 23, wherein the genetic variant comprises a trans- eQTL variant at the first target locus.
25. The method of claim 20, wherein the first target locus is located in cis to the second target locus.
26. The method of claim 25, wherein the first target locus is located in a cis-regulatory element of a transcription unit, and the second target locus is located in a 5’ untranslated region, a protein coding region, or a 3’ untranslated region (UTR) of the transcription unit.
27. The method of claim 20, wherein the first and/or second target locus is located in a non-coding intergenic region in the host cell genomic DNA.
28. The method of claim 25, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the first target locus.
29. The method of claim 28, wherein the genetic variant comprises a cis- eQTL variant at the first target locus.
30. The method of claim 20, wherein the barcode sequence encodes a detectable molecule, a selectable marker, or a cell surface marker.
31. The method of claim 20, wherein detecting the presence of the unique barcode sequence comprises sequencing the genome of the host cell, or detecting a detectable molecule encoded by the barcode sequence.
32. The method of claim 20, wherein the vector is no longer present in the host cell when detecting the presence of the unique barcode sequence.
33. The method of claim 20, wherein greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the first target locus.
34. The method of claim 20, further comprising: (d) transforming the host cell with a second vector comprising a second retron- guide RNA cassette comprising: a third retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) an msd locus; (iv) a third donor DNA sequence located within the msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a third target locus; and (v) a second inverted repeat sequence coding region; and a third guide RNA (gRNA) coding region;
a fourth retron comprising: (i) an msr locus; (ii) a first inverted repeat sequence coding region; (iii) a second msd locus; (iv) a fourth donor DNA sequence located within the second msd locus, wherein the donor DNA sequence comprises homology to one or more sequences within a fourth target locus and a second unique barcode sequence; and (v) a second inverted repeat sequence coding region; and a fourth guide RNA (gRNA) coding region; (e) culturing the host cell or transformed progeny of the host cell under conditions sufficient for expressing from the vector a third retron donor DNA-guide molecule comprising a third retron transcript and the third gRNA coding region and a fourth retron donor DNA-guide molecule comprising a fourth retron transcript and the fourth gRNA coding region, wherein the third and fourth retron transcripts self-prime reverse transcription by a reverse transcriptase (RT) expressed by the host cell or the transformed progeny of the host cell, wherein at least a portion of the third retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the third target locus and comprise sequence modifications compared to the sequences within the third target locus, wherein the third target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the third gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert, delete, and/or substitute one or more bases of the sequence of the one or more target nucleic acid sequences to induce one or more sequence modifications at the third target locus, wherein at least a portion of the fourth retron transcript is reverse transcribed to produce a multicopy single-stranded DNA (msDNA) molecule having one or more donor DNA sequences, wherein the one or more donor DNA sequences are homologous to the fourth target locus,
wherein the fourth target locus is cut by a nuclease expressed by the host cell or transformed progeny of the host cell, wherein the site of nuclease cutting is specified by the fourth gRNA, and wherein the one or more donor DNA sequences recombine with one or more target nucleic acid sequences to insert the second unique barcode sequence at the fourth target locus; and (f) detecting the presence of the second unique barcode sequence, wherein the presence of the unique barcode sequence indicates the presence of the genetic modification at the third target locus, thereby identifying the genetic modification at the third target locus.
35. The method of claim 34, wherein the third target locus is located in trans to the fourth target locus.
36. The method of claim 34, wherein the third target locus is located in a trans-regulatory element, and the fourth target locus is located in the 3’ untranslated region (UTR) of a transcription unit.
37. The method of claim 34, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus.
38. The method of claim 37, wherein the genetic variant comprises a trans- eQTL variant at the third target locus.
39. The method of claim 34, wherein the third target locus is located in cis to the fourth target locus.
40. The method of claim 39, wherein the third target locus is located in a cis-regulatory element of a transcription unit, and the fourth target locus is located in the 3’ untranslated region (UTR) of the transcription unit.
41. The method of claim 39, wherein the one or more donor DNA sequences comprise a genetic variant compared to the sequences within the third target locus.
42. The method of claim 41, wherein the genetic variant comprises a cis- eQTL variant at the first target locus.
43. The method of claim 40, further comprising detecting the relative expression of transcription from the transcription units comprising genetic variants at the first and third target loci.
44. The method of claim 34, wherein (i) the first and third gRNAs are the same; (ii) the first and third target loci are the same; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
45. The method of claim 34, wherein (i) the first and third gRNAs are different; (ii) the first and third target loci are different; (iii) the genetic modification at the first and third loci is different; (iv) the second and fourth gRNAs are the same; (v) the second and fourth target loci are the same; and (vi) the barcode sequences inserted at the second and fourth target loci are different.
46. The method of claim 20, wherein the one or more donor DNA sequences comprise two homology arms, wherein each homology arm has at least about 70% to about 99% similarity to a portion of the sequence of the one or more target loci on either side of a nuclease cleavage site.
47. The method of claim 34, wherein greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the host cells comprise both the barcode sequence and the sequence modifications compared to the sequences within the third target locus.
48. The method of claim 34, further comprising detecting the presence of the unique barcode at the third target locus, thereby identifying the genetic modification at both the first and third target loci.
49. The method claim 34, further comprising repeating steps (d)-(f) with a third vector comprising a third retron-guide RNA cassette that inserts a genetic modification at a fifth target locus and a unique barcode sequence at a sixth target locus, thereby identifying the genetic modification at the fifth target locus.
50. The method of claim 20, wherein the host cell is a prokaryotic cell.
51. The method of claim 20, wherein the host cell is a eukaryotic cell.
52. The method of claim 51, wherein the eukaryotic cell is a yeast cell.
53. The method of claim 51, wherein the eukaryotic cell is a mammalian cell.
54. The method of any one of claims 50 to 53, wherein the host cell comprises a clonal population of host cells.
55. The method of claim 54 wherein the genetic modifications are induced in greater than or equal to about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% or 95% of the population of host cells.
56. The method of claim 20, comprising transforming a mixture of cells with one or more vectors comprising the first, second or third retron-guide RNA cassettes, and screening the transformed cells for a phenotypic change relative to an untransformed control cell.
57. The method of claim 56, further comprising detecting the presence of the genetic modification at the target locus or the presence of the unique barcode sequence present in each retron-guide RNA cassette.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263344470P | 2022-05-20 | 2022-05-20 | |
| US63/344,470 | 2022-05-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023225358A1 true WO2023225358A1 (en) | 2023-11-23 |
Family
ID=88836015
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/022989 Ceased WO2023225358A1 (en) | 2022-05-20 | 2023-05-19 | Generation and tracking of cells with precise edits |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2023225358A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025007020A1 (en) * | 2023-06-30 | 2025-01-02 | The J. David Gladstone Institutes, A Testamentary Trust Established Under The Will Of J. David Gladstone | Multiplexed retron genome editing in prokaryotic and eukaryotic genomes |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050037487A1 (en) * | 2003-05-28 | 2005-02-17 | Yoshihiro Kawaoka | Recombinant influenza vectors with a PolII promoter and ribozymes for vaccines and gene therapy |
| US20130316339A1 (en) * | 2010-09-01 | 2013-11-28 | Orion Genomics Llc | Detection of nucleic acid sequences adjacent to repeated sequences |
| US20150184199A1 (en) * | 2013-12-19 | 2015-07-02 | Amyris, Inc. | Methods for genomic integration |
| US20190330619A1 (en) * | 2016-09-09 | 2019-10-31 | The Board Of Trustees Of The Leland Stanford Junior University | High-throughput precision genome editing |
| WO2020163779A1 (en) * | 2019-02-08 | 2020-08-13 | The Board Of Trustees Of The Leland Stanford Junior University | Production and tracking of engineered cells with combinatorial genetic modifications |
| US20200283780A1 (en) * | 2019-03-08 | 2020-09-10 | Zymergen Inc. | Iterative genome editing in microbes |
| US20210017530A1 (en) * | 2014-12-31 | 2021-01-21 | Synthetic Genomics, Inc. | RNA-Guided Endonuclease Expressing Algal Strain for High Efficiency In Vivo Genome Editing |
| WO2022007959A1 (en) * | 2020-07-10 | 2022-01-13 | 中国科学院动物研究所 | System and method for editing nucleic acid |
| US20220049226A1 (en) * | 2020-08-13 | 2022-02-17 | Sana Biotechnology, Inc. | Methods of treating sensitized patients with hypoimmunogenic cells, and associated methods and compositions |
-
2023
- 2023-05-19 WO PCT/US2023/022989 patent/WO2023225358A1/en not_active Ceased
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20050037487A1 (en) * | 2003-05-28 | 2005-02-17 | Yoshihiro Kawaoka | Recombinant influenza vectors with a PolII promoter and ribozymes for vaccines and gene therapy |
| US20130316339A1 (en) * | 2010-09-01 | 2013-11-28 | Orion Genomics Llc | Detection of nucleic acid sequences adjacent to repeated sequences |
| US20150184199A1 (en) * | 2013-12-19 | 2015-07-02 | Amyris, Inc. | Methods for genomic integration |
| US20210017530A1 (en) * | 2014-12-31 | 2021-01-21 | Synthetic Genomics, Inc. | RNA-Guided Endonuclease Expressing Algal Strain for High Efficiency In Vivo Genome Editing |
| US20190330619A1 (en) * | 2016-09-09 | 2019-10-31 | The Board Of Trustees Of The Leland Stanford Junior University | High-throughput precision genome editing |
| WO2020163779A1 (en) * | 2019-02-08 | 2020-08-13 | The Board Of Trustees Of The Leland Stanford Junior University | Production and tracking of engineered cells with combinatorial genetic modifications |
| US20200283780A1 (en) * | 2019-03-08 | 2020-09-10 | Zymergen Inc. | Iterative genome editing in microbes |
| WO2022007959A1 (en) * | 2020-07-10 | 2022-01-13 | 中国科学院动物研究所 | System and method for editing nucleic acid |
| US20220049226A1 (en) * | 2020-08-13 | 2022-02-17 | Sana Biotechnology, Inc. | Methods of treating sensitized patients with hypoimmunogenic cells, and associated methods and compositions |
Non-Patent Citations (1)
| Title |
|---|
| STRUNZ TOBIAS, GRASSMANN FELIX, GAYÁN JAVIER, NAHKURI SATU, SOUZA-COSTA DEBORA, MAUGEAIS CYRILLE, FAUSER SASCHA, NOGOCEKE EVERSON,: "A mega-analysis of expression quantitative trait loci (eQTL) provides insight into the regulatory architecture of gene expression variation in liver", SCIENTIFIC REPORTS, NATURE PUBLISHING GROUP, US, vol. 8, no. 1, US , XP093114130, ISSN: 2045-2322, DOI: 10.1038/s41598-018-24219-z * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025007020A1 (en) * | 2023-06-30 | 2025-01-02 | The J. David Gladstone Institutes, A Testamentary Trust Established Under The Will Of J. David Gladstone | Multiplexed retron genome editing in prokaryotic and eukaryotic genomes |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20230383290A1 (en) | High-throughput precision genome editing | |
| AU2025202331A9 (en) | Delivery and use of the CRISPR-Cas systems, vectors and compositions for hepatic targeting and therapy | |
| KR102773555B1 (en) | Genome engineering | |
| AU2024202007A1 (en) | Novel CRISPR enzymes and systems | |
| JP2022127638A (en) | Systems, methods and compositions for sequence manipulation with optimized functional crispr-cas systems | |
| CA3111432A1 (en) | Novel crispr enzymes and systems | |
| JP2021500036A (en) | Use of adenosine base editing factors | |
| WO2017196768A1 (en) | Self-targeting guide rnas in crispr system | |
| US20250043269A1 (en) | Precise Genome Editing Using Retrons | |
| EP3940078A1 (en) | Off-target single nucleotide variants caused by single-base editing and high-specificity off-target-free single-base gene editing tool | |
| US20240263173A1 (en) | High-throughput precision genome editing in human cells | |
| WO2022261122A1 (en) | Crispr-transposon systems for dna modification | |
| JP2022520063A (en) | Production and tracking of engineered cells with combined genetic modifications | |
| WO2023225358A1 (en) | Generation and tracking of cells with precise edits | |
| US20240124873A1 (en) | Methods and compositions for combinatorial targeting of the cell transcriptome | |
| KR20180128864A (en) | Gene editing composition comprising sgRNAs with matched 5' nucleotide and gene editing method using the same | |
| US20240240164A1 (en) | Non-viral homology mediated end joining | |
| WO2025010350A2 (en) | Compositions and methods for precise genome editing using retrons | |
| US20250290099A1 (en) | Materials and methods for targeted genetic manipulations in cells | |
| US20210292752A1 (en) | Method for Isolating or Identifying Cell, and Cell Mass | |
| KR20230051688A (en) | Nuclease-mediated nucleic acid modification | |
| US20250320483A1 (en) | Systems and methods for gene insertions | |
| WO2024023734A1 (en) | MULTI-gRNA GENOME EDITING | |
| US20240287506A1 (en) | Library construction method based on long overhang sequence ligation | |
| HK40012333A (en) | High-throughput precision genome editing |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23808411 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 23808411 Country of ref document: EP Kind code of ref document: A1 |