EP4532704A2

EP4532704A2 - Novel nucleic acid-editing proteins

Info

Publication number: EP4532704A2
Application number: EP23729980.5A
Authority: EP
Inventors: Tyson David BOWEN; Lila Herk RIEBER
Original assignee: UCB Biopharma SRL
Current assignee: UCB Biopharma SRL
Priority date: 2022-05-26
Filing date: 2023-05-24
Publication date: 2025-04-09
Also published as: WO2023227669A2; WO2023227669A3; JP2025517515A

Abstract

The present invention provides novel base-editing systems comprising an RNA-guided nuclease domain and novel cytidine deaminase domain. Novel uracil protecting peptides useful for base editing applications are also provided.

Description

NOVEL NUCLEIC ACID-EDITING PROTEINS

[001] The present invention relates to novel nucleic acid-editing proteins and base-editing systems comprising such.

BACKGROUND

[002] Targeted introduction of a specific modification into genomic DNA is a promising approach for the study of gene function and has the potential to provide new therapies for human genetic diseases. An ideal nucleic acid editing technology would provide high efficiency of introducing the desired modification, have a minimal off-target activity, and have the ability to be guided to edit precisely any site within the genome.

[003] There are multiple genome-engineering tools available, including engineered zinc finger nucleases (ZFNs), transcription activator like effector nucleases (TALENs), and the RNA-guided DNA endonuclease (RGN). The programmable cleavage can result in mutation of the DNA at the cleavage site via non-homologous end joining (NHEJ) or replacement of the DNA surrounding the cleavage site via homology-directed repair (HDR) process.

[004] One drawback to the current technologies is that both NHEJ and HDR typically result in modest gene editing efficiencies as well as unwanted gene alterations that can compete with the desired alteration. Since many genetic diseases in principle can be treated by effecting a specific nucleotide change at a specific location in the genome (for example, a C to T change in a specific codon of a gene associated with a disease), the development of a programmable way to achieve such precision gene editing would represent both a powerful new research tool, as well as a potential new approach to gene editing-based human therapeutics.

[005] The clustered regularly interspaced short palindromic repeat (CRISPR) system is a natural prokaryotic adaptive immune system that has been modified to enable robust and general genome engineering in a variety of organisms and cell lines. CRISPR-Cas (CRISPR associated) systems are protein-RNA complexes that use an RNA molecule (gRNA) as a guide to localize the complex to a target DNA sequence via base-pairing. In the natural systems, a Cas protein then acts as an endonuclease to cleave the targeted DNA sequence. The target DNA sequence must be both complementary to the sgRNA, and also contain a “protospacer-adjacent motif’ (PAM) sequence at the end of the complementary region in order for the system to function. Among the known Cas proteins, S. pyogenes Cas9 has been mostly widely used as a tool for genome engineering. This Cas9 protein is a large, multi- domain protein containing two distinct nuclease domains. Point mutations can be introduced into Cas proteins to abolish nuclease activity, resulting in a dead Cas (dCas) or nickase (nCas) that still retains its ability to bind DNA guided by a gRNA. When fused to another protein or domain, dCas9 can target that protein to the DNA sequence of interest simply by co-expression with an appropriate gRNA.

[006] Significantly, 80-90% of protein mutations responsible for human disease arise from the substitution, deletion, or insertion of only a single nucleotide. Current strategies for single-base gene correction include engineered nucleases, which rely on the creation of double-strand breaks (DSBs), followed by stochastic, inefficient homology-directed repair (HDR), and DNA-RNA chimeric oligonucleotides. The latter strategy involves the design of an RNA/DNA sequence to base pair with a specific sequence in genomic DNA at the position of the nucleotide to be edited. The resulting mismatch is recognized by the cell's endogenous repair system and fixed, leading to a change in the sequence. Both strategies suffer from low gene editing efficiencies and unwanted gene alterations, as they are subject to both the stochasticity of HDR and the competition between HDR and non-homologous end-joining, NHEJ. HDR efficiencies vary according to the location of the target gene within the genome, the state of the cell cycle, and the type of cell/tissue.

[007] US 10,167,457 discloses some examples base editors and provides fusion peptides for targeted base editing.

[008] The development of a direct, programmable way to install a specific type of base modification at a precise location in genomic DNA with enzyme-like efficiency and no stochasticity represents a powerful new approach to gene editing-based research tools and human therapeutics.

SUMMARY OF THE INVENTION

[009] Some aspects of this disclosure provide methods, systems, reagents, and kits that are useful for the targeted editing of nucleic acids. In some embodiments, fusion proteins of a RNA-guided endonucleases domain and cytidine deaminase domains are provided. In some embodiments, methods for targeted nucleic acid editing are provided. In some embodiments, reagents and kits for the generation of targeted nucleic acid editing proteins, e.g., fusion proteins of Cas and deaminase domains, are provided.

[010] Some aspects of this disclosure provide fusion proteins comprising (i) a nuclease-inactive RNA- guided endonuclease domain; and (ii) a cytidine deaminase domain. In some embodiments, the nucleic- acid-editing domain is fused to the N-terminus of the RNA-guided endonuclease domain. In some embodiments, the nucleic-acid-editing domain is fused to the C-terminus of the RNA-guided endonuclease domain. In some embodiments, the RNA-guided endonuclease domain and the nucleic- acid-editing domain are fused via a linker. [011] Some aspects of this disclosure provide methods for DNA editing. In some embodiments, the methods comprise contacting a DNA molecule with (a) a fusion protein or protein complex comprising a nuclease-inactive RNA-guided endonuclease domain and a cytidine deaminase domain; and (b) a guide RNA targeting said fusion protein to a target nucleotide sequence; wherein the DNA molecule is contacted with the fusion protein or protein complex and the guide RNA in an amount effective and under conditions suitable for the deamination of a nucleotide base.

[012] In some embodiments, the target DNA sequence comprises a sequence associated with a disease or disorder, and wherein the deamination of the nucleotide base results in a sequence that is not associated with a disease or disorder. In some embodiments, the DNA sequence comprises a T>C point mutation. In some embodiments, the deamination corrects a point mutation in the sequence associated with the disease or disorder. In some embodiments, the sequence associated with the disease or disorder encodes a protein, and wherein the deamination introduces a stop codon or disrupt splicing of the sequence associated with the disease or disorder.

[013] Some aspects of this disclosure provide kits comprising a nucleic acid construct that comprises a sequence encoding a nuclease-inactive RNA-guided endonuclease sequence, a sequence encoding a nucleic acid-editing enzyme or enzyme domain, such as cytidine deaminase, in-frame with the RNA- guided endonuclease -encoding sequence, and, optionally, a sequence encoding a linker positioned between the Cas encoding sequence and the cloning site. In addition, in some embodiments, the kit comprises suitable reagents, buffers, and/or instructions for use.

[014] Some aspects of this disclosure provide kits comprising a fusion protein comprising a nuclease- inactive RNA-guided endonuclease domain and a cytidine deaminase domain, and, optionally, a linker positioned between the RNA-guided endonuclease domain and the deaminase domain. In addition, in some embodiments, the kit comprises suitable reagents, buffers, and/or instructions for using the fusion protein, e.g., for in vitro or in vivo DNA or RNA editing. In some embodiments, the kit comprises instructions regarding the design and use of suitable gRNAs for targeted editing of a nucleic acid sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

[015] The present invention is described below by reference to the following figures.

[016] Figure 1 shows examples of the fusion proteins domain’s arrangement. NLS- nuclear localization signal sequence, NH₂ - N terminal end of the peptide sequence, COOH- C terminal end of the peptide

DETAILED DESCRIPTION OF THE INVENTION Abbreviations

[017] Table 1. Abbreviations used throughout the specification

[018] Table 2. Amino acids abbreviations

Definitions [019] The following definitions are used throughout the description.

[020] The term “Site-specific nuclease” as used herein refers to an enzyme capable of specifically recognizing and cleaving DNA sequences. The site-specific nuclease may be engineered. Examples of engineered site-specific nucleases include zinc finger nucleases (ZFNs), TAL effector nucleases (TALENs), and CRISPR/Cas9-based systems. [021] The term "Transcription activator-like effector" or "TALE" as used herein refers to a protein that recognizes and binds to a particular DNA sequence. The "TALE DNA-binding domain" refers to a DNA- binding domain that includes an array of tandem 33-35 amino acid repeats, each of which specifically recognizes a single base pair of DNA. Such repeats may be arranged in any order to assemble an array that recognizes a specific sequence.

[022] The term "Transcription activator-like effector nucleases" or "TALENs" as used herein refers to fusion proteins of the catalytic domain of a nuclease, and a designed TALE DNA-binding domain that may be targeted to a custom DNA sequence.

[023] The term "Zinc finger" as used herein refers to a protein that contains a zinc finger domain and which recognizes and binds to DNA sequences. A single zinc finger contains approximately 30 amino acids and the domain typically functions by binding 3 consecutive base pairs of DNA via interactions of a single amino acid side chain per base pair.

[024] The term "Zinc finger nuclease" or "ZFN" as used herein refers to a chimeric protein molecule comprising at least one zinc finger DNA binding domain effectively linked to at least one nuclease or part of a nuclease capable of cleaving DNA when fully transcribed and assembled.

[025] The term “CRISPR” (Clustered Regularly Interspaced Short Palindromic Repeats) refers to a family of DNA sequences found in the genomes of prokaryotic organisms such as bacteria and archaea. These sequences are derived from DNA fragments of bacteriophages that had previously infected the prokaryote. They are used to detect and destroy DNA from similar bacteriophages during subsequent infections.

[026] The term "CRISPR system" refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated ("Cas") proteins, including sequences encoding a Cas protein, a tracr (trans -activating CRISPR) sequence (e.g. tracrRNA or an active partial tracrRNA), a tracr-mate sequence (containing a "direct repeat" and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred herein to as a "spacer" in the context of an endogenous CRISPR system), or other sequences and transcripts from a CRISPR locus.

[027] The term “Type II CRISPR system” refers to effector system that carries out targeted DNA double- strand break in four sequential steps, using a single effector enzyme, Cas9, to cleave dsDNA. Compared to the Type I and Type III effector systems, which require multiple distinct effectors acting as a complex, the Type II effector system may function in alternative contexts such as eukaryotic cells. The Type II effector system consists of a long pre-crRNA, which is transcribed from the spacer-containing CRISPR locus, the Cas9 protein, and a tracrRNA, which is involved in pre-crRNA processing.

[028] The term “nucleic acid guided DNA binding protein” refers to any protein that complexes with one or more nucleic acids that guide the binding of that protein to a specific region of a DNA. RNA-guided nucleases are an example of nucleic acid guided DNA binding proteins.

[029] The term “RNA-guided endonuclease” or “RGN” is used interchangeably herein and refer to a nuclease that forms a complex with (e.g., binds or associates with) one or more RNA that is not a target for cleavage.

[030] The term “gRNA”, also used interchangeably herein as a chimeric single guide RNA (“sgRNA”), refers to nucleic acid which is a fusion of two noncoding RNAs: a crRNA and a tracrRNA. “gRNA” is used interchangeably to refer to guide RNAs that exist as either single molecules or as a complex of two or more molecules. Typically, gRNAs that exist as single RNA species comprise two domains:(l) a domain that shares homology to a target nucleic acid (e.g., and directs binding of a Cas9 complex to the target); and (2) a domain that binds a Cas9 protein.

[031] The term “Cas9” refers to type of an RGN that cleaves nucleic acid and is encoded by the CRISPR loci and is a part of the Type II CRISPR system. The Cas9 protein commonly used is from bacterial species Streptococcus pyogenes. The Cas9 protein may be mutated so that the nuclease activity is partly or completely inactivated.

[032] The term “dCas9” refers to an inactivated Cas9 protein. Examples include dCas9 from Streptococcus pyogenes with no nuclease activity. As used herein, “dCas9” refer to a Cas9 protein that has the amino acid substitutions and has its nuclease activity inactivated. For S. pyogenes Cas9 these mutations are D10A and H840A.

[033] The term “nCas9” refers to Cas9 nickase domain or protein. The term “Cas9 nickase” refers to a modified version of the Cas9, containing a single inactive catalytic domain, either RuvC- or HNH-. With only one active nuclease domain, the Cas9 nickase cuts only one strand of the target DNA, creating a single-strand break or “nick”. A Cas9 nickase is still able to bind DNA based on gRNA specificity, but nickases will only cut one of the DNA strands. As example nCas9 is derived from S. pyogenes and the RuvC domain can be inactivated by an amino acid substitution at position D10 (e.g., D10A) and the HNH domain can be inactivated by an amino acid substitution at position H840 (e.g., H840A), or at positions corresponding to those amino acids in other proteins. [034] The term “nicking”, as used herein, refers to a reaction that breaks the phosphodiester bond between two nucleotides in one strand of a double-stranded DNA molecule to produce a 3' hydroxyl group and a 5' phosphate group.

[035] The term “nucleic acid editing domain,” as used herein refers to a protein or enzyme capable of making one or more modifications (e.g., deamination of a cytidine residue) to a nucleic acid (e.g., DNA or RNA). Exemplary nucleic acid editing domains include, but are not limited to a deaminase, a nuclease, a nickase, a recombinase, a methyltransferase, a methylase, an acetylase, an acetyltransferase, a transcriptional activator, or a transcriptional repressor domain. In some embodiments the nucleic acid editing domain is a deaminase (e.g., a cytidine deaminase, such as an APOBEC or an AID deaminase).

[036] The term “deaminase” refers to an enzyme that catalyzes a deamination reaction. In some embodiments, the deaminase is a cytidine deaminase, catalyzing the hydrolytic deamination of cytidine or deoxycytidine to uracil or deoxyuracil, respectively.

[037] The term “linker,” as used herein, refers to a chemical group or a molecule linking two molecules or moieties, e.g., a binding domain and a cleavage domain of a nuclease. In some embodiments, a linker joins a gRNA binding domain of an RNA-programmable nuclease and the catalytic domain of a recombinase. In some embodiments, a linker joins a dCas9 and a recombinase. Typically, the linker is positioned between, or flanked by, two groups, molecules, or other moieties and connected to each one via a covalent bond, thus connecting the two. In some embodiments, the linker is an amino acid or a plurality of amino acids (e.g., a peptide or protein). In some embodiments, the linker is an organic molecule, group, polymer, or chemical moiety. In some embodiments, the linker is 5-100 amino acids in length, for example, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30-35, 35-40, 40-45, 45-50, 50-60, 60-70, 70-80, 80-90, 90-100, 100-150, or 150-200 amino acids in length. Longer or shorter linkers are also contemplated.

[038] As used herein, the terms "nucleic acid," "nucleic acid sequence," "nucleotide sequence," "oligonucleotide," and "polynucleotide" are interchangeable and refer to a polymeric form of nucleotides. The nucleotides may be deoxyribonucleotides (DNA), ribonucleotides (RNA), analogs thereof, or combinations thereof, and may be of any length. Polynucleotides may perform any function and may have any secondary and tertiary structures. The terms encompass known analogs of natural nucleotides and nucleotides that are modified in the base, sugar and/or phosphate moieties. Analogs of a particular nucleotide have the same base-pairing specificity (e.g., an analog of A base pairs with T). A polynucleotide may comprise one modified nucleotide or multiple modified nucleotides. Examples of modified nucleotides include fluorinated nucleotides, methylated nucleotides, and nucleotide analogs. Nucleotide structure may be modified before or after a polymer is assembled. Following polymerization, polynucleotides may be additionally modified via, for example, conjugation with a labeling component or target binding component. A nucleotide sequence may incorporate non-nucleotide components. The terms also encompass nucleic acids comprising modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, and have similar binding properties as a reference polynucleotide (e.g., DNA or RNA). Examples of such analogs include, but are not limited to, phosphorothioates, phosphoramidates, methyl phosphonates, chiral-methyl phosphonates, 2-O-methyl ribonucleotides, peptide-nucleic acids (PNAs), Locked Nucleic Acid (LNA™) (Exiqon, Inc., Woburn, MA) nucleosides, glycol nucleic acid, bridged nucleic acids, and morpholino structures. Polynucleotide sequences are displayed herein in the conventional 5' to 3' orientation unless otherwise indicated.

[039] As used herein, the terms "peptide," "polypeptide," and "protein" are interchangeable and refer to polymers of amino acids. A polypeptide may be of any length. It may be branched or linear, it may be interrupted by non-amino acids, and it may comprise modified amino acids. The terms may be used to refer to an amino acid polymer that has been modified through, for example, acetylation, disulfide bond formation, glycosylation, lipidation, phosphorylation, cross- linking, and/or conjugation (e.g., with a labeling component or ligand). Polypeptide sequences are displayed herein in the conventional N-terminal to C-terminal orientation. Polypeptides and polynucleotides can be made using routine techniques in the field of molecular biology (see, e.g., standard texts set forth above). Further, essentially any polypeptide or polynucleotide can be custom ordered from commercial sources.

[040] The term “target region”, “target sequence” or “protospacer” as used interchangeably herein refers to the region of the target gene to which the CRISPR-based system targets.

[041] The term “target site” refers to a sequence within a nucleic acid molecule that is deaminated by a deaminase or a fusion protein comprising a deaminase, (e.g., a RGN-cytidine deaminase fusion protein provided herein).

[042] The term “mutation,” as used herein, refers to a substitution of a residue within a sequence, e.g., a nucleic acid or amino acid sequence, with another residue, or a deletion or insertion of one or more residues within a sequence. Mutations are typically described herein by identifying the original residue followed by the position of the residue within the sequence and by the identity of the newly substituted residue. Various methods for making the amino acid substitutions (mutations) provided herein are well known in the art, and are provided by, for example, Green and Sambrook, Molecular Cloning: A Laboratory Manual (4th ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. (2012)).

[043] The term "complement" or "complementary" as used herein means a nucleic acid can mean Watson-Crick or Hoogsteen base pairing between nucleotides or nucleotide analogs of nucleic acid molecules. The term "complementarity" refers to a property shared between two nucleic acid sequences, such that when they are aligned antiparallel to each other, the nucleotide bases at each position will be complementary.

[044] The term "promoter" as used herein means a synthetic or naturally-derived molecule which is capable of conferring, activating or enhancing expression of a nucleic acid in a cell. A promoter may comprise one or more specific transcriptional regulatory sequences to further enhance expression and/or to alter the spatial expression and/or temporal expression of same. A promoter may also comprise distal enhancer or repressor elements, which may be located as much as several thousand base pairs from the start site of transcription. A promoter may be derived from sources including viral, bacterial, fungal, plants, insects, and animals.

[045] The term "enhancer" as used herein refers to non-coding DNA sequences containing multiple activator and repressor binding sites. Enhancers range from 200 bp to 1 kb in length and may be either proximal, 5' upstream to the promoter or within the first intron of the regulated gene, or distal, in introns of neighboring genes or intergenic regions far away from the locus. Through DNA looping, active enhancers contact the promoter dependently of the core DNA binding motif promoter specificity. 4 to 5 enhancers may interact with a promoter.

[046] The term “operably linked” as used herein means that expression of a gene is under the control of a promoter with which it is spatially connected. A promoter may be positioned 5' (upstream) or 3' (downstream) of a gene under its control. The distance between the promoter and a gene may be approximately the same as the distance between that promoter and the gene it controls in the gene from which the promoter is derived. As is known in the art, variation in this distance may be accommodated without loss of promoter function.

[047] The term "vector" as used herein means a nucleic acid sequence containing an origin of replication. A vector may be a viral vector, bacteriophage, bacterial artificial chromosome or yeast artificial chromosome. A vector may be a DNA or RNA vector. A vector may be a self- replicating extrachromosomal vector, or a DNA plasmid.

[048] The term “effective amount,” as used herein, refers to an amount of a biologically active agent that is sufficient to elicit a desired biological response. For example, in some embodiments, an effective amount of a nuclease may refer to the amount of the nuclease that is sufficient to induce cleavage of a target site specifically bound and cleaved by the nuclease. In some embodiments, an effective amount of a recombinase may refer to the amount of the recombinase that is sufficient to induce recombination at a target site specifically bound and recombined by the recombinase. As will be appreciated by the skilled artisan, the effective amount of an agent, e.g., a nuclease, a recombinase, a hybrid protein, a fusion protein, a protein dimer, a complex of a protein (or protein dimer) and a polynucleotide, or a polynucleotide, may vary depending on various factors as, for example, on the desired biological response, the specific allele, genome, target site, cell, or tissue being targeted, and the agent being used.

[049] The term "adeno-associated virus" or "AAV" as used interchangeably herein refers to a small virus belonging to the genus Dependovirus of the Parvoviridae family that infects humans and some other primate species. AAV is not currently known to cause disease and consequently the virus causes a very mild immune response.

[050] The term "subject" and "patient" as used herein interchangeably refers to any vertebrate, including, but not limited to, a mammal {e.g., cow, pig, camel, llama, horse, goat, rabbit, sheep, hamsters, guinea pig, cat, dog, rat, and mouse, a non-human primate (for example, a monkey, such as a cynomolgus or rhesus monkey, chimpanzee, etc.) and a human). In some embodiments, the subject may be a human or a non-human. The subject or patient may be undergoing other forms of treatment.

[051] The terms “treatment,” “treat,” and “treating,” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. As used herein, the terms “treatment,” “treat,” and “treating” refer to a clinical intervention aimed to reverse, alleviate, delay the onset of, or inhibit the progress of a disease or disorder, or one or more symptoms thereof, as described herein. In some embodiments, treatment may be administered after one or more symptoms have developed and/or after a disease has been diagnosed. In other embodiments, treatment may be administered in the absence of symptoms, e.g., to prevent or delay onset of a symptom or inhibit onset or progression of a disease. For example, treatment may be administered to a susceptible individual prior to the onset of symptoms (e.g., in light of a history of symptoms and/or in light of genetic or other susceptibility factors). Treatment may also be continued after symptoms have resolved, for example, to prevent or delay their recurrence

[052] Unless otherwise defined herein, scientific and technical terms used in connection with the present disclosure shall have the meanings that are commonly understood by those of ordinary skill in the art.

Fusion proteins and protein complexes

[053] The present disclosure provides fusion proteins comprising (i) site-specific nuclease domain; and (ii) a cytidine deaminase domain. Examples of site-specific nuclease domains are known to the skilled person and include zinc finger nucleases (ZFNs), TAL effector nucleases (TALENs), and CRISPR-based systems. Any nucleic acid guided DNA binding domain can be used as long as the domain is being guided to a specific point of interest within the target nucleic acid sequence. A CRISPR nuclease-domain would be suitable for such purpose.

[054] The present disclosure more specifically provides fusion proteins comprising (i) a CRISPR nuclease-domain and (ii) a cytidine deaminase domain. In some embodiments, the cytidine deaminase comprises one of the sequences described above. Suitable CRISPR nuclease domains are described herein with Cas9 being commonly used. Hence, in a specific embodiment inactive CRISPR nuclease domain is dCas9 domain. Alternative suitable site-specific nuclease domains will be apparent to the skilled artisan based on this disclosure. Preferably the CRISPR nuclease domain is a CRISPR nickase domain or inactive CRISPR nuclease domain.

[055] The disclosure provides CRISPR nuclease enzyme/domain fusion proteins with various configurations. In some embodiments, the cytidine deaminase enzyme or domain is fused to the N- terminus of the CRISPR nuclease domain. In some embodiments, the cytidine deaminase enzyme or domain is fused to the C-terminus of the CRISPR nuclease domain.

[056] In some embodiments, the general architecture of Cas fusion proteins provided herein comprises the structure:

• [NH₂]-[cytidine deaminase domain]-[ inactive CRISPR nuclease domain]-[COOH] ,

• [NH₂]-[inactive CRISPR nuclease domain]-[ cytidine deaminase domain]-[COOH],

• [NH₂]-[cytidine deaminase domain]-[CRISPR nickase domain]-[COOH], or

• [NH₂ ]-[CRISPR nickase domain]-[cytidine deaminase domain] -[COOH], wherein NH₂ is the N-terminus of the fusion protein, COOH is the C-terminus of the fusion protein, and is an optional linker.

[057] In some embodiments, the general architecture of Cas fusion proteins comprises the structure:

• [NH₂]-[NLS]-[CRISPR nuclease domain]-[cytidine deaminase domain]-[COOH],

• [NH₂]-[NLS]-[cytidine deaminase domain]-[CRISPR nuclease domain]-[COOH],

• [NH₂]-[CRISPR nuclease domain]-[cytidine deaminase domain]-[COOH], or

• [NH₂]-[cytidine deaminase domain]-[CRISPR nuclease domain]-[COOH], wherein NLS is a nuclear localization signal, NH₂ is the N-terminus of the fusion protein, COOH is the C- terminus of the fusion protein, and is an optional linker.

[058] In some embodiments, the NLS is located C -terminal of the cytidine deaminase and/or the CRISPR nuclease domain. Multiple NLS might be present. In some embodiments, the NLS is located between the cytidine deaminase and the Cas domain. In some embodiments multiple NLS are present. Preferably NLS is present on both C- and N-terminal of the cytidine deaminase and/or CRISPR nuclease domains. [059] In some embodiments, the CRISPR nuclease domain and the cytidine deaminase domain are fused via a linker. In some embodiments the linker comprises the sequence SGGSSGGSSGSETPGTSESATPESSGGSSGGS (SEQ ID NO: 31). In some embodiments, the linker comprises a (GGGGS)_n (SEQ ID NO: 32), a (G)_n , an (EAAAK)_n (SEQ ID NO: 33), or an (XP)_n motif, or a combination of any of these, wherein n is independently an integer between 1 and 30. In some embodiments, n is independently 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30, or, if more than one linker or more than one linker motif is present, any combination thereof. Additional suitable linker motifs and linker configurations will be apparent to those of skill in the art. In some embodiments, suitable linker motifs and configurations include those described in Chen et al., Adv Drug Deliv Rev. 2013; 65(10): 1357-69. In some embodiments, fusion proteins as provided herein comprise the full-length amino acid of a nucleic acid-editing enzyme, e.g., one of the sequences provided above. In other embodiments, however, fusion proteins as provided herein do not comprise a full-length sequence of a nucleic acid-editing enzyme, but only a fragment thereof. For example, in some embodiments, a fusion protein provided herein comprises a Cas9 domain and a fragment of a nucleic acid-editing enzyme, e.g., wherein the fragment comprises a nucleic acid-editing domain. Exemplary amino acid sequences of nucleic acid-editing domains are shown in the sequences above as italicized letters, and additional suitable sequences of such domains will be apparent to those of skill in the art.

[060] In some embodiments, additional features may be present. Such features could be one or more linker sequences between the NLS and the rest of the fusion protein and/or between the cytidine deaminase domain and the CRISPR nuclease domain. Other features such as, for example, nuclear localization sequences, cytoplasmic localization sequences, export sequences, such as nuclear export sequences, or other localization sequences, could be present. In some embodiments sequence tags could be present. Such tags are useful for solubilization, purification, or detection of the fusion proteins. Suitable localization signal sequences and protein tag sequences are provided herein, and include, but are not limited to, biotin carboxylase carrier protein (BCCP) tags, myc-tags, calmodulin-tags, FLAG-tags, hemagglutinin (HA)-tags, polyhistidine tags, also referred to as histidine tags or His-tags, maltose binding protein (MBP)-tags, nus-tags, glutathione-S-transferase (GST)-tags, green fluorescent protein (GFP)-tags, thioredoxin-tags, S-tags, Softags (e.g., Softag 1, Softag 3), strep-tags, biotin ligase tags, FlAsH tags, V5 tags, and SBP-tags. Additional suitable sequences will be apparent to those of skill in the art.

[061] Additional suitable nucleic-acid editing enzyme sequences, e.g., deaminase enzyme and domain sequences, that can be used according to aspects of this invention, e.g., that can be fused to a nuclease- inactive CRISPR associated domain, will be apparent to the skilled person based on this disclosure. In some embodiments, such additional enzyme sequences include cytidine deaminase domain sequences that are at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99% similar to the sequences provided herein. Additional suitable CRISPR nuclease domains, variants, and sequences will also be apparent to those of skill in the art.

[062] In some embodiments, fusion proteins as provided herein comprise the full-length amino acid of a cytidine deaminase, e.g., one of the sequences provided above. In other embodiments, however, fusion proteins as provided herein do not comprise a full-length sequence of a cytidine deaminase, but only a fragment thereof. For example, in some embodiments, a fusion protein provided herein comprises a CRISPR nuclease domain (such as, for example, Cas9 domain) and a fragment of a cytidine deaminase domain, e.g., wherein the fragment comprises a cytidine deaminase domain. Additional suitable sequences of such domains will be apparent to those of skill in the art.

Deaminase domains

[063] Some aspects of this disclosure provide fusion proteins and protein complexes that comprise a cytidine deaminase domain. Cytidine deaminase domain is capable of catalyzing the hydrolytic deamination of cytidine or deoxycytidine to uridine or deoxyuridine, respectively. In some embodiments, the cytidine deaminase domain catalyzes the hydrolytic deamination of cytidine to uracil. In some embodiments, the cytidine deaminase or cytidine deaminase domain is a naturally occurring cytidine deaminase.

[064] The present disclosure provides novel cytidine deaminase domains that could be used in a fusion protein or a protein complex comprising the sequence of any one of SEQ ID NO: 1-13. The examples of the present disclosure demonstrate cytidine base editing of several cytidine deaminase sequences. CBE07, CBE08, CBE10, CBE11, CBE12, and CBE13 demonstrate active base editing ability, with CBE07, CBE11, CBE12, and CBE13 being most effective.

[065] Table 3. Cytidine deaminase domain preferred sequences

[066] In some embodiments, the cytidine deaminase or cytidine deaminase domain is a variant of a naturally occurring deaminase from an organism, that does not occur in nature. For example, in some embodiments, the cytidine deaminase or cytidine deaminase domain is at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75% at least 80%, at least 85%, at least 90%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to the deaminase domain of any one of SEQ ID NO: 7, 8, 10, 11, 12, or 13.

[067] In some embodiments, the cytidine deaminase domain is at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%, or at least 99.5% identical to the deaminase domain of any one of SEQ ID NOs 7, 8, 10, 11, 12, or 13. In some embodiments, the cytidine deaminase domain comprises the amino acid sequence of any one of SEQ ID NOs: 7, 8, 10, 11, 12, or 13.

[068] Cytidine deaminases provided herein can be used for targeted editing of nucleic acid sequences. Such Cytidine deaminases are useful for targeted editing of DNA in vitro, e.g., for the generation of mutant cells or animals; for the introduction of targeted mutations, e.g., for the correction of genetic defects in cells ex vivo, e.g., in cells obtained from a subject that are subsequently re-introduced into the same or another subject; and for the introduction of targeted mutations, e.g., the correction of genetic defects or the introduction of deactivating mutations in disease-associated genes in a subject.

[069] In some embodiments, the cytidine deaminase domain has catalytic activity mutations that reduce, but do not eliminate, the catalytic activity of a cytidine deaminase domain within a base editing fusion protein. Such mutations could make it less likely that the cytidine deaminase domain will catalyze the deamination of a residue adjacent to a target residue, thereby narrowing the deamination window. The ability to narrow the deamination window may help to prevent unwanted deamination of residues adjacent of specific target residues, which may help to decrease or prevent off-target effects.

[070] In some embodiments, any of the fusion proteins provided herein comprise a cytidine deaminase domain that has reduced catalytic deaminase activity. In some embodiments, any of the fusion proteins provided herein comprise a cytidine deaminase domain that has a reduced catalytic deaminase activity as compared to an appropriate control. For example, the appropriate control may be the deaminase activity of the cytidine deaminase prior to introducing one or more mutations into the cytidine deaminase. In other embodiments, the appropriate control may be a wild-type deaminase. In some embodiments, the appropriate control is a wild-type apolipoprotein B mRNA-editing complex (APOBEC) family deaminase. In some embodiments, the appropriate control is an APOBEC 1 deaminase, an APOBEC2 deaminase, an APOBEC3A deaminase, an APOBEC3B deaminase, an APOBEC3C deaminase, an APOBEC3D deaminase, an APOBEC3F deaminase, an APOBEC3G deaminase, or an APOBEC3H deaminase. In some embodiments, the appropriate control is an activation induced deaminase (AID). In some embodiments, the deaminase domain may be a deaminase domain that has at least 1%, at least 5%, at least 15%, at least 20%, at least 25%, at least 30%, at least 40%, at least 50%, at least 60%, at least 70%, at least 80%, at least 90%, or at least 95% less catalytic deaminase activity as compared to an appropriate control.

Site specific nuclease domain

[071] Some aspects of this disclosure provide fusion proteins and protein complexes that comprise an RNA-guided endonucleases domain that binds to a guide RNA (gRNA or sgRNA), which, in turn, binds a target nucleic acid sequence via strand hybridization; and a cytidine deaminase domain that can deaminate a cytidine.

[072] Typically, the RNA-guided endonuclease domain of the fusion proteins described herein partially lacks nuclease activity or does not have any nuclease activity. Such domain might be a fragment of nuclease-inactive Cas9 protein or a dCas9 protein or domain. Typically, nCas9 (nickase Cas9) is used to cleave the RNA bound strand (templating) of DNA when the Cas9 is bound. This helps to direct the fusion protein (or the protein complex) to use the non-templating (displaced) strand where the deaminase domain is acting as the repair template, helping to fix the mutated cytosine in place with a base-paired adenosine before the uracil is removed.

[073] In some embodiments, RNA-guided endonucleases fusion protein or protein complex comprise at least one nuclear localization signal, which permits entry of the endonuclease into the nuclei of eukaryotic cells. RNA-guided endonucleases also comprise at least one nuclease domain and at least one domain that interacts with a guide RNA. An RNA-guided endonuclease is directed to a specific nucleic acid sequence (or target site) by a guide RNA. The guide RNA interacts with the RNA-guided endonuclease as well as the target site such that it directs RNA-guided to the target site nucleic acid sequence to which the guide RNA is complimentary to. Since the guide RNA provides the specificity for the target site, the endonuclease of the RNA-guided endonuclease is universal and can be used with different guide RNAs for targeted binding to different nucleic acid sequences. [074] The RNA-guided endonuclease can be derived from a clustered regularly interspersed short palindromic repeats (CRISPR)/CRISPR-associated (Cas) system,

[075] In general, CRISPR/Cas proteins comprise at least one RNA recognition and/or RNA binding domain. RNA recognition and/or RNA binding domains interact with guide RNAs. CRISPR/Cas proteins can also comprise nuclease domains (i.e., DNase or RNase domains), DNA binding domains, helicase domains, RNAse domains, protein-protein interaction domains, dimerization domains, as well as other domains.

[076] The CRISPR/Cas system can be a type I, a type II, type III, type IV, type V, or type VI system. Non-limiting examples of suitable CRISPR/Cas proteins include Cas1, Cas2, Cas3, Cas4, Cas4, Cas5, Cas7, Cas7, Cas8, Cas9, Cas1O, Cas12(CpfI), CasI3(C2c2), Csm, and Cmr. For an overview of different examples of CRISPR/Cas proteins see Makarova, et al. Nat Rev Microbiol 18, 67-83 (2020).

[077] In one embodiment, the RNA-guided endonuclease is derived from a type II CRISPR/Cas system. In specific embodiments, the RNA-guided endonuclease is derived from a Cas9 or Cas 12 protein.

[078] The CRISPR/Cas protein can be a wild type CRISPR/Cas protein, a modified CRISPR/Cas protein, or a fragment of a wild type or modified CRISPR/Cas protein. The CRISPR/Cas protein can be modified to increase nucleic acid binding affinity and/or specificity, alter an enzymatic activity, and/or change another property of the protein. For example, nuclease (i.e., DNase, RNase) domains of the CRISPR/Cas- like protein can be modified, deleted, or inactivated. Alternatively, the CRISPR/Cas protein can be truncated to remove domains that are not essential for the function of the fusion protein or protein complex. The CRISPR/Cas protein can also be truncated or modified to optimize the activity of the effector domain of the fusion protein.

[079] In some embodiments, the CRISPR/Cas protein can be derived from a wild type Cas9 protein or fragment thereof. In other embodiments, the CRISPR/Cas protein can be derived from modified Cas9 protein. For example, the amino acid sequence of the Cas9 protein can be modified to alter one or more properties (e.g., nuclease activity, affinity, stability, etc.) of the protein. Alternatively, domains of the Cas9 protein not involved in RNA-guided cleavage can be eliminated from the protein such that the modified Cas9 protein is smaller than the wild type Cas9 protein.

[080] Cas9 protein commonly comprises at least two nuclease domains. For example, a Cas9 protein can comprise a RuvC-like nuclease domain and a HNH-like nuclease domain. The RuvC and HNH domains work together to cut single strands to make a double-stranded break in DNA.

[081] In some embodiments, the Cas9 protein can be modified to contain only one functional nuclease domain (either a RuvC-like or a HNH-like nuclease domain). For example, the Cas9-derived protein can be modified such that one of the nuclease domains is deleted or mutated such that it is no longer functional (i. e. , the nuclease activity is absent). In some embodiments in which one of the nuclease domains is inactive, the Cas9-derived protein is able to introduce a nick into a double-stranded nucleic acid (such protein is termed a “nickase”), but not cleave the double-stranded DNA. For example, an aspartate to alanine (D10A) conversion in a RuvC-like domain converts the Cas9-derived protein into a nickase. In the same way, H840A or H839A mutations in a HNH domain convert the Cas9-derived protein into a nickase.

[082] Each nuclease domain can be modified using well-known methods, such as site-directed mutagenesis, PCR-mediated mutagenesis, and total gene synthesis, as well as other methods known in the art.

[083] Non-limiting, exemplary nuclease inactive Cas9 domains are well known to the skilled person. One exemplary suitable nuclease-inactive S. pyogenes Cas9 domain is the D10A/H840A Cas9 domain mutant.

[084] In a preferred embodiment the RGN domain is a nNme2Cas9 (having D16A mutation) having sequence of SEQ ID NO: 34.

[085] Additional suitable nuclease-inactive CRISPR associated domains will be apparent to those of skill in the art based on this disclosure. Such additional exemplary suitable nuclease-inactive spCas9 domains include, but are not limited to, D10A, D10A/D839A/H840A, and D10A/D839A/H840A/N863A mutant domains (e.g., Prashant et al. Nature Biotechnology. 2013; 31(9): 833-838).

[086] In some embodiments, Cas9 fusion proteins as provided herein comprise the full-length amino acid of a Cas9 protein. In other embodiments, however, fusion proteins as provided herein do not comprise a full-length Cas9 sequence, but only a fragment thereof. For example, in some embodiments, a Cas9 fusion protein provided herein comprises a Cas9 fragment, wherein the fragment binds crRNA and tracrRNA or sgRNA, but does not comprise a functional nuclease domain, e.g., in that it comprises only a truncated version of a nuclease domain or no nuclease domain at all. Exemplary amino acid sequences of suitable modified Cas9 domains are described for example in Oakes et al., Cell. 2019 Jan 10; 176(1 - 2):254-267. Additional suitable sequences of Cas9 domains and fragments will be apparent to those of skill in the art.

[087] ] In some embodiments, Cas9 refers to Cas9 from: Corynebacterium ulcerans (NCBI Refs: NC_015683.1, NC_017317.1); Corynebacterium diphtheria (NCBI Refs: NC_016782.1, NC_016786.1); Spiroplasma syrphidicola (NCBI Ref: NC_021284.1); Prevotella intermedia (NCBI Ref: NC_017861.1); Spiroplasma taiwanense (NCBI Ref: NC_021846.1); Streptococcus iniae (NCBI Ref: NCJJ21314.1); Belliella baltica (NCBI Ref: NC_018010.1); Psychroflexus torquisl (NCBI Ref: NC__018721.1); Streptococcus thermophilus (NCBI Ref: YP_820832.1); Listeria innocua (NCBI Ref: NP__472073.1); Campylobacter jejuni (NCBI Ref: YP 002344900.1); or Neisseria, meningitidis (NCBI Ref: YP_002342100.1).

[088] Some aspects of the disclosure provide RNA-guided endonuclease domains that have different PAM specificities. Typically, RNA-guided endonuclease proteins, such as commonly used Cas9 from S. pyogenes (spCas9), require a canonical NGG PAM sequence to bind a particular nucleic acid region. Having a nuclease domain that requires a specific PAM sequence may limit the ability to edit desired bases within a genome. In some embodiments, the fusion proteins provided herein may need to be placed at a precise location. For example, in case of spCas9 the target base is placed within a 4 base region (e.g., a “deamination window”), which is approximately 15 bases upstream of its PAM. (Komor, A. C., et al., Nature 533, 420-424, 2016). Accordingly, in some embodiments, any of the fusion proteins or protein complexes provided herein may contain a RGN domain that is capable of binding a nucleotide sequence that does not contain a canonical (e.g., NGG) PAM sequence. RGN domains that bind to non-canonical PAM sequences have been described in the art and would be apparent to the skilled artisan. For example, Cas9 domains that bind non-canonical PAM sequences have been described in Kleinstiver, B. P., et al., Nature 523, 481-485 (2015).

[089] A compact ortholog RNA-guided endonuclease from Neisseria meningitidis (NmeCas9) recognizes a simple dinucleotide PAM (nnnnCC) that provides for high target site density (Edraki et al. Mol Cell. 2019 Feb 21 ;73(4):714-726) and would be a preferred variant of a Cas9 domain for use in the fusion proteins and protein complexes described herein.

Uracil Protecting peptides

[090] Some aspects of the disclosure relate to fusion proteins and protein complexes that comprise an uracil protecting peptide (UPP). Examples of such peptides include an uracil glycosylase inhibitor (UGI) (US 10, 167,457) and p56. Both UGI and p56 have been shown to inhibit Uracil DNA-glycosylase’s (UDG) activity Fusion proteins comprising a cytidine deaminase domain, a dCas (e.g dCas9) domain and an uracil glycosylase inhibitor (UGI) have been demonstrated to improved efficiency for deaminating target nucleotides (Komor, et al. Nature 533, 420-424 (2016)) Without wishing to be bound by any particular theory, cellular DNA-repair response to the presence of U:G heteroduplex DNA may be responsible for a decrease in nucleobase editing efficiency in cells.

[091] The present disclosure provides novel UPPs that are useful in the context of base editing. Preferably, such UPP comprises the sequence of SEQ ID NO: 43 or 45. [092] In some embodiments, any of the fusion proteins provided herein that comprise a RNA-guided endonuclease domain (e.g., a nuclease active Cas9 domain, a nuclease inactive dCas9 domain, or a Cas9 nickase) may be further fused to an UPP either directly or via a linker.

[093] Some aspects of this disclosure provide cytidine deaminase-dCas9 fusion proteins, cytidine deaminase-nuclease active Cas9 fusion proteins and cytidine deaminase-Cas9 nickase (nCas9) fusion proteins comprising a UPP. In one aspect, the present disclosure provides a fusion protein or protein complex that comprises (i) an RNA-guided endonuclease domain (such as, for example, nuclease active Cas9 domain, a nuclease inactive dCas9 domain, or a Cas9 nickase), (ii) a cytidine deaminase domain, and (iii) an uracil protecting peptide (UPP) comprising the sequence of SEQ ID NO: 43 or 45.

[094] Without wishing to be bound by any particular theory, cellular DNA-repair response to the presence of U:G heteroduplex DNA may be responsible for the decrease in nucleobase editing efficiency in cells. For example, uracil DNA glycosylase (UDG) catalyzes removal of U from DNA in cells, which may initiate base excision repair, with reversion of the U:G pair to a C:G pair as the most common outcome.. Thus, this disclosure contemplates a fusion protein comprising dCas9 -nucleic acid editing domain further fused to a UPP. This disclosure also contemplates a fusion protein comprising a Cas9 nickase-nucleic acid editing domain further fused to a UPP. The use of a UPP may increase the editing efficiency of cytidine deaminase domain that catalyzes a C to U change.

[095] In some embodiments, the fusion protein comprises the structure:

• [cytidine deaminase]-[dCas9]-[UPP];

• [cytidine deaminase]-[UPP]-[dCas9];

• [UPP]-[ cytidine deaminase]-[dCas9];

• [UPP]-[dCas9]-[ cytidine deaminase];

• [dCas9]-[ cytidine deaminase]-[UPP]; or

• [dCas9]-[UPP]-[ cytidine deaminase], wherein is an optional linker sequence.

[096] In other embodiments, the fusion protein comprises the structure:

• [cytidine deaminase]-[nCas9]-[UPP];

• [cytidine deaminase]-[UPP]-[nCas9];

• [UPP]-[ cytidine deaminase]-[nCas9];

• [UPP]-[nCas9]-[ cytidine deaminase];

• [nCas9]-[ cytidine deaminase]-[UPP];

• [nCas9]-[UPP]-[ cytidine deaminase], where nCas9 is a Cas9 nickase, and is an optional linker sequence.

[097] Exemplary sequences of Uracil Protecting Peptides are provided in US 10, 167,457 and Hao-Ching Wang et al, Nucleic Acids Research, 42, 2, pp. 1354-1364,. Fusion proteins complexes with guide RN As

[098] Some aspects of this disclosure provide complexes comprising any of the fusion proteins or protein complexes provided herein, and a guide RNA bound to a Cas domain (e.g., a dCas9, a nuclease active Cas9, or a Cas9 nickase) of fusion protein.

[099] In some embodiments, the guide RNA is from 15-300 nucleotides long and comprises a sequence of at least 10 contiguous nucleotides that is complementary to a target sequence. In some embodiments, the guide RNA comprises a sequence of 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, or 40 contiguous nucleotides that is complementary to a target sequence. In some embodiments, the target sequence is a DNA sequence. In some embodiments, the target sequence is a sequence in the genome of a mammal, plant, or bacteria. In some embodiments, the target sequence is a sequence in the genome of a human. In some embodiments, the 3' end of the target sequence is immediately adjacent to a PAM sequence (e.g. canonical PAM sequence NGG of SpCas9). In some embodiments, the guide RNA is complementary to a sequence associated with a disease or disorder.

Uses of the fusion proteins and protein complexes

[100] Fusion proteins comprising cytidine deaminase domain can be used for the targeted editing of nucleic acid sequences. Such fusion proteins are useful for targeted editing of DNA in vitro, e.g., for the generation of mutant cells or animals; for the introduction of targeted mutations, e.g., for the correction of genetic defects in cells ex vivo, e.g., in cells obtained from a subject that are subsequently re-introduced into the same or another subject; and for the introduction of targeted mutations, e.g., the correction of genetic defects or the introduction of deactivating mutations in disease-associated genes in a subject.

[101] Some aspects of this disclosure provide methods of using the cytidine deaminase domains, fusion proteins, or complexes provided herein. In one aspect of this disclosure are provided methods comprising contacting a DNA molecule with any of the fusion proteins or protein complexes provided herein, and with at least one guide RNA, wherein the guide RNA is about 15-300 nucleotides long and comprises a sequence of at least 10 contiguous nucleotides that is complementary to a target sequence of interest.

[102] Alternatively, methods comprising contacting a DNA molecule with a cytidine deaminase domain of a fusion protein provided herein with at least one gRNA as provided herein. In some embodiments, the 3' end of the target sequence is not immediately adjacent to a PAM sequence . In some embodiments, the 3' end of the target sequence is immediately adjacent to adjacent to a canonical PAM sequence (NGG), e.g. an AGC, GAG, TTT, GTG, or CAA sequence of SpCas9. [103] In some embodiments, the target DNA sequence comprises a sequence associated with a disease or disorder. In some embodiments, the target DNA sequence comprises a point mutation associated with a disease or disorder. In some embodiments, the activity of the cytidine deaminase domain, the cytidine deaminase fusion protein, or the complex results in a correction of the point mutation. In some embodiments, the target DNA sequence comprises a T→ C point mutation associated with a disease or disorder, and wherein the deamination of the mutant C base results in a sequence that is not associated with a disease or disorder. In some embodiments, the target DNA sequence encodes a protein and wherein the point mutation is in a codon and results in a change in the amino acid encoded by the mutant codon as compared to the wild-type codon. In some embodiments, the deamination of the mutant C results in a change of the amino acid encoded by the mutant codon. In some embodiments, the deamination of the mutant C results in the codon encoding the wild-type amino acid. In some embodiments, the contacting is in vivo in a subject. In some embodiments, the subject has or has been diagnosed with a disease or disorder.

[104] Some embodiments provide methods for using the cytidine deaminase fusion proteins or protein complexes provided herein. In some embodiments, the fusion protein is used to introduce a point mutation into a nucleic acid by deaminating a target C residue. In some embodiments, the deamination of the target nucleobase results in the correction of a genetic defect, e.g., in the correction of a point mutation that leads to a loss of function in a gene product. In some embodiments, the genetic defect is associated with a disease or disorder. In some embodiments, the methods provided herein are used to introduce a deactivating point mutation into a gene or allele that encodes a gene product that is associated with a disease or disorder. For example, in some embodiments, methods are provided herein that employ a DNA editing fusion protein provided herein to introduce a deactivating point mutation into an oncogene (e.g., in the treatment of a proliferative disease). A deactivating mutation may, in some embodiments, generate a premature stop codon in a coding sequence, which results in the expression of a truncated gene product, e.g., a truncated protein lacking the function of the full-length protein.

[105] In some embodiments, the purpose of the methods provide herein is to restore the function of a dysfunctional gene via genome editing. The Cas- cytidine deaminase fusion proteins provided herein can be validated for gene editing-based human therapeutics in vitro, e.g., by correcting a disease-associated mutation in human cell culture. It will be understood by the skilled artisan that the fusion proteins provided herein, e.g., the fusion proteins comprising a Cas9 domain and a nucleic acid deaminase domain can be used to correct any single point T→ C or A→ G mutation. In the first case, deamination of the mutant C back to U corrects the mutation, and in the latter case, deamination of the C that is base-paired with the mutant G, followed by a round of replication, corrects the mutation. Base-editing system

[106] Some aspects of this disclosure provide a base-editing system comprising the fusion proteins and protein complexes of cytidine deaminase as disclosed herein. More specifically the disclosure provides a base-editing system comprising: (1) a fusion proteins or protein complexes comprising a RGN and cytidine deaminase domains as provided herein, and (2) a guide RNA that binds to the RNA-guided nuclease of the fusion protein or the protein complex.

[107] In some embodiments the fusion protein further comprises a UPP as disclosed herein.

Nucleic acids, genetic constructs, and vectors for expression of the base-editing system

[108] The present disclosure further provides a nucleic acid that encodes the components of the base- editing system as disclosed herein and genetic constructs comprising such nucleic acids. The genetic constructs, such as a plasmid or expression vector, may comprise a nucleic acid that encodes the fusion protein or the protein complex of RNA-guided nuclease with cytidine deaminase and/or at least one gRNA targeting the nucleic acid of interest.

[109] The present disclosure also contemplates a composition of a nucleic acid that encodes a modified AAV vector and one or more nucleic acid sequences that encode the components of the base-editing system as disclosed herein. The compositions may comprise a nucleic acid that encode a modified lentiviral vector.

[110] The nucleic acid may be present in the cell as a functioning extrachromosomal molecule. The genetic construct may be a linear minichromosome including centromere, telomeres or plasmids or cosmids. The nucleic acid may also be part of a genome of a recombinant viral vector, including recombinant lentivirus, recombinant adenovirus, and recombinant adenovirus associated virus. The nucleic acid may be part of the genetic material in attenuated live microorganisms or recombinant microbial vectors.

[111] The nucleic acid may comprise regulatory elements for gene expression of the coding sequences of the nucleic acid. The regulatory elements may be a promoter, an enhancer, an initiation codon, a stop codon, or a polyadenylation signal.

[112] The nucleic acid sequences may be a form of a vector. The vector may be capable of expressing the base-editing system as provided herein, in the cell of a mammal. The vector may be recombinant. The vector may comprise heterologous nucleic acid encoding the fusion protein or protein complex provided herein. The vector may be a plasmid. The vector may be useful for transfecting cells with nucleic acid encoding the fusion protein or the protein complex. [113] Coding sequences of the fusion proteins and protein complexes provided herein may be optimized for stability and high levels of expression. The coding sequences can also be codon optimized for the expression in the target cells. In some embodiments, coding sequences are codon optimized for expression in particular cells, such as eukaryotic cells. The eukaryotic cells may be those of or derived from a particular organism, such as a mammal, including but not limited to human, mouse, rat, rabbit, dog, or non-human primate. In general, codon optimization refers to a process of modifying a nucleic acid sequence for enhanced expression in the host cells of interest by replacing at least one codon (e.g. about or more than about 1, 2, 3, 4, 5, 10, 15, 20, 25, 50, or more codons) of the native sequence with codons that are more frequently or most frequently used in the genes of that host cell while maintaining the native amino acid sequence. Various species exhibit particular bias for certain codons of a particular amino acid.

[114] The vector may comprise heterologous nucleic acid encoding the base-editing system and may further comprise an initiation codon, which may be upstream of base-editing system coding sequences, and a stop codon, which may be downstream of the base-editing system coding sequences.

[115] In some embodiments, the vector for expression of the fusion proteins and protein complexes as described herein comprises one or more nuclear localization sequences (NLSs). In some embodiments, the vector comprises 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more NLSs at or near the amino-terminus, about or more than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more NLSs at or near the carboxy-terminus, or a combination of one or more NLS at the amino-terminus and one or more NLS at the carboxy terminus. Typically, an NLS consists of one or more short sequences of positively charged lysines or arginines exposed on the protein surface, but other types of NLS are known. Non-limiting examples of NLSs include an NLS sequence derived from: the NLS of the SV40 virus large T-antigen, having the amino acid sequence PKKKRKV (SEQ ID NO: 29) and the NLS from nucleoplasmin having the amino acid sequence KRPAATKKAGQAKKKK (SEQ ID NO: 30).

[116] The vector may also comprise a promoter that is operably linked to the base-editing system coding sequence. The promoter operably linked to the base-editing system coding sequence may be a promoter from simian virus 40 (SV40), a mouse mammary tumor virus (MMTV) promoter, a human immunodeficiency vims (HIV) promoter such as the bovine immunodeficiency virus (BIV) long terminal repeat (LTR) promoter, a cytomegalovirus (CMV) promoter such as the CMV immediate early promoter, Epstein Barr virus (EBV) promoter, or a Rous sarcoma virus (RSV) promoter. The promoter may also be a promoter from a human gene. The promoter may also be a tissue specific promoter. Examples of such promoters are described in US2004/017572.

[117] The vector may also comprise a polyadenylation signal, which may be downstream of the base- editing system. The polyadenylation signal may be a SV40 polyadenylation signal, LTR polyadenylation signal, bovine growth hormone (bGH) polyadenylation signal, human growth hormone (hGH) polyadenylation signal, or human β - globin polyadenylation signal.

[118] The vector may also comprise an enhancer upstream of the components of the base-editing system. Examples of enhancers are described in US5,593,972, US5,962,428, and WO94/016737. The vector may also comprise a mammalian origin of replication in order to maintain the vector extrachromosomally and produce multiple copies of the vector in a cell. The vector may also comprise a regulatory sequence, which may be well suited for gene expression in a mammalian or human cell into which the vector is administered. The vector may also comprise a reporter gene, such as green fluorescent protein ("GFP") and/or a selectable marker.

[119] In some aspects, the disclosure provides methods comprising delivering one or more polynucleotides, such as or one or more vectors as described herein, one or more transcripts thereof, and/or one or proteins transcribed therefrom, to a host cell. In some aspects, the invention further provides cells produced by such methods, and organisms (such as animals, plants, or fungi) comprising or produced from such cells.

Methods for editing nucleic acids

[120] Some aspects of the disclosure provide methods for editing a nucleic acid. In some embodiments, the method is a method for editing a base of a nucleic acid. In some embodiments the method comprises the step of contacting a target region of a double-stranded nucleic acid (such as a DNA) with the fusion protein or protein complex provided herein and a guide RNA complementary the target region, wherein the target region comprises a targeted nucleobase pair to be edited.

[121] In some embodiments, the method for editing a base of a double-stranded nucleic acid comprises the steps of: a) contacting a target region of a double-stranded nucleic acid with a complex comprising a fusion protein or protein complex provided herein and a guide RNA, wherein the target region comprises a targeted nucleobase pair to be edited; and b) converting a first nucleobase of said target nucleobase pair in a single strand of the target region to a second nucleobase, wherein a third nucleobase complementary to the first nucleobase base is replaced by a fourth nucleobase complementary to the second nucleobase; and the method results in less than 20% indel formation in the nucleic acid. [122] In some embodiments, the first nucleobase is a cytidine. In some embodiments, the second nucleobase is a deaminated cytidine, or a uracil. In some embodiments, the third nucleobase is a guanine. In some embodiments, the fourth nucleobase is an adenine. In some embodiments, the first nucleobase is a cytidine, the second nucleobase is a deaminated cytidine, or a uracil, the third nucleobase is a guanine, and the fourth nucleobase is an adenine. In some embodiments, the method results in less than 19%, 18%, 16%, 14%, 12%, 10%, 8%, 6%, 4%, 2%, 1%, 0.5%, 0.2%, or less than 0.1% indel formation.

[123] In some embodiments, the method further comprises replacing the second nucleobase with a fifth nucleobase that is complementary to the fourth nucleobase, thereby generating an intended edited base pair (e.g., C:G→ T: A). In some embodiments, the fifth nucleobase is a thymine.

[124] In some embodiments, at least 1% of the intended base pairs are edited. In some embodiments, at least 1%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, or 50% of the intended base pairs are edited.

[125] In some embodiments, the ratio of intended products to unintended products in the target nucleotide is at least 2: 1, 5:1, 10: 1, 20:1, 30: 1, 40: 1, 50:1, 60: 1, 70:1, 80:1, 90: 1, 100:1, or 200: 1, or more. In some embodiments, the ratio of intended point mutation to indel formation is greater than 1:1, 10:1, 50:1, 100: 1, 500: 1, or 1000:1, or more. In some embodiments, the cut single strand (nicked strand) is hybridized to the guide nucleic acid. In some embodiments, the cut single strand is opposite to the strand comprising the first nucleobase. In some embodiments, the base editor comprises a Cas domain, e.g. Cas9 domain.

[126] In some embodiments, the first base is cytidine. In some embodiments, the second base is not a G, C, A, or T. In some embodiments, the second base is uracil. In some embodiments, the fusion protein or protein complex provided herein inhibits base excision repair of the edited strand. In some embodiments, the fusion protein or protein complex provided herein protects or binds the non-edited strand. In some embodiments, the fusion protein or protein complex provided herein protects or binds the edited strand. In some embodiments, the fusion protein or protein complex provided herein protects or binds the non- edited and edited strands.

[127] In some embodiments, the intended edited base pair is upstream of a PAM site. In some embodiments, the intended edited base pair is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides upstream of the PAM site. In some embodiments, the intended edited base pair is downstream of a PAM site. In some embodiments, the intended edited base pair is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides downstream stream of the PAM site. In some embodiments, the method does not require a canonical spCas9 (e.g., NGG) PAM site. [128] In some embodiments, the target sequence comprises a target window, wherein the target window comprises the target nucleobase pair. In some embodiments, the target window comprises 1-10 nucleotides. In some embodiments, the target window is 1-9, 1-8, 1-7, 1-6, 1-5, 1-4, 1-3, 1-2, or 1 nucleotide in length. In some embodiments, the target window is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 nucleotides in length. In some embodiments, the intended edited base pair is within the target window. In some embodiments, the target window comprises the intended edited base pair. In some embodiments, the method is performed using any of the base editors provided herein. In some embodiments, a target window is a deamination window

[129] In some embodiments, the disclosure provides methods for editing a nucleotide. In some embodiments, the disclosure provides a method for editing a nucleobase pair of a double-stranded DNA sequence. In some embodiments, the method comprises a) contacting a target region of the double-stranded DNA sequence with a complex comprising a fusion protein or protein complex provided herein and a guide nucleic acid (e.g., gRNA), where the target region comprises a target nucleobase pair; and b) converting a first nucleobase of said target nucleobase pair in a single strand of the target region to a second nucleobase, wherein a third nucleobase complementary to the first nucleobase base is replaced by a fourth nucleobase complementary to the second nucleobase, and the second nucleobase is replaced with a fifth nucleobase that is complementary to the fourth nucleobase, thereby generating an intended edited base pair, wherein the efficiency of generating the intended edited base pair is at least 1%.

[130] In some embodiments, at least 5% of the intended base pairs are edited. In some embodiments, at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, or 50% of the intended base pairs are edited. In some embodiments, the method causes less than 19%, 18%, 16%, 14%, 12%, 10%, 8%, 6%, 4%, 2%, 1%, 0.5%, 0.2%, or less than 0.1% indel formation. In some embodiments, the ratio of intended product to unintended products at the target nucleotide is at least 2:1, 5:1, 10:1, 20:1, 30:1, 40:1, 50:1, 60:1, 70:1, 80: 1, 90: 1, 100:1, or 200: 1, or more. In some embodiments, the ratio of intended point mutation to indel formation is greater than 1: 1, 10:1, 50:1, 100:1, 500:1, or 1000:1, or more. In some embodiments, the cut single strand is hybridized to the guide nucleic acid. In some embodiments, the cut single strand is opposite to the strand comprising the first nucleobase. In some embodiments, the first base is cytidine. In some embodiments, the second nucleobase is not G, C, A, or T. In some embodiments, the second base is uracil. In some embodiments, the base editor inhibits base excision repair of the edited strand. In some embodiments, the base editor protects or binds the non-edited strand. In some embodiments, the nucleobase editor comprises a UPP. In some embodiments, the nucleobase edit comprises nickase activity. In some embodiments, the intended edited base pair is upstream of a PAM site. In some embodiments, the intended edited base pair 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides upstream of the PAM site. In some embodiments, the intended edited base pair is downstream of a PAM site. In some embodiments, the intended edited base pair is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides downstream stream of the PAM site. In some embodiments, the method does not require a PAM site. In some embodiments, the nucleobase editor comprises a linker. In some embodiments, the linker is 1-25 amino acids in length. In some embodiments, the linker is 5-20 amino acids in length. In some embodiments, the linker is 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acids in length. In some embodiments, the target region comprises a target window, wherein the target window comprises the target nucleobase pair. In some embodiments, the target window comprises 1-10 nucleotides. In some embodiments, the target window is 1-9, 1-8, 1-7, 1-6, 1-5, 1-4, 1-3, 1-2, or 1 nucleotide in length. In some embodiments, the target window is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 nucleotides in length. In some embodiments, the intended edited base pair occurs within the target window. In some embodiments, the target window comprises the intended edited base pair. In some embodiments, the nucleobase editor is any one of the base editors provided herein.

Therapeutic uses

[131] The instant disclosure provides methods for the treatment of diseases or disorders, e.g., diseases or disorders that are associated or caused by a point mutation that can be corrected by cytidine deaminase gene editing. Some such diseases are described herein, and additional suitable diseases that can be treated with the strategies and fusion proteins provided herein will be apparent to those of skill in the art based on the instant disclosure. Exemplary suitable diseases and disorders are listed below. It will be understood that the numbering of the specific positions or residues in the respective sequences depends on the particular protein and numbering scheme used. Numbering might be different, e.g., in precursors of a mature protein and the mature protein itself, and differences in sequences from species to species may affect numbering. One of skill in the art will be able to identify the respective residue in any homologous protein and in the respective encoding nucleic acid by methods well known in the art, e.g., by sequence alignment and determination of homologous residues. Exemplary suitable diseases and disorders include, without limitation, cystic fibrosis (see, e.g., Schwank et al., Functional repair of CFTR by CRISPR/Cas9 in intestinal stem cell organoids of cystic fibrosis patients. Cell stem cell. 2013; 13: 653-658; and Wu et. al., Correction of a genetic disease in mouse via use of CRISPR-Cas9. Cell stem cell. 2013; 13: 659-662, Pharmaceutical compositions

[132] The composition of the present invention may be in a pharmaceutical composition. The pharmaceutical composition may comprise about 1 ng to about 10 mg of DNA encoding the CRISPR/Cas9- based system or CRISPR/Cas9-based system protein component, i.e., the fusion protein. The pharmaceutical composition may comprise about 1 ng to about 10 mg of the DNA of the modified lentiviral vector. The pharmaceutical composition may comprise about 1 ng to about 10 mg of the DNA of the modified AAV vector and a nucleotide sequence encoding the site-specific nuclease. The pharmaceutical compositions according to the present invention can be formulated according to the mode of administration to be used. In cases where pharmaceutical compositions are injectable pharmaceutical compositions, they are sterile, pyrogen free and particulate free. An isotonic formulation is preferably used. Generally, additives for isotonicity may include sodium chloride, dextrose, mannitol, sorbitol and lactose. In some cases, isotonic solutions such as phosphate buffered saline are preferred. Stabilizers include gelatin and albumin. In some embodiments, a vasoconstriction agent is added to the formulation.

[133] The composition may further comprise a pharmaceutically acceptable excipient. The pharmaceutically acceptable excipient may be functional molecules as vehicles, adjuvants, carriers, or diluents. The pharmaceutically acceptable excipient may be a transfection facilitating agent, which may include surface active agents, such as immune-stimulating complexes (ISCOMS), Freunds incomplete adjuvant, LPS analog including monophosphoryl lipid A, muramyl peptides, quinone analogs, vesicles such as squalene and squalene, hyaluronic acid, lipids, liposomes, calcium ions, viral proteins, polyanions, polycations, or nanoparticles, or other known transfection facilitating agents.

[134] The transfection facilitating agent can be a polyanion, polycation, including poly-L- glutamate (LGS), or lipid. The transfection facilitating agent is poly-L-glutamate, and more preferably, the poly-L- glutamate is present in the composition for genome editing in skeletal muscle or cardiac muscle at a concentration less than 6 mg/ml. The transfection facilitating agent may also include surface active agents such as immune-stimulating complexes (ISCOMS), Freunds incomplete adjuvant, LPS analog including monophosphoryl lipid A, muramyl peptides, quinone analogs and vesicles such as squalene and squalene, and hyaluronic acid may also be used administered in conjunction with the genetic construct. In some embodiments, the DNA vector encoding the composition may also include a transfection facilitating agent such as lipids, liposomes, including lecithin liposomes or other liposomes known in the art, as a DNA- liposome mixture (see for example W09324640), calcium ions, viral proteins, polyanions, polycations, or nanoparticles, or other known transfection facilitating agents. Preferably, the transfection facilitating agent is a polyanion, polycation, including poly-L-glutamate (LGS), or lipid.

[135] The sequences included in the present invention are shown in Table 4: [136] Table 4. Sequences

EXAMPLES

Example 1. Identification of Novel Cytosine Base Editors

[137] Public bacterial genome collections were searched for sequences less than 200 amino acids which contained the prototypical deaminase fold (Iyer et a l Nucleic acids research, 39(22), 9473-9497, 2011) and active site (D/H/C-[X]-E-[X15-45]-P-C-[X2]-C) that were not part of any larger ORF and less than 40% homology to any functionally known cytosine deaminase.13 candidate enzymes were selected for testing along with a known Eukaryotic cytosine deaminase (hAPOBEC3A).

Example 2: Demonstration of base editing activity on endogenous targets in mammalian cells

[138] The coding sequence of the identified deaminase is codon-optimized for expression in mammalian cells and introduced into the expression cassette, which produces a fusion protein that includes a 3xFLAG tag at its N-terminal end, the 3xFLAG tag is operably linked to a NLS at its C-terminal end, and the NLS is operably linked to the putative deaminase sequences at its C-terminal end. The putative deaminases are operably linked to a flexible amino acid linker at their C-terminal end, and the amino acid linker is operably linked to a known active RNA guided nuclease at its C-terminal end, that has been mutated to have an inactive RuvC domain (nNme2Cas9_D16A) (That is, it has been mutated into RGN that acts as a nickase). The RNA-guided DNA binding polypeptide is operably linked to the second NLS at its C-terminal end. Each of these expression cassettes is introduced into a vector capable of driving the expression of the fusion protein in mammalian cells. A vector capable of expressing guide RNA to target the deaminase-RGN fusion protein to the determined genomic location was also produced. These guide RNAs can guide the deaminase- RGN fusion protein to the target genome sequence for base editing.

[139] Using liposome transfection, vectors capable of expressing the deaminase-RGN fusion protein and guide RNAs (EGsG0033 and 0034) was transfected into HEK293T cells. For liposome transfection, the day before transfection, the cells were distributed in a 24-well plate of growth medium (DMEM + 10% fetal bovine serum + 1% penicillin/streptomycin) at 1.3x 10 ⁵ cells/well. According to the manufacturer's instructions, use Lipofectamine® 3000 reagent (Thermo Fisher Scientific) to transfect 500 ng deaminase- RGN fusion expression vector and 500 ng guide RNA expression vector. 48-72 hours after liposome transfection, genomic DNA is harvested from the transfected cells, and the DNA is sequenced and analyzed for the presence of targeted cytosine base editing mutations using CRISPResso2 (Clement K, Rees H, Canver MC, Gehrke JM, Farouni R, Hsu JY, Cole MA, Liu DR, Joung JK, Bauer DE, Pinello L. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019 Mar; 37(3):224-226. doi: 10.1038/s41587-019-0032-3. PubMed PMID: 30809026).

[140] Table 5 summaries base editing results for both target sequences and Tables 6, 7, 8 and 9 show the editing rate of cytidine bases for CBE07, CBE08, CBE10, CBE11, CBE12, and CBE13 deaminase and the rate for targeted cytosine deamination for the wildtype Nme2Cas9 targeted to the same region. Active cytosine base editing was defined as greater than the median INDEL formation of the novel CBE under investigation and greater than 5x C>T SNP base editing at target cytosines compared to the average C>T rate of the regular Nme2Cas9 within the target region. Increased INDEL formation indicates an active Cytosine Base Editor because the deaminase-RGN fusion protein consists of a RGN that has an inactive RuvC domain. With only a single active nuclease domain, the RGN will only function as a nickase and will not generate a detectable INDEL formation by itself. When fused with an active deaminase that acts on the opposite strand a cytosine will be turned into a uracil. The uracil is rapidly removed from the DNA leaving an abasic site, and eventually a gap, on the strand opposite the strand nicked by the RGN. This results in a double stranded break which is repaired through non-homologous end joining (NHEJ) and detectable INDEL formation. Greater than 5x C>T SNP base editing at target cytosines indicates an active cytosine base editor because if the uracil created by the deaminase is not removed before the nicked strand is repaired, it will be read as a thymine when the nicked strand is repaired and an adenosine will be inserted across from it. Then when the uracil is removed, it will be replaced by thymine during the excision repair process fixing the mutation at C>T. Each CBE construct has a different editing window, or region of the target sequence that the cytosine deaminase acts upon. This is driven by two functions: 1) the steric properties of the direct fusion and how accessible the exposed single stranded DNA is, and 2) what are the cytosine recognition preferences of the deaminase itself. Successful cytosine editing will only occur when there is a cytosine with a preferred sequence motif located in the preferred editing window. TC is a common dinucleotide preference for existing cytosine base editors and we see successful editing on EGsG033 (ctggtggccarCtCcactgctagta) (SEQ ID NO: 39) with the C12 and C14 positions with APOBEC3A, but hardly any editing at position C8 (GC) or C 15 (CC) as they are not the preferred motif. However, with die novel cytosine deaminases identified the preferred editing occurs at C8 (GC).

[141] Table 5. INDEL editing results for CBE constructs. The CBE domains show greater than the median INDEL formation and targeted editing. The numbers in bold indicate active base editing. The numbers in italic indicate greater than the median INDEL formation

[142] Table 6. EGsG033 CBE editing results for CBE11, CBE12, and CBE13.

Example 3. Identification of Novel Uracil Protecting Peptides (UPP)

[146] Public genome collections were searched for sequences less than 200 amino acids with a total of at least 10 aspartic acid and/or glutamic acid residues on the predicted protein surface and a negative charge on at least 10% of the predicted surface residues (Wang et al. Nucleic Acids Res. 2014;42(2):1354-1364)).

21 candidate uracil protecting peptides (UPP) were selected for testing fused with a known Eukaryotic cytosine deaminase (hAPOBEC3A).

Example 4: Demonstration of base editing activity of fusion proteins comprising a UPP on endogenous targets in mammalian cells

[147] The coding sequence of the identified UPP is codon-optimized for expression in mammalian cells and introduced into the expression cassette, which produces a fusion protein that includes a 3xFLAG tag at its N-terminal end, the 3xFLAG tag is operably linked to a NLS at its C-terminal end, and the NLS is operably linked to the codon optimized deaminase sequences at its C-terminal end. The putative deaminases are operably linked to a flexible amino acid linker at their C-terminal end, and the amino acid linker is operably linked to a known active RNA guided nuclease at its C-terminal end, that has been mutated to have an inactive RuvC domain (nNme2Cas9_D16A) (That is, it has been mutated into RGN that acts as a nickase). The RNA-guided DNA binding polypeptide is operably linked to a flexible amino acid linker at their C-terminal end, and the amino acid linker is operably linked to the putative uracil protecting peptide. The putative uracil protecting peptide is operably linked to a flexible amino acid linker at their C-terminal end, and the amino acid linker is operably linked to a second NLS at its C-terminal end. Each of these expression cassettes is introduced into a vector capable of driving the expression of the fusion protein in mammalian cells. A vector capable of expressing guide RNA to target the deaminase-RGN-UPP fusion protein to the determined genomic location was also produced. These guide RNAs can guide the deaminase- RGN-UPP fusion protein to the target genome sequence for base editing.

[148] Using liposome transfection, vectors capable of expressing the deaminase-RGN-UPP fusion protein and guide RNAs (EGsG0034 and 0041) was transfected into HEK293T cells. For liposome transfection, the day before transfection, the cells were distributed in a 24-well plate of growth medium (DMEM + 10% fetal bovine serum + 1% penicillin/ streptomycin) at 1.3x 10 ⁵ cells/well. According to the manufacturer's instructions, use Lipofectamine® 3000 reagent (Thermo Fisher Scientific) to transfect 500 ng deaminase- RGN fusion expression vector and 500 ng guide RNA expression vector. 48-72 hours after liposome transfection, genomic DNA is harvested from the transfected cells, and the DNA is sequenced and analyzed for the presence of targeted cytosine base editing mutations using CRISPResso2 (Clement K, et al Nat Biotechnol. 2019; 37(3):224-226).

[149] Tables 11 and 12 show the editing rate of cytidine bases for UPP12 (SEQ ID NO: 43) and UPP14 (SEQ ID NO: 45) and the rate for targeted cytosine deamination for the control deaminase-RGN targeted to the same region. Active cytosine base editing was defined as a reduction in INDEL formation of the novel UPP under investigation, increase of C>D SNP base editing along the targeted window, and >85% C>T SNP base editing at highly mutated cytosines compared to the deaminase-RGN without a UPP within the target region. Decreased INDEL formation indicates an active UPP because the deaminase-RGN-UPP fusion protein consists of a RGN that has an inactive RuvC domain. With only a single active nuclease domain, the RGN will only function as a nickase and will not generate a detectable INDEL formation by itself. When fused with an active deaminase that acts on the opposite strand a cytosine will be turned into a uracil. The uracil is rapidly removed from the DNA leaving an abasic site, and eventually a gap, on the strand opposite the strand nicked by the RGN. This results in a double stranded break which is repaired through non-homologous end joining (NHEJ) and detectable INDEL formation. With the presence of an active UPP, the converted uracil is protected from removal and the abasic site is never removed and NHEJ does not occur. This also leads to an increase of C>D SNP base editing and in increase in the total amount of C>T conversions because if the uracil created by the deaminase is protected and not removed before the nicked strand is repaired, it will be read as a thymine when the nicked strand is repaired, and an adenosine will be inserted across from it. Then when the uracil is removed, it will be replaced by thymine during the excision repair process fixing the mutation at C>T. [150] Table 10. INDEL and Max Base editing results for UPP constructs.

[151] Table 11. EGsG034 UPP editing results

[152] Table 12. EGsG041 UPP editing results

Claims

WHAT IS CLAIMED IS:

1. A fusion protein or protein complex, comprising a) a site-specific nuclease domain, and b) a cytidine deaminase domain comprising the sequence of any one of SEQ ID NO: 7, 8, 10, 11, 12, or 13.

2. The fusion protein or the protein complex of claim 1 further comprising an uracil protecting peptide (UPP).

3. The fusion protein or the protein complex of claim 2, wherein said UPP comprises the sequence of SEQ ID NO 43 or 45.

4. The fusion protein or the protein complex of any one of claims 1-4, wherein said site-specific nuclease domain is an RNA-guided endonuclease (RGN) domain.

5. The fusion protein or the protein complex of claim 4, wherein said RNA-guided endonuclease domain is a nickase domain or an inactivated endonuclease domain.

6. The fusion protein or the protein complex of claim 4, wherein said RNA-guided endonuclease domain is a Cas9 domain.

7. The fusion protein or the protein complex of claim 6, wherein said Cas9 domain is a Nme2Cas9 domain.

8. The fusion protein or the protein complex of claim 7, wherein said Nme2Cas9 domain comprises the sequence of SEQ ID NO: 34.

9. The fusion protein of claim 1 further comprising a linker sequence.

10. The fusion protein of claim 9, wherein said linker sequence comprises the sequence of SEQ ID NO:31, 32 or 33.

11. The fusion protein or the protein complex of claim 1 further comprising one or more NLS sequences.

12. The fusion protein or the protein complex of claim 11 wherein said NLS sequence is a Nucleoplasmin NLS or SV40 NLS.

13. The fusion protein or the protein complex of claim 11, wherein said NLS sequence comprises the sequence of SEQ ID NO: 29 or 30.

14. A polynucleotide sequence encoding the fusion protein or protein complex according to any one of claims 1-13.

15. A vector comprising the nucleotide sequence of claim 14.

16. A cell comprising the fusion protein or protein complex according to any one of claims 1-13, the polynucleotide of claim 14, or the vector of claim 15.

17. A method for editing a base of a target nucleic acid sequence in a target cell, said method comprising introducing to the target cell the fusion protein or protein complex according to any one of claims 1-13 or the vector of claim 15.

18. A method for editing a base of a nucleic acid, the method comprising the steps of: a) contacting a target region of a double-stranded nucleic acid with a complex comprising the fusion protein or protein complex of any one of claims 1-13 and a guide RNA, wherein the target region comprises a targeted nucleobase pair to be edited; and b) converting a first nucleobase of said target nucleobase pair in a single strand of the target region to a second nucleobase, where a third nucleobase complementary to the first nucleobase base is replaced by a fourth nucleobase complementary to the second nucleobase; and the method results in less than 20% indel formation in the nucleic acid.

19. A pharmaceutical composition comprising the fusion protein or the protein complex of claims 1-13, one or more polynucleotides of claim 14, or one or more vectors of claim 15.