WO2024089629A1

WO2024089629A1 - Cas12 protein, crispr-cas system and uses thereof

Info

Publication number: WO2024089629A1
Application number: PCT/IB2023/060788
Authority: WO
Inventors: Bang Wang
Original assignee: Geneditbio Ltd
Current assignee: Geneditbio Ltd
Priority date: 2022-10-27
Filing date: 2023-10-26
Publication date: 2024-05-02
Anticipated expiration: 2025-04-27

Abstract

An engineered, non-naturally occurring Cas12 protein, CRISPR-Cas system and uses thereof are provided. The engineered, non-naturally occurring novel Cas12 proteins comprise an amino acid sequence selected from SEQ ID NOs: 1-36, a homologue thereof having at least 70% sequence identity to the amino acid sequence, or a variant thereof. These Cas12 proteins should enable wider application of CRISPR-Cas systems for gene editing or gene targeting.

Description

CAS12 PROTEIN, CRISPR-CAS SYSTEM AND USES THEREOF

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority to PCT application No. PCT/CN2022/128074 filed on October 27, 2022, PCT application No. PCT/CN2023/090629 filed on April 25, 2023, PCT application No. PCT/CN2023/093961 filed on May 12, 2023, and PCT application No. PCT/CN2023/094271 filed on May 15, 2023. The entire contents of the aforementioned applications are hereby incorporated by reference.

TECHNICAL FIELD

The present disclosure relates to a Casl2 protein, CRISPR-Cas system and uses thereof. Particularly, the Casl2 protein and CRISPR-Cas system are used for the gene targeting or gene editing.

BACKGROUND

Targeted genome editing or modification is rapidly becoming an important tool for basic and applied research, with clustered regularly interspaced short palindromic repeats and CRISPR- associated proteins (CRISPR-Cas) system showing the most promising due to the ease of altering target specificity by engineering associated guide RNAs. Recent advances in genome sequencing techniques and analysis methods have significantly accelerated the ability to catalog and map genetic factors associated with a diverse range of biological functions and diseases. Precise genome targeting technologies are needed to enable systematic reverse engineering of causal genetic variations by allowing selective perturbation of individual genetic elements, as well as to advance synthetic biology, biotechnological, and medical applications.

SUMMARY

There exists a pressing need for alternative Casl2a systems and techniques for gene editing with a wide array of applications. This invention addresses this need and provides related advantages. Mining of new Cas protein will help us to obtain the CRISPR-Cas system with higher gene editing efficiency and/or specificity. Collectively, 36 novel Cas 12 proteins are presented and should enable wider application of CRISPR-Cas systems for gene editing and/or gene targeting. The study found that they exhibit some special characteristics. Although phylogenetically more closely related to Cas 12a than other subtypes, the tree shows they each have their unique branches, suggesting that they are evolutionarily distinct.

In one aspect, the disclosure provides an engineered, non-naturally occurring Cas 12 protein, the Casl2 protein comprises an amino acid sequence selected from SEQ ID NOs: 1-36, a homologue thereof having at least 70% sequence identity to the amino acid sequence, or a variant thereof. In some embodiments, the Cas 12 protein comprises an amino acid sequence having at least 75%, 80%, 85%, 90%, 92%, 95% or 98% sequence identity to any one of SEQ ID NOs: 1- 36. In some embodiments, the Cas 12 protein comprises an amino acid sequence having at least 90%, 95% or 98% sequence identity to any one of SEQ ID NOs: 1-36. In some embodiments, the variant comprises one or more mutations in RECI domain, and/or WED II domain of any one of SEQ ID NOs: 1-36. In some embodiments, the variant comprises one or more mutations in RECI domain, and/or WED II domain of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations in region of 180-200 and/or 560-620 with reference to amino acid position numbering of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations in region of 190-200 and/or 570-610 with reference to amino acid position numbering of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations in region of 195-200 and/or 580-595 with reference to amino acid position numbering of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In some embodiments, the variant comprises two or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In some embodiments, the variant comprises three or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In some embodiments, the variant comprises mutations at the following positions: Q198, D584, K590, and Q593 of SEQ ID NO: 34. In some embodiments, the mutation is a single amino acid substitution. In some embodiments, said amino acid is mutated to a positively charged amino acid. In some embodiments, said amino acid is mutated to R or/and K, preferably R. In some embodiments, the variant comprises the following mutations: Q198R, D584R, K590R, and Q593R of SEQ ID NO: 34. In some embodiments, the variant recognizes a PAM sequence which is not recognized by SEQ ID NO: 34. In some embodiments, the variant recognizes a PAM sequence which is not TTTN, N is A, T, G or C. In some embodiments, the variant has nuclease activity. In some embodiments, the variant has double-strand DNA cleavage activity or nickase activity.

In some embodiments, the Casl2 protein further comprises one or more of a nuclear localization signal sequence, a cell penetrating peptide sequence, an affinity tag and/or a fusion base editor protein.

In another aspect, the disclosure provides an engineered, non-naturally occurring cell comprising the Casl2 protein of any one of above. In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a nonhuman primate cell, and a human cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell.

In another aspect, the disclosure provides a kit comprising the Casl2 protein of any one of above.

In another aspect, the disclosure provides an engineered, non-naturally occurring Casl2 polynucleotide encoding the Casl2 protein of any one of above.

In some embodiments, the polynucleotide is ribonucleotide sequence or deoxyribonucleotide sequence, or analogs thereof; preferably the polynucleotide is mRNA, and polynucleotide further comprises 5’cap sequence and poly-A tail sequence. In some embodiments, the polynucleotide is codon optimized for expression in a cell of interest. In some embodiments, the polynucleotide is codon optimized for expression in a eukaryotic cell; preferably the polynucleotide has at least 90%, 92%, 95% or 98% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell. In some embodiments, the cell is a mammalian cell, preferably a human cell. In some embodiments, the polynucleotide has at least 70% sequence identity to any one of the SEQ ID NOs: 37-72. In some embodiments, the polynucleotide has at least 75%, 80%, 85%, 88%, 90%, 92%, 94%, 95%, 96%, 98% or 99% sequence identity to any one of the SEQ ID NOs: 37-72.

In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use as nuclease, preferably, for use as double-strand DNA cleavage nuclease or nickase.

In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use in the gene editing. In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use in a therapeutic or treatment or prevention or diagnosis or detection method of disease. In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use as a medicament.

In another aspect, the disclosure provides an engineered vector comprising the Casl2 polynucleotide of any one of above. In some embodiments, the vector is an expression vector. In some embodiments, the vector is an inducible, conditional, or constitutive expression vector. In another aspect, the disclosure provides a vector system comprising one or more vectors of any one of above. In some embodiments, one or more vectors comprise a polynucleotide according to any one of above and one or more polynucleotides which are on the same or a different vector encoding a guide RNA.

In another aspect, the disclosure provides an engineered cell comprising the Casl2 polynucleotide of any one of above, or comprising the vector of any one of above, or comprising the vector system of any one of above. In some embodiments, the cell is expressing the Casl2 protein. In some embodiments, the cell transiently expresses or non-transiently expresses the Casl2 protein. In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell.

In another aspect, the disclosure provides a reagent kit comprising the Casl2 protein of any one of above, or comprising the Casl2 polynucleotide of any one of above, or comprising the vector of any one of above, or comprising the vector system of any one of above.

In another aspect, the disclosure provides a pharmaceutical composition comprising the Casl2 protein of any one of above or the polynucleotide of any one of above or the vector of any one of above or the vector system of any one of above formulated for delivery by AAV (adena- associated viruses), Adenoviruses, retroviruses, HSV (herpes simplex virus), Gammaretrovirus, LV (lentivirus), eCIS (extracellular Contractile Injection System), eVLP (Engineered virus -like particles), VLP (virus-like particles), liposomes, plasmids, LNPs (lipid nanoparticles), exosomes, microvesicles, nucleic acid nanoassemblies, a gene gun, and/or an implantable device.

In another aspect, the disclosure provides an engineered, non-naturally occurring CRISPR- Cas system comprising: a) the Casl2 protein of any one of above or the polynucleotide encoding the Casl2 protein; b) at least one engineered guide sequence or one or more engineered nucleic acid encoding the at least one engineered guide sequence, and the guide sequence comprises a direct repeat sequence capable of binding the Casl2 protein and a spacer sequence capable of hybridizing to a target sequence.

In some embodiments, the system comprises at least one guide sequences which are capable of hybridizing at least one target sequences or different regions of one target sequence. In some embodiments, the guide sequence hybridizes to one or more target sequences in a prokaryotic cell or in a eukaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a nonhuman primate cell, and a human cell. In some embodiments, the eukaryotic cell comprises a mammalian cell. In some embodiments, the mammalian cell comprises a human cell. In some embodiments, the eukaryotic cell comprises a plant cell. In some embodiments, the target sequence is DNA or RNA. In some embodiments, the target sequence is selected from: double stranded DNA, double stranded RNA, single stranded DNA, single stranded RNA, genomic DNA, or extrachromosomal DNA.

In some embodiments, the direct repeat sequence comprises a stem-loop structure which comprising: a first stem nucleotide strand which comprises 4-7 nucleotides; a second stem nucleotide strand which comprises 4-7 nucleotides, wherein the first and second stem nucleotide strands can hybridize with each other; and a loop nucleotide strand arranged between the first and second stem nucleotide strands, wherein the loop nucleotide strand comprises 4 or 5 nucleotides.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 90% identity to any one of SEQ ID NOs: 73-104. In some embodiments, the direct repeat sequence comprises a nucleotide sequence selected from any one of SEQ ID NOs: 73-78. In some embodiments, the spacer sequence is between 10 and 40 nucleotides in length, preferably the spacer sequence is between 15 and 30 nucleotides in length, or between 18 and 25 nucleotides in length.

In some embodiments, the mRNA or a DNA encodes the Casl2 protein.

In some embodiments, the polynucleotide encoding the Casl2 protein is operably linked to a promoter. In some embodiments, the promoter is a constitutive promoter, tissue-specific promoter or inducible promoter. In some embodiments, the polynucleotide encoding the Casl2 protein operably linked to a promoter is in a vector.

In some embodiments, the vector is selected from the group consisting of a retroviral vector, a lentiviral vector, a phage vector, an adenoviral vector, an adeno-associated vector, and a herpes simplex vector.

In some embodiments, the system further comprising a donor template nucleic acid, the donor template nucleic acid is a DNA or RNA or DNA-RNA hybrids.

In some embodiments, the targeting of the target sequence by the Casl2 protein and guide sequence results in a modification of the target sequence. In some embodiments, the modification of the target sequence is a cleavage event or a nicking event.

In another aspect, the disclosure provides a delivery system, wherein the system of any one of above is presented in selected from the group consisting of AAV (adena-associated viruses), Adenoviruses, retroviruses, HSV (herpes simplex virus), Gammaretrovirus, LV (lentivirus), eCIS (extracellular Contractile Injection System), eVLP (Engineered virus-like particles), VLP (viruslike particles), liposomes, plasmid, LNPs (lipid nanoparticles), exosomes, microvesicles, nucleic acid nanoassemblies, a gene gun, and/or an implantable device.

In another aspect, the disclosure provides an engineered cell comprising the system of any one of above. In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell.

In another aspect, the disclosure provides the engineered, non-naturally occurring CRISPR- Cas system of any one of above, or the delivery system of above for use in a therapeutic or treatment or prevention or diagnosis or detection method of disease.

In another aspect, the disclosure provides the engineered, non-naturally occurring CRISPR- Cas system of any one of above, delivery system of above or cell of any one of above for use as a medicament.

In another aspect, the disclosure provides the engineered, non-naturally occurring CRISPR- Cas system of any one of above, delivery system of above or cell of any one of above for use in a method of therapeutic treatment of a patient.

In another aspect, the disclosure provides a method of modifying or targeting a target DNA locus, the method comprising delivering to said locus a CRISPR-Cas system of any one of above or a delivery system of above. In some embodiments, said modifying or targeting a target locus comprises inducing a DNA strand break. In some embodiments, said modifying or targeting a target locus comprises inducing a DNA double strand break. In some embodiments, said modifying or targeting a target locus comprises altering gene expression of one or more genes. In some embodiments, said modifying or targeting a target locus comprises epigenetic modification of said target DNA locus. In some embodiments, the method is a method of modifying a cell, a cell line, or an organism by manipulation of one or more target sequences at genomic loci of interest. In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell. In some embodiments, the method is in vitro or in vivo.

In another aspect, the disclosure provides a method of targeting and cleaving a doublestranded target DNA, the method comprising: contacting the double- stranded target DNA with a system of any one of above. In some embodiments, cleaving the target DNA or target sequence results in the formation of an indel or the insertion of a nucleotide sequence. In some embodiments, cleaving the target DNA or target nucleotide comprising cleaving the target DNA or target sequence in two sites, and results in the deletion or inversion of a sequence between the two sites.

In another aspect, the disclosure provides an isolated eukaryotic cell comprising a modified target locus of interest, wherein the target locus of interest has been modified according to a method or via use of a composition or via use of a system of any one of the preceding contents.

In another aspect, the disclosure provides a system for detecting the presence of a nucleic acid target sequence in an in vitro sample, comprising: a Casl2 protein of any one of above; at least one guide polynucleotide comprising a guide sequence capable of binding the target sequence, and designed to form a complex with the Casl2 protein; and a nucleic acid-based masking construct comprising a non-target sequence; and wherein the Casl2 protein exhibits collateral cleavage activity of RNA and/or ssDNA and cleaves the nontarget sequence of the nucleic acid-based masking construct activated by the target sequence. In another aspect, the disclosure provides a method for detecting target nucleic acids in samples comprising: contacting one or more samples with a Casl2 protein of any one of above; at least one guide polynucleotide comprising a guide sequence designed to have a degree of complementarity with the target sequence, and designed to form a complex with the Casl2 protein; and a nucleic acid-based masking construct comprising a non-target sequence, wherein the Casl2 protein exhibits collateral cleavage activity of RNA and/or ssDNA and cleaves the non-target sequence of the nucleic acid-based masking construct activated by the target sequences; and detecting a signal from cleavage of the non-target sequence, thereby detecting the one or more target sequences in the sample.

In another aspect, the disclosure provides an engineered guide sequence or a nucleic acid encoding the guide sequence, wherein the guide sequence comprising a direct repeat sequence and a spacer sequence; the direct repeat sequence binds to the Casl2 protein of any one of described herein or Casl2 nucleotide sequence of any one of described herein encoding the Casl2 protein and the spacer sequence binds to a target sequence.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 95% identity to any one of SEQ ID NOs: 74-77 or SEQ ID NOs: 79-83 or SEQ ID NOs: 86- 104. In some embodiments, the spacer sequence is between 15 and 35 nucleotides in length, e.g., between 20 and 30 nucleotides in length, or between 20 and 25 nucleotides in length. In some embodiments, the guide sequence hybridizes to one or more target sequences in a prokaryotic cell or in a eukaryotic cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell. In some embodiments, the target sequence is DNA or RNA. In some embodiments, the targeting of the target sequence by the Casl2 protein and guide sequence results in a modification of the target sequence. In some embodiments, the modification of the target sequence is a cleavage event or a nicking event.

These and other aspects, objects, features, and advantages of the example embodiments will become apparent to those having ordinary skill in the art upon consideration of the following detailed description of illustrated example embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

An understanding of the features and advantages of the present disclosure will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure may be utilized, and the accompanying drawings of which:

Figure 1 (FIG.l) shows the phylogenetic tree of the Casl2 protein in this disclosure constructed by IQTREE;

Figure 2 (FIG.2) is the enlarged partial view of the subtree in the dashed box of FIG.l;

Figures 3-7 (FIGs.3-7) show the domains arrangement of the Cas proteins in this disclosure;

Figure 8 (FIG.8) shows the secondary structure of the crRNA utilized the Cas proteins in this disclosure;

Figure 9 (FIG.9) shows the sequences alignment between TnpB, Casl2f and GEBx0123/0142 in this disclosure; the region of Zinc finger domain and the conserved 4-Cys Zinc finger in Casl2f and TnpB were marked with arrow and star respectively, indicated that GEBx0123/0142 lack the zinc finger structure in their C terminus; Figure 10 (FIG.10) shows the schematic of pCasX and pgRNA plasmid harbored with the Cas nucleases CDS and guide RN A respectively;

Figure 11 (FIG.11) shows the editing efficiency (Indel) of human HEK293T cells following forward transfection of different pCasx plasmids with MYODI targeted crRNA plasmid at 400 ng and 100 ng respectively, wherein NC (Negative Control) represents the cell sample without adding the lipoplex mixture;

Figure 12 (FIG.12) shows the editing efficiency (Indel) of human HEK293T cells following forward transfection of different pCasx plasmids with FANCF targeted crRNA plasmid at 400 ng and 100 ng respectively, wherein NC (Negative Control) represents the cell sample without adding the lipoplex mixture;

Figure 13 (FIG.13) shows the PAM preference of the wild type Casl2 in HEK293 cell line;

Figure 14 (FIG.14) shows the site of 4 mutant residues of GEBxO142-variant, which all located around the putative PAM binding region;

Figure 15 (FIG.15) shows the domains arrangement of the GEBxO142 in this disclosure;

Figure 16 (FIG.16) shows the PAM preference of the GEBxO142-variantin HEK293 cell line; Figure 17 (FIG.17) shows the PAM preference of the GEBx0123 in HEK293 cell line.

Figure 18 (FIG.18) shows the indel activity of GEBx0123 and GEBxO142 across 15 targets with TTTG-PAM in HEK293T cell line

Figure 19 (FIG.19) shows the indel activity of GEBx0123 and GEBxO142 across 8 targets with TTTG-PAM in HEK293T cell line

Figure 20 (FIG.20) shows the schematic of pCasX-gRNA plasmid harbored with the Cas nucleases CDS and guide RNA.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The following examples further illustrate the present disclosure, but the present disclosure is not limited thereto.

General Definitions

Unless defined otherwise, the technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Definitions of common terms and techniques in molecular biology may be found in Molecular Cloning: A Laboratory Manual, 2nd edition (1989) (Sambrook, Fritsch, and Maniatis); Molecular Cloning: A Laboratory Manual, 4th edition (2012) (Green and Sambrook); Current Protocols in Molecular Biology (1987) (F. M. Ausubel et al. eds.); the series Methods in Enzymology (Academic Press, Inc.): PCR 2: A Practical Approach (1995) (M. J. MacPherson, B. D. Hames, and G. R. Taylor eds.): Antibodies, A Laboratory Manual (1988) (Harlow and Lane, eds.): Antibodies A Laboratory Manual, 2nd edition 2013 (E. A. Greenfield ed.); Animal Cell Culture (1987) (R.I. Freshney, ed.); Benjamin Lewin, Genes IX, published by Jones and Bartlet, 2008 (ISBN 0763752223); Kendrew et al. (eds.), the Encyclopedia of Molecular Biology, published by Blackwell Science Ltd., 1994 (ISBN 0632021829); Robert A. Meyers (ed.), Molecular Biology and Biotechnology: a Comprehensive Desk Reference, published by VCH Publishers, Inc., 1995 (ISBN 9780471185710); Singleton et al., Dictionary of Microbiology and Molecular Biology 2nd ed., J. Wiley & Sons (New York, N.Y. 1994), March, Advanced Organic Chemistry Reactions, Mechanisms and Structure 4th ed., John Wiley & Sons (New York, N.Y. 1992); and Marten H. Hofker and Jan van Deursen, Transgenic Mouse Methods and Protocols, 2nd edition (2011). As used herein, the term “a”, “an”, “the”, and “said” and similar terms used in the context of the present disclosure (especially in the context of the claims) are to be construed to cover both the singular and plural unless otherwise indicated herein or clearly contradicted by the context. In addition, it should be noted that the plural form does not necessarily mean that it is plural, and it needs to be understood according to the context in the article.

The term “identity” in the context of two or more nucleic acids or polypeptide sequences refers to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same as measured using a BLAST or BLAST 2.0 or FASTA etc. sequence comparison algorithms with default parameters described below.

It is noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as “comprises”, “comprised”, “comprising” and the like can have the meaning attributed to it in U.S. Patent law; e.g., they can mean “includes”, “included”, “including”, and the like; and those terms such as “consisting essentially of’ and “consists essentially of’ have the meaning ascribed to them in U.S. Patent law.

As used herein, the term “mutant”, “variant”, “modification”, and similar terms used in the context of the present invention (especially in the context of the claims) are to be construed to the same mean unless otherwise indicated herein or clearly contradicted by the context.

As used herein, the terms “recognized”, “recognizing”, or “recognition” in this context refers to the capability of the Casl2 protein to form a functional complex with a guide RNA at a DNA target site to which the guide RNA hydidizes (i.e. to which the guide sequence of the guide RNA hybridizes) and being flanked by the PAM sequence, and wherein the Casl2 protein is capable of performing its natural function, i.e. DNA cleavage. In this context it is to be noted that such DNA cleavage precludes the Casl2 protein from being a catalytically inactive Casl2 protein. In the case of for instance an inactivated Casl2 protein (e.g. a dead Casl2 protein), a complex between the Casl2 protein, guide RNA and cognate target may nevertheless be formed if the required PAM sequence is present, but such does not result in DNA cleavage.

As used herein, a “sample” may contain whole cells and/or live cells and/or cell debris. The sample may contain (or be derived from) a “bodily fluid”. The present disclosure encompasses embodiments wherein the bodily fluid is selected from amniotic fluid, aqueous humour, vitreous humour, bile, blood serum, breast milk, cerebrospinal fluid, cerumen (earwax), chyle, chyme, endolymph, perilymph, exudates, feces, female ejaculate, gastric acid, gastric juice, lymph, mucus (including nasal drainage and phlegm), pericardial fluid, peritoneal fluid, pleural fluid, pus, rheum, saliva, sebum (skin oil), semen, sputum, synovial fluid, sweat, tears, urine, vaginal secretion, vomit and mixtures of one or more thereof. Samples include cell cultures, bodily fluids, cell cultures from bodily fluids. Bodily fluids may be obtained from a mammal organism, for example by puncture, or other collecting or sampling procedures.

Various embodiments are described hereinafter. It should be noted that the specific embodiments are not intended as an exhaustive description or as a limitation to the broader aspects discussed herein. One aspect described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced with any other embodiment(s). Reference throughout this specification to “in a specific embodiment”, “in some embodiment”, “in certain embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrases “a specific embodiment”, “in one embodiment” or “in certain embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, a particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosure. For example, in the appended claims, any one of the claimed embodiments can be used in any combination.

All publications, published patent documents, and patent applications cited herein are hereby incorporated by reference to the same extent as though each individual publication, published patent document, or patent application was specifically and individually indicated as being incorporated by reference.

The term “gene” refers to a nucleic acid sequence (used interchangeably with polynucleotide or nucleotide sequence) that encodes a chimeric molecule as described herein. This definition includes various sequence polymorphisms, mutations, and/or sequence variants wherein such alterations do not substantially affect the function of the encoded chimeric molecule. The term “gene” may include not only coding sequences but also regulatory regions such as promoters, enhancers, and termination regions. The term further can include all introns and other DNA sequences spliced from the mRNA transcript, along with variants resulting from alternative splice sites. Gene sequences encoding the molecule can be DNA or RNA that directs the expression of the chimeric molecule. These nucleic acid sequences may be a DNA strand sequence that is transcribed into RNA or an RNA sequence that is translated into protein. The nucleic acid sequences include both the full-length nucleic acid sequences as well as non-full-length sequences derived from the full-length protein. The sequences can also include degenerate codons of the native sequence or sequences that may be introduced to provide codon preference in a specific cell type. Portions of complete gene sequences are referenced throughout the disclosure as is understood by one of ordinary skill in the art.

“Homologue” of a protein as used herein is a protein of the same species which perform the same or a similar function as the protein it is a homologue of. Homologous proteins may but need not be structurally related, or are only partially structurally related. “Homologue” of a protein as used herein also include sequences having one or more additions, deletions, stop positions, or substitutions, as compared to a sequence disclosed herein. The Homologue protein as used herein perform the same or a similar function as the Casl2 protein disclosed herein.

The term “cleavage event” as used herein, refers to a DNA break in a target sequence created by a nuclease of a CRISPR system described herein. In some embodiments, the cleavage event is a double- stranded DNA break. In some embodiments, the cleavage event is a single- stranded DNA break.

A “stem-loop structure” refers to a nucleic acid having a secondary structure that includes a region of nucleotides that are known or predicted to form a double strand (stem portion) that is linked on one side by a region of predominantly single- stranded nucleotides (loop portion). The terms “hairpin” and “fold-back” structures are also used herein to refer to stem- loop structures. Such structures are well known in the art and these terms are used consistently with their known meanings in the art. As is known in the art, a stem-loop structure does not require exact basepairing. Thus, the stem may include one or more base mismatches. Alternatively, the base-pairing may be exact, i.e., not include any mismatches. The term “donor template nucleic acid” as used herein refers to a nucleic acid molecule that can be used by one or more cellular proteins to alter the structure of a target sequence after a CRISPR enzyme described herein has altered a target nucleic acid. In some embodiments, the donor template nucleic acid is a double- stranded nucleic acid. In some embodiments, the donor template nucleic acid is a single-stranded nucleic acid. In some embodiments, the donor template nucleic acid is linear. In some embodiments, the donor template nucleic acid is circular (e.g., a plasmid). In some embodiments, the donor template nucleic acid is an exogenous nucleic acid molecule. In some embodiments, the donor template nucleic acid is an endogenous nucleic acid molecule (e.g., a chromosome).

As used herein, the term “targeting” refers to the ability of a complex including a CRISPR- associated protein and an RNA guide, to preferentially or specifically bind to, e.g., hybridize to, a specific target sequence compared to other nucleic acids that do not have the same or similar sequence as the target nucleic acid.

As used herein, the term “target sequence” refers to a specific nucleic acid substrate that contains a nucleic acid sequence complement to the entirety or a part of the spacer in an RNA guide. In some embodiments, the target sequence comprises a gene or a sequence within a gene. In certain embodiments, the target sequence comprises a noncoding region (e.g., a promoter). In a specific embodiment, the target sequence is single- stranded. In a specific embodiment, the target sequence is double- stranded.

It will be appreciated that the terms Casl2 enzyme, Casl2 protein, Casl2 effector protein and Casl2 are generally used interchangeably and at all points of reference herein refer by analogy to novel CRISPR effector proteins further described in this application, unless otherwise apparent.

Metagenomic sequencing samples were selected from public databases and then downloaded. And sequencing reads were assembled with assembling tools. To search for potential Cas protein sequences, Cas sequences were downloaded as references and then Cas sequences were analyzed. We mined 36 novel Cas 12 proteins via lots of work. The information of the 36 novel Cas 12 proteins is showed in Table 1.

Table 1 The detailed information of the Cas 12 proteins

The phylogenetic tree was constructed by IQTREE (FIGs.1-2) to visualize the relatedness of the orthologs at the primary amino-acid level using 162 Casl2a, Casl2b, Casl2c, Casl2d, Casl2e, Casl2f, Casl2g, Casl2i, Casl2j, Casl2k, Casl2L and TnpB sequences from The National Center for Biotechnology Information (NCBI), various publications, and patents. The branches of the tree corresponding to the Casl2 protein disclosed in this invention are marked with a circle while the reference nucleases (LbCpfl, and FnCpfl) were marked with a star. Although phylogenetically more closely related to Casl2a than other subtypes, they are located on different branches, suggesting that they are evolutionarily distinct. The tree shows that the engineered Casl2 proteins studied herein are representatives of unique Casl2 clusters. Besides that, the Casl2 proteins share less than 70% identity with the existed Cas protein, some even share less than 60% identity or 50% identity with the existed Cas protein. These features suggest that the Cas 12 proteins were independent of the existing Cas 12a family.

In one aspect, the disclosure provides an engineered, non-naturally occurring Cas 12 protein, wherein the Casl2 protein comprises an amino acid sequence selected from SEQ ID NOs: 1-36, a homologue thereof having at least 70% sequence identity to the amino acid sequence, or a variant thereof.

For example, “at least 70%”can include 70%, 75%, 80%, 85%, 86%, 87%, 88%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 80%”can include 85%, 86%, 87%, 88%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 85%”can include 85%, 86%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 90%” can include 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 95%” can include 95%, 96%, 97%, 98%, 99% or 100%; “at least 97%” can include 97%, 98%, 99% or 100%; “at least 98%” can include 98%, 99% or 100%; and so on.

In certain embodiments, the amino acid sequence of the Cas 12 protein has 100% sequence identity to any one of SEQ ID NOs: 1-36. The “100% sequence identity” means the amino acid sequence of the CRISPR-Casl2 protein is selected from any one of SEQ ID NOs: 1-36. The example amino acid sequences of Cas 12 protein are shown in Table 2.

Table 2 The example amino acid sequences of Cas 12 protein

REC is the abbreviation of “recognition”. RECI domain is also called Helical I domain and REC2 domain is also called Helical II domain. WED is abbreviation of wedge and WED is also called OBD. The WED domain is the oligonucleotide -binding domain. REC lobe, WED lobe and PI (the abbreviation of PAM-interacting domain, also called LHD) can form a cleft, the effect of the mutated site in these domains is unpredictable. The mutants of the CRISPR-Casl2 protein are explored for obtaining some variants which have an altered PAM, have a modified nuclease activity (e.g., cleavage activity) and/or modify its ability to functionally associate with a target nucleic acid. In some embodiments, the variant can recognize a broader range of PAMs, and PAM preference would be selected. In some embodiments, the variant may comprise one or more mutations that increase the ability of the nuclease to cleave a target nucleic acid. In some embodiments, the variant is a high-fidelity version, and the reduced off-target effects.

In some embodiments, the variant comprises one or more mutations in RECI domain, and/or WED II domain of SEQ ID NO: 34.

In some embodiments, the variant comprises one or more mutations in region of 180-200 and/or 560-620 with reference to amino acid position numbering of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations in region of 190-200 and/or 570-610 with reference to amino acid position numbering of SEQ ID NO: 34. In some embodiments, the variant comprises one or more mutations in region of 195-200 and/or 580-595 with reference to amino acid position numbering of SEQ ID NO: 34.

In some embodiments, the variant comprises one or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In a certain embodiment, the variant comprises one mutation at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. For example, in an embodiment, the variant comprises one mutation at Q198 of SEQ ID NO: 34; in an embodiment, the variant comprises one mutation at D584 of SEQ ID NO: 34; in an embodiment, the variant comprises one mutation at K590 of SEQ ID NO: 34; in an embodiment, the variant comprises one mutation at Q593 of SEQ ID NO: 34.

In some embodiments, the variant comprises two or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In a certain embodiment, the variant comprises two mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. For example, in an embodiment, the variant comprises two mutations at Q198 and D584 of SEQ ID NO: 34; in an embodiment, the variant comprises two mutations at Q198 and K590 of SEQ ID NO: 34; in an embodiment, the variant comprises two mutations at Q198 and Q593 of

SEQ ID NO: 34; in an embodiment, the variant comprises two mutations at D584 and K590 of

SEQ ID NO: 34; in an embodiment, the variant comprises two mutations at D584 and Q593 of

SEQ ID NO: 34; in an embodiment, the variant comprises two mutations at K590 and Q593 of

SEQ ID NO: 34.

In some embodiments, the variant comprises three or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. In a certain embodiment, the variant comprises three mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34. For example, in an embodiment, the variant comprises three mutations at Q198, D584 and K590 of SEQ ID NO: 34; in an embodiment, the variant comprises three mutations at Q198, D584 and Q593 of SEQ ID NO: 34; in an embodiment, the variant comprises three mutations at Q198, K590 and Q593 of SEQ ID NO: 34; in an embodiment, the variant comprises three mutations at D584, K590 and Q593 of SEQ ID NO: 34.

In some embodiments, the variant comprises mutations at the following positions: Q198, D584, K590, and Q593 of SEQ ID NO: 34.

In some embodiments, the mutation is a single amino acid substitution.

In some embodiments, said amino acid is mutated to a positively charged amino acid. In some embodiments, said amino acid is mutated to R or/and K, preferably R.

In some embodiments, the variant comprises the following mutations: Q198R, D584R, K590R, and Q593R of SEQ ID NO: 34.

In some embodiments, the variant recognizes a PAM sequence which is not recognized by SEQ ID NO: 34. In some embodiments, the variant recognizes a PAM sequence which is not TTTN, N is A, T, G or C.

In some embodiments, the variant has nuclease activity. In some embodiments, the variant has double-strand DNA cleavage activity or nickase activity.

The Casl2 protein comprises one or more nuclear localization signal(s) NLS(s). The NLS(s) can locate at the end or other portion of the peptide. The NLS(s) located each end or other portion of the Casl2 amino acid sequence can be same or not. In some embodiments, the NLS of the N- terminal end and the NLS of the C-terminal end are the same. In some embodiments, the NLS of the N-terminal end and the NLS of the C-terminal end are different. In some embodiments, the N- terminal end of the Casl2 amino acid sequence comprising one NLS and the C-terminal end of the Casl2 amino acid sequence comprising one NLS. The amino acid sequence of NLS fused to the N-terminal end or the C-terminal end of the Casl2 amino acid sequence respectively.

NLS is fused to a peptide or non-peptide moiety that allows proteins to enter or localize to a tissue, a cell, or a region of a cell. For instance, NLS maybe an SV40 (simian virus 40) NLS, c- Myc NLS, or other suitable monopartite NLS. The NLS may be fused to an N-terminal and/or a C-terminal of the Casl2 protein.

Generally, an affinity tag is added for purification of the fusion polypeptide by affinity chromatography.

In another aspect, the disclosure provides an engineered, non-naturally occurring cell comprising the Casl2 protein of any one of above.

In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell.

The cell maybe the eukaryotic cell or the prokaryotic cell. In one embodiment, the cell is a eukaryotic cell. In another embodiment, the cell is a vertebrate, mammalian, rodent, goat, pig, bird, chicken, turkey, cow, horse, sheep, fish, primate, or human cell. In one embodiment, the cell is a mammalian cell. In one embodiment, the cell is a human cell. In one embodiment, the cell is a somatic cell, a germ cell, or a prenatal cell. In one embodiment, the cell is a zygotic cell, a blastocyst cell, an embryonic cell, a stem cell, a mitotically competent cell, or a meiotically competent cell. In one embodiment, the cell is not part of a human embryo. In one embodiment, the cell is a somatic cell. In one embodiment, the cell is a T cell, a CD⁸⁺ T cell, a CD⁸⁺ naive T cell, a central memory T cell, an effector memory T cell, a CD⁴⁺ T cell, a stem cell memory T cell, a helper T cell, a regulatory T cell, a cytotoxic T cell, a natural killer T cell, a Hematopoietic Stem Cell, a long term hematopoietic stem cell, a short term hematopoietic stem cell, a multipotent progenitor cell, a lineage restricted progenitor cell, a lymphoid progenitor cell, a myeloid progenitor cell, a common myeloid progenitor cell, an erythroid progenitor cell, a megakaryocyte erythroid progenitor cell, a retinal cell, a photoreceptor cell, a rod cell, a cone cell, a retinal pigmented epithelium cell, a trabecular meshwork cell, a cochlear hair cell, an outer hair cell, an inner hair cell, a pulmonary epithelial cell, a bronchial epithelial cell, an alveolar epithelial cell, a pulmonary epithelial progenitor cell, a striated muscle cell, a cardiac muscle cell, a muscle satellite cell, a neuron, a neuronal stem cell, a mesenchymal stem cell, an induced pluripotent stem (iPS) cell, an embryonic stem cell, a monocyte, a megakaryocyte, a neutrophil, an eosinophil, a basophil, a mast cell, a reticulocyte, a B cell, e.g., a progenitor B cell, a Pre B cell, a Pro B cell, a memory B cell, a plasma B cell, a gastrointestinal epithelial cell, a biliary epithelial cell, a pancreatic ductal epithelial cell, an intestinal stem cell, a hepatocyte, a liver stellate cell, a Kupffer cell, an osteoblast, an osteoclast, an adipocyte, a preadipocyte, a pancreatic islet cell (e.g., a beta cell, an alpha cell, a delta cell), a pancreatic exocrine cell, a Schwann cell, or an oligodendrocyte. In one embodiment, the cell is a T cell, a Hematopoietic Stem Cell, a retinal cell, a cochlear hair cell, a pulmonary epithelial cell, a muscle cell, a neuron, a mesenchymal stem cell, an induced pluripotent stem (iPS) cell, or an embryonic stem cell. In another embodiment, the cell is a plant cell. In another aspect, the disclosure provides a kit comprising the engineered, non-naturally occurring Casl2 protein of any one of above. In addition, the reagent kit can comprise the other components, for example, a solution or a buffer.

It would be appreciated that the kit may further comprise other suitable excipients such as buffers or reagents for facilitating the application of the kit. Preferably, the kit may be applied in various applications such as medical applications including therapies and diagnosis, researches and the like. Accordingly, the Casl2 protein and the kit of the present invention may be used in the preparation of a medicament for treatment and/or in the preparation of an agent for research study.

The polynucleotides, may be in the form of RNA or DNA, which includes cDNA, genomic DNA, and synthetic DNA. A polynucleotide may be double stranded or single stranded, and if single stranded, may be the coding strand or non-coding (anti- sense strand). A coding polynucleotide may have a coding sequence identical to a coding sequence known in the art or may have a different coding sequence, which, as the result of the redundancy or degeneracy of the genetic code, or by splicing, can encode the same polypeptide.

The polypeptide may include not only coding sequences but also regulatory regions such as promoters, enhancers, and termination regions. The term further can include all introns and other DNA sequences spliced from the mRNA transcript, along with variants resulting from alternative splice sites. These nucleic acid sequences may be a DNA strand sequence that is transcribed into RNA or an RNA sequence that is translated into protein. The nucleic acid sequences include both the full-length nucleic acid sequences as well as non-full-length sequences derived from the full- length protein. The sequences can also include degenerate codons of the native sequence or sequences that may be introduced to provide codon preference in a specific cell type. The polypeptide sequences are referenced throughout the disclosure as is understood by one of ordinary skill in the art.

In some embodiments, the polynucleotide is ribonucleotide sequence or deoxyribonucleotide sequence or analogs thereof; preferably the polynucleotide is mRNA, and polynucleotide further comprises 5’cap sequence and poly-A tail sequence. In some embodiments, the polynucleotide is codon optimized for expression in a cell of interest. In some embodiments, the polynucleotide is codon optimized for expression in a eukaryotic cell; preferably the polynucleotide has at least 90%, 92%, 95% or 98% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell. In some embodiments, the cell is a mammalian cell, preferably a human cell. In some embodiments, the cell is a mammalian cell, preferably a human cell.

In some embodiments, the polynucleotide has at least 90%, 92%, 95% or 98% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the polynucleotide has at least 90% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the polynucleotide has at least 95% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the polynucleotide has at least 98% sequence identity to any one of SEQ ID NOs: 180-183, 202-211. In some embodiments, the polynucleotide has a sequence identity to any one of SEQ ID NOs: 180-183, 202-211. For example, “at least 90%” can include 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 95%” can include 95%, 96%, 97%, 98%, 99% or 100%; “at least 97%” can include 97%, 98%, 99% or 100%; “at least 98%” can include 98%, 99% or 100%; and so on.

In some embodiments, the polynucleotide has at least 70% sequence identity to any one of the SEQ ID NOs: 37-72.

In some embodiments, the polynucleotide has at least 75%, 80%, 85%, 88%, 90%, 92%, 94%, 95%, 96%, 98% or 99% sequence identity to any one of the SEQ ID NOs: 37-72.

For example, “at least 70%”can include 70%, 72%, 75%, 80%, 85%, 86%, 87%, 88%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 80%”can include 85%, 86%, 87%, 88%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 85%”can include 85%, 86%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 90%” can include 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 100%; “at least 95%” can include 95%, 96%, 97%, 98%, 99% or 100%; “at least 97%” can include 97%, 98%, 99% or 100%; “at least 98%” can include 98%, 99% or 100%; and so on.

Table 3 The example nucleic acid sequences of Casl2 protein

The nucleic acids of SEQ ID NOs: 37-72 are the Non-Human Codon Optimized sequence. And the example nucleic acid sequences of Casl2 protein are shown in Table 3.

In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use in the gene editing.

In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use in a therapeutic or treatment or prevention or diagnosis or detection method of disease.

In another aspect, the disclosure provides the engineered, non-naturally occurring Casl2 protein as described herein above, or the Casl2 polynucleotide as described herein above for use as a medicament.

In another aspect, the disclosure provides an engineered vector comprising the Casl2 polynucleotide of any one of above.

In certain aspects, the invention involves vectors. As used herein, a “vector” is a tool that allows or facilitates the transfer of an entity from one environment to another. It is a replicon, such as a plasmid, phage, or cosmid, into which another DNA segment may be inserted so as to bring about the replication of the inserted segment. Generally, a vector is capable of replication when associated with the proper control elements. In general, the term “vector” refers to a nucleic acid molecule capable of transporting another nucleic acid to which it has been linked. Vectors include, but are not limited to, nucleic acid molecules that are single-stranded, double- stranded, or partially double- stranded; nucleic acid molecules that comprise one or more free ends, no free ends (e.g., circular); nucleic acid molecules that comprise DNA, RNA, or both; and other varieties of polynucleotides known in the art. One type of vector is a “plasmid,” which refers to a circular double stranded DNA loop into which additional DNA segments can be inserted, such as by standard molecular cloning techniques. Another type of vector is a viral vector, wherein virally- derived DNA or RNA sequences are present in the vector for packaging into a virus (e.g., retroviruses, replication defective retroviruses, adenoviruses, replication defective adenoviruses, and adeno-associated viruses (AAVs)). Viral vectors also include polynucleotides carried by a virus for transfection into a host cell. Certain vectors are capable of autonomous replication in a host cell into which they are introduced (e.g., bacterial vectors having a bacterial origin of replication and episomal mammalian vectors). Other vectors (e.g., non-episomal mammalian vectors) are integrated into the genome of a host cell upon introduction into the host cell, and thereby are replicated along with the host genome. Moreover, certain vectors are capable of directing the expression of genes to which they are operatively-linked. Such vectors are referred to herein as “expression vectors”. Common expression vectors of utility in recombinant DNA techniques are often in the form of plasmids. Recombinant expression vectors can comprise a nucleic acid of the invention in a form suitable for expression of the nucleic acid in a host cell, which means that the recombinant expression vectors include one or more regulatory elements, which may be selected on the basis of the host cells to be used for expression, that is operatively- linked to the nucleic acid sequence to be expressed. Within a recombinant expression vector, “operably linked” is intended to mean that the nucleotide sequence of interest is linked to the regulatory element(s) in a manner that allows for expression of the nucleotide sequence (e.g., in an in vitro transcription/translation system or in a host cell when the vector is introduced into the host cell).

In some embodiments, the vector is an expression vector. In some embodiments, the vector is an inducible, conditional, or constitutive expression vector.

In another aspect, the disclosure provides a vector system comprising one or more vectors of any one of above. In some embodiments, one or more vectors comprise a polynucleotide according to any one of above and one or more polynucleotides which are on the same or a different vector encoding a guide RNA.

In another aspect, the disclosure provides an engineered cell comprising the Casl2 polynucleotide of any one of above, or comprising the vector of any one of above, or comprising the vector system of any one of above.

In some embodiments, the cell is expressing the Casl2 protein. In some embodiments, the cell transiently expresses or non-transiently expresses the Casl2 protein. In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell.

In another aspect, the disclosure provides a pharmaceutical composition comprising the Casl2 protein of any one of above or the polynucleotide of any one of above or the vector of any one of above or the vector system of any one of above formulated for delivery by AAV (adena- associated viruses), Adenoviruses, retroviruses, HSV (herpes simplex virus), Gammaretrovirus, LV (lentivirus), eCIS (extracellular Contractile Injection System), eVLP (Engineered virus -like particles), VLP (virus-like particles), liposomes, plasmid, lipid nanoparticles (LNPs), exosomes, microvesicles, nucleic acid nanoassemblies, a gene gun, and/or an implantable device.

“Gammaretrovirus” refers to a genus of the retroviridae family. Exemplary gammaretroviruses include mouse stem cell virus, murine leukemia virus, feline leukemia virus, feline sarcoma virus, and avian reticuloendotheliosis viruses.

The CRISPR-Casl2 system of the below or pharmaceutical composition of above described herein, or components thereof, nucleic acid molecules thereof, or nucleic acid molecules encoding or providing components thereof, can be delivered by various delivery systems such as vectors, e.g., plasmids, viral delivery vectors, such as adeno- associated viruses (AAV), lentiviruses, adenoviruses, and other viral vectors, or methods, such as nucleofection or electroporation of ribonucleoprotein complexes consisting of Type V-I effectors and their cognate RNA guide or guides. The proteins and one or more RNA guides can be packaged into one or more vectors, e.g., plasmids or viral vectors. For bacterial applications, the nucleic acids encoding any of the components of the CRISPR systems described herein can be delivered to the bacteria using a phage. Exemplary phages, include, but are not limited to, T4 phage, Mu, X phage, T5 phage, T7 phage, T3 phage, <T>29, M13, MS2, Qp, and 0>X174.

In some embodiments, the vectors, e.g., plasmids or viral vectors, are delivered to the tissue of interest by, e.g., intramuscular injection, intravenous administration, transdermal administration, intranasal administration, oral administration, or mucosal administration. Such delivery may be either via a single dose or multiple doses. One skilled in the art understands that the actual dosage to be delivered herein may vary greatly depending upon a variety of factors, such as the vector choices, the target cells, organisms, tissues, the general conditions of the subject to be treated, the degrees of transformation/modification sought, the administration routes, the administration modes, the types of transformation/modification sought, etc.

In certain embodiments, the delivery is via adeno-associated viruses (AAV), e.g., AAV2, AAV8, or AAV9, which can be administered in a single dose containing at least IxlO⁵ particles (also referred to as particle units, pu) of adenoviruses or adeno-associated viruses. In some embodiments, the dose is at least about IxlO⁶ particles, at least about IxlO⁷ particles, at least about IxlO⁸ particles, or at least about IxlO⁹ particles of the adeno-associated viruses. Due to the limited genomic payload of recombinant AAV, the smaller size of the Casl2 proteins described herein enables greater versatility in packaging the effector and RNA guides with the appropriate control sequences (e.g., promoters) required for efficient and cell-type specific expression.

In some embodiments, the delivery is via a recombinant adeno-associated virus (rAAV) vector. For example, in some embodiments, a modified AAV vector may be used for delivery. Modified AAV vectors can be based on one or more of several capsid types, including AAV1, AV2, AAV5, AAV6, AAV8, AAV8.2, AAV9, AAV rhlO, modified AAV vectors (e.g., modified AAV2, modified AAV3, modified AAV6) and pseudotyped AAV (e.g., AAV2/8, AAV2/5 and AAV2/6). Exemplary AAV vectors and techniques that may be used to produce rAAV particles are known in the art (see, e.g., Aponte-Ubillus et al. (2018) Appl. Microbiol. Biotechnol. 102(3): 1045-54; Zhong et al. (2012) J. Genet. Syndr. Gene Ther. SI: 008; West et al. (1987) Virology 160: 38-47 (1987); Tratschin et al. (1985) Mol. Cell. Biol. 5: 3251-110), each of which is incorporated by reference).

In some embodiments, the delivery is via plasmids. The dosage can be a sufficient number of plasmids to elicit a response. In some cases, suitable quantities of plasmid DNA in plasmid compositions can be from about 0.1 to about 2 mg. Plasmids will generally include (i) a promoter; (ii) a sequence encoding a nucleic acid-targeting CRISPR enzymes, operably linked to the promoter; (iii) a selectable marker; (iv) an origin of replication; and (v) a transcription terminator downstream of and operably linked to (ii). The plasmids can also encode the RNA components of a CRISPR-Cas system, but one or more of these may instead be encoded on different vectors. The frequency of administration is within the ambit of the medical or veterinary practitioner (e.g., physician, veterinarian), or a person skilled in the art.

In another embodiment, lipid nanoparticles (LNPs) are contemplated. LNPs can take different materials to form different forms. For example, the LNP may comprises: a cationic lipid at a molar ratio between 35% and 45%, a polyethylene glycol (PEG) conjugated (PEGylated) lipid at a molar ratio between 0.25% and 2.75%, a cholesterol-based lipid at a molar ratio between 20% and 35%, and a helper lipid at a molar ratio of between 25% and 35%, wherein all the molar ratios are relative to the total lipid content of the LNP. LNP can be made into different sizes, such as an average diameter of 30-200 nm or 80-150 nm.

In another embodiment, the delivery is via liposomes or lipofection formulations and the like, and can be prepared by methods known to those skilled in the art. Such methods are described, for example, in WO 2016205764 and U.S. Pat. Nos. 5,593,972; 5,589,466; and 5,580,859; each of which is incorporated herein by reference in its entirety.

In some embodiments, the delivery is via nanoparticles or exosomes. For example, exosomes have been shown to be particularly useful in the delivery of RNA.

Further means of introducing one or more components of the new CRISPR systems into cells is by using cell penetrating peptides (CPP). In some embodiments, a cell penetrating peptide is linked to the CRISPR enzymes. In some embodiments, the CRISPR enzymes and/or RNA guides are coupled to one or more CPPs to transport them inside cells effectively (e.g., plant protoplasts). In some embodiments, the CRISPR enzymes and/or RNA guide(s) are encoded by one or more circular or non-circular DNA molecules that are coupled to one or more CPPs for cell delivery.

The engineered Casl2 protein that complexes with the guide sequence to form a CRISPR complex, and wherein in the CRISPR complex the nucleic acid molecule target one or more polynucleotide loci.

In some embodiments, the direct repeat sequence and the spacer sequence are heterologous. “Heterologous”, as used herein, means a nucleotide or polypeptide sequence that is not found in the native nucleic acid or protein, respectively.

In some embodiments, the system comprises at least one guide sequences which are capable of hybridizing at least one target sequences or different regions of one target sequence. In some embodiments, the guide sequence hybridizes to one or more target sequences in a prokaryotic cell or in a eukaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a nonhuman primate cell, and a human cell. In some embodiments, the eukaryotic cell comprises a mammalian cell. In some embodiments, the mammalian cell comprises a human cell. In some embodiments, the eukaryotic cell comprises a plant cell.

In some embodiments, the target sequence is DNA or RNA. In some embodiments, the target sequence is selected from: the target sequence is selected from: double stranded DNA, double stranded RNA, single stranded DNA, single stranded RNA, genomic DNA, or extrachromosomal DNA.

In some embodiments, the direct repeat sequence comprises a stem- loop structure which comprising a first stem nucleotide strand which comprises 4-7 nucleotides; a second stem nucleotide strand which comprises 4-7 nucleotides, wherein the first and second stem nucleotide strands can hybridize with each other; and a loop nucleotide strand arranged between the first and second stem nucleotide strands, wherein the loop nucleotide strand comprises 4 or 5 nucleotides.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 90% identity to any one of SEQ ID NOs: 73-104.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 90% identity to any one of SEQ ID NOs: 73-78.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 90% identity to any one of SEQ ID NOs: 79-104.

In some embodiments, the nucleotide sequence of the direct repeat sequence corresponding to different Casl2 proteins is shown in Table 4.

Table 4 The Casl2 protein and the direct repeat sequence

The engineered crRNA or the engineered guide sequence described herein comprises a spacer sequence and a direct repeat sequence. The predicted crRNA secondary structures corresponding to SEQ ID NOs: 73-78 are shown in FIG.8. In FIG.8, N represents the target specific sequence and the number of N is just an example illustration which does not represent its actual nucleotide quantity.

The guide RNA secondary structures of the Casl2 protein suggest that Casl2 protein could process and utilize each other’s crRNAs for DNA targeting. A “stem-loop structure” refers to a nucleic acid having a secondary structure that includes a region of nucleotides that are known or predicted to form a double strand (stem portion) that is linked on one side by a region of predominantly single- stranded nucleotides (loop portion). The terms “hairpin” and “fold-back” structures are also used herein to refer to stem-loop structures. Such structures are well known in the art and these terms are used consistently with their known meanings in the art. As is known in the art, a stem-loop structure does not require exact base-pairing. Thus, the stem may include one or more base mismatches. Alternatively, the base-pairing may be exact, i.e., not include any mismatches. The predicted stem loop structure of the direct repeat is illustrated in FIG.8. In FIG.8, “N” is just an example illustration and does not represent its actual nucleotide quantity.

In certain embodiments, the Casl2 protein has the nuclease activity. In certain embodiments, the Casl2 protein has single-strand RNA cleavage activity, double-strand RNA cleavage activity, single-strand DNA cleavage activity, double-strand DNA cleavage activity, the nucleic acid binding activity, or collateral cleavage activity of RNA and/or DNA.

In some embodiments, Casl2 protein has endonuclease activity, nickase activity, and/or exonuclease activity.

In certain embodiments, the Casl2 protein according to the disclosure as described herein, the Casl2 protein may be a deactivated or inactivated Casl2 protein (e.g. “dead” Casl2 protein), wherein catalytic activity is partially or (substantially) completely lost, as described herein elsewhere. Loss of catalytic activity in this context means that the Casl2 protein is not capable of cleaving DNA (e.g. not capable of inducing double strand breaks, or only capable of inducing single strand breaks, such as a nickase). The Casl2 protein may be used to reduce off-target effects, as defined herein elsewhere. The Casl2 protein may also be part of a fusion protein, as defined herein elsewhere. The Casl2 protein may also be described to include a destabilization domain, as defined herein elsewhere. The Casl2 protein may also be a split Casl2 protein, as defined herein elsewhere. The Cas 12 protein may also be an inducible Cas 12 protein, as defined herein elsewhere. The Cas 12 protein may also be part of a self-inactivating system (SIN), as defined herein elsewhere. The Cas 12 protein may also be part of a synergistic activator system (SAM) as defined herein elsewhere.

Accordingly, in certain embodiments, the Cas 12 protein polypeptide according to the disclosure as described herein is comprised in a fusion protein with a functional domain. In certain embodiments, said functional domain comprises a (transcriptional) activator domain, a (transcriptional) repressor domain, a recombinase, a transposase, a histone remodeler, a DNA methyltransferase, a cryptochrome, a light inducible/controllable domain, or a chemically inducible/controllable domain.

In certain embodiments, the Cas 12 polypeptide according to the disclosure as described herein is not capable of inducing a DNA double strand break. In certain embodiments, the Cas 12 polypeptide according to the disclosure as described herein is a nickase. In certain embodiments, the Cas 12 polypeptide according to the disclosure as described herein is a catalytically inactive Cas 12 polypeptide. In certain embodiments, the Cas 12 polypeptide according to the disclosure as described herein is not capable of inducing a DNA single strand break. In an exemplary, the Cas 12 protein is a dead Cas 12 protein having a catalytically inactive. In an exemplary, the Cas 12 protein is a nickase having a catalytically inactive.

In some embodiments, a vector encoding the Cas 12 protein lacks the ability to cleave one or both strands of a target polynucleotide containing a target sequence. In some embodiments, the Cas 12 protein lack all DNA cleavage activity when the DNA cleavage activity of the enzyme is about no more than 25%, 10%, 5%, 1%, 0.1%, 0.01%, or less of the DNA cleavage activity. Thus, the Cas 12 protein may be used as a generic DNA binding protein with or without fusion to a functional domain. In one aspect of the disclosure, the Cas 12 enzyme may be fused to a protein, e.g., a TAG, and/or an inducible/controllable domain such as a chemically inducible/controllable domain. The Casl2 in the disclosure may be a chimeric Casl2 proteins; e.g., a Casl2 having enhanced function by being a chimera. Chimeric Casl2 proteins may be new Cas containing fragments from more than one naturally occurring Cas. In some embodiments, the Cas 12 protein has enhanced on target activity without higher off target cutting or for making super cutting nickases, or for combination with a mutation that renders the Cas dead for a super binder.

The Cas 12 enzyme provided in this disclosure can recognize a short motif associated in the vicinity of a target DNA called a Protospacer Adjacent Motif (PAM). In some embodiments, the Casl2 enzyme can recognize the canonical PAM comprising or consisting of 5'-TTTN-3' and the non-canonical sequences, wherein X denotes any nucleotide. For example, the canonical PAM may be TTTA, TTTT, TTTG, or TTTC.

In some embodiments, the spacer sequence is between 10 and 40 nucleotides in length, preferably the spacer sequence is between 15 and 30 nucleotides in length, or between 18 and 25 nucleotides in length.

In some embodiments, a mRNA or a DNA encodes the Cas 12 protein.

In some embodiments, the polynucleotide encoding the Cas 12 protein is operably linked to a promoter. In some embodiments, the promoter is a constitutive promoter, tissue-specific promoter or inducible promoter. In some embodiments, the polynucleotide encoding the Cas 12 protein operably linked to a promoter is in a vector. In some embodiments, the vector is selected from the group consisting of a retroviral vector, a lentiviral vector, a phage vector, an adenoviral vector, an adeno-associated vector, and a herpes simplex vector.

In some embodiments, the targeting of the target sequence by the Cas 12 protein and guide sequence results in a modification of the target sequence. In some embodiments, the modification of the target sequence is a cleavage event or a nicking event.

In another aspect, the disclosure provides a delivery system, wherein the system of any one of above is presented in selected from the group consisting of AAV (adena-associated viruses), Adenoviruses, retroviruses, HSV (herpes simplex virus), Gammaretrovirus, LV (lentivirus), eCIS (extracellular Contractile Injection System), eVLP (Engineered virus-like particles), VLP (viruslike particles), liposomes, plasmid, lipid nanoparticles (LNPs), exosomes, microvesicles, nucleic acid nanoassemblies, a gene gun, and/or an implantable device.

In another aspect, the disclosure provides the engineered, non-naturally occurring CRISPR- Cas system of any one of above, delivery system of above or cell of any one of above for use as a medicament. In another aspect, the disclosure provides the engineered, non-naturally occurring CRISPR- Cas system of any one of above, delivery system of above or cell of any one of above for use in a method of therapeutic treatment of a patient.

In another aspect, the disclosure provides a method of modifying or targeting a target DNA locus, the method comprising delivering to said locus a CRISPR-Cas system of any one of above or a delivery system of above.

In some embodiments, said modifying or targeting a target locus comprises inducing a DNA strand break. In some embodiments, said modifying or targeting a target locus comprises inducing a DNA double strand break or a DNA single strand break. In some embodiments, said modifying or targeting a target locus comprises altering gene expression of one or more genes. In some embodiments, said modifying or targeting a target locus comprises epigenetic modification of said target DNA locus. In some embodiments, the method is a method of modifying a cell, a cell line, or an organism by manipulation of one or more target sequences at genomic loci of interest.

In some embodiments, the cell is a eukaryotic cell or a prokaryotic cell. In some embodiments, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell. In some embodiments, the method is in vitro or in vivo.

In another aspect, the disclosure provides a method of targeting and cleaving a doublestranded target DNA, the method comprising: contacting the double- stranded target DNA with a system of any one of above.

In some embodiments, cleaving the target DNA or target sequence results in the formation of an indel or the insertion of a nucleotide sequence. In some embodiments, cleaving the target DNA or target nucleotide comprising cleaving the target DNA or target sequence in two sites, and results in the deletion or inversion of a sequence between the two sites.

The cleavage efficiency of the Casl2 protein on double- stranded DNA (dsDNA) is verified. In some embodiments, the cleavage ratio is 2%-100%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is less than 10%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is less than 5%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is less than 15%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio can be less than 20%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 30%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 40%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 50%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 60%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 70%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 80%. In one embodiment, in vitro cleavage efficiency assay, the range of the cleavage ratio is more than 90%. In some embodiments, the cleavage ratio is 50%-100%. In some embodiments, the cleavage ratio is 60%-100%. In some specific embodiments, the cleavage ratio is 70%-90%. In some specific embodiments, the cleavage ratio is 80%-90%. In some specific embodiments, the cleavage ratio is 80%-95%. In some specific embodiments, the cleavage ratio is 85%-95%. In some specific embodiments, the cleavage ratio is 85%-98%. In some specific embodiments, the cleavage ratio is 60%-90%. For another example, in a specific embodiment, the cleavage ratio can be 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 12%, 15%, 18%, 20%, 25%, 30%, 35%, 40%, 50%, 55%, 58%, 60%, 65%, 70%, 72%, 73%, 75%, 78%, 80%, 82%, 85%, 87%, 88%, 90%, 92%, 95%, 97%, 98%, 99%, 100% and so on.

In some embodiments, the test of the genome cleavage activity in mammalian cells shows that the gene editing efficiency of the Casl2 protein is 50%-95%. For example, in a specific embodiment, the gene editing efficiency can be 50%, 55%, 58%, 60%, 65%, 67%, 70%, 72%, 73%, 75%, 78%, 80%, 82%, 85%, 87%, 88%, 90%, 92%, 95% and so on.

In some embodiments, the Casl2 protein shows a lower off-targets. In some embodiments, the off-targets are not detected in some Casl2 proteins.

The programmability, specificity, and collateral activity of the Casl2 protein also make it an ideal switchable nuclease for non-specific cleavage of nucleic acids. In one embodiment, a Casl2 protein system is engineered to provide and take advantage of collateral non-specific cleavage of nucleic acids, such as ssDNA. In another embodiment, a Casl2 protein system is engineered to provide and take advantage of collateral non-specific cleavage of ssDNA. Accordingly, engineered Casl2 protein systems provide platforms for nucleic acid detection and transcriptome manipulation, and inducing cell death. Casl2 protein is developed for use as a mammalian transcript knockdown and binding tool. Casl2 protein is capable of robust collateral cleavage of RNA and ssDNA when activated by sequence- specific targeted DNA binding.

In certain embodiments, Casl2 protein is provided or expressed in an in vitro system or in a cell, transiently or stably, and targeted or triggered to non- specifically cleave cellular nucleic acids. In one embodiment, Casl2 protein is engineered to knock down ssDNA, for example viral ssDNA. In another embodiment, Casl2 protein is engineered to knock down RNA. The system can be devised such that the knockdown is dependent on a target DNA present in the cell or in vitro system, or triggered by the addition of a target sequence to the system or cell.

In an embodiment, the Casl2 protein system is engineered to non- specifically cleave RNA in a subset of cells distinguishable by the presence of an aberrant DNA sequence, for instance where cleavage of the aberrant DNA might be incomplete or ineffectual.

Collateral activity was recently leveraged for a highly sensitive and specific nucleic acid detection platform termed SHERLOCK that is useful for many clinical diagnoses (Gootenberg, J. S. et al. Nucleic acid detection with CRISPR-Casl3a/C2c2. Science 356, 438- 442 (2017)).

According to the invention, engineered Cas 12 protein systems are optimized for DNA or RNA endonuclease activity and can be expressed in mammalian cells and targeted to effectively knock down reporter molecules or transcripts in cells.

The collateral effect of engineered Cas 12 protein with isothermal amplification provides a CRISPR-based diagnostic providing rapid DNA or RNA detection with high sensitivity and singlebase mismatch specificity. The Cas 12 protein-based molecular detection platform is used to detect specific strains of virus, distinguish pathogenic bacteria, genotype human DNA, and identify cell- free tumor DNA mutations. Furthermore, reaction reagents can be lyophilized for cold-chain independence and long-term storage, and readily reconstituted on paper for field applications.

The ability to rapidly detect nucleic acids with high sensitivity and single-base specificity on a portable platform may aid in disease diagnosis and monitoring, epidemiology, and general laboratory tasks. Although methods exist for detecting nucleic acids, they have trade-offs among sensitivity, specificity, simplicity, cost, and speed.

In another aspect, the disclosure provides a system for detecting the presence of a nucleic acid target sequence in an in vitro sample, comprising: a Casl2 protein of any one of above; at least one guide polynucleotide comprising a guide sequence capable of binding the target sequence, and designed to form a complex with the Casl2 protein; and a nucleic acid-based masking construct comprising a non-target sequence; and wherein the Casl2 protein exhibits collateral cleavage activity of RNA and/or ssDNA and cleaves the nontarget sequence of the nucleic acid-based masking construct activated by the target sequence.

In some embodiments, the system further comprising nucleic acid amplification reagents to amplify the target sequence. In some embodiments, the amplification reagents are isothermal amplification reagents. In some embodiments, the amplification reagents are nucleic-acid sequenced-based amplification (NASBA), recombinase polymerase amplification (RPA), loop- mediated isothermal amplification (LAMP), strand displacement amplification (SDA), helicasedependent amplification (HD A), or nicking enzyme amplification reaction (NEAR).

In some embodiments, the target sequence is a target RNA sequence and the system further comprises an DNA polymerase and a primer designed to bind the target RNA sequence and further comprises a DNA polymerase promoter.

In another aspect, the disclosure provides a method for detecting target nucleic acids in samples comprising: contacting one or more samples with a Casl2 protein of any one of above; at least one guide polynucleotide comprising a guide sequence designed to have a degree of complementarity with the target sequence, and designed to form a complex with the Casl2 protein; and a nucleic acid-based masking construct comprising a non-target sequence, wherein the Casl2 protein exhibits collateral cleavage activity of RNA and/or ssDNA and cleaves the non-target sequence of the nucleic acid-based masking construct activated by the target sequences; and detecting a signal from cleavage of the non-target sequence, thereby detecting the one or more target sequences in the sample.

In some embodiments, the method further comprising contacting the one or more samples with reagents for amplifying one or more target sequences. In some embodiments, the amplification reagents are isothermal amplification reagents. In some embodiments, the amplification reagents are nucleic-acid sequenced-based amplification (NASBA), recombinase polymerase amplification (RPA), loop- mediated isothermal amplification (LAMP), strand displacement amplification (SDA), helicase- dependent amplification (HD A), or nicking enzyme amplification reaction (NEAR). In some embodiments, the target sequence is a target RNA sequence and the system further comprises an DNA polymerase and a primer designed to bind the target RNA sequence and further comprises a DNA polymerase promoter. In some embodiments, the masking construct suppresses generation of a detectable positive signal until cleaved or deactivated, or masks a detectable positive signal, or generates a detectable negative signal until the masking construct is deactivated or cleaved. In some embodiments, the masking construct comprises: a. a silencing RNA that suppresses generation of a gene product encoded by a reporting construct, wherein the gene product generates the detectable positive signal when expressed; b. a ribozyme that generates the negative detectable signal, and wherein the positive detectable signal is generated when the ribozyme is deactivated; or c. a ribozyme that converts a substrate to a first color and wherein the substrate converts to a second color when the ribozyme is deactivated; d. an aptamer and/or comprises a polynucleotide-tethered inhibitor; e. a polynucleotide to which a detectable ligand and a masking component are attached; f. a nanoparticle held in aggregate by bridge molecules, wherein at least a portion of the bridge molecules comprises a polynucleotide, and wherein the solution undergoes a color shift when the nanoparticle is disbursed in solution; g. a quantum dot or fluorophore linked to one or more quencher molecules by a linking molecule, wherein at least a portion of the linking molecule comprises a polynucleotide; q. a polynucleotide in complex with an intercalating agent, wherein the intercalating agent changes absorbance upon cleavage of the polynucleotide; or h. two fluorophores tethered by a polynucleotide that undergo a shift in fluorescence when released from the polynucleotide.

In some embodiments, the direct repeat sequence comprises a nucleotide sequence having at least 95% identity to any one of SEQ ID NOs: 74-77 or SEQ ID NOs: 79-83 or SEQ ID NOs: 86- 104.

The sequence of the direct repeat sequence is shown in Table 4.

The predicted crRNA secondary structures corresponding to SEQ ID NOs: 74-77(crRNA2-5) are shown in FIG.8. In FIG.8, N represents the target specific sequence and the number of N is just an example illustration which does not represent its actual nucleotide quantity.

The guide RNA secondary structures of the Casl2 protein suggest that Casl2 protein could process and utilize each other’s crRNAs for DNA targeting. A “stem-loop structure” refers to a nucleic acid having a secondary structure that includes a region of nucleotides that are known or predicted to form a double strand (stem portion) that is linked on one side by a region of predominantly single- stranded nucleotides (loop portion). The terms “hairpin” and “fold-back” structures are also used herein to refer to stem-loop structures. Such structures are well known in the art and these terms are used consistently with their known meanings in the art. As is known in the art, a stem-loop structure does not require exact base-pairing. Thus, the stem may include one or more base mismatches. Alternatively, the base-pairing may be exact, i.e., not include any mismatches.

In some embodiments, the spacer sequence is between 15 and 35 nucleotides in length, e.g., between 20 and 30 nucleotides in length, or between 20 and 25 nucleotides in length. In some embodiments, the guide sequence hybridizes to one or more target sequences in a prokaryotic cell or in a eukaryotic cell. In some embodiments, the cell is a mammalian cell or a human cell or a plant cell. In some embodiments, the target sequence is DNA or RNA. In some embodiments, the targeting of the target sequence by the Casl2 protein and guide sequence results in a modification of the target sequence. In some embodiments, the modification of the target sequence is a cleavage event or a nicking event.

The following non-limiting examples are provided to further illustrate embodiments of the disclosure disclosed herein. It should be appreciated by those of skill in the art that the techniques disclosed in the examples that follow represent approaches that have been found to function well in the practice of the disclosure, and thus can be considered to constitute examples of modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments that are disclosed and still obtain a like or similar result without departing from the spirit and scope of the disclosure.

EXAMPLE 1: A method of metagenomic analysis for the proteins

Metagenomic sequence data from public databases were search using Hidden Markov Models generated based on known Cas protein sequences including class II type V Cas effector proteins. CRISPR-Cas protein identified by the search were aligned to known proteins to identify potential active sites. From hundreds of potential sequences, finally, this metagenomic workflow resulted in the delineation of the Cas 12 protein as above described and shown in FIGs.1-2.

The phylogenetic tree was constructed by IQTREE (FIGs.1-2) to visualize the relatedness of the orthologs at the primary amino-acid level using 162 Casl2a, Casl2b, Casl2c, Casl2d, Casl2e, Casl2f, Casl2g, Casl2i, Casl2j, Casl2k, Casl2L and TnpB sequences from The National Center for Biotechnology Information (NCBI), various publications, and patents. The branch of the tree corresponding to the Cas 12 proteins provided by this disclosure was marked with a circle while the reference nucleases (LbCpfl, and FnCpfl) were marked with a star.

Although phylogenetic ally more closely related to Cas 12a than other subtypes, the tree shows that the engineered Cas 12 protein studied here are representatives of unique Cas 12 clusters. For example, as shown in FIG.l and FIG.2, GEBxOl 12-0122 are more similar and they are representative clusters; GEBx0127-0130 are more similar and they are representative clusters; GEBx0139 and GEBx0140 are more similar and they are representative clusters; GEBx0108, GEBxO126 and GEBx0137 are more similar and they are representative clusters; GEBx0109, GEBx0131, GEBxOl 32, GEBxOl 42, GEBxOl 06 and GEBxO138 are more similar and they are representative clusters; GEBxO133 and GEBx0134 are more similar and they are representative clusters; GEBx0107, GEBxO141, GEBxOllO, GEBx0134, GEBxO133, GEBxOlll, GEBx0105 and GEBx0123 are more similar and they are representative clusters.

Besides that, the Cas 12 proteins share less than 70% identity with the existing Cas protein, some even share less than 60% identity or 50% identity with the existing Cas protein. These features suggest that Cas 12 proteins were independent of the existing Cas 12a family.

The structure modeling of the Cas 12 effectors/proteins in this disclosure was achieved by Alphafold and the domain arrangement was shown in FIGs.3-7. From FIGs.3-7, we can see that the Cas 12 effector proteins have the similar C-terminal RuvC domain as other Cas 12 nucleases. The further sequences analysis to GEBx00123 and GEBx00142 (FIG.9) and other GEBx Casl2 proteins provided in this disclosure found that there is no Zinc finger domain in any one of the GEBx Casl2 effectors. That is to say, the Casl2 proteins provided by this disclosure are all lack of the Zinc finger domain.

The amino acid sequences of Casl2 proteins are shown in Table 2. The amino acid sequences of LbCpfl and FnCpfl are SEQ ID NO: 177 and SEQ ID NO: 178, respectively.

EXAMPLE 2: Protocol for predicted crRNA folding

Predicted RNA folding of the active single crRNA sequence located at the CRISPR array of Casl2 proteins was computed using the RNAfold webserver developed by Lorenz et al 2011. The folded crRNA corresponding to the direct repeat sequence comprising a nucleotide sequence set forth in SEQ ID NOs: 73-78 is shown in FIG.8.

In FIG.8, N represents the target specific sequence and the number of N is just an example illustration which does not represent its actual nucleotide quantity.

EXAMPLE 3: Protein Expression and Purification

The complete amino acid sequences of Casl2 proteins with nuclear localization signals (NLSs) and FLAG-tagged sequence are shown in SEQ ID NOs: 105-140, and some examples are shown in Table 5.

Table 5 The example complete amino acid sequences of Casl2 proteins

GKTEKKNGRKDNDGENRELERLRNE YLPIEINEIRKNKAYLRS S SNFSNE AL VKYID Y YKERVKE YFNEID

FKFKETCEYKQFNEFAEDVNLQAYQISFIEVSKKYIKSLIDDNKIYLFKIYNKDFSKYSKGTPNLHTLYFKM

LFDKENLENPIYKLSGNAEMFFRKGNLDLDKTTIHHANQPINNKNPNNRKKQSVFKYDIIKNRRYTVDKF

ALHMSITTNFQVYENKNVNETVNRALKYCDDIYAIGIDRGERNLLYACVVNSRGEIVKQVPLNFVCNTD

YHQLLAKREEERMNSRKNWKIIDNIKNLKEGYLSQAIHIITDLMVEYNAVLVLEDLNFRFKEKRMKFEKS

VYQKFEKMLIDKLNFLVDKKLDKNANGGLFNAYQLTEKFTSFKDMKNQNGIVFYIPAWMTSKIDPVTGF

TNLFYIKYESIEKSKEFFGKFKSIKFNKVDNYFEFEFDYNDFTDRAQGTRSKWTVCSFGPRIEGFRNPEKN

NKWDSREIDITEKIKKLLDDYNISLDEDIQAQIMDINTKDFFEKLIKYFKLVLQMRNSKTGTDIDYIISPVRN

KQNEFFDSRKKNEKLPMDADANGAYNIARKGLMFIDIIKETEDKDLKMPKLFIKNKDWLNYVQKSDLKR

PAATKKAGQAKKKKDYKDDDDK

KQNEFFDSRKKNEKLPMDADANGAYNIARKGLMFIDIIKETEDKDLKMPKLFIKNKDWLNYVQKSDLKR

PAATKKAGQAKKKKDYKDDDDK

The DNA fragments encoding the Casl2 proteins, together with 3’ and 5’ nuclear localization signals (NLSs) and FLAG-tagged sequences, were synthesized by GenScript and assembled by Gibson assembly into pEASY-Blunt E2 expression plasmid. The nucleotides encoding the Casl2 proteins with NLSs and FLAG-tag are shown in the SEQ

ID NOs: 141-176. Some examples of the nucleotides are shown in the Table 6.

Table 6 The example nucleotides encoding the Casl2 proteins with NLSs and FLAG-tag

The nucleotide sequences of the Casl2 protein were synthesized commercially (like by Ruibiotech).

Casl2 proteins were expressed as FLAG-tagged fusion proteins from an inducible T7 promoter (pEASY-Blunt E2 expression plasmid) in a protease deficient E.coli B strain. Cells expressing the FLAG-tagged proteins were lysed by sonication. The supernatant was loaded on the Ni²⁺-charged HisTrap HP column (GE Healthcare) and eluted with a linear gradient of increasing imidazole concentration (from 0 to 500 mM) in 20 mM Tris-HCl, pH 7.5 at 25°C, 0.5 M NaCl Buffer on an AKTA Pure25 FPLC (Inscinstech). The eluate was resolved by SDS-PAGE on BeyoGel Plus PAGE (Beyotime) and stained with Feto SDS-PAGE staining buffer (H&Z lifescience). Purity was determined using densitometry of the protein band with ImageLab software (Bio-Rad). Purified endonucleases were dialyzed into a storage buffer composed of 20 mM CH₃COONa, 500 mM NaCl, 0.1 Mm EDTA, 0.1 mM TCEP, 50% glycerol; pH 6.0 and stored at -80°C.

EXAMPLE 4: PAM Sequence identification/confirmation for the endonucleases described herein.

A commercially available cell-free TXTL system developed from an all-Escherichia coli (E.coli) lysate (my TXTL, Arbor Biosciences) was used to rapidly express putative endonucleases from a plasmid (pEASY-Blunt E2) and targeting or non-targeting guide RNAs. PAM sequences were determined by sequencing plasmids containing randomly generated potential PAM sequences that could be cleaved by the nucleases. In this system, an E.coli codon-optimized nucleotide sequence encoding the nuclease was transcribed and translated in vitro from a PCR fragment under the control of a T7 promoter. A synthetic crRNA encoding the repeat- spacer sequence was added to the system. Successful expression of the endonuclease in the TXTL system followed by complex with crRNA provides active in vitro CRISPR nuclease complexes.

A library of target DNA fragments containing a protospacer sequence preceded by 8N mixed bases (potential PAM sequence) was incubated with the output of the TXTL reaction. After 1 hour of incubation, the reaction was stopped, and the DNA was recovered via a DNA clean-up kit. Adaptor sequences were blunt end ligated to DNA with active PAM sequences that have been cleaved by the endonuclease, whereas DNA that has not been cleaved was inaccessible for ligation. DNA segments comprising active PAM sequences were then amplified by PCR with primers specific to the library and the adapter sequence. The PCR amplification products were resolved on a gel to identify amplicons that to map cleavage events. The amplified segments of the cleavage reaction were also used as template for preparation of an NGS library or as a substrate for sanger sequencing. Sequencing this resulting library, which was a subset of the starting 8N library, revealed sequences with PAM activity compatible with the CRISPR complex. The PAM sequences were collected into seqLogo (see e.g., Huber et al. Nat Methods. 2015 Feb; 12(2): 115-21) representations. The seqLogo showed the 8 bp which were upstream of the spacer labelled as positions 0-7. For PAM testing with a processed RNA construct, the same procedure was repeated except that an in vitro transcribed RNA was added along with the plasmid library and the minimal CRISPR array template was omitted.

EXAMPLE 5: In vitro cleavage efficiency

Target DNAs containing protospacer sequences (5’ -gagaagtcattcaataaggccac-3’ , SEQ ID NO: 179) and PAM sequences were constructed by DNA synthesis. A single representative PAM was chosen for testing when the PAM has degenerate bases. The target DNAs were comprised of 515 bp of linear DNA derived from a plasmid via PCR amplification with a PAM and protospacer located 700 bp from one end. Successful cleavage results in fragments of -200 and -300 bp. The target DNA, in vitro transcribed single RNA, and purified recombinant protein were combined in a cleavage buffer (NEBuffer 2.1) with an excess of protein and RNA and were incubated for 5 minutes to 3 hours, usually 1 hour. The reaction was stopped via addition of RNase A and incubation at 60 minutes. The reaction was then resolved on a 2% TAE agarose gel and the fraction of cleaved target DNA was quantified in ImageLab software.

The cleavage efficiency is represented by cutting ratio. The cutting ratio is calculated by the Gray value analysis and the formula like this: The cutting ratio (%) = 100 x (l-sqrt(l-(b + c)/(a + b + c)), “a” represents the uncut band gray value, “b” and “c” respectively represent the gray value of the two short sequences that be cut, “sqrt” is abbreviation for Square Root Calculations. In this application, cutting ratio can be also called cleavage ratio.

EXAMPLE 6: GFP reporter assay

GFP reporter plasmids (pmax-EGFP) containing a target DNA sequence containing spacer sequences and potential PAM sequences (determined e.g., as in Example 4) were constructed by DNA synthesis and cloning. A single representative PAM was chosen for testing when the PAM has degenerate bases. The target site was located at the EFla promoter region which could drive GFP expression. GFP fluorescence was measured with an Infinite M200 Plate Reader (Tecan) using excitation and emission wavelengths of 488 nm and 533 nm, respectively. The reactions were incubated for 16 hours at 29°C and the resulting fluorescence data were analyzed using endpoint and time-course analyses. The reported production of GFP was calculated using a linear standard calibration curve developed from recombinant GFP. For the plate reader used for our experiments, the raw fluorescence values were divided by the conversion factor 9212.61/pmol.

EXAMPLE 7: Testing of Genome Cleavage Activity of the CRISPR-Casl2 Complexes in Mammalian Cells

To show targeting and cleavage activity in mammalian cells, Casl2 proteins and the combined with corresponding crRNA sequence were co-transfected into HEK293T cells by method of RNP transfection using Lipofectamine™ RNAiMAX (Invitrogen). 72 hours after cotransfection, the genomic DNA was extracted and used for the preparation of an NGS library. The percentage of indels mediated by non-homologous end joining (NHEJ) was measured via in the sequencing of the target site to demonstrate the targeting efficiency of the nuclease in mammalian cells. At least 10 different target sites were chosen to test each protein’s activity.

EXAMPLE 8: PAM determination in mammalian cell line

In a set of experiments, the HEK293T cells were cultured in DMEM media supplemented with 10% fetal bovine serum (Gibco™). For reverse transfection, the HEK293T cells were cultured in DMEM media supplemented with 10% fetal bovine serum (Gibco™). A volume of 450 pL of cells with a density of 100,000 cells/well was mixed with 50 pL mixture containing Lipofectamine™ 3000 (ThermoFisher Scientific, Cat. L3OOOOO8), Opti-Mem (Volume refill to 50 pL), 1 pLdsODN (lO pM), 100 ng (~1 pL) pgRNA(SEQ ID NO: 184+ SEQ ID NO: 187) harbored Humanspacer3 spacer (SEQ ID NO: 187) and 400 ng (~1 pL) pCasX plasmid harbored Casl2 protein CDS (with NLS and FLAG) per the manufacturer’s protocol. Then seeded the cell mixture onto a 24-well plate and cultured at 37°C and 5% CO2. 10 pM dsODN was annealed using dsODN- Top and dsODN-BoT oligonucleotides pre-transfection.

72 hours post-transfection, the supernatant was removed and the cell layer was washed by PBS. Then the genomic DNA was extracted from each well of a 24-well plate using DNA Extraction solution (Denogen (Beijing) Bio Sci & Tech Co. Ltd, Cat. DNS033-48) per manufacturer’s protocol. All DNA samples (500ng, 260/280 value: 1.8-2.0) were subjected to Guide-Seq NGS analyses.

The basic method of Guide-Seq library preparation is described by Nikolay et. al (Nat. Protoc. 2021 ) . The extracted DNA sample were first sheared using KAPA Frag Kit (Cat# KK8602, Roche) . Fragmented DNA was purified and then phosphorated using T4 Polynucleotide Kinase (Cat#M0201S, NEB). An SS5-adapter (generated by annealing 10 pM SS5TOP oligo with 10 pM SS5BTM oligo) was ligated to the fragmented DNA using Quick Ligation™ Kit (Cat#M2200S, NEB), followed by two steps off-target PCR to add chemistry for sequencing.

For off-target PCR1 was performed using Platinum™ Taq DNA Polymerase (Cat#15966005, Invitrogen) with GSP1 (a mixture of GSPl-Top and GSPl-BoT) and Y_XX oligos. For off-target PCR2 was performed using Platinum™ Taq DNA Polymerase with GSP2 (a mixture of GSP2- TopA/B/C and GSPl-BoTA/B/C), Y_XX (Same to PCR1) and i753_XX oligos. The DNA product in each step described above need purification using SPRI Select (Cat#B23318, Beckman Coulter). The final library was quantified with qPCR and sequenced on Illumina NextSeq 1000. The reads were aligned to a reference genome after eliminating those having low quality scores. Q30 rate is more than 0.9. The reads length is between 130 bp-140 bp. The resulting files containing the reads were mapped to the reference genome (BAM files), where reads that overlapped the target region of interest were selected. The sequences in these examples are shown in Table7.

Table 7 The nucleotide sequences referred above

Note: p: phosphorylation modification; *: phosphoro thioate (PS) bond (phosphorothioate linkage); “N” may be any natural or non-natural nucleotide.

In this example, the PAM preference of Casl2 proteins comprising GEBxOlll, GEBxO118, GEBxO119, GEBx0120 and GEBxO142 were tested. The nucleic acid sequences (human Codon Optimized sequence) with NLS and FLAG of GEBxOlll, GEBxO118, GEBxO119, GEBx0120 are shown in Table 8. The nucleic acid sequences with NLS and FLAG of GEBxO123 and GEBxO142 are shown in Table 9.

Table 8 The nucleic acid sequence of Casl2 proteins

The PAM preference of GEBxOlll, GEBxO118, GEBxO119, GEBx0120, GEBxO123 and GEBxO142 in HEK293 cell line are shown in FIG.13 and FIG.17.

EXAMPLE 9: In vitro gene editing efficiency assay

1. pgRNA and pCasX plasmid Transfection

In a set of experiments, HEK293T cells were cultured in DMEM media supplemented with 10% fetal bovine serum (Gibco™).

For forward transfection, cells were counted and plated 450 pL on 24-well plates at a density of 150,000 cells/well in a 24-well plate for 24 hours prior to transfection. Cells were co-transfected with a total volume of 50pL lipoplex mixture containing pCasX plasmid (~1 pL, 400 ng), pgRNA plasmid (~1 pL, lOOng), Lipofectamine™ 2000 (1 pL, Thermo Fisher Scientific, Cat. 11668019) and Opti-Mem, then the cells were cultured at 37°C and 5% CO2.

The pgRNA plasmid and pCasX plasmid are shown in FIG.10. In the pCasX plasmid, the nucleotide sequences encoding GEBx0123 and GEBxO142 are codon-optimized for expression in mammalian cells. The nucleotide sequences encoding GEBx0123 and GEBxO142 are set forth in SEQ ID NOs: 180-181. These nucleotide sequences further comprise 3’ and 5’ nuclear localization signals (NLSs) and FLAG-tagged sequences, and these nucleotide sequences are shown in Table 9 (SEQ ID NOs: 182-183). In the pgRNA plasmid, the crRNA fraction including the direct repeat sequence and the spacer sequence (MYODI and FANCF, SEQ ID NOs: 185-186, based on the PAM is 5’-TTTC-3’) are shown in Table 10. To GEBx0123 and GEBxO142, the direct repeat sequence (DR) is same and it is set forth in SEQ ID NO: 184.

Table 9 The nucleotide sequences

Table 10 The direct repeat sequence and the spacer sequences

2. Genomic DNA isolation

72 hours post-transfection, the supernatant was removed and the cell layer was washed by PBS. Then the genomic DNA was extracted from each well of a 24-well plate using DNA Extraction solution (Denogen (Beijing) Bio Sci & Tech Co. Ltd, Cat. DNS033-48) per manufacturer’s protocol. All DNA samples (500 ng, 260/280 value: 1.8-2.0) were subjected to amplicons NGS analyses.

3. Next-generation sequencing (NGS) analysis

To quantitatively determine the efficiency of editing at the target location in the genome, NGS was utilized to identify the presence of insertions and deletions introduced by gene editing. Primers used for NGS which around the target area within the MYOD1/FANCF genes were designed. Additional PCR was performed per the manufacturer’s protocols (Illumina) to add chemistry for sequencing. The amplicons were sequenced on an Illumina iSeq 100 instruments. The reads were aligned to a reference genome after eliminating those having low quality scores. Q30 rate is more than 0.9. The reads length is between 130 bp-140 bp. The resulting files containing the reads were mapped to the reference genome (BAM files), where reads that overlapped the target region of interest were selected and the number of wild types reads versus the number of reads which contain an insertion, substitution, or deletion was calculated. The number of the reads mapped the reference genome is more than 1000.

The editing efficiency (e.g., the “editing percentage” or “percent editing” or “indel frequency”) is defined as the total number of sequences reads with insertions/deletions (“indels”) or substitutions over the total number of sequences reads, including wild type.

In one in vitro experiment, 2 new Casl2 (GEBx0123 and GEBxO142) were tested on MYODI and FANCF targets in HEK293T cell with forward transfection. The results are shown in FIGs.ll and FIG.12. FIG.11 demonstrates that GEBx0123 and GEBxO142 have the modest editing effect on MYODI, wherein GEBx0123 shows the higher indel frequency than GEBxO142. FIG.12 shows that the indel frequency of GEBx0123 is far more than GEBxO142.

EXAMPEE 10: Structure-guide engineering of the CRISPR-Casl2 for PAM expansion

In the context of genome editing, the requirement to recognize PAM reduces CRISPR targeting resolution and leaves some genome sites inaccessible to editing.

To expand the number of PAMs accessible to CRISPR enzymes, structure-guided engineering was performed to generate additional GEBxO142 variants. As shown in FIG.14 and FIG.15, 4 residues in RECI and WED II domain which located around the putative PAM binding site of GEBxO142 were mutated to get the GEBxO142 variant. The types of mutations are summarized in Table 11. The GEBxO142 variantl PAM determination assay was performed as described in Example 9 and the related nucleic acid sequences (Human Codon Optimized sequence, with NFS and FEAG) of Casl2 GEBxO142-variantl were shown in Table 12. The result of PAM is shown in FIG.16, demonstrated a greatly change in -1 to -5 position comparing with the wildtype GEBxO142. The amino acid sequences of Casl2 GEBxO142 variant are shown in Table 13.

Table 11 Types of mutations in GEBxO142 variant

Table 12 The nucleic acid sequences of Casl2 GEBxO142 variant

Table 13 The amino acid sequences of Casl2 GEBxO142 variant

EXAMPLE 11: Human cell genome editing efficiency in different targets with TTTG-PAM in HEK293T cell line

In set of experiments, the HEK293T was cultured in DMEM media supplemented with 10% fetal bovine serum (Gibco™). For lipoplex transfection. A volume of 200 pL of cells with a density of 50,000 cells/well were seeded 24 hours pre-transfection. Cells were transfected with a lipoplex containing Lipofectamine™ 3000 (0.4 pL/well), P3000 (2 pL/well), pgRNA/pCasX plasmid (125 ng/well and 375 ng/well, respectively) and Opti-Mem up to 25 pL/wcll per the manufacturer's protocol. Plated cells were allowed to settle and adhere for 72 hours in a tissue culture incubator at 37°C and 5% CO2 atmosphere. In the pCasX plasmid, the nucleotide sequences encoding GEBx0123 and GEBx0142 are codon-optimized for expression in mammalian cells. The nucleotide sequences encoding GEBx0123 and GEBx0142 are set forth in SEQ ID NOs: 180-181, These nucleotide sequences further comprise 3’ and 5’ nuclear localization signals (NLSs) and FLAG-tagged sequences, and these nucleotide sequences are shown in Table 7 (SEQ ID NOs: 182- 183). For the pgRNA plasmid, the corresponding crRNA sequences are shown in Table 15. The direct repeat sequence and the spacer sequences of the crRNA are shown in Table 14.

72 hours post-transfection, the supernatant was removed and the cell layer was washed by PBS. Then the genomic DNA was extracted from each well of a 48-well plate using DNA Extraction solution (Denogen (Beijing) Bio Sci & Tech Co. Ltd, Cat. DNS033-48) per manufacturer’s protocol. All DNA samples (500ng, 260/280 value: 1.8-2.0) were subjected to amplicons NGS analyses to quantitatively determine the efficiency of editing at the target location in the genome.

For NGS, 50 ng of total genomic DNA was input for two-step PCR using KAPA Hifi HotStart Ready Mix Kit (Roche). First-step PCR (PCR 1) resulted in a -200 bp product, followed by indexing PCR (PCR 2) yielding final fragments flanking the Illumina sequencing barcodes for subsequent Next-Seq or iSeq (Illumina, San Diego, CA, USA). PCR 1 reactions were carried out as follows: 98°C for 5 min, then 20 cycles of [98°C for 20 sec; 60°C for 20 sec; 72°C for 20 sec], followed by a final extension at 72°C for 3 min. The indexing PCR 2 reactions were carried out as follows: 98°C for 5 min, then 15 cycles of [98°C for 20 sec; 62°C for 20 sec; 72°C for 20 sec], followed by a final extension at 72°C for 3 min. PCR 2 products were purified by SPRI beads and quantified by VAHTS Library Quantification Kit for Illumina (Vazyme, Cat.NQIOl) on a StepOnePlus Real-time PCR system (Thermo Fisher Scientific). The amplicons were sequenced on an Illumina iSeq 100 or NextSeq instrument. The reads were aligned to a reference genome after eliminating those having low quality scores. Q30 rate is more than 0.9. The reads length is between 130 bp-140 bp. The resulting files containing the reads were mapped to the reference genome (BAM files), where reads that overlapped the target region of interest were selected and the number of wild types reads versus the number of reads which contain an insertion, substitution, or deletion was calculated. The number of the reads mapped the reference genome is more than 1000.

For Indel frequency determination, qualified reads were mapped to the referenced amplicons sequence using CRISPResso2 with default parameters, then subjected to filtering those reads not spanning the corresponding spacer regions. The resulting reads were then estimated the desired and undesired insertion and deletion occurring on the whole spacer region. Total editing frequency was calculated as: [count of total reads] divided by [count of reads with any insertions or deletions]. Out-of-frame frequency was calculated as: [count of edited reads] divided by [count of reads with those insertions or deletions indivisible by 3].

FIG.18-FIG.19 indicates human cell genome editing efficiency of GEBx0123 and 0142 at additional 23 loci. GEBx0123 shows the highest editing efficiency on VEGFA-TTTG-T1 locus (14.92 %, FIG.19); GEBxO142 shows the highest editing efficiency on POLQ-TTTG-T3 locus (15.37 %, FIG.19).

Table 14 The direct repeat sequence and the spacer sequences of the crRNA

Table 15 The crRNA sequences

While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

EXAMPLE 12: Human cell genome editing efficiency using single plasmid

In set of experiments, the HEK293T was cultured in DMEM media supplemented with 10% fetal bovine serum (Gibco™). For lipoplex transfection. A volume of 200 pL of cells with a density of 50,000 cells/well were seeded 24 hours pre-transfection. Cells were transfected with a lipoplex containing Lipofectamine™ 3000 (0.4 pL/well), P3000 (2 pL/well), pCasX-gRNA plasmid (0.5 pg/well) and Opti-Mem up to 25 pL/well per the manufacturer's protocol. Plated cells were allowed to settle and adhere for 72 hours in a tissue culture incubator at 37°C and 5% CO2 atmosphere. In the pCasX-gRNA plasmid, the nucleotide sequences encoding GEBx0123 and GEBxO142 are codon-optimized for expression in mammalian cells. The nucleotide sequences encoding GEBx0123 and GEBxO142 are set forth in SEQ ID NOs: 180-181, These nucleotide sequences further comprise 3’ and 5’ nuclear localization signals (NLSs) and FLAG-tagged sequences, and these nucleotide sequences are shown in Table 7 (SEQ ID NOs: 182-183). The corresponding crRNA sequences are shown in Table 15. The direct repeat sequence and the spacer sequences of the crRNA are shown in Table 14. The schematic of pCasX-gRNA plasmid harbored with the Cas nucleases CDS and guide RNA are shown in FIG. 20.

Claims

What is claimed is:

1. An engineered, non-naturally occurring Casl2 protein, wherein the Casl2 protein comprises an amino acid sequence selected from SEQ ID NOs: 1-36, a homologue thereof having at least 70% sequence identity to the amino acid sequence selected from SEQ ID NOs: 1-36, or a variant thereof; preferably, the Casl2 protein comprises an amino acid sequence having at least 75%, 80%, 85%, 90%, 92%, 95%, 96%, 97%, 98%, 99% or 100% sequence identity to any one of SEQ ID NOs: 1-36, 260.

2. The Casl2 protein of claim 1, wherein the variant comprises one or more mutations in RECI domain, and/or WED II domain of any one of SEQ ID NOs: 1-36.

3. The Casl2 protein of claim 1 or 2, wherein the variant comprises one or more mutations in RECI domain, and/or WED II domain of SEQ ID NO: 34.

4. The Casl2 protein of claims 3, wherein the variant comprises one or more mutations in region of 180-200 and/or 560-620 with reference to amino acid position numbering of SEQ ID NO: 34; preferably, the variant comprises one or more mutations in region of 190-200 and/or 570-610 with reference to amino acid position numbering of SEQ ID NO: 34; more preferably, the variant comprises one or more mutations in region of 195-200 and/or 580-595 with reference to amino acid position numbering of SEQ ID NO: 34.

5. The Casl2 protein of any one of claims 1-4, wherein the variant comprises one or more mutations at the following positions: Q198, D584, K590, and/or Q593 of SEQ ID NO: 34; preferably, the variant comprises mutations at the following positions: Q198, D584, K590, and Q593 of SEQ ID NO: 34.

6. The Casl2 protein of any one of claims 2-5, wherein the mutation is a single amino acid substitution; preferably, the amino acid is mutated to a positively charged amino acid.

7. The Casl2 protein of claim 6, wherein the amino acid is mutated to R or/and K, preferably R; preferably, the variant comprises the following mutations: Q198R, D584R, K590R, and Q593R of SEQ ID NO: 34.

8. The Casl2 protein of any one of claims 1-7, wherein the variant recognizes a PAM sequence which is not recognized by SEQ ID NO: 34; preferably, the variant recognizes a PAM sequence which is not TTTN, N is A, T, G or C.

9. The Casl2 protein of any one of claims 1-8, wherein the variant has nuclease activity; preferably, the variant has double-strand DNA cleavage activity or nickase activity.

10. The Casl2 protein of any one of claims 1-9, wherein the Casl2 protein further comprises one or more of a nuclear localization signal sequence, a cell penetrating peptide sequence, an affinity tag and/or a fusion deaminase protein; preferably, the Casl2 protein comprises an amino acid sequence having at least 70%, 75%, 80%, 85%, 90%, 92%, 95%, 96%, 97%, 98%, 99% or 100% sequence identity to any one of SEQ ID NOs: 105-140, 261.

11. An engineered, non-naturally occurring cell comprising the Cas 12 protein of any one of claims 1-10.

12. The cell of claim 11, wherein the cell is a eukaryotic cell or a prokaryotic cell; preferably, the eukaryotic cell is selected from the group consisting of: a plant cell, a fungal cell, a single cell eukaryotic organism, a mammalian cell, a reptile cell, an insect cell, an avian cell, a fish cell, a parasite cell, an arthropod cell, a cell of an invertebrate, a cell of a vertebrate, a rodent cell, a mouse cell, a rat cell, a primate cell, a non-human primate cell, and a human cell.

13. A kit comprising the Casl2 protein of any one of claims 1-10.

14. An engineered, non-naturally occurring Casl2 polynucleotide encoding the Casl2 protein of any one of claims 1-10.

15. The polynucleotide of claim 14, wherein the polynucleotide is ribonucleotide sequence or deoxyribonucleotide sequence, or analogs thereof; preferably the polynucleotide is mRNA, and the polynucleotide further comprises 5 ’cap sequence and poly- A tail sequence.

16. The polynucleotide of claim 15, wherein the polynucleotide is codon optimized for expression in a cell of interest; preferably, the polynucleotide is codon optimized for expression in a eukaryotic cell.

17. The polynucleotide of claims 16, wherein the polynucleotide has at least 90%, 92%, 95% or 98% sequence identity to any one of SEQ ID NOs: 180-183, 202-211.

18. The polynucleotide of claim 14, wherein the polynucleotide has at least 70% sequence identity to any one of the SEQ ID NOs: 37-72; preferably, the polynucleotide has at least 75%, 80%, 85%, 88%, 90%, 92%, 94%, 95%, 96%, 98%, 99% or 100% sequence identity to any one of the SEQ ID NOs: 37-72.

19. An engineered vector comprising the Casl2 polynucleotide of any one of claims 14-18.

20. The vector of claim 19, wherein the vector is an expression vector; preferably, the vector is an inducible, conditional, or constitutive expression vector.

21. A vector system comprising one or more vectors of claim 19 or 20.

22. The vector system of claim 21, the one or more vectors comprise a polynucleotide according to any one of claims 14-18 and one or more polynucleotides encoding a guide RNA.

23. The vector system of claim 21 or 22, wherein the polynucleotide according to any one of claims 14-18 and one or more polynucleotides encoding a guide RNA are on a same vector or on different vectors; preferably on a same vector.

24. An engineered cell comprising the Casl2 polynucleotide of any one of claims 14-18, or comprising the vector of any one of claims 19-20, or comprising the vector system of any one of claims 21-23.

25. A reagent kit comprising the Casl2 protein of any one of claims 1-10, or the Casl2 polynucleotide of any one of claims 14-18, or the vector of claim 19 or 20, or the vector system of any one of claims 21-23.

26. A pharmaceutical composition comprising the Casl2 protein of any one of claims 1-10, or the Casl2 polynucleotide of any one of claims 14-18, or the vector of claim 19 or 20, or the vector system of any one of claims 21-23; preferably, the pharmaceutical composition furture comprise a delivery system selected from: AAV (adena-associated viruses), Adenoviruses, retroviruses, HSV (herpes simplex virus), Gammaretrovirus, LV (lentivirus), eCIS (extracellular Contractile Injection System), eVLPs (Engineered virus-like particles), VLPs (virus-like particles), liposomes, plasmids, LNPs (lipid nanoparticles), exosomes, microvesicles, nucleic acid nanoassemblies, a gene gun, and/or an implantable device.

27. An engineered, non-naturally occurring CRISPR-Cas system comprising: a) the Casl2 protein of any one of claims 1-10 or the polynucleotide of any one of claims 14-18; b) at least one engineered guide sequence or one or more engineered nucleic acids encoding the at least one engineered guide sequence, and the guide sequence comprises a direct repeat sequence capable of binding the Casl2 protein and a spacer sequence capable of hybridizing to a target.

28. The system of claim 27, wherein the target is selected from: double stranded DNA, double stranded RNA, single stranded DNA, single stranded RNA, genomic DNA, or extrachromosomal DNA.

29. The system of claim 27, wherein the direct repeat sequence comprises a stem-loop structure which comprising a first stem nucleotide strand which comprises 4-7 nucleotides; a second stem nucleotide strand which comprises 4-7 nucleotides, wherein the first and second stem nucleotide strands can hybridize with each other; and a loop nucleotide strand arranged between the first and second stem nucleotide strands, wherein the loop nucleotide strand comprises 4 or 5 nucleotides.

30. The system of claim 27, wherein the direct repeat sequence comprises a nucleotide sequence having at least 90% identity to any one of SEQ ID NOs: 73-104, 212; preferably, the direct repeat sequence comprises a nucleotide sequence having at least 95% identity to any one of SEQ ID NOs: 74-77 or SEQ ID NOs: 79-83 or SEQ ID NOs: 86-104 or SEQ ID NO: 212.

31. The system of any one of claims 27-30, wherein the spacer sequence is between 10 and 40 nucleotides in length, preferably the spacer sequence is between 15 and 30 nucleotides in length, or between 18 and 25 nucleotides in length; preferably, the spacer sequence comprises a nucleotide sequence having at least 95% identity to any one of SEQ ID NOs: 213-236.

32. The system of any one of claims 27-31, wherein the system further comprising a donor template nucleic acid, the donor template nucleic acid is a DNA or RNA or DNA-RNA hybrids.

33. The system of any one of claims 27-31, wherein the targeting of the target by the Casl2 protein and guide sequence results in a modification of the target sequence; preferably, the modification of the target is a cleavage event or a nicking event.

34. The system of any one of claims 27-33, wherein the target locus is selected from MYODI, FANCF, CD34, CFTR, DNMT1, EMX1, HBB, LPA, POLQ, RNF2, TTR or VEGFA; preferably, the target locus is POLQ or VEGFA.

35. A method of modifying or targeting a target DNA locus, the method comprising delivering to the locus the Casl2 protein of any one of claims 1-10, the polynucleotide of any one of claims 14- 18 or the CRISPR-Cas system of any one of claims 24-34.

36. The method of claim 35, wherein the target locus is selected from MYODI, FANCF, CD34, CFTR, DNMT1, EMX1, HBB, LPA, POLQ, RNF2, TTR or VEGFA; preferably, the target locus is POLQ or VEGFA.

37. The method of claim 35 or36, wherein the modifying or targeting the target locus comprises inducing a DNA strand break.

38. The method of claim 35 or 36, wherein the modifying or targeting the target locus comprises inducing a DNA double strand break or a DNA single strand break.

39. The method of claim 35 or 36, wherein the modifying or targeting the target locus comprises altering gene expression of one or more genes.

40. The method of claim 35 or 36, wherein the modifying or targeting the target locus comprises epigenetic modification of said target DNA locus.

41. An isolated eukaryotic cell comprising a modified target locus of interest, wherein the target locus of interest has been modified according to a method, or via use of the Casl2 protein, polynucleotide, vector, vector system, pharmaceutical composition or the CRISPR-Cas system of any one of the preceding claims.

42. The Casl2 protein of any one of claims 1-10, or the polynucleotide of any one of claims 14- 18 for use in gene editing.

43. The Casl2 protein or polynucleotide of claim 42, the gene editing result in edit event in the target locus; preferably the target locus is selected from MYODI, FANCF, CD34, CFTR, DNMT1, EMX1, HBB, LPA, POLQ, RNF2, TTR or VEGFA.

44. The Casl2 protein or polynucleotide of claim 42, wherein the target locus is POLQ or VEGFA.