[go: up one dir, main page]

WO2025050118A1 - Computational exploration of the global microbiome for antibiotic discovery - Google Patents

Computational exploration of the global microbiome for antibiotic discovery Download PDF

Info

Publication number
WO2025050118A1
WO2025050118A1 PCT/US2024/045022 US2024045022W WO2025050118A1 WO 2025050118 A1 WO2025050118 A1 WO 2025050118A1 US 2024045022 W US2024045022 W US 2024045022W WO 2025050118 A1 WO2025050118 A1 WO 2025050118A1
Authority
WO
WIPO (PCT)
Prior art keywords
amps
peptides
candidate
genomes
ampsphere
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/045022
Other languages
French (fr)
Inventor
César de la Fuente-Nunez
Marcelo DER TOROSSIAN TORRES
Luis PEDRO FRAGAO BENTO COELHO
Célio DIAS SANTOS JÚNIOR
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
University of Pennsylvania Penn
Original Assignee
Fudan University
University of Pennsylvania Penn
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University, University of Pennsylvania Penn filed Critical Fudan University
Publication of WO2025050118A1 publication Critical patent/WO2025050118A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis

Definitions

  • Antibiotic-resistant infections are becoming increasingly difficult to treat with conventional therapies 1 . Indeed, such infections currently kill 1.27 million people per year 2 . Therefore, there is an urgent need for novel methods for antibiotic discovery.
  • Computational approaches have been recently developed to accelerate our ability to identify novel antibiotics, including antimicrobial peptides (AMPs) 3 9 .
  • proteome mining approaches have even been developed to identify antimicrobial agents in extinct organisms in an attempt to further expand our repertoire of known antimicrobials 10 .
  • AMPs found in all domains of life 11 14 , are short sequences (here operationally defined as 10-100 amino acid residues 15 ) capable of disturbing microbial growth 12,15 . AMPs most commonly interfere with cell wall integrity and cause cell lysis 12 16 . Natural AMPs can originate by proteolysis 4 17 , by non-ribosomal synthesis 18 , or they can be encoded within the genome 19.
  • AMPs play an important role in modulating such microbial interactions and can displace competitor strains, facilitating cooperation 20 .
  • pathogens such as Shigella spp. 21 , Staphylococcus spp. 22 , Vibrio cholerae 13 , and Listeria spp. 24 ‘ 2 produce AMPs that eliminate competitors (sometimes from the same species), allowing them to occupy their niche.
  • AMPs hold promise as potential therapeutics and have already been used clinically as antiviral drugs e.g., enfuvirtide and telaprevir 26 ).
  • AMPs that exhibit immunomodulatory properties are currently undergoing clinical trials 27 , as are AMPs that may be used to address yeast and bacterial infections 28 (e.g., pexiganan, LL-37, PAC-113).
  • yeast and bacterial infections 28 e.g., pexiganan, LL-37, PAC-113.
  • yeast and bacterial infections 28 e.g., pexiganan, LL-37, PAC-113
  • Such AMPs are more targeted agents than conventional broad-spectrum antibiotics 30,31 .
  • the evolution of resistance to many AMPs occurs at low rates and is not related to cross-resistance to other classes of widely used antibiotics 4,32,33 .
  • FIG. 1A demonstrates how assembled 63,410 publicly available metagenomes were assembled from diverse habitats.
  • a modified version of Prodigal 34 which can also predict smORFs (30-300 bp), was used to predict genes on the resulting metagenomic contigs as well as on 87,920 microbial genomes from ProGenomes2 43 .
  • Macrel 42 was applied to the 4,599,187,424 predicted smORFs to obtain 863,498 non-redundant c_AMPs (see also Fig. 8).
  • c AMPs were then hierarchically clustered in a reduced amino acids alphabet using 100%, 85%, and 75% identity cutoffs.
  • FIG. IB shows that only 9% of c_AMPs have detectable homologs in other small protein databases (SmProt 2 54 , STsORFs 92 ), bioactive peptide databases (DRAMP 46 version 3.0, starPepDB 45k 93 ), and general protein datasets (GMGCvP 5 ) - see also Fig. 9B. Also shown is the number of homologs in the AMPSphere in each database as well as the total. The number of homologs passing all quality tests, regardless their experimental evidence of translation/transcription is also shown along with the percentage it represents in the homologs identified.
  • FIG. 1C shows rarefaction curves showing how AMP discovery is impacted by sampling, with most of the habitats presenting steep sampling curves.
  • FIG. ID illustrates how sharing of c AMPs between habitats is limited. The width of ribbons represents the proportion of the shared c AMPs in the habitat on the left. See also Figs. 9C- D.
  • FIG. 2A illustrates the number of AMPSphere candidates passing each of the tests proposed for quality.
  • FIG. 2B shows the number of AMP candidates predicted as AMP by AMP prediction systems beyond Macrel (AMPScanner v2 49 , ampir 40 - with the model for mature peptides, amPEPpy 50 , APIN 51 - with their proposed model, AI4AMP 52 , and AMPLify 53 ). Only a small portion of AMPSphere ( ⁇ 2%) cannot be co-predicted by any system other than Macrel 42 .
  • FIG. 3 A shows the distribution of positions (as a percentage of the length of the larger protein) from which the AMP homologs start their alignment. About 7% of c_AMPs are homologous to proteins from GMGCvl 55 , with approximately one-fourth of the hits sharing start positions with the larger protein.
  • FIG. 3B provides as an illustrative example of an AMP homologous to a full-length protein, AMP10.271_016 was recovered from three samples of human saliva from the same donor 94 .
  • AMP10.271 016 is predicted to be produced by Prevotella jejuni, sharing the start codon (bolded) of an NAD(P)-dependent dehydrogenase gene (WP_089365220.1), the transcription of which was stopped by a mutation (in red; TGG > TGA).
  • C The distribution of AMPs per OG class (left) and their enrichment in comparison to full-length proteins from GMGCvl 35 (right). OGs were classified into subgroups according to the number of c AMPs they were affiliated with.
  • FIG. 4A shows how, compared to other proteins, c AMPs in conserved genomic architectures tend to be closer to ribosomal machinery -related genes than families of proteins with different sizes (all length and small proteins with ⁇ 50 amino acids).
  • FIG. 4B shows how proportion of c_AMPs in a genome context involving antibiotic resistance genes is lower than in other gene families.
  • FIG. 4C illustrates how proportion of c AMPs in neighborhoods with antibiotic synthesis-related genes is very small ( ⁇ 0.25%).
  • FIG. 4D illustrates the conserved genomic context of the gene encoding AMP10.015_426 is shown in different genomes (the tree on the left depicts the phylogenetic relationship of the genes homologous to it). This c_AMP is homologous to the ribosomal protein rpsH, and is found in the context of rpsH and other ribosomal protein genes.
  • FIG. 5A shows the fractions of AMPs (or AMP families) that are accessory (present in ⁇ 50% of genomes from same species), shell (50-95%), or core (>95%).
  • FIG. 5B illustrates of the lowest taxonomic level at which c_AMPs were annotated. In detail (right), the top 10 genera with the highest numbers of c AMPs included in AMPSphere. Animal - associated genera (e.g., I ‘revote I la, Faecalibacterium, CAG-llff) contribute the most c_AMPs, possibly reflecting data sampling.
  • FIG. 5A shows the fractions of AMPs (or AMP families) that are accessory (present in ⁇ 50% of genomes from same species), shell (50-95%), or core (>95%).
  • FIG. 5B illustrates of the lowest taxonomic level at which c_AMPs were annotated. In detail (right), the top 10 genera with the highest numbers of c AMPs included in AMPSphere. Animal -
  • FIG. 6A provides the amino acid frequency in c AMPs from AMPSphere, AMPs from databases (DRAMP 46 version 3, APD3 71 , and DBAASP 70 ), and encrypted peptides 4 (EPs) from the human proteome.
  • FIG. 6B is a heat map with the percentage of secondary structure found for each peptide in three different solvents: water, 60% trifluoroethanol (TFE) in water, and 50% methanol (MeOH) in water. Secondary structure was calculated using BeStSel server 95 .
  • FIG. 6D illustrates the activity of c_AMPs assessed against ESKAPEE pathogens and human gut commensal strains.
  • FIG. 6D illustrates fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe l-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of A. baumannii ATCC 19606 cells.
  • FIG. 6E provides fluorescence values relative to PMB (positive control) of 3,3 '-dipropylthiadi carbocyanine iodide [DiSC3-(5)], a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of A. baumannii ATCC 19606 cells. Depolarization of the cytoplasmic membrane occurred with a slow kinetics compared to the permeabilization of the outer membrane and took approximately 20 min to stabilize.
  • FIG. 7A is a schematic of the skin abscess mouse model used to assess the anti-infective activity of the peptides against A. baumannii cells.
  • FIG. 7C provides how, to rule out toxic effects of the peptides, mouse weight was monitored throughout the experiment.
  • Features on the violin plots represent median and upper and lower quartiles. Data in are the mean ⁇ the standard deviation.
  • FIG. 8 provides density curves; the arbitrary density units not shown as all curves are independently normalized so the area under the curve is one. For each dataset and feature, the top 1% and bottom 1% of values were considered outliers and are not shown in the plot. Proportions of residues with small side chains [A, C, D, G, N, P, S, T, V] per c_AMP along with the proportions of basic residues [H, R, K] per c_AMP were also shown. The distributions of each feature were compared among the datasets using the Mann-Whitney test with multiple hypothesis testing corrected using Holm-Sidak. Almost all differences are significant (adjusted p-value ⁇ 0.05).
  • FIG. 9A illustrates how quality assessment of AMPSphere revealed most of the peptides passed at least one of the tests.
  • the RNAcode test depends on gene diversity, which is very low for AMPSphere, and therefore, determined a low rate of positives among our candidates.
  • FIG. 8B provides how c AMPs homologous to databases of validated bioactive peptides also showed a higher average quality of these datasets.
  • FIG. 9C shows how the limited overlap of c AMPs among habitats argues in favor of using habitat groups to gain resolution. Note that the group of habitats with the highest paired overlaps belong to human body sites and samples from human guts and non-human mammalian guts. Only habitats with at least 100 samples were shown.
  • FIG. 9D provides how it is also possible to observe the great proportion of rare genes in AMPSphere from different habitat groups, in which few genes are largely detected.
  • FIG. 10 illustrates how, to validate the clustering procedure using a reduced amino acid alphabet, samples of 1,000 peptides were randomly drawn from AMPSphere (excluding representative sequences) and aligned against their cluster representatives. Three different levels (I, II, and III) of clustering were tested. The E-values were computed per alignment and plotted against the corresponding alignment identity. The averaged proportion of significant alignments is shown in each graph.
  • FIG. 11 A illustrates minimal inhibitory concentration values for polymyxin B, a peptide antibiotic, and levofloxacin against all the strains tested.
  • Polymyxin B and levofloxacin were used as positive controls in all antimicrobial assays.
  • c_AMPs secondary structural tendency was analyzed using three different solvents.
  • FIG. 11B shows analysis in water.
  • FIG. 11C shows the analysis in a trifluoroethanol (TFE) and water mixture (3:2, V:V).
  • FIG. 11D shows the analysis in a methanol (MeOH) and water mixture (1: 1, V:V).
  • FIG. 12A provides minimal inhibitory concentration values of the scrambled versions of five of the lead c_AMPs from AMPSphere tested against the same 11 pathogenic strains and eight gut commensal strains used to assess the activity of the c_AMPs.
  • the scrambled peptides secondary structural tendency was analyzed using three different solvents.
  • FIG. 12B shows analysis in water.
  • FIG. 12C shows the analysis in a TFE and water mixture (3 :2, V:V).
  • FIG. 12D shows the analysis in a MeOH and water mixture (1: 1, V:V).
  • the experiments were carried out in the same conditions as the ones used for the c AMPs.
  • a Fourier transform filter was applied to minimize background effects.
  • FIG. 12E provides a heat map with the percentage of secondary structure found for each peptide in three different solvents: water, 60% TFE in water, and 50% MeOH in water. Secondary structure was calculated using BeStSel server 95 .
  • FIG. 13 A shows fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe l-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of P. aeruginosa PAO1 cells.
  • FIG. 13B provides fluorescence values relative to PMB (positive control) of 3,3 '-dipropylthiadicarbocyanine iodide [DiSC3- (5)], a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of P aeruginosa PAO1 cells.
  • PMB polymyxin B
  • NPN N-phenylamino)naphthalene
  • FIG. 13C provides bacterial counts four days post infection, the c_AMPs were tested at their MIC in a single dose one hour after the establishment of the infection.
  • Statistical significance in was determined using one-way ANOVA where all groups were compared to the untreated control group; P-values are shown for each of the groups.
  • FIG. 13D illustrates mouse weight throughout the experiment (mean ⁇ the standard deviation).
  • Features on the violin plots represent median and upper and lower quartiles.
  • AMPs metagenomic-derived candidate antimicrobial peptides
  • methods for forming a database-assisted platform providing one or more functional and/or physicochemical features of respective metagenomic-derived candidate antimicrobial peptides (AMPs)
  • the methods comprising selecting one or more genomes or metagenomes for inclusion in the platform; using an NGS assembler to assemble reads in order to identify contigs from the genomes or metagenomes; from the identified contigs, predicting small open reading frames (smORFs); removing duplicate smORFs to yield non-redundant smORFs; and, predicting candidate AMPs from the non-redundant smORFs.
  • smORFs small open reading frames
  • Also disclosed are methods of treating a microbial infection in a subject comprising administering to the subject a therapeutically effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform.
  • the present disclosure also provides methods comprising contacting a biofilm with an effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform.
  • compositions comprising an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform and pharmaceutically acceptable carrier, diluent, or excipient.
  • the phrase “about 1-10” is understood to mean “about 1 to about 10”, as well as “about x”, wherein x refers to any value between 1 and 10.
  • all ranges are inclusive and combinable.
  • the recited range should be construed as including ranges “1 to 4”, “1 to 3”, “1-2”, “1-2 & 4-5”, “1-3 & 5”, and the like.
  • a list of alternatives is positively provided, such listing can be interpreted to mean that any of the alternatives may be excluded, e.g., by a negative limitation in the claims.
  • a range of “1 to 5” when a range of “1 to 5” is recited, the recited range may be construed as including situations whereby any of 1, 2, 3, 4, or 5 are negatively excluded; thus, a recitation of “1 to 5” may be construed as “1 and 3-5, but not 2”, or simply “wherein 2 is not included.”
  • AMPs antimicrobial peptides
  • AMPSphere provides insights into the evolutionary origins of peptides, including by duplication or gene truncation of longer sequences, and we observed that AMP production varies by habitat.
  • ML was used to predict and catalog AMPs from the global microbiome as currently represented in public databases.
  • AMPSphere a collection of 863,498 non-redundant peptide sequences, encompassing candidate AMPs (c AMPs) derived from (meta)genomic data.
  • c AMPs candidate AMPs
  • the present analysis revealed that these c AMPs were specific to particular habitats and were predominantly not core genes in the pangenome.
  • a database-assisted platform providing one or more functional and/or physicochemical features of respective metagenomic-derived candidate antimicrobial peptides (AMPs)
  • the methods comprising selecting one or more genomes or metagenomes for inclusion in the platform; using an NGS assembler to assemble reads in order to identify contigs from the genomes or metagenomes; from the identified contigs, predicting small open reading frames (smORFs); removing duplicate smORFs to yield non-redundant smORFs; and, predicting candidate AMPs from the non-redundant smORFs.
  • smORFs small open reading frames
  • the selection of the one or more genomes or metagenomes is according to criteria (i) whereby the genome or metagenome is tagged with taxonomy ID 408169 (for metagenome) or is a descendent of it in a taxonomic tree, (ii) whereby experiments with the genome or metagenome are listed as “METAGENOMIC”, or both (i) and (ii).
  • Metadata is curated from the selected one or more genomes or metagenomes to create groups based on similarity of habitat conditions.
  • Habitat conditions include one or more of air, anthropogenic, aquatic, host-associated, alkaline pH, sediment, or terrestrial.
  • the selection of the one or more genomes or metagenomes can alternatively or additionally include assessing sample origin or other information relating to host species using an NCBI taxonomic identification number.
  • the present methods may further comprise processing the assembled reads by trimming positions with a quality lower than a desired number, and discarding reads shorter than a specified number of base pairs, post trimming.
  • the quality at which a given position is trimmed may be selected to be, for example, lower than 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, or 10.
  • the number of base pairs, post -trimming that represent a read to be discarded may be about 40 to about 100, such as about 40-90, 45-80, 50-75, 50-70, or 55-65, such as about 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100.
  • metagenomes obtained from a host-associated microbiome may be passed through a filtering of reads mapping to the host genome, when available.
  • the NGS assembler that is used to assemble reads in order to identify contigs from the genomes or metagenomes may be, for example, optimized for metagenomes.
  • the smORFs may be predicted from the identified contigs using prokaryotic gene recognition and translation initiation site identification.
  • the prokaryotic gene recognition and translation initiation site identification may be, for example, Prodigal (PROkaryotic DYnamic programming Gene-finding Algorithm - see Hyatt, D., et al., BMC Bioinformatics 11, 119 (2010)), or a modified version thereof.
  • the length of the smORFs that are predicted from the contigs may be, for example, from about 20 to about 500 bp, such as 20-450, 25-400, 30-350, or 30-300 bp, or about 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, or 500 bp.
  • the candidate AMPs are predicted from the non- redundant smORFs using metagenomic AMP classification and retrieval.
  • the candidate AMPs may be predicted from the non-redundant smORFs using Macrel.
  • singleton sequences may be removed from the predicted candidate AMPs.
  • singleton sequences are retained if they match a sequence from a data repository of antimicrobial peptides.
  • Matching a sequence from a data repository of antimicrobial peptides can mean having at least or about 65, 70, 75, 80, or 85% amino acid identity to a sequence from a data repository, and/or an E-value of at least about 10' 5 relative to a sequence from a data repository.
  • the data repository may be, for example, the Data Repository of Antimicrobial Peptides (DRAMP) 46 .
  • DRAMP Data Repository of Antimicrobial Peptides
  • candidate AMPs originating from a genomic database may assigned a taxonomy from the original genome.
  • candidate AMPs originating from a metagenome may assigned a taxonomy predicted for the contig in which the candidate AMP was found.
  • the present methods may further comprise identifying potential structural configuration of a candidate AMP by using a secondary structure function from a calculation of a fraction of amino acids in the candidate AMP that tend to assume conformations of helix, turn, or sheet. This information may also be used to provide characterization of the candidate AMPs within the platform.
  • the candidate AMPs may be hierarchically clustered using a reduced amino acid alphabet and an identity cutoff of a preselected percentage.
  • the reduced amino acid alphabet may include about 5, 6, 7, 8, 9, 10, 11, or 12 amino acids.
  • the identity cutoff may be about 70-100%, such as about 70, 75, 80, 85, 90, 95, or 100%.
  • a sequential cutoff may be employed, such as a sequential cutoff of 100, 95, and 90%, or 100, 90, and 85%, or 100, 90, and 80%, or 100, 85, and 75%.
  • Representative sequences of peptide clusters may be selected according to their length (selecting for the longest), with ties being broken by the alphabetical order.
  • the clustering procedure may be validated.
  • the present methods may further comprise synthesizing one or more of the identified candidate AMPs.
  • Selection of candidate AMPs for synthesis may be according to criteria for solubility, criteria for synthesis, or both.
  • criteria may be in accordance with those used in PepFun 145 .
  • the synthesized candidate AMPs may individually be tested for antimicrobial activity to determine minimal inhibitory concentration (MIC). For example, the broth microdilution method may be used to determine MIC.
  • the synthesized candidate AMPs may be subjected to circular dichroism assays, as described more fully infra. Membrane permeability with respect to a particular candidate AMP may be analyzed, for example, using the l-(N-phenylamino)naphthalene (NPN) uptake assay.
  • NPN l-(N-phenylamino)naphthalene
  • ability of a candidate peptides to depolarize the cytoplasmic membrane may be assessed, for example, by measuring the fluorescence of the membrane potential -sensitive dye 3,3’- dipropylthiadicarbocyanine iodide [DiSC3-(5)].
  • Also disclosed are methods of treating a microbial infection in a subject comprising administering to the subject a therapeutically effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform.
  • the methods of treating a microbial infection comprises administering to the subject a therapeutically effective amount of an AMP according to any one or more of SEQ ID NOS: 1-100, provided below in Table 1.
  • compositions comprising an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform and pharmaceutically acceptable carrier, diluent, or excipient.
  • the compositions may comprise an AMP according to any one or more of SEQ ID NOS: 1-100.
  • two or more AMPs that have been identified according to any one of the disclosed methods for forming a database-assisted platform, such as two or more of SEQ IDS NOS: 1-100 may be administered to the subject or included in the present compositions.
  • the phrase “therapeutically effective amount” refers to the amount of active agent (here, the antimicrobial peptide) that elicits the biological or medicinal response that is being sought in a tissue, system, animal, individual or human by a researcher, veterinarian, medical doctor or other clinician, which includes one or more of the following: (1) at least partially preventing the disease or condition or a symptom thereof; for example, preventing a disease, condition or disorder in an individual who may be predisposed to the disease, condition or disorder but does not yet experience or display the pathology or symptomatology of the disease;
  • active agent here, the antimicrobial peptide
  • inhibiting the disease or condition for example, inhibiting a disease, condition or disorder in an individual who is experiencing or displaying the pathology or symptomatology of the disease, condition or disorder (i.e., including arresting further development of the pathology and/or symptomatology); and
  • ameliorating the disease or condition for example, ameliorating a disease, condition or disorder in an individual who is experiencing or displaying the pathology or symptomatology of the disease, condition or disorder (i.e., including reversing the pathology and/or symptomatology).
  • compositions that are formulated for any type of administration.
  • the compositions may be formulated for administration orally, topically, parenterally, enterally, or by inhalation (e.g., intranasally).
  • the active agent may be formulated for neat administration, or in combination with conventional pharmaceutical carriers, diluents, or excipients, which may be liquid or solid.
  • the applicable solid carrier, diluent, or excipient may function as, among other things, a binder, disintegrant, filler, lubricant, glidant, compression aid, processing aid, color, sweetener, preservative, suspensing/dispersing agent, tablet-disintegrating agent, encapsulating material, film former or coating, flavoring agent, or printing ink.
  • a binder disintegrant, filler, lubricant, glidant, compression aid, processing aid, color, sweetener, preservative, suspensing/dispersing agent, tablet-disintegrating agent, encapsulating material, film former or coating, flavoring agent, or printing ink.
  • Any material used in preparing any dosage unit form is preferably pharmaceutically pure and substantially non-toxic in the amounts employed.
  • the active agent may be incorporated into sustained-release preparations and formulations.
  • Administration in this respect includes administration by, inter alia, the following routes: intravenous, intramuscular, subcutaneous, intraocular, intrasynovial, transepithelial including transdermal, ophthalmic, sublingual and buccal; topically including ophthalmic, dermal, ocular, rectal and nasal inhalation via insufflation, aerosol, and rectal systemic.
  • the carrier, diluent, or excipient may be a finely divided solid that is in admixture with the finely divided active ingredient.
  • the active ingredient is mixed with a carrier, diluent or excipient having the necessary compression properties in suitable proportions and compacted in the shape and size desired.
  • the active compound may be incorporated with the carrier, diluent, or excipient and used in the form of ingestible tablets, buccal tablets, troches, capsules, elixirs, suspensions, syrups, wafers, and the like.
  • the amount of active agent(s) in such therapeutically useful compositions is preferably such that a suitable dosage will be obtained.
  • Liquid carriers, diluents, or excipients may be used in preparing solutions, suspensions, emulsions, syrups, elixirs, and the like.
  • the active ingredient of this invention can be dissolved or suspended in a pharmaceutically acceptable liquid such as water, an organic solvent, a mixture of both, or pharmaceutically acceptable oils or fat.
  • the liquid carrier, excipient, or diluent can contain other suitable pharmaceutical additives such as solubilizers, emulsifiers, buffers, preservatives, sweeteners, flavoring agents, suspending agents, thickening agents, colors, viscosity regulators, stabilizers, or osmo-regulators.
  • Suitable solid carriers, diluents, and excipients may include, for example, calcium phosphate, silicon dioxide, magnesium stearate, talc, sugars, lactose, dextrin, starch, gelatin, cellulose, methyl cellulose, ethyl cellulose, sodium carboxymethyl cellulose, microcrystalline cellulose, polyvinylpyrrolidine, low melting waxes, ion exchange resins, croscarmellose carbon, acacia, pregelatinized starch, crospovidone, HPMC, povidone, titanium dioxide, polycrystalline cellulose, aluminum methahydroxide, agar-agar, tragacanth, or mixtures thereof.
  • liquid carriers for example, for oral, topical, or parenteral administration
  • Suitable examples of liquid carriers, diluents and excipients include water (particularly containing additives as above, e.g. cellulose derivatives, preferably sodium carboxymethyl cellulose solution), alcohols (including monohydric alcohols and polyhydric alcohols, e.g. glycols) and their derivatives, and oils (e.g. fractionated coconut oil and arachis oil), or mixtures thereof.
  • water particularly containing additives as above, e.g. cellulose derivatives, preferably sodium carboxymethyl cellulose solution
  • alcohols including monohydric alcohols and polyhydric alcohols, e.g. glycols
  • oils e.g. fractionated coconut oil and arachis oil
  • the carrier, diluent, or excipient can also be an oily ester such as ethyl oleate and isopropyl myristate.
  • sterile liquid carriers, diluents, or excipients which are used in sterile liquid form compositions for parenteral administration.
  • Solutions of the active agents can be prepared in water suitably mixed with a surfactant, such as hydroxypropylcellulose.
  • a dispersion can also be prepared in glycerol, liquid polyethylene glycols, and mixtures thereof and in oils. Under ordinary conditions of storage and use, these preparations may contain a preservative to prevent the growth of microorganisms.
  • the pharmaceutical forms suitable for injectable use include, for example, sterile aqueous solutions or dispersions and sterile powders for the extemporaneous preparation of sterile injectable solutions or dispersions.
  • the form is preferably sterile and fluid to provide easy syringability. It is preferably stable under the conditions of manufacture and storage and is preferably preserved against the contaminating action of microorganisms such as bacteria and fungi.
  • the carrier, diluent, or excipient may be a solvent or dispersion medium containing, for example, water, ethanol, polyol (for example, glycerol, propylene glycol, liquid polyethylene glycol and the like), suitable mixtures thereof, and vegetable oils.
  • the proper fluidity can be maintained, for example, by the use of a coating, such as lecithin, by the maintenance of the required particle size in the case of a dispersion, and by the use of surfactants.
  • a coating such as lecithin
  • surfactants for example, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium bicarbonate, sodium sulfate, sodium stearate, sodium stearate, and gelatin.
  • Sterile injectable solutions may be prepared by incorporating the active agent in the pharmaceutically appropriate amounts, in the appropriate solvent, with various of the other ingredients enumerated above, as required, followed by filtered sterilization.
  • dispersions may be prepared by incorporating the sterilized active ingredient into a sterile vehicle which contains the basic dispersion medium and the required other ingredients from those enumerated above.
  • the preferred methods of preparation may include vacuum drying and freeze drying techniques that yield a powder of the active ingredient or ingredients, plus any additional desired ingredient from the previously sterile-fdtered solution thereof.
  • an antimicrobial peptide may be in the present compositions and methods in an effective amount by any of the conventional techniques well-established in the medical field.
  • the administration may be in the amount of about 0.1 mg/day to about 500 mg per day.
  • the administration may be in the amount of about 250 mg/kg/day.
  • administration may be in the amount of about 0.1 mg/day, about 0.5 mg/day, about 1.0 mg/day, about 5 mg/day, about 10 mg/day, about 20 mg/day, about 50 mg/day, about 100 mg/day, about 200 mg/day, about 250 mg/day, about 300 mg/day, or about 500 mg/day.
  • Also disclosed are methods comprising contacting a biofilm with an effective amount of an AMP that has been identified according to a method for forming a database- assisted platform according to the present disclosure.
  • the AMP comprises one or more of SEQ ID NOS: 1-100.
  • Such methods may be effective to remove or reduce the presence of an unwanted biofilm, such as in hospitals or other medical settings, in sewer and filtration systems, in industrial settings, on equipment involved in food preparation or manufacture, in aquaculture or hydroponics, or in any other context that is prone to unwanted biofilm formation.
  • microbes against which the present antimicrobial peptides are effective may be, for example, any unicellular organism, such as gram-negative bacteria, gram-positive bacteria, protozoa, viruses, bacteriophages, and archaea.
  • the present peptides can have an antimicrobial effect with respect to any such microbe.
  • bacteria against which the present compounds are effective to cause reduction in numbers include gram positive bacteria and gram negative bacteria, for example, Salmonella enterica, Listeria monocytogenes, Escherichia coli, Clostridium botulinum, Clostridium difficile, Campylobacter, Bacillus cereus, Vibrio parahaemolyticus, Vibrio cholerae, Vibrio vulnificus, Staphylococcus aureus, Yersinia enterocolitica, Shigella, Moraxella spp., Helicobacter, Stenotrophomonas, Bdellovibrio, Legi onella spp.
  • Salmonella enterica Listeria monocytogenes
  • Escherichia coli Clostridium botulinum
  • Clostridium difficile Clostridium difficile
  • Campylobacter Bacillus cereus
  • Vibrio parahaemolyticus Vibrio cholerae
  • Vibrio vulnificus Vibrio vulnificus
  • Neisseria gonorrhoeae Neisseria meningitidis
  • Haemophilus influenzae Acinetobacter baumannii
  • Klebsiella pneumoniae Pseudomonas aeruginosa
  • Proteus mirabilis Enterobacter cloacae
  • Enterococcus faecium Serratia marcescens
  • Elelicobacter pylori Salmonella enteritidis
  • Salmonella typhi and combinations thereof.
  • Salmonella enterica serovars examples include, for example, Salmonella enteriditis, Salmonella typhimurium, Salmonella poona, Salmonella heidelberg, and Salmonella anatum.
  • Exemplary viruses against which the present peptides are effective to cause reduction in numbers include coronaviruses, rhinoviruses, and influenza viruses.
  • AMPSphere which incorporates c AMPs predicted with ML using Macrel 42 , a pipeline that uses random forests to predict AMPs from large peptide datasets with an emphasis on precision over recall. It was applied to 63,410 globally distributed publicly available metagenomes (Fig. 1 A) and 87,920 high-quality bacterial and archaeal genomes 43 . Sequences present in a single sample were removed 42 , except when they had a significant match (defined as amino acids identity > 75% and E-value ⁇ I O 5 ) to a sequence in the AMP-dedicated database Data Repository of Antimicrobial Peptides (DRAMP) version 3.0 46 .
  • DRAMP Antimicrobial Peptides
  • c_AMPs from AMPSphere present a positive charge (4.7 ⁇ 2.6), high isoelectric point (10.9 ⁇ 1.2), amphiphilicity (hydrophobic moment, 0.6 ⁇ 0.1) and a potential to bind to membranes or other proteins (Boman index, 1.14 ⁇ 1.1).
  • c_AMPs from AMPSPhere are on average longer (37 ⁇ 8 residues) than those in DRAMP 46 version 3.0 (28 ⁇ 22 residues) and we observed differences in the distribution of other features (e.g., charge, aliphaticity, amphipathicity, and isoelectric point, Fig. 8).
  • AMPScanner v2 49 the model for mature peptides in ampir 40 , amPEPpy 50 , APIN 51 , AI4AMP 52 , and AMPLify 53
  • 98.4% 849,703 peptides
  • AMPSphere c_AMPs were also predicted as AMPs by at least one other AMP prediction system.
  • Approximately 15% (132,440 out of 863,498 peptides) of AMPSphere c_AMPs were co-predicted by all methods used.
  • peptides were hierarchically clustered using a reduced amino acid alphabet of 8 letters 56 .
  • the three sequence clustering levels adopted identity cutoffs of 100%, 85%, and 75% (Fig. 10).
  • 75% identity level 521,760 protein clusters were obtained, of which 405,547 were singletons, corresponding to 47% of all c_AMPs from AMPSphere.
  • a total of 78,481 (19.3%) of these singletons were detected in metatranscriptomes or metaproteomes from various sources, indicating that they were not artifacts.
  • the AMPSphere spanned 72 different habitats, which were classified into eight high-level habitat groups, e.g.. soil/plant (36.6% of c_AMPs in AMPSphere), aquatic (24.8%), human gut (13%) - (Fig. 1A). Most of the habitats, except for the human gut, appeared to be far from saturated in terms of newly discovered c_AMPs (Fig. 1C). In fact, most AMPs were rare (median number of detections is 99, or 0.17% of the dataset; when restricted to high- quality c_AMPs, the median number of detections is 81, or 0.14% of the dataset), with 83.97% being observed in ⁇ 1% of samples (Fig. 9).
  • AMPs are generated post-translationally by the fragmentation of larger proteins 17 .
  • encrypted peptides EPs
  • EPs are computationally detected fragments from protein sequences within the human proteome and other proteomes, which have been shown to be highly active 4,10 .
  • EPs present diverse secondary structures and act on the membrane of bacterial cells, similar to known natural AMPs, but have different physicochemical features compared to known AMPs 4,33 .
  • AMP Sphere only considered peptides encoded by dedicated genes. Nonetheless, it was hypothesized that some of these have originated from larger proteins by fragmentation at the genomic level.
  • AMPSphere c AMPs were aligned to the full-length proteins in GMGCvF 5 and it was observed that about 7% (61,020) of them are homologous to a canonical -length protein (Fig. IB), with 27% of these hits sharing the start codon with the longer protein. This suggests early termination of full- length proteins as one mechanism for generating novel c AMPs (Fig. 3A and 3B).
  • c_AMP genes may arise after gene duplication events. Next, the question was raised of whether c_AMPs would be predominantly present in specific genomic contexts. To investigate the functions of the neighboring genes of the c AMPs, they were mapped against 169,484 genomes included in a recent study 58 . Atotal of 38.9% (21,465 out of 55,191) of c AMPs with more than two homologs in different genomes in the database showed phylogenetically conserved genomic context with genes of known function (see infra, Methods - Genomic context conservation analysis).
  • c_AMPs were generally depleted from conserved genomic contexts involving known systems of antibiotic synthesis and resistance, even when compared to small protein families (Fig. 4). Instead, it was found that c_AMPs are encoded in conserved genomic contexts with ribosomal genes (23.6%) at a higher frequency than other gene families (4.75%, Fig. 4A).
  • c_AMPs Most of the c_AMPs (2,201 out of the 2,642) in a conserved context with ribosomal subunits were homologous to ribosomal proteins (Fig. 4D), congruent with the observation that, in some species, ribosomal proteins have antimicrobial properties 39 . Seventy-seven c_AMPs homologous to ribosomal proteins were also homologous to a ribosomal gene in their immediate vicinity (up to 1 gene up/downstream). This phenomenon is not exclusive to ribosomal proteins: 1,951 c AMPs can be annotated to the same KEGG Orthologous Group (KO) as some of their immediate neighbors and may have originated from gene duplication events.
  • KEGG Orthologous Group KEGG Orthologous Group
  • AMP 10.018 194 the only c_AMP found in Mycoplasma pneumoniae genomes.
  • M. pneumoniae strains are traditionally classified into two groups based on their Pl adhesin gene 65 . Of the 76 M. pneumoniae genomes present in our study, 29 were classified as type-1, 29 as type-2, and the remaining 18 were undetermined in this classification system 66 (see Methods - Determination of accessory AMPs). Twenty-six of the 29 type-2 genomes contain AMP10.018_194, as did 2 undetermined type genomes, but none of the type-1 genomes contain this AMP.
  • AMP Sphere More transmissible species have lower c_AMP density.
  • GTDB Genome Taxonomy Database
  • the genera contributing the most c_AMPs to AMPSphere were Prevotella (18,593 c_AMPs), Bradyrhizobium (11,846 c_AMPs), Pelagibacter (6,675 c_AMPs), aecalibacterium (5,917 c_AMPs), and CAG-110 (5,254 c_AMPs) see Fig. 5).
  • This distribution reflects the fact that these genera are among those that contribute the most assembled sequences in our dataset (all occupying percentiles above 99.75% among the assembled genera). Therefore, the c_AMP density (P ⁇ MP) was calculated by determining the number of c_AMP genes per megabase pairs of assembled sequence. To avoid bias due to the unequal sampling of habitats, all the sequences predicted by Macrel 42 was included in each sample, including singleton sequences that were subsequently removed and are not part of AMPSphere.
  • AMPSphere sequences displayed a slightly higher abundance of aliphatic amino acid residues, specifically alanine and valine. However, these AMPSphere sequences consistently differed (Fig.
  • AMPSphere was first filtered for peptides that were predicted as suitable for in vitro assays, namely solubility in aqueous solution and ease of chemical synthesis.
  • a set of high-quality AMPs with 50 peptide sequences was selected based on prevalence and taxonomic diversity (see Methods - Selection of peptides for synthesis and activity testing).
  • Samples were grouped by project and all projects with at least 20 samples were included for analysis. Additionally, metagenomes deposited by the Integrated Microbial Genomes System (IMG) missing from ENA were also included. Metadata was manually curated from each sample’s describing literature and Biosamples database 127 . For habitat classification groups were created based on the similarity of habitat conditions, such as air, anthropogenic, aquatic, host-associated, ph:alkaline, sediment, terrestrial, and others. The sample origins and information related to host species were obtained using the NCBI taxonomic identification number. High-quality microbial genomes were selected from ProGenomes2 database 43 .
  • IMG Integrated Microbial Genomes System
  • Reads trimming and assembly Reads were processed using NGLess 96 , trimming positions with quality lower than 25 and discarding reads shorter than 60 bp post- trimming. Metagenomes obtained from a host-associated microbiome passed through a filtering of reads mapping to the host genome when available. Reads totaling more than 14.7 trillion base pairs of sequenced DNA were assembled with MEGAHIT 1.2.9 112 and the taxonomy of the 16,969,685,977 contigs generated was inferred as previously described 131 , using MMSeqs2 99 to map the sequences against the GTDB release 95 67,68 . Mapped taxonomy lineages were then manually curated to conform to the International Code of Nomenclature of Prokaryotes 132,133 .
  • AMPSphere encompassed 863,498 non- redundant predicted c_AMPs encoded by 5,518,294 redundant genes.
  • AMP densities were estimated as the number of AMPs per assembled base pairs in a sample or a species.
  • AMP genes originating from ProGenomes2 43 had the taxonomy of the original genome assigned to them, whereas AMP genes from metagenomes were assigned the taxonomy predicted for the contig where they were found. Insights about potential structural conformations were obtained using the function secondary structure fraction from the ProtParam module implemented in the SeqUtils in Biopython 107 . This function calculates the fraction of amino acids tend to assume conformations of helix [VIYFWL], turn [NPGS], and sheet [EMAL],
  • Clustering of AMP families Clustering peptides by sequence identity is only possible at high identities as short low-/medium-identity matches are possible by chance. Therefore, aiming to recover matches where basic features are preserved even if individual amino acids are not identical 134,133 , a reduced amino acids alphabet of 8 letters was used 56 - [LVIMC], [AG], [ST], [FYW], [EDNQ], [KR], [P], [H], c AMPs were hierarchically clustered after alphabet reduction using three sequential identity cutoffs (100%, 85%, and 75%) with CD-Hit 98 . A cluster was considered an AMP family when it consisted of at least 8 sequences 38 . Representative sequences of peptide clusters were selected according to their length (taking the longest) with ties being broken by their alphabetical order.
  • c_AMPs were subjected to five different quality tests to reduce the likelihood that the observed peptides were artifacts or fragments of larger proteins. Initially, the peptides were searched against AntiFam v.7.0 123 using HMMSearch 109 , which was designed to identify commonly recurring spuriously predicted ORFs, with the option cut_ga”. Fewer than 0.05% of c_AMPs had any significant hits.
  • RNAcode 84 program predicts protein-coding regions based on evolutionary signatures typical for protein genes. This analysis depends on a set of homologous and non-identical genes. Therefore, AMP clusters containing at least three gene variants were aligned. Given that an extensive portion of the AMPSphere candidates (53%; 459,910 out of 863,498) is not part of such a cluster, they could not be tested. Of the tested c_AMPs, 53% (215,421 out of 403,588) were considered genes with evolutionary traits of protein-coding sequences.
  • the inventors looked for evidence of transcription and/or translation using 221 publicly available metatranscriptomes, comprising human gut (142), peat (48), plant (13), and symbionts (17); and 109 publicly available metaproteomes from PRIDE 129 database comprising from 37 habitats.
  • 221 publicly available metatranscriptomes comprising human gut (142), peat (48), plant (13), and symbionts (17); and 109 publicly available metaproteomes from PRIDE 129 database comprising from 37 habitats.
  • bwa v.0.7.17 113 reads from the metatranscriptomes were mapped against non-redundant AMP genes, and, using NGLess 96 , genes were selected with at least one read mapped across a minimum of two samples to increase our confidence. This approach is similar to that adopted when predicting AMPs 42 .
  • mapping of c_AMPs was performed without considering genomic context, which may have led to an overestimation of candidates being identified as potentially transcribed. For example, if they are homologous to longer proteins the presence of the longer gene may lead to a false positive detection of the shorter c AMP.
  • sample -based c AMPs accumulation curves were computed by randomly sampling metagenomes in steps of 10 metagenomes. This procedure was repeated 32 times, and the average was taken.
  • Multi-habitat and rare c AMPs were counted c_AMPs present in >2 habitats (“multi -habitat AMPs”). To then test the significance of this value, we opted for a similar approach to that described in Coelho et al. 55 : habitat labels for each sample were shuffled 100 times and the number of resulting multi -habitat c AMPs was counted.
  • Shuffling labels resulted in 676,489.7 ⁇ 4,281 .8 multi-habitat c_AMPs by chance for high-level habitat groups, and in 685,477.17 ⁇ 4,369.6 multi-habitat c_AMPs by chance when looking at the habitats individually inside the high-level groups.
  • the rarest genes will not be high-quality.
  • this effect was quantified by computing the mean and median number of detections in only the high-quality c_AMPs and only non-terminal c_AMPs (a test which does not require a minimum number of genes).
  • the mean number of detections is 682 for the full collection, 789 for high-quality c AMPs, and 679 for non-terminal ones.
  • Chebyshev's inequality is p ⁇ — , where Z stands for the Z-score computed from the average and standard deviations estimated by the shuffling procedure.
  • Z stands for the Z-score computed from the average and standard deviations estimated by the shuffling procedure.
  • the p-values were adjusted using Holm-Sidak implemented in multipletests from the statsmodels package 114 , and those below 0.05 were considered significant.
  • the AMP density and the coefficient of transmissibility were correlated using Spearman's method implemented in the scipy package 104 : following children's microbiome after 1, 3, and up to 18 years, as well as, cohabitation and intradatasets.
  • the p-values of correlations were corrected using Holm-Sidak implemented in the multipletests function from the statsmodels package 114 .
  • AMPSphere candidates were aligned against several databases: (i) the small protein sets in SmProt 2 54 , (ii) the bioactive peptides database starPepDB 45k 93 , (iii) the small proteins from the global data-driven census of Salmonella 92 , (iv) the global microbial gene catalog GMGCvl ?5 , (v) and the AMP database DRAMP 46 version 3.0.
  • the hypergeometric test implemented in the scipy package 104 was used to model the association between c AMPs and the background distribution of ortholog groups from GMGCvl 55 .
  • the number of genes that were redundant in GMGCvl 55 for each ortholog group was computed along with the counts for ortholog groups in the top hits to AMPSphere.
  • the enrichment was given as the proportion of hits present in a given ortholog group divided by the proportion of that ortholog group among the redundant sequences in GMGCvl 55 , and results were considered significant if p ⁇ 0.05 after correction with the Holm-Sidak method implemented in multipletests from the statsmodels package 114 .
  • Biopython 107 codon-align the fragments from metagenomic contigs assembled from samples SAMN09837386, SAMN09837387, and SAMN09837388, and genomic fragments of different strains of Prevotella jejuni CD3 33 (CP023864.1 :504836- 504949), F0106 (CP072366.E781389-781502), F0697 (CP072364.1 : 1466323-1466436), and from Prevotella melaninogenica strains FDAARGOS_760 (CP054010.1 : 157726-157839), FDAARGOS 306 (CP022041.2:943522-943635), FDAARGOS 1566
  • Genomic context conservation analysis To gain insights into the gene synteny involving AMP genes, the 863,498 AMP sequences were mapped against a collection of 169,632 reference genomes, metagenome-assembled genomes (MAGs) and single amplified genomes (SAGs) curated elsewhere 58 with DIAMOND 119 in “blastp” mode, as previously reported 58 . Hits with identity > 50% (amino acid) and query and target coverage > 90% were considered significant. The target coverage threshold avoids hits to larger homologs whose function may be unrelated. This yielded 107,308 AMPs with homologs in at least one genome.
  • MAGs metagenome-assembled genomes
  • SAGs single amplified genomes
  • AMPSphere web resource AMPSphere is found at the address The implementation is based on Python 100 and Vue Javascript. The database was built with sqlite, and SQLalchemy was used to map the database to Python objects. Internal and external APIs were built using FastAPI and Gunicom to serve them. On the front end, Vue 3 was used as the backbone and Quasar built the layout. Plotly was used to generate interactive visualization plots, and Axios to render content seamlessly. logoJS wgttglabd ⁇ PiO was use ⁇ to generate sequence logos for AMP families; while the helical wheel app was used to generate AMP helical wheels.
  • One-hundred synthesized peptides were tested against 11 clinically relevant pathogenic strains, encompassing Acinetobacter baumannii, Escherichia coli (including one colistin-resistant strain), Klebsiella pneumoniae, Pseudomonas aeruginosa, Staphylococcus aureus (including one methicillin-resistant strain), vancomycin-resistant Enterococcus faecalis, and vancomycin-resistant Enterococcus faecium.
  • the initial screening revealed that 63 AMPs (out of 100 synthesized) completely eradicated the growth of at least one of the pathogens tested (Fig. 6C).
  • the AMPs were active at concentrations as low as 1 pmol L' 1 , close to the peptide antibiotic polymyxin B and the antibiotic levofloxacin, used as positive controls in all experiments (Fig. 11 A).
  • MRSA methicillin- resistant S. aureus
  • the growth of human gut commensals is impaired by c AMPs.
  • the AMPs were screened against eight of the most relevant members of the human gut microbiota associated with human health 73 77 Tested were commensal bacteria belonging to four phyla (Verrucomicrobiota, Bacteroidota, Actinomycetota, and Bacillota), i.e., Akkermansia muciniphila, Bacteroides fragilis, Bacteroides thetaiotaomicron , Bacteroides unifornris, Phocaeicola vulgatus (formerly Bacteroides vulgatus), Collinsella aerofaciens, Clostridium scindens, and Parabacteroides distasonis.
  • NPN N-phenylaminonaphthalene
  • the fluorescent dye 3,3 '-dipropylthiadi carbocyanine iodide [DiSCa-(5)] was utilized, v Among the peptides tested against baumannii, bogicin-1 (AMP10.364_543), ampspherin-2 (AMP10.615_023), and marinobacticin-1 (AMP10.321_460) exhibited greater cytoplasmic membrane depolarization than polymyxin B, and among the ones tested against P aeruginosa, cagicin-2 (AMP10.014_861) exhibited greater cytoplasmic membrane depolarization than polymyxin B (Fig. 6E).
  • AMPs exhibit anti-infective efficacy in a mouse model.
  • Fig. 7A skin abscess murine infection model
  • Mice were subjected to infection with A. baumannii, a dangerous Gramnegative pathogen known for causing severe infections in various body sites including the bloodstream, lungs, urinary tract, and wounds 82 .
  • Ten lead AMPs from different sources displayed potent in vitro activity against A. baumannii'. synechocucin-1 (AMP10.000_211, 8 prnol L -1 ) from Synechococcus sp.
  • proteobacticin-1 (AMP10.048_551, 16 prnol L' 1 ) from Pseudomonadota (plant and soil microbiome), actynomycin-1 (AMP10.199_072, 64 prnol L' 1 ) from Actinomyces (human mouth and saliva microbiome), lachnospirin-1 (AMP10.015_742, 2 prnol L' 1 ) from Lachnospira sp.
  • RNAcode 84 requires multiple variants, which is independent of their activity and influenced by sampling biases.
  • c_AMPs from AMPSphere were habitat-specific and mostly accessory members of microbial pangenomes. Furthermore, four out of the five genera with the most c_AMPs present in AMPSphere share a host-associated lifestyle, and three of these (Prevotella, Faecalibacterium , and CAG-llff) are common in animal hosts 89 91 (Fig. 5).
  • Valles-Colomer et al. 69 who recently analyzed a large collection of human- associated metagenomes, provide a species-specific index of transmissibility for the several transmission scenarios they study (e.g., mother to infant). Hypothesizing that AMP production may be related to transmission, the species-specific p AMP calculated in AMPSphere was correlated with transmission scores. In both the human gut and oral microbiomes, species with higher p AMP are less transmissible, possibly because AMPs confer protection against strain replacement. Taken together, these results validate the applicability of AMPSphere in the study of microbial ecology as they suggest a role for AMPs in determining the transmissibility and colonization ability of microbes.
  • tested AMPs from AMPSphere tended to target clinically relevant Gram-negative pathogens and showed activity against vancomycin- resistant E.faecium. Although conventional AMPs do not target bacteria from the human gut microbiome 78 , tested AMPs from AMPSphere showed efficacy against commensal bacteria, suggesting potential ecological implications of peptides as protective agents for their producing organisms and to reconfigure microbiome communities.
  • c_AMPs were considered for synthesis. They were further filtered according to six criteria for solubility 144 and three criteria for synthesis, as in PepFun 145 .
  • the solubility was estimated using the criteria implemented in PepFun 145 , observing that 67.4% (581,749 peptides) passed at least half of the solubility criteria evaluated.
  • the subset that is homologous to peptides in DRAMP 46 version 3.0 had a slightly lower rate, 44.3% passed half the tests.
  • a peptide approved for at least six of the above-mentioned criteria was then filtered by predicting AMP activity with six methods in addition to Macrel 42 : AMPScanner v2 49 , the mature peptides model in ampir 40 , amPEPpy 50 , APIN 51 - with their proposed model, AI4AMP 52 , and AMPLify 33 .
  • Peptides predicted to be AMPs by all methods were filtered by length, discarding sequences longer than 40 amino acid residues, for which conventional solid-phase peptide synthesis using Fmoc strategy has lower yields and many recoupling reactions 146 l4S .
  • Circular dichroism assays Circular dichroism experiments were conducted using a J1500 circular dichroism spectropolarimeter (Jasco). The experiments were carried out at a temperature of 25°C. Circular dichroism spectra were obtained by averaging three accumulations using a quartz cuvette with an optical path length of 1.0 mm. The spectra were recorded in the wavelength range from 260 to 190 nm at a scanning rate of 50 nm min’ 1 with a bandwidth of 0.5 nm. The peptides were tested at a concentration of 50 pmol L’ 1 .
  • Measurements were performed in water, a mixture of water and trifluoroethanol (TFE) in a ratio of 3 :2, and a mixture of water and methanol in a ratio of 1 : 1. Baseline measurements were recorded prior to each measurement. To minimize background effects, a Fourier transform filter was applied. The helical fraction values were calculated using the single spectra analysis tool available on the BeStSel server 95 .
  • TFE trifluoroethanol
  • NPN l-(N-phenylamino)naphthalene
  • Cytoplasmic membrane depolarization assays The ability of the peptides to depolarize the cytoplasmic membrane was assessed by measuring the fluorescence of the membrane potential-sensitive dye 3,3’-dipropylthiadicarbocyanine iodide [DiSC 3 -(5)]. This potentiometric fluorophore fluoresces upon release from the interior of the cytoplasmic membrane in response to an imbalance of its transmembrane potential.
  • the cells were then centrifuged and washed twice with washing buffer (20 mmol L’ 1 glucose, 5 mmol L’ 1 HEPES, pH 7.2) and re-suspended to an ODeoo of 0.05 in 20 mmol L’ 1 glucose, 5 mmol L’ 1 HEPES, 0.1 mol L’ 1 KC1, pH 7.2.
  • An aliquot of 100 pL of bacterial cells was added to a black flat bottom 96-well plate and incubated with 20 nmol L’ 1 of DiSC 3 -(5) for 15 min until the fluorescence stabilized, indicating the incorporation of the dye into the cytoplasmic membrane.
  • the positive control antibiotic polymyxin B
  • the percentage of difference between the baseline (polymyxin B) and the sample was estimated using the same mathematical approach as in the ''Outer membrane permeabilization assays ' .
  • Listeriolysin S is a streptolysin s-like virulence factor that targets exclusively prokaryotic cells in vivo. mBio 8. 10.1128/mBio.00259-17.
  • proGenomes2 an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Research 48, D621-D625. 10.1093/nar/gkzl002.
  • DRAMP 3.0 an enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Research. 10.1093/nar/gkab651.
  • AI4AMP an Antimicrobial Peptide Predictor Using Physicochemical Property-Based Encoding Method and Deep Learning. mSystems 6, e0029921.
  • eggNOG 5.0 a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research 47, D309-D314. 10.1093/nar/gkyl085.
  • Ribosomes The New Role of Ribosomal Proteins as Natural Antimicrobials. Int J Mol Sci 23, 9123. 10.3390/ijms23169123.
  • Type 1 and type 2 strains of Mycoplasma pneumoniae form different biofilms. Microbiology (Reading) 159, 737-747. 10.1099/mic.0.064782-0.
  • RNAcode Robust discrimination of coding and noncoding regions in comparative sequence data. RNA 77, 578-594. 10.1261/ma.2536111.
  • NG-meta-profiler fast processing of metagenomes using NGLess, a domainspecific language. Microbiome 7, 84. 10.1186/s40168-019-0684-8.
  • SciPy 1.0 fundamental algorithms for scientific computing in Python. Nat Methods 17, 261-272. 10.1038/s41592-019-0686-2.
  • scikit-bio A Bioinformatics Library for Data scientistss, Students, and Developers. Version 0.5.5.
  • PRIDE a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 34, D659-663. 10.1093/nar/gkj l38.
  • SolyPep a fast generator of soluble peptides https://bioserv.rpbs.univ -paris- diderot.fr/services/SolyPep/.
  • AMPSphere the worldwide survey of prokaryotic antimicrobial peptides. (Zenodo). 10.5281/zenodo.4606582 10.5281/zenodo.4606582.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Biochemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Peptides Or Proteins (AREA)

Abstract

Novel antibiotics are needed to combat the antibiotic-resistance crisis. Presented herein are machine learning-based approaches for predicting antimicrobial peptides (AMPs) within the global microbiome and leverage a vast dataset that can include metagenomes and prokaryotic genomes from environmental and host-associated habitats to create a comprehensive catalog comprising distinct, non-redundant peptides, the majority of which are novel. This platform provides insights into the evolutionary origins of peptides, including by duplication or gene truncation of longer sequences. To validate predictions, 100 AMPs were synthesized and tested against clinically relevant drug-resistant pathogens and human gut commensals, both in vitro and in vivo. Many of the synthesized peptides were active, with a large number of them targeting pathogens. The presently described computational approach can identify millions of prokaryotic AMP sequences, opening new avenues for antibiotic discovery.

Description

COMPUTATIONAL EXPLORATION OF THE GLOBAL MICROBIOME FOR
ANTIBIOTIC DISCOVERY
GOVERNMENT SUPPORT
[0001] This invention was made with government support under GM138201 awarded by the National Institutes of Health and HDTRA1- 18- 1-0041, HD TRA 1-21-1-0014, and HDTRA1-21-1-0014 awarded by the Defense Threat Reduction Agency. The government has certain rights in the invention.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0002] The present application claims the benefit of priority to U.S. Provisional App. No. 63/579,834, filed August 31, 2023, the entire contents of which are incorporated herein by reference.
BACKGROUND
[0003] Antibiotic-resistant infections are becoming increasingly difficult to treat with conventional therapies1. Indeed, such infections currently kill 1.27 million people per year2. Therefore, there is an urgent need for novel methods for antibiotic discovery. Computational approaches have been recently developed to accelerate our ability to identify novel antibiotics, including antimicrobial peptides (AMPs)3 9. Recently, proteome mining approaches have even been developed to identify antimicrobial agents in extinct organisms in an attempt to further expand our repertoire of known antimicrobials10.
[0004] AMPs, found in all domains of life11 14, are short sequences (here operationally defined as 10-100 amino acid residues15) capable of disturbing microbial growth12,15. AMPs most commonly interfere with cell wall integrity and cause cell lysis12 16. Natural AMPs can originate by proteolysis4 17, by non-ribosomal synthesis18, or they can be encoded within the genome 19.
[0005] Bacteria live in an intricate balance of antagonism and mutualism in natural habitats. AMPs play an important role in modulating such microbial interactions and can displace competitor strains, facilitating cooperation20. For instance, pathogens such as Shigella spp.21, Staphylococcus spp.22, Vibrio cholerae13 , and Listeria spp.242 produce AMPs that eliminate competitors (sometimes from the same species), allowing them to occupy their niche.
[0006] AMPs hold promise as potential therapeutics and have already been used clinically as antiviral drugs e.g., enfuvirtide and telaprevir26). AMPs that exhibit immunomodulatory properties are currently undergoing clinical trials27, as are AMPs that may be used to address yeast and bacterial infections28 (e.g., pexiganan, LL-37, PAC-113). Although most AMPs display broad-spectrum activity, some are only active against closely related members of the same species or genus29. Such AMPs are more targeted agents than conventional broad-spectrum antibiotics30,31. Furthermore, contrary to conventional antibiotics, the evolution of resistance to many AMPs occurs at low rates and is not related to cross-resistance to other classes of widely used antibiotics4,32,33.
[0007] The application of metagenomic analyses to the study of AMPs has been limited due to technical constraints, primarily stemming from the challenge of distinguishing genuine protein-coding sequences from false positives34.
BRIEF DESCRIPTION OF THE FIGURES
[0008] FIG. 1A demonstrates how assembled 63,410 publicly available metagenomes were assembled from diverse habitats. A modified version of Prodigal34, which can also predict smORFs (30-300 bp), was used to predict genes on the resulting metagenomic contigs as well as on 87,920 microbial genomes from ProGenomes243. Macrel42 was applied to the 4,599,187,424 predicted smORFs to obtain 863,498 non-redundant c_AMPs (see also Fig. 8). c AMPs were then hierarchically clustered in a reduced amino acids alphabet using 100%, 85%, and 75% identity cutoffs. We observed at 75% of identity 118,051 non-singleton clusters, and 8,788 of them were considered families (> 8 c AMPs). FIG. IB shows that only 9% of c_AMPs have detectable homologs in other small protein databases (SmProt 254, STsORFs92), bioactive peptide databases (DRAMP46 version 3.0, starPepDB 45k93), and general protein datasets (GMGCvP5) - see also Fig. 9B. Also shown is the number of homologs in the AMPSphere in each database as well as the total. The number of homologs passing all quality tests, regardless their experimental evidence of translation/transcription is also shown along with the percentage it represents in the homologs identified. Note that some peptides have homologs in multiple databases and, thus, the total count is not the sum of the individual databases. FIG. 1C shows rarefaction curves showing how AMP discovery is impacted by sampling, with most of the habitats presenting steep sampling curves. FIG. ID illustrates how sharing of c AMPs between habitats is limited. The width of ribbons represents the proportion of the shared c AMPs in the habitat on the left. See also Figs. 9C- D.
[0009] FIG. 2A illustrates the number of AMPSphere candidates passing each of the tests proposed for quality. FIG. 2B shows the number of AMP candidates predicted as AMP by AMP prediction systems beyond Macrel (AMPScanner v249, ampir40 - with the model for mature peptides, amPEPpy50, APIN51 - with their proposed model, AI4AMP52, and AMPLify53). Only a small portion of AMPSphere (<2%) cannot be co-predicted by any system other than Macrel42.
[0010] FIG. 3 A shows the distribution of positions (as a percentage of the length of the larger protein) from which the AMP homologs start their alignment. About 7% of c_AMPs are homologous to proteins from GMGCvl55, with approximately one-fourth of the hits sharing start positions with the larger protein. FIG. 3B provides as an illustrative example of an AMP homologous to a full-length protein, AMP10.271_016 was recovered from three samples of human saliva from the same donor94. AMP10.271 016 is predicted to be produced by Prevotella jejuni, sharing the start codon (bolded) of an NAD(P)-dependent dehydrogenase gene (WP_089365220.1), the transcription of which was stopped by a mutation (in red; TGG > TGA). (C) The distribution of AMPs per OG class (left) and their enrichment in comparison to full-length proteins from GMGCvl 35 (right). OGs were classified into subgroups according to the number of c AMPs they were affiliated with. The OGs of unknown function represent the largest (2,041 out of 3,792 OGs) and most enriched (PKruskai = 2.66- 10'39) class with homologs to c AMPs in GMGCvl53. Interestingly, when considered individually, the number of c_AMP hits to unknown OGs was the lowest (PKruskai = 6 - 10'3). These results do not change when underrepresented OGs are excluded by using different thresholds (e.g, at least 10, 20, or 100 homologs per OG). [0011] FIG. 4A shows how, compared to other proteins, c AMPs in conserved genomic architectures tend to be closer to ribosomal machinery -related genes than families of proteins with different sizes (all length and small proteins with < 50 amino acids). FIG. 4B shows how proportion of c_AMPs in a genome context involving antibiotic resistance genes is lower than in other gene families. FIG. 4C illustrates how proportion of c AMPs in neighborhoods with antibiotic synthesis-related genes is very small (<0.25%). FIG. 4D illustrates the conserved genomic context of the gene encoding AMP10.015_426 is shown in different genomes (the tree on the left depicts the phylogenetic relationship of the genes homologous to it). This c_AMP is homologous to the ribosomal protein rpsH, and is found in the context of rpsH and other ribosomal protein genes.
[0012] FIG. 5A shows the fractions of AMPs (or AMP families) that are accessory (present in <50% of genomes from same species), shell (50-95%), or core (>95%). FIG. 5B illustrates of the lowest taxonomic level at which c_AMPs were annotated. In detail (right), the top 10 genera with the highest numbers of c AMPs included in AMPSphere. Animal - associated genera (e.g., I ‘revote I la, Faecalibacterium, CAG-llff) contribute the most c_AMPs, possibly reflecting data sampling. FIG. 4C illustrates that the PAMP per genus (calculated with c_AMPs in AMPSphere), it was observed the distribution of c_AMPs per phylum, with Bacillota A as the densest (the number of samples used to build the graph is shown above each box). FIG. 5D shows the detected taxa in AMPSphere, using the GTDB67,68 reference tree. The gray bars show PAMP distribution with respect to taxonomy, with black bars representing the confidence interval of 95%. Bacillota A, Actinomycetota, and Pseudomonadota are the densest phyla in c AMPs. As a reference, the median of PAMP f°r the presented genera is indicated by a magenta dashed line.
[0013] FIG. 6A provides the amino acid frequency in c AMPs from AMPSphere, AMPs from databases (DRAMP46 version 3, APD371, and DBAASP70), and encrypted peptides4 (EPs) from the human proteome. FIG. 6B is a heat map with the percentage of secondary structure found for each peptide in three different solvents: water, 60% trifluoroethanol (TFE) in water, and 50% methanol (MeOH) in water. Secondary structure was calculated using BeStSel server95. FIG. 6D illustrates the activity of c_AMPs assessed against ESKAPEE pathogens and human gut commensal strains. Briefly, 106 CFU mL 1 was exposed to c_AMPs two-fold serially diluted ranging from 64 to 1 pmol L'1 in 96-wells plates and incubated at 37 °C for one day. After the exposure period, the absorbance of each well was measured at 600 nm. Untreated solutions were used as controls and minimal concentration values for complete inhibition were presented as a heat map of antimicrobial activities (pmol-L'1) against 11 pathogenic and eight human gut commensal bacterial strains. All the assays were performed in three independent replicates and the heatmap shows the mode obtained within the two-fold dilutions concentration range studied. Gram-positive (+) and Gram-negative (-) bacteria are indicated as such on top panel C. FIG. 6D illustrates fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe l-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of A. baumannii ATCC 19606 cells. FIG. 6E provides fluorescence values relative to PMB (positive control) of 3,3 '-dipropylthiadi carbocyanine iodide [DiSC3-(5)], a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of A. baumannii ATCC 19606 cells. Depolarization of the cytoplasmic membrane occurred with a slow kinetics compared to the permeabilization of the outer membrane and took approximately 20 min to stabilize.
[0014] FIG. 7A is a schematic of the skin abscess mouse model used to assess the anti-infective activity of the peptides against A. baumannii cells. FIG. 7B shows how peptides were tested at their MIC in a single dose one hour after the establishment of the infection. Each group consisted of three mice (n = 3) and the bacterial loads used to infect each mouse derived from a different inoculum. Statistical significance in (FIG. 7B) was determined using one-way ANOVA where all groups were compared to the untreated control group; P-values are shown for each of the groups. FIG. 7C provides how, to rule out toxic effects of the peptides, mouse weight was monitored throughout the experiment. Features on the violin plots represent median and upper and lower quartiles. Data in are the mean ± the standard deviation.
[0015] FIG. 8 provides density curves; the arbitrary density units not shown as all curves are independently normalized so the area under the curve is one. For each dataset and feature, the top 1% and bottom 1% of values were considered outliers and are not shown in the plot. Proportions of residues with small side chains [A, C, D, G, N, P, S, T, V] per c_AMP along with the proportions of basic residues [H, R, K] per c_AMP were also shown. The distributions of each feature were compared among the datasets using the Mann-Whitney test with multiple hypothesis testing corrected using Holm-Sidak. Almost all differences are significant (adjusted p-value < 0.05). The exceptions are: aliphatic index did not differ between the peptides from DRAMP version 346 database and the ones present in the positive training set used in Macrel42 (P ann = 0.71), AMPSphere peptides did not differ from the positive training set used in Macrel42 in the fraction of aromatic (PMann = 0.58), non-polar (PMann = 0.97), polar (Pwann = 0.97), and acidic (PMann = 0.69) residues; the instability index (PMann = 0.58) and the hydrophobicity (PMann = 0.31) of AMPSphere peptides also were not different from the positive training set used in Macrel42.
[0016] FIG. 9A illustrates how quality assessment of AMPSphere revealed most of the peptides passed at least one of the tests. The RNAcode test depends on gene diversity, which is very low for AMPSphere, and therefore, determined a low rate of positives among our candidates. FIG. 8B provides how c AMPs homologous to databases of validated bioactive peptides also showed a higher average quality of these datasets. FIG. 9C shows how the limited overlap of c AMPs among habitats argues in favor of using habitat groups to gain resolution. Note that the group of habitats with the highest paired overlaps belong to human body sites and samples from human guts and non-human mammalian guts. Only habitats with at least 100 samples were shown. FIG. 9D provides how it is also possible to observe the great proportion of rare genes in AMPSphere from different habitat groups, in which few genes are largely detected.
[0017] FIG. 10 illustrates how, to validate the clustering procedure using a reduced amino acid alphabet, samples of 1,000 peptides were randomly drawn from AMPSphere (excluding representative sequences) and aligned against their cluster representatives. Three different levels (I, II, and III) of clustering were tested. The E-values were computed per alignment and plotted against the corresponding alignment identity. The averaged proportion of significant alignments is shown in each graph.
[0018] FIG. 11 A illustrates minimal inhibitory concentration values for polymyxin B, a peptide antibiotic, and levofloxacin against all the strains tested. Polymyxin B and levofloxacin were used as positive controls in all antimicrobial assays. c_AMPs secondary structural tendency was analyzed using three different solvents. FIG. 11B shows analysis in water. FIG. 11C shows the analysis in a trifluoroethanol (TFE) and water mixture (3:2, V:V). FIG. 11D shows the analysis in a methanol (MeOH) and water mixture (1: 1, V:V). The experiments were carried out at 25 °C, and the circular dichroism spectra shown are an average of three accumulations obtained using a quartz cuvette with an optical path length of 1.0 mm, ranging from 260 to 190 nm at a rate of 50 nm min'1 and a bandwidth of 0.5 nm. All peptides were tested at a concentration of 50 pmol L'1, with respective baselines recorded prior to measurement. A Fourier transform filter was applied to minimize background effects.
[0019] FIG. 12A provides minimal inhibitory concentration values of the scrambled versions of five of the lead c_AMPs from AMPSphere tested against the same 11 pathogenic strains and eight gut commensal strains used to assess the activity of the c_AMPs. The scrambled peptides secondary structural tendency was analyzed using three different solvents. FIG. 12B shows analysis in water. FIG. 12C shows the analysis in a TFE and water mixture (3 :2, V:V). FIG. 12D shows the analysis in a MeOH and water mixture (1: 1, V:V). The experiments were carried out in the same conditions as the ones used for the c AMPs. A Fourier transform filter was applied to minimize background effects. FIG. 12E provides a heat map with the percentage of secondary structure found for each peptide in three different solvents: water, 60% TFE in water, and 50% MeOH in water. Secondary structure was calculated using BeStSel server95.
[0020] FIG. 13 A shows fluorescence values relative to polymyxin B (PMB, positive control) of the fluorescent probe l-(N-phenylamino)naphthalene (NPN) that indicate outer membrane permeabilization of P. aeruginosa PAO1 cells. FIG. 13B provides fluorescence values relative to PMB (positive control) of 3,3 '-dipropylthiadicarbocyanine iodide [DiSC3- (5)], a hydrophobic fluorescent probe used to indicate cytoplasmic membrane depolarization of P aeruginosa PAO1 cells. FIG. 13C provides bacterial counts four days post infection, the c_AMPs were tested at their MIC in a single dose one hour after the establishment of the infection. Each group consisted of three mice (n = 3) and the bacterial loads used to infect each mouse were derived from a different inoculum. Statistical significance in was determined using one-way ANOVA where all groups were compared to the untreated control group; P-values are shown for each of the groups. FIG. 13D illustrates mouse weight throughout the experiment (mean ± the standard deviation). Features on the violin plots represent median and upper and lower quartiles.
SUMMARY
[0021] Provided herein are methods for forming a database-assisted platform providing one or more functional and/or physicochemical features of respective metagenomic-derived candidate antimicrobial peptides (AMPs), the methods comprising selecting one or more genomes or metagenomes for inclusion in the platform; using an NGS assembler to assemble reads in order to identify contigs from the genomes or metagenomes; from the identified contigs, predicting small open reading frames (smORFs); removing duplicate smORFs to yield non-redundant smORFs; and, predicting candidate AMPs from the non-redundant smORFs.
[0022] Also disclosed are methods of treating a microbial infection in a subject comprising administering to the subject a therapeutically effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform.
[0023] The present disclosure also provides methods comprising contacting a biofilm with an effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform.
[0024] Also provided are compositions comprising an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform and pharmaceutically acceptable carrier, diluent, or excipient.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
[0025] The present invention may be understood more readily by reference to the following detailed description taken in connection with the accompanying examples, which form a part of this disclosure. It is to be understood that this invention is not limited to the specific products, methods, conditions or parameters described and/or shown herein, and that the terminology used herein is for the purpose of describing particular embodiments by way of example only and is not intended to be limiting of the claimed invention. [0026] The disclosures of each patent, patent application, and publication cited or described in this document are hereby incorporated herein by reference, in their entirety.
[0027] As employed above and throughout the disclosure, the following terms and abbreviations, unless otherwise indicated, shall be understood to have the following meanings.
[0028] In the present disclosure the singular forms “a”, “an”, and “the” include the plural reference, and reference to a particular numerical value includes at least that particular value, unless the context clearly indicates otherwise. Thus, for example, a reference to “a compound” is a reference to one or more of such compounds and equivalents thereof known to those skilled in the art, and so forth. Furthermore, when indicating that a certain chemical moiety “may be” X, Y, or Z, it is not necessarily intended by such usage to exclude other choices for the moiety.
[0029] When values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another embodiment. As used herein, “about X” (where X is a numerical value) preferably refers to ±10% of the recited value, inclusive. For example, the phrase “about 8” may refer to a value of 7.2 to 8.8, inclusive; as another example, the phrase “about 8%” may refer to a value of 7.2% to 8.8%, inclusive. Also, when the term “about” precedes a range, it is understood that the term modifies both recited endpoints and all points embraced within the range. For example, the phrase “about 1-10” is understood to mean “about 1 to about 10”, as well as “about x”, wherein x refers to any value between 1 and 10. Where present, all ranges are inclusive and combinable. For example, when a range of “1 to 5” is recited, the recited range should be construed as including ranges “1 to 4”, “1 to 3”, “1-2”, “1-2 & 4-5”, “1-3 & 5”, and the like. In addition, when a list of alternatives is positively provided, such listing can be interpreted to mean that any of the alternatives may be excluded, e.g., by a negative limitation in the claims. For example, when a range of “1 to 5” is recited, the recited range may be construed as including situations whereby any of 1, 2, 3, 4, or 5 are negatively excluded; thus, a recitation of “1 to 5” may be construed as “1 and 3-5, but not 2”, or simply “wherein 2 is not included.”
[0030] Novel antibiotics are urgently needed to combat the antibiotic-resistance crisis. In order to address this need, presented herein are machine learning-based approaches for predicting antimicrobial peptides (AMPs) within the global microbiome and leverage a vast dataset that can include metagenomes and prokaryotic genomes from environmental and host- associated habitats to create the AMPSphere, a comprehensive catalog comprising distinct, non-redundant peptides, the majority of which are novel. AMPSphere provides insights into the evolutionary origins of peptides, including by duplication or gene truncation of longer sequences, and we observed that AMP production varies by habitat. To validate predictions, 100 AMPs were synthesized and tested against clinically relevant drug-resistant pathogens and human gut commensals, both in vitro and in vivo. Many of the synthesized peptides were active, with a large number of them targeting pathogens. These active AMPs exhibited antibacterial activity by disrupting bacterial membranes. The presently described computational approach can identify millions of prokaryotic AMP sequences, opening new avenues for antibiotic discovery.
[0031] The significance of small open reading frames (smORFs) has been historically overlooked in (meta)genomic analyses35 37. In recent years, significant progress has been made in metagenomic analyses of human-associated smORFs6,38. These advancements have incorporated machine learning (ML) techniques to identify smORFs encoding proteins belonging to specific functional categories39 42. Notably, a recent study used predicted smORFs to uncover approximately 2,000 AMPs from metagenomic samples of human gut microbiomes6. Nevertheless, it is important to note that the human gut represents only a fraction of the overall microbial diversity, suggesting that there remains an immense potential for the discovery of AMPs from prokaryotes in the diverse range of habitats across the globe.
[0032] Pursuant to the present invention, ML was used to predict and catalog AMPs from the global microbiome as currently represented in public databases. By computationally exploring 63,410 publicly available metagenomes and 87,920 high-quality microbial genomes43, a vast array of AMP diversity was uncovered. This resulted in the creation of the AMPSphere, a collection of 863,498 non-redundant peptide sequences, encompassing candidate AMPs (c AMPs) derived from (meta)genomic data. Remarkably, the majority of these c_AMP sequences had not been previously described. The present analysis revealed that these c AMPs were specific to particular habitats and were predominantly not core genes in the pangenome. [0033] Moreover, 100 c_AMPs were synthesized from AMPSphere and it was found that 79 were active, with 63 exhibiting antimicrobial activity in vitro against clinically significant ESKAPEE pathogens, which are recognized as public health concerns44,45. These peptides were further compared to encrypted peptides (EPs), which are peptide sequences hidden in protein sequences and mined computationally4,10, and demonstrated their ability to target bacterial membranes and were prone to adopting a-helical and P-structures. Notably, the leading candidates displayed promising anti -infective activity in a pre-clinical animal model. Together, the work disclosed herein provides a ML approach to identify new functional AMPs from the global microbiome.
[0034] Accordingly, provided herein are methods for forming a database-assisted platform providing one or more functional and/or physicochemical features of respective metagenomic-derived candidate antimicrobial peptides (AMPs), the methods comprising selecting one or more genomes or metagenomes for inclusion in the platform; using an NGS assembler to assemble reads in order to identify contigs from the genomes or metagenomes; from the identified contigs, predicting small open reading frames (smORFs); removing duplicate smORFs to yield non-redundant smORFs; and, predicting candidate AMPs from the non-redundant smORFs.
[0035] In some embodiments, the selection of the one or more genomes or metagenomes is according to criteria (i) whereby the genome or metagenome is tagged with taxonomy ID 408169 (for metagenome) or is a descendent of it in a taxonomic tree, (ii) whereby experiments with the genome or metagenome are listed as “METAGENOMIC”, or both (i) and (ii).
[0036] In some embodiments, metadata is curated from the selected one or more genomes or metagenomes to create groups based on similarity of habitat conditions. Habitat conditions include one or more of air, anthropogenic, aquatic, host-associated, alkaline pH, sediment, or terrestrial.
[0037] The selection of the one or more genomes or metagenomes can alternatively or additionally include assessing sample origin or other information relating to host species using an NCBI taxonomic identification number. [0038] The present methods may further comprise processing the assembled reads by trimming positions with a quality lower than a desired number, and discarding reads shorter than a specified number of base pairs, post trimming. The quality at which a given position is trimmed may be selected to be, for example, lower than 60, 55, 50, 45, 40, 35, 30, 25, 20, 15, or 10. The number of base pairs, post -trimming that represent a read to be discarded may be about 40 to about 100, such as about 40-90, 45-80, 50-75, 50-70, or 55-65, such as about 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100.
[0039] In some embodiments, metagenomes obtained from a host-associated microbiome may be passed through a filtering of reads mapping to the host genome, when available.
[0040] The NGS assembler that is used to assemble reads in order to identify contigs from the genomes or metagenomes may be, for example, optimized for metagenomes.
[0041] In some embodiments, the smORFs may be predicted from the identified contigs using prokaryotic gene recognition and translation initiation site identification. The prokaryotic gene recognition and translation initiation site identification may be, for example, Prodigal (PROkaryotic DYnamic programming Gene-finding Algorithm - see Hyatt, D., et al., BMC Bioinformatics 11, 119 (2010)), or a modified version thereof.
[0042] The length of the smORFs that are predicted from the contigs may be, for example, from about 20 to about 500 bp, such as 20-450, 25-400, 30-350, or 30-300 bp, or about 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, or 500 bp.
[0043] In certain embodiments, the candidate AMPs are predicted from the non- redundant smORFs using metagenomic AMP classification and retrieval. For example, the candidate AMPs may be predicted from the non-redundant smORFs using Macrel.
[0044] Optionally, singleton sequences may be removed from the predicted candidate AMPs. In some embodiments, singleton sequences are retained if they match a sequence from a data repository of antimicrobial peptides. Matching a sequence from a data repository of antimicrobial peptides can mean having at least or about 65, 70, 75, 80, or 85% amino acid identity to a sequence from a data repository, and/or an E-value of at least about 10'5 relative to a sequence from a data repository. The data repository may be, for example, the Data Repository of Antimicrobial Peptides (DRAMP)46.
[0045] For purposes of further characterizing the candidate AMPs within the platform, candidate AMPs originating from a genomic database may assigned a taxonomy from the original genome. Alternatively or additionally, candidate AMPs originating from a metagenome may assigned a taxonomy predicted for the contig in which the candidate AMP was found.
[0046] The present methods may further comprise identifying potential structural configuration of a candidate AMP by using a secondary structure function from a calculation of a fraction of amino acids in the candidate AMP that tend to assume conformations of helix, turn, or sheet. This information may also be used to provide characterization of the candidate AMPs within the platform.
[0047] The candidate AMPs may be hierarchically clustered using a reduced amino acid alphabet and an identity cutoff of a preselected percentage. For example, the reduced amino acid alphabet may include about 5, 6, 7, 8, 9, 10, 11, or 12 amino acids. The identity cutoff may be about 70-100%, such as about 70, 75, 80, 85, 90, 95, or 100%. A sequential cutoff may be employed, such as a sequential cutoff of 100, 95, and 90%, or 100, 90, and 85%, or 100, 90, and 80%, or 100, 85, and 75%. Representative sequences of peptide clusters may be selected according to their length (selecting for the longest), with ties being broken by the alphabetical order. Optionally, the clustering procedure may be validated.
[0048] The present methods may further comprise synthesizing one or more of the identified candidate AMPs. Selection of candidate AMPs for synthesis may be according to criteria for solubility, criteria for synthesis, or both. For example, the criteria may be in accordance with those used in PepFun145.
[0049] The synthesized candidate AMPs may individually be tested for antimicrobial activity to determine minimal inhibitory concentration (MIC). For example, the broth microdilution method may be used to determine MIC. The synthesized candidate AMPs may be subjected to circular dichroism assays, as described more fully infra. Membrane permeability with respect to a particular candidate AMP may be analyzed, for example, using the l-(N-phenylamino)naphthalene (NPN) uptake assay. In certain embodiments, ability of a candidate peptides to depolarize the cytoplasmic membrane may be assessed, for example, by measuring the fluorescence of the membrane potential -sensitive dye 3,3’- dipropylthiadicarbocyanine iodide [DiSC3-(5)].
[0050] Also disclosed are methods of treating a microbial infection in a subject comprising administering to the subject a therapeutically effective amount of an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform. In some embodiments, the methods of treating a microbial infection comprises administering to the subject a therapeutically effective amount of an AMP according to any one or more of SEQ ID NOS: 1-100, provided below in Table 1.
TABLE 1
Figure imgf000015_0001
Figure imgf000016_0001
Figure imgf000017_0001
Figure imgf000018_0001
Figure imgf000019_0001
[0051] The present disclosure also provides compositions comprising an AMP that has been identified according to any one of the disclosed methods for forming a database-assisted platform and pharmaceutically acceptable carrier, diluent, or excipient. For example, the compositions may comprise an AMP according to any one or more of SEQ ID NOS: 1-100. In some embodiments of the present methods, two or more AMPs that have been identified according to any one of the disclosed methods for forming a database-assisted platform, such as two or more of SEQ IDS NOS: 1-100, may be administered to the subject or included in the present compositions.
[0052] As used herein, the phrase “therapeutically effective amount” refers to the amount of active agent (here, the antimicrobial peptide) that elicits the biological or medicinal response that is being sought in a tissue, system, animal, individual or human by a researcher, veterinarian, medical doctor or other clinician, which includes one or more of the following: (1) at least partially preventing the disease or condition or a symptom thereof; for example, preventing a disease, condition or disorder in an individual who may be predisposed to the disease, condition or disorder but does not yet experience or display the pathology or symptomatology of the disease;
(2) inhibiting the disease or condition; for example, inhibiting a disease, condition or disorder in an individual who is experiencing or displaying the pathology or symptomatology of the disease, condition or disorder (i.e., including arresting further development of the pathology and/or symptomatology); and
(3) at least partially ameliorating the disease or condition; for example, ameliorating a disease, condition or disorder in an individual who is experiencing or displaying the pathology or symptomatology of the disease, condition or disorder (i.e., including reversing the pathology and/or symptomatology).
[0053] The antimicrobial peptides that are administered, contacted with a biofilm, or included in a composition to the present disclosure may be provided in a composition that is formulated for any type of administration. For example, the compositions may be formulated for administration orally, topically, parenterally, enterally, or by inhalation (e.g., intranasally). The active agent may be formulated for neat administration, or in combination with conventional pharmaceutical carriers, diluents, or excipients, which may be liquid or solid. The applicable solid carrier, diluent, or excipient may function as, among other things, a binder, disintegrant, filler, lubricant, glidant, compression aid, processing aid, color, sweetener, preservative, suspensing/dispersing agent, tablet-disintegrating agent, encapsulating material, film former or coating, flavoring agent, or printing ink. Any material used in preparing any dosage unit form is preferably pharmaceutically pure and substantially non-toxic in the amounts employed. In addition, the active agent may be incorporated into sustained-release preparations and formulations. Administration in this respect includes administration by, inter alia, the following routes: intravenous, intramuscular, subcutaneous, intraocular, intrasynovial, transepithelial including transdermal, ophthalmic, sublingual and buccal; topically including ophthalmic, dermal, ocular, rectal and nasal inhalation via insufflation, aerosol, and rectal systemic. [0054] In powders, the carrier, diluent, or excipient may be a finely divided solid that is in admixture with the finely divided active ingredient. In tablets, the active ingredient is mixed with a carrier, diluent or excipient having the necessary compression properties in suitable proportions and compacted in the shape and size desired. For oral therapeutic administration, the active compound may be incorporated with the carrier, diluent, or excipient and used in the form of ingestible tablets, buccal tablets, troches, capsules, elixirs, suspensions, syrups, wafers, and the like. The amount of active agent(s) in such therapeutically useful compositions is preferably such that a suitable dosage will be obtained.
[0055] Liquid carriers, diluents, or excipients may be used in preparing solutions, suspensions, emulsions, syrups, elixirs, and the like. The active ingredient of this invention can be dissolved or suspended in a pharmaceutically acceptable liquid such as water, an organic solvent, a mixture of both, or pharmaceutically acceptable oils or fat. The liquid carrier, excipient, or diluent can contain other suitable pharmaceutical additives such as solubilizers, emulsifiers, buffers, preservatives, sweeteners, flavoring agents, suspending agents, thickening agents, colors, viscosity regulators, stabilizers, or osmo-regulators.
[0056] Suitable solid carriers, diluents, and excipients may include, for example, calcium phosphate, silicon dioxide, magnesium stearate, talc, sugars, lactose, dextrin, starch, gelatin, cellulose, methyl cellulose, ethyl cellulose, sodium carboxymethyl cellulose, microcrystalline cellulose, polyvinylpyrrolidine, low melting waxes, ion exchange resins, croscarmellose carbon, acacia, pregelatinized starch, crospovidone, HPMC, povidone, titanium dioxide, polycrystalline cellulose, aluminum methahydroxide, agar-agar, tragacanth, or mixtures thereof.
[0057] Suitable examples of liquid carriers, diluents and excipients, for example, for oral, topical, or parenteral administration, include water (particularly containing additives as above, e.g. cellulose derivatives, preferably sodium carboxymethyl cellulose solution), alcohols (including monohydric alcohols and polyhydric alcohols, e.g. glycols) and their derivatives, and oils (e.g. fractionated coconut oil and arachis oil), or mixtures thereof.
[0058] For parenteral administration, the carrier, diluent, or excipient can also be an oily ester such as ethyl oleate and isopropyl myristate. Also contemplated are sterile liquid carriers, diluents, or excipients, which are used in sterile liquid form compositions for parenteral administration. Solutions of the active agents can be prepared in water suitably mixed with a surfactant, such as hydroxypropylcellulose. A dispersion can also be prepared in glycerol, liquid polyethylene glycols, and mixtures thereof and in oils. Under ordinary conditions of storage and use, these preparations may contain a preservative to prevent the growth of microorganisms.
[0059] The pharmaceutical forms suitable for injectable use include, for example, sterile aqueous solutions or dispersions and sterile powders for the extemporaneous preparation of sterile injectable solutions or dispersions. In all cases, the form is preferably sterile and fluid to provide easy syringability. It is preferably stable under the conditions of manufacture and storage and is preferably preserved against the contaminating action of microorganisms such as bacteria and fungi. The carrier, diluent, or excipient may be a solvent or dispersion medium containing, for example, water, ethanol, polyol (for example, glycerol, propylene glycol, liquid polyethylene glycol and the like), suitable mixtures thereof, and vegetable oils. The proper fluidity can be maintained, for example, by the use of a coating, such as lecithin, by the maintenance of the required particle size in the case of a dispersion, and by the use of surfactants. The prevention of the action of microorganisms may be achieved by various antibacterial and antifungal agents, for example, parabens, chlorobutanol, phenol, sorbic acid, thimerosal and the like. In some instances, the antimicrobial peptides themselves may be sufficient to prevent contamination by microorganisms. In many cases, it will be preferable to include isotonic agents, for example, sugars or sodium chloride. Prolonged absorption of the injectable compositions may be achieved by the use of agents delaying absorption, for example, aluminum monostearate and gelatin.
[0060] Sterile injectable solutions may be prepared by incorporating the active agent in the pharmaceutically appropriate amounts, in the appropriate solvent, with various of the other ingredients enumerated above, as required, followed by filtered sterilization. Generally, dispersions may be prepared by incorporating the sterilized active ingredient into a sterile vehicle which contains the basic dispersion medium and the required other ingredients from those enumerated above. In the case of sterile powders for the preparation of sterile injectable solutions, the preferred methods of preparation may include vacuum drying and freeze drying techniques that yield a powder of the active ingredient or ingredients, plus any additional desired ingredient from the previously sterile-fdtered solution thereof.
[0061] Thus, an antimicrobial peptide may be in the present compositions and methods in an effective amount by any of the conventional techniques well-established in the medical field. For example, the administration may be in the amount of about 0.1 mg/day to about 500 mg per day. In some embodiments, the administration may be in the amount of about 250 mg/kg/day. Thus, administration may be in the amount of about 0.1 mg/day, about 0.5 mg/day, about 1.0 mg/day, about 5 mg/day, about 10 mg/day, about 20 mg/day, about 50 mg/day, about 100 mg/day, about 200 mg/day, about 250 mg/day, about 300 mg/day, or about 500 mg/day.
[0062] Also disclosed are methods comprising contacting a biofilm with an effective amount of an AMP that has been identified according to a method for forming a database- assisted platform according to the present disclosure. In some embodiments, the AMP comprises one or more of SEQ ID NOS: 1-100. Such methods may be effective to remove or reduce the presence of an unwanted biofilm, such as in hospitals or other medical settings, in sewer and filtration systems, in industrial settings, on equipment involved in food preparation or manufacture, in aquaculture or hydroponics, or in any other context that is prone to unwanted biofilm formation.
[0063] In accordance with the methods of treating a microbial infection in a subject or the methods comprising contacting a biofilm according to the present disclosure, microbes against which the present antimicrobial peptides are effective may be, for example, any unicellular organism, such as gram-negative bacteria, gram-positive bacteria, protozoa, viruses, bacteriophages, and archaea. The present peptides can have an antimicrobial effect with respect to any such microbe. Examples of bacteria against which the present compounds are effective to cause reduction in numbers include gram positive bacteria and gram negative bacteria, for example, Salmonella enterica, Listeria monocytogenes, Escherichia coli, Clostridium botulinum, Clostridium difficile, Campylobacter, Bacillus cereus, Vibrio parahaemolyticus, Vibrio cholerae, Vibrio vulnificus, Staphylococcus aureus, Yersinia enterocolitica, Shigella, Moraxella spp., Helicobacter, Stenotrophomonas, Bdellovibrio, Legi onella spp. (e.g., pneumophila), Neisseria gonorrhoeae, Neisseria meningitidis, Haemophilus influenzae, Acinetobacter baumannii, Klebsiella pneumoniae, Pseudomonas aeruginosa, Proteus mirabilis, Enterobacter cloacae, Enterococcus faecium, Serratia marcescens, Elelicobacter pylori, Salmonella enteritidis, Salmonella typhi, and combinations thereof. Examples of Salmonella enterica serovars that can be reduced using the compounds of the disclosure include, for example, Salmonella enteriditis, Salmonella typhimurium, Salmonella poona, Salmonella heidelberg, and Salmonella anatum. Exemplary viruses against which the present peptides are effective to cause reduction in numbers include coronaviruses, rhinoviruses, and influenza viruses.
Examples
[0064] The present invention is further defined in the following Examples. It should be understood that the examples, while indicating preferred embodiments of the invention, are given by way of illustration only, and should not be construed as limiting the appended claims. From the above discussion and the examples, one skilled in the art can ascertain the essential characteristics of this invention, and without departing from the spirit and scope thereof, can make various changes and modifications of the invention to adapt it to various usages and conditions.
Example 1 - Development of AMPSphere Platform, Including Characterization of AMPs
[0065] The present inventors developed AMPSphere, which incorporates c AMPs predicted with ML using Macrel42, a pipeline that uses random forests to predict AMPs from large peptide datasets with an emphasis on precision over recall. It was applied to 63,410 globally distributed publicly available metagenomes (Fig. 1 A) and 87,920 high-quality bacterial and archaeal genomes43. Sequences present in a single sample were removed42, except when they had a significant match (defined as amino acids identity > 75% and E-value < I O 5) to a sequence in the AMP-dedicated database Data Repository of Antimicrobial Peptides (DRAMP) version 3.046. This resulted in 5,518,294 genes, 0.1% of the total predicted smORFs, coding for 863,498 non -redundant c AMPs (on average 37±8 residues long; Fig. lA and 8). Similar to validated sequences with antimicrobial activity42 47-48, c_AMPs from AMPSphere present a positive charge (4.7±2.6), high isoelectric point (10.9±1.2), amphiphilicity (hydrophobic moment, 0.6±0.1) and a potential to bind to membranes or other proteins (Boman index, 1.14±1.1). As expected, in general, the distribution of physicochemical properties of peptides from AMPSphere, DRAMP46 version 3.0, and the positive training dataset used in Macrel42 are more similar to each other than to the negative training set (assumed to not be AMPs). Nonetheless, c_AMPs from AMPSPhere are on average longer (37±8 residues) than those in DRAMP46 version 3.0 (28±22 residues) and we observed differences in the distribution of other features (e.g., charge, aliphaticity, amphipathicity, and isoelectric point, Fig. 8).
[0066] The quality of the smORF predictions was subsequently estimated, and 20% (172,840) of the c_AMP sequences were detected in independent publicly available metaproteomes or metatranscriptomes (Fig. 2 and 9A and Methods - Quality control of c AMPs) belonging to several habitats included in the AMPSphere, such as the human gut, plants, and others. All c_AMPs were then subejeted to a bundle of in silico quality tests (see topic Quality control of c AMPs). A subset of c_AMPs (9.2%, or 80,213 c_AMPs) passed all of them, and this subset is hereafter designated as high-quality. Testing with other AMP prediction systems (AMPScanner v249, the model for mature peptides in ampir40, amPEPpy50, APIN51, AI4AMP52, and AMPLify53), it was observed that 98.4% (849,703 peptides) of AMPSphere c_AMPs were also predicted as AMPs by at least one other AMP prediction system. Approximately 15% (132,440 out of 863,498 peptides) of AMPSphere c_AMPs were co-predicted by all methods used.
[0067] Only 0.7% of the identified c AMPs (6,339 peptides) are homologous (operationally defined as amino acid identity > 75% and E-value < 10 5) to experimentally validated AMP sequences in the DRAMP version 3.046. Moreover, most c_AMPs were also absent from protein databases not specific to AMPs (Fig. IB), such as the Small Proteins database (SmProt2)54 or the Global Microbiome Gene Catalogue of canonical -length proteins (GMGCvl)53, suggesting that c_AMPs represent an entirely novel region of peptide sequence space. In total, it was possible to find only 73,774 (8.5%) c_AMPs with homologs in any of the databases we considered. High-quality c AMPs were detected in public databases with higher frequency than general c_AMPs (2.5-fold, PHypergom. = 4.2- IO'250, Fig. IB), with 23,012 out of the 80,213 high-quality c_AMPs having a match in another database. However, it is notable that 76.4% (4,843 peptides out of 6,339) of those c_AMPs which have a homolog in DRAMP46 version 3.0 (and, therefore, highly likely to be functional) are not high-quality c AMPs. Thus, while the quality tests did enrich for validated sequences, a failure to pass the tests is not a sufficient reason to conclude that the sequence is not active.
[0068] To put c AMPs in an evolutionary context, peptides were hierarchically clustered using a reduced amino acid alphabet of 8 letters56. The three sequence clustering levels adopted identity cutoffs of 100%, 85%, and 75% (Fig. 10). At the 75% identity level, 521,760 protein clusters were obtained, of which 405,547 were singletons, corresponding to 47% of all c_AMPs from AMPSphere. A total of 78,481 (19.3%) of these singletons were detected in metatranscriptomes or metaproteomes from various sources, indicating that they were not artifacts. The large number of singletons suggests that most c_AMPs originated from processes other than diversification within families, which is the opposite of the hypothesized origin of full-length proteins, in which singleton families are rare55. The 8,788 clusters with >8 peptides obtained at 75% of identity are hereafter named “families”, as in Sberro et al.38. Among them, 6,499 were considered as high-quality families because they contained evidence of translation or transcription, or >75% of their sequences pass all in silica quality tests, regardless of whether experimental evidence is available (see Methods - AMP families). These high-quality families span 15.4% of the AMPSphere (133,309 peptides).
[0069] All the c_AMPs predicted can be accessed online,
Figure imgf000026_0001
The platform permits users to retrieve the peptide sequences, ORFs, and
Figure imgf000026_0002
predicted biochemical properties of each c_AMP (e.g., molecular weight, isoelectric point, and net charge at pH 7.0). The platform also provides the distribution across geographical regions, habitats, and microbial species for each c_AMP.
[0070] The AMPSphere spanned 72 different habitats, which were classified into eight high-level habitat groups, e.g.. soil/plant (36.6% of c_AMPs in AMPSphere), aquatic (24.8%), human gut (13%) - (Fig. 1A). Most of the habitats, except for the human gut, appeared to be far from saturated in terms of newly discovered c_AMPs (Fig. 1C). In fact, most AMPs were rare (median number of detections is 99, or 0.17% of the dataset; when restricted to high- quality c_AMPs, the median number of detections is 81, or 0.14% of the dataset), with 83.97% being observed in <1% of samples (Fig. 9). Only 10.8% (93,280) of c_AMPs were detected in more than one high-level habitat group (henceforth, “multi -habitat c_AMPs”); this fraction is 7.25-fold smaller than would be expected by a random assignment of habitats to samples (Ppermutation <1O'300, see infra). Even within high-level habitat groups, c_AMPs overlap between habitats much less frequently than expected by chance (2.4 to 192-fold less, Ppermutation<5.4 - IO'50, see infra & Fig. ID).
[0071] Mutations in larger genes generate c AMPs as independent genomic entities. Many AMPs are generated post-translationally by the fragmentation of larger proteins17. For example, encrypted peptides (EPs) are computationally detected fragments from protein sequences within the human proteome and other proteomes, which have been shown to be highly active4,10. EPs present diverse secondary structures and act on the membrane of bacterial cells, similar to known natural AMPs, but have different physicochemical features compared to known AMPs4,33. AMP Sphere only considered peptides encoded by dedicated genes. Nonetheless, it was hypothesized that some of these have originated from larger proteins by fragmentation at the genomic level. To explore this, the AMPSphere c AMPs were aligned to the full-length proteins in GMGCvF5 and it was observed that about 7% (61,020) of them are homologous to a canonical -length protein (Fig. IB), with 27% of these hits sharing the start codon with the longer protein. This suggests early termination of full- length proteins as one mechanism for generating novel c AMPs (Fig. 3A and 3B).
[0072] To investigate the function of the full-length proteins homologous to AMPs, the matching proteins from GMGCvl53 were mapped to orthologous groups (OGs) from eggNOG 5.057. Identified were 3,792 (out of 43,789) OGs significantly enriched (Puypergeom. <0.05, after multiple hypothesis corrections with the Holm-Sidak method) among the hits from AMPSphere. Although OGs of unknown function comprise 53.8% of all identified OGs, when considered individually, these OGs are, on average, smaller than OGs in other categories. Thus, despite each OG having a relatively small number of c_AMP hits, when compared to the background distribution of the OGs in GMGCvl55, OGs of unknown function were the most enriched among the c_AMP hits, with an average enrichment of 10,857-fold (PMann <3.9- I O'4; Fig. 3C).
[0073] c_AMP genes may arise after gene duplication events. Next, the question was raised of whether c_AMPs would be predominantly present in specific genomic contexts. To investigate the functions of the neighboring genes of the c AMPs, they were mapped against 169,484 genomes included in a recent study58. Atotal of 38.9% (21,465 out of 55,191) of c AMPs with more than two homologs in different genomes in the database showed phylogenetically conserved genomic context with genes of known function (see infra, Methods - Genomic context conservation analysis). This holds true for curated versions of the catalog: 35.32% high-quality c_AMPs and 32.06% high-quality c_AMPs with experimental evidence show conserved genomic neighbors. These conservation values were similar to that of 3,899,674 gene families with more than two homologs calculated de novo on the gene catalog (34.4%), indicating that the genomic location of c_AMPs is not random.
[0074] Despite being involved in similar processes, c_AMPs were generally depleted from conserved genomic contexts involving known systems of antibiotic synthesis and resistance, even when compared to small protein families (Fig. 4). Instead, it was found that c_AMPs are encoded in conserved genomic contexts with ribosomal genes (23.6%) at a higher frequency than other gene families (4.75%, Fig. 4A).
[0075] Most of the c_AMPs (2,201 out of the 2,642) in a conserved context with ribosomal subunits were homologous to ribosomal proteins (Fig. 4D), congruent with the observation that, in some species, ribosomal proteins have antimicrobial properties39. Seventy-seven c_AMPs homologous to ribosomal proteins were also homologous to a ribosomal gene in their immediate vicinity (up to 1 gene up/downstream). This phenomenon is not exclusive to ribosomal proteins: 1,951 c AMPs can be annotated to the same KEGG Orthologous Group (KO) as some of their immediate neighbors and may have originated from gene duplication events. This shared annotation was interpreted, in this context, as evidence for a common evolutionary origin and not as a functional prediction for the c AMPs. These duplications may have arisen by recombination of flanking homologous sequences, which can happen during cell division60 62. Interestingly, 1,635 (83.8%) of these c_AMPs were located upstream of the neighbor with the same KO annotation. Different permeases and transposases are the most common KOs assigned to c_AMPs and their neighbors (400 and 125 c_AMPs, respectively).
[0076] Most c AMP s are members of the accessory pangenome . It was observed that only a small portion (5.9%, Ppermutation = 4.8- 10'3, Nspecies = 416) of c_AMP families present in ProGenomes243 are contained in >95% of genomes from the same species (Fig. 5), here referred to as “core”63. This is consistent with previous work, in which AMP production was observed to be strain-specific64. In contrast, a high proportion (circa 68.8%) of full-length protein families are core in ProGenomes243 species. There is a 1.9-fold greater chance (Ppisher = 2.2- IO'92) that a pair of genomes from the same species share at least one c_AMP when they belong to the same strain (99.5% < ANI < 99.99%).
[0077] One example of this strain-specific behavior is AMP 10.018 194, the only c_AMP found in Mycoplasma pneumoniae genomes. M. pneumoniae strains are traditionally classified into two groups based on their Pl adhesin gene65. Of the 76 M. pneumoniae genomes present in our study, 29 were classified as type-1, 29 as type-2, and the remaining 18 were undetermined in this classification system66 (see Methods - Determination of accessory AMPs). Twenty-six of the 29 type-2 genomes contain AMP10.018_194, as did 2 undetermined type genomes, but none of the type-1 genomes contain this AMP.
[0078] More transmissible species have lower c_AMP density. The taxonomic composition of AMP Sphere was investigated by annotating contigs with the Genome Taxonomy Database (GTDB) taxonomy6768 (see Methods - c_AMP density in microbial species from different habitats), which resulted in 570, 187 c_AMPs being annotated to a genus or species. The genera contributing the most c_AMPs to AMPSphere were Prevotella (18,593 c_AMPs), Bradyrhizobium (11,846 c_AMPs), Pelagibacter (6,675 c_AMPs), aecalibacterium (5,917 c_AMPs), and CAG-110 (5,254 c_AMPs) see Fig. 5). This distribution reflects the fact that these genera are among those that contribute the most assembled sequences in our dataset (all occupying percentiles above 99.75% among the assembled genera). Therefore, the c_AMP density (P^MP) was calculated by determining the number of c_AMP genes per megabase pairs of assembled sequence. To avoid bias due to the unequal sampling of habitats, all the sequences predicted by Macrel42 was included in each sample, including singleton sequences that were subsequently removed and are not part of AMPSphere.
[0079] To further explore the importance of AMP production in ecological processes, the role of AMPs in the mother-to-child transmissibility of bacterial species in a recently published paper69 was investigated by correlating the pAMP for each bacterial species to the published measures of microbial transmission. Human gut bacteria showed increased transmissibility at lower AMP densities (Rspeannan -0.42, Pnoim-sidak 3.4- 1 O'2, Nspecies = 43). Similarly, in human oral microbiome bacterial species, transmissibility from mother to offspring was consistently inversely correlated with their pAMP for the first year (Rspearman = - 0.55, Puoim-sidak = 1.4- 10’3, Nspecies = 41). This suggests that human gut bacteria and oral microbiome bacterial species show increased transmissibility at lower pAMp- Moreover, it highlights the potential influence of pAMp on the transmissibility of gut and oral microbiota, suggesting a link between AMPs and the transmission success rates of microbial species.
[0080] Physicochemical features and secondary structure of AMPs. To investigate the properties and structure of the synthesized peptides, a comparison was made of their amino acid composition to AMPs from available databases of experimentally-verified sequences (DRAMP46 version 3.0, Database of Antimicrobial Activity and Structure of Peptides - DBAASP70, and Antimicrobial Peptides Database - APD71 version 3). Overall, the composition was similar, as was expected, given that Macrel’s ML model was trained using known AMPs42. Notably, AMPSphere sequences displayed a slightly higher abundance of aliphatic amino acid residues, specifically alanine and valine. However, these AMPSphere sequences consistently differed (Fig. 6A) from EPs4,10,33. The resemblances in amino acid composition between the identified c_AMPs and known AMPs suggested similar physicochemical characteristics and secondary structures, both of which are recognized for their influence on antimicrobial activity16. The c AMPs exhibited comparable hydrophobicity, net charge, and amphiphilicity to AMPs sourced from databases (Fig. 8). Furthermore, they displayed a slight propensity for disordered conformations (Fig. 6B) and had a lower net positive charge compared to other EPs (Fig. 6A).
[0081] To evaluate the structural and antimicrobial properties of c AMPs from AMPSphere, AMPSphere was first filtered for peptides that were predicted as suitable for in vitro assays, namely solubility in aqueous solution and ease of chemical synthesis. A set of high-quality AMPs with 50 peptide sequences was selected based on prevalence and taxonomic diversity (see Methods - Selection of peptides for synthesis and activity testing). Additionally, to provide an unbiased evaluation of the novel peptides described here, excluded first were any peptides with a homolog in one of the published databases and then randomly selected 50 additional peptides from the AMPSphere, including 25 peptides with AMP probability of at least 0.6 (as reported by Macrel42) and 25 peptides with lower probabilities (0.5-0.6).
[0082] Subsequently, experimental assessments of the secondary structure of the active c AMPs were conducted using circular dichroism (Fig. 6B and 11). Similar to AMPs documented in databases, peptides derived from AMPSphere exhibited different propensities for adopting not only a-helical structures, but also some of them were unstructured or adopted P-antiparallel conformations in all media analyzed. Notably, they also displayed an unusually high content of P-antiparallel structure in both water and methanol/water mixtures (Fig. 6B), despite their amino acid composition similarities to AMPs and EPs. These findings were attributed to the slightly elevated occurrence of alanine and valine residues, which are known to favor P-like structures with a preference for P-antiparallel conformation72.
[0083] Methods.
[0084] Selection of metagenomes and high-quality microbial genomes. Selection of metagenomes and genomes to compose the AMPSphere was similar to that adopted by Coelho et al.53,130. Public metagenomes available on 1 January 2020 produced with Illumina instruments (except for MiSeq, to ensure the consistency and reliability of the meta-analysis findings), with at least 2 million reads and, on average, 75 bp long, were downloaded from the European Nucleotide Archive (ENA). These samples met two criteria: (1) they were tagged with taxonomy ID 408169 (for metagenome) or were a descendent of it in the taxonomic tree; and/or (2) they came from experiments with the library source listed as “METAGENOMIC”. Samples were grouped by project and all projects with at least 20 samples were included for analysis. Additionally, metagenomes deposited by the Integrated Microbial Genomes System (IMG) missing from ENA were also included. Metadata was manually curated from each sample’s describing literature and Biosamples database127. For habitat classification groups were created based on the similarity of habitat conditions, such as air, anthropogenic, aquatic, host-associated, ph:alkaline, sediment, terrestrial, and others. The sample origins and information related to host species were obtained using the NCBI taxonomic identification number. High-quality microbial genomes were selected from ProGenomes2 database43.
[0085] Reads trimming and assembly. Reads were processed using NGLess96, trimming positions with quality lower than 25 and discarding reads shorter than 60 bp post- trimming. Metagenomes obtained from a host-associated microbiome passed through a filtering of reads mapping to the host genome when available. Reads totaling more than 14.7 trillion base pairs of sequenced DNA were assembled with MEGAHIT 1.2.9112 and the taxonomy of the 16,969,685,977 contigs generated was inferred as previously described131, using MMSeqs299 to map the sequences against the GTDB release 9567,68. Mapped taxonomy lineages were then manually curated to conform to the International Code of Nomenclature of Prokaryotes132,133.
[0086] smORF and AMP prediction. Analogously to Sberro et al.38, a modified version of Prodigal34 was used to predict smORFs (33 to 303 bp) from contigs. The 4,599,187,424 redundant smORFs, most of which (99.25%) originated in metagenomes, were then de-duplicated to optimize the computational resource usage, yielding 2,724,621,233 non- redundant smORFs. Macrel42 was run on the de-duplicated smORFs to predict c AMPs. Singleton sequences (those appearing in a single sample or genome) were eliminated, except when they had a significant match (amino acid identity > 75% and E-value < 10'5) to a sequence from the Data Repository of Antimicrobial Peptides (DRAMP)46 version 3.0 using the ‘easy-search’ method from MMSeqs299. In total, AMPSphere encompassed 863,498 non- redundant predicted c_AMPs encoded by 5,518,294 redundant genes. AMP densities were estimated as the number of AMPs per assembled base pairs in a sample or a species.
[0087] AMP genes originating from ProGenomes243 had the taxonomy of the original genome assigned to them, whereas AMP genes from metagenomes were assigned the taxonomy predicted for the contig where they were found. Insights about potential structural conformations were obtained using the function secondary structure fraction from the ProtParam module implemented in the SeqUtils in Biopython107. This function calculates the fraction of amino acids tend to assume conformations of helix [VIYFWL], turn [NPGS], and sheet [EMAL],
[0088] Clustering of AMP families. Clustering peptides by sequence identity is only possible at high identities as short low-/medium-identity matches are possible by chance. Therefore, aiming to recover matches where basic features are preserved even if individual amino acids are not identical134,133, a reduced amino acids alphabet of 8 letters was used 56 - [LVIMC], [AG], [ST], [FYW], [EDNQ], [KR], [P], [H], c AMPs were hierarchically clustered after alphabet reduction using three sequential identity cutoffs (100%, 85%, and 75%) with CD-Hit98. A cluster was considered an AMP family when it consisted of at least 8 sequences38. Representative sequences of peptide clusters were selected according to their length (taking the longest) with ties being broken by their alphabetical order.
[0089] To validate this clustering procedure, a sample of 3,000 sequences randomly sampled from AMPSphere was used, excluding cluster representatives. These sequences were aligned against the representative sequence of their cluster using the Smith -Waterman algorithm136 with the BLOSUM 62 cost matrix, and gap open and extension penalties of -10 and -0.5, respectively. The alignment score was then converted to an E-value according to the model by Karlin and Altschul137, which uses the values of K (0.132539) and (0.313667) constants adjusted to search for a short input sequence as implemented in the BLAST algorithm120,138. Alignments were considered significant if their E-value was less than 10’5. We found that more than 95.3% of alignments produced in the first two levels (100% and >85% of identity) were significant, along with 77.1% of those from the third level (>75% of identity) - see Fig. 10.
[0090] Quality control of c AMPs. The c AMPs in AMPSphere were submitted to another six AMP prediction systems (AMPScanner v249, ampir40 - with the model for mature peptides, amPEPpy50, APIN’ 1 - with their proposed model, AI4AMP52, and AMPLify53).
[0091] The genes of c_AMPs were subjected to five different quality tests to reduce the likelihood that the observed peptides were artifacts or fragments of larger proteins. Initially, the peptides were searched against AntiFam v.7.0123 using HMMSearch109, which was designed to identify commonly recurring spuriously predicted ORFs, with the option cut_ga”. Fewer than 0.05% of c_AMPs had any significant hits.
[0092] For each smORF, a search as was performed for an in-frame stop codon upstream of its start codon. When no stop codon is found, we cannot rule out the possibility that the smORF is part of a larger gene which we cannot observe due to fragmented assembly. Most (68.4%) of the c_AMPs are encoded by at least one gene that is not terminally placed. However, the fact that a c_AMP is terminal does not imply that the given c_AMP is an artifact since the AMP genes are short enough to be recovered even in short contigs. For example, 72.9% (4,622/6,339) of homologs to DRAMP46 version 3.0 were found as terminal c_AMPs in AMPSphere.
[0093] The RNAcode84 program predicts protein-coding regions based on evolutionary signatures typical for protein genes. This analysis depends on a set of homologous and non-identical genes. Therefore, AMP clusters containing at least three gene variants were aligned. Given that an extensive portion of the AMPSphere candidates (53%; 459,910 out of 863,498) is not part of such a cluster, they could not be tested. Of the tested c_AMPs, 53% (215,421 out of 403,588) were considered genes with evolutionary traits of protein-coding sequences.
[0094] The inventors looked for evidence of transcription and/or translation using 221 publicly available metatranscriptomes, comprising human gut (142), peat (48), plant (13), and symbionts (17); and 109 publicly available metaproteomes from PRIDE129 database comprising from 37 habitats. Using bwa v.0.7.17113, reads from the metatranscriptomes were mapped against non-redundant AMP genes, and, using NGLess96, genes were selected with at least one read mapped across a minimum of two samples to increase our confidence. This approach is similar to that adopted when predicting AMPs42. Using regular expressions implemented in Python 3.8100, k-mers of all AMPSphere peptides (with length equal to at least half the length of the sequence) were compared to peptide sequences in metaproteomics data. A perfect match between a k-mer and a metaproteomic peptide was considered additional evidence that this c_AMP is likely to be translated, as described by Ma et al.6. Briefly, the number of c_AMP peptides mapped against the set of metaproteomic samples was counted and those c_AMP peptides with at least one match covering more than 50% of the peptide were marked as detected. c AMPs with experimental evidence in metatranscriptomes and/or metaproteomes accounted for circa 20% of the AMPSphere.
[0095] The mapping of c_AMPs was performed without considering genomic context, which may have led to an overestimation of candidates being identified as potentially transcribed. For example, if they are homologous to longer proteins the presence of the longer gene may lead to a false positive detection of the shorter c AMP. We investigated this using Fisher’s Exact Test to compare the percentage of AMP homologs to the GMGCvl53 database with experimental evidence of translation (3.4% - 2,073 out of 61,020 peptides, Odds-Ratio = 4.3, Ppisher's exact IO’300) and/or transcription (22.8% - 13,901 out of 61,020 peptides, Odds- Ratio = 1.2, Ppisher’s exact = 6.7’ 10‘108). The results suggest that our approach tends to slightly overestimate the potential transcription and translation of candidates with canonical -length homologs.
[0096] Given that only a small number of transcriptomic or proteomics dataset were available and the afore-mentioned limitations in interpreting the mappings, we considered AMPs passing all quality-control tests to be high-quality, regardless of evidence of translation or transcription. We further separated those with experimental evidence of translation/transcription (17,115 c_AMPs, circa 2% of AMPSphere) and those without it (63,098 c_AMPs, circa 7%). For c_AMP families, we considered high-quality those where > 75% of its c_AMPs pass all quality control tests or those with at least one c_AMP possessing experimental evidence of translation/transcription.
[0097] Sample -based c AMPs accumulation curves. To determine the saturation of c_AMP discovery, for each habitat or group of habitats, sample-based accumulation curves were computed by randomly sampling metagenomes in steps of 10 metagenomes. This procedure was repeated 32 times, and the average was taken.
[0098] Multi-habitat and rare c AMPs. First were counted c_AMPs present in >2 habitats (“multi -habitat AMPs”). To then test the significance of this value, we opted for a similar approach to that described in Coelho et al.55: habitat labels for each sample were shuffled 100 times and the number of resulting multi -habitat c AMPs was counted. Shuffling labels resulted in 676,489.7 ± 4,281 .8 multi-habitat c_AMPs by chance for high-level habitat groups, and in 685,477.17 ± 4,369.6 multi-habitat c_AMPs by chance when looking at the habitats individually inside the high-level groups. The Shapiro-Wilks test was used to check that the resulting data distribution is normal (P = 0.49, for specific habitats; P = 0.1 for high- level habitats). In the original (non-shuffled data), high-level habitat groups presented 93,280 multi-habitat c_AMPs (136.21 standard deviations below shuffled value), while specific habitats presented 173,955 multi-habitat c_AMPs (117.1 standard deviations below shuffled value).
[0099] To determine the rarity of c AMPs, the protocol previously established by Coelho et al.55 was adapted in which the non-redundant genes in AMPSphere were mapped against the reads of metagenome samples using NGLess96. Only uniquely mapped reads were considered. From the mapping, the c AMPs detected per sample and the number of detections per c_AMP were computed, considering “rare” c_AMPs as those detected less than the average of the entire AMPSphere (682 detections or 1% of all samples as previously described for species139). This approach was adopted to overcome the high computational costs of a competitive mapping procedure. It was expected that this approach overestimates how prevalent c_AMPs are, and because of that, it is a robust way to estimate the rarity of c AMPs.
[0100] As the high-quality designation requires at least 3 gene variants for the RNAcode test to be performed, the rarest genes will not be high-quality. However, for robustness, this effect was quantified by computing the mean and median number of detections in only the high-quality c_AMPs and only non-terminal c_AMPs (a test which does not require a minimum number of genes). The mean number of detections is 682 for the full collection, 789 for high-quality c AMPs, and 679 for non-terminal ones.
[0101] Significance of the overlapping c AMPs across different habitats. Like was done when testing the significance of the number of multi-habitat c AMPs observed, the number of overlapping c AMPs was computed for each pair of habitats. The sample labels were shuffled 1,000 times, counting the number of randomly overlapping c_AMPs for each pair of habitats. Then, the probability was estimated of observing the overlap by Chebyshev's inequality, which does not rely on any assumption regarding the distribution of the data as we observed, using the Shapiro-Wilk's test, that the shuffled counts do not follow a normal
1 distribution. Chebyshev's inequality is p < — , where Z stands for the Z-score computed from the average and standard deviations estimated by the shuffling procedure. The p-values were adjusted using Holm-Sidak implemented in multipletests from the statsmodels package114, and those below 0.05 were considered significant.
[0102] c_AMP density in microbial species from different habitats. The c_AMP density was defined as P MP = CA PS, where nCAMPs is the number of c_AMP redundant genes and L is the assembled base pairs. It was assumed, as an approximation, that in a large segment assembled, the start positions of AMP genes are independent and uniformly random. Then, the standard sample proportion error was calculated with the formula: ST Derr = The standard sample proportion error was used to calculate the margin of error at a
Figure imgf000037_0001
95% confidence interval (Z = 1.96, a = 0.05).
[0103] To gain insights about the contributions of different phyla, species, and genera to the AMPSphere, the c_AMP density was calculated for these taxonomy levels using the c_AMPs included within AMPSphere, summing all assembled base pairs for contigs assigned to each taxonomy level in the samples used in AMPSphere. The PAMP of genera, phyla and species within a margin of error superior to 10% of the calculated value were eliminated along with outliers according to Tukey’s fences (k = 1.5). We estimated species’ presence and abundance in each sample using mOTUs2115. None of the genera with the highest PAMP (Algorimicrobium, TMED78, SFJ001, STGJ01, and C AG-462) were highly prevalent microbes.
[0104] c AMPs and bacterial species transmissibility. The species taxonomy and transmissibility indices calculated by Valles-Colomer et al.69 was used to demonstrate the effect of AMPs on the transmission of bacterial species from mother to children. Only those species overlapping AMPSphere and the datasets from Valles-Colomer et al.69 were used for this analysis, and their AMP densities were calculated as described in the previous section (c_AMP density in microbial species from different habitats), using all the predicted c_AMPs from metagenomes and genomes we obtained, also including those not in AMPSphere, to avoid the bias of sampling. The AMP density and the coefficient of transmissibility were correlated using Spearman's method implemented in the scipy package104: following children's microbiome after 1, 3, and up to 18 years, as well as, cohabitation and intradatasets. The p-values of correlations were corrected using Holm-Sidak implemented in the multipletests function from the statsmodels package114.
[0105] Determination of accessory AMPs. To uncover the prevalence of c AMPs through the microbial pangenomes, core, shell, and accessory c_AMP clusters were determined using the subset of c AMPs obtained from ProGenomes243 because of their high- confidence assigned taxonomies and genomically-defined species (specl140). To increase confidence in our measures, only species containing at least 10 genomes were used in this analysis. c_AMPs and AMP families present in fewer than 50% of the genomes from a microbial species were classified as accessory. c_AMPs and families present in 50% - 95% of the genomes in the cluster were classified as shell141, and those present in >95% of the genomes were classified as core genes63.
[0106] To determine the propensity of AMPs being shared between genomes belonging to the same strain, strains within species were defined first. For this, FastANI v.1.33111 was used to cluster genomes from the same species in ProGenomes243. Genome groups with ANI > 99.99% were considered clonal complexes and only a single representative of each clonal complex was kept for further analyses. Species that had fewer than 10 genomes after this step were not considered further in this analysis. Next, strains (99.5% < ANI < 99.99%) were inferred as in Rodriguez et al.142. The pairs of genomes were counted from the same species sharing AMPs, stratified by whether the pair originates from the same strain or not, and tested the results with Fisher’s Exact Test implemented in the scipy package104.
[0107] To determine the proportions of accessory, shell and core full-length proteins in the microbial pangenomes, also extracted were the predicted full-length proteins from the ENA database for each genome and hierarchically clustered them after alphabet reduction in a similar fashion to that described in the topic “AMP families'". Full-length protein clusters with >8 sequences for each species were kept. The prevalence of full-length protein families within a species was computed as above and the number of core families was compared to the number of c_AMP core families using the probability, calculated as number of species with proportion of core full-length protein families less or equal to that observed for c AMPs divided by the total of assessed species.
[0108] To determine the genotype of Mycoplasma pneumoniae genomes in ProGenomes243, extracted were the gene coding for Pl adhesin65 by mapping the reference gene sequence NZ_LR214945.1 :c568695-567307 against each genome with bwa v.0.7.17113, and later extracted the sequences using with SAMtools116 and BEDtools117. The extracted gene sequences were aligned using Clustal Omega118, and a phylogenetic tree was built using the aligned nucleotide sequences and FastTree 2110 with the restricted time-reversible substitution model and a bootstrapping procedure with 1,000 pseudo-replicates to determine node support. The tree was used to segregate and classify genomes taking the strain type of reference genomes from Diaz et al.66. [0109] Annotation of AMP s using different datasets. To detect homologs to previously published proteins, AMPSphere candidates were aligned against several databases: (i) the small protein sets in SmProt 254, (ii) the bioactive peptides database starPepDB 45k93, (iii) the small proteins from the global data-driven census of Salmonella92, (iv) the global microbial gene catalog GMGCvl ?5, (v) and the AMP database DRAMP46 version 3.0. To strictly avoid any artifacts of assembly for the analysis, only c_AMPs which passed the terminal placement test (i.e., for which there was strong evidence that the ORF is indeed complete) were searched against the GMGCvl55. The AMPs were annotated using MMseqs299 with the ‘easy-search’ method, retaining hits with an E-value up to 10'5. As Macrel42 removes the starting methionine from the peptides it outputs, hits starting at the second amino acid were treated as if they matched the first one.
[0110] The hypergeometric test implemented in the scipy package104 was used to model the association between c AMPs and the background distribution of ortholog groups from GMGCvl55. The number of genes that were redundant in GMGCvl 55 for each ortholog group was computed along with the counts for ortholog groups in the top hits to AMPSphere. The enrichment was given as the proportion of hits present in a given ortholog group divided by the proportion of that ortholog group among the redundant sequences in GMGCvl55, and results were considered significant if p < 0.05 after correction with the Holm-Sidak method implemented in multipletests from the statsmodels package114. When using a robust approach that filters the ortholog groups by the number of c_AMP hits and GMGCvl55 hits associated with them, using a minimum of 10, 20, or even 100 proteins, the results were kept similar to those obtained with all data showing that the extension of the ortholog groups in AMPSphere did not affect the enrichment analysis.
[OHl] To check for genomic entities generated after gene truncation, a screen was performed for c_AMP homologs using the default settings for Blastn120 against the NCBI database124, keeping only significant hits with a maximum E-value of 10‘5. As a case study, the AMP10.271_016 was selected, which was predicted to be produced by Prevotella jejuni, which shares the start codon with the gene coding for a NAD(P)-dependent dehydrogenase (WP_089365220.1). To verify the gene disposition and putative mutations leading to the AMP creation, we used Biopython107 to codon-align the fragments from metagenomic contigs assembled from samples SAMN09837386, SAMN09837387, and SAMN09837388, and genomic fragments of different strains of Prevotella jejuni CD3 33 (CP023864.1 :504836- 504949), F0106 (CP072366.E781389-781502), F0697 (CP072364.1 : 1466323-1466436), and from Prevotella melaninogenica strains FDAARGOS_760 (CP054010.1 : 157726-157839), FDAARGOS 306 (CP022041.2:943522-943635), FDAARGOS 1566
(CP085943.1 :1102942- 1103055), and ATCC 25845 (CP002123.1 :409656-409769) and compared the segments coding for the AMP and the original full-length protein.
[0112] Genomic context conservation analysis. To gain insights into the gene synteny involving AMP genes, the 863,498 AMP sequences were mapped against a collection of 169,632 reference genomes, metagenome-assembled genomes (MAGs) and single amplified genomes (SAGs) curated elsewhere58 with DIAMOND119 in “blastp” mode, as previously reported58. Hits with identity > 50% (amino acid) and query and target coverage > 90% were considered significant. The target coverage threshold avoids hits to larger homologs whose function may be unrelated. This yielded 107,308 AMPs with homologs in at least one genome. We built gene families from the hits of each AMP detected in the prokaryotic genomes and calculated a conservation score based on the functional annotation of the neighboring genes in a window of three genes up and downstream. The vertical conservation score at each position within the window of each c_AMP was calculated as the number of genes with a given functional annotation (ortholog group, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, KEGG orthology, KEGG module126, PFAM 33.1122143, and CARD125; details of annotation and annotated database described previously58), divided by the number of genes in the family. AMPs with more than two hits and a vertical conservation score > 0.9 with any functional term were considered to have conserved genomic contexts. Figure 4 shows genomic context conservation of different KEGG pathways.
[0113] For testing whether the fraction of AMPs with conserved genomic neighbors is similar to that of other gene families within the 169,632 genomes curated by del Rio et al.58, genomic context conservation on 3,899,674 gene families were calculated de novo with MMSeqs299 (using a minimal amino acid identity of 30%, coverage of the shorter sequence of at least 50%, and maximum E-value of 10'3). The c_AMPs were also annotated using EggNOG-mapper v2108. Their KO annotations were compared to that of the immediate neighbors (+/- 1 positions) to identify neighborhoods with the same function. It was possible to annotate 56.1% (60,173 out of 107,308) of c_AMPs with hits to the genomes tested using the EggNOG5 database57. Of these, 18.1% were assigned to translation-related functions (class J), 14.4% belong to proteins of unknown function (S), 9% were assigned to replication, recombination, and repair (L).
[0114] AMPSphere web resource. AMPSphere is found at the address
Figure imgf000041_0001
The implementation is based on Python100 and Vue Javascript. The database was built with sqlite, and SQLalchemy was used to map the database to Python objects. Internal and external APIs were built using FastAPI and Gunicom to serve them. On the front end, Vue 3 was used as the backbone and Quasar built the layout. Plotly was used to generate interactive visualization plots, and Axios to render content seamlessly. LogoJS
Figure imgf000041_0002
wgttglabd ^ PiO was use^ to generate sequence logos for AMP families; while the helical wheel app was used to
Figure imgf000041_0003
generate AMP helical wheels.
Example 2 - Synthesis and Efficacy Testing of Candidate AMPs
[0115] One-hundred synthesized peptides were tested against 11 clinically relevant pathogenic strains, encompassing Acinetobacter baumannii, Escherichia coli (including one colistin-resistant strain), Klebsiella pneumoniae, Pseudomonas aeruginosa, Staphylococcus aureus (including one methicillin-resistant strain), vancomycin-resistant Enterococcus faecalis, and vancomycin-resistant Enterococcus faecium. The initial screening revealed that 63 AMPs (out of 100 synthesized) completely eradicated the growth of at least one of the pathogens tested (Fig. 6C). Remarkably, in some cases, the AMPs were active at concentrations as low as 1 pmol L'1, close to the peptide antibiotic polymyxin B and the antibiotic levofloxacin, used as positive controls in all experiments (Fig. 11 A). The Gramnegative bacteria A. baumannii and E. coli, as well as the Gram -positive strains vancomycin- resistant E. faecalis and E. faecium, displayed higher susceptibility to the AMPs, with 39, 24, 21 and 26 peptide hits, respectively. However, none of the tested AMPs affected methicillin- resistant S. aureus (MRSA) (Fig. 6C). Also synthesized and tested for antimicrobial activity were the scrambled versions of five of the most active peptides from the high-quality group (i.e., actinomycin- 1, enterococcin-1, lachnosporin-1, proteobacticin-1, and synechocucin-1). All scrambled versions were inactive except for lachnospirin-l_scrambled, which presented modest activity against baumannii at 32 pmol L'1 (16-times higher concentration compared to its parent peptide lachnospirin-1) (Fig. 12A). These results underscore the importance of the specific sequence of these peptides to exert their antimicrobial activity. To further explore the influence of sequence on structure, the secondary structure tendency of the scrambled peptides was assessed using circular dichroism. A decrease in helical fraction was detected for sequences with higher helical content (enterococcin-1, lachnospirin-1, and synechocucin- 1), while the predominately random coiled sequences actinomycin- 1 and proteob actin- 1, as well as their scrambled counterparts, showed similar secondary structural sequences in all media analyzed (Fig. 12B-E). These results suggest a lack of correlation between secondary structure and antimicrobial activity of the AMPs derived from AMPSphere.
[0116] The growth of human gut commensals is impaired by c AMPs. The AMPs were screened against eight of the most relevant members of the human gut microbiota associated with human health73 77 Tested were commensal bacteria belonging to four phyla (Verrucomicrobiota, Bacteroidota, Actinomycetota, and Bacillota), i.e., Akkermansia muciniphila, Bacteroides fragilis, Bacteroides thetaiotaomicron , Bacteroides unifornris, Phocaeicola vulgatus (formerly Bacteroides vulgatus), Collinsella aerofaciens, Clostridium scindens, and Parabacteroides distasonis.
[0117] While it is commonly observed that known natural AMPs do not target microbiome strains78, the present study found that 58 of the synthesized AMPs (58%) demonstrated inhibitory effects on at least one commensal strain at low concentrations (8-16 pmol L'1). Although this concentration range was higher than that required to inhibit pathogens (1-4 pmol L-1), it still falls within the highly active range of AMPs based on previous studies79 81 (Fig. 6C). Interestingly, all the analyzed gut microbiome strains were susceptible to at least four c AMPs, with strains of A. muciniphila, B. uniformis, P vulgatus, C. aerofaciens, C. scindens, and P distasonis exhibiting the highest susceptibility. In total, 79 AMPs (out of 100 synthesized peptides) demonstrated antimicrobial activity against pathogens and/or commensals. The AMPs were also screened against gut commensals five scrambled sequences of five of the highly active peptides from the high-quality group. Similarly to the results obtained against pathogenic strains (Fig. 12), only lachnospirin- l_scrambled was modestly active against C. scindens at 64 pmol L'1, as lachnospirin-1 (Fig. 12A).
[0118] Permeabilization and depolarization of the bacterial membrane by c AMPs from AMPSphere. To gain insights into the mechanism of action responsible for the antimicrobial activity observed in the peptides derived from AMPSphere (Fig. 6C), experiments were conducted to assess their ability to permeabilize and depolarize the outer and cytoplasmic membranes of bacteria at their Minimum Inhibitory Concentrations (MICs). Specifically, the effects of all 39 peptides that showed activity against A. baumannii (Figs. 6D-E and Fig. 13 A, 13D) and 6 peptides with antimicrobial activity on P aeruginosa (Figs. 13B-C,E) were investigated. For comparison and as a control, polymyxin B, a peptide antibiotic known for its membrane permeabilization and depolarization properties4, was used.
[0119] To investigate the potential permeabilization of the outer membranes of Gramnegative bacteria by the selected AMPs, 1 -(N-phenylamino)naphthalene (NPN) uptake assays were performed. NPN is a lipophilic fluorophore that exhibits increased fluorescence in the presence of lipids found within bacterial outer membranes. The uptake of NPN indicates membrane permeabilization and damage. Among the 39 peptides evaluated for activity against A baumannii, 10 peptides caused significant permeabilization of the outer membrane, resulting in fluorescence levels at least 50% higher than that of polymyxin B (Fig. 6D) after 45 min of exposure. In the case of P. aeruginosa cells, four out of the six peptides tested showed higher permeabilization than polymyxin B (Fig. 13 A).
[0120] To evaluate the potential membrane depolarization effect of the selected AMPs from AMPSphere, the fluorescent dye 3,3 '-dipropylthiadi carbocyanine iodide [DiSCa-(5)] was utilized, v Among the peptides tested against baumannii, bogicin-1 (AMP10.364_543), ampspherin-2 (AMP10.615_023), and marinobacticin-1 (AMP10.321_460) exhibited greater cytoplasmic membrane depolarization than polymyxin B, and among the ones tested against P aeruginosa, cagicin-2 (AMP10.014_861) exhibited greater cytoplasmic membrane depolarization than polymyxin B (Fig. 6E). vlnterestingly, all the tested AMPSphere peptides displayed a characteristic crescent-shaped depolarization pattern compared to polymyxin B, with lower levels of depolarization observed during the first 20 min of exposure, followed by an increase in depolarization over time (Fig. 6E; Fig. 13B). Taken together, these results indicate that the kinetics of cytoplasmic membrane depolarization are slower compared to the kinetics of outer membrane permeabilization, which occurs rapidly upon interaction with the bacterial cells.
[0121] These findings indicate that the tested AMPs from AMPSphere primarily exert their effects by permeabilizing the outer membrane rather than depolarizing the cytoplasmic membrane, revealing a similar mechanism of action to that observed for classical AMPs and EPs from the human proteome4
[0122] AMPs exhibit anti-infective efficacy in a mouse model. Next, tested was the anti-infective efficacy of AMPSphere-derived peptides in a skin abscess murine infection model (Fig. 7A). Mice were subjected to infection with A. baumannii, a dangerous Gramnegative pathogen known for causing severe infections in various body sites including the bloodstream, lungs, urinary tract, and wounds82. Ten lead AMPs from different sources displayed potent in vitro activity against A. baumannii'. synechocucin-1 (AMP10.000_211, 8 prnol L-1) from Synechococcus sp. (coral associated, marine microbiome), proteobacticin-1 (AMP10.048_551, 16 prnol L'1) from Pseudomonadota (plant and soil microbiome), actynomycin-1 (AMP10.199_072, 64 prnol L'1) from Actinomyces (human mouth and saliva microbiome), lachnospirin-1 (AMP10.015_742, 2 prnol L'1) from Lachnospira sp. (human gut microbiome), enterococcin-1 (AMP10.051_911, 1 prnol L’1) from Enterococcus faecalis (human gut microbiome), alphaprotecin- 1 (AMP10.316 798, 1 pmol L'1) from Alphaproteobacteria (aquatic microbiome), oscillospirin (AMP10.771_988, 8 pmol L 1) from Oscillospiraceae (pig gut microbiome), ampspherin-4 (AMP10.466_287, 8 pmol L 1) from an unknown source, methylocellin-1 (AMP10.446_571, 2 pmol L'1) from Methylocella sp. (soil microbiome), and reyranin-1 (AMP10.337_875, 16 pmol L'1) from Reyranella (plant and soil microbiome). The skin abscess infection was established with a bacterial load of 20 pL of A. baumannii cells at 106 CFU mL'1 onto the wounded area of the dorsal epidermis (Fig. 7A). A single dose of each peptide, at their respective MIC value obtained in vitro (Fig. 6C; Fig. 11 A), was administered to the infected area. Two days post-infection, synechocucin-1, actynomycin-1, and oscillosporin-1 presented bacteriostatic activity, inhibiting the proliferation of baumannii cells, whereas lachnospirin-1, enterococcin-1, ampspherin-4, and reyranin-1 presented bactericidal activity close to that of the antibiotic polymyxin B (at 5 pmol L’1), reducing the colony-forming units (CFU) counts by to 3-4 orders of magnitude (Fig. 7B). Four days post-infection, synechocucin-1, lachnospirin-1, enterococcin- 1 , and ampspherin-4 presented a bacteriostatic effect close to that of the antibiotic polymyxin B, reducing the CFU counts by 2-3 orders of magnitude compared to the untreated control (Fig. 13C). These results highlight the anti -infective potential of the tested peptides from AMPSphere as they were administered at a single time immediately after the establishment of the abscess. Mouse weight was monitored as a proxy for toxicity and no significant changes were observed (Fig. 7C and Fig. 13D), suggesting that the peptides tested were not toxic.
[0123] Conclusions. Here, ML was to identify nearly a million novel candidate AMPs in the global microbiome. Building on previous studies that focused specifically on the human gut microbiome6'38'83, AMPs were catalogued from the global microbiome across 63,410 publicly available metagenomes, as well as 87,920 high-quality microbial genomes from the ProGenomes2 database42. This led to the creation of AMPSphere, an open-access and publicly available resource encompassing 863,498 non-redundant peptides and 6,499 high-quality AMP families from 72 different habitats, including marine and soil environments and the human gut. Most of the c_AMPs (91.5%) were previously unknown, lacking detectable homologs in other databases, and about one in five had evidence of translation and/or transcription as they could be detected in independent publicly available sets of metatranscriptomes or metaproteomes.
[0124] A set of tests was designed to capture higher-quality predictions, but many peptides failed these tests despite evidence that they were active, including our own in vitro data and existence of validated homologs in external databases. Low prevalence peptides will be less likely to pass the tests (RNAcode84 requires multiple variants), which is independent of their activity and influenced by sampling biases.
[0125] Focusing on candidate AMPs that are directly encoded in the genome enabled in vitro and in vivo testing using chemical synthesis without post -translational modifications, but there are other processes that generate active peptides, such as encrypted peptides (EPs)4, which were used as a comparison point. Notably, the amino acid composition and physicochemical characteristics of the validated AMPs from AMPSphere differed from those of recently identified in EPs4. Two evolutionary mechanisms by which AMPs may be generated were explored. First, mutations in genes encoding longer proteins could generate gene fragments via truncation. Among the enriched ortholog groups of proteins from GMGCvl55homologous to c_AMPs, we observed that a majority of groups had unknown function (53.8%), similar to what was reported by Sberro et al.38 for small proteins from the human gut microbiome. The second mechanism is that a small protein gene undergoes a duplication followed by mutation, which we observed in the case of ribosomal proteins. Ribosomal proteins can harbor antimicrobial activity63, possibly due to their amyloidogenic properties85. Other origins of AMPs still may be the horizontal gene transfer86 or the ancestral non-coding sequences87.
[0126] Nonetheless, the majority of identified AMPs did not have a detectable homolog in other databases, highlighting their novelty. The lack of observed homology may be due to limitations in our ability to robustly detect these homology relationships in small sequences, but there is also the possibility that small proteins, such as AMPs, may be more likely to be generated de novo and may have repeatedly evolved in various taxa88. This may also be an explanation for the large fraction of c AMPs in the AMPSphere that do not cluster with any other sequences.
[0127] It was observed that c_AMPs from AMPSphere were habitat-specific and mostly accessory members of microbial pangenomes. Furthermore, four out of the five genera with the most c_AMPs present in AMPSphere share a host-associated lifestyle, and three of these (Prevotella, Faecalibacterium , and CAG-llff) are common in animal hosts89 91 (Fig. 5).
[0128] Valles-Colomer et al.69, who recently analyzed a large collection of human- associated metagenomes, provide a species-specific index of transmissibility for the several transmission scenarios they study (e.g., mother to infant). Hypothesizing that AMP production may be related to transmission, the species-specific pAMP calculated in AMPSphere was correlated with transmission scores. In both the human gut and oral microbiomes, species with higher pAMP are less transmissible, possibly because AMPs confer protection against strain replacement. Taken together, these results validate the applicability of AMPSphere in the study of microbial ecology as they suggest a role for AMPs in determining the transmissibility and colonization ability of microbes. [0129] Finally, the present inventors experimentally validated predictions made by the inventive ML model42 and found that 79 (out of 100) synthesized AMPs displayed antimicrobial activity against either pathogens or commensals. Nonetheless, notably, four peptides (cagicin-1, cagicin-4, and enterococcin-1 against baumannii, and cagicin-1 and lachnospirin-1 against vancomycin-resistant E.faecium) presented MIC values as low as 1 pmol L'1, comparable to the MICs of some of the most potent peptides previously described in the literature80,81.
[0130] It was herein demonstrated that the tested AMPs from AMPSphere tended to target clinically relevant Gram-negative pathogens and showed activity against vancomycin- resistant E.faecium. Although conventional AMPs do not target bacteria from the human gut microbiome78, tested AMPs from AMPSphere showed efficacy against commensal bacteria, suggesting potential ecological implications of peptides as protective agents for their producing organisms and to reconfigure microbiome communities.
[0131] When assessing their activity in vivo, three peptides exhibited anti -infective efficacy in a murine infection model, with lachnospirin-1 and enterococcin-1 being the most potent, resulting in a reduction of bacterial load by up to four orders of magnitude. The active peptides included those derived from both human-associated and environmental microbiota, validating the present approach of investigating the global microbiome. Overall, the present findings unveil a wide array of novel AMP sequences, highlighting the potential of machine learning in the discovery of much-needed antimicrobials.
[0132] Methods
[0133] Selection of peptides for synthesis and activity testing. Two groups of peptides were selected: (i) 50 peptides that were selected as being particularly likely to be active and that were otherwise interesting (as described below), (ii) 50 peptides selected randomly after applying technical exclusions.
[0134] For the first group, only high-quality (see topic “Quality control of c AMPs”) c_AMPs were considered for synthesis. They were further filtered according to six criteria for solubility144 and three criteria for synthesis, as in PepFun145. The solubility was estimated using the criteria implemented in PepFun145, observing that 67.4% (581,749 peptides) passed at least half of the solubility criteria evaluated. The subset that is homologous to peptides in DRAMP46 version 3.0 had a slightly lower rate, 44.3% passed half the tests. We then assessed the peptides regarding their ease of synthesis, however, only 21.2% from AMPSphere passed at least 2 out of the 3 criteria established for chemical synthesis.
[0135] A peptide approved for at least six of the above-mentioned criteria was then filtered by predicting AMP activity with six methods in addition to Macrel42: AMPScanner v249, the mature peptides model in ampir40, amPEPpy50, APIN51 - with their proposed model, AI4AMP52, and AMPLify33. Peptides predicted to be AMPs by all methods were filtered by length, discarding sequences longer than 40 amino acid residues, for which conventional solid-phase peptide synthesis using Fmoc strategy has lower yields and many recoupling reactions146 l4S. Only one peptide was kept from each family or cluster, namely the one with the highest number of observed smORFs. After this process, we obtained 364 candidate AMPs, belonging to 166 families and 198 clusters with < 8 c AMPs. Of these, 30 candidates were homologous to sequences from the databases used in annotation (e.g., SmProt 234). To compose the list of 50 high-likelihood candidates: (i) we selected 34 of the most prevalent peptides; (ii) we randomly selected 14 c_AMPs (30% of our set) with homologs to the GMGCvl55 and one that matched SmProt 254; and (iii) we included one peptide that was found in the MAGs binned from stool samples used to investigate fecal transplantations149. We also included scrambled sequences made using five of the most active peptide sequences to verify the potency of randomly generated sequences.
[0136] To build the group of randomly selected peptides, first selected were c AMPs that are not homologous to any other databases tested and that passed the abovementioned synthesis criteria (total of 768,061 out of 863,498 peptides). This group was further divided into subgroups: (i) those with Macrel -assigned probability >0.6 (271,555 c_AMPs) and (ii) those in the range 0.5-0.6 (496,506 c_AMPs; note that all c_AMPs in AMPSphere have a Macrel -assigned probability >0.5). Twenty-five peptides were randomly selected from each group.
[0137] Minimal inhibitory concentration (MIC) determination. The 100 AMPs were tested for antimicrobial activity using the broth microdilution method150. MIC values were considered as the concentration of the peptides that killed 100% of cells after 24 h of incubation at 37°C. First, peptides diluted in water were added to untreated flat -bottom polystyrene microtiter 96-well plates in two-fold dilutions ranging from 64 to 1 pmol L’1, and then peptides were exposed to an inoculum of 2- 106 cells in LB or BHI broth, for pathogens and gut commensals, respectively. After the incubation time, the absorbance of each well representing each of the conditions was analyzed using a spectrophotometer at 600 nm. The assays were conducted in three biological replicates to ensure statistical reliability.
[0138] Circular dichroism assays. Circular dichroism experiments were conducted using a J1500 circular dichroism spectropolarimeter (Jasco). The experiments were carried out at a temperature of 25°C. Circular dichroism spectra were obtained by averaging three accumulations using a quartz cuvette with an optical path length of 1.0 mm. The spectra were recorded in the wavelength range from 260 to 190 nm at a scanning rate of 50 nm min’1 with a bandwidth of 0.5 nm. The peptides were tested at a concentration of 50 pmol L’1. Measurements were performed in water, a mixture of water and trifluoroethanol (TFE) in a ratio of 3 :2, and a mixture of water and methanol in a ratio of 1 : 1. Baseline measurements were recorded prior to each measurement. To minimize background effects, a Fourier transform filter was applied. The helical fraction values were calculated using the single spectra analysis tool available on the BeStSel server95.
[0139] Outer membrane permeabilization assays. Membrane permeability was analyzed using the l-(N-phenylamino)naphthalene (NPN) uptake assay. NPN demonstrates weak fluorescence in an extracellular environment but displays strong fluorescence when in contact with lipids from the bacterial outer membrane. Thus, NPN will show increased fluorescence when the integrity of the outer membrane is compromised. A. baumannii ATCC 19606 and P. aeruginosa PA01 were cultured until cell numbers reached an ODeoo of 0.4, followed by centrifugation (10,000 rpm at 4°C for 3 min), washing, and resuspension in buffer (5 mmol L'1 HEPES, 5 mmol L'1 glucose, pH 7.4). Subsequently, 4 pL of NPN solution (working concentration of 0.5 mmol L'1) was added to 100 pL of bacterial solution in a white flat bottom 96-well plate. The fluorescence was monitored at Ax = 350 nm and Xem = 420 nm. The peptide solutions in water (100 pL solution at their MIC values) were introduced into each well, and fluorescence was monitored as a function of time until no further increase in fluorescence was observed (30 min). The relative fluorescence was calculated using a nonlinear fit. The positive control (antibiotic polymyxin B) was used as baseline. The following equation was applied to reflect % of difference between the baseline (polymyxin B) and the sample:
100 x
Relativefluorescence =
Figure imgf000050_0001
CCTlCCpOlymyXing
[0140] Cytoplasmic membrane depolarization assays. The ability of the peptides to depolarize the cytoplasmic membrane was assessed by measuring the fluorescence of the membrane potential-sensitive dye 3,3’-dipropylthiadicarbocyanine iodide [DiSC3-(5)]. This potentiometric fluorophore fluoresces upon release from the interior of the cytoplasmic membrane in response to an imbalance of its transmembrane potential. A. baumannii ATCC 19606 and P. aeruginosa PA01 cells were grown with agitation at 37°C until they reached mid-log phase (ODeoo = 0.5). The cells were then centrifuged and washed twice with washing buffer (20 mmol L’1 glucose, 5 mmol L’1 HEPES, pH 7.2) and re-suspended to an ODeoo of 0.05 in 20 mmol L’1 glucose, 5 mmol L’1 HEPES, 0.1 mol L’1 KC1, pH 7.2. An aliquot of 100 pL of bacterial cells was added to a black flat bottom 96-well plate and incubated with 20 nmol L’1 of DiSC3-(5) for 15 min until the fluorescence stabilized, indicating the incorporation of the dye into the cytoplasmic membrane. The membrane depolarization was monitored by observing the change in the fluorescence emission intensity of the dye ( ex = 622 nm, Am = 670 nm), after the addition of the peptides (100 pL solution at their MIC values). The relative fluorescence was calculated using a non-linear fit. The positive control (antibiotic polymyxin B) was used as baseline. The percentage of difference between the baseline (polymyxin B) and the sample was estimated using the same mathematical approach as in the ''Outer membrane permeabilization assays ' .
[0141] References. Superscripted numerals in the present disclosure refer to the following numbered list of publications, which may be relevant to the inventive subject matter.
1. de la Fuente-Nunez, C., Torres, M.D., Mojica, F.J., and Lu, T.K. (2017). Nextgeneration precision antimicrobials: towards personalized treatment of infectious diseases. Current Opinion in Microbiology 37, 95-102. 10.1016/j.mib.2017.05.014. 2. Antimicrobial Resistance Collaborators (2022). Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629-655. 10.1016/S0140- 6736(21)02724-0.
3. Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z., et al. (2020). A Deep Learning Approach to Antibiotic Discovery. Cell 180, 688-702. el3.
10.1016/j .cell.2020.01.021.
4. Torres, M.D.T., Melo, M.C.R., Flowers, L., Crescenzi, O., Notomista, E., and de la Fuente-Nunez, C. (2022). Mining for encrypted peptide antibiotics in the human proteome. Nat Biomed Eng 6, 67-75. 10.1038/s41551 -021 -00801 - 1.
5. Porto, W.F., Irazazabal, L., Alves, E.S.F., Ribeiro, S.M., Matos, C.O., Pires, A.S., Fensterseifer, I.C.M., Miranda, V.J., Haney, E.F., Humblot, V., et al. (2018). In silico optimization of a guava antimicrobial peptide enables combinatorial exploration for peptide design. Nat Commun 9, 1490. 10.1038/s41467-018-03746-3.
6. Ma, Y., Guo, Z , Xia, B„ Zhang, Y„ Liu, X., Yu, Y„ Tang, N., Tong, X., Wang, M., Ye, X., et al. (2022). Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat Biotechnol, 1-11. 10.1038/s41587-022-01226-0.
7. Wong, F., de la Fuente-Nunez, C., and Collins, J.J. (2023). Leveraging artificial intelligence in the fight against infectious diseases. Science 381, 164-170. 10.1126/science.adhl 114.
8. Cesaro, A., Bagheri, M., Torres, M., Wan, F., and de la Fuente-Nunez, C. (2023). Deep learning tools to accelerate antibiotic discovery. Expert Opinion on Drug Discovery 18, 1245-1257. 10.1080/17460441.2023.2250721.
9. Torres, M.D.T., and De La Fuente-Nunez, C. (2019). Toward computer-made artificial antibiotics. Current Opinion in Microbiology 51, 30-38. 10.1016/j. mib.2019.03.004.
10. Maasch, J.R.M.A., Torres, M.D.T., Melo, M.C.R., and de la Fuente-Nunez, C. (2023). Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. Cell Host Microbe 31, 1260-1274.e6. 10.1016/j.chom.2023.07.001. 11. Besse, A. (2017). Halocin C8: an antimicrobial peptide distributed among four halophilic archaeal genera: Natrinema, Haloterrigena, Haloferax, and Halobacterium. Extremophiles 21. 10.1007/s00792-017-0931-5.
12. Cotter, P.D., Ross, R.P., and Hill, C. (2013). Bacteriocins — a viable alternative to antibiotics? Nat Rev Microbiol 11, 95-105. 10.1038/nrmicro2937.
13. Wang, S., Zheng, Z., Zou, H., Li, N., and Wu, M. (2019). Characterization of the secondary metabolite biosynthetic gene clusters in archaea. Comput Biol Chem 78, 165— 169. 10.1016/j.compbiol chem.2018.11.019.
14. Zasloff, M. (2019). Antimicrobial Peptides of Multicellular Organisms: My Perspective. In Antimicrobial Peptides: Basics for Clinical Application, K. Matsuzaki, ed. (Springer Singapore), pp. 3-6. 10.1007/978-981-13-3588-4 1.
15. Huang, K.-Y., Chang, T.-H., Jhong, J.-H., Chi, Y.-H., Li, W.-C., Chan, C.-L., Robert Lai, K., and Lee, T.-Y. (2017). Identification of natural antimicrobial peptides from bacteria through metagenomic and metatranscriptomic analysis of high-throughput transcriptome data of Taiwanese oolong teas. BMC Syst Biol 11. 10.1186/s l2918-017-0503- 4.
16. Torres, M.D.T., Sothiselvam, S., Lu, T.K., and de la Fuente-Nunez, C. (2019). Peptide Design Principles for Antimicrobial Applications. J Mol Biol 431, 3547-3567. 10.1016/j.jmb.2018.12.015.
17. Pizzo, E., Cafaro, V., Di Donato, A., and Notomista, E. (2018). Cryptic Antimicrobial Peptides: Identification Methods and Current Knowledge of their Immunomodulatory Properties. Current Pharmaceutical Design 24, 1054-1066. 10.2174/1381612824666180327165012.
18. Nolan, E.M., and Walsh, C.T. (2009). How nature morphs peptide scaffolds into antibiotics. Chembiochem 10, 34-53. 10.1002/cbic.200800438.
19. Singh, N., and Abraham, J. (2014). Ribosomally synthesized peptides from natural sources. J Antibiot 67, 277-289. 10.1038/j a.2013.138.
20. Garcia-Bayona, L., and Comstock, L.E. (2018). Bacterial antagonism in host- associated microbial communities. Science 361. 10.1126/science.aat2456. 21. Anderson, M.C., Vonaesch, P., Saffarian, A., Marteyn, B.S., and Sansonetti, P.J. (2017). Shigella sonnei encodes a functional T6SS used for interbacterial competition and niche occupancy. Cell Host Microbe 21. 10.1016/j.chom.2017.05.004.
22. Krismer, B., Weidenmaier, C., Zipperer, A., and Peschel, A. (2017). The commensal lifestyle of Staphylococcus aureus and its interactions with the nasal microbiota. Nat. Rev. Microbiol 15. 10.1038/nrmicro.2017.104.
23. Zhao, W., Caro, F., Robins, W., and Mekalanos, J. J. (2018). Antagonism toward the intestinal microbiota and its effect on Vibrio cholerae virulence. Science 359. 10.1126/science.aap8775.
24. Quereda, J. J. (2017). Listeriolysin S is a streptolysin s-like virulence factor that targets exclusively prokaryotic cells in vivo. mBio 8. 10.1128/mBio.00259-17.
25. Quereda, J. J. (2016). Bacteriocin from epidemic Listeria strains alters the host intestinal microbiota to favor infection. Proc. Natl Acad. Sci. USA 113.
10.1073/pnas.1523899113.
26. Gomes, B., Augusto, M.T., Felicio, M.R., Hollmann, A., Franco, O.L., Goncalves, S., and Santos, N.C. (2018). Designing improved active peptides for therapeutic approaches against infectious diseases. Biotechnol Adv 36, 415-429.
10.1016/j.biotechadv.2018.01.004.
27. Lesiuk, M., Paduszyhska, M., and Greber, K.E. (2022). Synthetic Antimicrobial Immunomodulatory Peptides: Ongoing Studies and Clinical Trials. Antibiotics (Basel) 11, 1062. 10.3390/antibioticsl 1081062.
28. Mahlapuu, M., Hakansson, J., Ringstad, L., and Bjorn, C. (2016). Antimicrobial Peptides: An Emerging Category of Therapeutic Agents. Frontiers in Cellular and Infection Microbiology 6.
29. Baquero, F., Lanza, V.F., Baquero, M.R., Campo, R., and Bravo- Vazquez, D.A. (2019). Microcins in Enterobacteriaceae: peptide antimicrobials in the eco-active intestinal chemosphere. Front. Microbiol. 10. 10.3389/fmicb.2019.02261.
30. Kim, S.G. (2019). Microbiota-derived lantibiotic restores resistance against vancomycin-resistant Enterococcus. Nature 572. 10.1038/s41586-019- 1501 -z. 31. Nakatsuji, T. (2021). Development of a human skin commensal microbe for bacteriotherapy of atopic dermatitis and use in a phase 1 randomized clinical trial. Nat. Med. 27. 10.1038/s41591-021-01256-2.
32. Spohn, R., Daruka, L., Lazar, V., Martins, A., Vidovics, F., Grezal, G., Mehi, O., Kintses, B., Szamel, M., Jangir, P.K., et al. (2019). Integrated evolutionary analysis reveals antimicrobial peptides with limited resistance. Nat Commun 10, 4538. 10.1038/s41467-019-12364-6.
33. Cesaro, A., Torres, M.D.T., Gaglione, R., Dell’Olmo, E., Di Girolamo, R., Bosso, A., Pizzo, E., Haagsman, H.P., Veldhuizen, E.J.A., de la Fuente-Nunez, C., et al. (2022). Synthetic Antibiotic Derived from Sequences Encrypted in a Protein from Human Plasma. ACS Nano 16, 1880-1895. 10.1021/acsnano.lc04496.
34. Hyatt, D., Chen, G.-L., LoCascio, P.F., Land, M.L., Larimer, F.W., and Hauser, L.J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119. 10.1186/1471-2105-11-119.
35. Ahrens, C.H., Wade, J.T., Champion, M.M., and Langer, J.D. (2022). A Practical Guide to Small Protein Discovery and Characterization Using Mass Spectrometry. J Bacteriol 204, e0035321. 10.1128/JB.00353-21.
36. Storz, G., Wolf, Y.I., and Ramamurthi, K.S. (2014). Small Proteins Can No Longer Be Ignored. Annu. Rev. Biochem. 83, 753-777. 10.1146/annurev-biochem-070611- 102400.
37. Su, M., Ling, Y , Yu, J., Wu, L, and Xiao, J. (2013). Small proteins: untapped area of potential biological importance. Front Genet 4. 10.3389/fgene.2013.00286.
38. Sberro, H., Fremin, B.J., Zlitni, S., Edfors, F., Greenfield, N., Snyder, M.P., Pavlopoulos, G.A., Kyrpides, N.C., and Bhatt, A.S. (2019). Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell 178, 1245-1259. el4.
10.1016/j .cell.2019.07.016.
39. Donia, M.S. (2014). A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics. Cell 158.
10.1016/j. cell.2014.08.032. 40. Fingerhut, L.C.H.W., Miller, D.J., Strugnell, J.M., Daly, N.L., and Cooke, I.R. (2020). ampir: an R package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics 36, 5262-5263. 10.1093/bioinformatics/btaa653.
41. Sugimoto, Y. (2019). A metagenomic strategy for harnessing the chemical repertoire of the human microbiome. Science 366. 10.1126/science.aax9176.
42. Santos-Junior, C.D., Pan, S., Zhao, X.-M., and Coelho, L.P. (2020). Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ 5, elO555. 10.7717/peerj.10555.
43. Mende, D.R., Letunic, I., Maistrenko, O.M., Schmidt, T.S.B., Milanese, A., Paoli, L., Hernandez-Plaza, A., Orakov, A.N., Forslund, S.K., Sunagawa, S., et al. (2020). proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Research 48, D621-D625. 10.1093/nar/gkzl002.
44. Navidinia, M. (2016). The clinical importance of emerging ESKAPE pathogens in nosocomial infections. Archives of Advances in Biosciences 7, 43-57. 10.22037/jps.v7i3.12584.
45. Mulani, M.S., Kamble, E.E., Kumkar, S.N., Tawre, M.S., and Pardesi, K.R. (2019). Emerging Strategies to Combat ESKAPE Pathogens in the Era of Antimicrobial Resistance: A Review. Front Microbiol 10, 539. 10.3389/fmicb.2019.00539.
46. Shi, G., Kang, X., Dong, F„ Liu, Y., Zhu, N., Hu, Y., Xu, H., Lao, X., and Zheng, H. (2021). DRAMP 3.0: an enhanced comprehensive data repository of antimicrobial peptides. Nucleic Acids Research. 10.1093/nar/gkab651.
47. Zhang, L.-J., and Gallo, R.L. (2016). Antimicrobial peptides. Curr. Biol. 26, R14-19. 10.1016/j.cub.2015.11.017.
48. Bhadra, P., Yan, J., Li, J., Fong, S., and Siu, S.W.I. (2018). AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 8, 1-10. 10.1038/s41598-018-19752-w.
49. Veltri, D., Kamath, U., and Shehu, A. (2018). Deep learning improves antimicrobial peptide recognition. Bioinformatics 34, 2740-2747. 10.1093/bioinformatics/btyl79. 50. Lawrence, T.J., Carper, D.L., Spangler, M.K., Carrell, A.A., Rush, T.A., Minter, S.J., Weston, D.J., and Labbe, J.L. (2021). amPEPpy 1.0: a portable and accurate antimicrobial peptide prediction tool. Bioinformatics 37, 2058-2060.
10.1093/bioinformatics/btaa917.
51. Su, X., Xu, J., Yin, Y., Quan, X., and Zhang, H. (2019). Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinformatics 20, 730. 10.1186/sl2859-019-3327-y.
52. Lin, T.-T., Yang, L.-Y., Lu, I.-H., Cheng, W.-C., Hsu, Z.-R., Chen, S.-H., and Lin, C.-Y. (2021). AI4AMP: an Antimicrobial Peptide Predictor Using Physicochemical Property-Based Encoding Method and Deep Learning. mSystems 6, e0029921.
10.1128/mSystems.00299-21.
53. Li, C., Sutherland, D., Hammond, S.A., Yang, C., Taho, F., Bergman, L., Houston, S., Warren, R.L., Wong, T., Hoang, L.M.N., et al. (2022). AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics 23, 77. 10.1186/s 12864-022-08310-4.
54. Hao, Y., Zhang, L., Niu, Y., Cai, T., Luo, J., He, S., Zhang, B., Zhang, D., Qin,
Y., Yang, F., et al. (2018). SmProt: a database of small proteins encoded by annotated coding and non-coding RNA loci. Briefings in Bioinformatics 19, 636-643. 10.1093/bib/bbx005.
55. Coelho, L.P., Alves, R., del Rio, A.R., Myers, P.N., Cantalapiedra, C.P., Giner-Lamia, J., Schmidt, T.S., Mende, D.R., Orakov, A., Letunic, I., et al. (2022). Towards the biogeography of prokaryotic genes. Nature 601, 252-256. 10.1038/s41586-021 -04233-4.
56. Murphy, L.R., Wallqvist, A., and Levy, R.M. (2000). Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Engineering, Design and Selection 13, 149-152. 10.1093/protein/13.3.149.
57. Huerta-Cepas, J., Szklarczyk, D., Heller, D., Hernandez -Pl aza, A., Forslund, S.K., Cook, H., Mende, D.R., Letunic, I., Rattei, T., Jensen, L.J., et al. (2019). eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research 47, D309-D314. 10.1093/nar/gkyl085.
58. Rodriguez del Rio, A., Giner-Lamia, J., Cantalapiedra, C.P., Botas, J., Deng,
Z., Hernandez-Plaza, A., Munar-Palmer, M., Santamaria-Hernando, S., Rodriguez-Herva, J.J., Ruscheweyh, H.-J., et al. (2023). Functional and evolutionary significance of unknown genes from uncultivated taxa. Nature, 1-3. 10.1038/s41586-023-06955-z.
59. Hurtado-Rios, J.J., Carrasco-Navarro, U., Almanza-Perez, J.C., and Ponce- Alquicira, E. (2022). Ribosomes: The New Role of Ribosomal Proteins as Natural Antimicrobials. Int J Mol Sci 23, 9123. 10.3390/ijms23169123.
60. Shoja, V., and Zhang, L. (2006). A Roadmap of Tandemly Arrayed Genes in the Genomes of Human, Mouse, and Rat. Molecular Biology and Evolution 23, 2134-2141. 10.1093/molbev/msl085.
61. Sukhodolets, V.V. (2006). Unequal crossing-over in Escherichia coli. Russ J Genet 42, 1285-1293. 10.1134/S 102279540611010X.
62. Kim, M.K., Kang, T.H., Kim, J., Kim, H., and Yun, H.D. (2012). Evidence Showing Duplication and Recombination of cel Genes in Tandem from Hyperthermophilic Thermotoga sp. Appl Biochem Biotechnol 168, 1834-1848. 10.1007/sl2010-012-9901-7.
63. Blaustein, R.A., McFarland, A.G., Ben Maamar, S., Lopez, A., Castro- Wallace, S., and Hartmann, E.M. (2019). Pangenomic Approach To Understanding Microbial Adaptations within a Model Built Environment, the International Space Station, Relative to Human Hosts and Soil. mSystems 4, e00281-18. 10.1128/mSystems.00281-18.
64. Collins, F.W.J., Mesa-Pereira, B., O’Connor, P.M., Rea, M.C., Hill, C., and Ross, R.P. (2018). Reincarnation of Bacteriocins From the Lactobacillus Pangenomic Graveyard. Front Microbiol 9, 1298. 10.3389/fmicb.2018.01298.
65. Simmons, W.L., Daubenspeck, J.M., Osborne, J.D., Balish, M.F., Waites, K.B., and Dybvig, K. (2013). Type 1 and type 2 strains of Mycoplasma pneumoniae form different biofilms. Microbiology (Reading) 159, 737-747. 10.1099/mic.0.064782-0.
66. Diaz, M.H., Desai, H.P., Morrison, S.S., Benitez, A.J., Wolff, B.J., Caravas, J., Read, T.D., Dean, D., and Winchell, J.M. (2017). Comprehensive bioinformatics analysis of Mycoplasma pneumoniae genomes to investigate underlying population structure and typespecific determinants. PLOS ONE 12, e0174701. 10.1371/journal. pone.0174701.
67. Parks, D.H., Rinke, C., Chuvochina, M., Chaumeil, P.-A., Woodcroft, B J., Evans, P.N., Hugenholtz, P., and Tyson, G.W. (2017). Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nature Microbiology
2, 1533-1542. 10.1038/s41564-017-0012-7.
68. Parks, D.H., Chuvochina, M., Chaumeil, P.-A., Rinke, C., Mussig, A.J., and Hugenholtz, P. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, 1-8. 10.1038/s41587-020-0501-8.
69. Valles-Colomer, M., Blanco-Miguez, A., Manghi, P., Asnicar, F., Dubois, L., Golzato, D., Armanini, F., Cumbo, F., Huang, K.D., Manara, S., et al. (2023). The person-to- person transmission landscape of the gut and oral microbiomes. Nature 614, 125-135.
10.1038/s41586-022-05620- 1.
70. Pirtskhalava, M., Amstrong, A.A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D.E., and Tartakovsky, M. (2021). DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Research 49, D288-D297. 10.1093/nar/gkaa991.
71. Wang, G., Li, X., and Wang, Z. (2016). APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res 44, DI 087- 1093. 10.1093/nar/gkvl278.
72. Lifson, S., and Sander, C. (1979). Antiparallel and parallel 0-strands differ in amino acid residue preferences. Nature 282, 109-111. 10.1038/282109a0.
73. Derrien, M., Collado, M.C., Ben-Amor, K., Salminen, S., and de Vos, W.M. (2008). The Mucin Degrader Akkermansia muciniphila Is an Abundant Resident of the Human Intestinal Tract. Applied and Environmental Microbiology 74, 1646-1648. 10.1128/AEM.01226-07.
74. Earley, H, Lennon, G., Balfe, A., Coffey, J.C., Winter, D.C., and O’Connell, P.R. (2019). The abundance of Akkermansia muciniphila and its relationship with sulphated colonic mucins in health and ulcerative colitis. Sci Rep 9, 15683. 10.1038/s41598-019-51878-
3.
75. Daquigan, N., Seekatz, A.M., Greathouse, K.L., Young, V.B., and White, J.R. (2017). High-resolution profiling of the gut microbiome reveals the extent of Clostridium difficile burden, npj Biofilms Microbiomes 3, 1-8. 10.1038/s41522-017-0043-0. 76. Saenz, C., Fang, Q., Gnanasekaran, T., Trammell, S.A.J., Buijink, J. A., Pisano, P., Wierer, M., Moens, F., Lengger, B , Brejnrod, A., et al. (2023). Clostridium scindens secretome suppresses virulence gene expression of Clostridioides difficile in a bile acidindependent manner. Microbiology Spectrum 11, e03933-22. 10.1128/spectrum.03933-22.
77. Geerlings, S.Y., Kostopoulos, I., De Vos, W.M., and Belzer, C. (2018). Akkermansia muciniphila in the Human Gastrointestinal Tract: When, Where, and How? Microorganisms 6, 75. 10.3390/microorganisms6030075.
78. Cullen, T.W., Schofield, W.B., Barry, N.A., Putnam, E.E., Rundell, E.A., Trent, M.S., Degnan, P.H., Booth, C.J., Yu, H., and Goodman, A.L. (2015). Antimicrobial peptide resistance mediates resilience of prominent gut commensals during inflammation. Science 347, 170-175. 10.1126/science.1260580.
79. Torres, M.D.T., Pedron, C.N., Araujo, I., Silva Jr., P.I., Silva, F.D., and Oliveira, V.X. (2017). Decoralin Analogs with Increased Resistance to Degradation and Lower Hemolytic Activity. Chemistry Select 2, 18-23. 10.1002/slct.201601590.
80. Torres, M.D.T., Pedron, C.N., Higashikuni, Y., Kramer, R.M., Cardoso, M.H., Oshiro, K.G.N., Franco, O.L., Silva Junior, P.I., Silva, F.D., Oliveira Junior, V.X., et al. (2018). Structure-function-guided exploration of the antimicrobial peptide polybia-CP identifies activity determinants and generates synthetic therapeutic candidates. Commun Biol
I, 1-16. 10.1038/s42003-018-0224-2.
81. Silva, O.N., Torres, M.D.T., Cao, J., Alves, E.S.F., Rodrigues, L.V., Resende,
J.M., Liao, L.M., Porto, W.F., Fensterseifer, I. C M , Lu, T.K., et al. (2020). Repurposing a peptide toxin from wasp venom into antiinfectives with dual antimicrobial and immunomodulatory properties. Proc Natl Acad Sci U S A 117, 26936-26945.
10.1073/pnas.2012379117.
82. Morris, F.C., Dexter, C., Kostoulias, X., Uddin, M.I., and Peleg, A.Y. (2019). The Mechanisms of Disease Caused by Acinetobacter baumannii. Frontiers in Microbiology 10.
83. Petruschke, H., Schori, C., Canzler, S., Riesbeck, S., Poehlein, A., Daniel, R., Frei, D., Segessemann, T., Zimmerman, J., Marinos, G., et al. (2021). Discovery of novel community -relevant small proteins in a simplified human intestinal microbiome. Microbiome 9, 55. 10.1186/s40168-020-00981-z.
84. Washietl, S., FindeiB, S., Muller, S.A., Kalkhof, S., Bergen, M. von, Hofacker, I.L., Stadler, P.F., and Goldman, N. (2011). RNAcode: Robust discrimination of coding and noncoding regions in comparative sequence data. RNA 77, 578-594. 10.1261/ma.2536111.
85. Galzitskaya, O.V. (2021). Exploring Amyloidogenicity of Peptides From Ribosomal SI Protein to Develop Novel AMPs. Front Mol Biosci 8, 705069. 10.3389/fmolb.2021.705069.
86. Ochman, H., Lawrence, J.G., and Groisman, E.A. (2000). Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299-304. 10.1038/35012500.
87. Zheng, D., and Gerstein, M.B. (2007). The ambiguous boundary between genes and pseudogenes: the dead rise up, or do they? Trends in Genetics 23, 219-224.
10.1016/j .tig.2007.03.003.
88. Lazzaro, B.P., Zasloff, M., and Rolff, J. (2020). Antimicrobial peptides: Application informed by evolution. Science 368, eaau5480. 10. 1126/science.aau5480.
89. Sun, S., Wang, H., Howard, A.G., Zhang, J., Su, C., Wang, Z., Du, S., Fodor, A.A., Gordon-Larsen, P., and Zhang, B. (2022). Loss of Novel Diversity in Human Gut Microbiota Associated with Ongoing Urbanization in China. mSystems 7, e00200-22.
10.1128/msy stems.00200-22.
90. Piquer-Esteban, S., Ruiz-Ruiz, S., Arnau, V., Diaz, W., and Moya, A. (2022). Exploring the universal healthy human gut microbiota around the World. Computational and Structural Biotechnology Journal 20, 421 433 10.1016/j.csbj.2021.12.035.
91. Dhakan, D.B., Maji, A., Sharma, A.K., Saxena, R., Pulikkan, J., Grace, T., Gomez, A., Scaria, J., Amato, K.R., and Sharma, V.K. (2019). The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches. GigaScience 8, giz004. 10.1093/gigascience/giz004.
92. Venturini, E., Svensson, S.L., MaaB, S., Gelhausen, R., Eggenhofer, F., Li, L., Cain, A.K., Parkhill, J., Becher, D., Backofen, R., et al. (2020). A global data-driven census of Salmonella small proteins and their potential functions in bacterial virulence. microLife 1, uqaa002. 10.1093/femsml/uqaa002. 93. Aguilera-Mendoza, L., Marrero-Ponce, Y., Beltran, J. A., Tellez Ibarra, R., Guillen-Ramirez, H A., and Brizuela, C.A. (2019). Graph-based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics 35, 4739-4747. 10.1093/bioinformatics/btz260.
94. Heintz-Buschart, A., May, P., Laczny, C.C., Lebrun, L.A., Bellora, C., Krishna, A., Wampach, L., Schneider, J.G., Hogan, A., de Beaufort, C., et al. (2016). Integrated multi-omics of the human gut microbiome in a case study of familial type 1 diabetes. Nat Microbiol 2, 16180. 10.1038/nmicrobiol.2016.180.
95. Micsonai, A., Moussong, E., Wien, F., Boros, E., Vadaszi, H., Murvai, N., Lee, Y.-H., Molnar, T., Refregiers, M., Goto, Y., et al. (2022). BeStSel: webserver for secondary structure and fold prediction for protein CD spectroscopy. Nucleic Acids Research 50, W90-W98. 10.1093/nar/gkac345.
96. Coelho, L.P., Alves, R., Monteiro, P., Huerta-Cepas, J., Freitas, A T., and Bork, P. (2019). NG-meta-profiler: fast processing of metagenomes using NGLess, a domainspecific language. Microbiome 7, 84. 10.1186/s40168-019-0684-8.
97. Coelho, L.P. (2017). Jug: Software for Parallel Reproducible Computation in Python. Journal of Open Research Software 5, 30. 10.5334/j ors.161.
98. Fu, L., Niu, B., Zhu, Z , Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152.
10.1093/bioinformatics/bts565.
99. Steinegger, M., and Sbding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026- 1028. 10.1038/nbt.3988.
100. Van Rossum, G. (2020). Python Release Python 3.8.2. Python.org. https://www.python.org/downloads/release/python-382/.
101. Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science Engineering 9, 90-95. 10.1109/MCSE.2007.55. 102. Harris, C.R., Millman, KJ., van der Walt, S ., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., et al. (2020). Array programming with NumPy. Nature 585, 357-362. 10.1038/s41586-020-2649-2.
103. McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 56-61. 10.25080/Majora-92bfl922- 00a.
104. Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W ., Bright, J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17, 261-272. 10.1038/s41592-019-0686-2.
105. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine Learning in Python. MACHINE LEARNING IN PYTHON 12, 2825-2830.
106. The scikit-bio development team (2020). scikit-bio: A Bioinformatics Library for Data Scientists, Students, and Developers. Version 0.5.5.
107. Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422-1423. 10.1093/bioinformatics/btpl63.
108. Cantalapiedra, C.P., Hernandez -Pl aza, A., Letunic, I., Bork, P., and Huerta- Cepas, J. (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 38, 5825-5829.
10.1093/molbev/msab293.
109. Eddy, S.R. (2011). Accelerated Profile HMM Searches. PLoS Computational Biology 7, el002195. 10.1371/joumal.pcbi.1002195.
110. Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree 2 - Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490.
10.1371/joumal. pone.0009490. 111. Jain, C., Rodriguez-R, L.M., Phillippy, A.M., Konstantinidis, K.T., and Alum, S. (2018). High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114. 10.1038/s41467-018-07641-9.
112. Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H., and Lam, T.W. (2016). MEGAHIT vl.O: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. 102, 3-11. 10.1016/j.ymeth.2016.02.020.
113. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760. 10.1093/bioinformatics/btp324.
114. Seabold, S., and Perktold, J. (2010). Statsmodels: Econometric and Statistical Modeling with Python. Proceedings of the 9th Python in Science Conference, 92-96.
10.25080/Maj ora-92bf 1922-011.
115. Milanese, A., Mende, D.R., Paoli, L., Salazar, G., Ruscheweyh, H.-J., Cuenca, M., Hingamp, P., Alves, R., Costea, P.I., Coelho, L.P., et al. (2019). Microbial abundance, activity and population genomic profiling with mOTUs2. Nat Commun 10, 1014.
10.1038/s41467-019-08844-4.
116. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.
10.1093/bioinformatics/btp352.
117. Quinlan, A.R., and Hall, I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842. 10.1093/bioinformatics/btq033.
118. Sievers, F., Wilm, A., Dineen, D., Gibson, T.J., Karplus, K., Li, W ., Lopez, R., McWilliam, H., Remmert, M., Sbding, J., et al. (2011). Fast, scalable generation of high- quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7, 539. 10.1038/msb.2011.75.
119. Buchfink, B., Xie, C., and Huson, D.H. (2014). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12, 59-60. 10.1038/nmeth.3176. 120. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics 10, 421. 10.1186/1471-2105-10-421.
121. The UniProt Consortium (2021). UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research 49, D480-D489. 10.1093/nar/gkaal 100.
122. Mistry, J., Chuguransky, S., Williams, L , Qureshi, M., Salazar, G.A., Sonnhammer, E.L.L., Tosatto, S.C.E., Paladin, L., Raj, S., Richardson, L.J., et al. (2021). Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412-D419. 10.1093/nar/gkaa913.
123. Eberhardt, R.Y., Haft, D.H., Punta, M., Martin, M., O’Donovan, C., and Bateman, A. (2012). AntiFam: a tool to help identify spurious ORFs in protein annotation. Database (Oxford) 2012, bas003. 10.1093/database/bas003.
124. NCBI Resource Coordinators (2015). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 43, D6-D17. 10.1093/nar/gkul l30.
125. Alcock, B.P., Raphenya, A.R., Lau, T.T.Y., Tsang, K.K., Bouchard, M., Edalatmand, A., Huynh, W ., Nguyen, A.-L.V., Cheng, A.A., Liu, S., et al. (2020). CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res 48, D517-D525. 10.1093/nar/gkz935.
126. Kanehisa, M., and Sato, Y. (2020). KEGG Mapper for inferring cellular functions from protein sequences. Protein Sci. 29, 28-35. 10.1002/pro.3711.
127. Courtot, M., Cherubin, L., Faulconbridge, A., Vaughan, D., Green, M., Richardson, D., Harrison, P., Whetzel, P.L., Parkinson, H., and Burdett, T. (2019). BioSamples database: an updated sample metadata hub. Nucleic Acids Research 47, DI 172- D1178. 10.1093/nar/gkyl061.
128. Harrison, P.W., Ahamed, A., Aslam, R., Alako, B.T.F., Burgin, J., Buso, N., Courtot, M., Fan, J., Gupta, D., Haseeb, M., et al. (2021). The European Nucleotide Archive in 2020. Nucleic Acids Research 49, D82-D85. 10.1093/nar/gkaal028.
129. Jones, P., Cote, R.G., Martens, L., Quinn, A.F., Taylor, C.F., Derache, W ., Hermjakob, H., and Apweiler, R. (2006). PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 34, D659-663. 10.1093/nar/gkj l38.
130. Schmidt, T.S.B., Fullam, A., Ferretti, P., Orakov, A., Maistrenko, O.M., Ruscheweyh, H.-J., Letunic, I., Duan, Y., Van Rossum, T., Sunagawa, S., et al. (2024). SPIRE: a Searchable, Planetary-scale microbiome REsource. Nucleic Acids Research 52, D777-D783. 10.1093/nar/gkad943.
131. Mirdita, M., Steinegger, M., Breitwieser, F., Soding, J., and Levy Karin, E. (2021). Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics. 10.1093/bioinformatics/btabl84.
132. Oren, A., Arahal, D.R., Rossello-Mora, R., Sutcliffe, I.C., and Moore, E.R.B. (2021). Emendation of Rules 5b, 8, 15 and 22 of the International Code of Nomenclature of Prokaryotes to include the rank of phylum. Int J Syst Evol Microbiol 71.
10.1099/ijsem.0.004851.
133. Oren, A., and Garrity, G.M. (2021). Valid publication of the names of forty- two phyla of prokaryotes. Int J Syst Evol Microbiol 71. 10.1099/ijsem.0.005056.
134. Solis, A.D. (2015). Amino acid alphabet reduction preserves fold information contained in contact interactions in proteins. Proteins: Structure, Function, and Bioinformatics 83, 2198-2216. 10.1002/prot.24936.
135. Peterson, E.L., Kondev, J., Theriot, J. A., and Phillips, R. (2009). Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25, 1356-1362. 10.1093/bioinformatics/btpl64.
136. Smith, T.F., and Waterman, M.S. (1981). Identification of Common Molecular Subsequences. J. Mol. Biol. 147, 195-197. 10.1016/0022-2836(81)90087-5.
137. Karlin, S., and Altschul, S.F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 87, 2264-2268. 10.1073/pnas.87.6.2264.
138. Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W ., and Lipman, D. (1997). Gapped BLAST and PSLBLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402. 139. Cena, J. A. de, Zhang, J., Deng, D., Dame-Teixeira, N., and Do, T. (2021). Low- Abundant Microorganisms: The Human Microbiome’s Dark Mater, a Scoping Review. Frontiers in Cellular and Infection Microbiology 11.
140. Mende, D.R., Sunagawa, S., Zeller, G., and Bork, P. (2013). Accurate and universal delineation of prokaryotic species. Nat Methods 10, 881-884. 10.1038/nmeth.2575.
141. Selem-Mojica, N., Aguilar, C., Gutierrez -Garcia, K., Martinez-Guerrero, C.E., and Barona-Gomez, F. (2019). EvoMining reveals the origin and fate of natural product biosynthetic enzymes. Microb Genom 5, e000260. 10.1099/mgen.0.000260.
142. Rodriguez-R, L.M., Conrad, R.E., Viver, T., Feistel, D.J., Lindner, B.G., Venter, S.N., Orellana, L.H., Amann, R., Rossello-Mora, R., and Konstantinidis, K.T. (2023). An ANI gap within bacterial species that advances the definitions of intra-species units. mBio 75, e02696-23. 10.1128/mbio.02696-23.
143. Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell, A.L., Potter, S.C., Punta, M., Qureshi, M., Sangrador- Vegas, A., et al. (2016). The Pfam protein families database: towards a more sustainable future. Nucleic acids research 44, D279-85. 10.1093/nar/gkvl344.
144. SolyPep: a fast generator of soluble peptides https://bioserv.rpbs.univ -paris- diderot.fr/services/SolyPep/.
145. Ochoa, R., and Cossio, P. (2021). PepFun: Open Source Protocols for Peptide- Related Computational Analysis. Molecules 26, 1664. 10.3390/molecules26061664.
146. Kochendoerfer, G.G., and Kent, S B. (1999). Chemical protein synthesis. Curr Opin Chem Biol 3, 665-671. 10.1016/sl367-5931(99)00024-1.
147. Sheppard, R. (2003). The fluorenylmethoxycarbonyl group in solid phase synthesis. J Pept Sci 9, 545-552. 10.1002/psc.479.
148. Palomo, J.M. (2014). Solid-phase peptide synthesis: an overview focused on the preparation of biologically relevant peptides. RSC Adv. 4, 32658-32672.
10.1039/C 4RA02458C.
149. Schmidt, T.S.B., Li, S.S., Maistrenko, O.M., Akanni, W., Coelho, L.P., Dolai, S., Fullam, A., Glazek, A.M., Hercog, R., Herrema, H., et al. (2022). Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nat Med 28, 1902-1912. 10.1038/s41591 -022-01913-0.
150. Wiegand, I., Hilpert, K., and Hancock, R.E.W. (2008). Agar and broth dilution methods to determine the minimal inhibitory concentration (MIC) of antimicrobial substances. Nat Protoc 3, 163-175. 10.1038/nprot.2007.521.
151. Santos-Junior, C.D., Schmidt, T.S.B., Fullam, A., Duan, Y , Bork, P., Zhao, X.-M., and Coelho, L.P. (2021). AMPSphere : the worldwide survey of prokaryotic antimicrobial peptides. (Zenodo). 10.5281/zenodo.4606582 10.5281/zenodo.4606582.

Claims

What is claimed:
1. A method for forming a database-assisted platform providing one or more functional and/or physicochemical features of respective metagenomic -derived candidate antimicrobial peptides (AMPs), comprising: selecting one or more genomes or metagenomes for inclusion in the platform; using an NGS assembler to assemble reads in order to identify contigs from the genomes or metagenomes; from the identified contigs, predicting small open reading frames (smORFs); removing duplicate smORFs to yield non-redundant smORFs; and, predicting candidate AMPs from the non-redundant smORFs.
2. The method according to claim 1, wherein selection of the one or more genomes or metagenomes is according to criteria (i) whereby the genome or metagenome is tagged with taxonomy ID 408169 (for metagenome) or is a descendent of it in a taxonomic tree, (ii) whereby experiments with the genome or metagenome are listed as “METAGENOMIC”, or both (i) and (ii).
3. The method according to claim 1, wherein metadata is curated from the one or more genomes or metagenomes to create groups based on similarity of habitat conditions.
4. The method according to claim 3, wherein the habitat conditions include one or more of air, anthropogenic, aquatic, host-associated, alkaline pH, sediment, or terrestrial.
5. The method according to any preceding claim, wherein the selection of the one or more genomes or metagenomes comprises assessing sample origin or other information relating to host species using an NCBI taxonomic identification number.
6. The method according to any preceding claim, further comprising processing the assembled reads by trimming positions with a quality lower than a desired number, such as about 25, and discarding reads shorter than a specified number of base pairs, such as about 50-100, preferably about 60 base pairs, post trimming.
7. The method according to any preceding claim, wherein the NGS assembler is optimized for metagenomes.
8. The method according to any preceding claim, wherein the smORFs are predicted from the identified contigs using prokaryotic gene recognition and translation initiation site identification.
9. The method according to any preceding claim, wherein the candidate AMPs are predicted from the non-redundant smORFs using metagenomic AMP classification and retrieval.
10. The method according to claim 9, wherein the metagenomic AMP classification and retrieval is Macrel.
11. The method according to any preceding claim, further comprising removing singleton sequences from the candidate AMPs.
12. The method according to claim 11 , wherein the singleton sequences are not removed if they match a sequence from a data repository of antimicrobial peptides.
13. The method according to any preceding claim, wherein candidate AMPs originating from a genomic database is assigned a taxonomy from the original genome.
14. The method according to any preceding claim, wherein candidate AMPs originating from a metagenome were assigned a taxonomy predicted for the contig in which the candidate AMP was found.
15. The method according to any preceding claim, further comprising identifying potential structural configuration of a candidate AMP by using a secondary structure function from a calculation of a fraction of amino acids in the candidate AMP that tend to assume conformations of helix, turn, or sheet.
16. The method according to any preceding claim, further comprising hierarchically clustering the candidate AMPs using a reduced amino acid alphabet and an identity cutoff of a preselected percentage.
17. The method according to claim 16, wherein the reduced amino acid alphabet includes 6-10 amino acids, preferably 8 amino acids.
18. The method according to claim 16 or claim 17, wherein the identity cutoff is 75%, 85%, or 100%, preferably 75%.
19. A method of treating a microbial infection in a subject comprising administering to the subject a therapeutically effective amount of a candidate AMP that has been identified according to the method of any one of claims 1-18.
20. The method according to claim 19, wherein the candidate AMP is any one of SEQ ID NOS: 1-100.
21. A method comprising contacting a biofilm with an effective amount of a candidate AMP that has been identified according to the method of any one of claims 1 -18.
22. The method according to claim 21, wherein the candidate AMP is any one of SEQ ID NOS: 1-100.
23. A composition comprising a candidate AMP that has been identified according to the method of any one of claims 1-18 and a pharmaceutically acceptable carrier, diluent, or excipient.
24. The method according to claim 23, wherein the candidate AMP is any one of SEQ ID
NOS: 1-100.
PCT/US2024/045022 2023-08-31 2024-09-03 Computational exploration of the global microbiome for antibiotic discovery Pending WO2025050118A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363579834P 2023-08-31 2023-08-31
US63/579,834 2023-08-31

Publications (1)

Publication Number Publication Date
WO2025050118A1 true WO2025050118A1 (en) 2025-03-06

Family

ID=94820524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/045022 Pending WO2025050118A1 (en) 2023-08-31 2024-09-03 Computational exploration of the global microbiome for antibiotic discovery

Country Status (1)

Country Link
WO (1) WO2025050118A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357375A1 (en) * 2017-04-04 2018-12-13 Whole Biome Inc. Methods and compositions for determining metabolic maps

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357375A1 (en) * 2017-04-04 2018-12-13 Whole Biome Inc. Methods and compositions for determining metabolic maps

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SANTOS-JÚNIOR CÉLIO DIAS, PAN SHAOJUN, ZHAO XING-MING, COELHO LUIS PEDRO: "Macrel: antimicrobial peptide screening in genomes and metagenomes", PEERJ, vol. 8, pages e10555, XP055902788, DOI: 10.7717/peerj.10555 *

Similar Documents

Publication Publication Date Title
Torres et al. Mining human microbiomes reveals an untapped source of peptide antibiotics
Weigert et al. Evolution of mitochondrial gene order in Annelida
Shapiro et al. Acinetobactin isomerization enables adaptive iron acquisition in Acinetobacter baumannii through pH-triggered siderophore swapping
Sharp et al. Diversity and distribution of nuclease bacteriocins in bacterial genomes revealed using Hidden Markov Models
Oyama et al. In silico identification of two peptides with antibacterial activity against multidrug-resistant Staphylococcus aureus
McGowan et al. Comparative genomic and proteomic analyses of three widespread Phytophthora species: Phytophthora chlamydospora, Phytophthora gonapodyides and Phytophthora pseudosyringae
Santos-Júnior et al. Computational exploration of the global microbiome for antibiotic discovery
Dennis et al. Functional insights from the GC-poor genomes of two aphid parasitoids, Aphidius ervi and Lysiphlebus fabarum
Haney et al. Effects of gene duplication, positive selection, and shifts in gene expression on the evolution of the venom gland transcriptome in widow spiders
Sicard et al. Introduction and adaptation of an emerging pathogen to olive trees in Italy
Gerth et al. Rapid molecular evolution of Spiroplasma symbionts of Drosophila
US20230136613A1 (en) Compositions and methods for treating or ameliorating infections
Bianco et al. Pre-epidemic evolution of the MRSA USA300 clade and a molecular key for classification
Chen et al. Antimicrobial peptides in the global microbiome: biosynthetic genes and resistance determinants
Wyrsch et al. Whole-genome sequence analysis of environmental Escherichia coli from the faeces of straw-necked ibis (Threskiornis spinicollis) nesting on inland wetlands
Grant et al. Allelic diversity uncovers protein domains contributing to the emergence of antimicrobial resistance
Wan et al. Molecular de-extinction of antibiotics enabled by deep learning
Culver et al. Too hot to handle: Antibacterial peptides identified in ghost pepper
WO2025050118A1 (en) Computational exploration of the global microbiome for antibiotic discovery
Beyer et al. Mimicking Nonribosomal Peptides from the Marine Actinomycete Streptomyces sp. H-KF8 Leads to Antimicrobial Peptides
Andreev et al. Discovery of a rapidly evolving yeast defense factor, KTD1, against the secreted killer toxin K28
Souder et al. Role of dipA and pilD in Francisella tularensis Susceptibility to Resazurin
Sonoda et al. Venomous noodles: The evolution of toxins in Nemertea through positive selection and gene duplication
Li et al. Mining the UniProtKB/Swiss‐Prot database for antimicrobial peptides
WO2024178484A1 (en) Antimicrobial peptides

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24861280

Country of ref document: EP

Kind code of ref document: A1