[go: up one dir, main page]

WO2025096827A2 - Procédés de prédiction de la réponse à une thérapie pour un trouble par l'intermédiaire de guildes du microbiome central - Google Patents

Procédés de prédiction de la réponse à une thérapie pour un trouble par l'intermédiaire de guildes du microbiome central Download PDF

Info

Publication number
WO2025096827A2
WO2025096827A2 PCT/US2024/053959 US2024053959W WO2025096827A2 WO 2025096827 A2 WO2025096827 A2 WO 2025096827A2 US 2024053959 W US2024053959 W US 2024053959W WO 2025096827 A2 WO2025096827 A2 WO 2025096827A2
Authority
WO
WIPO (PCT)
Prior art keywords
gut
subject
therapy
microorganism
microorganisms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/053959
Other languages
English (en)
Other versions
WO2025096827A3 (fr
Inventor
Liping Zhao
Guojun WU
Chenhong ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Rutgers State University of New Jersey
Original Assignee
Shanghai Jiao Tong University
Rutgers State University of New Jersey
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University, Rutgers State University of New Jersey filed Critical Shanghai Jiao Tong University
Publication of WO2025096827A2 publication Critical patent/WO2025096827A2/fr
Publication of WO2025096827A3 publication Critical patent/WO2025096827A3/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • C12Q1/689Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/106Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism

Definitions

  • this core microbiome is akin to that of an essential organ, underscoring its criticality in overall health management.
  • the demarcation of the core microbiome has predominantly rested upon the evaluation of the presence or absence, supplemented by the quantification of the abundance or prevalence of specific taxa or genes/pathways within a cohort of healthy individuals. While these methodologies have undoubtedly provided significant insights into the structural configuration and potential functional traits of the microbiome, they may inadequately represent the vital ecological interactions that underscore the stability and resilience of this intricate system. This oversight is particularly relevant when considering the critical role these interactions play in the inception, progression, and remission of various disease states. [0006] As a CAS, the microbiome adheres to the modular design principle.
  • Integral components of a CAS are organized into modules, which interconnect to establish a network.
  • individual microbes are integrated into a modular structure referred to as guilds.
  • Each guild despite comprising microorganisms of diverse taxonomic backgrounds, functions as a coherent functional unit or module within the microbiome's CAS.
  • Members of a guild display cooperative behavior through co-abundance, and different guilds may engage in cooperative or competitive interactions to shape an ecological network. Consequently, the characterization of the core microbiome in terms of guilds emerges as a promising and interesting approach.
  • SUMMARY [0007] Throughout their co-evolution, gut microbiota has established a vital role in sustaining human health.
  • HQMAGs high-quality metagenome-assembled genomes
  • This methodology involved detecting stable relationships among HQMAGs across varying conditions, with environmental perturbations to the gut ecosystem being introduced via dietary interventions or disease progression. These stable relationships can unveil the core members of the microbiome. This aligns with a foundational principle of systems biology, whereby relationship stability often signifies pivotal system components. In the context of the gut microbiome, these core components are likely to execute essential functions contributing to system resilience and host health, demanding their persistent presence and predictable interaction patterns. Therefore, uncovering these stable relationships could disclose these critical microbial components, potentially exposing the backbone of the ecological network conserved within the gut microbiome, across individuals, populations, or health states. [0010] A robust seesaw-like network comprising two competing bacterial guilds was identified.
  • This network was discerned by searching for stable genome pairs across co- abundance networks among individuals pre- and post-high fiber intervention (the QD trial, FIG. PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 1A), or between healthy and diseased cohorts.
  • This seesaw-like network embodies both cooperative and competitive interactions, potentially indicating a key feature of a stable microbiome structure.
  • the HQMAGs identified within this novel core microbiome demonstrated correlations with various clinical parameters in patients with type 2 diabetes mellitus (T2DM) undergoing a high fiber intervention.
  • T2DM type 2 diabetes mellitus
  • a universal machine learning model premised on these HQMAGs in the seesaw-networked core microbiome, successfully differentiated cases from controls in 26 independent datasets spanning 15 different diseases.
  • HQMAGs supported a machine learning model for predicting personalized treatment responses to immunotherapy in patients with cancer or autoimmune diseases.
  • the disclosure introduces a novel conceptual and analytical paradigm for studying the core gut microbiome. This paradigm provides enhanced health maintenance strategies and disease management, enabling personalized interventions that accommodate the intricate interplay of microbial relationships within the gut ecosystem. [0011] Accordingly, one aspect of the present disclosure provides methods, and systems for training a model for predicting a subject’s response to a therapy.
  • the method includes, at a computer system having at least one processor, and memory storing one or more programs for execution by the one or more processors, obtaining, in electronic form, for each respective training subject in a plurality of training subjects, wherein each respective training subject in the plurality of training subjects has received a therapy for a disorder: (i) a corresponding plurality of genomic abundance values for the respective training subject at a time prior to receiving the therapy, wherein the corresponding plurality of genomic abundance values comprising, for each respective gut microorganism in a plurality of gut microorganisms, a corresponding value for the abundance of the genome of the respective gut microorganism in a corresponding biological sample from the gut of the respective training subject, and (ii) an indication of the respective training subject’s response to the therapy of the respective training subject.
  • the method also includes inputting, for each respective training subject in the plurality of training subjects, information about the respective training subject into a model comprising a plurality of parameters, wherein the model applies the plurality of parameters to the information through at least 10,000 computations to obtain a corresponding output for the respective training subject from the model, wherein the corresponding output comprises a prediction of the respective training subject’s response to the therapy, the information about the respective training subject PATENT APPLICATION Attorney Docket No.: 126146-5003-WO comprises the corresponding genomic abundance value for each respective gut microorganism in the plurality of gut microorganisms, and the plurality of gut microorganisms are selected from Table 1, Table 2, or Figure 13A-13XX.
  • the method also includes adjusting the plurality of parameters based on, for each respective training subject in the first plurality of training subjects, one or more differences between (i) the corresponding output from the model and (ii) the corresponding indication of the respective training subject’s response to the therapy.
  • the method includes, at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors, obtaining, in electronic form, a plurality of genomic abundance values comprising, for each respective gut microorganism in a plurality of gut microorganisms selected from Table 1, Table 2, or Figure 13A-13XX, a corresponding abundance value for the genome of the respective species of gut microorganisms, in the plurality of gut microorganisms, in a biological sample from the subject.
  • the method also includes inputting the plurality of genomic abundance values into a model comprising a plurality of parameters, wherein the model applies the plurality of parameters to the plurality of genomic abundance values through at least 10,000 computations to generate as output from the model a prediction of the subject’s response to the therapy.
  • one aspect of the invention provides a method of training a model for predicting subject response to a therapy at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
  • the method includes obtaining, in electronic form, for each respective training subject in a plurality of training subjects, wherein each respective training subject in the plurality of training subjects has received a therapy for a disorder, (i) a corresponding plurality of genomic abundance values for the respective training subject at a time prior to receiving the therapy, wherein the corresponding plurality of genomic abundance values comprise, for each respective gut microorganism in a plurality of gut microorganisms, a corresponding value for the abundance of the genome of the respective gut microorganism in a corresponding biological sample from the gut of the respective training subject, (ii) an indication of the respective training subject’s response to the therapy of the respective training subject.
  • the method includes sequencing, for each respective subject in the plurality of training subjects, genomic DNA from the corresponding biological sample from the gut of the respective training subject, thereby obtain the corresponding plurality of at least 100,000 nucleic acid sequences.
  • the method includes obtaining, for each respective training subject in the plurality of training subjects, in electronic form, a corresponding plurality of at least 100,000 nucleic acid sequences for genomic DNA from the corresponding biological sample from the gut of the respective training subject.
  • the method includes determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism from the corresponding plurality of at least 100,000 nucleic acid sequences.
  • the method includes, for each respective training subject in the plurality of training subjects, assembling, in electronic form, a corresponding plurality of gut microorganism genomes by metagenomic de novo sequence assembly from the corresponding plurality of at least 100,000 nucleic acid sequences, and calculating, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism based on the prevalence of respective nucleic acid sequences, in the plurality of at least 100,000 nucleic acid sequences, used to assemble a respective gut microorganism genome in the plurality of gut microorganism genomes corresponding to the respective gut microorganism.
  • the method includes, for each respective subject in the plurality of training subjects, assigning each respective nucleic acid sequence in the corresponding plurality of at least 100,000 sequences to a respective gut microorganism in the plurality of gut microorganisms, thereby generating, for each respective gut microorganism in the plurality of gut microorganism, a corresponding count of respective nucleic acid sequences in the corresponding plurality of nucleic acid sequences assigned to the respective gut microorganism, and determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding genomic abundance value for the respective gut microorganism based on the corresponding count of respective nucleic acid sequences assigned to the respective gut microorganism.
  • the plurality of gut microorganisms comprises at least 20 gut microorganisms selected from Table 1, Table 2, or Figure 13A-13XX. [0023] In some such embodiments, the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms in Table 1, Table 2, or Figure 13A-13XX having a connectivity of at least 2. [0024] In some such embodiments, the biological sample from the gut of the respective subject is a fecal sample from the respective training subject.
  • the therapy is a biological therapy, an immunotherapy, a chemotherapy, a radiation therapy, a gene therapy, a hormone therapy, a photodynamic therapy, a targeted therapy, small molecules, antibodies, polynucleotide , natural compound, immune modulator, bone marrow therapy, stem cell therapy, surgery therapy, induction therapy, maintenance therapy, or a combination thereof.
  • the disorder is selected from the group consisting of type- 2 diabetes, hypertension, schizophrenia, atherosclerotic cardiovascular disease (ACVD), liver cirrhosis (LC), inflammatory bowel diseases (IBD), colorectal cancer (CRC), ankylosing spondylitis (AS), and Parkinson’s disease (PD), inflammatory bowel disease (IBD, rheumatoid arthritis (RA), or advanced melanoma and B cell lymphoma.
  • the disorder is cancer.
  • the method includes inputting, for each respective training subject in the plurality of training subjects, information about the respective training subject into a model comprising a plurality of parameters, wherein the model applies the plurality of parameters to the information through at least 10,000 computations to obtain a corresponding output for the respective training subject from the model, wherein the corresponding output comprises a prediction of the respective training subject’s response to the therapy, and the information about the respective training subject comprises the corresponding genomic abundance value for each respective gut microorganism in the plurality of gut microorganisms, and the plurality of gut microorganisms are selected from Table 1, Table 2, or Figure 13A-13XX.
  • the prediction of the respective training subject’s response is a class output of a respective response, in a plurality of possible responses, of the respective training subject.
  • the prediction of the respective training subject’s response is a probability output for the respective training subject’s response.
  • the model is a neural network algorithm, a support vector machine algorithm, a Naive Bayes algorithm, a nearest neighbor algorithm, a boosted trees algorithm, a Random Forest algorithm, a convolutional neural network algorithm, a decision tree algorithm, a regression algorithm, or a clustering algorithm.
  • the plurality of parameters is at least 1000, at least 10,000, at least 15,000, at least 50,000, at least 100,000, at least 250,000, at least 500,000, or at least 1,000,000 parameters.
  • the model applies the plurality of parameters to the information through at least 25,000, at least 50,000, at least 100,000, at least 250,000, at least 500,000, or at least 1,000,000 computations to obtain a corresponding output for the respective training subject from the model.
  • the method includes adjusting the plurality of parameters based on, for each respective training subject in the plurality of training subjects, one or more differences between (i) the corresponding output from the model and (ii) the corresponding indication of the respective training subject’s response to the therapy.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Another aspect of the present disclosure provides a method of using a model for predicting a subject’s response to a therapy for a disorder at a computer system having one or more processors, and memory storing one or more programs for execution by the one or more processors.
  • the method includes obtaining, in electronic form, a plurality of genomic abundance values comprising, for each respective gut microorganism in a plurality of gut microorganisms selected from Table 1, Table 2, or Figure 13A-13XX, a corresponding abundance value for the genome of the respective species of gut bacteria, in the plurality of gut microorganisms, in a biological sample from the subject.
  • the method includes sequencing genomic DNA from the biological sample from the gut of the subject, thereby obtaining the plurality of at least 100,000 nucleic acid sequences.
  • the method includes obtaining, in electronic form, a plurality of at least 100,000 nucleic acid sequences for genomic DNA from the biological sample from the gut of the subject.
  • the meth od includes determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism from the plurality of at least 100,000 nucleic acid sequences.
  • the method includes assembling, in electronic form, a corresponding plurality of gut microorganism genomes by metagenomic de novo sequence assembly from the plurality of at least 100,000 nucleic acid sequences, and calculating, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism based on the prevalence of respective nucleic acid sequences, in the plurality of at least 100,000 nucleic acid sequences, used to assemble a respective gut microorganism genome in the plurality of gut microorganism genomes corresponding to the respective gut microorganism.
  • the method includes assigning each respective nucleic acid sequence in the plurality of at least 100,000 sequences to a respective gut microorganism in PATENT APPLICATION Attorney Docket No.: 126146-5003-WO the plurality of gut microorganisms, thereby generating, for each respective gut microorganism in the plurality of gut microorganism, a corresponding count of respective nucleic acid sequences in the plurality of nucleic acid sequences assigned to the respective gut microorganism, and determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding genomic abundance value for the respective gut microorganism based on the corresponding count of respective nucleic acid sequences assigned to the respective gut microorganism.
  • the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms in Table 1, Table 2, or Figure 13A- 13XX having a connectivity of at least 2.
  • the biological sample from the gut of the subject is a fecal sample.
  • the therapy is a biological therapy, an immunotherapy, a chemotherapy, a radiation therapy, a gene therapy, a hormone therapy, a photodynamic therapy, a targeted therapy, small molecules, antibodies, polynucleotide , natural compound, immune modulator, bone marrow therapy, stem cell therapy, surgery therapy, induction therapy, maintenance therapy, or a combination thereof.
  • the disorder is selected from the group consisting of type- 2 diabetes, hypertension, schizophrenia, atherosclerotic cardiovascular disease (ACVD), liver cirrhosis (LC), inflammatory bowel diseases (IBD), colorectal cancer (CRC), ankylosing spondylitis (AS), and Parkinson’s disease (PD), inflammatory bowel disease (IBD), rheumatoid arthritis (RA), advanced melanoma and B cell lymphoma.
  • the disorder is cancer.
  • the method includes inputting the plurality of genomic abundance values into a model comprising a plurality of parameters, wherein the model applies the plurality of parameters to the plurality of genomic abundance values through at least 10,000 computations to generate as output from the model a prediction of the subject’s response to the therapy.
  • the prediction of the subject’s response is a class output of a respective response, in a plurality of possible responses, of the respective subject.
  • the prediction of the subject’s response of the subject is a probability output for the respective subject’s response.
  • the model is a neural network algorithm, a support vector machine algorithm, a Naive Bayes algorithm, a nearest neighbor algorithm, a boosted trees algorithm, a Random Forest algorithm, a convolutional neural network algorithm, a decision tree algorithm, a regression algorithm, or a clustering algorithm.
  • the plurality of parameters is at least 1000, at least 10,000, at least 15,000, at least 50,000, at least 100,000, at least 250,000, at least 500,000, or at least 1,000,000 parameters.
  • the model applies the plurality of parameters to the information through at least 25,000, at least 50,000, at least 100,000, at least 250,000, at least 500,000, or at least 1,000,000 computations to obtain a corresponding output for the respective subject from the model.
  • the method includes treating the subject by: i) when the prediction of the subject’s response to the therapy satisfies a threshold likelihood that the subject will respond favorably to the therapy, administer the therapy to the subject; ii) when the prediction of the subject’s response to the therapy does not satisfy the threshold likelihood that the subject will respond favorably to the therapy, administer one or more of the plurality of gut microorganisms to the subject.
  • the computer system comprises one or more processors and a non-transitory computer-readable medium including computer-executable instructions that, when executed by the one or more processors, cause the processors to perform the method described herein.
  • Another aspect of the present disclosure provides a non-transitory computer readable storage medium.
  • the non-transitory computer readable storage medium stores instructions, which when executed by a computer system, cause the computer system to perform any of the methods described herein.
  • Figure 1 illustrates a block diagram of an example computing device, in accordance with some embodiments of the present disclosure.
  • Figures 2A, 2B, 2C, and 2D collectively provide a flow chart of processes and features for training a model for predicting a subject’s response to a therapy for a disorder, in accordance with some embodiments of the present disclosure.
  • Figures 3A, 3B, and 3C collectively provide a flow chart of processes and features for predicting a subject’s response to a therapy for a disorder, in accordance with some embodiments of the present disclosure.
  • Figures 4A, 4B, 4C, 4D, 4E, 4F, 4G, and 4H collectively illustrate reversible alterations in the gut microbiota induced by a high-fiber diet are associated with corresponding shifts in metabolic phenotypes in patients with Type 2 Diabetes Mellitus (T2DM).
  • T2DM Type 2 Diabetes Mellitus
  • A Study design of the QD trial. During the Run-in period, written informed consent, questionnaire of personal information and HbA1c-based screening were conducted. After Run-in, medical checkup and sample collection were conducted at baseline (M0), three months (M3) after on the high fiber intervention (W) or usual diet (U) and one year (M15) after the high fiber intervention stopped.
  • B Changes of fiber intake.
  • FIG. 5A and 5B collectively illustrate that despite substantial global changes in the gut microbiota induced by the high-fiber intervention, two competing bacterial guilds, which are associated with HbA1c levels, form a robust seesaw-like network within the ecosystem. (A) The distribution of different types of correlations of the genome pairs during the trial.
  • Figures 6A, 6B, 6C1, 6C2, 6D, 6E1, 6E2, 6E3, 6E4, 6E5, 6E6, 6E7, 6E8, 6E9, 6E10, and 6E11 collectively illustrate Genomes within the two competing guilds predict metabolic health outcomes in T2DM patients of the QD trial, and distinguish cases from controls across seven diseases in eleven independent case-control metagenomic datasets (Case-Control Dataset Collection I).
  • BMI body mass index
  • SBP systolic blood pressure
  • DBP diastolic blood pressure
  • WC waist circumference
  • HP hip circumference
  • TNF- ⁇ tumor necrosis factor- ⁇
  • WBC white blood cell count
  • CRP C-reactive protein
  • LBP lipopolysaccharide-binding protein
  • TC total cholesterol
  • TG triglyceride
  • Lpa lipoprotein a
  • HDL high-density lipoprotein
  • APOA apolipoprotein A
  • LDL low-density lipoprotein
  • APOB apolipoprotein B
  • GFR (MDRR), glomerular filtration rate
  • CysC Cystatin C
  • ACR urinary microalbumin to creatinine ratio
  • IMT intima-media thickness
  • DAN diabetic autonomic PATENT APPLICATION
  • MHR mean heart rate
  • SDNN standard deviation
  • C Differences in genetic capacity of carbohydrate substrate utilization (CAZy), short- chain fatty acid production (SCFA), antibiotic resistance genes (ARG) and virulence factor genes (VF).
  • the heatmaps show the proportion (CAZy) or gene copy numbers (SCFA, ARG and VF) of each category in each genome.
  • CAZy genes were predicted in each genome.
  • the proportion of CAZy genes for a particular substrate was calculated as the number of the CAZy genes involved in its utilization divided by the total number of the CAZy genes.
  • Arabinoxylan-related CAZy families CE1, CE2, CE4, CE6, CE7, GH10, GH11, GH115, GH43, GH51, GH67, GH3 and GH5; cellulose-related: GH1, GH44, GH48, GH8, GH9, GH3 and GH5; inulin-related: GH32 and GH91; mucin-related families: GH1, GH2, GH3, GH4, GH18, GH19, GH20, GH29, GH33, GH38, GH58, GH79, GH84, GH85, GH88, GH89, GH92, GH95, GH98, GH99, GH101, GH105, GH109, GH110, GH113, PL6, PL8, PL12, PL13 and PL21; pectin-related: CE12, CE8, GH28, PL1 and PL9; starch-related: GH13, GH
  • FTHFS formate-tetrahydrofolate ligase for acetate production
  • ScpC propionyl-CoA succinate-CoA transferase
  • Pct propionate- CoA transferase for propionate production
  • Butyryl–coenzyme A butyryl-CoA: acetate CoA transferase
  • Buk butyrate kinase
  • 4Hbt butyryl- CoA: 4-hydroxybutyrate CoA transferase
  • Ato butyryl-CoA:acetoacetate CoA transferase (AtoA: alpha subunit, AtoD: beta subunit) for butyrate production.
  • Figures 7A and 7B collectively illustrate genomes forming the two competing guilds, as identified from a case-control dataset specific to one disease, demonstrate significant effectiveness in classifying cases from controls across independent datasets on different diseases within the Case-Control Dataset Collection I.
  • Case-Control Dataset Collection I has 11 published metagenomic case-control datasets on 7 diseases including type 2 diabetes (T2D), liver cirrhosis (LC), ankylosing spondylitis (AS), atherosclerotic cardiovascular disease (ACVD), schizophrenia (SCZ), colorectal cancer (CRC), inflammatory bowel disease (IBD) dataset. Datasets from 3 studies were combined to analyze CRC. Datasets from 2 studies were combined to analyze IBD. The percentage of correlations followed the pattern in the seesaw networked two competing guilds (i.e., positive edges within each guild, negative edges between the 2 guilds) was in yellow, and the ratio of correlations that were negative within each guild and positive between the guilds was in black of the 100% stacked bar.
  • T2D type 2 diabetes
  • LC liver cirrhosis
  • AS ankylosing spondylitis
  • ACVD atherosclerotic cardiovascular disease
  • CRC colorectal cancer
  • IBD inflammatory bowel disease
  • Figures 8A, 8B1, 8B2, 8B3, 8B4, 8B5, 8B6, 8B7, 8B8, 8B9, 8B10, 8B11, 8B12, 8B13, 8B13, 8B14, 8B15, 8B16, 8C1, and 8C2 collectively illustrate the combined core genomes, drawn from all identified competing guilds, effectively differentiate cases from controls across a broader range of diseases, and predict treatment outcomes in independent datasets.
  • HQMAGs in each set of the two competing guilds were dereplicated based on the cutoff of 99% average nucleotide identity (ANI) between two genomes.788 non- redundant HQMAGs were obtained as the combined genomes of all the 8 sets of the two PATENT APPLICATION Attorney Docket No.: 126146-5003-WO competing guilds.
  • Random forest classification model with leave-one-out cross validation was constructed based on the 788 HQMAGs in each dataset. The HQMAGs were ranked based on their importance across all the models. From the least important HQMAGs (biggest importance rank), subsequently removing one HQMAGs to do random forest classification model in each dataset.
  • MTX methotrexate
  • DAS28 Disease Activity Score in 28 joints
  • NR n 28.
  • progression-free survival was used to determined R and NR to immune checkpoint inhibitor (ICI) treatment.
  • Figures 9A, 9B, and 9C collectively illustrate the discriminative power of the combined core genomes from all the 8 sets of the two competing guilds in classifying healthy individuals vs. patients across colorectal cancer (CRC), inflammatory bowel diseases (IBD), and Pancreatic Cancer (PC) datasets in the Case-Control Dataset Collection I and II.
  • CRC colorectal cancer
  • IBD inflammatory bowel diseases
  • PC Pancreatic Cancer
  • a prediction matrix was shown for the classification of cases and controls based on the combined core genomes from all eight sets of the two competing guilds within each dataset (diagonal values), across pairs of datasets (one dataset used for model training and the other for testing), and in a leave-one-dataset-out setting (training the model on all but one datasset and testing on the left- out dataset). Random Forest classification model with leave-one-out cross validation was applied. The area under the ROC curve (AUC) values were shown in the matrix.
  • FIG. 10A1, 10A2, 10B1, 10B2, 10C1, 10C2, 10D1, and 10D2 collectively illustrate the combined core of the two competing guilds supports the prediction of therapeutic effects in the Treatment Dataset Collection for inflammatory bowel diseases, rheumatoid arthritis, advanced melanoma, and B cell lymphoma.
  • D Tumor response to CAR-T cell immunotherapy was classified as either complete remission or non-complete remission (partial remission, stable disease, progressive disease or death) at 180 days after CAR-T cell infusion by the treating physician.
  • Figures 11A, 11A2, 11B1, 11B2, 11C1, 11C2, 11D1, and 11D2 collectively illustrate the Combined Core genomes of the two competing guilds provide a universal model for distinguishing between cases and controls across a variety of diseases (Case-Control Dataset Collection I and II).
  • FIGS 14A and 14B collectively illustrate that the CC-TCGs predict treatment outcomes in independent datasets.
  • CC-TCG was used as a predictor in the treatment dataset collection to predict whether respondents (R) and non-responders (NR) were under treatment.
  • R responders
  • NR non-responders
  • IBD inflammatory bowel disease
  • RA responder to methotrexate
  • DAS28 improves the disease activity score in 28 joints
  • FIG. 15A, 15B, 15C, and 15D collectively illustrate that the combined core of the TCGs supports the prediction of therapeutic effects in the treatment dataset collection for inflammatory bowel disease, rheumatoid arthritis, advanced melanoma, and B cell lymphoma.
  • the abundance of the combined core genomes (284 HQMAGs) in the pre-treatment samples was used as predictors in random forest classification models to predict responder (R) and non- responder (NR) under treatment. Area under the ROC curve (AUC) and AUC values were shown in the panels. (15A) 14-week remission was used to determine R and NR.
  • Arabinoxylan-related CAZy families CE1, CE2, CE4, CE6, CE7, GH10, GH11, GH115, GH43, GH51, GH67, GH3, and GH5; cellulose-related: GH1, GH44, GH48, GH8, GH9, GH3, and GH5; inulin-related: GH32 and GH91; mucin-related: GH1, GH2, GH3, GH4, GH18, GH19, GH20, GH29, GH33, GH38, GH58, GH79, GH84, GH85, GH88, GH89, GH92, GH95, GH98, GH99, GH101, GH105, GH109, GH110, GH113, PL6, PL8, PL12, PL13, and PL21; pectin- related: CE12, CE8, GH28, PL1, and PL9; and starch-related: GH13,
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” may be construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.
  • first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first subject could be termed a second subject, and, similarly, a second subject could be termed a first subject, without departing from the scope of the present disclosure.
  • the term “measure of central tendency” refers to a central or representative value for a distribution of values.
  • measures of central tendency include an arithmetic mean, weighted mean, midrange, midhinge, trimean, geometric mean, geometric median, Winsorized mean, median, and mode of the distribution of values.
  • the term “subject” refers to any living or non-living organism including, but not limited to, a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human mammal, or a non-human animal.
  • Any human or non- human animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale and shark.
  • bovine e.g., cattle
  • equine e.g., horse
  • caprine and ovine e.g., sheep, goat
  • swine e.g., pig
  • camelid e.g., camel, llama, alpaca
  • monkey ape
  • ape
  • a subject is a male or female of any age (e.g., a man, a woman, or a child).
  • administering means a method for therapeutically or prophylactically preventing, treating or ameliorating a syndrome, disorder or disease as described herein. Such methods include administering an effective amount of said therapeutic agent at different times during the course of a therapy or concurrently in a combination form. The methods of the invention are to be understood as embracing all known therapeutic treatment regimens.
  • cancer refers to an abnormal mass of tissue in which the growth of the mass surpasses, and is not coordinated with, the growth of normal tissue, including both solid masses (e.g., as in a solid tumor) or fluid masses (e.g., as in a hematological cancer).
  • a cancer or tumor can be defined as “benign” or “malignant” depending on the following characteristics: degree of cellular differentiation including morphology and functionality, rate of growth, local invasion and metastasis.
  • a “benign” tumor can be well differentiated, have characteristically slower growth than a malignant tumor and remain localized to the site of origin.
  • a benign tumor does not have the capacity to infiltrate, invade or metastasize to distant sites.
  • a PATENT APPLICATION Attorney Docket No.: 126146-5003-WO “malignant” tumor can be a poorly differentiated (anaplasia), have characteristically rapid growth accompanied by progressive infiltration, invasion, and destruction of the surrounding tissue.
  • a malignant tumor can have the capacity to metastasize to distant sites.
  • a cancer cell is a cell found within the abnormal mass of tissue whose growth is not coordinated with the growth of normal tissue.
  • a “tumor sample” refers to a biological sample obtained or derived from a tumor of a subject, as described herein.
  • Non-limiting examples of cancer types include ovarian cancer, cervical cancer, uveal melanoma, colorectal cancer, chromophobe renal cell carcinoma, liver cancer, endocrine tumor, oropharyngeal cancer, retinoblastoma, biliary cancer, adrenal cancer, neural cancer, neuroblastoma, basal cell carcinoma, brain cancer, breast cancer, non-clear cell renal cell carcinoma, glioblastoma, glioma, kidney cancer, gastrointestinal stromal tumor, medulloblastoma, bladder cancer, gastric cancer, bone cancer, non-small cell lung cancer, thymoma, prostate cancer, clear cell renal cell carcinoma, skin cancer, thyroid cancer, sarcoma, testicular cancer, head and neck cancer (e.g., head and neck squamous cell carcinoma), meningioma, peritoneal cancer, endometrial cancer, pancreatic cancer, mesothelioma, esophageal cancer
  • cancer state or “cancer condition” refer to a characteristic of a cancer patient's condition, e.g., a diagnostic status, a type of cancer, a location of cancer, a primary origin of a cancer, a cancer stage, a cancer prognosis, and/or one or more additional characteristics of a cancer (e.g., tumor characteristics such as morphology, heterogeneity, size, etc.).
  • one or more additional personal characteristics of the subject are used further describe the cancer state or cancer condition of the subject, e.g., age, gender, weight, race, personal habits (e.g., smoking, drinking, diet), other pertinent medical conditions (e.g., high blood pressure, dry skin, other diseases), current medications, allergies, pertinent medical history, current side effects of cancer treatments and other medications, etc.
  • the term “treat”, “treating”, “treatment”, or “therapy” refers to both therapeutic treatment and prophylactic or preventative measures, wherein the object is to prevent PATENT APPLICATION Attorney Docket No.: 126146-5003-WO or slow down (lessen) the targeted pathologic condition or disorder.
  • Those in need of treatment include those diagnosed with the disorder as well as those prone to have the disorder (e.g., a genetic predisposition) or those in whom the disorder is to be prevented.
  • the terms “prevent,” “preventing,” and “prevention” refer to reducing the likelihood of the onset (or recurrence) of a disease, disorder, condition, or associated symptom(s).
  • the term means obtaining beneficial or desired results, for example, clinical results.
  • Beneficial or desired results can include, but are not limited to, alleviation of one or more symptoms.
  • the "response” refers to the response to a biological drug, chemical drug, or physical therapy of the subject suffering from a pathology which is treatable with said biological drug, chemical drug, or physical therapy. Standard criteria may vary from disease to disease.
  • immunotherapies are all therapies that either directly or indirectly modify the immune response or the immune system of a patient. For immunotherapeutic strategies, it has been found that the detection of a strong immune response at the tumor site was a reliable marker for a plurality of cancers, like colon cancers as well as rectum cancers, this association of a pre-existing immune response with a better therapeutic efficacy was assumed.
  • Immune response encompasses any form of immune response of said patient through direct or indirect, or both, action towards said cancer or tumor sites.
  • the immune response means the immune response of the host cancer patient in reaction to the tumor and encompasses the presence of, the number of, or alternatively the activity of, cells and related signaling molecules involved in the immune response of the host which includes: all cytokines, chemokines, growth factors, stem cell growth factors.
  • the immune response encompasses a multitude of different cellular subtypes, such as T cell lineage, the B cell lineage, the natural killer cells, macrophages, dendritic cells, myelo-derived suppressor cells, lytic dendritic cells, fibroblasts, endothelial cells, as well as an enormous number of signaling molecules (cytokines, chemokines, other signaling molecules).
  • immunotherapeutic agent refers to a compound, composition or treatment that indirectly or directly enhances, stimulates, or augments the body's immune PATENT APPLICATION Attorney Docket No.: 126146-5003-WO response against cancer cells and/or that lessens the side effects of other anticancer therapies.
  • Immunotherapy is thus a therapy that directly or indirectly stimulates or enhances the immune system's responses to cancer cells and/or lessens the side effects that may have been caused by other anti-cancer agents.
  • Immunotherapy is also referred to in the art as immunologic therapy, biological therapy biological response modifier therapy and biotherapy.
  • immunotherapeutic agents examples include, but are not limited to, cytokines, cancer vaccines, monoclonal antibodies, and non-cytokine adjuvants.
  • the immunotherapeutic treatment may consist of administering the patient with an amount of immune cells (T cells, NK, cells, dendritic cells, B cells).
  • Immunotherapeutic agents can be non-specific, i.e. boost the immune system generally so that it becomes more effective in fighting the growth and/or spread of cancer cells, or they can be specific, i.e. targeted to the cancer cells themselves immunotherapy regimens may combine the use of non-specific and specific immunotherapeutic agents.
  • Non-specific immunotherapeutic agents are substances that stimulate or indirectly augment the immune system.
  • IFNs can act directly PATENT APPLICATION Attorney Docket No.: 126146-5003-WO on cancer cells, for example, by slowing their growth, promoting their development into cells with more normal behavior and/or increasing their production of antigens thus making the cancer cells easier for the immune system to recognize and destroy.
  • IFNs can also act indirectly on cancer cells, for example, by slowing down angiogenesis, boosting the immune system and/or stimulating natural killer (NK) cells, T cells and macrophages.
  • Recombinant IFN-alpa is available commercially as Roferon (Roche Pharmaceuticals) and Intron A (Schering Corporation).
  • Non-cytokine adjuvants in combination with other immuno- and/or chemotherapeutics have demonstrated efficacy against various cancers including, for example, colon cancer and colorectal cancer (Levimasole); melanoma (BCG and QS-21); renal cancer and bladder cancer (BCG).
  • immunotherapeutic agents can be active, i.e. stimulate the body's own immune response, or they can be passive, i.e. comprise immune system components that were generated external to the body.
  • Passive specific immunotherapy typically involves the use of one or more monoclonal antibodies that are specific for a particular antigen found on the surface of a cancer cell or that are specific for a particular cell growth factor.
  • Monoclonal antibodies currently used as cancer immunotherapeutic agents that are suitable for inclusion in the combinations of the present invention include, but are not limited to, rituximab (Rituxan®), trastuzumab (Herceptin®), ibritumomab tiuxetan (Zevalin®), tositumomab (Bexxar®), cetuximab (C-225, Erbitux®), bevacizumab (Avastin®), gemtuzumab ozogamicin (Mylotarg®), alemtuzumab (Campath®), and BL22.
  • rituximab Rituxan®
  • trastuzumab Herceptin®
  • ibritumomab tiuxetan Zevalin®
  • tositumomab Bexxar®
  • cetuximab C-225, Erbitux®
  • bevacizumab Avastin®
  • Monoclonal antibodies are used in the treatment of a wide range of cancers including breast cancer (including advanced metastatic breast cancer), colorectal cancer (including advanced and/or metastatic colorectal cancer), ovarian cancer, lung cancer, prostate cancer, cervical cancer, melanoma and brain tumours.
  • breast cancer including advanced metastatic breast cancer
  • colorectal cancer including advanced and/or metastatic colorectal cancer
  • ovarian cancer lung cancer, prostate cancer, cervical cancer, melanoma and brain tumours.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO [0098]
  • Other examples include antibodies specific a co-stimulatory molecule.
  • Co-stimulatory molecules include, for example B7-1/CD80, CD28, B7- 2/CD86, CTLA-4, B7-H1/PD-L1, Gi24/Dies 1/VISTA, B7-H2, ICOS, B7-H3 PD-1, B7-H4, PD-L2/B7-DC, B7-H6, PDCD6, BTLA, 4-1 BB/TNFRSF9/CD137, CD40 Ligand/TNFSF5, 4-1BB Ligand/TNFSF9 GITR/TNFRSF18, HVEM/TNFRSF14, CD27/TNFRSF7, LIGHT/TNFSF14, CD27 Ligand/TNFSF7, OX40/TNFRSF4, CD30/TNFRSF8, 0X40 Ligand/TNFSF4, CD30 Ligand/TNFSF8, TACI/TNFRSF13B, CD40/TNFRSF5, 2B4/CD244/SLAMF4 CD84/SLAMF5, BLAME/SLAMF8, CD2
  • the patient's circulating lymphocytes, or tumor infiltrated lymphocytes are isolated in vitro, activated by lymphokines such as IL-2 or PATENT APPLICATION Attorney Docket No.: 126146-5003-WO transuded with genes for tumor necrosis, and readministered (Rosenberg et al., 1988; 1989).
  • the activated lymphocytes are most preferably the patient's own cells that were earlier isolated from a blood or tumor sample and activated (or "expanded") in vitro.
  • This form of immunotherapy has produced several cases of regression of melanoma and renal carcinoma.
  • genomic abundance value refers to an absolute or relative amount of a microorganism’s genome in a biological sample from the gut of a subject.
  • a genomic abundance value can be expressed different units, including copy number, molarity, mass (e.g., normalized against the size of the genome), unique sequence reads (e.g., normalized against the size of the genome), a percentage of any of the former metrics relative to the total amount of the metric across all genomes in the sample, a percentage of any of the former metrics relative to the total amount of the metric across a plurality of genomes in the sample, etc.
  • a genomic abundance value is normalized against a total genomic abundance in the sample.
  • a genomic abundance value is normalized against a genomic abundance value for a control genome in the sample.
  • the values for a plurality of genomic abundance values in a sample are standardized, normalized, and/or scaled. Examples of methods for normalizing genomic abundance values are described, for example, in Lin, H., Peddada, S.D., Analysis of microbial compositions: a review of normalization and differential abundance analysis, Biofilms Microbiomes, 6(60) (2020) and Lutz K.C., et al., A Survey of Statistical Methods for Microbiome Data Analysis, Frontiers in Applied Mathematics and Statistics, 8 (2022) the contents of which are incorporated herein by reference in their entireties.
  • genomic abundance can be measured in the art. For example, metagenomic sequencing can be used to largely reconstruct microbial genomes from next generation sequencing of genomic DNA in biological samples, such as biological samples from the gut of a subject.
  • metagenomic sequence see, for example, Quince C, et al., Shotgun metagenomics, from sampling to analysis, Nat Biotechnol, 35(9):833-44 (2017), the content of which is incorporated herein by reference in its entirety.
  • Genomic abundance may also be determined by quantification of the copy number of a ribosomal gene, for example the 16S rRNA gene.
  • rRNA quantification examples are described in Manzari C., et al., Accurate quantification of bacterial abundance in metagenomic DNAs accounting for variable DNA integrity levels, Microb Genom., 6(10):mgen000417 (2020) and Barlow, J.T., et al., A quantitative sequencing framework for absolute abundance measurements of mucosal and PATENT APPLICATION Attorney Docket No.: 126146-5003-WO lumenal microbial communities, Nat Commun., 11:2590 (2020), the contents of which are incorporated herein by reference in their entireties.
  • relative abundance refers to a ratio of a first amount of a compound measured in a sample, e.g., a genome for a first microorganism, to a second amount of a compound measured in a second sample.
  • relative abundance refers to a ratio of an amount of a compound, e.g., a genome for a first microorganism, to a total amount of compounds, e,g., the total amount of microorganism genomes or the total amount of a plurality of genomes, in the same sample.
  • relative abundance refers to a ratio of an amount of a compound, e.g., a genome for a first microorganism, in a first sample to an amount of the compound of the compound in a second sample.
  • a ratio of a normalized amount of a genome for a first microorganism in a first sample to a normalized amount of the genome for the first microorganism in a second and/or reference sample refers to any biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins.
  • sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as an mRNA transcript or a genomic locus.
  • sequence reads or “reads” refers to nucleotide sequences produced by any nucleic acid sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”) or from both ends of nucleic acid fragments (e.g., paired-end reads, double-end reads). The length of the sequence read is often associated with the particular sequencing technology.
  • High-throughput methods provide sequence reads that can vary in size from tens to hundreds of base pairs (bp).
  • the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp.
  • the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Nanopore® sequencing can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs.
  • Illumina® parallel sequencing for example, can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp.
  • a sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides).
  • a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment.
  • a sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
  • PCR polymerase chain reaction
  • the term “read segment” refers to any form of nucleotide sequence read including the raw sequence reads obtained directly from a nucleic acid sequencing technique or from a sequence derived therefrom, e.g., an aligned sequence read, a collapsed sequence read, or a stitched sequence read.
  • the term “read count” refers to the total number of nucleic acid reads generated, which may or may not be equivalent to the number of nucleic acid molecules generated, during a nucleic acid sequencing reaction.
  • the term “read-depth,” “sequencing depth,” or “depth” can refer to a total number of unique nucleic acid fragments encompassing a particular locus or region of the genome of a microorganism that are sequenced in a particular sequencing reaction. Sequencing depth can be expressed as “Yx”, e.g., 50x, 100x, etc., where “Y” refers to the number of unique nucleic acid fragments encompassing a particular locus that are sequenced in a sequencing reaction. In such a case, Y is necessarily an integer, because it represents the actual sequencing depth for a particular locus.
  • read-depth, sequencing depth, or depth can refer to a measure of central tendency (e.g., a mean or mode) of the number of unique nucleic acid fragments that encompass one of a plurality of loci or regions of the genome of a microorganism that are sequenced in a particular sequencing reaction.
  • sequencing depth refers to the average depth of every locus across a targeted sequencing panel, an exome, or an entire genome for the microorganism.
  • Y may be expressed as a PATENT APPLICATION Attorney Docket No.: 126146-5003-WO fraction or a decimal, because it refers to an average coverage across a plurality of loci.
  • sequencing breadth refers to what fraction of a particular microorganism genome has been sequenced. Sequencing breadth can be expressed as a fraction, a decimal, or a percentage, and is generally calculated as (the number of loci analyzed / the total number of loci in the genome). The denominator of the fraction can be a repeat-masked genome, and thus 100% can correspond to all of the reference genome minus the masked parts.
  • a repeat- masked genome can refer to a genome in which sequence repeats are masked (e.g., sequence reads align to unmasked portions of the genome).
  • sequence ratio and “coverage ratio” interchangeably refer to any measurement of a number of units of a genomic sequence in a first one or more biological samples (e.g., a test and/or tumor sample) compared to the number of units of the respective genomic sequence in a second one or more biological samples (e.g., a reference and/or control sample).
  • a sequence ratio is a copy ratio, a log 2 -transformed copy ratio (e.g., log 2 copy ratio), a coverage ratio, a base fraction, an allele fraction (e.g., a variant allele fraction), and/or a tumor ploidy.
  • sequence ratio is a logN-transformed copy ratio, where N is any real number greater than 1.
  • the term “sequencing probe” refers to a molecule that binds to a nucleic acid with affinity that is based on the expected nucleotide sequence of the RNA or DNA present at that locus.
  • targeted panel or “targeted gene panel” refers to a combination of probes for sequencing (e.g., by next-generation sequencing) nucleic acids present PATENT APPLICATION Attorney Docket No.: 126146-5003-WO in a biological sample from a subject (e.g., a tumor sample, liquid biopsy sample, germline tissue sample, white blood cell sample, or tumor or tissue organoid sample), selected to map to one or more loci of interest in a genome.
  • sensitivity or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives.
  • Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having a particular biological characteristic.
  • TNR true negative rate
  • Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having a particular biological characteristic.
  • a model refers to a machine learning model or algorithm.
  • a model includes an unsupervised learning algorithm.
  • an unsupervised learning algorithm is cluster analysis.
  • a model includes supervised machine learning.
  • Nonlimiting examples of supervised learning algorithms include, but are not limited to, logistic regression, neural networks, support vector machines, Naive Bayes algorithms, nearest neighbor algorithms, random forest algorithms, decision tree algorithms, boosted trees algorithms, multinomial logistic regression algorithms, linear models, linear regression, Gradient Boosting, mixture models, hidden Markov models, Gaussian NB algorithms, linear discriminant analysis, diffusion models, or any combinations thereof.
  • a model is a multinomial classifier algorithm.
  • a model is a 2-stage stochastic gradient descent (SGD) model.
  • a model is a deep neural network (e.g., a deep-and-wide sample-level model).
  • Neural networks e.g., the model is a neural network (e.g., a convolutional neural network and/or a residual neural network).
  • Neural network algorithms also PATENT APPLICATION Attorney Docket No.: 126146-5003-WO known as artificial neural networks (ANNs), include convolutional and/or residual neural network algorithms (deep learning algorithms).
  • a node receives input that comes either directly from the input data or the output of nodes in previous layers, and performs a specific operation, e.g., a summation operation.
  • a connection from an input to a node is associated with a parameter (e.g., a weight and/or weighting factor).
  • the node sums up the products of all pairs of inputs, x i , and their associated parameters.
  • the weighted sum is offset with a bias, b.
  • the output of a node or neuron is gated using a threshold or activation function, f, which, in some instances, is a linear or non-linear function.
  • the activation function is, for example, a rectified linear unit (ReLU) activation function, a Leaky ReLU activation function, or other function such as a saturating hyperbolic tangent, identity, binary step, logistic, arcTan, softsign, parametric rectified linear unit, exponential linear unit, softPlus, bent identity, softExponential, Sinusoid, Sine, Gaussian, or sigmoid function, or any combination thereof.
  • the weighting factors, bias values, and threshold values, or other computational parameters of the neural network are “taught” or “learned” in a training phase using one or more sets of training data.
  • the parameters are trained using the input data from a training dataset and a gradient descent or backward propagation method so that the output value(s) that the ANN computes are consistent with the examples included in the training dataset.
  • the parameters are obtained from a back propagation neural network training process.
  • Any of a variety of neural networks are suitable for use in accordance with the present disclosure. Examples include, but are not limited to, feedforward neural networks, radial basis function networks, recurrent neural networks, residual neural networks, convolutional neural networks, residual convolutional neural networks, and the like, or any combination thereof.
  • a deep neural network model includes an input layer, a plurality of individually parameterized (e.g., weighted) convolutional layers, and an output scorer.
  • the parameters (e.g., weights) of each of the convolutional layers as well as the input layer contribute to the plurality of parameters (e.g., weights) associated with the deep neural network model.
  • at least 50 parameters, at least 100 parameters, at least 1000 parameters, at least 2000 parameters or at least 5000 parameters are associated with the deep neural network model.
  • Neural network algorithms including convolutional neural network algorithms, suitable for use as models are disclosed in, for example, Vincent et al., 2010, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J Mach Learn Res 11, pp.3371-3408; Larochelle et al., 2009, “Exploring strategies for training deep neural networks,” J Mach Learn Res 10, pp.1-40; and Hassoun, 1995, Fundamentals of Artificial Neural Networks, Massachusetts Institute of Technology, each of which is hereby incorporated by reference.
  • Additional example neural networks suitable for use as models are disclosed in Duda et al., 2001, Pattern Classification, Second Edition, John Wiley & Sons, Inc., New York; and Hastie et al., 2001, The Elements of Statistical Learning, Springer- PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Verlag, New York, each of which is hereby incorporated by reference in its entirety. Additional example neural networks suitable for use as models are also described in Draghici, 2003, Data Analysis Tools for DNA Microarrays, Chapman & Hall/CRC; and Mount, 2001, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, each of which is hereby incorporated by reference in its entirety.
  • the model is a support vector machine (SVM).
  • SVM algorithms suitable for use as models are described in, for example, Cristianini and Shawe-Taylor, 2000, “An Introduction to Support Vector Machines,” Cambridge University Press, Cambridge; Boser et al., 1992, “A training algorithm for optimal margin classifiers,” in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press, Pittsburgh, Pa., pp.142-152; Vapnik, 1998, Statistical Learning Theory, Wiley, New York; Mount, 2001, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.; Duda, Pattern Classification, Second Edition, 2001, John Wiley & Sons, Inc., pp.259, 262-265; and Hastie, 2001, The Elements of Statistical Learning, Springer, New York; and Furey et al., 2000, Bioinformatics 16, 906-914, each of which is
  • SVMs When used for classification, SVMs separate a given set of binary labeled data with a hyper-plane that is maximally distant from the labeled data. For certain cases in which no linear separation is possible, SVMs work in combination with the technique of ⁇ kernels ⁇ , which automatically realizes a non-linear mapping to a feature space.
  • the hyper-plane found by the SVM in feature space corresponds, in some instances, to a non-linear decision boundary in the input space.
  • the plurality of parameters (e.g., weights) associated with the SVM define the hyper-plane.
  • the hyper-plane is defined by at least 10, at least 20, at least 50, or at least 100 parameters and the SVM model requires a computer to calculate because it cannot be mentally solved.
  • Na ⁇ ve Bayes algorithms In some embodiments, the model is a Naive Bayes algorithm. Na ⁇ ve Bayes models suitable for use as models are disclosed, for example, in Ng et al., 2002, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,” Advances in Neural Information Processing Systems, 14, which is hereby incorporated by reference.
  • a Naive Bayes model is any model in a family of “probabilistic models” based on applying Bayes’ theorem with strong (na ⁇ ve) independence assumptions PATENT APPLICATION Attorney Docket No.: 126146-5003-WO between the features. In some embodiments, they are coupled with Kernel density estimation. See, for example, Hastie et al., 2001, The elements of statistical learning: data mining, inference, and prediction, eds. Tibshirani and Friedman, Springer, New York, which is hereby incorporated by reference. [00124] Nearest neighbor algorithms. In some embodiments, a model is a nearest neighbor algorithm. In some implementations, nearest neighbor models are memory-based and include no model to be fit.
  • the number of distance calculations needed to solve the k-nearest neighbor model is such that a computer is used to solve the model for a given input because it cannot be mentally performed.
  • Random forest, decision tree, and boosted tree algorithms are used to solve the model for a given input because it cannot be mentally performed.
  • the model is a decision tree. Decision trees suitable for use as models are described generally by Duda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York, pp.395-396, which is hereby incorporated by reference.
  • Tree-based methods partition the feature space into a set of PATENT APPLICATION Attorney Docket No.: 126146-5003-WO rectangles, and then fit a model (like a constant) in each one.
  • the decision tree is random forest regression.
  • one specific algorithm is a classification and regression tree (CART).
  • Other specific decision tree algorithms include, but are not limited to, ID3, C4.5, MART, and Random Forests. CART, ID3, and C4.5 are described in Duda, 2001, Pattern Classification, John Wiley & Sons, Inc., New York, pp.396-408 and pp.411-412, which is hereby incorporated by reference.
  • the decision tree model includes at least 10, at least 20, at least 50, or at least 100 parameters (e.g., weights and/or decisions) and requires a computer to calculate because it cannot be mentally solved. [00127] Regression. In some embodiments, the model uses a regression algorithm.
  • a regression algorithm is any type of regression.
  • the regression algorithm is logistic regression.
  • the regression algorithm is logistic regression with lasso, L2 or elastic net regularization.
  • those extracted features that have a corresponding regression coefficient that fails to satisfy a threshold value are pruned (removed from) consideration.
  • a generalization of the logistic regression model that handles multicategory responses is used as the model. Logistic regression algorithms are disclosed in Agresti, An Introduction to Categorical Data Analysis, 1996, Chapter 5, pp.103-144, John Wiley & Son, New York, which is hereby incorporated by reference.
  • the model makes use of a regression model disclosed in Hastie et al., 2001, The Elements of Statistical Learning, Springer-Verlag, New York.
  • the logistic regression model includes at least 10, at least 20, at least 50, at least 100, or at least 1000 parameters (e.g., weights) and requires a computer to calculate because it cannot be mentally solved. [00128] Linear discriminant analysis algorithms.
  • linear discriminant analysis LDA
  • normal discriminant analysis NDA
  • discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two PATENT APPLICATION Attorney Docket No.: 126146-5003-WO or more classes of objects or events.
  • the resulting combination is used as the model (linear model) in some embodiments of the present disclosure.
  • Mixture model and Hidden Markov model is used as the model (linear model) in some embodiments of the present disclosure.
  • the model is a mixture model, such as that described in McLachlan et al., Bioinformatics 18(3):413-422, 2002.
  • the model is a hidden Markov model such as described by Schliep et al., 2003, Bioinformatics 19(1):i255-i263. [00130] Clustering.
  • the model is an unsupervised clustering model.
  • the model is a supervised clustering model. Clustering algorithms suitable for use as models are described, for example, at pages 211-256 of Duda and Hart, Pattern Classification and Scene Analysis, 1973, John Wiley & Sons, Inc., New York, (hereinafter "Duda 1973") which is hereby incorporated by reference in its entirety.
  • the clustering problem is described as one of finding natural groupings in a dataset.
  • two issues are addressed.
  • a way to measure similarity (or dissimilarity) between two samples is determined.
  • This metric e.g., similarity measure
  • a mechanism for partitioning the data into clusters using the similarity measure is determined.
  • One way to begin a clustering investigation is to define a distance function and to compute the matrix of distances between all pairs of samples in the training set.
  • clustering does not use a distance metric.
  • a nonmetric similarity function s(x, x') is used to compare two vectors x and x'.
  • s(x, x') is a symmetric function whose value is large when x and x' are somehow “similar.”
  • Partitions of the dataset that extremize the criterion function are used to cluster the data.
  • Particular exemplary clustering techniques contemplated for use in the present disclosure include, but are not limited to, hierarchical clustering (agglomerative clustering using a nearest- neighbor algorithm, farthest-neighbor algorithm, the average linkage algorithm, the centroid PATENT APPLICATION Attorney Docket No.: 126146-5003-WO algorithm, or the sum-of-squares algorithm), k-means clustering, fuzzy k-means clustering algorithm, and Jarvis-Patrick clustering.
  • the clustering includes unsupervised clustering (e.g., with no preconceived number of clusters and/or no predetermination of cluster assignments).
  • Ensembles of models and boosting are used.
  • a boosting technique such as AdaBoost is used in conjunction with many other types of learning algorithms to improve the performance of the model.
  • AdaBoost boosting technique
  • the output of any of the models disclosed herein, or their equivalents is combined into a weighted sum that represents the final output of the boosted model.
  • the plurality of outputs from the models is combined using any measure of central tendency known in the art, including but not limited to a mean, median, mode, a weighted mean, weighted median, weighted mode, etc.
  • the plurality of outputs is combined using a voting method.
  • a respective model in the ensemble of models is weighted or unweighted.
  • the term “parameter” refers to any coefficient or, similarly, any value of an internal or external element (e.g., a weight and/or a hyperparameter) in an algorithm, model, regressor, and/or classifier that can affect (e.g., modify, tailor, and/or adjust) one or more inputs, outputs, and/or functions in the algorithm, model, regressor and/or classifier.
  • a parameter refers to any coefficient, weight, and/or hyperparameter that can be used to control, modify, tailor, and/or adjust the behavior, learning, and/or performance of an algorithm, model, regressor, and/or classifier.
  • a parameter is used to increase or decrease the influence of an input (e.g., a feature) to an algorithm, model, regressor, and/or classifier.
  • a parameter is used to increase or decrease the influence of a node (e.g., of a neural network), where the node includes one or more activation functions.
  • Assignment of parameters to specific inputs, outputs, and/or functions is not limited to any one paradigm for a given algorithm, model, regressor, and/or classifier but can be used in any suitable algorithm, model, regressor, and/or classifier architecture for a desired performance.
  • a parameter has a fixed value.
  • a value of a parameter is manually and/or automatically adjustable.
  • a value of a parameter is modified by a validation and/or training process for an algorithm, model, regressor, and/or classifier (e.g., by error minimization and/or PATENT APPLICATION Attorney Docket No.: 126146-5003-WO backpropagation methods).
  • an algorithm, model, regressor, and/or classifier of the present disclosure includes a plurality of parameters.
  • the plurality of parameters is n parameters, where: n ⁇ 2; n ⁇ 5; n ⁇ 10; n ⁇ 25; n ⁇ 40; n ⁇ 50; n ⁇ 75; n ⁇ 100; n ⁇ 125; n ⁇ 150; n ⁇ 200; n ⁇ 225; n ⁇ 250; n ⁇ 350; n ⁇ 500; n ⁇ 600; n ⁇ 750; n ⁇ 1,000; n ⁇ 2,000; n ⁇ 4,000; n ⁇ 5,000; n ⁇ 7,500; n ⁇ 10,000; n ⁇ 20,000; n ⁇ 40,000; n ⁇ 75,000; n ⁇ 100,000; n ⁇ 200,000; n ⁇ 500,000, n ⁇ 1 x 10 6 , n ⁇ 5 x 10 6
  • the term “untrained model” refers to a machine learning model or algorithm, such as a classifier or a neural network, that has not been trained on a target dataset.
  • “training a model” refers to the process of training an untrained or partially trained model (e.g., “an untrained or partially trained neural network”).
  • the term “untrained model” does not exclude the possibility that transfer learning techniques are used in such training of the untrained or partially trained model.
  • auxiliary training datasets that can be used to complement the primary training dataset in training the untrained model in the present disclosure.
  • two or more auxiliary training datasets, three or more auxiliary training datasets, four or more auxiliary training datasets or five or more auxiliary PATENT APPLICATION Attorney Docket No.: 126146-5003-WO training datasets are used to complement the primary training dataset through transfer learning, where each such auxiliary dataset is different than the primary training dataset. Any manner of transfer learning is used, in some such embodiments.
  • the parameters learned from the first auxiliary training dataset are applied to the second auxiliary training dataset using transfer learning techniques (e.g., a second model that is the same or different from the first model), which in turn results in a trained intermediate model whose parameters are then applied to the primary training dataset and this, in conjunction with the primary training dataset itself, is applied to the untrained model.
  • transfer learning techniques e.g., a second model that is the same or different from the first model
  • the term “AUC” refers to the Area Under the Curve, for example, of a ROC Curve. That value can assess the merit of a test on a given sample population with a value of 1 representing a good test ranging down to 0.5 which means the test is providing a random response in classifying test subjects. Since the range of the AUC is only 0.5 to 1.0, a small change in AUC has greater significance than a similar change in a metric that ranges for 0 to 1 or 0 to 100%. When the % change in the AUC is given, it will be calculated based on the fact that the full range of the metric is 0.5 to 1.0. A variety of statistics packages can calculate AUC for an ROC curve.
  • AUC can be used to compare the accuracy of the classification algorithm across the complete data range. Classification algorithms with greater AUC have, by definition, a greater capacity to classify unknowns correctly between the two groups of interest (disease and no disease, responder and non-responder).
  • the term “instruction” refers to an order given to a computer processor by a computer program. On a digital computer, each instruction is a sequence of 0s and 1s that describes a physical operation the computer is to perform. Such instructions can include data transfer instructions and data manipulation instructions.
  • each instruction is a type of instruction in an instruction set that is recognized by a particular processor type used to carry out the instructions.
  • instruction sets include, but are not limited to, Reduced Instruction Set Computer (RISC), Complex Instruction Set Computer (CISC), Minimal instruction set computers (MISC), Very long instruction word (VLIW), Explicitly parallel instruction computing (EPIC), and One instruction set computer (OISC).
  • RISC Reduced Instruction Set Computer
  • CISC Complex Instruction Set Computer
  • MISC Minimal instruction set computers
  • VLIW Very long instruction word
  • EPIC Explicitly parallel instruction computing
  • OFC One instruction set computer
  • FIG. 1 is a block diagram illustrating a system 100 in accordance with some implementations.
  • the system 100 in some implementations includes one or more processing units CPU(s) 102 (also referred to as processors), one or more network interfaces 104, a user interface 106 including (optionally) a display 108 and an input system 110, a non- persistent memory 111, a persistent memory 112, and one or more communication buses 114 for interconnecting these components.
  • the one or more communication buses 114 optionally include circuitry (sometimes called a chipset) that interconnects and controls communications between system components.
  • the non-persistent memory 111 typically includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112 typically includes CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices.
  • the persistent memory 112 optionally includes one or more storage devices remotely located from the CPU(s) 102.
  • the persistent memory 112, and the non-volatile memory device(s) within the non-persistent memory 112 comprise non-transitory computer readable storage medium.
  • the non- persistent memory 111 or alternatively the non-transitory computer readable storage medium stores the following programs, modules and data structures, or a subset thereof, sometimes in conjunction with the persistent memory 112: PATENT APPLICATION Attorney Docket No.: 126146-5003-WO ⁇ an optional operating system 116, which includes procedures for handling various basic system services and for performing hardware dependent tasks; ⁇ an optional network communication module (or instructions) 118 for connecting the system 100 with other devices and/or a communication network 104; ⁇ a microbiome evaluation module 140 for determining a disease state, in a plurality of disease states, of a subject based on the constitution of the subject’s microbiome; and ⁇ a datastore of subject information 140 based on microbiome sequencing results 150, including abundance values 152 for microbes in each of guilds 152-A and 152-B as described herein.
  • one or more of the above identified elements are stored in one or more of the previously mentioned memory devices, and correspond to a set of instructions for performing a function described above.
  • the above identified modules, data, or programs (e.g., sets of instructions) need not be implemented as separate software programs, procedures, datasets, or modules, and thus various subsets of these modules and data may be combined or otherwise re-arranged in various implementations.
  • the non-persistent memory 111 optionally stores a subset of the modules and data structures identified above. Furthermore, in some embodiments, the memory stores additional modules and data structures not described above.
  • one or more of the above identified elements is stored in a computer system, other than that of visualization system 100, that is addressable by visualization system 100 so that visualization system 100 may retrieve all or a portion of such data when needed.
  • Figure 1 depicts a "system 100," the figure is intended more as a functional description of the various features that may be present in computer systems than as a structural schematic of the implementations described herein. In practice, and as recognized by those of ordinary skill in the art, items shown separately could be combined and some items could be separated. Moreover, although Figure 1 depicts certain data and modules in non-persistent memory 111, some or all of these data and modules instead may be stored in persistent memory 112. [00143] 1.
  • Figure 2 is a schematic diagram of a method of training a model for predicting a subject’s response to a therapy for a disorder as discussed below.
  • the method may be implemented using a computer system (e.g., the computer system 100 shown and described above in reference to Figure 1).
  • the methods including obtaining, in electronic form, for each respective training subject in a plurality of training subjects, (i) a corresponding plurality of genomic abundance values for the respective training subject at a time prior to receiving the therapy, wherein the corresponding plurality of genomic abundance values comprise, for each respective gut microorganism in a plurality of gut microorganisms, a corresponding value for the abundance of the genome of the respective gut microorganism in a corresponding biological sample from the gut of the respective training subject, (ii) an indication of the respective training subject’s response to the therapy of the respective training subject.
  • Each respective training subject in the plurality of training subjects has received a therapy for a disorder.
  • the plurality of training subjects comprises at least 50, at least 100, at least 200, at least 500, at least 1000, at least 5000, at least 10,000, at least 20,000, at least 50,000, at least 100,000, at least 500,000, or at least 1,000,000 subjects. In some embodiments, the plurality of training subjects comprises no more than 1,000,000, no more than 500,000, no more than 100,000, no more than 50,000, no more than 20,000, no more than 10,000, no more than 1000 subjects, no more than 500 subjects, no more than 100 subjects, or no more than 50 subjects.
  • the plurality of training subjects consists of from 50 to 100, from 50 to 200, from 50 to 500, from 100 to 500, from 200 to 500, from 200 to 1000, from 500 to 1000, from 200 to 5,000, from 1000 to 10,000, from 5000 from 200,00, from 10,000 to 50,000, from 20,000 to 100,000, or from 500,000 to 1,000,000.
  • the plurality of training subjects falls within another range starting no lower than 50 subjects and ending no higher than 100,000,000 subjects.
  • the plurality of subjects shares similar health status (such as physical or mental conditions, medical history, gene carrier, or medication use).
  • a corresponding biological sample from the gut of the respective training subject was taken prior to a treatment or a therapy.
  • the PATENT APPLICATION Attorney Docket No.: 126146-5003-WO biological sample is taken no more than 15 minutes, 30 minutes, 1 hour, 2 hours, 3 hours, 4 hours, 5 hours, 6 hours, 12 hours, or 24 hours prior to a treatment or a therapy.
  • the biological sample is taken 1 day, 2, days, 3 days, 4 days, 5 days, 6 days, 1 week, 2 weeks, 3 weeks, 4 weeks, or more prior to a treatment or a therapy.
  • the biological sample is taken about any of 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, or more prior to a treatment or a therapy.
  • sample data including plasma, stool specimens
  • clinical information including gender/age/body fat count/underlying disease/histopathological characteristics, etc.
  • sample data were collected for each training subject prior to receiving a therapy.
  • Individual biological samples were subjected to full microbiome analysis.
  • the sample is a tissue biopsy, an intestinal, or mucosal sample. See, for example, Tang Q, Jet al., Current Sampling Methods for Gut Microbiota: A Call for More Precise Devices, Front Cell Infect Microbiol., 10:151 (2020), the content of which is incorporated herein by reference in its entirety.
  • the biological sample from the gut of the respective subject is a fecal sample from the respective training subject.
  • the corresponding value for the abundance of the genome is a value representative of the absolute abundance of a microorganism genome. In some of the embodiments, the corresponding value for the abundance of the genome is a value representative of a normalized abundance value, or a relative abundance value (e.g., an abundance of one microorganism normalized against the abundance of total microbiome of interest). In some of the embodiments, corresponding value for the abundance of the genome is a value representative of an averaged abundance value (e.g., average of abundances obtained at different time points or from different biological samples from the patients, or average of abundances obtained using different probes, etc.), or a combination of any of above.
  • an averaged abundance value e.g., average of abundances obtained at different time points or from different biological samples from the patients, or average of abundances obtained using different probes, etc.
  • the corresponding value for the abundance of the genome is measured by any technique known in the art.
  • the genomic abundance value for the genome is measured by quantitative PCR(qPCR), such as bacterial 16S rRNA qPCR, RT-PCR, or qRT-PCR, for quantifying the abundance of region of interests in the genome, e.g., as described in U.S. Patent No.11,427,865, the disclosure of which is hereby incorporated by reference in its entirety.
  • the genomic abundance value is measured by targeted sequencing (e.g., 16S rRNA sequencing, or any other suitable biomarker), partial genome sequencing or whole genome PATENT APPLICATION Attorney Docket No.: 126146-5003-WO sequencing, thereby quantifying the number of reads of the targeted regions in a microorganism genome to determine the abundance of the genome, e.g., as disclosed in U.S. Patent Application Publication No.2021/0403986 or U.S. Patent No.11,332,783, the disclosures of which are hereby incorporated by reference in their entireties.
  • deep sequencing is employed to determine the abundance of targeted sequences, e.g., as disclosed in U.S.
  • the sequencing depth is at least 2X, at least 3X, at least 4X, at least 5X, at least 6X, at least 7X, at least 8X, at least 9X, at least 10X, at least 11X, at least 12X, at least 13X, at least 14X, at least 15X, at least 16X, at least 17X, at least 18X, at least 19X, at least 20X, at least 21X, at least 22X, at least 23X, at least 24X, at least 25X, at least 26X, at least 27X, at least 28X, at least 29X, at least 30X, at least 31X, at least 32X, at least 33X, at least 34X, at least 35X, at least 36X, at least 37X, at least 38X, at least 39X, at least 40X, at least 41X, at least 42X, at least 43X, at least 44X, at least 45X
  • Tumor responses were evaluated using RECIST criteria.
  • complete response was defined as complete radiographic disappearance of measurable or evaluable disease or stable, minimal radiographic findings; partial response was defined as reduction of the longest dimension of measurable disease by at least 50%; stable disease was defined as reduction of the longest dimension by less than 25%; Progressive disease PATENT APPLICATION Attorney Docket No.: 126146-5003-WO was defined as growth of the tumor by more than 25% in the longest dimension or development of new lesions.
  • overall response rate was defined as the sum of the complete and partial response rates and the tumor control rate was defined as the sum of overall response rates with stable disease rates.
  • the indication of subject’s response is characterized by the actual treatment efficacy of an therapy, including progression-free survival (PFS), the duration of the progression free survival under treatment, total Survival (OS), response to therapy (RT), overall response rate (ORR), sustained clinical effect (DCB), Disease Activity Score, or any combination thereof, or any other method for evaluating the progression or prognosis of a disease or disorder known in the art.
  • PFS progression free survival
  • OS total Survival
  • RT response to therapy
  • ORR overall response rate
  • DCB sustained clinical effect
  • Disease Activity Score or any combination thereof, or any combination thereof, or any other method for evaluating the progression or prognosis of a disease or disorder known in the art.
  • “progression free survival” (PFS) has its art-understood meaning relating to the length of time during and after the treatment of a disease, such as cancer, that a patient lives with the disease but it does not get worse.
  • measuring the progression-free survival is utilized as an assessment of how well a new treatment works.
  • PFS is determined in a randomized clinical trial; in some such embodiments, PFS refers to time from randomization until objective tumor progression and/or death.
  • ORR may be defined as the proportion of patients in whom partial (PR) or complete (CR) responses are identified as a best overall response (BOR) according to some metric, such as Response Evaluation Criteria in Solid Tumors (RECIST 1.1). Stable disease (SD) was categorized as non-response together with progressive disease (PD).
  • ORR has its art-understood meaning referring to the proportion of patients with tumor size reduction of a predefined amount and for a minimum time period. In some embodiments, response duration usually measured from the time of initial response until documented tumor progression. In some embodiments, ORR involves the sum of partial responses plus complete responses. [00154] In some embodiments, "clinical effect" refers to a clinical benefit. In some embodiments, such a clinical benefit is or comprises reduction in tumor size, increase in progression free survival, increase in overall survival, decrease in overall tumor burden, decrease in the symptoms caused by tumor growth such as pain, organ failure, bleeding, damage to the skeletal system, and other related sequelae of metastatic cancer and combinations thereof.
  • the clinical effect is a “sustained clinical effect” (DCB) that is maintained for a relevant period of time.
  • the relevant period of time is at least 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 1 year, 2 years, 3 years, 4 years, 5 years, or longer.
  • DAS Disease Activity Score
  • the DAS system represents both current state of disease activity and change.
  • the DAS scoring system uses a weighted mathematical formula, derived from clinical trials in RA.
  • the DAS 28 is 0.56(T28)+0.28(SW28)+0.70(Ln ESR)+0.014 GH wherein T represents tender joint number, SW is swollen joint number, ESR is erythrocyte sedimentation rate, and GH is global health.
  • T represents tender joint number
  • SW is swollen joint number
  • ESR erythrocyte sedimentation rate
  • GH global health.
  • Various values of the DAS represent high or low disease activity as well as remission, and the change and endpoint score result in a categorization of the patient by degree of response (none, moderate, good).
  • the indication of the subject’s response is measured by the level of the immune response or immune parameters of a cancer-bearing patient resulting from an immunotherapy.
  • the immune response or immune parameters are characterized by expression level of various biological markers of the host immune response in conjunction with the occurrence of a cancer at a given stage of cancer development (i.e., treatment efficacy).
  • the expression level of a biological marker is compared with a reference value for the same biological marker, and when required with reference values. The reference value for the same biological marker is thus predetermined and is already known to be indicative of a reference value that is pertinent for discriminating between a low level and a high level of the immune response of a patient with cancer, for said biological marker.
  • Said predetermined reference value for said biological marker is correlated with a responder to treatment in a cancer patient, or conversely is correlated with non-responder to treatment in a cancer patient.
  • a change of a combination of biological markers are quantified.
  • a combination of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or more distinct biological markers are quantified.
  • biological markers are quantified with immunohistochemical techniques.
  • Example biological markers include 18s, ACE, ACTB, AGTR1, AGTR2, APC, APOA1, ARF1, AXIN1, BAX, BCL2, BCL2L1, CXCR5, BMP2, BRCA1, BTLA, C3, CASP3, CASp9, CCL1, CCL11, CCL13, CCL16, CCL17, CCL18, CCL19, CCL2, CCL20, CCL21, CCL22, CCL23, CCL24, CCL25, CCL26, CCL27, CCL28, CCL3, CCL5, CCL7, CCL8, CCNB1, CCND1, CCNE1, CCR1, CCR10, CCR2, CCR3, CCR4, CCR5, CCR6, CCR7, CCR8, CCR9, CCRL2, CD154, CD19, CD1a, CD2, CD226, CD
  • a treatment regime or a therapy can be administered via any common route so long as the target tissue or cell is available via that route.
  • the methods include sequencing, for each respective subject in the plurality of training subjects, genomic DNA from the corresponding biological sample from the gut of the respective training subject, thereby obtain the corresponding plurality of nucleic acid sequences.
  • the corresponding plurality of nucleic acid sequences comprises at least 10,000, at least 100,000, at least 250,000, at least 500,000, at least 1,000,000, at least 2,500,000, at least 5,000,000, at least 10,000,000 or at least 50,000,000 nucleic acid sequences.
  • the corresponding plurality of nucleic acid sequences comprises no more than 250,000,000, no more than 100,000,000, no more than 50,000,000, no more than 25,000,000, no more than 10,000,000, no more than 5,000,000, no more than 1,000,000, no more than 100,000 nucleic acid sequences. In some embodiments, the corresponding plurality of nucleic acid sequences consists of from 100,000 to 1,000,000, from 200,000 to 5,000,000, from 500,000 from 10,000,000, from 1,000,000 to 20,000,000, from 5,000,000 to 50,000,000, from 10,000,000 to 100,000,000, or from 50,000,000 to 250,000,000 nucleic acid sequences. In some embodiments, the corresponding plurality of nucleic acid sequences falls within another range starting no lower than 1000 nucleic acid sequences and ending no higher than 250,000,000 nucleic acid sequences.
  • the corresponding plurality of nucleic acid sequences are obtained through metagenomic sequencing, e.g., as disclosed in U.S. Patent Application Publication No.2016/0239602 or U.S. Patent No.11,495,326, the contents of which are incorporated herein by reference in their entireties.
  • metagenomes sequencing further comprise generating the plurality of metagenomic fragment reads.
  • metagenomic sequencing further comprise fragmenting microbial genomes into random fragments of targeted sizes. The resulting fragments can vary in size. In one embodiment, fragments of approximately 500 nucleotides can be obtained.
  • fragments of from 100-2000 nucleotides e.g., 200-800, 100-900, 100-1000, 300- 800, 400-900 nucleotides can be obtained.
  • the method may further comprise extracting the metagenomic fragments from the corresponding biological sample.
  • metagenomes sequencing further comprise sequencing the fragments using high throughput sequencing methods to generate a plurality of sequencing reads. [00162]
  • the corresponding plurality of nucleic acid sequences are obtained through targeted panel sequencing. An example of targeted panel sequencing is described in U.S. Patent Application Publication No.2019/0316209.
  • the targeted panel sequencing comprises hybridizing genomic DNA isolated from a biological sample from the gut of a subject with a panel of probes that include one or more probes that hybridize to a unique sequence in the genome of each microorganism being quantified, prior to sequencing recovered nucleic acids.
  • the microorganisms include a plurality of the microorganisms listed in Table 1, Table 2, and/or Figures 13A-13XX.
  • a combination of semi-unique sequences e.g., sequences found in a small number of the microorganism genomes
  • an algorithm e.g., a system of equations.
  • the panel of probes includes at least 1 probe that hybridizes to a sequence unique to each microorganism genome being detected. In some embodiments, the panel of probes includes at least 2, at least 3, at least 4, at least 5, at least 10, at least 25, at least 50, or more probes that hybridize to a different sequence unique to each microorganism genome being detected.
  • the panel of probes includes at least 20, at least 30, at least 40, at least 50, at least 75, at least 100, at least 125, at least 150, at least 200, at least 150, at least 300, at least 400, at least 500, at least 750, at PATENT APPLICATION Attorney Docket No.: 126146-5003-WO least 1000, at least 1250, at least 1500, at least 2000, at least 2500, at least 3000, at least 4000, at least 5000, at least 7500, at least 10,000 or more unique probes.
  • the sequencing genomic DNA from the corresponding biological sample comprises a partial or complete sequencing platform adapter sequence at their termini useful for sequencing using a sequencing platform of interest.
  • Sequencing platforms of interest include, but are not limited to, the HiSeqTM, MiSeqTM and Genome AnalyzerTM sequencing systems from Illumina®; the Ion PGMTM and Ion ProtonTM sequencing systems from Ion TorrentTM; the PACBIO RS II Sequel system from Pacific Biosciences, the SOLiD sequencing systems from Life TechnologiesTM, the 454 GS FLX+ and GS Junior sequencing systems from Roche, the MinIONTM system from Oxford Nanopore, or any other sequencing platform of interest.
  • the methods include obtaining, for each respective training subject in the plurality of training subjects, in electronic form, a corresponding plurality of nucleic acid sequences for genomic DNA from the corresponding biological sample from the gut of the respective training subject.
  • the methods include determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism from the corresponding plurality of nucleic acid sequences.
  • the genomic abundance values determined for each respective subject in the plurality of training subjects comprise at least 20, at least 25, at least 50, at least 75, at least 100, at least 125, at least 150, at least 175, at least 200, at least 250, at least 300, at least 350, at least 400, at least 500, at least 600, at least 700, at least 700, at least 800, at least 900, at least 1000, at leasst 1500, at least 2000, at least 25000, at least 5,000 or at least 10,000 genome abundance values, where each genome abundance value corresponds to different gut microorganism.
  • the genomic abundance values determined for each respective subject in the plurality of training subjects comprise no more than 250,000, no more than 100,000, no more than 50,000, no more than 25,000, no more than 10,000, no more than 5000, no more than 2500, no more than 1000, no more than 750, no more than 500, or fewer genome abundance values.
  • the genomic abundance values determined for PATENT APPLICATION Attorney Docket No.: 126146-5003-WO each respective subject in the plurality of training subjects consist of from 10 to 40, from 20 to 50, from 30 to 80, from 40 to 100, from 50 to 150, from 60 to 200, from 80 to 300, from 90 to 500, from 100 to 1000, from 500 to 2,000, or from 1,000 to 5,000 genome abundance values.
  • the genomic abundance values determined for each respective subject in the plurality of training subjects fall within another range starting no lower than 10 genome abundance values and ending no higher than 250,000 genome abundance values.
  • the methods include assembling, in electronic form, a corresponding plurality of gut microorganism genomes by metagenomic de novo sequence assembly from the corresponding plurality of nucleic acid sequences, and calculating, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism based on the prevalence of respective nucleic acid sequences, in the plurality of nucleic acid sequences, used to assemble a respective gut microorganism genome in the plurality of gut microorganism genomes corresponding to the respective gut microorganism.
  • metagenomic de novo sequence assembly further comprise generating contigs based on the sequencing reads generated by a shotgun sequencing technique.
  • a shotgun sequencing technique is described, for example, in U.S. Patent No.10,529,443, the content of which is incorporated herein by reference in its entirety.
  • the first plurality of nucleic acid sequences is assembled into full genomes of the plurality of gut microorganisms.
  • the plurality of nucleic acid sequences is assembled into partial genomes of the plurality of gut microorganisms.
  • the methods including assigning each respective nucleic acid sequence in the corresponding plurality of nucleic acid sequences to a respective gut microorganism in the plurality of gut microorganisms, thereby generating, for each respective gut microorganism in the plurality of gut microorganism, a corresponding count of respective nucleic acid sequences in the corresponding plurality of nucleic acid sequences assigned to the respective gut microorganism, and determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding genomic abundance value for the respective gut microorganism based on the corresponding count of respective nucleic acid sequences assigned to the respective gut microorganism.
  • the assigning each PATENT APPLICATION Attorney Docket No.: 126146-5003-WO respective nucleic acids to a respective gut microorganism includes mapping the nucleic acid to a reference nucleic acid (e.g., a contig listed in FIG.12)
  • the assigning each respective nucleic acids a respective gut microorganism includes annotating genome information based on existing databases.
  • nucleic acid sequences are analyzed, and annotations are to define taxonomic assignments using sequence similarity and phylogenetic placement methods or a combination of the two strategies.
  • Sequence similarity-based methods for assigning each nucleic acid sequence to a respective gut microorganism include those familiar to individuals skilled in the art including, but not limited to BLAST, BLASTx, tBLASTn, tBLASTx, RDP-classifier, DNAclust, and various implementations of these algorithms such as Qiime or Mothur. These methods rely on mapping a sequence read to a reference database and selecting the match with the best score and e-value. In some embodiments, phylogenetic methods are used in combination with sequence similarity methods to improve the calling accuracy of an annotation or taxonomic assignment.
  • the plurality of genomic abundance values is determined using a microarray comprising a probe sequence capable of detecting a unique genomic sequence of each respective genome for the plurality of gut microorganisms.
  • the panel of probes on a microarray includes at least 1 probe that hybridizes to a sequence unique to each microorganism genome being detected. In some embodiments, the panel of probes includes at least 2, at least 3, at least 4, at least 5, at least 10, at least 25, at least 50, or more probes that hybridize to a different sequence unique to each microorganism genome being detected.
  • the panel of probes includes at least 20, at least 30, at least 40, at least 50, at least 75, at least 100, at least 125, at least 150, at least 200, at least 150, at least 300, at least 400, at least 500, at least 750, at least 1000, at least 1250, at least 1500, at least 2000, at least 2500, at least 3000, at least 4000, at least 5000, at least 7500, at least 10,000 or more unique probes.
  • the plurality of gut microorganisms comprises at least 20 gut microorganisms selected from Table 1, Table 2, or Figure 13A-13XX.
  • gut microorganisms of at least about 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 or greater are selected from Table 1, Table 2 or Figure 13A-13XX.
  • the plurality of gut microorganisms comprises at least 25 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the plurality of gut microorganisms comprises at least 30 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the plurality of gut microorganisms comprises at least 40 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the plurality of gut microorganisms comprises at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, or all of the gut microorganisms selected from Table 1, Table 2 or Figures 13A-13XX.
  • the plurality of gut microorganisms are at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, or all of the gut microorganisms selected from Table 1, Table 2 or Figures 13A-13XX.
  • the plurality of gut microorganisms are all of the gut microorganisms listed in Table 1.
  • the plurality of gut microorganisms are all of the gut microorganisms listed in Table 2. In some embodiments, the plurality of gut microorganisms are all of the gut microorganisms listed in Figures 13A-13XX.
  • genomic DNA was isolated from each fecal sample was sequenced by next generation sequencing and contigs for microorganism genome sequences were constructed de novo. Generally, the contigs identified for each microorganism are predicted to represent greater than 95% of the entire genome for the microorganism. Genomic constructs having less than 1% sequence divergence from each other were combined and defined to be from the same microorganism. Genomic contigs for each microorganism listed in Table 1, Table 2, and Figures 13A-13XX are provided in the sequence listing filed with the application. The taxonomic assignment of each microorganism is given in PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Table 1, Table 2, or Figures 13A-13XX.
  • the contigs provided as SEQ ID NOS:1-68 correspond to the genomic sequence of microorganism 1U001.8 (as indicated in FIG.12A), which is a microorganism classified as domain Bacteria, phylum Proteobacteria, class Gammaproteobacteria, order Enterobacterales, family Enterobacteria, genus Escherichia, and species Escherichia coli and is in Guild 2 of the 141 core microorganisms identified in Table 1.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 97% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 98% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A- 13XX if the identified genomic constructs have at least 99% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 99.5% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 97%, at least 97.5%, at least 98%, at least 98.5%, at least 99%, at least 99.1%, at least 99.2%, at least 99.3%, at least 99.4%, at least 99.5%, at least 99.6%, at least 99.7%, at least 99.8%, at least 99.9%, or more sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms in Table 1, Table 2, or PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Figure 13A-13XX having a connectivity of at least 2.
  • the set of identified gut microorganisms are selected from those microorganisms having a connectivity of at least 2, at least 3, at least 4, at least 5, at least 10, at least 15, at least 20, at least 25, or more.
  • the biological sample from the gut of the respective subject is a fecal sample from the respective training subject.
  • said biological sample is a sample obtained from the small or large intestine, preferably colon or rectum, more preferably obtained in the form of a fecal sample or rectal swab or in the form of a biopsy specimen of gastrointestinal mucosa.
  • the therapy is a biological therapy, an immunotherapy, a chemotherapy, a radiation therapy, a gene therapy, a hormone therapy, a photodynamic therapy, a targeted therapy, small molecules, antibodies, polynucleotide , natural compound, immune modulator, bone marrow therapy, stem cell therapy, surgery therapy, induction therapy, maintenance therapy, or a combination thereof.
  • the disorder is selected from the group consisting of type-2 diabetes, hypertension, schizophrenia, atherosclerotic cardiovascular disease (ACVD), liver cirrhosis (LC), inflammatory bowel diseases (IBD), colorectal cancer (CRC), ankylosing spondylitis (AS), and Parkinson’s disease (PD), inflammatory bowel disease (IBD, rheumatoid arthritis (RA), or advanced melanoma and B cell lymphoma.
  • ACVD atherosclerotic cardiovascular disease
  • LC liver cirrhosis
  • IBD inflammatory bowel diseases
  • CRC colorectal cancer
  • AS ankylosing spondylitis
  • PD Parkinson’s disease
  • IBD inflammatory bowel disease
  • RA rheumatoid arthritis
  • advanced melanoma and B cell lymphoma melanoma and B cell lymphoma.
  • the disorder is, e.g., hypertension (HT), schizophrenia (SCZ), Multiple Sclerosis (MS), Gaucher disease type II (GDII), COVID-19 (COV), Behcet's disease (BD), autism spectrum disorder (ASD), or pancreatic cancer (PC).
  • the disorder is cancer, Alzheimer diseases, a cardiovascular disease, an autoimmune disease, a mental health disease, an infectious disease, or a genetic disorder.
  • the disorder is categorized by any indicator of a biological state, function, structure, process, response, or condition in a patient. Such indicators include any of the numerous variables (parameters) that are commonly measured in medicine to evaluate a patient for purposes such as diagnosis, prognosis, and/or treatment.
  • indicators of interest herein are those whose values (which may be quantitative or qualitative) reflect, characterize, or are related to the function or structure of organs and organ systems and/or whose values reflect, characterize, or are related to the presence or severity of conditions.
  • the disease is categorized by the progression or prognosis of a disease or disorder, e.g., different stages of cancer, type, frequency, or degree of severity of the conditions that can be objectively measured or experienced by a subject.
  • the disorder may be acquired by a medical device, which may be used to analyze the status of a body part, wound, or lesion, or of a physical substrate such as a test strip, dipstick, filter, or other substrate that provides means for detecting or measuring the presence of a substance in a biological sample obtained from a subject.
  • a medical device which may be used to analyze the status of a body part, wound, or lesion, or of a physical substrate such as a test strip, dipstick, filter, or other substrate that provides means for detecting or measuring the presence of a substance in a biological sample obtained from a subject.
  • pathogens e.g., viruses, bacteria, fungi
  • abnormal tissues e.g., tumor site
  • biomarkers e.g., cancer
  • the disorder is cancer.
  • the methods include inputting, for each respective training subject in the plurality of training subjects, information about the respective training subject into a model comprising a plurality of parameters.
  • the model applies the plurality of parameters to the information, e.g., through at least 10,000 computations, to obtain a corresponding output for the respective training subject from the model.
  • the corresponding output comprises a prediction of the respective training subject’s response to the therapy, and the information about the respective training subject comprises the corresponding genomic abundance value for each respective gut microorganism in the plurality of gut microorganisms, and the plurality of gut microorganisms are selected from Table 1, Table 2, or Figure 13A-13XX.
  • the resulting model was powered to predict responder or non-responder to anti- cytokine or anti-integrin therapy, methotrexate treatment in new-onset Rheumatoid Arthritis, PATENT APPLICATION Attorney Docket No.: 126146-5003-WO immune checkpoint inhibitor (ICI) treatment on advanced melanoma, and CD19-CAR-T immunotherapy on B cell lymphoma.
  • the prediction of the respective training subject’s response is a class output of a respective response, in a plurality of possible responses, of the respective training subject.
  • the method allows the setting of a single "cut-off" value permitting discrimination between responder or non-responder to a treatment.
  • the prediction of the respective training subject’s response includes a prediction of an objective response rate of the human subject to the treatment or therapy, and wherein the prediction of the objective response rate includes an indication or classification of a complete response or an amount of a partial response to the treatment.
  • the prediction of the respective training subject’s response is a probability output for the respective training subject’s response.
  • the method allows the setting of a single "cut-off" value permitting discrimination between responder or non-responder to a treatment.
  • the methods comprise utilizing the model to calculate a probability value for a subject; compare the probability value to a threshold value derived from a cohort of responders/non-responders to determine whether or not the probability value is above/below the threshold value; classify the subject as responder/non-responder if the probability value is above/below the threshold.
  • the threshold value may be about a probability value of at least 50%, 55%, 50%, 65%, 70%, 75% or about 80% or more.
  • the probability value is a positive predictive value as measured by area under the curve (AUC) of receiver operating characteristic (ROC) curves.
  • the probability value is calculated using a multivariate logistic regression model, a neural network model, a random forest model or a decision tree model.
  • the model is a neural network algorithm, a support vector machine algorithm, a Naive Bayes algorithm, a nearest neighbor algorithm, a boosted trees algorithm, a Random Forest algorithm, a convolutional neural network algorithm, a decision tree algorithm, a regression algorithm, or a clustering algorithm.
  • the plurality of parameters is at least 1000, at least 10,000, at least 15,000, at least 50,000, at least 100,000, at least 250,000, at least PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 500,000, or at least 1,000,000 parameters, at least 2,500,000 parameters, at least 5,000,000 parameters, at least 10,000,000 parameters, or more.
  • the model applies the plurality of parameters to the information through at least 1000 computations, at least 5000 computations, at least 10,000 computations, at least 25,000 computations, at least 50,000 computations, at least 100,000 computations, at least 250,000 computations, at least 500,000 computations, at least 1,000,000 computations, at least 2,500,000 computations, at least 5,000,000 computations, at least 10,000,000 computations, or more to obtain a corresponding output for the respective training subject from the model.
  • the methods include adjusting the plurality of parameters based on, for each respective training subject in the plurality of training subjects, one or more differences between (i) the corresponding output from the model and (ii) the corresponding indication of the respective training subject’s response to the therapy.
  • the training of the neural network to improve the accuracy of its prediction involves modifying one or more parameters, including, but not limited to, weights in the filters in convolutional layers as well as biases in network layers.
  • the weights and biases are further constrained with various forms of regularization such as L1, L2, weight decay, and dropout.
  • the neural network or any of the models disclosed herein optionally, where training data is labeled (e.g., with an indication of the state of the biological characteristic), have their parameters (e.g., weights) tuned (adjusted to potentially minimize the error between the system’s predicted indications and the training data’s measured indications).
  • parameters e.g., weights
  • Various methods used to minimize error function include, but are not limited to, log-loss, sum of squares error, hinge-loss methods. In some embodiments, these methods further include second-order methods or approximations such as momentum, Hessian-free estimation, Nesterov’s accelerated gradient, adagrad, etc.
  • the methods also combine unlabeled generative pretraining and labeled discriminative training.
  • the training of the neural network comprises adjusting one or more parameters in the plurality of parameters by back-propagation through a loss function.
  • the loss function is a regression task and/or a classification task.
  • loss functions suitable for the regression task include, but are not limited to, a mean squared error loss function, a mean absolute error loss function, a Huber loss function, a Log-Cosh loss function, or a quantile loss function.
  • Non-limiting examples of loss functions suitable for the classification task include, but are not limited to, a binary cross entropy loss function, a hinge loss function, or a squared hinged loss function.
  • the loss function is any suitable regression task loss function or classification task loss function.
  • the parameters of the neural network are randomly initialized prior to training.
  • the neural network comprises a dropout regularization parameter.
  • a regularization is performed by adding a penalty to the loss function, where the penalty is proportional to the values of the parameters in the trained or untrained model.
  • regularization reduces the complexity of the model by adding a penalty to one or more parameters to decrease the importance of the respective hidden neurons associated with those parameters. Such practice can result in a more generalized model and reduce overfitting of the data.
  • the regularization includes an L1 or L2 penalty.
  • the training the neural network comprises an optimizer.
  • the optimizer may employ the loss function to update the parameters of the neural network or other model via back-propagation.
  • the training the neural network comprises a learning rate.
  • the learning rate is at least 0.0001, at least 0.0005, at least 0.001, at least 0.005, at least 0.01, at least 0.05, at least 0.1, at least 0.2, at least 0.3, at least 0.4, at least 0.5, at least 0.6, at least 0.7, at least 0.8, at least 0.9, or at least 1.
  • the learning rate is no more than 1, no more than 0.9, no more than 0.8, no more than 0.7, no more than 0.6, no more than 0.5, no more than 0.4, no more than 0.3, no more than 0.2, no more than 0.1 no more than 0.05, no more than 0.01, or less. In some embodiments, the learning rate is from 0.0001 to 0.01, from 0.001 to 0.5, from 0.001 to 0.01, from 0.005 to 0.8, or from 0.005 to 1. In some embodiments, the learning rate falls within another range starting no lower than 0.0001 and ending no higher than 1.
  • the learning rate further comprises a learning rate decay (e.g., a reduction in the learning rate over one or more epochs).
  • a learning decay rate can be a reduction in the learning rate of 0.5 or 0.1.
  • the learning rate is a differential learning rate.
  • the training the neural network further uses a scheduler that conditionally applies the learning rate decay based on an evaluation of a performance metric over a threshold number of training epochs (e.g., the learning rate decay is applied when the performance metric fails to satisfy a threshold performance value for at least a threshold number of training epochs).
  • the performance of the neural network is measured at one or more time points using a performance metric, including, but not limited to, a training loss metric, a validation loss metric, and/or a mean absolute error.
  • the performance metric is an area under receiving operating characteristic (AUROC) and/or an area under precision-recall curve (AUPRC).
  • AUROC area under receiving operating characteristic
  • AUPRC area under precision-recall curve
  • the performance of the neural network is measured by validating the model using a validation (e.g., development) dataset.
  • the training the neural network forms a trained neural network when the neural network satisfies a minimum performance requirement based on a validation.
  • any suitable method for validation can be used, including but not limited to K-fold cross-validation, advanced cross-validation, random cross-validation, grouped cross-validation (e.g., K-fold grouped cross-validation), bootstrap bias corrected cross- validation, random search, and/or Bayesian hyperparameter optimization.
  • a method for training a model comprising a plurality of parameters by a procedure comprising (i) inputting corresponding genomic abundance value for each respective gut microorganism in a plurality of gut microorganisms for each respective training subject in a plurality of training subjects, thereby obtaining as output from the model, for each respective training subject in the plurality of training subjects, a corresponding prediction of a training subject’s response to a therapy, and (ii) refining the plurality of model parameters based on a differential between the corresponding actual response to a therapy of the training subject and the corresponding predicted response to a therapy of the training subject.
  • Figure 3 is a schematic diagram of a method for applying a model for predicting a subject’s response to a therapy for a disorder as discussed below.
  • the method 300 may be implemented using a computer system (e.g., the computer system 100 shown and described above in reference to Figure 1).
  • the methods include obtaining, in electronic form, a plurality of genomic abundance values comprising, for each respective gut microorganism in a plurality of gut microorganisms selected from Table 1, Table 2, or Figure 13A-13XX, a corresponding abundance value for the genome of the respective species of gut bacteria, in the plurality of gut microorganisms, in a biological sample from the subject.
  • a corresponding biological sample from the gut of the respective subject was taken prior to a treatment or a therapy.
  • the biological sample is taken no more than 15 minutes, 30 minutes, 1 hour, 2 hours, 3 hours, 4 hours, 5 hours, 6 hours, 12 hours, or 24 hours prior to a treatment or a therapy. In some embodiments, the biological sample is taken 1 day, 2, days, 3 days, 4 days, 5 days, 6 days, 1 week, 2 weeks, 3 weeks, 4 weeks, or more prior to a treatment or a therapy. In some embodiments, the biological sample is taken about any of 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, or more prior to a treatment or a therapy.
  • sample data including plasma, stool specimens
  • clinical information including gender/age/body fat count/underlying PATENT APPLICATION Attorney Docket No.: 126146-5003-WO disease/histopathological characteristics, etc.
  • sample data including plasma, stool specimens
  • clinical information including gender/age/body fat count/underlying PATENT APPLICATION Attorney Docket No.: 126146-5003-WO disease/histopathological characteristics, etc.
  • Individual biological samples were subjected to full microbiome analysis.
  • the sample is a tissue biopsy, an intestinal, or mucosal sample. See, for example, Tang Q, Jet al., Current Sampling Methods for Gut Microbiota: A Call for More Precise Devices, Front Cell Infect Microbiol., 10:151 (2020), the content of which is incorporated herein by reference in its entirety.
  • the biological sample from the gut of the respective subject is a fecal sample from the respective subject.
  • the plurality of gut microorganisms comprises at least 25 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX. In some embodiments, the plurality of gut microorganisms comprises at least 30 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX. In some embodiments, the plurality of gut microorganisms comprises at least 40 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX. In some embodiments, the plurality of gut microorganisms comprises at least 25 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the plurality of gut microorganisms comprises at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, or all of the gut microorganisms selected from Table 1, Table 2 or Figures 13A-13XX.
  • the plurality of gut microorganisms are at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, or all of the gut microorganisms selected from Table 1, Table 2 or Figures 13A-13XX.
  • the plurality of gut microorganisms are all of the gut microorganisms listed in Table 1.
  • the plurality of gut microorganisms are all of the gut microorganisms listed in Table 2. In some embodiments, the plurality of gut microorganisms are all of the gut microorganisms listed in Figures 13A-13XX. [00209] In some of the embodiments, the corresponding value for the abundance of the genome is a value representative of the absolute abundance of a microorganism genome. In some of the embodiments, the corresponding value for the abundance of the genome is a value PATENT APPLICATION Attorney Docket No.: 126146-5003-WO representative of a normalized abundance value, or a relative abundance value (e.g., an abundance of one microorganism normalized against the abundance of total microbiome of interest).
  • corresponding value for the abundance of the genome is a value representative of an averaged abundance value (e.g., average of abundances obtained at different time points or from different biological samples from the patients, or average of abundances obtained using different probes, etc.), or a combination of any of above.
  • the corresponding value for the abundance of the genome is measured by any technique known in the art.
  • the genomic abundance value for the genome is measured by quantitative PCR(qPCR), such as bacterial 16S rRNA qPCR, RT-PCR, or qRT-PCR, for quantifying the abundance of region of interests in the genome, e.g., as described in U.S. Patent No.11,427,865, the disclosure of which is hereby incorporated by reference in its entirety.
  • the genomic abundance value is measured by targeted sequencing (e.g., 16S rRNA sequencing, or any other suitable biomarker), partial genome sequencing or whole genome sequencing, thereby quantifying the number of reads of the targeted regions in a microorganism genome to determine the abundance of the genome, e.g., as disclosed in U.S. Patent Application Publication No.2021/0403986 or U.S. Patent No.11,332,783, the disclosures of which are hereby incorporated by reference in their entireties.
  • deep sequencing is employed to determine the abundance of targeted sequences, e.g., as disclosed in U.S. Patent Application Publication No.2018/0237863, the disclosure of which is incorporated herein by reference in its entirety.
  • the sequencing depth is at least 2X, at least 3X, at least 4X, at least 5X, at least 6X, at least 7X, at least 8X, at least 9X, at least 10X, at least 11X, at least 12X, at least 13X, at least 14X, at least 15X, at least 16X, at least 17X, at least 18X, at least 19X, at least 20X, at least 21X, at least 22X, at least 23X, at least 24X, at least 25X, at least 26X, at least 27X, at least 28X, at least 29X, at least 30X, at least 31X, at least 32X, at least 33X, at least 34X, at least 35X, at least 36X, at least 37X, at least 38X, at least 39X, at least 40X, at least 41X, at least 42X, at least 43X, at least 44X, at least 45X, at least 46X, at least 47X, at least 48X, at least 49X, at least 50X, at least
  • shotgun metagenomic sequencing is employed to provide sequence reads for PATENT APPLICATION Attorney Docket No.: 126146-5003-WO genomes in a sample, e.g., as described in U.S. Patent No.11,028,449, the content of which is incorporated herein by reference in its entirety.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 97% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A- 13XX if the identified genomic constructs have at least 98% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 99% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures AXX if the identified genomic constructs have at least 99.5% sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • a genome identified in a metagenomic analysis is classified as corresponding to a microorganism listed in Table 1, Table 2, and/or Figures 13A-13XX if the identified genomic constructs have at least 97%, at least 97.5%, at least 98%, at least 98.5%, at least 99%, at least 99.1%, at least 99.2%, at least 99.3%, at least 99.4%, at least 99.5%, at least 99.6%, at least 99.7%, at least 99.8%, at least 99.9%, or more sequence identity when compared to the contigs for the microorganism provided in the sequence listing, as denoted in FIG.12.
  • the methods include sequencing genomic DNA from the biological sample from the gut of the subject, thereby obtaining the plurality of nucleic acid sequences.
  • the plurality of nucleic acid sequences comprises at least 10,000, at least 100,000, at least 250,000, at least 500,000, at least 1,000,000, at least 2,500,000, at least 5,000,000, at least 10,000,000 or at least 50,000,000 nucleic acid sequences.
  • the plurality of nucleic acid sequences comprises no more than 250,000,000, no more than 100,000,000, no more than 50,000,000, no more than PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 25,000,000, no more than 10,000,000, no more than 5,000,000, no more than 1,000,000, no more than 100,000 nucleic acid sequences.
  • the plurality of nucleic acid sequences consists of from 100,000 to 1,000,000, from 200,000 to 5,000,000, from 500,000 from 10,000,000, from 1,000,000 to 20,000,000, from 5,000,000 to 50,000,000, from 10,000,000 to 100,000,000, or from 50,000,000 to 250,000,000 nucleic acid sequences.
  • the plurality of nucleic acid sequences falls within another range starting no lower than 1000 nucleic acid sequences and ending no higher than 250,000,000 nucleic acid sequences.
  • the corresponding plurality of nucleic acid sequences are obtained through metagenomic sequencing, e.g., as disclosed in U.S. Patent Application Publication No.2016/0239602 or U.S. Patent No.11,495,326, the contents of which are incorporated herein by reference in their entireties.
  • metagenomes sequencing further comprise generating the plurality of metagenomic fragment reads.
  • metagenomic sequencing further comprise fragmenting microbial genomes into random fragments of targeted sizes. The resulting fragments can vary in size.
  • fragments of approximately 500 nucleotides can be obtained.
  • fragments of from 100-2000 nucleotides, e.g., 200-800, 100-900, 100-1000, 300- 800, 400-900 nucleotides can be obtained.
  • the method may further comprise extracting the metagenomic fragments from the corresponding biological sample.
  • metagenomes sequencing further comprise sequencing the fragments using high throughput sequencing methods to generate a plurality of sequencing reads.
  • the corresponding plurality of nucleic acid sequences are obtained through targeted panel sequencing. An example of targeted panel sequencing is described in U.S. Patent Application Publication No.2019/0316209.
  • the targeted panel sequencing comprises hybridizing genomic DNA isolated from a biological sample from the gut of a subject with a panel of probes that include one or more probes that hybridize to a unique sequence in the genome of each microorganism being quantified, prior to sequencing recovered nucleic acids.
  • the microorganisms include a plurality of the microorganisms listed in Table 1, Table 2, and/or Figures 13A-13XX.
  • a combination of semi-unique sequences e.g., sequences found in a small number of the microorganism genomes
  • an algorithm e.g., a system of equations.
  • the panel of probes includes at PATENT APPLICATION Attorney Docket No.: 126146-5003-WO least 1 probe that hybridizes to a sequence unique to each microorganism genome being detected. In some embodiments, the panel of probes includes at least 2, at least 3, at least 4, at least 5, at least 10, at least 25, at least 50, or more probes that hybridize to a different sequence unique to each microorganism genome being detected.
  • the panel of probes includes at least 20, at least 30, at least 40, at least 50, at least 75, at least 100, at least 125, at least 150, at least 200, at least 150, at least 300, at least 400, at least 500, at least 750, at least 1000, at least 1250, at least 1500, at least 2000, at least 2500, at least 3000, at least 4000, at least 5000, at least 7500, at least 10,000 or more unique probes.
  • the sequencing genomic DNA from the corresponding biological sample comprise a partial or complete sequencing platform adapter sequence at their termini useful for sequencing using a sequencing platform of interest.
  • Sequencing platforms of interest include, but are not limited to, the HiSeqTM, MiSeqTM and Genome AnalyzerTM sequencing systems from Illumina®; the Ion PGMTM and Ion ProtonTM sequencing systems from Ion TorrentTM; the PACBIO RS II Sequel system from Pacific Biosciences, the SOLiD sequencing systems from Life TechnologiesTM, the 454 GS FLX+ and GS Junior sequencing systems from Roche, the MinIONTM system from Oxford Nanopore, or any other sequencing platform of interest.
  • the methods include obtaining, in electronic form, a plurality of nucleic acid sequences for genomic DNA from the biological sample from the gut of the subject.
  • the methods include determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism from the plurality of nucleic acid sequences.
  • the genomic abundance values determined for the subject comprise at least 20, at least 25, at least 50, at least 75, at least 100, at least 125, at least 150, at least 175, at least 200, at least 250, at least 300, at least 350, at least 400, at least 500, at least 600, at least 700, at least 700, at least 800, at least 900, at least 1000, at leasst 1500, at least 2000, at least 25000, at least 5,000 or at least 10,000 genome abundance values, where each genome abundance value corresponds to different gut microorganism.
  • the genomic abundance values comprise no more than 250,000, no more than 100,000, no more than PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 50,000, no more than 25,000, no more than 10,000, no more than 5000, no more than 2500, no more than 1000, no more than 750, no more than 500, or fewer genome abundance values.
  • the genomic abundance values consist of from 10 to 40, from 20 to 50, from 30 to 80, from 40 to 100, from 50 to 150, from 60 to 200, from 80 to 300, from 90 to 500, from 100 to 1000, from 500 to 2,000, or from 1,000 to 5,000 genome abundance values.
  • the methods include assembling, in electronic form, a corresponding plurality of gut microorganism genomes by metagenomic de novo sequence assembly from the plurality of nucleic acid sequences, and calculating, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding value for the abundance of the genome of the respective gut microorganism based on the prevalence of respective nucleic acid sequences, in the plurality of nucleic acid sequences, used to assemble a respective gut microorganism genome in the plurality of gut microorganism genomes corresponding to the respective gut microorganism.
  • metagenomic de novo sequence assembly further comprise generating contigs based on the sequencing reads generated by a shotgun sequencing technique.
  • a shotgun sequencing technique is described, for example, in U.S. Patent No.10,529,443, the content of which is incorporated herein by reference in its entirety.
  • the plurality of nucleic acid sequences can be assembled into full genomes of the plurality of gut microorganisms.
  • the plurality of nucleic acid sequences can be assembled into partial genomes of the plurality of gut microorganisms.
  • the methods include assigning each respective nucleic acid sequence in the plurality of nucleic acid sequences to a respective gut microorganism in the plurality of gut microorganisms, thereby generating, for each respective gut microorganism in the plurality of gut microorganism, a corresponding count of respective nucleic acid sequences in the plurality of nucleic acid sequences assigned to the respective gut microorganism, and determining, for each respective gut microorganism in the plurality of gut microorganisms, the corresponding genomic abundance value for the respective gut microorganism based on the corresponding count of respective nucleic acid sequences assigned to the respective gut microorganism.
  • the assigning each respective nucleic PATENT APPLICATION Attorney Docket No.: 126146-5003-WO acids to a respective gut microorganism includes mapping the nucleic acid to a reference nucleic acid. In some embodiments, the assigning each respective nucleic acids a respective gut microorganism includes annotating genome information based on existing databases. In some embodiments, nucleic acid sequences are analyzed, and annotations are to define taxonomic assignments using sequence similarity and phylogenetic placement methods or a combination of the two strategies.
  • Sequence similarity based methods for assigning each respective nucleic acid sequence in a respective gut microorganism include those familiar to individuals skilled in the art including, but not limited to BLAST, BLASTx, tBLASTn, tBLASTx, RDP-classifier, DNAclust, and various implementations of these algorithms such as Qiime or Mothur. These methods rely on mapping a sequence read to a reference database and selecting the match with the best score and e-value. In some embodiments, phylogenetic methods are used in combination with sequence similarity methods to improve the calling accuracy of an annotation or taxonomic assignment.
  • GT-DBTK National Center for Biotechnology Information
  • NCBI National Center for Biotechnology Information
  • EBI- ENA European Bioinformatics Institute-European Nucleotide Archive
  • U.S. Department of ENERGY U.S. Department of ENERGY
  • IMG/M International Multimedia Merase
  • the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms in Table 1, Table 2, or Figure 13A-13XX having a connectivity of at least 2, at least 3, at least 4, at least 5, at least 10, at least 15, at least 20, at least 25, or more.
  • the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms listed in Table 1, Table 2, or Figure 13A- 13XX as having a connectivity of at least 2.
  • the plurality of gut microorganisms comprises at least 20 microorganisms selected from those microorganisms listed in Table 1, Table 2, or Figure 13A- 13XX as having a connectivity of at least 3, at least 4, at least 5, at least 10, at least 15, at least 20, at least 25, or more.
  • the biological sample from the gut of the subject is a fecal sample.
  • the sample is a tissue biopsy, an intestinal, or mucosal sample.
  • said biological sample is a sample obtained from the small or large intestine, preferably colon or rectum, more preferably obtained in the form of a fecal sample or rectal swab or in the form of a biopsy specimen of gastrointestinal mucosa.
  • the therapy is a biological therapy, an immunotherapy, a chemotherapy, a radiation therapy, a gene therapy, a hormone therapy, a photodynamic therapy, a targeted therapy, small molecules, antibodies, polynucleotide , natural compound, immune modulator, bone marrow therapy, stem cell therapy, surgery therapy, induction therapy, maintenance therapy, or a combination thereof.
  • the disorder is selected from the group consisting of type-2 diabetes, hypertension, schizophrenia, atherosclerotic cardiovascular disease (ACVD), liver cirrhosis (LC), inflammatory bowel diseases (IBD), colorectal cancer (CRC), ankylosing spondylitis (AS), and Parkinson’s disease (PD), inflammatory bowel disease (IBD), rheumatoid arthritis (RA), advanced melanoma and B cell lymphoma.
  • ACVD atherosclerotic cardiovascular disease
  • LC liver cirrhosis
  • IBD inflammatory bowel diseases
  • CRC colorectal cancer
  • AS ankylosing spondylitis
  • PD Parkinson’s disease
  • IBD inflammatory bowel disease
  • RA rheumatoid arthritis
  • advanced melanoma advanced melanoma and B cell lymphoma.
  • the disorder is, e.g., hypertension (HT), schizophrenia (SCZ), Multiple Sclerosis (MS), Gaucher disease type II (GDII), COVID-19 (COV), Behcet's disease (BD), autism spectrum disorder (ASD), or pancreatic cancer (PC).
  • the disorder is cancer, Alzheimer diseases, a cardiovascular disease, an autoimmune disease, a mental health disease, an infectious disease, or a genetic disorder.
  • the disorder is categorized by any indicator of a biological state, function, structure, process, response, or condition in a patient. Such indicators include any of the numerous variables (parameters) that are commonly measured in medicine to evaluate a patient for purposes such as diagnosis, prognosis, and/or treatment.
  • indicators of interest herein are those whose values (which may be quantitative or qualitative) reflect, characterize, or are related to the function or structure of organs and organ systems and/or whose values reflect, characterize, or are related to the presence or severity of conditions.
  • the disease is categorized by the progression or prognosis of a disease or disorder, e.g., different stages of cancer, type, frequency, or degree of severity of the conditions that can be objectively measured or experienced by a subject.
  • the disorder may PATENT APPLICATION Attorney Docket No.: 126146-5003-WO be acquired by a medical device, which may be used to analyze the status of a body part, wound, or lesion, or of a physical substrate such as a test strip, dipstick, filter, or other substrate that provides means for detecting or measuring the presence of a substance in a biological sample obtained from a subject.
  • a medical device which may be used to analyze the status of a body part, wound, or lesion, or of a physical substrate such as a test strip, dipstick, filter, or other substrate that provides means for detecting or measuring the presence of a substance in a biological sample obtained from a subject.
  • pathogens e.g., viruses, bacteria, fungi
  • abnormal tissues e.g., tumor site
  • biomarkers e.g., biomarkers
  • the disorder is cancer.
  • the disorder is inflammatory bowel disease and the therapy comprises anti-cytokine or anti-integrin treatment.
  • the therapy comprises anti-cytokine or anti-integrin treatment.
  • a random forest classifier was built based on the abundance of the 284 HQMAGs in CC-TCG to predict therapeutic responses.
  • the models predicted therapeutic response with ROC AUC values of from 0.64 to 0.69.
  • a random forest classifier was built based on the abundance of the 284 HQMAGs in CC-TCG to predict therapeutic responses.
  • the model predicted therapeutic response with an ROC AUC value of 0.69.
  • the methods include inputting the plurality of genomic abundance values into a model comprising a plurality of parameters.
  • the model applies the plurality of parameters to the plurality of genomic abundance values through, e.g., at least 10,000 computations, to generate as output from the model a prediction of the subject’s response to the therapy.
  • the model is trained against datasets collected across a plurality of therapies to disorders and the model is trained to distinguish between a responsive state and a non -responsive state for the therapy.
  • the model comprises a learning statistical classifier system.
  • the learning statistical classifier system is random forest classification and regression tree, boosted tree, neural network. For example, as described in Example 3, a random forest classifier was trained against datasets from 11 different studies collectively looking at microbiomes in 4 different disorders.
  • the resulting model was powered to predict responder or non-responder to anti- cytokine or anti-integrin therapy, methotrexate treatment in new-onset Rheumatoid Arthritis, immune checkpoint inhibitor (ICI) treatment on advanced melanoma, and CD19-CAR-T immunotherapy on B cell lymphoma.
  • the indication of subject’s response is characterized by clinical outcome measures include, but are not limited to, complete remission, partial remission, non- remission, survival, development of adverse events, or any combination thereof.
  • one responder has complete remission in response to the treatment, and the non- responders has non-remission or partial remission in response to the treatment.
  • patients were subjected to routine clinical examinations, laboratory analyses, and computed tomography. Tumor responses were evaluated using RECIST criteria.
  • complete response was defined as complete radiographic disappearance of measurable or evaluable disease or stable, minimal radiographic findings; partial response was defined as reduction of the longest dimension of measurable disease by at least 50%; stable PATENT APPLICATION Attorney Docket No.: 126146-5003-WO disease was defined as reduction of the longest dimension by less than 25%; Progressive disease was defined as growth of the tumor by more than 25% in the longest dimension or development of new lesions.
  • overall response rate was defined as the sum of the complete and partial response rates and the tumor control rate was defined as the sum of overall response rates with stable disease rates.
  • the indication of subject’s response is characterized by the actual treatment efficacy of an therapy, including progression-free survival (PFS), the duration of the progression free survival under treatment, total Survival (OS), response to therapy (RT), overall response rate (ORR), sustained clinical effect (DCB), Disease Activity Score, or any combination thereof, or any other methods for evaluating the progression or prognosis of a disease or disorder known in the art.
  • PFS progression free survival
  • OS total Survival
  • RT response to therapy
  • ORR overall response rate
  • DCB sustained clinical effect
  • Disease Activity Score or any combination thereof, or any combination thereof, or any other methods for evaluating the progression or prognosis of a disease or disorder known in the art.
  • “progression free survival” (PFS) has its art-understood meaning relating to the length of time during and after the treatment of a disease, such as cancer, that a patient lives with the disease but it does not get worse.
  • measuring the progression-free survival is utilized as an assessment of how well a new treatment works.
  • PFS is determined in a randomized clinical trial; in some such embodiments, PFS refers to time from randomization until objective tumor progression and/or death.
  • ORR may be defined as the proportion of patients in whom partial (PR) or complete (CR) responses are identified as a best overall response (BOR) according to some metric, such as Response Evaluation Criteria in Solid Tumors (RECIST 1.1). Stable disease (SD) was categorized as non-response together with progressive disease (PD).
  • ORR has its art-understood meaning referring to the proportion of patients with tumor size reduction of a predefined amount and for a minimum time period. In some embodiments, response duration usually measured from the time of initial response until documented tumor progression. In some embodiments, ORR involves the sum of partial responses plus complete responses. [00236] In some embodiments, "clinical effect" refers to a clinical benefit.
  • such a clinical benefit is or comprises reduction in tumor size, increase in progression free survival, increase in overall survival, decrease in overall tumor burden, decrease in the symptoms caused by tumor growth such as pain, organ failure, bleeding, damage to the PATENT APPLICATION Attorney Docket No.: 126146-5003-WO skeletal system, and other related sequelae of metastatic cancer and combinations thereof.
  • the clinical effect is a “sustained clinical effect” (DCB) that is maintained for a relevant period of time.
  • the relevant period of time is at least 1 month, 2 months, 3 months, 4 months, 5 months, 6 months, 7 months, 8 months, 9 months, 10 months, 11 months, 1 year, 2 years, 3 years, 4 years, 5 years, or longer.
  • the subject’s response is measured by Disease Activity Score (DAS) (see, e.g., Van der Heijde D. M. et al., J Rheumatol, 1993, 20(3): 579-81; Prevoo M. L. et al, Arthritis Rheum, 1995, 38: 44-8).
  • DAS Disease Activity Score
  • the DAS system represents both current state of disease activity and change.
  • the DAS scoring system uses a weighted mathematical formula, derived from clinical trials in RA.
  • the DAS 28 is 0.56(T28)+0.28(SW28)+0.70(Ln ESR)+0.014 GH wherein T represents tender joint number, SW is swollen joint number, ESR is erythrocyte sedimentation rate, and GH is global health.
  • T represents tender joint number
  • SW is swollen joint number
  • ESR is erythrocyte sedimentation rate
  • GH is global health.
  • Various values of the DAS represent high or low disease activity as well as remission, and the change and endpoint score result in a categorization of the patient by degree of response (none, moderate, good).
  • the indication of the subject’s response is measured by the level of the immune response or immune parameters of a cancer-bearing patient resulting from an immunotherapy.
  • the immune response or immune parameters are characterized by expression level of various biological markers of the host immune response in conjunction with the occurrence of a cancer at a given stage of cancer development (i.e. treatment efficacy).
  • the expression level of a biological marker is compared with a reference value for the same biological marker, and when required with reference values.
  • the reference value for the same biological marker is thus predetermined and is already known to be indicative of a reference value that is pertinent for discriminating between a low level and a high level of the immune response of a patient with cancer, for said biological marker.
  • Said predetermined reference value for said biological marker is correlated with a responder to treatment in a cancer patient, or conversely is correlated with non-responder to treatment in a cancer patient.
  • a change of a combination of biological markers are quantified.
  • a combination of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or more distinct biological markers are quantified.
  • biological markers are quantified with immunohistochemical techniques.
  • Example biological markers include 18s, ACE, ACTB, AGTR1, AGTR2, APC, APOA1, ARF1, AXIN1, BAX, BCL2, BCL2L1, CXCR5, BMP2, BRCA1, BTLA, C3, CASP3, CASp9, CCL1, CCL11, CCL13, CCL16, CCL17, CCL18, CCL19, CCL2, CCL20, CCL21, CCL22, CCL23, CCL24, CCL25, CCL26, CCL27, CCL28, CCL3, CCL5, CCL7, CCL8, CCNB1, CCND1, CCNE1, CCR1, CCR10, CCR2, CCR3, CCR4, CCR5, CCR6, CCR7, CCR8, CCR9, CCRL2, CD154, CD19, CD1a, CD2, CD226, CD244, PDCD1LG1, CD28, CD34, CD36, CD38, CD3E, CD3G, CD3Z, CD4, CD40LG, CD5, CD
  • the prediction of the subject’s response is a class output of a respective response, in a plurality of possible responses, of the respective subject.
  • the method allows the setting of a single "cut-off" value permitting discrimination between responder or non-responder to a treatment.
  • the prediction of the respective subject’s response includes a prediction of an objective response rate of the human subject to the treatment or therapy, and wherein the prediction of the objective response rate includes an indication or classification of a complete response or an amount of a partial response to the treatment.
  • the prediction of the subject’s response of the subject is a probability output for the respective subject’s response.
  • the method allows the setting of a single "cut-off" value permitting discrimination between responder or non-responder to a treatment.
  • the methods comprise utilizing the model to calculate a probability value for a subject; compare the probability value to a threshold value derived from a cohort of responders/non-responders to determine whether or not the probability value is above or below the threshold value; classify the subject as responder/non-responder if the probability value is above or below the threshold.
  • the threshold value may be about a probability value of at least 50%, 55%, 50%, 65%, 70%, 75% or about 80% or more.
  • the probability value is a positive predictive value as measured by area under the curve (AUC) of receiver operating characteristic (ROC) curves.
  • the probability value is calculated using a multivariate logistic regression model, a neural network model, a random forest model or a decision tree model.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO [00243]
  • the model is a neural network algorithm, a support vector machine algorithm, a Naive Bayes algorithm, a nearest neighbor algorithm, a boosted trees algorithm, a Random Forest algorithm, a convolutional neural network algorithm, a decision tree algorithm, a regression algorithm, or a clustering algorithm.
  • the plurality of parameters is at least 1000, at least 10,000, at least 15,000, at least 50,000, at least 100,000, at least 250,000, at least 500,000, at least 1,000,000 parameters, at least 2,500,000 parameters, at least 5,000,000 parameters, at least 10,000,000 parameters, or more.
  • the model applies the plurality of parameters to the information through at least 1000 computation, at least 5000 computations, at least 10,000 computations, at least 25,000 computations, at least 50,000 computations, at least 100,000 computations, at least 250,000 computations, at least 500,000 computations, at least 1,000,000 computations, at least 2,500,000 computations, at least 5,000,000 computations, at least 10,000,000 computations, or more to obtain a corresponding output for the respective subject from the model.
  • the method further comprises treating the subject by: i) when the prediction of the subject’s response to the therapy satisfies a threshold likelihood that the subject will respond favorably to the therapy, administering the therapy to the subject; ii) when the prediction of the subject’s response to the therapy does not satisfy the threshold likelihood that the subject will respond favorably to the therapy, administering one or more of the plurality of gut microorganisms to the subject.
  • the administering comprises identifying one or more of the plurality of gut microorganisms that is underrepresented in the subject, e.g., as determined based on the corresponding genomic abundance value for the microorganism, and administering the identified one or more gut microorganism to the subject.
  • the identifying includes determining whether the abundance of a gut microorganism, e.g., as determined based on the corresponding genomic abundance value for the microorganism, satisfies a corresponding threshold amount. When the abundance of the microorganism does not satisfy the corresponding threshold amount, identifying that microorganism for administration.
  • the corresponding threshold amount is a relative abundance.
  • the PATENT APPLICATION Attorney Docket No.: 126146-5003-WO corresponding threshold amount is an amount relative to the abundance of one or more different gut microorganisms in the subject. In some embodiments, the corresponding threshold amount is an amount relative to the total abundance of the plurality of gut microorganisms in the subject. [00248] In some embodiments, the administering comprises administering a pre-defined set of microorganisms. In some embodiments, the predefined set of microorganisms includes at least 5 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the predefined set of microorganisms includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 175, 180, 190, 200, 225, 250, 275, 300, 350, 400, 450, 500, 600, 700, or more gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.
  • the predefined set of microorganisms only includes gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.that are assigned to Guild 1. That is, the predefined set of microoganisms does not include microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.that are assigned to Guild 2. In some embodiments, the predefined set of microorganisms includes at least 5 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX that are assigned to Guild 1.
  • the predefined set of microorganisms includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 175, 180, 190, 200, 225, 250, 275, 300, 350, 400, or more gut microorganisms selected from Table 1, Table 2, or Figures 13A- 13XX that are assigned to Guild 1.
  • the predefined set of microorganisms only includes gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.that are assigned to Guild 2.
  • the predefined set of microoganisms does not include microorganisms selected from Table 1, Table 2, or Figures 13A-13XX.that are assigned to Guild 1.
  • the predefined set of microorganisms includes at least 5 gut microorganisms selected from Table 1, Table 2, or Figures 13A-13XX that are assigned to Guild 2.
  • the predefined set of microorganisms includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50, 55, 60, 65, PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 70, 75, 80, 85, 90, 95, 100, 110, 120, 130, 140, 150, 160, 170, 175, 180, 190, 200, 225, 250, 275, 300, 350, 400, or more gut microorganisms selected from Table 1, Table 2, or Figures 13A- 13XX that are assigned to Guild 2.
  • the method further comprises administering the therapy to the subject.
  • the therapy is administered to the subject around the same time as the one or more of the plurality of gut microorganisms are administered.
  • the therapy is administered to the subject after the one or more of the plurality of gut microorganisms are administered.
  • the therapy is administered to the subject at least 1 day, at least 2 days, at least 3 days, at least 4 days, at least 5 days, at least 6 days, at least 7 days, at least 1 week, at least 2 weeks, at least 3 weeks, at least 4 weeks, at least 5 weeks, at least 6 weeks, at least 7 weeks, at least 8 weeks, or more after the one or more of the plurality of gut microorganisms are administered.
  • the therapy is administered to the subject no more than 3 months, no more than 2 months, no more than one month, no more than 4 weeks, no more than 3 weeks, no more than 2 weeks, no more than 1 week, no more than 6 days, no more than 5 days, no more than 4 days, no more than 3 days, or no more than 2 days after the one or more of the plurality of gut microorganisms are administered.
  • the therapy is administered to the subject from 1 day to 2 months, from 1 day to 1 month, from 1 day to 3 weeks, from 1 day to 2 weeks, from 1 day to 1 week, from 1 day to 3 days, from 2 days to 2 months, from 2 days to 1 month, from 2 days to 3 weeks, from 2 days to 2 weeks, from 2 days to 1 week, from 2 days to 3 days, from 3 days to 2 months, from 3 days to 1 month, from 3 days to 3 weeks, from 3 days to 2 weeks, from 3 days to 1 week, from 1 week to 2 months, from 1 week to 1 month, from 1 week to 3 weeks, or from 1 week to 2 weeks after the one or more of the plurality of gut microorganisms are administered.
  • a clinician may treat that subject differently to a subject classified as a predicted responder. Classifying the subject as a predicted non-responder or as a predicted responder may allow the adoption of a particular, or an alternative, treatment regime more suited to the patient.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO PATENT APPLICATION Attorney Docket No.: 126146-5003-WO
  • a treatment regime or a therapy can be administered via any common route so long as the target tissue or cell is available via that route. In some embodiment, a treatment regime or a therapy can be administered via any common route so long as the target tissue or cell is available via that route.
  • a non-responder is administered with one or more of the pluralities of gut microorganisms via, but is not limited to, oral administration or by colonoscopy.
  • a gut microorganism therapeutic composition for use as described herein can be prepared and administered using methods known in the art.
  • compositions are formulated for oral, colonoscopic, or nasogastric delivery although any appropriate method can be used.
  • a non-responder receives fecal microbiota transplantation from a responder population through methods as disclosed in e.g., US 20230109343, US20200147151, or US 2021036172.
  • a non-responder receives an effective amount of pre- selected isolated population of gut microorganisms from fecal matters of a responder.
  • a non-responder receives an effective amount of pre-selected isolated population of gut microorganisms from Table 1, Table 2 or Figure 13A-13XX.
  • the one or more of the pluralities of gut microorganisms administered to a non-responder comprise a therapeutically effective or sufficient amount of at least 1, at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 35, at least 40, at least 45, at least 50, at least 55, at least 60, at least 65, at least 70, at least 75, at least 80, at least 85, at least 90, at least 95, at least 100, at least 125, at least 150, at least 200, at least 250, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, or all of the gut microorganisms isolated or purified populations of gut microorganisms selected from Table 1, Table 2 or Figures 13A-13XX.
  • the one or more of the pluralities of gut microorganisms administered to a non- responder comprise at least about 1 ⁇ 10 3 viable colony forming units (CFU) of bacteria or at least about 1 ⁇ 10 4 , 1 ⁇ 10 5 , 1 ⁇ 10 6 , 1 ⁇ 10 7 , 1 ⁇ 10 8 , 1 ⁇ 10 9 , 1 ⁇ 10 10 , 1 ⁇ 10 11 , 1 ⁇ 10 12 , 1 ⁇ 10 13 , 1 ⁇ 10 14 , 1 ⁇ 10 15 PATENT APPLICATION Attorney Docket No.: 126146-5003-WO viable CFU (or any derivable range therein).
  • CFU colony forming units
  • a single dose will contain an amount of gut microorganisms (such as a specific bacteria or species, genus, or family described herein) of at least, at most, or exactly 1 ⁇ 10 4 , 1 ⁇ 10 5 , 1 ⁇ 10 6 , 1 ⁇ 10 7 , 1 ⁇ 10 8 , 1 ⁇ 10 9 , 1 ⁇ 10 10 , 1 ⁇ 10 11 , 1 ⁇ 10 12 , 1 ⁇ 10 13 , 1 ⁇ 10 14 , 1 ⁇ 10 15 or greater than 1 ⁇ 10 15 viable CFU (or any derivable range therein) of a specified bacteria.
  • gut microorganisms such as a specific bacteria or species, genus, or family described herein
  • a single dose will contain at least, at most, or exactly 1 ⁇ 10 4 , 1 ⁇ 10 5 , 1 ⁇ 10 6 , 1 ⁇ 10 7 , 1 ⁇ 10 8 , 1 ⁇ 10 9 , 1 ⁇ 10 10 , 1 ⁇ 10 11 , 1 ⁇ 10 12 , 1 ⁇ 10 13 , 1 ⁇ 10 14 , 1 ⁇ 10 15 or greater than 1 ⁇ 10 15 viable CFU (or any derivable range therein) of total gut microorganisms.
  • the pluralities of gut microorganisms are administered concomitantly or sequentially with one or more therapies to a disease or a disorder.
  • the pluralities of gut microorganisms are administered more than once. In certain aspects, the composition is administered daily, weekly, or monthly. In some embodiments, the pluralities of gut microorganisms are administered for two, three, or four months to induce and/or maintain an appropriate microbiome in the non-responder’s GI tract.
  • HbA1c was significantly increased from M3 but remained lower than that at M0 (FIG.4E).
  • the proportion of patients who achieved adequate glycemic control (HbA1c ⁇ 7%) was significantly higher in the W group (61.6 % versus 33.3% in the U group) at M3, but showed no difference between the two groups at M15 (FIG.4F).
  • the level of fasting blood glucose and postprandial glucose in PATENT APPLICATION followed a similar trend as HbA1c (FIG.4G, H).
  • Co-abundance network is a data- driven way to investigate ecological interactions between microbes across habitats.
  • a total of 477 HQMAGs were selected for network construction because they were detectable in more than 75% of the samples at each time point in the W group. These 477 HQMAGs also accounted for ⁇ 60% of the total abundance of the 1,845 HQMAGs.
  • the three networks were of similar order S, i.e., the total number of nodes (HQMAGs), S M0 (442), S M3 (421), and S M15 (429), but they varied considerably in their size L, i.e., the total number of edges (correlations), LM0(4,231), LM3(2,587) and LM15(4,592).
  • L in GM3 decreased to 61.14% of that in G M0 and rebounded back in G M15 to 108.53% of that in G M0 .
  • Connectance decreased from 0.043 in G M0 to 0.029 in in GM15.
  • Changes in L and connectance showed that the high fiber intervention dramatically reduced the correlations among the prevalent genomes in the network.
  • the distributions of degree i.e. the number of edges a node has, fit well with a power-law model (R 2 values GM0: 0.79, GM3: 0.82, GM15: 0.79), indicating the presence of network hubs 21 .
  • C1A and C1B can be considered as guilds as HQMAGs in each cluster were highly interconnected with only positive correlations no matter which were robust or transient (FIG.5B ).
  • the two guilds were connected by negative edges only, indicating a competitive relationship that structures a seesaw-like network.
  • Such a network feature was termed as two competing guilds (TCG).
  • the members of the TCG had significantly higher degree, betweenness centrality, eigenvector centrality, closeness centrality and stress centrality than the rest of the genomes in the networks .
  • This finding indicates that the two guilds exerted a relatively large amount of control over the interaction of other nodes (reflected by betweenness centrality and eigenvector centrality) and the information flow in the network (reflected by closeness centrality and stress centrality). Removing the two guilds would lead to the collapse of the networks since on average 86.08% of the total edges would have been lost.
  • CAZy carbohydrate-active enzyme
  • SCFA short- chain fatty acid
  • liver cirrhosis (LC), ankylosing spondylitis (AS), atherosclerotic cardiovascular disease (ACVD), schizophrenia (SCZ), colorectal cancer (CRC), and inflammatory bowel disease (IBD).
  • CCDC-I Case-Control Dataset Collection I
  • WTP diet high-fiber diet
  • U group the usual care
  • WTP diet a high-fiber diet
  • U group the usual care
  • Total caloric and macronutrients prescriptions were based on age-specific Chinese Dietary Reference Intakes (Chinese Nutrition Society, 2013).
  • the WTP diet based on wholegrains, traditional Chinese medicinal foods and prebiotics, included three ready-to-consume pre-prepared foods.
  • the usual care included standard dietary and exercise advice that was made according to the Chinese Diabetes Society guidelines for T2DM.
  • Patients in W group were provided with the WTP diet to perform a self-administered intervention at home for three months, while patients in U group accepted the usual care.
  • W group stopped WTP diet intervention at the end of the third month (at M3). Then W and U continued a one-year follow-up (M15).
  • a meal-based food frequency questionnaire and 24-h dietary recall were used to calculate nutrient intake based on the China Food Composition 2009. Patients in both groups continued with their antidiabetic medications according to their physician prescriptions .
  • feces, urine, and serum samples were stored in dry ice immediately then transported to lab and frozen at -80°C. Subsequently, anthropometric markers and diabetic complication PATENT APPLICATION Attorney Docket No.: 126146-5003-WO indexes were measured. Ewing test and 24-h dynamic electrocardiogram were conducted to estimate diabetic autonomic neuropathy (DAN). B-mode carotid ultrasound was conducted to estimate atherosclerosis. Michigan Neuropathy Screening Instrument was conducted to estimate diabetic peripheral neuropathy (DPN). In addition, A meal-based food frequency questionnaire and the 24-h dietary review were recorded for nutrient intake calculation.
  • DAN diabetic autonomic neuropathy
  • DPN diabetic peripheral neuropathy
  • the fasting venous blood was used to measure HbA1c, fasting blood glucose, fasting insulin, fasting C-Peptide, C-reactive protein (CRP), blood routine examination, blood biochemical examination and five analytes of thyroid.
  • the venous blood samples at 30, 60, 120, and 180 min of MTT were used to measure the postprandial blood glucose, insulin, and C- Peptide.
  • the fasting early morning urine was used to measure the routine urine examination and urinary microalbumin creatinine ratio. The measurements above were completed at Qidong People’s Hospital.
  • HOMA-IR insulin resistance
  • HOMA- ⁇ islet ⁇ -cell function
  • Gut microbiome analysis [00289] Metagenomic sequencing. DNA was extracted from fecal samples using the methods as previously described. Metagenomic sequencing was performed using Illumina Hiseq 3000 at GENEWIZ Co. (Beijing, China). Cluster generation, template hybridization, isothermal amplification, linearization, and blocking denaturing and hybridization of the sequencing primers were performed according to the workflow specified by the service provider.
  • the assembled high-quality draft genomes were further dereplicated by using dRep.
  • DiTASiC which applied kallisto for pseudo-alignment and a generalized linear model for resolving shared reads among genomes, was used to calculate the abundance of the genomes in each sample, estimated counts with P- value > 0.05 were removed, and all samples were downsized to 36 million reads (One sample with read mapping ratio ⁇ 25%, which could not be well represented by the high quality genomes, were removed in downstream analysis). Taxonomic assignment of the genomes was performed by using GTDB-Tk with default parameters . [00292] Gut microbiome network construction and analysis.
  • the layout algorithm sets the position of the nodes to minimize the sum of forces in the network.
  • We defined robust stable edges as the unchanged positive/negative correlations between the same PATENT APPLICATION Attorney Docket No.: 126146-5003-WO two genomes across all the 3 networks at M0, M3, and M15.
  • Nodes in the stable network were clustered by Connected Components Clustering analysis in Cystoscope.
  • Cystoscope To identify if sub- clusters existed in Cluster C1, a clustering tree based on the negative (set as -1) and positive correlations (set as 1) with average linkage method followed by WGCNA analysis.
  • CCDC-I Case-Control Dataset Collection I
  • a random forest classification model to classify case and control was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one- out cross-validation.
  • High quality reads were de novo assembled for each sample using MEGAHIT (-min-contig-len 500, -presets meta-large). The assembled contigs were binned using MetaBAT 2 and MaxBin 2. The quality of the bins was assessed using CheckM. Bins had completeness > 95%, contamination ⁇ 5% and strain heterogeneity ⁇ 5% were retained as HQMAGs, which were further dereplicated by using dRep.
  • TDC Treatment Dataset Collection
  • KEGG Orthologue (KO) IDs were assigned to the predicted protein sequences in each HQMAG by HMMSEARCH against KOfam using KofamKOALA with default parameters. KOs were further assigned to KEGG modules. Antibiotic resistance genes were predicted using ResFinder with default parameters. The identification of virulence factors was based on the core set of Virulence Factors of Pathogenic Bacteria Database (VFDB). The predicted proteins sequences were aligned to the reference sequence in VFDB using BLASTP (best hist with E-value ⁇ 1e-5, identity > 80% and query coverage > 70%).
  • VFDB Virulence Factors of Pathogenic Bacteria Database
  • CAZys carbohydrate-active enzymes
  • Mann-Whitney test (two-sided) was used for comparisons between W and U at the same time point. Pearson Chi-square tests was performed to compare the differences of PATENT APPLICATION Attorney Docket No.: 126146-5003-WO categorical data between groups or timepoints. PERMANOVA test (9,999 permutations) was used to compare the groups of gut microbiota structure. [00299] Mann-Whitney test (two-sided) and Fisher’s exact test (two sided) were used to compare the target functions between Guild 1 and Guild 2. Hierarchical clustering analysis based on Jaccard distance on the KO profiles was conducted to compare HQMAGs in CC-TCG.
  • Example 2 The Random Forest with 10-fold cross-validation was used to perform classification analysis based on HQMAGs in CC-TCG for universal model.
  • Example 2 The combined core genomes from all the two competing guilds (CC- TCG) shows better performances in classifying case vs control across diseases
  • QD-TCG QD-TCG which were identified based on unveiling stable genome pairs despite dietary interventions, we focused our research on pinpointing stable genome pairs across both cases and controls in cross-sectional datasets, with disease progression acting as a perturbation to the gut ecosystem.
  • Each Cluster C1 was subsequently segmented into two sub-clusters, C1A and C1B .
  • every pair of C1A and C1B followed the pattern of two competing guilds that we observed in the QD-TCG.
  • the majority of stable correlations between the HQMAGs belonging to C1A and C1B exhibited positive correlations within each cluster and negative correlations between the two clusters. These correlations accounted for 76.88%, 95.31%, 100%, 96.23%, 96.43%, 96.72%, and 97.10% of the stable correlations within or between the C1A and C1B in the studies on T2D, LC, AS, ACVD, SCZ, CRC and IBD respectively.
  • C-TCG combined genomes of the two competing guilds
  • 701 HQMAGs were unique to one of the 8 sets and 87 shared across multiple sets.
  • 301 belonged to C1A and 400 to C1B.
  • 10 and 40 consistently belonged to C1A and C1B respectively, while 37 exhibited inconsistent assignments across different TCGs (FIG.8A).
  • C-TCG accounted for 84.54% of total abundance on average.
  • a Random Forest classifier was trained on each CCDC-I dataset using the abundance of HQMAGs in C-TCG .
  • C-TCG demonstrated superior case-control classification capacity in CCDC-I compared to individual TCGs , with significantly higher AUC values than classifiers trained on TCGs from the T2D, AS, IBD, SCZ, and LC studies.
  • Random Forest classifier built on CC-TCG demonstrated superior performance in classifying cases and controls compared to both C-TCG and individual TCGs from the QD trial and CCDC-I, with significantly higher AUC values than classifiers trained on TCGs from the CRC, T2D, AS, IBD, SCZ, and LC studies.
  • HQMAGs belonging to CC-TCG We first performed targeted functional analysis and compared HQMAGs assigned to C1A and C1B. Similar to QD-TCG findings, C1A had a significantly higher gene copy number for butyrate biosynthesis and lower for propionate production.
  • HQMAGs in C1A were rich in CAZy genes for arabinoxylan and cellulose utilization. These findings suggest that compared to C1B, HGMAGs from C1A have a higher genetic capacity for utilizing complex plant polysaccharides and producing butyrate. From the perspective of antibiotic resistance and pathogenicity, C1A had fewer ARGs and VFs than C1B. [00311] Furthermore, we conducted an untargeted functional analysis based on the assignment of KEGG Orthology (KO) to all predicted genes from the 284 core HQMAGs. In total, we found 3,553 and 5,495 KOs in C1A and C1B respectively.
  • KEGG Orthology KO
  • CCDC-II Case-Control Dataset Collection II
  • the CC-TCG showed moderate to excellent diagnostic power in 10 of the 15 datasets, specifically those related to AS, ASD, COVID-19, CRC, GD, HT, MS, and PC, although it only achieved an AUC value of 0.58 for HT#2, and AUC values between 0.6-0.7 for BD, PD, CRC#4 and CRC#5 datasets (FIG.8B).
  • AUC value 0.58 for HT#2
  • AUC values between 0.6-0.7 for BD, PD, CRC#4 and CRC#5 datasets FIG.8B.
  • DiTASiC which applied kallisto for pseudo-alignment and a generalized linear model for resolving shared reads among genomes, was used to calculate the abundance of the genomes in each sample, estimated counts with P- value > 0.05 were removed, and all samples were downsized to 36 million reads (One sample with read mapping ratio ⁇ 25%, which could not be well represented by the high quality genomes, were removed in downstream analysis). Taxonomic assignment of the genomes was performed by using GTDB-Tk with default parameters. [00320] Gut microbiome network construction and analysis. In W group, prevalent genomes shared by more than 75% of the samples at every timepoint were used to construct the co-abundance network at each timepoint.
  • case or control was collected from the corresponding papers or form curatedMetagenomicData. Quality control of raw reads was conducted by KneadData (https://bitbucket.org/biobakery/kneaddata). DiTASiC was used to recruit reads and estimate the abundance of the 141 HQMAGs in QD-CTG in each sample, estimated counts with P-value > 0.05 were removed and further converted to relative abundance divided by the total number of reads. A random forest classification model to classify case and control was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one- out cross-validation.
  • High quality reads were de novo assembled for each sample using MEGAHIT ( (-min-contig-len 500, -presets meta-large). The assembled contigs were binned using MetaBAT 2 and MaxBin 2. The quality of the bins was assessed using CheckM. Bins had completeness > 95%, contamination ⁇ 5% and strain heterogeneity ⁇ 5% were retained as HQMAGs, which were further dereplicated by using dRep (Table 3). In the case and control groups, Fastspar with 1,000 permutations was used to calculate the co-abundance correlations based on the HQMAGs shared by more than 75% of the samples in both groups.
  • TDC Treatment Dataset Collection
  • KEGG Orthologue (KO) IDs were assigned to the predicted protein sequences in each HQMAG by HMMSEARCH against KOfam using KofamKOALA with default parameters. KOs were further assigned to KEGG modules. Antibiotic resistance genes were predicted using ResFinder with default parameters. The identification of virulence factors was based on the core set of Virulence Factors of Pathogenic Bacteria Database (VFDB). The predicted proteins sequences were aligned to the reference sequence in VFDB using BLASTP (best hist with E-value ⁇ 1e-5, identity > 80% and query coverage > 70%). Genes encoding carbohydrate-active enzymes (CAZys) were identified using dbCAN (releasee 6.0), and the best-hit alignment was retained.
  • CAZys carbohydrate-active enzymes
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO Genes encoding formate-tetrahydrofolate ligase, propionyl-CoA: succinate-CoA transferase, propionate CoA-transferase, 4Hbt, AtoA, AtoD, Buk and But were identified as described previously. [00325] Statistical Analysis. [00326] Statistical analysis was performed in the R environment (R version3.6.1). Friedman test followed by Nemenyi post-hoc test was used for intra-group comparisons for repeat measurements. Mann-Whitney test (two-sided) was used for comparisons between W and U at the same time point.
  • the robust clr-transformed abundance across each sample was first range-scaled. Subsequently, the guild abundance was then obtained as the mean of the range-scaled abundances of HQMAGs belonging to that guild.
  • the M0 and M3 timepoints was used to train the linear mixed effect model, and M15 was used as testing.
  • the Random Forest with leave-one-out cross-validation was used to perform classification analysis based on HQMAGs of TCGs in each dataset of CCDC-I, CCDC-II and TDC.
  • the Random Forest with 10-fold cross-validation was used to perform classification analysis based on HQMAGs in CC-TCG for universal model.
  • Example 3 The combined core genomes in the two competing guilds (CC-TCG) predict immunotherapy outcomes across various independent datasets spanning a diverse range of diseases.
  • CC-TCG The combined core genomes in the two competing guilds (CC-TCG) predict immunotherapy outcomes across various independent datasets spanning a diverse range of diseases.
  • CC-TCG pre-treatment variations in CC-TCG might be predictive PATENT APPLICATION Attorney Docket No.: 126146-5003-WO of clinical success in diseases.
  • TDC Treatment Dataset Collection
  • CC-TCG predicted remission at week 14 with AUC of 0.68 for IBD_anti-cytokine, 0.64 for IBD_anti-integrin#1 and 0.69 for IBD_anti-integrin#2 (FIG.8C and FIG.10A).
  • ICI immune checkpoint inhibitor
  • the cohorts were from Barcelona (AM_ICI#1), Leeds (AM_ICI #2), Manchester (AM_ICI #3), PRIMM-UK (AM_ICI #4) and PRIMM-NL (AM_ICI #5).
  • Responders and responders were defined based on overall response rate (ORR) or progression- free survival at 12 months (PFS12). Averagely, CC-TCG predicted ORR and PFS12 with AUC of 0.61 and 0.7 within each cohort (FIG.8C and FIG.10C).
  • the transportability of the prediction model from one single cohort to another was found to be insufficient.
  • CC-TCG predicted responses to CD19-CAR-T immunotherapy with an AUC of 0.66 in the German cohort, and the model was sufficiently transportable to predict for the US cohort with an AUC of 0.64.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO confirm the associations between pre-treatment gut microbiome and therapeutic effect and highlight the potential of using CC-TCG to predict treatment effects across various diseases.
  • Gut microbiome analysis [00335] Metagenomic sequencing. DNA was extracted from fecal samples using the methods as previously described. Metagenomic sequencing was performed using Illumina Hiseq 3000 at GENEWIZ Co. (Beijing, China).
  • Cluster generation, template hybridization, isothermal amplification, linearization, and blocking denaturing and hybridization of the sequencing primers were performed according to the workflow specified by the service provider. Libraries were constructed with an insert size of approximately 500 bp followed by high-throughput sequencing to obtain paired-end reads with 150 bp in the forward and reverse directions. [00336] Data quality control. Prinseq was used to: 1) trim the reads from the 3′ end until reaching the first nucleotide with a quality threshold of 20; 2) remove read pairs when either read was ⁇ 60 bp or contained “N” bases; and 3) de-duplicate the reads. Reads that could be aligned to the human genome (H.
  • the assembled high-quality draft genomes were further dereplicated by using dRep.
  • DiTASiC which applied kallisto for pseudo-alignment and a generalized linear model for resolving shared reads among genomes, was used to calculate the abundance of the genomes in each sample, estimated counts with P- value > 0.05 were removed, and all samples were downsized to 36 million reads (One sample with read mapping ratio ⁇ 25%, which could not be well represented by the high quality genomes, were removed in downstream analysis). Taxonomic assignment of the genomes was performed by using GTDB-Tk with default parameters.
  • PATENT APPLICATION Attorney Docket No.: 126146-5003-WO [00338] Gut microbiome network construction and analysis.
  • the layout algorithm sets the position of the nodes to minimize the sum of forces in the network.
  • We defined robust stable edges as the unchanged positive/negative correlations between the same two genomes across all the 3 networks at M0, M3, and M15.
  • Nodes in the stable network were clustered by Connected Components Clustering analysis in Cystoscope.
  • a clustering tree based on the negative (set as -1) and positive correlations (set as 1) with average linkage method followed by WGCNA analysis.
  • Case-Control Dataset Collection I (CCDC-I). Eleven independent metagenomic datasets on T2DM, LC, AS, ACVD, SCZ, CRC and IBD were downloaded from SRA or ENA database.
  • the group information i.e. case or control
  • Quality control of raw reads was conducted by KneadData.
  • DiTASiC was used to recruit reads and estimate the abundance of the 141 HQMAGs in QD-CTG in each sample, estimated counts with P-value > 0.05 were removed and further converted to relative abundance divided by the total number of reads.
  • a random forest classification model to classify case and control was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one-out cross-validation. High quality reads were de novo assembled for each sample using MEGAHIT ( (-min-contig-len 500, -presets meta-large).
  • Case-Control Dataset Collection II Fifteen independent metagenomic datasets on AS, ASD, BD, COVID-19, CRC, GD, HT, MS, PC and PD were downloaded from SRA or ENA database (Table 4). The group information, i.e. case or control, was collected from the corresponding papers or form curatedMetagenomicData. Quality control of raw reads was conducted by KneadData.
  • DiTASiC was used to recruit reads and estimate the abundance of HQMAGs in C-TCG and CC-TCG.
  • a random forest classification model to classify case and control was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one-out cross-validation.
  • Treatment Dataset Collection (TDC). Eleven independent metagenomic datasets on pre-treatment samples related to four diseases, including IBD, rheumatoid arthritis (RA), advanced melanoma and B cell lymphoma, were download from SRA or ENA database. The responder and non-responder categories of each sample were collected from the corresponding paper. Quality control of raw reads was conducted by KneadData.
  • DiTASiC was used to recruit reads and estimate the abundance of HQMAGs in CC-TCG.
  • a random forest classification model to predict responder and non-responder was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one-out cross-validation.
  • Gut microbiome functional analysis. Prokka was used to annotate the HQMAGs.
  • KEGG Orthologue (KO) IDs were assigned to the predicted protein sequences in each HQMAG by HMMSEARCH against KOfam using KofamKOALA with default parameters. KOs were further assigned to KEGG modules. Antibiotic resistance genes were predicted using ResFinder with default parameters.
  • VFDB Virulence Factors of Pathogenic Bacteria Database
  • PERMANOVA test (9,999 permutations) was used to compare the groups of gut microbiota structure.
  • Mann-Whitney test two-sided
  • Fisher’s exact test two sided
  • Hierarchical clustering analysis based on Jaccard distance on the KO profiles was conducted to compare HQMAGs in CC-TCG.
  • Linear mixed effect model with subject id as random effect was applied to explore the associates between the abundance of guilds in QD-TCG and clinical parameters. For each HQMAG belonging to a guild, the robust clr-transformed abundance across each sample was first range-scaled.
  • the guild abundance was then obtained as the mean of the range-scaled abundances of HQMAGs belonging to that guild.
  • the M0 and M3 timepoints was used to train the linear mixed effect model, and M15 was used as testing.
  • the Random Forest with leave-one-out cross-validation was used to perform classification analysis based on HQMAGs of TCGs in each dataset of CCDC-I, CCDC-II and TDC.
  • the Random Forest with 10-fold cross-validation was used to perform classification analysis based on HQMAGs in CC-TCG for universal model.
  • Example 4 – A universal model based on the combined core genomes of the two competing guilds distinguish cases from controls across diseases.
  • the immune checkpoint inhibitor (ICI) treatment datasets for advanced melanoma included multiple cohorts, including Barcelona (AM_ICI#1), Leeds (AM_ICI#2), Manchester (AM_ICI#3), PRIMM-UK (AM_ICI#4), and PRIMM-NL (AM_ICI#5).
  • CC-TCG predicted overall response rate (ORR) and progression-free survival at 12 months (FPS12) with AUC values of 0.61 and 0.7 within each cohort, respectively ( Figures 14B and 15C).
  • ORR overall response rate
  • FPS12 progression-free survival at 12 months
  • the transportability of the prediction model from one cohort to another was insufficient. However, when pooling training datasets, as done in the LODO analysis, the prediction performance improved, reaching average AUC values of 0.66 and 0.67 for ORR and FPS12, respectively.
  • HQMAGs allows annotation and understanding of the health-relevant functionalities of the two guilds in the TCGs.
  • the HQMAGs from the eight different TCGs found in QD and CCDC-I displayed limited overlap, with 701 of the 788 genomes in C-TCG being specific to just one TCG.
  • KEGG Ortholog KEGG Ortholog
  • C1A in QD-TCG had a significantly higher genetic capacity for utilizing complex plant polysaccharides and producing butyrate and a lower genetic capacity for producing propionate.
  • CAZy and SCFA-related genes were explored in all eight different TCGs, this functional genetic distinction between C1A and C1B was further validated ( Figures 17C and 17D).
  • CC-TCG also inherited this functional dichotomy ( Figures 17E and 17F).
  • antibiotic resistance genes ARGs
  • VFs virulence factor genes
  • C1A had 8 VF genes involved in 1 VF class and 1 ARG related to 1 antibiotic class, while C1B had 753 VF genes from 11 classes and PATENT APPLICATION Attorney Docket No.: 126146-5003-WO 25 ARGs related to 7 antibiotic classes.
  • C1As consistently carried fewer VF genes and ARGs than C1Bs.
  • CC-TCG inherited such a genetically functional difference between the two guilds, as evidenced by its C1A harboring 30 VF genes involved in 6 VF classes and 42 ARGs related to 5 antibiotic classes.
  • this distinctive microbiome signature underpins our machine learning models, successfully differentiating between cases and controls across various conditions and predicting immunotherapy outcomes, underscoring its broad applicability.
  • the articulation of the TCG model as a core microbiome signature represents a novel insight into our understanding of the microbiome-health nexus.
  • this study concentrated on stable interactions within the gut microbiome. Such a stable interaction-focused approach has enhanced the understanding of the fundamental structure of the gut microbiome and devised a method to streamline the complex data inherent to this field.
  • This approach allows maintenance of the integrity of crucial biological patterns while managing the high dimensionality and sparsity challenges of extensive microbiome datasets.
  • This research encompassed approximately 4,000 metagenomic samples from 38 studies, covering 15 diseases across three continents.284 HQMAGs within the CC-TCG were identified from a pool of 8,000 by specifically targeting stably correlated genomes.
  • the refined predictive prowess of our TCG-informed models when benchmarked against species- level, entire-genome-list, differential-feature, or high prevalent-feature models, substantiates the merit of the targeted approach.
  • the beneficial guild comprises genomes rich in CAZy genes essential for the digestion of dietary fiber and genes instrumental in the production of SCFAs, such as butyrate, which play vital roles in nutrition, metabolism, immune function, and overall physiology.
  • C1Bs what might be considered the detrimental counterpart (C1Bs) is characterized by genomes abundant with genes that confer antibiotic resistance and express VFs that may predispose to a pathogenic interaction with the host. It is thought that these safe pathogens or pathobionts might play a pivotal role in the developmental shaping of the immune system in early life.
  • the dominance of this guild over the beneficial one could contribute to a pro-inflammatory state, potentially underpinning various chronic conditions and the aging process.
  • This guild shows greater resilience in the more acidic gut environment caused by SCFA accumulation, a byproduct of their fiber digestion, indicating that dietary fibers serve not just as an energy source but also promote the growth of these beneficial microbes. They may progressively outcompete pathobionts due to the acidified environment, SCFA’s antimicrobial properties, and competitive exclusion.
  • the beneficial guild thus may function similarly to foundation species in ecology, forming a supportive microbiome structure that underpins host health while making the gut environment inhibitive to pathobionts.
  • the beneficial guild (C1As) can be designated as the foundation guild, postulating that their role may be analogous to towering trees within a forest ecosystem.
  • dietary fiber may act as a constant energy infusion, nurturing an orderly TCG structure predominated by the foundation guild and bolstering host health within the intricate ecosystem of the gut microbiome.
  • a broad spectrum of epidemiological and interventional studies has validated that dietary fibers can potentially deter and alleviate various disease conditions.
  • the TCG model holds promise in unraveling the mechanisms underpinning this phenomenon. [00378]
  • Members of the TCG model can be viewed as housekeeping microbes in the human gut microbiome, analogous to housekeeping genes in genomes.
  • the core members identified in our TCG model represent foundational PATENT APPLICATION Attorney Docket No.: 126146-5003-WO components of the gut microbiome that are critical for maintaining its overall stability and functionality.
  • These housekeeping microbes could be involved in fundamental processes such as nutrient cycling, metabolic regulation, and immune modulation, similar to how housekeeping genes are involved in basic cellular processes such as transcription, translation, and replication.
  • the TCG model aims to identify the most crucial microbial relationships that persist across different conditions and host environments.
  • Prinseq 60 was used to: 1) trim the reads from the 3′ end until reaching the first nucleotide with a quality threshold of 20; 2) remove read pairs when either read was ⁇ 60 bp or contained “N” bases; and 3) de-duplicate the reads.
  • DiTASiC 66 which applied kallisto for pseudo-alignment 67 and a generalized linear model for resolving shared reads among genomes, was used to calculate the abundance of the genomes in each sample, estimated counts with P- value > 0.05 were removed, and all samples were downsized to 36 million reads (One sample with read mapping ratio ⁇ 25%, which could not be well represented by the high quality genomes, were removed in downstream analysis). Taxonomic assignment of the genomes was performed by using GTDB-Tk 68 with default parameters. [00385] Gut microbiome network construction and analysis. In W group, prevalent genomes shared by more than 75% of the samples at every timepoint were used to construct the co-abundance network at each timepoint.
  • Fastspar 74 a rapid and scalable correlation estimation tool for microbiome study, was used to calculate the correlations between the genomes with 1,000 permutations at each time point based on the abundances of the genomes across the patients and the correlations with P ⁇ 0.001 were retained for further analysis.
  • the networks were visualized with Cytoscape v3.8.1 75 .
  • the layout of the nodes and edges was determined by Edge-weighted Spring Embedded Layout using the correlation coefficient as weights.
  • the links between the nodes are treated as metal springs attached to the pair of nodes.
  • the correlation coefficient was used to determine the repulsion and attraction of the spring 75 .
  • the layout algorithm sets the position of the nodes to minimize the sum of forces in the network.
  • the group information i.e. case or control
  • Quality control of raw reads was conducted by KneadData (https://bitbucket.org/biobakery/kneaddata).
  • DiTASiC was used to recruit reads and estimate the abundance of the 141 HQMAGs in QD-CTG in each sample, estimated counts with P-value > 0.05 were removed and further converted to relative abundance divided by the total number of reads.
  • a random forest classification model to classify case and control was constructed based on the estimated abundances of the HQMAGs in each dataset with leave-one- out cross-validation.
  • High quality reads were de novo assembled for each sample using MEGAHIT ( (-min-contig-len 500, -presets meta-large). The assembled contigs were binned using MetaBAT 2 and MaxBin 2. The quality of the bins was assessed using CheckM. Bins had completeness > 95%, contamination ⁇ 5% and strain heterogeneity ⁇ 5% were retained as HQMAGs, which were further dereplicated by using dRep (Table 3). In the case and control groups, Fastspar with 1,000 permutations was used to calculate the co-abundance correlations based on the HQMAGs shared by more than 75% of the samples in both groups.
  • TDC Treatment Dataset Collection
  • Prokka 69 was used to annotate the HQMAGs.
  • KEGG Orthologue (KO) IDs were assigned to the predicted protein sequences in each HQMAG by HMMSEARCH against KOfam using KofamKOALA with default parameters 70 .
  • KOs were further assigned to KEGG modules.
  • Antibiotic resistance genes were predicted using ResFinder 71 with default parameters.
  • the identification of virulence factors was based on the core set of Virulence Factors of Pathogenic Bacteria Database (VFDB 72 ). The predicted proteins sequences were aligned to the reference sequence in VFDB using BLASTP (best hist with E-value ⁇ 1e-5, identity > 80% and query coverage > 70%).
  • CAZys carbohydrate-active enzymes
  • Mann-Whitney test (two-sided) was used for comparisons between W and U at the same time point. Pearson Chi-square tests was performed to compare the differences of categorical data between groups or timepoints. PERMANOVA test (9,999 permutations) was used to compare the groups of gut microbiota structure. PATENT APPLICATION Attorney Docket No.: 126146-5003-WO [00392] Mann-Whitney test (two-sided) and Fisher’s exact test (two sided) were used to compare the target functions between Guild 1 and Guild 2. Hierarchical clustering analysis based on Jaccard distance on the KO profiles was conducted to compare HQMAGs in CC-TCG.
  • the Random Forest with 10-fold cross-validation was used to perform classification analysis based on HQMAGs in CC-TCG for universal model.
  • REFERENCES CITED AND ALTERNATIVE EMBODIMENTS [00395] All references cited herein are incorporated herein by reference in their entirety and for all purposes to the same extent as if each individual publication or patent or patent application was specifically and individually indicated to be incorporated by reference in its entirety for all purposes.
  • the present invention can be implemented as a computer program product that comprises a computer program mechanism embedded in a non-transitory computer readable storage medium.
  • the computer program product could contain the program modules shown in Figure 1, and/or as described in Figure 2.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Medicinal Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des procédés et des systèmes pour prédire la réponse d'un sujet à une thérapie par obtention d'une première pluralité de séquences d'acide nucléique pour l'ADN génomique à partir d'un échantillon provenant de l'intestin d'un sujet. À partir des séquences d'acides nucléiques, une pluralité de valeurs d'abondance génomique pour une pluralité de bactéries intestinales sont déterminées. Un modèle est appliqué à la pluralité de valeurs d'abondance génomique, ce qui permet d'obtenir la prédiction de la réponse d'un sujet à une thérapie en tant que sortie du modèle.
PCT/US2024/053959 2023-11-01 2024-10-31 Procédés de prédiction de la réponse à une thérapie pour un trouble par l'intermédiaire de guildes du microbiome central Pending WO2025096827A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363595189P 2023-11-01 2023-11-01
US63/595,189 2023-11-01

Publications (2)

Publication Number Publication Date
WO2025096827A2 true WO2025096827A2 (fr) 2025-05-08
WO2025096827A3 WO2025096827A3 (fr) 2025-06-19

Family

ID=95583305

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/053959 Pending WO2025096827A2 (fr) 2023-11-01 2024-10-31 Procédés de prédiction de la réponse à une thérapie pour un trouble par l'intermédiaire de guildes du microbiome central

Country Status (1)

Country Link
WO (1) WO2025096827A2 (fr)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190019575A1 (en) * 2014-10-21 2019-01-17 uBiome, Inc. Nasal-related characterization associated with the nose microbiome
EP3785269A4 (fr) * 2018-03-29 2021-12-29 Freenome Holdings, Inc. Procédés et systèmes d'analyse du microbiote

Also Published As

Publication number Publication date
WO2025096827A3 (fr) 2025-06-19

Similar Documents

Publication Publication Date Title
US11244763B2 (en) Predicting likelihood and site of metastasis from patient records
US20200335179A1 (en) Clinical decision model
Nowak et al. Characterisation of the circulating transcriptomic landscape in inflammatory bowel disease provides evidence for dysregulation of multiple transcription factors including NFE2, SPI1, CEBPB, and IRF2
US20240282449A1 (en) Methods and systems for machine learning analysis of inflammatory skin diseases
Howrylak et al. Discovery of the gene signature for acute lung injury in patients with sepsis
WO2019178546A1 (fr) Méthodes et systèmes de prédiction de la réponse à des thérapies anti-tnf
US20230073731A1 (en) Gene expression analysis techniques using gene ranking and statistical models for identifying biological sample characteristics
Li et al. Identification of common blood gene signatures for the diagnosis of renal and cardiac acute allograft rejection
Borisov et al. New paradigm of machine learning (ML) in personalized oncology: data trimming for squeezing more biomarkers from clinical datasets
WO2022192457A1 (fr) Prédiction de la réponse à des traitements chez des patients atteints d'un carcinome rénal à cellules claires
US20230282367A1 (en) Methods and systems for predicting response to anti-tnf therapies
WO2023194392A1 (fr) Analyse d'échantillons tumoraux
Lyu et al. Deciphering a TB-related DNA methylation biomarker and constructing a TB diagnostic classifier
Mo et al. Stratification of risk of progression to colectomy in ulcerative colitis via measured and predicted gene expression
WO2024226805A2 (fr) Procédés pour prédire une réponse à une thérapie pour un trouble par le biais de guildes de microbiome central
US20250285756A1 (en) Two competing guilds as core microbiome signature for human diseases
Tang et al. Derivation of stable microarray cancer-differentiating signatures using consensus scoring of multiple random sampling and gene-ranking consistency evaluation
US20250174366A1 (en) Methods and Compositions for Assessing and Treating Lupus
WO2025096827A2 (fr) Procédés de prédiction de la réponse à une thérapie pour un trouble par l'intermédiaire de guildes du microbiome central
Larionova et al. Immune gene signatures as prognostic criteria for cancer patients
WO2025064586A1 (fr) Procédés d'apprentissage machine destinés à prédire un phénotype de maladie
He et al. Integrated Machine Learning Algorithms for Stratification of Patients with Bladder Cancer
Sun et al. Risk prediction model construction for post myocardial infarction heart failure by blood immune B cells
Momen-Roknabadi et al. Detection of Early-Stage Colorectal Cancer Using Cell-Free oncRNA Biomarkers and Artificial Intelligence
Piyawajanusorn et al. Predicting atezolizumab response in metastatic urothelial carcinoma patients using machine learning on integrated tumour gene expression and clinical data