WO2025221865A1 - Methods and compositions for cell free rna modification analysis - Google Patents
Methods and compositions for cell free rna modification analysisInfo
- Publication number
- WO2025221865A1 WO2025221865A1 PCT/US2025/024928 US2025024928W WO2025221865A1 WO 2025221865 A1 WO2025221865 A1 WO 2025221865A1 US 2025024928 W US2025024928 W US 2025024928W WO 2025221865 A1 WO2025221865 A1 WO 2025221865A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- rna
- disease
- patient
- cancer
- modifications
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/689—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for bacteria
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- aspects of this invention relate to at least the fields of cell biology and epitranscriptomics, and medicine.
- circulating cell-free DNA has emerged as an increasingly important analyte, providing valuable clinical information and revolutionizing non-invasive diagnoses across various domains 1 3 .
- a key strategy for cancer diagnosis involves identifying the changes in tumor-derived cell-free DNA (ctDNA) 4,5 .
- ctDNA concentrations in plasma can be exceedingly low in early-stage cancer patients, resulting in reduced sensitivity and specificity 6 .
- epigenetics-based cfDNA sequencing has shown promise 7,8 , the low concentration of cfDNA at the early cancer stages remains a significant challenge 9 .
- the epigenetic alterations found in cancer may also be present in noncancer tissues, adding another layer of complexity to the diagnostic process 10 .
- cfRNAs Cell-free RNAs released from apoptotic cells provide an alternative analyte for non-invasive diagnosis and detection of various diseases" l3 .
- Changes in cellular RNA transcription are dynamic processes that can serve as indicators of disease pathobiology 12 ". While overexpression of tumor-specific transcripts may enhance tumor- derived RNA signals, the limited cell death in the early stages of cancer is again a main challenge.
- Cell-free mRNAs are susceptible to degradation if not protected by protein binding and present in low quantities in clinically feasible samples (e.g., a few mL of plasma), complicating their amplification and detection 16 . Additionally, contamination from unrelated cells can markedly alter the abundance of cell-free mRNAs, further complicating detection specificity.
- RNA modifications present in cell free RNA can be used to determine microbiome identity in biological samples, such as plasma, from a patient.
- the present disclosure provides various methods, compositions, systems, and kits for detecting an RNA modification signature.
- the RNA modification signature may be useful for determining the presence and/or severity of a disease in a patient.
- the method comprises 1, 2, 3, or more steps including any of the following: receiving sequencing data obtained from the patient; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the patient has the disease.
- the sequencing data comprises an RNA modification signature from cell-free RNA.
- the patient is an individual having, suspected of having or diagnosed with having a disease, including any disease included herein.
- the patient is a human.
- the patient is a healthy individual.
- the patient is an individual to be monitored for changes in microbiome status or composition.
- the patient is an individual to be screened for a disease.
- the patient is an individual to be screened for microbiome composition.
- the patient has not been diagnosed with a disease, including any disease disclosed herein.
- the individual is a human.
- the method comprises 1, 2, 3, or more steps including any of the following: receiving sequencing data obtained from the individual; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the individual has the disease.
- the sequencing data comprises an RNA modification signature from cell-free RNA.
- the diseases can be any disease, including any disease associated with changes in a patient’s microbiome.
- the disease is cancer.
- the cancer is colorectal cancer.
- the disease is an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
- the patient received or will receive a transplant.
- the disease is graft versus host disease (GvHD).
- the transplant results in GvHD).
- the transplant is a bone marrow transplant.
- the GvHD associated with bone marrow transplant or other transplant.
- the disease is associated with microbiome dysbiosis.
- the disease is Alzheimer’s disease or Parkinson’s disease.
- the method is a computer implemented method.
- the receiving is done by one or more processors.
- the generating is done by one or more processors.
- the applying is done by one or more processors.
- the sequencing data is obtained from a cell free RNA sample from the patient. In certain aspects, the sequencing data is obtained from plasma from the patient.
- the RNA modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl, m 3 U, acp 3 U modifications, or a combination thereof. It is also contemplated that, in some aspects, one or more of m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl is excluded from the RNA modification signature.
- the presence or absence of the RNA modifications are determined by LIME-Seq.
- the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- the RNA modifications are RNA modifications from bacterial RNA.
- the methods detect RNA modification presence and/or levels from bacterial RNA.
- the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- the RNA modification comprise patient-derived RNA, including any RNA produced by endogenous cells of the patient.
- the RNA modifications comprise RNA modifications from host RNA.
- the host RNA may be RNA produced by one or more endogenous cells of the host, which may be a patient including any patient disclosed herein.
- the host RNA does not comprise bacterial RNA.
- the RNA modifications comprise RNA modifications from human RNA.
- the RNA modifications comprise RNA modifications present in the patient, including any patient disclosed herein.
- the RNA modifications are from both bacterial RNA and host RNA.
- the biological sample is a human plasma sample.
- the biological sample is from a patient suspected of having cancer.
- the cancer is colorectal cancer.
- the biological sample is from a patient suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
- the patient received or will receive a transplant is from a patient that has received or will receive a transplant.
- the disease is associated with microbiome dysbiosis. In certain aspects, the disease is Alzheimer’s disease or Parkinson’s disease.
- the biological sample comprises cell free RNA. In certain aspects, the RNA sample comprises approximately 0.1-50 ng of total RNA.
- the RNA modification signature comprises nfA, m 3 C, nfG, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl, m 3 U, acp 3 U modifications, or a combination thereof. In certain aspects, the presence or absence of the RNA modifications, including nfA, m 3 C, nfG, m 2 2G, are determined by LIME-Seq.
- the RNA modification signature comprises RNA modifications from microbiome-derived RNA. In certain aspects, the RNA modification signature comprises RNA modifications from host-derived RNA. In certain aspects, the RNA modification signature comprises patient-derived RNA. In certain aspects, host-derived and/or patient derived RNA comprises RNA produced by one or more cells endogenous to the host or individual. In certain aspects, host-derived and/or patient derived RNA does not comprise bacterial RNA. In certain aspects, the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- the RNA sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
- the method comprises administering to the patient an effective amount of a therapy, wherein the patient has been determined to have a disease after detection of an RNA modification signature in cell free RNA from a biological sample from the patient.
- the detection can occur by any of the methods disclosed herein.
- the RNA modification signature comprises nfA, m 3 C, nfG, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl, m 3 U, acp 3 U modifications, or a combination thereof.
- RNA modifications including m 1 A, m 3 C, nfG, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl, m 3 U, acp 3 U is excluded from the RNA modification signature.
- the biological sample comprises plasma from the patient.
- the biological sample comprises approximately 5-10 ng of total RNA.
- the patient is a human patient.
- the patient has or is suspected of having cancer.
- the cancer is colorectal cancer.
- the patient has or is suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
- the patient received or will receive a transplant.
- the disease is associated with microbiome dysbiosis.
- the patient has or is suspected of having Alzheimer’s disease or Parkinson’s disease.
- the therapy is determined based on the detected RNA modification signature.
- the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- the RNA modification signature comprises RNA modifications from host-derived RNA.
- the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- RNA modification signature may be any RNA modification signature disclosed herein.
- the patient may be any patient disclosed herein.
- the biological sample may be any biological sample disclosed herein.
- the disease may be any disease disclosed herein.
- RNA modification signature may be any RNA modification signature disclosed herein.
- the individual may be any individual disclosed herein.
- the biological sample may be any biological sample disclosed herein.
- the disease individuals may have may be any disease disclosed herein.
- RNA modification signature may be any RNA modification signature disclosed herein.
- the individual may be any individual disclosed herein.
- the individual has not been diagnosed with a disease, such as any disease disclosed herein.
- the individual has not been diagnosed with an inflammatory disease.
- the individual has not been diagnosed with a disease associated with the microbiome.
- the individual has not been diagnosed with a disease associated with microbiome dysbiosis.
- the biological sample may be any biological sample disclosed herein.
- the disease individuals may have may be any disease disclosed herein.
- the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
- RNA molecules from a biological sample from a patient comprising 1 2, 3, or more steps, which can including any of the following: incubating the sample with a reverse transcriptase under conditions to generate fully or partially complementary DNA (cDNA) molecules; incubating the cDNA molecules with a 3’ linker and ligase under conditions to ligate the 3’ linker to the cDNA molecules; and sequencing the population ob ligated cDNA to identify modifications on the RNA molecules.
- the reverse transcriptase is a human immunodeficiency virus (HIV) reverse transcriptase.
- the reverse transcriptase is a MarathonRT reverse transcriptase.
- the reverse transcriptase is a reverse transcriptase capable of producing mutations in the reverse transcribed DNA when there is an RNA modification in the reverse transcribed RNA.
- the modifications comprise nflA, m 3 C, nflG, m 2 2G, m 5 C modifications, or a combination thereof.
- the biological sample comprises cell free RNA.
- the biological sample is a human plasma sample.
- the biological sample comprises approximately 0.1-50 ng of total RNA.
- RNA in the biological sample is fragmented prior to the reverse transcribing step.
- the population of cDNA is amplified before sequencing.
- the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
- RNA sample a ribonucleic acid (RNA) sample
- the method comprising 1 2, 3, or more steps, which can including any of the following: ligating a 3 ’ adapter to a plurality of RNA molecules in the RNA sample; reverse transcribing the plurality of RNA molecules in the RNA sample using a reverse transcriptase to generate a population of complementary DNA (cDNA); ligating a 3’ linker to the population of cDNA; and sequencing the population of cDNA.
- the modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C modifications, or a combination thereof, in the RNA sample.
- the RNA sample comprises cell free RNA. In certain aspects, the RNA sample is from human plasma. In certain aspects, the RNA sample comprises approximately 0.1-50 ng of total RNA. In certain aspects, the RNA sample is fragmented prior to the reverse transcribing step. In certain aspects, the population of cDNA is amplified before sequencing.
- the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
- the RNA modification signature of the aspects described herein may include RNA modification levels.
- the RNA modifications detected for a particular RNA modification signature may be specific to the disease. For example, a certain subset of the RNA modifications in Table A may be detected in an RNA modification signature specific for cancer, whereas a different subset of Table A may be detected in an RNA modification signature for Parkinson’s disease.
- the method can include the step of identifying one or more RNA modifications, which can include one or more tRNA modifications, in bacterial populations obtained from a patient.
- the method comprises sequencing bacterial nucleic acid in a biological sample from a patient.
- the method comprises determining the presence or absence of one or more RNA modifications, which can include one or more RNA modifications (including tRNA modifications), in bacterial nucleic acid in a biological sample from a patient. The determining may be performed after sequencing the nucleic acid.
- the sequencing step comprises performing LIME-seq.
- one or more bacterial strains having one or more RNA modifications, which can include one or more tRNA modifications comprise an active infection in the patient.
- a particular bacterial strain is identified as being more active (such as relative to other bacterial strains in a population of bacteria) and/or actively infecting a patient when RNA modifications, which can include one or more tRNA modifications, are identified in nucleic acid from the bacterial strain.
- the bacterial strain is determined to be more active and/or actively infecting the patient when more RNA modifications are identified in nucleic acid from the bacterial strain relative to a control.
- control is a known level of RNA modifications in the bacterial strain when the bacterial strain is not actively infecting an individual and/or when the bacterial strain is not in a growth mode. In some aspects, the control is an average level of RNA modifications in a population of bacteria, including a population of bacteria in which the bacterial strain is present.
- kits comprising one or more reagents capable of detecting the RNA modification signature disclosed herein.
- the kit comprises a reverse transcriptase.
- the kit comprises an HIV polymerase.
- the kit comprises one or more primers, or other reagents, capable of detecting specific RNA modification markers, including any of the RNA modification markers disclosed herein.
- a method for predicting disease in a patient comprising: receiving sequencing data obtained from the patient; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the patient has the disease, wherein the sequencing data comprises an RNA modification signature from cell-free RNA.
- the disease is an inflammatory disease, irritable bowel syndrome, sepsis, graft versus host disease (GvHD), GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
- GvHD graft versus host disease
- RNA modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
- RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- RNA modification signature comprises RNA modifications from patient-derived RNA.
- RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- a method of detecting a microbiome signature in a biological sample comprising detecting an RNA modification signature in cell-free RNA from the biological sample.
- RNA sample comprises approximately 0.1-50 ng of total RNA.
- RNA modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
- RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- a method of treating disease in a patient comprising administering to the patient an effective amount of a therapy, wherein the patient has been determined to have a disease after detection of an RNA modification signature in cell free RNA from a biological sample from the patient.
- RNA modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
- RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- RNA modification signature comprises RNA modifications from patient-derived RNA.
- RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- a method of diagnosing a disease in a patient comprising detecting an RNA modification signature in cell free RNA in a biological sample from the patient.
- RNA modification signature comprises m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
- RNA modification signature comprises RNA modifications from microbiome-derived RNA.
- RNA modification signature comprises RNA modifications from patient-derived RNA.
- RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
- a method of detecting a modification signature on a ribonucleic acid (RNA) sample comprising: ligating a 3’ adapter to a plurality of RNA molecules in the RNA sample; reverse transcribing the plurality of RNA molecules in the RNA sample using a reverse transcriptase to generate a population of complementary DNA (cDNA); ligating a 3’ linker to the population of cDNA; and sequencing the population of cDNA.
- RNA ribonucleic acid
- RNA sample comprises cell free RNA.
- RNA sample is from human plasma.
- RNA sample comprises approximately 0.1-50 ng of total RNA.
- a method of determining bacterial activity and/or active bacterial infections in a patient comprising identifying one or more RNA modifications in nucleic acid from one or more bacteria in a biological sample obtained from the patient.
- RNA modifications comprise one or more tRNA modifications.
- RNA modifications comprise m x A, m 3 C, m x G, m 2 2G, m 5 C, pseudouridine, 2’-o-methyl, m 3 U, acp 3 U modifications, or a combination thereof.
- A, B, and/or C includes: A alone, B alone, C alone, a combination of A and B, a combination of A and C, a combination of B and C, or a combination of A, B, and C. In other words, “and/or” operates as an inclusive or.
- compositions and methods for their use can “comprise,” “consist essentially of,” or “consist of’ any of the ingredients or steps disclosed throughout the specification. Compositions and methods “consisting essentially of’ any of the ingredients or steps disclosed limits the scope of the claim to the specified materials or steps which do not materially affect the basic and novel characteristic of the claimed invention.
- any limitation discussed with respect to one aspect of the invention may apply to any other aspect of the invention.
- any composition of the invention may be used in any method of the invention, and any method of the invention may be used to produce or to utilize any composition of the invention.
- Any aspect discussed with respect to one aspect of the disclosure applies to other aspects of the disclosure as well and vice versa.
- any step in a method described herein can apply to any other method.
- any method described herein may have an exclusion of any step or combination of steps.
- FIGs. 1A-1J show LIME-seq reveals the presence of host non-coding RNAs and microbiome-derived RNAs in human plasma cell-free RNA.
- a The high-resolution bioanalyzer assay to visualize the input RNA sizes in LIME-seq libraries, for human plasma cell-free RNA (cfRNA) and HEPG2 small RNA ( ⁇ 200 nt), respectively.
- the plasma cfRNA samples were collected from the blood of two non-cancer individuals, and the small RNA was purified from HEPG2 total RNA.
- b Schematic overview of LIME-seq. Light green circles mark diverse cfRNA modifications that induce mutation signatures in the presence of HIV RT.
- RNA/cDNA dual ligations ensure the capture of mutation signals at reads end and internal positions, c, The relative expression levels of the top 50 abundant RNA transcripts in plasma cfRNA, depicted by bar chart plot, indicate the dominant distribution of tRNA and rRNA. d, The proportion distribution of all mapped reads classified by various RNA species, when making LIME-seq libraries with plasma cfRNA. e, The reads coverage of representative tRNAs in plasma cfRNA, obtained by LIME-seq. Intact reads coverage has remained well in host tRNA regions, f, The reads coverage of representative non-tRNA RNA species in plasma cfRNA, obtained by LIME-seq.
- the stacked bar chart shows the proportional distribution of LIME-seq reads mapped to the human genome versus the unmapped reads that can be further aligned to microbial genomes, when starting with plasma cfRNA.
- the box plot on the right side indicates the ratios of total microbiome-related reads in unmapped reads, i
- the pie chart shows the percentage of microbiome-derived cfRNA from the top 13 abundant microbial species and the rest (other), when counting LIME-seq reads mapped to diverse microbial genomes, j,
- FIGs. 2A-2H show LIME-seq detects RNA modification signatures in microbiome-derived RNAs in plasma cfRNA, offering a new set of biomarkers for CRC early diagnosis, a, IGV plot of a representative microbiome-derived tRNA region in Stenotrophomonas mallophilia.
- b Heatmap illustrates the differential patterns of mutation signatures in microbiome- derived RNAs of plasma cfRNA, when comparing CRC patients with non-cancer controls
- c Receiver Operating Characteristic curve (ROC) of the classifier describes how LIME-seq of plasma cfRNA distinguishes CRC patients from non-cancer controls, using microbiome- derived RNA abundance or mutation patterns in these RNAs. LOOCV validation was conducted to evaluate the performance, d, Scatter plot demonstrates the high sensitivity of the classifier in predicting the probability of different CRC stages, based on mutation patterns in microbiome-derived RNAs from plasma cfRNA.
- ROC Receiver Operating Characteristic curve
- the SHAP value obtained from the classifier model points out critical mutation sites in 10 representative microbes, which predominantly contributes to CRC diagnostic model
- f Box plot compares the integral mutation patterns in microbiome-derived RNAs and indicates significant differences between CRC patients and non-cancer individuals, with 3 representative microbes observed in LIME- seq data of plasma cfRNA, as validation of the CRC diagnostic model, g, Box plot illustrates the dynamic profiles of stage-specific mutation patterns in microbiome-derived RNAs from plasma cfRNA, with Hydrogenophaga/A and Ramlibacter tataouinensis/G as representative microbes. The variations of these mutation patterns across different CRC stages show the potential for monitoring cancer-progression-related changes, via the CRC diagnostic model, h, external validation of Validation 1 and Validation 2 compared to training.
- FIGs. 3A-3L show quantitative identification of RNA methylations in small RNAs from HepG2 cells by LIME-seq.
- a f Barplot shows the mutation ratios of RNA methylation sites m 1 Ass (a), m 1 A9 (b), m x G9 (c), m x G37 (d), m2 2 G26 (e), and m 3 C (f) detected by LIME-seq in small RNAs from HEPG2 cells
- g IGV plot of a representative tRNA-Met-i region in HEPG2 small RNA, displaying the reads coverage and observed mutation sites
- h Correlation of mutation ratios between two replicates of HEPG2 small RNA (50 ng) detected by LIME-seq.
- PCC represents Pearson’s correlation coefficient, i, Correlation of mutation ratios detected in HepG2 small RNA across different input amounts (50 ng, 15 ng, 5 ng, and 1.5 ng), j. IGV plot of a representative tRNA-Met-i region in HepG2 small RNA with varying input amounts, k-1. Bar plots showing mutation ratios of RNA methylation sites at m 1 Ass (d) and m'Gg (e) in tRNA-Met-i across different RNA input amounts.
- FIGs. 4A-4M show the reads coverage and methylation profile of host RNAs in human plasma cell-free RNAs, revealed by LIME-seq. a-e
- Line plot shows the coverage of cfRNA LIME-seq reads in tRNA (a), ncRNA (b), snRNA (c), 18S rRNA (d), and 28 S rRNA (e) from the human genome.
- the dark lines represent the average coverage across cfRNAs from 36 non-cancer individuals, while the light regions indicate the standard deviation of RNA coverage variability in these individuals, f, Stacked bars show the mutation ratios of methylation sites in rRNAs detected by LIME-seq in cfRNAs.
- FIGs. 5A-5G show the reads coverage and methylation profile of microbiome- derived cfRNA in human plasma, revealed by LIME-seq.
- a Hierarchical visualization of the composition and relative abundance of microbial species identified in cfRNA using LIME- seq for one typical health individual, b e, Line plot shows the coverage of microbiome-derived cfRNA LIME-seq reads in Pseudomonas sp. CIP-10 rRNA (b), and tRNA (c) regions for typical individuals, d-g.
- IGV plots of representative tRNA and rRNA regions including tRNA- Pro-TGG (d), tRNA-Pro-cGG (e), tRNA-Leu-CAG (f), and rRNA (g), comparing RNA from lab-cultured S. maltophilia with plasma cfRNA reads. AlkB treatment was applied to lab- cultured RNA to confirm modifications. Skin samples were used as a negative control, with distinct coverage patterns and low abundance indicating that the reads originate from other microbiome rather than S. maltophilia.
- FIGs. 6A-6D show comparison of microbiome-derived cfRNA origins in CRC patients versus non-cancer controls, a, Boxplot compares the ratio of microbial reads from LIME-seq between non-cancer controls and CRC patients, b-c, Bar plot shows the relative abundance of the top 50 most abundant microbial species detected by LIME-seq in plasma samples from 36 non-cancer individuals (b) and 27 CRC patients (c).
- Abundance is determined by Log2RPM values calculated using LIME-seq data and averaged from 35 non-cancer individuals and 27 cancer patients, respectively, d, Venn diagram shows the overlap of the top 50 most abundant microbial species in non-cancer individuals and patients with colorectal cancer (CRC).
- FIGs. 7A-7N show differential expression and methylation profiles for host RNAs in plasma cfRNA from CRC patients versus non-cancer controls
- a Bar plot shows the relative abundance of the top 50 highly expressed transcripts identified in cfRNA data from plasma of CRC patients. Abundance is based on Log2RPM values derived from LIME-seq data
- b Boxplot compares the abundance of various RNA transcripts mapping to the human genome in CRC patients and non-cancer individuals, p value obtained by student t-test (two tailed) is shown in the figure
- c Volcano plot shows the differential RNA methylation analysis of human rRNA between CRC patients and non-cancer individuals as revealed by LIME-seq.
- Up-regulated tRNA methylated sites (2) are marked in red, and down- regulated sites (5) are marked in blue, h-i, Box plot compares the mutational signatures detected by LIME seq in human (h) m 3 C32 at tRNA-Lys(CTT), (i) m 3 C32 at tRNA-Pro(CGG). p value obtained by student t-test (two tailed) is shown in the figure, j, Schematic illustrating pipeline for the identification of differential methylation sites on tRNAs in CRC tumor tissues versus normal adjacent tissue (NATs) from four CRC patients using LIME-seq.
- k-1 Volcano plot (k) and scatter plot (1) showing differential RNA methylation analysis in human tRNA using LIME-seq, comparing CRC tumor tissue and NATs. />-values are calculated using a paired t- test.
- m Receiver operating characteristic (ROC) curve of a classifier based on mutational signatures from human tRNA in cfRNA obtained from LIME-seq. The inputs for the classifier are the mutational ratio in cfRNA of tRNA sites shown in figure I. Validation was performed using leave-one-out cross-validation (LOOCV).
- n Scatter plot showing the low correlation between the fold change in mutation ratios between CRC tumor and NATs and those between CRC patients’ and non-cancer individuals’ cfRNA.
- FIGs. 8A-8C show LIME-seq reveals the differential abundance of microbial species as the biomarker for CRC diagnosis, a, Workflow for the CRC diagnosis based on the abundance of microbial species in human plasma as revealed by LIME-seq. b, ROC curve of the classifier based on microbial species abundance obtained from LIME-seq comparing CRC patients and non-cancer individuals, validated using LOOCV. c, The SHAP values from the classifier model reveal the important microbe in CRC (p value obtained by t-test between non-cancer individuals and CRC patients is marked in blue). The SHAP values illustrate the significance of different inputs in determining the predictive capability of the statistical model.
- FIGs. 9A-9P RNA modification profiles in microbiome-derived cfRNA distinguishes CRC patients from non-cancer controls
- a Volcano plot shows the differential RNA methylation analysis in microbiome-derived cfRNA, comparing mutation signatures between CRC patients and non-cancer individuals
- b The pie chart shows the composition of mutational signature accessed by LIME-seq in microbiome-derived cfRNA. The percentage of SNP and RNA modifications is shown in figure.
- RNA modifications can be detected as mixed mutational signature during reverse transcription process by HIV-RT.
- Bar plot shows the ratio of mutational signature with statistical difference for both SNP and modification sites.
- g Workflow for developing a diagnostic model for colorectal cancer patients using epitranscriptional signatures in microbiome-derived cfRNA with LIME-seq.
- h Confusion Matrix of the classifier based on mutational signature from LIME-seq between CRC patients and non-cancer controls, evaluated through LOOCV.
- i Schematic overview of different validation methods
- ROC of the classifier based on mutational signature and abundance of microbial species from LIME-seq between CRC patients and non-cancer controls, evaluated through bootstrapping with 25% dataset as the validation set.
- the light color regions reflect the standard error of 20 repeats, m-o, Dynamic mutational sites in (m) Halomonas. (n) Hydrogenophaga, (o) Clostridium tetani across different stages identified by the statistical model./?
- FIGs. 10A-10D show evaluating the stability of the mutational signatures detected on microbiome-derived cfRNA.
- a Schematic of the experimental design for evaluating stability. Whole blood was collected from two individuals, with three tubes per individual, and stored at 4°C for ⁇ 2 hours, 8 hours, and 24 hours before plasma extraction.
- b Correlation of mutation ratios detected in cfRNA from plasma stored at 4°C for ⁇ 2 hours, 8 hours, and 24 hours
- c Relative proportion of microbial reads observed in cfRNA with different storage conditions
- d Low correlation of host transcript expression levels between cfRNA obtained at 2 hours and 24 hours, suggesting cell leakage over time.
- FIGs 11A-11E show the validation of a subset of mutational signatures as biomarker for CRC early diagnosis with two external cohorts.
- a ROC curve of the classifier based on the mutational signature of microbial species detected by LIME-seq in the training cohort, evaluated using model fitting and LOOCV.
- b Schematic of the experimental design and sample sizes for the external validation cohorts
- c Mutational signatures of 12 selected sites with normalized, centered mutational level across the training and validation cohorts
- d-e Relative CRC scores of patients in external validation cohort 1 (d) and external validation cohort 2 (e).
- FIGs. 12A-12B show RNA modification in plasma cfRNA could be clinically informative for pancreatic cancer diagnosis
- a Two-dimensional PCA illustrating differences in the mutational signatures of differentially methylated sites in microbiome- derived cfRNA between pancreatic cancer patients and non-cancer controls
- b Random permutation of patient and control labels significantly reduces the classification ability observed in the PCA.
- FIG. 13 is a block diagram illustrating an example process 100 for predicting disease according to non-limiting aspects of the present disclosure.
- FIGs. 14A-14D show dynamic regulation of bacterial tRNA modifications under different growth conditions
- a-b Scatter plots showing LIME-seq-derived mutation ratios for tRNA modification sites in Pseudomonas aeruginosa (a) and Staphylococcus aureus (b) grown on solid medium (x-axis) versus liquid medium (y-axis). Each point represents a single site; red points indicate sites with significant differences between conditions. Dashed lines denote twofold deviation boundaries, c, Comparison of tRNA modification levels in Escherichia coli grown at high versus low cell densities.
- compositions, methods, and kits for detection and analysis of an RNA modification signature relate to compositions, methods, and kits for detection and analysis of an RNA modification signature.
- the RNA modification signature comprises 1, modifications including any of Table A.
- aspects of the methods include assaying nucleic acids to determine expression levels and/or modification levels of nucleic acids (e.g., DNA, RNA) including such expression and/or modification in cell free nucleic acid. Certain example methods for detection and analysis of nucleic acid methylation are described herein.
- methods provided herein reduce levels of background in assays comprising low and/or ultralow RNA inputs relative to canonical-BS treatments. In some aspects, methods provided herein reduce false positive rates by equal to about or greater than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
- methods provided herein increase the rate of true positive detection by equal to about or greater than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
- HPLC-UV high performance liquid chromatography -ultraviolet
- Kuo and colleagues in 1980 (described further in Kuo K.C. et al., Nucleic Acids Res. 1980;8:4763-4776, which is herein incorporated by reference) can be used to quantify the amount of RNA methylation present in a hydrolyzed DNA sample.
- the method includes hydrolyzing the RNA into its constituent nucleoside bases, which are separated chromatographically and, then, the fractions are measured.
- LC-MS/MS Liquid chromatography coupled with tandem mass spectrometry
- HPLC-UV high-sensitivity approach to HPLC-UV, which requires much smaller quantities of the hydrolyzed DNA sample.
- LC-MS/MS has been validated for detecting levels of methylation levels ranging from 0.05%-10%, and it can confidently detect differences between samples as small as -0.25% of the total cytosine residues, which corresponds to -5% differences in global DNA methylation.
- the procedure routinely requires 50-100 ng of DNA sample, although much smaller amounts (as low as 5 ng) have been successfully profiled.
- Detection of fragments that are differentially methylated could be achieved by traditional PCR-based amplification fragment length polymorphism (AFLP), restriction fragment length polymorphism (RFLP) or protocols that employ a combination of both.
- AFLP PCR-based amplification fragment length polymorphism
- RFLP restriction fragment length polymorphism
- the LUMA (luminometric methylation assay) technique utilizes a combination of two DNA restriction digest reactions performed in parallel and subsequent pyrosequencing reactions to fill-in the protruding ends of the digested DNA strands.
- One digestion reaction is performed with the CpG methylation-sensitive enzyme Hpall; while the parallel reaction uses the methylation-insensitive enzyme MspI, which will cut at all CCGG sites.
- the enzyme EcoRI is included in both reactions as an internal control. Both MspI and Hpall generate 5'-CG overhangs after DNA cleavage, whereas EcoRI produces 5'-AATT overhangs, which are then filled in with the subsequent pyrosequencing-based extension assay.
- the measured light signal calculated as the Hpall/MspI ratio is proportional to the amount of unmethylated DNA present in the sample.
- the specificity of the method is very high and the variability is low, which is essential for the detection of small changes in global methylation.
- LUMA requires only a relatively small amount of DNA (250-500 ng), demonstrates little variability and has the benefit of an internal control to account for variability in the amount of DNA input.
- RNA may be analyzed by sequencing.
- the RNA may be prepared for sequencing by any method known in the art, such as but not limited to, poly-A selection, cDNA synthesis, stranded or nonstranded library preparation, or a combination thereof.
- the RNA may be prepared for any type of RNA sequencing technique, including but not limited to, stranded specific RNA sequencing. In some aspects, sequencing may be performed to generate approximately 10M, 15M, 20M, 25M, 30M, 35M, 40M or more reads, including paired reads.
- the sequencing may be performed at a read length of approximately 50 bp, 55 bp, 60 bp, 65 bp, 70 bp, 75 bp, 80 bp, 85 bp, 90 bp, 95 bp, 100 bp, 105 bp, 110 bp, or longer (or any range derivable therein).
- raw sequencing data may be converted to estimated read counts (RSEM), fragments per kilobase of transcript per million mapped reads (FPKM), and/or reads per kilobase of transcript per million mapped reads (RPKM).
- RSEM estimated read counts
- FPKM fragments per kilobase of transcript per million mapped reads
- RPKM reads per kilobase of transcript per million mapped reads
- RNA may be used for amplification of one or more regions of interest followed by sequencing. Accordingly, aspects of the disclosure may include sequencing nucleic acids to detect and/or quantify methylation of nucleic acids biomarkers. In some aspects, the methods of the disclosure include a sequencing method. Sequencing may be excluded from certain methods of the disclosure. Example sequencing methods include, but are not limited to, those described below. a. Massively parallel signature sequencing (MPSS).
- MPSS Massively parallel signature sequencing
- MPSS massively parallel signature sequencing
- the Polony sequencing method developed in the laboratory of George M. Church at Harvard, was among the first next-generation sequencing systems and was used to sequence a full genome in 2005. It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and ligation-based sequencing chemistry to sequence an E. coli genome at an accuracy of >99.9999% and a cost approximately 1/9 that of Sanger sequencing. c. 454 pyrosequencingTM.
- a parallelized version of pyrosequencing was developed by 454 Life SciencesTM, which has since been acquired by Roche DiagnosticsTM.
- the method amplifies DNA inside water droplets in an oil solution (emulsion PCR), with each droplet containing a single DNA template attached to a single primer-coated bead that then forms a clonal colony.
- the sequencing machine contains many picoliter-volume wells each containing a single bead and sequencing enzymes.
- Pyrosequencing uses luciferase to generate light for detection of the individual nucleotides added to the nascent DNA, and the combined data are used to generate sequence read-outs. This technology provides intermediate read length and price per base compared to Sanger sequencing on one end and Solexa and SOLiDTM on the other. d. IlluminaTM (Solexa) sequencing.
- Solexa developed a sequencing method based on reversible dye-terminators technology, and engineered polymerases, that it developed internally.
- the terminated chemistry was developed internally at Solexa and the concept of the Solexa system was invented by Balasubramanian and Klennerman from Cambridge University's chemistry department.
- Solexa acquired the company Manteia Predictive Medicine in order to gain a massively parallel sequencing technology based on "DNA Clusters", which involves the clonal amplification of DNA on a surface.
- the cluster technology was co-acquired with Lynx Therapeutics of California. Solexa Ltd. later merged with Lynx to form Solexa Inc.
- DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal DNA colonies, later coined "DNA clusters", are formed.
- DNA clusters DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal DNA colonies, later coined "DNA clusters", are formed.
- four types of reversible terminator bases (RT -bases) are added and non-incorporated nucleotides are washed away.
- a camera takes images of the fluorescently labeled nucleotides, then the dye, along with the terminal 3' blocker, is chemically removed from the DNA, allowing for the next cycle to begin.
- the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera.
- SOLiDTM technology employs sequencing by ligation.
- a pool of all possible oligonucleotides of a fixed length are labeled according to the sequenced position.
- Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for matching sequences results in a signal informative of the nucleotide at that position.
- the DNA is amplified by emulsion PCR.
- the resulting beads, each containing single copies of the same DNA molecule, are deposited on a glass slide. The result is sequences of quantities and lengths comparable to IlluminaTM sequencing. f. Ion TorrentTM semiconductor sequencing.
- Ion TorrentTM Systems Inc. developed a system based on using standard sequencing chemistry, but with a novel, semiconductor based detection system. This method of sequencing is based on the detection of hydrogen ions that are released during the polymerization of DNA, as opposed to the optical methods used in other sequencing systems.
- a microwell containing a template DNA strand to be sequenced is flooded with a single type of nucleotide. If the introduced nucleotide is complementary to the leading template nucleotide it is incorporated into the growing complementary strand. This causes the release of a hydrogen ion that triggers a hypersensitive ion sensor, which indicates that a reaction has occurred. If homopolymer repeats are present in the template sequence multiple nucleotides will be incorporated in a single cycle. This leads to a corresponding number of released hydrogens and a proportionally higher electronic signal.
- DNA NanoballsTM sequencing DNA NanoballsTM sequencing.
- DNA NanoballsTM sequencing is a type of high throughput sequencing technology used to determine the entire genomic sequence of an organism.
- the company Complete Genomics® uses this technology to sequence samples submitted by independent researchers.
- the method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Unchained sequencing by ligation is then used to determine the nucleotide sequence.
- This method of DNA sequencing allows large numbers of DNA nanoballs to be sequenced per run and at low reagent costs compared to other next generation sequencing platforms. However, only short sequences of DNA are determined from each DNA nanoball which can make mapping the short reads to a reference genome difficult. This technology has been used for multiple genome sequencing projects. h. Heliscope single molecule sequencing.
- Heliscope sequencing is a method of single-molecule sequencing developed by Helicos Biosciences. It uses DNA fragments with added poly-A tail adapters which are attached to the flow cell surface. The next steps involve extension-based sequencing with cyclic washes of the flow cell with fluorescently labeled nucleotides (one nucleotide type at a time, as with the Sanger method). The reads are performed by the Heliscope sequencer. The reads are short, up to 55 bases per run, but recent improvements allow for more accurate reads of stretches of one type of nucleotides. This sequencing method and equipment were used to sequence the genome of the Ml 3 bacteriophage. i. Single molecule real time (SMRT) sequencing.
- SMRT Single molecule real time
- SMRT sequencing is based on the sequencing by synthesis approach.
- the DNA is synthesized in zero-mode wave-guides (ZMWs) - small well-like containers with the capturing tools located at the bottom of the well.
- the sequencing is performed with use of unmodified polymerase (attached to the ZMW bottom) and fluorescently labelled nucleotides flowing freely in the solution.
- the wells are constructed in a way that only the fluorescence occurring by the bottom of the well is detected.
- the fluorescent label is detached from the nucleotide at its incorporation into the DNA strand, leaving an unmodified DNA strand.
- this methodology allows detection of nucleotide modifications (such as cytosine methylation). This happens through the observation of polymerase kinetics. This approach allows reads of 20,000 nucleotides or more, with average read lengths of 5 kilobases.
- methods involve amplifying and/or sequencing one or more target genomic regions using at least one pair of primers specific to the target genomic regions.
- the primers are heptamers.
- enzymes are added such as primases or primase/polymerase combination enzyme to the amplification step to synthesize primers.
- arrays can be used to detect nucleic acids of the disclosure.
- An array comprises a solid support with nucleic acid probes attached to the support.
- Arrays typically comprise a plurality of different nucleic acid probes that are coupled to a surface of a substrate in different, known locations.
- These arrays also described as “microarrays” or colloquially “chips” have been generally described in the art, for example, U.S. Pat. Nos. 5,143,854, 5,445,934, 5,744,305, 5,677,195, 6,040,193, 5,424,186 and Fodorc/ a/., 1991), each of which is incorporated by reference in its entirety for all purposes.
- arrays may be fabricated on a surface of virtually any shape or even a multiplicity of surfaces.
- Arrays may be nucleic acids on beads, gels, polymeric surfaces, fibers such as fiber optics, glass or any other appropriate substrate, see U.S. Pat. Nos. 5,770,358, 5,789,162, 5,708,153, 6,040,193 and 5,800,992, which are hereby incorporated in their entirety for all purposes.
- RNA-Seq RNA-Seq
- TAm-Seg Tagged- Amplicon deep sequencing
- PAP Pyrophosphorolysis-activation polymerization
- next generation RNA sequencing northern hybridization, hybridization protection assay (HPA)(GenProbe), branched DNA (bDNA) assay (Chiron), rolling circle amplification (RCA), single molecule hybridization detection (US Genomics), Invader assay (Thir
- Amplification primers or hybridization probes can be prepared to be complementary to a genomic region, biomarker, probe, or oligo described herein.
- the term "primer” as used herein, is meant to encompass any nucleic acid that is capable of priming the synthesis of a nascent nucleic acid in a template-dependent process and/or pairing with a single strand of an oligo of the disclosure, or portion thereof.
- primers are oligonucleotides from ten to twenty and/or thirty nucleic acids in length, but longer sequences can be employed.
- Primers may be provided in double-stranded and/or single-stranded form, although the singlestranded form is preferred.
- a primer of between 13 and 100 nucleotides particularly between 17 and 100 nucleotides in length, or in some aspects up to 1-2 kilobases or more in length, allows the formation of a duplex molecule that is both stable and selective.
- Molecules having complementary sequences over contiguous stretches greater than 20 bases in length may be used to increase stability and/or selectivity of the hybrid molecules obtained.
- One may design nucleic acid molecules for hybridization having one or more complementary sequences of 20 to 30 nucleotides, or even longer where desired.
- Such fragments may be readily prepared, for example, by directly synthesizing the fragment by chemical means or by introducing selected sequences into recombinant vectors for recombinant production.
- each probe/primer comprises at least 15 nucleotides.
- each probe can comprise at least or at most 20, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 400 or more nucleotides (or any range derivable therein). They may have these lengths and have a sequence that is identical or complementary to a gene described herein.
- each probe/primer has relatively high sequence complexity and does not have any ambiguous residue (undetermined "n" residues).
- the probes/primers can hybridize to the target gene, including its RNA transcripts, under stringent or highly stringent conditions. It is contemplated that probes or primers may have inosine or other design implementations that accommodate recognition of more than one human sequence for a particular biomarker.
- relatively high stringency conditions For applications requiring high selectivity, one will typically desire to employ relatively high stringency conditions to form the hybrids.
- relatively low salt and/or high temperature conditions such as provided by about 0.02 M to about 0.10 M NaCl at temperatures of about 50°C to about 70°C.
- Such high stringency conditions tolerate little, if any, mismatch between the probe or primers and the template or target strand and would be particularly suitable for isolating specific genes or for detecting specific mRNA transcripts. It is generally appreciated that conditions can be rendered more stringent by the addition of increasing amounts of formamide.
- quantitative RT-PCR (such as but not limited to TaqManTM, AB I) is used for detecting and comparing the levels or abundance of nucleic acids in samples.
- concentration of the target DNA in the linear portion of the PCR process is proportional to the starting concentration of the target before the PCR was begun.
- concentration of the PCR products of the target DNA in PCR reactions that have completed the same number of cycles and are in their linear ranges, it is possible to determine the relative concentrations of the specific target sequence in the original DNA mixture. This direct proportionality between the concentration of the PCR products and the relative abundances in the starting material is true in the linear range portion of the PCR reaction.
- the final concentration of the target DNA in the plateau portion of the curve is determined by the availability of reagents in the reaction mix and is independent of the original concentration of target DNA. Therefore, the sampling and quantifying of the amplified PCR products may be carried out when the PCR reactions are in the linear portion of their curves.
- relative concentrations of the amplifiable DNAs may be normalized to some independent standard/control, which may be based on either internally existing DNA species or externally introduced DNA species. The abundance of a particular DNA species may also be determined relative to the average abundance of all DNA species in the sample.
- the PCR amplification utilizes one or more internal PCR standards.
- the internal standard may be an abundant housekeeping gene in the cell or it can specifically be GAPDH, GUSB and P-2 microglobulin. These standards may be used to normalize expression levels so that the expression levels of different gene products can be compared directly. A person of ordinary skill in the art would know how to use an internal standard to normalize expression levels.
- a problem inherent in some samples is that they are of variable quantity and/or quality. This problem can be overcome if the RT-PCR is performed as a relative quantitative RT-PCR with an internal standard in which the internal standard is an amplifiable DNA fragment that is similar or larger than the target DNA fragment and in which the abundance of the DNA representing the internal standard is roughly 5-100 fold higher than the DNA representing the target nucleic acid region.
- the relative quantitative RT-PCR uses an external standard protocol. Under this protocol, the PCR products are sampled in the linear portion of their amplification curves. The number of PCR cycles that are optimal for sampling can be empirically determined for each target DNA fragment. In addition, the nucleic acids isolated from the various samples can be normalized for equal concentrations of amplifiable DNAs.
- a nucleic acid array can comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250 or more different polynucleotide probes, which may hybridize to different and/or the same biomarkers. Multiple probes for the same gene can be used on a single nucleic acid array. Probes for other disease genes can also be included in the nucleic acid array.
- the probe density on the array can be in any range. In some aspects, the density may be or may be at least 50, 100, 200, 300, 400, 500 or more probes/cm 2 (or any range derivable therein).
- chip-based nucleic acid technologies such as those described by Hacia et al. (1996) and Shoemaker et al. (1996). Briefly, these techniques involve quantitative methods for analyzing large numbers of genes rapidly and accurately. By tagging genes with oligonucleotides or using fixed probe arrays, one can employ chip technology to segregate target molecules as high density arrays and screen these molecules on the basis of hybridization (see also, Pease et al., 1994; and Fodor et al, 1991). It is contemplated that this technology may be used in conjunction with evaluating the expression level of one or more cancer biomarkers with respect to diagnostic, prognostic, and treatment methods.
- Certain aspects may involve the use of arrays or data generated from an array. Data may be readily available. Moreover, an array may be prepared in order to generate data that may then be used in correlation studies.
- FIG. 13 is a block diagram illustrating an example process 100 for predicting disease according to non-limiting aspects of the present disclosure.
- process 100 includes a number of enumerated steps, but aspects of process 100 may include additional steps before, after, and in between the enumerated steps. In some aspects, one or more of the enumerated steps may be omitted or performed in a different order.
- Process 100 which may comprise a training phase 100A and an application phase 100B, may be performed by one or more computing devices.
- process 100 may be performed by one or more processors based on computer-executable or machine readable instructions stored in a memory of the one or more computing device.
- the training phase 100 A may be performed by a computing device separate or distinct from the computing device performing the application phase 100B, for example, to conserve computer resources and/or bandwidth.
- the training phase 100 A may involve receiving reference data from reference patient populations, including patient populations known to have a certain disease, such as cancer or any other disease described herein.
- the reference data also includes data on the microbiome profile of a reference patient population.
- the reference data comprises data on one or more RNA modifications in RNA obtained from a reference patient, including cell-free RNA obtained from a reference patient.
- the reference data may be received from disparate sources, such as other computing systems, for example, electronic health record systems, clinical data management systems, sample analytical systems, or bioreactor system of network environment, or databases and/or repositories, for example, the patient health record database.
- received reference data may be linked together appropriately, for example, as corresponding to a reference patient.
- the computing device may vectorize the reference data to generate reference input feature vectors and reference output feature vectors.
- each reference input feature vector may be associated with a respective reference patient, and each reference output feature vector may be associate with one or more diseases or disease attributes.
- each reference input feature vector may be paired with a respective reference output feature vector.
- the vectorization may result in a reference input feature vector comprising composite data inputs for each of a plurality of input parameters.
- redundant or unnecessary parameters may be removed, for example, for dimensionality reduction of the reference input feature vector. The dimensionality reduction may enhance the speed of the machine learning model being trained or may be used to overcome issues of overfitting.
- the computing device may associate the reference input feature vectors with reference output feature vectors on a machine learning model. For example, for each pair of reference input feature vector (representing input parameters for a respective CAR- T cell drug production process from a respective reference patient) and reference output feature vector (representing one or more attributes of the respective CAR T-cell drug that is produced), the input feature vector may be inputted within the machine learning model with randomized or initialized weights and/or biases for each input parameter represented by the reference input feature vector.
- the machine learning model may be structured to allow the weights to be iteratively adjusted through an error minimization process as the relation between the reference input feature vector and the respective reference output feature vector is determined.
- the input feature vector may be aligned along an input layer of the neural network, whereas the output feature vector may be aligned along an output layer separated from the input layer by one or more hidden layers.
- Each layer may comprise one or more nodes that may involve an activation function.
- the aforementioned weights may be assigned to the various nodes of input layer.
- the computing device may train the machine learning model to iteratively minimize error within a predetermined threshold.
- the training module of a computing device may train the machine learning model by iteratively minimizing errors in determining a relation between parameters represented by the reference input feature vector and the reference output feature vector.
- the relation may be represented by the set of weights assigned to the parameters represented by the input feature vector.
- the initial set of weights for the parameters of the input feature vector may be tested for how correctly the set of weights attribute the significance of various parameters in their ability to predict the one or more diseases or disease attributes represented by the reference output feature vector.
- Each prediction may be a quantitative and/or binary data that is compared to the known data for the one or more attributes.
- an iterative process occurs involving a new set of weights for the parameters.
- the training involves determining a correct set of weights for the input parameters of the input feature vector.
- Each weight may indicate a significance of a parameter associated with the weight in the parameter’s ability to predict the one or more diseases or disease attributes indicated by the output feature vector.
- the training process may occur over
- the computing device may output the trained machine learning model comprising the finalized set of weights indicating a relation between the input parameters and the one or more disease or disease attributes.
- the trained machine learning model may be stored in a memory or may otherwise may accessible to the computing device that performed the training or to another computing device.
- the trained machine learning model may be stored in a local or remote server that may be accessed by a computing device performing the application phase 100B.
- the application phase 100B may involve a computing device having a processor receiving unstructured sequence data from a target patient having or suspected of having a disease.
- the target patient may be distinguishable from a reference patient as the target patient has unknown data for attributes are otherwise predicted using the systems and methods presented herein.
- the reference patient may refer to a patient for whom data for diseases may already be known.
- reference patients as well as the attributes for disease may be applicable for the training phase 100 A, whereas the target patient may be applicable for the application phase 100B.
- the computing device may receive sequencing data from a patient to be analyzed, including a patient having or suspected of having a disease.
- the sequencing data may include RNA sequencing data.
- the sequencing data may include data on one or more modifications to RNA obtained from the patient.
- the RNA obtained form the patient may be cell-free RNA.
- the computing device may apply the input feature vector to the trained machine learning model (e.g., from block 108) to generate an output feature vector predicting disease in the patient.
- the trained machine learning model may have a stored set of weights that indicate the capability for each of a plurality of parameters towards predicting the disease.
- the plurality of parameters may include, comprise, and/or correspond to the parameters represented by the input feature vector.
- the input feature vector may be associated with the set of weights in the trained machine learning model to generate the output feature vector predicting data for the one or more attributes.
- the methods of the disclosure may be useful for evaluating RNA modifications for clinical and/or diagnostic purposes. Certain aspects relate to methods for evaluating RNA. Certain aspects relate to a method for evaluating a sample comprising RNA molecules. The evaluation may be the detection or determination of a particular RNA modification or the differential detection or determination of a particular modification.
- the sample may be from a biopsy such as from fine needle aspiration, core needle biopsy, vacuum assisted biopsy, incisional biopsy, excisional biopsy, punch biopsy, shave biopsy or skin biopsy.
- the sample is obtained from a biopsy from cancerous tissue by any of the biopsy methods previously mentioned.
- the sample may be obtained from any source including but not limited to blood, plasma, or serum.
- the cyst, tumor or neoplasm is colorectal.
- any medical professional such as a doctor, nurse or medical technician may obtain a biological sample for testing.
- the biological sample can be obtained without the assistance of a medical professional.
- a sample may include but is not limited to, tissue, cells, or biological material from cells or derived from cells of a subject.
- the sample comprises cell-free RNA.
- the sample comprises a fertilized egg, a zygote, a blastocyst, or a blastomere.
- the biological sample may be a heterogeneous or homogeneous population of cells or tissues.
- the biological sample may be obtained using any method known to the art that can provide a sample suitable for the analytical methods described herein.
- the sample may be obtained by non-invasive methods including but not limited to: scraping of the skin or cervix, swabbing of the cheek, saliva collection, urine collection, feces collection, collection of menses, tears, or semen.
- the methods of the disclosure can be used in the discovery of novel biomarkers for a disease or condition.
- the methods of the disclosure can performed on a sample from a patient to provide a prognosis for a certain disease or condition in the patient.
- the methods of the disclosure can be performed on a sample from a patient to predict the patient’s response to a particular therapy.
- the disease comprises a cancer.
- the cancer may be pancreatic cancer, colon cancer, acute myeloid leukemia, adrenocortical carcinoma, AIDS-related cancers, AIDS-related lymphoma, anal cancer, appendix cancer, astrocytoma, childhood cerebellar or cerebral basal cell carcinoma, bile duct cancer, extrahepatic bladder cancer, bone cancer, osteosarcoma/malignant fibrous histiocytoma, brainstem glioma, brain tumor, cerebellar astrocytoma brain tumor, cerebral astrocytoma/malignant glioma brain tumor, ependymoma brain tumor, medulloblastoma brain tumor, supratentorial primitive neuroectodermal tumors brain tumor, visual pathway and hypothalamic glioma, breast cancer, lymphoid cancer, bronchial adenomas/carcinoids, tracheal cancer, Burkitt lymphoma, carcinoid tumor, childhood carcinoid tumor,
- the cancer comprises ovarian, prostate, colon, or lung cancer.
- the method is for determining novel biomarkers for ovarian, prostate, colon, or lung cancer by evaluating cell-free RNA using methods of the disclosure.
- RNA concentration is at about or below about 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, or 15 nanograms, or any range derivable therein.
- a low input RNA concentration is at about 1 to 10 ng, 5 to 10 ng, 10 to 50 ng, or 10 to 100 ng total RNA. In some aspects, a low input concentration of RNA is obtained from about or less than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 125, 150, 175, 200, 250, 300, 350, 400, 450, or 500 cells.
- methods involve obtaining a sample (also “biological sample”) from a subject.
- the biological sample is a blood sample.
- the biological sample is a plasma sample.
- the sample may be obtained by methods known in the art.
- the sample is obtained by swabbing, endoscopy, scraping, phlebotomy, or any other methods known in the art.
- the sample may be obtained, stored, or transported using components of a kit of the present methods.
- multiple samples may be obtained for diagnosis by the methods described herein.
- multiple samples such as one or more samples from one tissue and one or more samples from another specimen (for example serum) may be obtained for diagnosis by the methods.
- samples such as one or more samples from one tissue type and one or more samples from another specimen (e.g. serum) may be obtained at the same or different times. Samples may be obtained at different times are stored and/or analyzed by different methods. For example, a sample may be obtained and analyzed by routine staining methods or any other cytological analysis methods.
- the biological sample may be obtained by a physician, nurse, or other medical professional such as a medical technician, endocrinologist, cytologist, phlebotomist, radiologist, or a pulmonologist. The medical professional may indicate the appropriate test or assay to perform on the sample.
- a molecular profiling business may consult on which assays or tests are most appropriately indicated.
- the patient or subject may obtain a biological sample for testing without the assistance of a medical professional, such as obtaining a whole blood sample, a urine sample, a fecal sample, a buccal sample, or a saliva sample.
- the sample is a fine needle aspirate of a tissue or a suspected tumor or neoplasm.
- the fine needle aspirate sampling procedure may be guided by the use of an ultrasound, X-ray, or other imaging device.
- the molecular profiling business may obtain the biological sample from a subject directly, from a medical professional, from a third party, or from a kit provided by a molecular profiling business or a third party.
- the biological sample may be obtained by the molecular profiling business after the subject, a medical professional, or a third party acquires and sends the biological sample to the molecular profiling business.
- the molecular profiling business may provide suitable containers, and excipients for storage and transport of the biological sample to the molecular profiling business.
- a medical professional need not be involved in the initial diagnosis or sample acquisition.
- An individual may alternatively obtain a sample through the use of an over the counter (OTC) kit.
- OTC kit may contain a means for obtaining said sample as described herein, a means for storing said sample for inspection, and instructions for proper use of the kit.
- molecular profiling services are included in the price for purchase of the kit. In other cases, the molecular profiling services are billed separately.
- a sample suitable for use by the molecular profiling business may be any material containing tissues, cells, nucleic acids, genes, gene fragments, expression products, gene expression products, or gene expression product fragments of an individual to be tested. Methods for determining sample suitability and/or adequacy are provided.
- the subject may be referred to a specialist such as an oncologist, surgeon, or endocrinologist.
- the specialist may likewise obtain a biological sample for testing or refer the individual to a testing center or laboratory for submission of the biological sample.
- the medical professional may refer the subject to a testing center or laboratory for submission of the biological sample.
- the subject may provide the sample.
- a molecular profiling business may obtain the sample.
- kits which may be useful for performing the methods of the disclosure.
- the contents of a kit can include one or more reagents described throughout the disclosure and/or one or more reagents known in the art for performing one or more steps described throughout the disclosure.
- kits may include one or more of the following: nuclease-free water, one or more primers, polyethylene glycol, magnetic beads, DNA polymerase, taq polymerase, DNA ligase, RNA ligase, a reverse transcriptase, dNTPs, DNA polymerase buffer, RNA polymerase, DTT, redox reagent, Mg 2+ , K + , adaptors, DNA adaptors, DNA comprising an RNA promoter, a protease, an alkaline solution, a sodium hydroxide solution, and NTPs. Any one or more of the preceding components may be excluded from a kit in certain aspects of the present disclosure.
- a kit of the disclosure does not comprise sodium bisulfite or added sodium bisulfite.
- a kit of the disclosure does not comprise ammonium sulfite or added ammonium sulfite.
- a kit of the disclosure comprises instructions for processing a nucleic acid sample, such as a DNA sample or an RNA sample. Instructions may comprise instructions for using one or more components of the kit in a method disclosed herein.
- One or more reagent is can be supplied in a solid form or liquid buffer that is suitable for inventory storage, and later for addition into the reaction medium when the method of using the reagent is performed.
- Suitable packaging is provided.
- the kit may provide additional components that are useful in the procedure. These additional components may include buffers, capture reagents, developing reagents, labels, reacting surfaces, means for detection, control samples, instructions, and interpretive information.
- kit described herein may be used in a method disclosed herein. Further, components described in the context of a disclosed method may be provided in a kit of the present disclosure.
- Example 2 Epitranscriptomic markers in microbiome-derived cell-free RNA from plasma for colorectal cancer detection
- RNA modifications from tumor- or tumor microenvironment (TME)-derived transcripts could be distinct from those in non-tumor cells.
- TBE tumor microenvironment
- the inventors therefore investigated cfRNA modifications in clinical samples.
- the inventors observed that a stable amount of total RNA (-5-10 ng) could be isolated from 1 mL of plasma.
- Short RNA fragments (-50 nt) are the predominant forms in the isolated cfRNAs, with the sizes similar to those of tRNAs (Fig. la).
- tRNAs are highly structured and more resistant to nuclease-mediated degradation than other RNA species. Based on the observed RNA size profiles, Other structured non-coding RNAs (ncRNAs) or tRNA fragments (tRF) may also be present (Fig. la). These RNAs could be bound by tRNA synthetases or RNA-binding proteins (RBPs), providing additional protection against degradation in plasma. Given that tRNAs are known to be heavily modified, the inventors speculated that changes in tRNA modification stoichiometry might offer additional clues for human disease diagnosis (e.g. detection of early-stage cancer).
- LIME-seq Low-Input Multiple Methylation Sequencing
- LIME-seq employs cDNA 3 ’-ligation after RT and utilizes HIV-RT to ensure read-through of RNA methylations while inducing mutation signatures for quantification 25 . These features allow the inventors to measure the abundance of individual small RNAs with greater precision, and to quantitatively reveal the landscape of major RNA methylations on cfRNAs.
- the RNA/cDNA ligation strategy in LIME-seq also ensures the capture of all short RNA species in plasma, which are often lost in typical RNA-seq libraries that use commercial kits.
- the inventors applied it to small RNAs ( ⁇ 200 nt) isolated from HepG2 cells.
- LIME-seq detected two tRNAs with nfAg, forty-four with nfAss, seven with m 3 C32, ten with nfGg, ten with rn'Gs?. and twenty-one with m 2 2G26, along with modification stoichiometry information (Extended Data Fig. la-f) 21,26,27 . Intact read coverage profiles were observed at tRNA regions through LIME-seq, demonstrating its capability to measure tRNA abundance as well (Extended Data Fig. 1g).
- the inventors next applied LIME-seq to cfRNAs isolated from 36 non-cancer human plasma samples. Notably, 19 out of the top 50 highly expressed transcripts were cytoplasmic tRNAs, which accounted for -13% of the total reads mapped to the human genome (hg38). This finding supports the notion that tRNAs are a main component of cfRNA in human plasma (Fig. Ic-d). LIME-seq also captured abundant RNA fragments from rRNAs, Alu elements, LINEs, snRNAs, mRNAs, and IncRNAs. The inventors observed intact read coverage profiles in most annotated cytoplasmic tRNA regions (including mitochondrial tRNA regions, Fig.
- the inventors also applied the AlkB-mediated demethylation of small RNAs isolated from cultured microbes and confirmed that the methylation sites detected in these RNAs were also present in cfRNA from plasma.
- the inventors collected multiple tubes of blood samples from the same patient without changing the needle between blood draws. The microbial profiles remained consistent across samples, suggesting minimal skin microbe contamination which would be expected to enrich in the first tube.
- the inventors suspect cell-free host tRNAs may be derived from a range of different cell types, which potentially displayed different tRNA methylation levels based on cell type and cell status and masked the signatures from tumor tissues.
- the well-developed Kraken2 methodology enabled the inventors to compare the relative abundance of various microbial species using the mapped reads from LIME-seq, and revealed distinct abundances of diverse microbial species among CRC patients and noncancer controls, suggesting their potential as biomarkers for CRC diagnosis (Fig. lj) 29 .
- the analysis revealed differential methylation patterns, including both up- and downalteration of methylation levels, across various methylated sites in cell-free tRNAs and other small RNA fragments when comparing colorectal cancer (CRC) patients to non-cancer individuals.
- CRC colorectal cancer
- the majority of methylation sites exhibited increased modification levels in the CRC samples (Extended Data Fig. 7a).
- Higher modification levels in microbiome RNA may correlate with increased metabolic activity and external stress, suggesting dysregulation of microbiomes in CRC patients 32 36 .
- a heatmap plot confirmed the presence of differential methylation sites in microbiome-derived cfRNAs between CRC patients and noncancer individuals (Fig. 2b).
- RNA modifications constitute a significant component and feature a notably higher ratio of sites with statistically meaningful differences (P ⁇ 0.05, Extended Data Fig. 7c).
- PCA Principal Component Analysis
- the statistical model validated using LOOCV, demonstrated exceptional predictive ability for classifying CRC patients, achieving an AUC of 0.98 and an accuracy rate of 0.95 (Fig. 2c and Extended Data Fig. 7h). This represents a significant improvement compared to the model based solely on microbial abundance, which attained an AUC of only 0.77 (Fig. 2c). Furthermore, the statistical model enabled the application of these mutation signatures in distinguishing CRC stages (Fig. 2d), where remarkably high accuracy was observed even for early stages. This sensitivity may be attributed to the microbiota's responsiveness to abnormal cells, even at stage 0 39 41 .
- the inventors employed more rigorous validation methods, including bootstrapping and k-fold cross-validation, both utilizing a 25% validation set (Extended Data Fig. 7i).
- the mutation-signature-based model maintained high accuracy with an AUC of 0.92 (Confidence Intervals (CI) 95%: 0.82-1.00, Extended Data Fig. 7j- 1).
- the model exhibited a sensitivity of 0.93 (CI 95%: 0.89-0.97), a specificity of 0.92 (CI 95%: 0.88-0.97), with an accuracy of 0.92 (CI 95%: 0.89-0.95).
- the inventors further focused on microbes known to be enriched in CRC patients, including Peptostreptococcus anaerobius, Prevotella intermedia, and Parvimonas micra 52 ⁇ 55 .
- the inventors observed high mutation levels among differentially methylated sites in cfRNA from CRC patients, suggesting a connection between tRNA methylation levels and elevated microbiome activity (Extended Data Fig. 7p).
- validation cohort 1 which included 11 adenoma patients and 8 stage I CRC patients, the model effectively distinguished both adenoma and stage I CRC from non-cancer controls (Extended Data Fig. 9d-e).
- extended Data Fig. 9c significant differences in specific microbial methylation sites were consistently observed across multiple cohorts.
- the inventors further applied the strategy to pancreatic cancer (PANC), a cancer that may not directly associate with microbiota.
- PANC pancreatic cancer
- microbiome-derived cfRNA methylation profiles also showed differences between cancer patients and non-cancer controls (Extended Data Fig. 10). While this result hints a broader applicability beyond CRC, further validation in larger cohorts and mechanistic understanding are needed in the future.
- the inventors demonstrate that methylation levels in microbiome- derived cfRNA are effective and promising biomarkers for colorectal cancer (CRC) diagnosis, offering a significant advantage over microbial abundance profiles of cfRNA/cfDNA. Unlike abundance profiles, changes in methylation pattern in microbiome-derived cfRNA may reflect the intrinsic status and activity of the microbiota, making them more sensitive to cancerous microenvironment 36 . The inventors achieved unprecedented accuracy in CRC detection, especially in the very early stages of the disease. Host RNA modification level changes in tissues or plasma may also be explored for disease diagnosis or prognosis. The findings highlight the potential of cfRNA and their modification patterns as reliable biomarkers for monitoring host microbiome dynamics, not only for cancer diagnosis but also for other health- related applications.
- Validation cohort 1 consisted of 8 non-cancer individuals and 29 patients with CRC. These samples were obtained from stored specimens, which were collected and processed by a separate research group at the same institute. To ensure comparability, CRC patients at different disease stages were age- and sex-matched with the non-cancer individuals.
- Validation cohort 2 included samples from additional individuals, with 8 non-cancer individuals and 7 patients with CRC, who were recruited in a different phase of this study. These samples were collected using the same procedures as the training cohort and processed by the same research group.
- Plasma separation was performed within 4 hours of whole blood collection. Blood samples were first centrifuged at l,350xg for 12 min at 4°C, to collect the upper plasma layer. The plasma layer was then transferred to a clean 15 ml tube and centrifuged again at l,350xg for 12 min at 4°C again. Later, the upper plasma layer was transferred to a clean 1.5-ml EP tube and further centrifuged at 13,500xg for 5 min at 4°C to ensure complete removal of cell debris. The harvested plasma was split into 600 pl aliquots and stored at -80 °C until cfRNA isolation. cfRNA extraction from plasma samples
- cfRNAs were isolated from 600 pl plasma using QIAamp ccfDNA/RNA Kit (Qiagen, Cat. 55184) according to the manufacturer's protocol. In brief, 180 pl RPL buffer was added into 600 pl plasma, followed by vortexing for 5 s. After leaving at room temperature for 3 mins, 100 pl RPP buffer was added. Then, the EP tube cap was closed, and immediately mix vigorously by vortexing 10 s for 2 times, which is important to disperse the solid. Incubate the EP tube on ice for 3 min. After the standard washing and concentration steps following the manufacturer's protocol, the purified cfRNA was finally eluted in 14 pl of RNase-free water.
- HepG2 cell line was purchased from the American Type Culture Collection
- Stenotrophomonas maltophilia strains were obtained from the clinical microbiology laboratory after confirming the strend and subsequently cultured in Eugon Broth medium. Bacterial growth was carried out under aerobic conditions at 37°C with continuous agitation (180 rpm) for 24-48 hours. Skin specimens
- CRC tumor tissues and matched normal adjacent tissues were obtained from surgical resections. Tissue samples were immediately snap-frozen in liquid nitrogen and stored at -80°C until processing. For RNA extraction, frozen tissues samples were mechanically homogenized using a stainless-steel bead milling system. Then, TRIzol reagent (Invitrogen) was added into the homogenate and incubated at room temperature for 5 min to ensure complete dissociation of nucleoprotein complexes, followed by centrifugation at 12,000 x g for 10 min at 4°C to remove insoluble debris. The supernatant was transferred to fresh RNase-free tubes for subsequent RNA isolation.
- TRIzol reagent Invitrogen
- RNA of HepG2 was isolated with TRIzol reagent (Invitrogen) following the manufacturer’s standard protocol based on isopropanol precipitation.
- the small RNA fraction size ⁇ 200 nt was further extracted from the purified total RNA using the mirVana miRNA Isolation Kit (AMI 560, Invitrogen), following the manufacturer’s standard protocol.
- Plasma cell-free RNA (cfRNA) or small RNA from HepG2 ( ⁇ 200 nt) were fragmented to lengths of -35-55 nucleotides using the 10X RNA Fragmentation Reagent (AM8740, Invitrogen), which was used as 15X with heating at 70 °C for 14 mins. Purify the fragmented RNA using the Oligo Clean & Concentrator kit (Zymo Research) with a final elution volume of 10 pl.
- the subsequent 3 '-end repair involved the preparation of a mixture of fragmented RNA, 2 pl of 10x T4 Polynucleotide Kinase Reaction Buffer (B0201S, NEB), 2 pl of T4 PNK, 1 pl SUPERase n RNase Inhibitor and a proper volume of RNase-free water. Mix the 20 pl reaction mixture very well and incubate at 37°C for 1 hour. 3 ’-end-repaired RNA fragments were purified by Oligo Clean & Concentrator kit and eluted in 10 pl of RNase-free water.
- RNA fragments were denatured with lul of 10 pM RNA 3'- linker (5'rApp-NNNNNAGATCGGAAGAGCGTCGTG-3SpC3; SEQ ID NO: 1) by heating at 70°C for 2 minutes and then immediately moved onto ice. Then a mixture of 2.5 pl of 10x T4 RNA Ligase Reaction Buffer (NEB), 7.5 pl of 50% PEG8000, 1 pl SUPERaseHn, and 2 pl T4 RNA ligase 2 truncated KQ (NEB) was prepared, followed by mixing well and incubating at 25°C for 2 hrs and then at 16°C for 10 hrs.
- the excessive adapters were removed by adding 2 pl of 5 '-deadenylase (NEB) with an incubation at 30 °C for 45 mins, followed by adding 1.0 pl of RecJf (NEB) with an incubation at 37 °C for 45 mins.
- the 3 ’-ligated RNA was then purified by RNA Clean & Concentrator (Zymo Research) and eluted in 10 pl of RNase-free water.
- Reverse transcription was initiated by mixing the 3’-ligated RNA with 1 ul of 2.0 pMRT primer (5'-ACACGACGCTCTTCCGATCT-3'; SEQ ID NO:2) and heating at 65°C for 2 minutes. Then move the samples onto ice immediately. A mixture of 2 pl of 10x AMV Reverse Transcriptase Reaction Buffer (NEB), 2 pl of 10 mM dNTPs, 1 pl RNaseOUT (Thermo Fisher Scientific), 2 pl HIV Recombinant Reverse Transcriptase (Worthington) and a proper volume of RNase-free water was prepared, to give a final volume of 20 pl for a one- hour incubation at 37°C. Upon the completion of RT, 1.0 pl volume of RNase H (NEB, M0297L) was added to the reaction mixture, followed by incubation at 37 °C for 20 min and 70 °C for 5 min.
- NEB AMV Reverse Transcriptase Reaction Buffer
- M0297L was
- the produced cDNAs after RT were purified by Oligo Clean & Concentrator (Zymo Research) and eluted in 10 pl of RNase-free water.
- a mixture of eluted cDNA and 2.0 pl of 10 pM cDNA 3 '-linker (5'Phos-NNNNNAGATCGGAAGAGCACACGTCTG-3SpC3; SEQ ID NO:3) was denatured at 75 °C for 2 min and then moved onto ice immediately.
- the LIME-seq method enables high mutation ratio detection across multiple RNA modifications, including m x A, m 3 C, m x G, m 2 2G, m 3 U, etc., with an enhanced read-through at these methylation sites. Such performance has been facilitated by the robust HIV RT enzyme, and it effectively uncovered methylation fraction information at methylated sites.
- the inventors compared the cfRNA mapping results between bowtie2 and STAR, mis-splicing and short reads in the libraries promote the inventors to use bowtie2. Based on the aligned results from bowtie2, the inventors normalized the data by calculating the reads per million (RPM) and performed student t-test (two tailed) to calculate the p value for the differential expression. Most differentially expressed protein-coding transcripts identified by LIME-seq are miRNAs occurring within these transcripts. For analyzing cfRNA transcripts from the human genome, the inventors set a minimum RPM of 20 in at least one group (CRC or noncancer), focused on non-protein coding transcripts, and excluded redundant transcripts.
- the inventors After aligning sequences to the reference genome, the inventors employ “depth” function in the samtools (version 1.16.1) 59 to extract regions corresponding to various RNA species. Following the extraction of read depths at these sites, the inventors utilize an in-house Python pipeline to plot the read coverage across these regions.
- the inventors identified microbial species in cfRNA samples using kraken2 (version 2.1.3) 60 , setting a threshold whereby the median reads of the specie must exceed 10 in at least one group (CRC or non-cancer). Following species identification, the inventors extracted their sequences from the kraken2 reference genome. To construct a new, streamlined reference, the inventors limited each microbial species to a maximum of ten reference genomes, minimizing redundancy
- the inventors aligned the reads that failed to align to the human genome to this curated microbiome reference genome using bowtie2.
- the alignments generated by bowtie2 were applied in the calculation of the microbiome mapping ratio.
- Variant identification within these sequences is then achieved through the precise analysis of base composition at each genomic position, employing bam-readcount tool (https://github.com/genome/bam-readcount).
- bam-readcount tool https://github.com/genome/bam-readcount.
- the generated bam-readcount output results were parsed and analyzed to calculate the misincorporation ratio.
- the inventors plotted differentially expressed microbes with RPM greater than 20 for comparison with host transcripts.
- a site's mutation signal is due to methylation or a single nucleotide polymorphism (SNP)
- SNPs single nucleotide polymorphisms
- the analysis indicates that signals attributed to methylation typically result from a combination of mutations, whereas SNPs are characterized by a single mutation type (e.g., A>G). Consequently, at any given site, if the mutation ratio for a specific base exceeds 95% of the total mutation ratio, the site is classified as an SNP. Otherwise, this site will be annotated as the methylation.
- the inventors organized raw bam-readcount data and applied filtering to minimize random variation. Specifically, mutational sites in each sample with a depth of less than 3 were deemed insufficient (considered as empty) and excluded from subsequent analyses to ensure robustness in calculating p values and the average mutation ratio.
- the analysis proceeded with the establishment of two critical thresholds: (1) a minimum median depth of more than 5 reads in at least one of the groups (CRC or non-cancer), and (2) an average mutation ratio exceeding 0.10 in at least one group (CRC or non-cancer). These criteria significantly refined the dataset, resulting in approximately 9,000 eligible sites for further analysis.
- the inventors then assessed these sites for differential methylation using a two-tailed Student's t-test and visualized the results through heatmap.
- Support vector classifier used in this work can be achieved by Scikit-learn 61 .
- the misincorporation ratio of a subgroup of 330 methylation sites have been used as the input for the model.
- These data are then partitioned into combined training and testing sets. Within the model, any missing feature values are imputed with the feature's average value across samples, ensuring completeness.
- the inventors apply several validation techniques, including LOOCV, 4-fold Cross-Validation, and bootstrapping, the latter using 25% of the data as a test set.
- SHAP values were calculated to estimate feature attribution for each endpoint and model individually 62 .
- SHAP values integrate Shapley values — rooted in cooperative game theory — to compute the average incremental effect of each feature on the model's prediction, adhering to the principle of local additivity.
- a key attribute of Shapley values is their ability to quantify the contribution of each player (or feature) in a game (or model) by comparing the outcome with all players (or features) involved versus an absence of players (or features). In the context of statistical models, this translates to SHAP values summing to the difference between the expected model output (baseline) and the actual output for a specific prediction. This ensures a comprehensive and fair attribution of each feature's impact on the prediction outcome.
- the inventors apply 4-fold cross-validation in calculating the SHAP values for every feature across all samples.
- Example 4 Dynamic regulation of bacterial tRNA modifications under different growth conditions.
- Hu, L. et al. m6A RNA modifications are measured at single-base resolution across the mammalian transcriptome. Nat Biotechnol 40, 1210-1219 (2022).
Landscapes
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Organic Chemistry (AREA)
- Engineering & Computer Science (AREA)
- Analytical Chemistry (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Physics & Mathematics (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Immunology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Microbiology (AREA)
- Biotechnology (AREA)
- Biochemistry (AREA)
- Pathology (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Oncology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Hospice & Palliative Care (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Aspects of the present disclosure are directed to methods, compositions, and kits for detection and analysis of RNA modifications in cell-free RNA. The RNA modification can be used to generate an RNA modification signature in cell-free RNA from a patient.
Description
METHODS AND COMPOSITIONS FOR CELL FREE RNA MODIFICATION ANALYSIS
[0001] This application claims priority of U.S. Provisional Application No. 63/634,886 filed April 16, 2024, which is hereby incorporated by reference in its entirety.
SEQUENCE LISTING
[0002] The instant application contains a Sequence Listing which has been submitted in XML format and is hereby incorporated by reference in its entirety. Said XML copy, created on April 15, 2025, is named ARCD.P0837WO - SEQ LISTING.xml and is 3,753 bytes in size.
BACKGROUND
I. Field of the Invention
[0003] Aspects of this invention relate to at least the fields of cell biology and epitranscriptomics, and medicine.
II. Background
[0004] In precision medicine, circulating cell-free DNA (cfDNA) has emerged as an increasingly important analyte, providing valuable clinical information and revolutionizing non-invasive diagnoses across various domains1 3. A key strategy for cancer diagnosis involves identifying the changes in tumor-derived cell-free DNA (ctDNA)4,5. However, ctDNA concentrations in plasma can be exceedingly low in early-stage cancer patients, resulting in reduced sensitivity and specificity6. While epigenetics-based cfDNA sequencing has shown promise7,8, the low concentration of cfDNA at the early cancer stages remains a significant challenge9. In addition, the epigenetic alterations found in cancer may also be present in noncancer tissues, adding another layer of complexity to the diagnostic process10.
[0005] Cell-free RNAs (cfRNAs) released from apoptotic cells provide an alternative analyte for non-invasive diagnosis and detection of various diseases" l3. Changes in cellular RNA transcription are dynamic processes that can serve as indicators of disease pathobiology12 ". While overexpression of tumor-specific transcripts may enhance tumor- derived RNA signals, the limited cell death in the early stages of cancer is again a main challenge. Cell-free mRNAs are susceptible to degradation if not protected by protein binding
and present in low quantities in clinically feasible samples (e.g., a few mL of plasma), complicating their amplification and detection16. Additionally, contamination from unrelated cells can markedly alter the abundance of cell-free mRNAs, further complicating detection specificity.
SUMMARY
[0006] Aspects of the disclosure relate, in part, to the discovery that RNA modifications present in cell free RNA can be used to determine microbiome identity in biological samples, such as plasma, from a patient. The present disclosure provides various methods, compositions, systems, and kits for detecting an RNA modification signature. The RNA modification signature may be useful for determining the presence and/or severity of a disease in a patient. [0007] Disclosed are methods for predicting disease in a patient, including a human patient. In some aspects, the method comprises 1, 2, 3, or more steps including any of the following: receiving sequencing data obtained from the patient; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the patient has the disease. In some aspects, the sequencing data comprises an RNA modification signature from cell-free RNA.
[0008] In some aspects, the patient is an individual having, suspected of having or diagnosed with having a disease, including any disease included herein. In some aspects, the patient is a human. In certain aspects, the patient is a healthy individual. In certain aspects, the patient is an individual to be monitored for changes in microbiome status or composition. In certain aspects, the patient is an individual to be screened for a disease. In certain aspects, the patient is an individual to be screened for microbiome composition. In certain aspects, the patient has not been diagnosed with a disease, including any disease disclosed herein.
[0009] Disclosed are methods for predicting disease in a healthy individual. In some aspects, the individual is a human. In some aspects, the method comprises 1, 2, 3, or more steps including any of the following: receiving sequencing data obtained from the individual; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the individual has the disease. In some aspects, the sequencing data comprises an RNA modification signature from cell-free RNA.
[0010] The diseases can be any disease, including any disease associated with changes in a patient’s microbiome. In some aspects, the disease is cancer. In certain aspects, the cancer is colorectal cancer. In certain aspects, the disease is an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease. In some aspects, the patient received or will receive a transplant. In some aspects, the disease is graft versus host disease (GvHD). In certain aspects, the transplant results in GvHD). In some aspects, the transplant is a bone marrow transplant. In certain aspects, the GvHD associated with bone marrow transplant or other transplant. In certain aspects, the disease is associated with microbiome dysbiosis. In certain aspects, the disease is Alzheimer’s disease or Parkinson’s disease.
[0011] In certain aspects, the method is a computer implemented method. In certain aspects, the receiving is done by one or more processors. In certain aspects, the generating is done by one or more processors. In certain aspects, the applying is done by one or more processors.
[0012] In certain aspects, the sequencing data is obtained from a cell free RNA sample from the patient. In certain aspects, the sequencing data is obtained from plasma from the patient. In certain aspects, the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof. It is also contemplated that, in some aspects, one or more of mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl is excluded from the RNA modification signature. In certain aspects, the presence or absence of the RNA modifications, including mxA, m3C, mxG, m22G, are determined by LIME-Seq. In certain aspects, the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
[0013] In certain aspects, the RNA modifications are RNA modifications from bacterial RNA. In certain aspects, the methods detect RNA modification presence and/or levels from bacterial RNA. In certain aspects, the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A. In certain aspects, the RNA modification comprise patient-derived RNA, including any RNA produced by endogenous cells of the patient. In certain aspects, the RNA modifications comprise RNA modifications from host RNA. The host RNA may be RNA produced by one or more endogenous cells of the host, which may be a patient including any patient disclosed herein. In some aspects, the host RNA does not comprise bacterial RNA. In certain aspects, the RNA modifications comprise RNA modifications from human RNA. In certain aspects, the RNA modifications comprise
RNA modifications present in the patient, including any patient disclosed herein. In certain aspects, the RNA modifications are from both bacterial RNA and host RNA.
[0014] Disclosed are methods of detecting a microbiome signature in a biological sample, the method comprising detecting an RNA modification signature in cell-free RNA from the biological sample. In certain aspects, the biological sample is a human plasma sample. In certain aspects, the biological sample is from a patient suspected of having cancer. In certain aspects, the cancer is colorectal cancer. In certain aspects, the biological sample is from a patient suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease. In some aspects, the patient received or will receive a transplant. In some aspects, the biological sample is from a patient that has received or will receive a transplant. In certain aspects, the disease is associated with microbiome dysbiosis. In certain aspects, the disease is Alzheimer’s disease or Parkinson’s disease. In certain aspects, the biological sample comprises cell free RNA. In certain aspects, the RNA sample comprises approximately 0.1-50 ng of total RNA. In certain aspects, the RNA modification signature comprises nfA, m3C, nfG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof. In certain aspects, the presence or absence of the RNA modifications, including nfA, m3C, nfG, m22G, are determined by LIME-Seq. In certain aspects, the RNA modification signature comprises RNA modifications from microbiome-derived RNA. In certain aspects, the RNA modification signature comprises RNA modifications from host-derived RNA. In certain aspects, the RNA modification signature comprises patient-derived RNA. In certain aspects, host-derived and/or patient derived RNA comprises RNA produced by one or more cells endogenous to the host or individual. In certain aspects, host-derived and/or patient derived RNA does not comprise bacterial RNA. In certain aspects, the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
[0015] In certain aspects, the RNA sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
[0016] Certain aspects relate to methods of treating disease in a patient, including a human patient. In some aspects, the method comprises administering to the patient an effective amount of a therapy, wherein the patient has been determined to have a disease after detection of an RNA modification signature in cell free RNA from a biological sample from the patient. The detection can occur by any of the methods disclosed herein. In some aspects, the RNA
modification signature comprises nfA, m3C, nfG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof. It is also contemplated that, in some aspects, one or more of nfA, m3C, nfG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U is excluded from the RNA modification signature. In some aspects, the RNA modifications, including m1 A, m3C, nfG, m22G, are determined by LIME-Seq. In some aspects, the biological sample comprises plasma from the patient. In some aspects, the biological sample comprises approximately 5-10 ng of total RNA. In some aspects, the patient is a human patient. In some aspects, the patient has or is suspected of having cancer. In some aspects, the cancer is colorectal cancer. In some aspects, the patient has or is suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease. In some aspects, the patient received or will receive a transplant. In some aspects, the disease is associated with microbiome dysbiosis. In some aspects, the patient has or is suspected of having Alzheimer’s disease or Parkinson’s disease. In some aspects, the therapy is determined based on the detected RNA modification signature. In some aspects, the RNA modification signature comprises RNA modifications from microbiome-derived RNA. In some aspects, the RNA modification signature comprises RNA modifications from host-derived RNA. In some aspects, the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
[0017] Also disclosed are methods of diagnosing a disease in a patient, including a human patient, the method comprising detecting an RNA modification signature in cell free RNA in a biological sample from the patient. The RNA modification signature may be any RNA modification signature disclosed herein. The patient may be any patient disclosed herein. The biological sample may be any biological sample disclosed herein. The disease may be any disease disclosed herein.
[0018] Also disclosed are methods of screening for and/or diagnosing a disease from a healthy population of individuals, the method comprising detecting an RNA modification signature in cell free RNA in a biological sample from at least one individual of the population. The RNA modification signature may be any RNA modification signature disclosed herein. The individual may be any individual disclosed herein. The biological sample may be any biological sample disclosed herein. The disease individuals may have may be any disease disclosed herein.
[0019] Also disclosed are methods of screening for and/or diagnosing a disease in a healthy individual, the method comprising detecting an RNA modification signature in cell free RNA in a biological sample from the individual. The RNA modification signature may be any RNA
modification signature disclosed herein. The individual may be any individual disclosed herein. In certain aspects, the individual has not been diagnosed with a disease, such as any disease disclosed herein. In certain aspects, the individual has not been diagnosed with an inflammatory disease. In certain aspects, the individual has not been diagnosed with a disease associated with the microbiome. In certain aspects, the individual has not been diagnosed with a disease associated with microbiome dysbiosis. The biological sample may be any biological sample disclosed herein. The disease individuals may have may be any disease disclosed herein.
[0020] In certain aspects, the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
[0021] Also disclosed are methods of detecting modifications on RNA molecules from a biological sample from a patient comprising 1 2, 3, or more steps, which can including any of the following: incubating the sample with a reverse transcriptase under conditions to generate fully or partially complementary DNA (cDNA) molecules; incubating the cDNA molecules with a 3’ linker and ligase under conditions to ligate the 3’ linker to the cDNA molecules; and sequencing the population ob ligated cDNA to identify modifications on the RNA molecules. In certain aspects, the reverse transcriptase is a human immunodeficiency virus (HIV) reverse transcriptase. In certain aspects, the reverse transcriptase is a MarathonRT reverse transcriptase. In certain aspects, the reverse transcriptase is a reverse transcriptase capable of producing mutations in the reverse transcribed DNA when there is an RNA modification in the reverse transcribed RNA. In some aspects, the modifications comprise nflA, m3C, nflG, m22G, m5C modifications, or a combination thereof. In some aspects, the biological sample comprises cell free RNA. In some aspects, the biological sample is a human plasma sample. In some aspects, the biological sample comprises approximately 0.1-50 ng of total RNA. In some aspects, RNA in the biological sample is fragmented prior to the reverse transcribing step. In some aspects, the population of cDNA is amplified before sequencing.
[0022] In certain aspects, the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
[0023] Also disclosed are methods of detecting a modification signature on a ribonucleic acid (RNA) sample, the method comprising 1 2, 3, or more steps, which can including any of the following: ligating a 3 ’ adapter to a plurality of RNA molecules in the RNA sample; reverse
transcribing the plurality of RNA molecules in the RNA sample using a reverse transcriptase to generate a population of complementary DNA (cDNA); ligating a 3’ linker to the population of cDNA; and sequencing the population of cDNA. In certain aspects, the modification signature comprises mxA, m3C, mxG, m22G, m5C modifications, or a combination thereof, in the RNA sample. In certain aspects, the RNA sample comprises cell free RNA. In certain aspects, the RNA sample is from human plasma. In certain aspects, the RNA sample comprises approximately 0.1-50 ng of total RNA. In certain aspects, the RNA sample is fragmented prior to the reverse transcribing step. In certain aspects, the population of cDNA is amplified before sequencing.
[0024] In certain aspects, the biological sample comprises approximately 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, or any range derivable therein, ng of total RNA.
[0025] The RNA modification signature of the aspects described herein may include RNA modification levels. The RNA modifications detected for a particular RNA modification signature may be specific to the disease. For example, a certain subset of the RNA modifications in Table A may be detected in an RNA modification signature specific for cancer, whereas a different subset of Table A may be detected in an RNA modification signature for Parkinson’s disease.
[0026] Also disclosed are method of determining bacterial activity and/or active bacterial infections in a patient. The method can include the step of identifying one or more RNA modifications, which can include one or more tRNA modifications, in bacterial populations obtained from a patient. In certain aspects, the method comprises sequencing bacterial nucleic acid in a biological sample from a patient. In certain aspects, the method comprises determining the presence or absence of one or more RNA modifications, which can include one or more RNA modifications (including tRNA modifications), in bacterial nucleic acid in a biological sample from a patient. The determining may be performed after sequencing the nucleic acid. In some aspects, the sequencing step comprises performing LIME-seq. In some aspects, one or more bacterial strains having one or more RNA modifications, which can include one or more tRNA modifications, comprise an active infection in the patient. In some aspects, a particular bacterial strain is identified as being more active (such as relative to other bacterial strains in a population of bacteria) and/or actively infecting a patient when RNA modifications, which can include one or more tRNA modifications, are identified in nucleic acid from the bacterial strain. In some aspects, the bacterial strain is determined to be more active and/or actively infecting
the patient when more RNA modifications are identified in nucleic acid from the bacterial strain relative to a control. In some aspects, the control is a known level of RNA modifications in the bacterial strain when the bacterial strain is not actively infecting an individual and/or when the bacterial strain is not in a growth mode. In some aspects, the control is an average level of RNA modifications in a population of bacteria, including a population of bacteria in which the bacterial strain is present.
[0027] Also disclosed are kits comprising one or more reagents capable of detecting the RNA modification signature disclosed herein. In some aspects, the kit comprises a reverse transcriptase. In some aspects, the kit comprises an HIV polymerase. In certain aspects, the kit comprises one or more primers, or other reagents, capable of detecting specific RNA modification markers, including any of the RNA modification markers disclosed herein.
[0028] Also disclosed are the following aspects:
1. A method for predicting disease in a patient, the method comprising: receiving sequencing data obtained from the patient; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the patient has the disease, wherein the sequencing data comprises an RNA modification signature from cell-free RNA.
2. The method of Aspect 1, wherein the disease is cancer.
3. The method of Aspect 2, wherein the cancer is colorectal cancer.
4. The method of Aspect 1, wherein the disease is an inflammatory disease, irritable bowel syndrome, sepsis, graft versus host disease (GvHD), GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
5. The method of Aspect 1, wherein the disease is associated with microbiome dysbiosis.
6. The method of Aspect 1, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
7. The method of any one of Aspects 1 to 6, wherein the method is a computer implemented method.
8. The method of any one of Aspects 1 to 7, wherein the receiving is done by one or more processors.
9. The method of any one of Aspects 1 to 8, wherein the generating is done by one or more processors.
10. The method of any one of Aspects 1 to 9, wherein the applying is done by one or more processors.
11. The method of any one of Aspects 1 to 10, wherein the sequencing data is obtained from a cell free RNA sample from the patient.
12. The method of any one of Aspects 1 to 11, wherein the sequencing data is obtained from plasma from the patient.
13. The method of any one of Aspects 1 to 12, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
14. The method of Aspect 13, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
15. The method of any one of Aspects 1 to 14, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
16. The method of any one of Aspects 1 to 14, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
17. The method of any one of Aspects 1 to 16, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
18. A method of detecting a microbiome signature in a biological sample, the method comprising detecting an RNA modification signature in cell-free RNA from the biological sample.
19. The method of Aspect 18, wherein the biological sample is a human plasma sample.
20. The method of Aspect 18 or 19, wherein the biological sample is from a patient suspected of having cancer.
21. The method of Aspect 20, wherein the cancer is colorectal cancer.
22. The method of Aspect 18 or 19, wherein the biological sample is from a patient having or suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease or from a patient that has received or will receive a transplant.
23. The method of Aspect 18, wherein the disease is associated with microbiome dysbiosis.
24. The method of Aspect 18, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
25. The method of any one of Aspects 18 to 24, wherein the biological sample comprises cell free RNA.
26. The method of any one of Aspects 18 to 25, wherein the RNA sample comprises approximately 0.1-50 ng of total RNA.
27. The method of any one of Aspects 18 to 26, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
28. The method of Aspect 27, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
29. The method of any one of Aspects 18 to 28, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
30. A method of treating disease in a patient, the method comprising administering to the patient an effective amount of a therapy, wherein the patient has been determined to have a disease after detection of an RNA modification signature in cell free RNA from a biological sample from the patient.
31. The method of Aspect 30, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
32. The method of Aspect 31, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
33. The method of any one of Aspects 30 to 32, wherein the biological sample comprises plasma from the patient.
34. The method of any one of Aspects 30 to 33, wherein the biological sample comprises approximately 0.1-50 ng of total RNA.
35. The method of any one of Aspects 30 to 34, wherein the patient is a human patient.
36. The method of any one of Aspects 30 to 35, wherein the patient has or is suspected of having cancer.
37. The method of Aspect 36, wherein the cancer is colorectal cancer.
38. The method of any one of Aspects 30 to 35, wherein the patient has or suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease or has received or will receive a transplant.
39. The method of any one of Aspects 30 to 35, wherein the disease is associated with microbiome dysbiosis.
40. The method of any one of Aspects 30 to 35, wherein the patient has or is suspected of having Alzheimer’s disease or Parkinson’s disease.
41. The method of any one of Aspects 30 to 40, wherein the therapy is determined based on the detected RNA modification signature.
42. The method of any one of Aspects 30 to 41, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
43. The method of any one of Aspects 30 to 42, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
44. The method of any one of Aspects 30 to 43, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
45. A method of diagnosing a disease in a patient, the method comprising detecting an RNA modification signature in cell free RNA in a biological sample from the patient.
46. The method of Aspect 45, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl modifications, or a combination thereof.
47. The method of Aspect 46, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
48. The method of any one of Aspects 45 to 47, wherein the biological sample comprises plasma from the patient.
49. The method of any one of Aspects 45 to 48, wherein the biological sample comprises approximately 5-10 ng of total RNA.
50. The method of any one of Aspects 45 to 49, wherein the patient is a human patient.
51. The method of any one of Aspects 45 to 50, wherein the disease is cancer.
52. The method of Aspect 51, wherein the cancer is colorectal cancer.
53. The method of any one of Aspects 45 to 50, wherein the disease is an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
54. The method of any one of Aspects 45 to 50, wherein the disease is associated with microbiome dysbiosis.
55. The method of any one of Aspects 45 to 50, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
56. The method of any one of Aspects 45 to 55, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
57. The method of any one of Aspects 45 to 56, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
58. The method of any one of Aspects 45 to 57, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
59. A method of detecting a modification signature on a ribonucleic acid (RNA) sample, the method comprising: ligating a 3’ adapter to a plurality of RNA molecules in the RNA sample; reverse transcribing the plurality of RNA molecules in the RNA sample using a reverse transcriptase to generate a population of complementary DNA (cDNA);
ligating a 3’ linker to the population of cDNA; and sequencing the population of cDNA.
60. The method of Aspect 59, wherein the modification signature comprises mxA, m3C, mxG, m22G modifications, or a combination thereof, in the RNA sample.
61. The method of Aspect 59 or 60, wherein the RNA sample comprises cell free RNA.
62. The method of any one of Aspects 59 to 61, wherein the RNA sample is from human plasma.
63. The method of any one of Aspects 59 to 62, wherein the RNA sample comprises approximately 0.1-50 ng of total RNA.
64. The method of any one of Aspects 59 to 63, wherein the RNA sample is fragmented prior to the reverse transcribing step.
65. The method of any one of Aspects 59 to 64, wherein the population of cDNA is amplified before sequencing.
66. A method of determining bacterial activity and/or active bacterial infections in a patient, the method comprising identifying one or more RNA modifications in nucleic acid from one or more bacteria in a biological sample obtained from the patient.
67. The method of Aspect 66, wherein the identifying comprises sequencing bacterial nucleic acid in the biological sample.
68. The method of Aspect 67, wherein the sequencing comprises performing LIME-seq.
69. The method of any one of Aspects 66 to 68, wherein identification of one or more RNA modifications in a bacterial strain in the bacterial population indicates the bacterial strain is actively growing.
70. The method of any one of Aspects 66 to 69, wherein the RNA modifications comprise one or more tRNA modifications.
71. The method of any one of Aspects 66 to 70, wherein the RNA modifications comprise mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof.
72. The method of any one of Aspects 66 to 71, wherein the RNA modifications are determined by LIME-seq.
[0029] Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the measurement or quantitation method.
[0030] The use of the word “a” or “an” when used in conjunction with the term “comprising” may mean “one,” but it is also consistent with the meaning of “one or more,” “at least one,” and “one or more than one.”
[0031] The phrase “and/or” means “and” or “or”. To illustrate, A, B, and/or C includes: A alone, B alone, C alone, a combination of A and B, a combination of A and C, a combination of B and C, or a combination of A, B, and C. In other words, “and/or” operates as an inclusive or.
[0032] The words “comprising” (and any form of comprising, such as “comprise” and “comprises”), “having” (and any form of having, such as “have” and “has”), “including” (and any form of including, such as “includes” and “include”) or “containing” (and any form of containing, such as “contains” and “contain”) are inclusive or open-ended and do not exclude additional, unrecited elements or method steps.
[0033] The compositions and methods for their use can “comprise,” “consist essentially of,” or “consist of’ any of the ingredients or steps disclosed throughout the specification. Compositions and methods “consisting essentially of’ any of the ingredients or steps disclosed limits the scope of the claim to the specified materials or steps which do not materially affect the basic and novel characteristic of the claimed invention.
[0034] A person of ordinary skill in the art would understand that a solution that does not contain a particular chemical (e.g., ammonium sulfite, sodium bisulfite, etc.) does not contain an added quantity of that chemical. The term added means that the chemical is exogenously supplied, i.e. supplied in amounts greater than what would be considered trace or minute amounts.
[0035] It is specifically contemplated that any limitation discussed with respect to one aspect of the invention may apply to any other aspect of the invention. Furthermore, any composition of the invention may be used in any method of the invention, and any method of the invention may be used to produce or to utilize any composition of the invention. Any aspect discussed with respect to one aspect of the disclosure applies to other aspects of the disclosure as well and vice versa. For example, any step in a method described herein can apply to any other method. Moreover, any method described herein may have an exclusion of any step or combination of steps. Aspects of an embodiment set forth in the Examples are also embodiments that may be implemented in the context of embodiments discussed elsewhere in a different Example or elsewhere in the application, such as in the Summary, Detailed Description, Claims, and Brief Description of the Drawings.
[0036] Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating specific embodiments of the invention, are given by way of illustration only, since various changes and modifications within
the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
[0037] The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
[0038] FIGs. 1A-1J show LIME-seq reveals the presence of host non-coding RNAs and microbiome-derived RNAs in human plasma cell-free RNA. a, The high-resolution bioanalyzer assay to visualize the input RNA sizes in LIME-seq libraries, for human plasma cell-free RNA (cfRNA) and HEPG2 small RNA (<200 nt), respectively. The plasma cfRNA samples were collected from the blood of two non-cancer individuals, and the small RNA was purified from HEPG2 total RNA. b, Schematic overview of LIME-seq. Light green circles mark diverse cfRNA modifications that induce mutation signatures in the presence of HIV RT. RNA/cDNA dual ligations ensure the capture of mutation signals at reads end and internal positions, c, The relative expression levels of the top 50 abundant RNA transcripts in plasma cfRNA, depicted by bar chart plot, indicate the dominant distribution of tRNA and rRNA. d, The proportion distribution of all mapped reads classified by various RNA species, when making LIME-seq libraries with plasma cfRNA. e, The reads coverage of representative tRNAs in plasma cfRNA, obtained by LIME-seq. Intact reads coverage has remained well in host tRNA regions, f, The reads coverage of representative non-tRNA RNA species in plasma cfRNA, obtained by LIME-seq. Incomplete reads coverage was observed in 7SK RNA and hY4 RNA regions, indicating partial degradation of RNA. For e and f, The Y axis shows the normalized reads coverage depth within each annotated RNA, with the X axis displaying the RNA length, g, The coverage of LIME-seq reads mapped to the mitochondrial genome, when starting with plasma cfRNA. The blue and red colors represent the heavy and light strands of mt-DNA. Accumulated reads coverage was observed in mt-rRNA and mt-tRNA regions, h, The stacked bar chart shows the proportional distribution of LIME-seq reads mapped to the human genome versus the unmapped reads that can be further aligned to microbial genomes, when starting with plasma cfRNA. The box plot on the right side indicates the ratios of total microbiome-related reads in unmapped reads, i, The pie chart shows the percentage of microbiome-derived cfRNA from the top 13 abundant microbial species and the rest (other),
when counting LIME-seq reads mapped to diverse microbial genomes, j, The volcano plot for differential expression analysis of various cfRNA species from host and the abundance of different microbial genus accessed by LIME-seq, when comparing CRC patients with noncancer controls. The genes/microbial genus displaying significantly varied expression, in CRC vs. the non-cancer controls, were marked by blue and red colors for host-derived and microbiome-derived RNAs, respectively.
[0039] FIGs. 2A-2H show LIME-seq detects RNA modification signatures in microbiome-derived RNAs in plasma cfRNA, offering a new set of biomarkers for CRC early diagnosis, a, IGV plot of a representative microbiome-derived tRNA region in Stenotrophomonas mallophilia. when applying LIME-seq to plasma cfRNA and lab-cultured microbiome (and (demethylated), displaying the reads coverage and observed m'Gs? mutation sites, b, Heatmap illustrates the differential patterns of mutation signatures in microbiome- derived RNAs of plasma cfRNA, when comparing CRC patients with non-cancer controls, c, Receiver Operating Characteristic curve (ROC) of the classifier describes how LIME-seq of plasma cfRNA distinguishes CRC patients from non-cancer controls, using microbiome- derived RNA abundance or mutation patterns in these RNAs. LOOCV validation was conducted to evaluate the performance, d, Scatter plot demonstrates the high sensitivity of the classifier in predicting the probability of different CRC stages, based on mutation patterns in microbiome-derived RNAs from plasma cfRNA. e, The SHAP value obtained from the classifier model points out critical mutation sites in 10 representative microbes, which predominantly contributes to CRC diagnostic model, f, Box plot compares the integral mutation patterns in microbiome-derived RNAs and indicates significant differences between CRC patients and non-cancer individuals, with 3 representative microbes observed in LIME- seq data of plasma cfRNA, as validation of the CRC diagnostic model, g, Box plot illustrates the dynamic profiles of stage-specific mutation patterns in microbiome-derived RNAs from plasma cfRNA, with Hydrogenophaga/A and Ramlibacter tataouinensis/G as representative microbes. The variations of these mutation patterns across different CRC stages show the potential for monitoring cancer-progression-related changes, via the CRC diagnostic model, h, external validation of Validation 1 and Validation 2 compared to training.
[0040] FIGs. 3A-3L show quantitative identification of RNA methylations in small RNAs from HepG2 cells by LIME-seq. a f, Barplot shows the mutation ratios of RNA methylation sites m 1 Ass (a), m1 A9 (b), mxG9 (c), mxG37 (d), m22G26 (e), and m3C (f) detected by LIME-seq in small RNAs from HEPG2 cells, g, IGV plot of a representative tRNA-Met-i region in HEPG2 small RNA, displaying the reads coverage and observed mutation sites, h,
Correlation of mutation ratios between two replicates of HEPG2 small RNA (50 ng) detected by LIME-seq. PCC represents Pearson’s correlation coefficient, i, Correlation of mutation ratios detected in HepG2 small RNA across different input amounts (50 ng, 15 ng, 5 ng, and 1.5 ng), j. IGV plot of a representative tRNA-Met-i region in HepG2 small RNA with varying input amounts, k-1. Bar plots showing mutation ratios of RNA methylation sites at m1 Ass (d) and m'Gg (e) in tRNA-Met-i across different RNA input amounts.
[0041] FIGs. 4A-4M show the reads coverage and methylation profile of host RNAs in human plasma cell-free RNAs, revealed by LIME-seq. a-e, Line plot shows the coverage of cfRNA LIME-seq reads in tRNA (a), ncRNA (b), snRNA (c), 18S rRNA (d), and 28 S rRNA (e) from the human genome. The dark lines represent the average coverage across cfRNAs from 36 non-cancer individuals, while the light regions indicate the standard deviation of RNA coverage variability in these individuals, f, Stacked bars show the mutation ratios of methylation sites in rRNAs detected by LIME-seq in cfRNAs. g m, Stacked bars show the mutation ratios of tRNA methylation sites mxA9 (g), m1 Ass (h), mxG9 (i), mxG9 (mitochondrial tRNA) (j), mxG37 (k), 1113C32 (1) and ms2i6A (m) detected by LIME-seq in cfRNAs. For f m, The ratios for different nucleobases are averaged from 35 non-cancer individuals.
[0042] FIGs. 5A-5G show the reads coverage and methylation profile of microbiome- derived cfRNA in human plasma, revealed by LIME-seq. a, Hierarchical visualization of the composition and relative abundance of microbial species identified in cfRNA using LIME- seq for one typical health individual, b e, Line plot shows the coverage of microbiome-derived cfRNA LIME-seq reads in Pseudomonas sp. CIP-10 rRNA (b), and tRNA (c) regions for typical individuals, d-g. IGV plots of representative tRNA and rRNA regions, including tRNA- Pro-TGG (d), tRNA-Pro-cGG (e), tRNA-Leu-CAG (f), and rRNA (g), comparing RNA from lab-cultured S. maltophilia with plasma cfRNA reads. AlkB treatment was applied to lab- cultured RNA to confirm modifications. Skin samples were used as a negative control, with distinct coverage patterns and low abundance indicating that the reads originate from other microbiome rather than S. maltophilia.
[0043] FIGs. 6A-6D show comparison of microbiome-derived cfRNA origins in CRC patients versus non-cancer controls, a, Boxplot compares the ratio of microbial reads from LIME-seq between non-cancer controls and CRC patients, b-c, Bar plot shows the relative abundance of the top 50 most abundant microbial species detected by LIME-seq in plasma samples from 36 non-cancer individuals (b) and 27 CRC patients (c). Abundance is determined by Log2RPM values calculated using LIME-seq data and averaged from 35 non-cancer individuals and 27 cancer patients, respectively, d, Venn diagram shows the overlap of the top
50 most abundant microbial species in non-cancer individuals and patients with colorectal cancer (CRC).
[0044] FIGs. 7A-7N show differential expression and methylation profiles for host RNAs in plasma cfRNA from CRC patients versus non-cancer controls, a, Bar plot shows the relative abundance of the top 50 highly expressed transcripts identified in cfRNA data from plasma of CRC patients. Abundance is based on Log2RPM values derived from LIME-seq data, b, Boxplot compares the abundance of various RNA transcripts mapping to the human genome in CRC patients and non-cancer individuals, p value obtained by student t-test (two tailed) is shown in the figure, c, Volcano plot shows the differential RNA methylation analysis of human rRNA between CRC patients and non-cancer individuals as revealed by LIME-seq. d-f, Bar plot compares the mutational signatures detected by LIME-seq in human 18S rRNA m1acp3'P (d), 28S rRNA m1 A (e), and 28S rRNA m3U (f). g, Volcano plot shows the differential RNA methylation analysis in human tRNA using LIME-seq between CRC patients and non- cancer individuals. Up-regulated tRNA methylated sites (2) are marked in red, and down- regulated sites (5) are marked in blue, h-i, Box plot compares the mutational signatures detected by LIME seq in human (h) m3C32 at tRNA-Lys(CTT), (i) m3C32 at tRNA-Pro(CGG). p value obtained by student t-test (two tailed) is shown in the figure, j, Schematic illustrating pipeline for the identification of differential methylation sites on tRNAs in CRC tumor tissues versus normal adjacent tissue (NATs) from four CRC patients using LIME-seq. k-1, Volcano plot (k) and scatter plot (1) showing differential RNA methylation analysis in human tRNA using LIME-seq, comparing CRC tumor tissue and NATs. />-values are calculated using a paired t- test. m, Receiver operating characteristic (ROC) curve of a classifier based on mutational signatures from human tRNA in cfRNA obtained from LIME-seq. The inputs for the classifier are the mutational ratio in cfRNA of tRNA sites shown in figure I. Validation was performed using leave-one-out cross-validation (LOOCV). n, Scatter plot showing the low correlation between the fold change in mutation ratios between CRC tumor and NATs and those between CRC patients’ and non-cancer individuals’ cfRNA.
[0045] FIGs. 8A-8C show LIME-seq reveals the differential abundance of microbial species as the biomarker for CRC diagnosis, a, Workflow for the CRC diagnosis based on the abundance of microbial species in human plasma as revealed by LIME-seq. b, ROC curve of the classifier based on microbial species abundance obtained from LIME-seq comparing CRC patients and non-cancer individuals, validated using LOOCV. c, The SHAP values from the classifier model reveal the important microbe in CRC (p value obtained by t-test between
non-cancer individuals and CRC patients is marked in blue). The SHAP values illustrate the significance of different inputs in determining the predictive capability of the statistical model.
[0046] FIGs. 9A-9P RNA modification profiles in microbiome-derived cfRNA distinguishes CRC patients from non-cancer controls, a, Volcano plot shows the differential RNA methylation analysis in microbiome-derived cfRNA, comparing mutation signatures between CRC patients and non-cancer individuals, b, The pie chart shows the composition of mutational signature accessed by LIME-seq in microbiome-derived cfRNA. The percentage of SNP and RNA modifications is shown in figure. As the mutation patterns from HIV-RT shown in Extended Data Fig. 3, RNA modifications can be detected as mixed mutational signature during reverse transcription process by HIV-RT. c, Bar plot shows the ratio of mutational signature with statistical difference for both SNP and modification sites. A much higher ratio in modification sites highlight its dynamic between CRC patients and non-cancer individuals, d, Two-dimensional principal component analysis (PCA) on the mutation signature of significantly methylated sites in cfRNA derived from the microbiome in colorectal cancer patients and non-cancer controls, e, Three-dimensional PCA of the mutation signature at significantly methylated sites in cfRNA from the microbiome of CRC patients and non-cancer controls suggests that higher dimensionality enhances classification accuracy between CRC patients and non-cancer individuals compared to two-dimensional PCA. f, Schematic overview of the hyperplane in SVC. g, Workflow for developing a diagnostic model for colorectal cancer patients using epitranscriptional signatures in microbiome-derived cfRNA with LIME-seq. h, Confusion Matrix of the classifier based on mutational signature from LIME-seq between CRC patients and non-cancer controls, evaluated through LOOCV. i, Schematic overview of different validation methods, j, The prediction accuracy of the classifier for CRC diagnosis with 4-Fold cross validation. The error bar comes from the five repeats (p = 5*10'9, obtained by student t-test (two tailed)), k, The prediction accuracy of the classifier for CRC diagnosis with bootstrapping with 25% dataset as the validation set. The error bar comes from the 20 repeats (p = l*10'9, obtained by two tailed student t-test (two tailed)). 1, ROC of the classifier based on mutational signature and abundance of microbial species from LIME-seq between CRC patients and non-cancer controls, evaluated through bootstrapping with 25% dataset as the validation set. The light color regions reflect the standard error of 20 repeats, m-o, Dynamic mutational sites in (m) Halomonas. (n) Hydrogenophaga, (o) Clostridium tetani across different stages identified by the statistical model./? value obtained by student t-test (two tailed) is shown in the figure, p, Mutational profiles of differential mutation sites observed in Peptostreptococcus anaerobiusm. Prevotella Intermedia, and Parvimonas micra.
[0047] FIGs. 10A-10D show evaluating the stability of the mutational signatures detected on microbiome-derived cfRNA. a, Schematic of the experimental design for evaluating stability. Whole blood was collected from two individuals, with three tubes per individual, and stored at 4°C for <2 hours, 8 hours, and 24 hours before plasma extraction. Plasma was immediately frozen at -80°C after extraction, b, Correlation of mutation ratios detected in cfRNA from plasma stored at 4°C for <2 hours, 8 hours, and 24 hours, c, Relative proportion of microbial reads observed in cfRNA with different storage conditions, d, Low correlation of host transcript expression levels between cfRNA obtained at 2 hours and 24 hours, suggesting cell leakage over time.
[0048] FIGs 11A-11E show the validation of a subset of mutational signatures as biomarker for CRC early diagnosis with two external cohorts. a, ROC curve of the classifier based on the mutational signature of microbial species detected by LIME-seq in the training cohort, evaluated using model fitting and LOOCV. b, Schematic of the experimental design and sample sizes for the external validation cohorts, c, Mutational signatures of 12 selected sites with normalized, centered mutational level across the training and validation cohorts, d-e, Relative CRC scores of patients in external validation cohort 1 (d) and external validation cohort 2 (e).
[0049] FIGs. 12A-12B show RNA modification in plasma cfRNA could be clinically informative for pancreatic cancer diagnosis, a, Two-dimensional PCA illustrating differences in the mutational signatures of differentially methylated sites in microbiome- derived cfRNA between pancreatic cancer patients and non-cancer controls, b, Random permutation of patient and control labels significantly reduces the classification ability observed in the PCA.
[0050] FIG. 13 is a block diagram illustrating an example process 100 for predicting disease according to non-limiting aspects of the present disclosure.
[0051] FIGs. 14A-14D show dynamic regulation of bacterial tRNA modifications under different growth conditions, a-b, Scatter plots showing LIME-seq-derived mutation ratios for tRNA modification sites in Pseudomonas aeruginosa (a) and Staphylococcus aureus (b) grown on solid medium (x-axis) versus liquid medium (y-axis). Each point represents a single site; red points indicate sites with significant differences between conditions. Dashed lines denote twofold deviation boundaries, c, Comparison of tRNA modification levels in Escherichia coli grown at high versus low cell densities. Mutation ratios for individual sites are plotted; red points denote density-dependent changes, d, Heat map showing Z-score-
normalized mutation ratios across E. coli cultures collected at five optical densities (OD600 = 0.12, 0.29, 0.70, 1.10, and 1.70). Increased modifications at high density are prominently observed at U47 (putative acp3U) and A38 across multiple tRNA species.
DETAILED DESCRIPTION
[0052] Aspects of the present disclosure relate to compositions, methods, and kits for detection and analysis of an RNA modification signature. In certain aspects, the RNA modification signature comprises 1, modifications including any of Table A.
Table A: RNA Modifications for Certain Aspects Herein
I. General Assay Methods
A. Detection and analysis of methylated nucleic acid
[0053] Aspects of the methods include assaying nucleic acids to determine expression levels and/or modification levels of nucleic acids (e.g., DNA, RNA) including such expression and/or modification in cell free nucleic acid. Certain example methods for detection and analysis of nucleic acid methylation are described herein.
[0054] In some aspects, methods provided herein reduce levels of background in assays comprising low and/or ultralow RNA inputs relative to canonical-BS treatments. In some aspects, methods provided herein reduce false positive rates by equal to about or greater than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91,
92, 93, 94, 95, 96, 97, 98, 99, or 100 fold, or any range derivable therein, when compared to canonical-BS treatments. In some aspects, methods provided herein increase the rate of true positive detection by equal to about or greater than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, or 100 percent, or any range derivable therein, when compared to canonical-BS treatments.
[0055]
1. HPLC-UV
[0056] The technique of HPLC-UV (high performance liquid chromatography -ultraviolet), developed by Kuo and colleagues in 1980 (described further in Kuo K.C. et al., Nucleic Acids Res. 1980;8:4763-4776, which is herein incorporated by reference) can be used to quantify the amount of RNA methylation present in a hydrolyzed DNA sample. The method includes hydrolyzing the RNA into its constituent nucleoside bases, which are separated chromatographically and, then, the fractions are measured.
2. LC-MS/MS
[0057] Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) is an high-sensitivity approach to HPLC-UV, which requires much smaller quantities of the hydrolyzed DNA sample. In the case of mammalian DNA, of which ~2%-5% of all cytosine residues are methylated, LC-MS/MS has been validated for detecting levels of methylation levels ranging from 0.05%-10%, and it can confidently detect differences between samples as small as -0.25% of the total cytosine residues, which corresponds to -5% differences in global DNA methylation. The procedure routinely requires 50-100 ng of DNA sample, although much smaller amounts (as low as 5 ng) have been successfully profiled.
3. AFLP and RFLP
[0058] Detection of fragments that are differentially methylated could be achieved by traditional PCR-based amplification fragment length polymorphism (AFLP), restriction fragment length polymorphism (RFLP) or protocols that employ a combination of both.
4. LUMA
[0059] The LUMA (luminometric methylation assay) technique utilizes a combination of two DNA restriction digest reactions performed in parallel and subsequent pyrosequencing reactions to fill-in the protruding ends of the digested DNA strands. One digestion reaction is performed with the CpG methylation-sensitive enzyme Hpall; while the parallel reaction uses the methylation-insensitive enzyme MspI, which will cut at all CCGG sites. The enzyme EcoRI is included in both reactions as an internal control. Both MspI and Hpall generate 5'-CG overhangs after DNA cleavage, whereas EcoRI produces 5'-AATT overhangs, which are then filled in with the subsequent pyrosequencing-based extension assay. Essentially, the measured
light signal calculated as the Hpall/MspI ratio is proportional to the amount of unmethylated DNA present in the sample. As the sequence of nucleotides that are added in pyrosequencing reaction is known, the specificity of the method is very high and the variability is low, which is essential for the detection of small changes in global methylation. LUMA requires only a relatively small amount of DNA (250-500 ng), demonstrates little variability and has the benefit of an internal control to account for variability in the amount of DNA input. RNA Sequencing
[0060] In some aspects, RNA may be analyzed by sequencing. The RNA may be prepared for sequencing by any method known in the art, such as but not limited to, poly-A selection, cDNA synthesis, stranded or nonstranded library preparation, or a combination thereof. The RNA may be prepared for any type of RNA sequencing technique, including but not limited to, stranded specific RNA sequencing. In some aspects, sequencing may be performed to generate approximately 10M, 15M, 20M, 25M, 30M, 35M, 40M or more reads, including paired reads. In some aspects, the sequencing may be performed at a read length of approximately 50 bp, 55 bp, 60 bp, 65 bp, 70 bp, 75 bp, 80 bp, 85 bp, 90 bp, 95 bp, 100 bp, 105 bp, 110 bp, or longer (or any range derivable therein). In some aspects, raw sequencing data may be converted to estimated read counts (RSEM), fragments per kilobase of transcript per million mapped reads (FPKM), and/or reads per kilobase of transcript per million mapped reads (RPKM).
5. Example Sequencing Methods
[0061] RNA may be used for amplification of one or more regions of interest followed by sequencing. Accordingly, aspects of the disclosure may include sequencing nucleic acids to detect and/or quantify methylation of nucleic acids biomarkers. In some aspects, the methods of the disclosure include a sequencing method. Sequencing may be excluded from certain methods of the disclosure. Example sequencing methods include, but are not limited to, those described below. a. Massively parallel signature sequencing (MPSS).
[0062] The first of the next-generation sequencing technologies, massively parallel signature sequencing (or MPSS), was developed in the 1990s at Lynx Therapeutics. MPSS was a bead-based method that used a complex approach of adapter ligation followed by adapter
decoding, reading the sequence in increments of four nucleotides. This method made it susceptible to sequence-specific bias or loss of specific sequences. b. Polony sequencing.
[0063] The Polony sequencing method, developed in the laboratory of George M. Church at Harvard, was among the first next-generation sequencing systems and was used to sequence a full genome in 2005. It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and ligation-based sequencing chemistry to sequence an E. coli genome at an accuracy of >99.9999% and a cost approximately 1/9 that of Sanger sequencing. c. 454 pyrosequencing™.
[0064] A parallelized version of pyrosequencing was developed by 454 Life Sciences™, which has since been acquired by Roche Diagnostics™. The method amplifies DNA inside water droplets in an oil solution (emulsion PCR), with each droplet containing a single DNA template attached to a single primer-coated bead that then forms a clonal colony. The sequencing machine contains many picoliter-volume wells each containing a single bead and sequencing enzymes. Pyrosequencing uses luciferase to generate light for detection of the individual nucleotides added to the nascent DNA, and the combined data are used to generate sequence read-outs. This technology provides intermediate read length and price per base compared to Sanger sequencing on one end and Solexa and SOLiD™ on the other. d. Illumina™ (Solexa) sequencing.
[0065] Solexa developed a sequencing method based on reversible dye-terminators technology, and engineered polymerases, that it developed internally. The terminated chemistry was developed internally at Solexa and the concept of the Solexa system was invented by Balasubramanian and Klennerman from Cambridge University's chemistry department. In 2004, Solexa acquired the company Manteia Predictive Medicine in order to gain a massively parallel sequencing technology based on "DNA Clusters", which involves the clonal amplification of DNA on a surface. The cluster technology was co-acquired with Lynx Therapeutics of California. Solexa Ltd. later merged with Lynx to form Solexa Inc.
[0066] In this method, DNA molecules and primers are first attached on a slide and amplified with polymerase so that local clonal DNA colonies, later coined "DNA clusters", are formed. To determine the sequence, four types of reversible terminator bases (RT -bases) are
added and non-incorporated nucleotides are washed away. A camera takes images of the fluorescently labeled nucleotides, then the dye, along with the terminal 3' blocker, is chemically removed from the DNA, allowing for the next cycle to begin. Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera.
[0067] Decoupling the enzymatic reaction and the image capture allows for optimal throughput and theoretically unlimited sequencing capacity. With an optimal configuration, the ultimately reachable instrument throughput is thus dictated solely by the analog-to-digital conversion rate of the camera, multiplied by the number of cameras and divided by the number of pixels per DNA colony required for visualizing them optimally (approximately 10 pixels/colony). In 2012, with cameras operating at more than 10 MHz A/D conversion rates and available optics, fluidics and enzymatics, throughput can be multiples of 1 million nucleotides/second, corresponding roughly to one human genome equivalent at lx coverage per hour per instrument, and one human genome re-sequenced (at approx. 3 Ox) per day per instrument (equipped with a single camera). e. SOLiD™ sequencing.
[0068] SOLiD™ technology employs sequencing by ligation. Here, a pool of all possible oligonucleotides of a fixed length are labeled according to the sequenced position. Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for matching sequences results in a signal informative of the nucleotide at that position. Before sequencing, the DNA is amplified by emulsion PCR. The resulting beads, each containing single copies of the same DNA molecule, are deposited on a glass slide. The result is sequences of quantities and lengths comparable to Illumina™ sequencing. f. Ion Torrent™ semiconductor sequencing.
[0069] Ion Torrent™ Systems Inc. developed a system based on using standard sequencing chemistry, but with a novel, semiconductor based detection system. This method of sequencing is based on the detection of hydrogen ions that are released during the polymerization of DNA, as opposed to the optical methods used in other sequencing systems. A microwell containing a template DNA strand to be sequenced is flooded with a single type of nucleotide. If the introduced nucleotide is complementary to the leading template nucleotide it is incorporated
into the growing complementary strand. This causes the release of a hydrogen ion that triggers a hypersensitive ion sensor, which indicates that a reaction has occurred. If homopolymer repeats are present in the template sequence multiple nucleotides will be incorporated in a single cycle. This leads to a corresponding number of released hydrogens and a proportionally higher electronic signal. g. DNA Nanoballs™ sequencing.
[0070] DNA Nanoballs™ sequencing is a type of high throughput sequencing technology used to determine the entire genomic sequence of an organism. The company Complete Genomics® uses this technology to sequence samples submitted by independent researchers. The method uses rolling circle replication to amplify small fragments of genomic DNA into DNA nanoballs. Unchained sequencing by ligation is then used to determine the nucleotide sequence. This method of DNA sequencing allows large numbers of DNA nanoballs to be sequenced per run and at low reagent costs compared to other next generation sequencing platforms. However, only short sequences of DNA are determined from each DNA nanoball which can make mapping the short reads to a reference genome difficult. This technology has been used for multiple genome sequencing projects. h. Heliscope single molecule sequencing.
[0071] Heliscope sequencing is a method of single-molecule sequencing developed by Helicos Biosciences. It uses DNA fragments with added poly-A tail adapters which are attached to the flow cell surface. The next steps involve extension-based sequencing with cyclic washes of the flow cell with fluorescently labeled nucleotides (one nucleotide type at a time, as with the Sanger method). The reads are performed by the Heliscope sequencer. The reads are short, up to 55 bases per run, but recent improvements allow for more accurate reads of stretches of one type of nucleotides. This sequencing method and equipment were used to sequence the genome of the Ml 3 bacteriophage. i. Single molecule real time (SMRT) sequencing.
[0072] SMRT sequencing is based on the sequencing by synthesis approach. The DNA is synthesized in zero-mode wave-guides (ZMWs) - small well-like containers with the capturing tools located at the bottom of the well. The sequencing is performed with use of unmodified polymerase (attached to the ZMW bottom) and fluorescently labelled nucleotides flowing
freely in the solution. The wells are constructed in a way that only the fluorescence occurring by the bottom of the well is detected. The fluorescent label is detached from the nucleotide at its incorporation into the DNA strand, leaving an unmodified DNA strand. According to Pacific Biosciences, the SMRT technology developer, this methodology allows detection of nucleotide modifications (such as cytosine methylation). This happens through the observation of polymerase kinetics. This approach allows reads of 20,000 nucleotides or more, with average read lengths of 5 kilobases.
B. Additional Assay Methods
[0073] In some aspects, methods involve amplifying and/or sequencing one or more target genomic regions using at least one pair of primers specific to the target genomic regions. In certain aspects, the primers are heptamers. In certain aspects, enzymes are added such as primases or primase/polymerase combination enzyme to the amplification step to synthesize primers.
[0074] In some aspects, arrays can be used to detect nucleic acids of the disclosure. An array comprises a solid support with nucleic acid probes attached to the support. Arrays typically comprise a plurality of different nucleic acid probes that are coupled to a surface of a substrate in different, known locations. These arrays, also described as "microarrays" or colloquially "chips" have been generally described in the art, for example, U.S. Pat. Nos. 5,143,854, 5,445,934, 5,744,305, 5,677,195, 6,040,193, 5,424,186 and Fodorc/ a/., 1991), each of which is incorporated by reference in its entirety for all purposes. Techniques for the synthesis of these arrays using mechanical synthesis methods are described in, e.g., U.S. Pat. No. 5,384,261, incorporated herein by reference in its entirety for all purposes. Although a planar array surface is used in certain aspects, the array may be fabricated on a surface of virtually any shape or even a multiplicity of surfaces. Arrays may be nucleic acids on beads, gels, polymeric surfaces, fibers such as fiber optics, glass or any other appropriate substrate, see U.S. Pat. Nos. 5,770,358, 5,789,162, 5,708,153, 6,040,193 and 5,800,992, which are hereby incorporated in their entirety for all purposes.
[0075] In addition to the use of arrays and microarrays, it is contemplated that a number of difference assays could be employed to analyze nucleic acids. Such assays include, but are not limited to, nucleic amplification, polymerase chain reaction, quantitative PCR, RT-PCR, in situ hybridization, digital PCR, ddPCR (droplet digital PCR), nCounter® (nanoString®), BEAMing (Beads, Emulsions, Amplifications, and Magnetics) (Inostics), ARMS
(Amplification Refractory Mutation Systems), RNA-Seq, TAm-Seg (Tagged- Amplicon deep sequencing), PAP (Pyrophosphorolysis-activation polymerization), next generation RNA sequencing, northern hybridization, hybridization protection assay (HPA)(GenProbe), branched DNA (bDNA) assay (Chiron), rolling circle amplification (RCA), single molecule hybridization detection (US Genomics), Invader assay (ThirdWave Technologies), and/or Bridge Litigation Assay (Genaco).
[0076] Amplification primers or hybridization probes can be prepared to be complementary to a genomic region, biomarker, probe, or oligo described herein. The term "primer" as used herein, is meant to encompass any nucleic acid that is capable of priming the synthesis of a nascent nucleic acid in a template-dependent process and/or pairing with a single strand of an oligo of the disclosure, or portion thereof. Typically, primers are oligonucleotides from ten to twenty and/or thirty nucleic acids in length, but longer sequences can be employed. Primers may be provided in double-stranded and/or single-stranded form, although the singlestranded form is preferred.
[0077] The use of a primer of between 13 and 100 nucleotides, particularly between 17 and 100 nucleotides in length, or in some aspects up to 1-2 kilobases or more in length, allows the formation of a duplex molecule that is both stable and selective. Molecules having complementary sequences over contiguous stretches greater than 20 bases in length may be used to increase stability and/or selectivity of the hybrid molecules obtained. One may design nucleic acid molecules for hybridization having one or more complementary sequences of 20 to 30 nucleotides, or even longer where desired. Such fragments may be readily prepared, for example, by directly synthesizing the fragment by chemical means or by introducing selected sequences into recombinant vectors for recombinant production.
[0078] In some aspects, each probe/primer comprises at least 15 nucleotides. For instance, each probe can comprise at least or at most 20, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 400 or more nucleotides (or any range derivable therein). They may have these lengths and have a sequence that is identical or complementary to a gene described herein. Particularly, each probe/primer has relatively high sequence complexity and does not have any ambiguous residue (undetermined "n" residues). The probes/primers can hybridize to the target gene, including its RNA transcripts, under stringent or highly stringent conditions. It is contemplated that probes or primers may have inosine or other design implementations that accommodate recognition of more than one human sequence for a particular biomarker.
[0079] For applications requiring high selectivity, one will typically desire to employ relatively high stringency conditions to form the hybrids. For example, relatively low salt
and/or high temperature conditions, such as provided by about 0.02 M to about 0.10 M NaCl at temperatures of about 50°C to about 70°C. Such high stringency conditions tolerate little, if any, mismatch between the probe or primers and the template or target strand and would be particularly suitable for isolating specific genes or for detecting specific mRNA transcripts. It is generally appreciated that conditions can be rendered more stringent by the addition of increasing amounts of formamide.
[0080] In some aspects, quantitative RT-PCR (such as but not limited to TaqMan™, AB I) is used for detecting and comparing the levels or abundance of nucleic acids in samples. The concentration of the target DNA in the linear portion of the PCR process is proportional to the starting concentration of the target before the PCR was begun. By determining the concentration of the PCR products of the target DNA in PCR reactions that have completed the same number of cycles and are in their linear ranges, it is possible to determine the relative concentrations of the specific target sequence in the original DNA mixture. This direct proportionality between the concentration of the PCR products and the relative abundances in the starting material is true in the linear range portion of the PCR reaction. The final concentration of the target DNA in the plateau portion of the curve is determined by the availability of reagents in the reaction mix and is independent of the original concentration of target DNA. Therefore, the sampling and quantifying of the amplified PCR products may be carried out when the PCR reactions are in the linear portion of their curves. In addition, relative concentrations of the amplifiable DNAs may be normalized to some independent standard/control, which may be based on either internally existing DNA species or externally introduced DNA species. The abundance of a particular DNA species may also be determined relative to the average abundance of all DNA species in the sample.
[0081] In some aspects, the PCR amplification utilizes one or more internal PCR standards. The internal standard may be an abundant housekeeping gene in the cell or it can specifically be GAPDH, GUSB and P-2 microglobulin. These standards may be used to normalize expression levels so that the expression levels of different gene products can be compared directly. A person of ordinary skill in the art would know how to use an internal standard to normalize expression levels.
[0082] A problem inherent in some samples is that they are of variable quantity and/or quality. This problem can be overcome if the RT-PCR is performed as a relative quantitative RT-PCR with an internal standard in which the internal standard is an amplifiable DNA fragment that is similar or larger than the target DNA fragment and in which the abundance of
the DNA representing the internal standard is roughly 5-100 fold higher than the DNA representing the target nucleic acid region.
[0083] In some aspects, the relative quantitative RT-PCR uses an external standard protocol. Under this protocol, the PCR products are sampled in the linear portion of their amplification curves. The number of PCR cycles that are optimal for sampling can be empirically determined for each target DNA fragment. In addition, the nucleic acids isolated from the various samples can be normalized for equal concentrations of amplifiable DNAs.
[0084] A nucleic acid array can comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 150, 200, 250 or more different polynucleotide probes, which may hybridize to different and/or the same biomarkers. Multiple probes for the same gene can be used on a single nucleic acid array. Probes for other disease genes can also be included in the nucleic acid array. The probe density on the array can be in any range. In some aspects, the density may be or may be at least 50, 100, 200, 300, 400, 500 or more probes/cm2 (or any range derivable therein).
[0085] Specifically contemplated are chip-based nucleic acid technologies such as those described by Hacia et al. (1996) and Shoemaker et al. (1996). Briefly, these techniques involve quantitative methods for analyzing large numbers of genes rapidly and accurately. By tagging genes with oligonucleotides or using fixed probe arrays, one can employ chip technology to segregate target molecules as high density arrays and screen these molecules on the basis of hybridization (see also, Pease et al., 1994; and Fodor et al, 1991). It is contemplated that this technology may be used in conjunction with evaluating the expression level of one or more cancer biomarkers with respect to diagnostic, prognostic, and treatment methods.
[0086] Certain aspects may involve the use of arrays or data generated from an array. Data may be readily available. Moreover, an array may be prepared in order to generate data that may then be used in correlation studies.
II. Machine Learning
[0087] FIG. 13 is a block diagram illustrating an example process 100 for predicting disease according to non-limiting aspects of the present disclosure. As illustrated, process 100 includes a number of enumerated steps, but aspects of process 100 may include additional steps before, after, and in between the enumerated steps. In some aspects, one or more of the enumerated steps may be omitted or performed in a different order. Process 100, which may comprise a training phase 100A and an application phase 100B, may be performed by one or
more computing devices. For example, process 100 may be performed by one or more processors based on computer-executable or machine readable instructions stored in a memory of the one or more computing device. In some aspects, the training phase 100 A may be performed by a computing device separate or distinct from the computing device performing the application phase 100B, for example, to conserve computer resources and/or bandwidth. [0088] In various aspects, the training phase 100 A may involve receiving reference data from reference patient populations, including patient populations known to have a certain disease, such as cancer or any other disease described herein. In various aspects, the reference data also includes data on the microbiome profile of a reference patient population. In some aspects, the reference data comprises data on one or more RNA modifications in RNA obtained from a reference patient, including cell-free RNA obtained from a reference patient. In some aspects, the reference data may be received from disparate sources, such as other computing systems, for example, electronic health record systems, clinical data management systems, sample analytical systems, or bioreactor system of network environment, or databases and/or repositories, for example, the patient health record database. In some aspects, received reference data may be linked together appropriately, for example, as corresponding to a reference patient.
[0089] At block 104, the computing device may vectorize the reference data to generate reference input feature vectors and reference output feature vectors. In some aspects, each reference input feature vector may be associated with a respective reference patient, and each reference output feature vector may be associate with one or more diseases or disease attributes. Thus, each reference input feature vector may be paired with a respective reference output feature vector. The vectorization may result in a reference input feature vector comprising composite data inputs for each of a plurality of input parameters. In some aspects, redundant or unnecessary parameters may be removed, for example, for dimensionality reduction of the reference input feature vector. The dimensionality reduction may enhance the speed of the machine learning model being trained or may be used to overcome issues of overfitting.
[0090] At block 106, the computing device may associate the reference input feature vectors with reference output feature vectors on a machine learning model. For example, for each pair of reference input feature vector (representing input parameters for a respective CAR- T cell drug production process from a respective reference patient) and reference output feature vector (representing one or more attributes of the respective CAR T-cell drug that is produced), the input feature vector may be inputted within the machine learning model with randomized or initialized weights and/or biases for each input parameter represented by the reference input
feature vector. The machine learning model may be structured to allow the weights to be iteratively adjusted through an error minimization process as the relation between the reference input feature vector and the respective reference output feature vector is determined. For example, for a neural network, the input feature vector may be aligned along an input layer of the neural network, whereas the output feature vector may be aligned along an output layer separated from the input layer by one or more hidden layers. Each layer may comprise one or more nodes that may involve an activation function. The aforementioned weights may be assigned to the various nodes of input layer.
[0091] At block 106, the computing device may train the machine learning model to iteratively minimize error within a predetermined threshold. For example, the training module of a computing device may train the machine learning model by iteratively minimizing errors in determining a relation between parameters represented by the reference input feature vector and the reference output feature vector. The relation may be represented by the set of weights assigned to the parameters represented by the input feature vector. The initial set of weights for the parameters of the input feature vector may be tested for how correctly the set of weights attribute the significance of various parameters in their ability to predict the one or more diseases or disease attributes represented by the reference output feature vector. Each prediction may be a quantitative and/or binary data that is compared to the known data for the one or more attributes. If the difference does not fall below a predetermined threshold or tolerance, an iterative process occurs involving a new set of weights for the parameters. The training involves determining a correct set of weights for the input parameters of the input feature vector. Each weight may indicate a significance of a parameter associated with the weight in the parameter’s ability to predict the one or more diseases or disease attributes indicated by the output feature vector. The training process may occur over
[0092] At block 108, the computing device may output the trained machine learning model comprising the finalized set of weights indicating a relation between the input parameters and the one or more disease or disease attributes. For example, the trained machine learning model may be stored in a memory or may otherwise may accessible to the computing device that performed the training or to another computing device. Also or alternatively, the trained machine learning model may be stored in a local or remote server that may be accessed by a computing device performing the application phase 100B.
[0093] In various aspects, the application phase 100B may involve a computing device having a processor receiving unstructured sequence data from a target patient having or suspected of having a disease. The target patient may be distinguishable from a reference
patient as the target patient has unknown data for attributes are otherwise predicted using the systems and methods presented herein. As used herein, the reference patient may refer to a patient for whom data for diseases may already be known. Thus, reference patients as well as the attributes for disease may be applicable for the training phase 100 A, whereas the target patient may be applicable for the application phase 100B.
[0094] At block 110, the computing device may receive sequencing data from a patient to be analyzed, including a patient having or suspected of having a disease. The sequencing data may include RNA sequencing data. The sequencing data may include data on one or more modifications to RNA obtained from the patient. The RNA obtained form the patient may be cell-free RNA.
[0095] At block 114, the computing device may apply the input feature vector to the trained machine learning model (e.g., from block 108) to generate an output feature vector predicting disease in the patient. As previously discussed, the trained machine learning model may have a stored set of weights that indicate the capability for each of a plurality of parameters towards predicting the disease. The plurality of parameters may include, comprise, and/or correspond to the parameters represented by the input feature vector. Thus, the input feature vector may be associated with the set of weights in the trained machine learning model to generate the output feature vector predicting data for the one or more attributes.
III. Methods of Use
A. Clinical and diagnostic applications
[0096] The methods of the disclosure may be useful for evaluating RNA modifications for clinical and/or diagnostic purposes. Certain aspects relate to methods for evaluating RNA. Certain aspects relate to a method for evaluating a sample comprising RNA molecules. The evaluation may be the detection or determination of a particular RNA modification or the differential detection or determination of a particular modification.
[0097] The sample may be from a biopsy such as from fine needle aspiration, core needle biopsy, vacuum assisted biopsy, incisional biopsy, excisional biopsy, punch biopsy, shave biopsy or skin biopsy. In certain aspects, the sample is obtained from a biopsy from cancerous tissue by any of the biopsy methods previously mentioned. The sample may be obtained from any source including but not limited to blood, plasma, or serum. In certain aspects, the cyst, tumor or neoplasm is colorectal. In certain aspects of the current methods, any medical professional such as a doctor, nurse or medical technician may obtain a biological sample for
testing. Yet further, in certain aspects the biological sample can be obtained without the assistance of a medical professional.
[0098] A sample may include but is not limited to, tissue, cells, or biological material from cells or derived from cells of a subject. In some aspects, the sample comprises cell-free RNA. In some aspects, the sample comprises a fertilized egg, a zygote, a blastocyst, or a blastomere. The biological sample may be a heterogeneous or homogeneous population of cells or tissues. The biological sample may be obtained using any method known to the art that can provide a sample suitable for the analytical methods described herein. The sample may be obtained by non-invasive methods including but not limited to: scraping of the skin or cervix, swabbing of the cheek, saliva collection, urine collection, feces collection, collection of menses, tears, or semen.
[0099] In some aspects, the methods of the disclosure can be used in the discovery of novel biomarkers for a disease or condition. In some aspects, the methods of the disclosure can performed on a sample from a patient to provide a prognosis for a certain disease or condition in the patient. In some aspects, the methods of the disclosure can be performed on a sample from a patient to predict the patient’s response to a particular therapy. In some aspects, the disease comprises a cancer. For example, the cancer may be pancreatic cancer, colon cancer, acute myeloid leukemia, adrenocortical carcinoma, AIDS-related cancers, AIDS-related lymphoma, anal cancer, appendix cancer, astrocytoma, childhood cerebellar or cerebral basal cell carcinoma, bile duct cancer, extrahepatic bladder cancer, bone cancer, osteosarcoma/malignant fibrous histiocytoma, brainstem glioma, brain tumor, cerebellar astrocytoma brain tumor, cerebral astrocytoma/malignant glioma brain tumor, ependymoma brain tumor, medulloblastoma brain tumor, supratentorial primitive neuroectodermal tumors brain tumor, visual pathway and hypothalamic glioma, breast cancer, lymphoid cancer, bronchial adenomas/carcinoids, tracheal cancer, Burkitt lymphoma, carcinoid tumor, childhood carcinoid tumor, gastrointestinal carcinoma of unknown primary, central nervous system lymphoma, primary cerebellar astrocytoma, childhood cerebral astrocytoma/malignant glioma, childhood cervical cancer, childhood cancers, chronic lymphocytic leukemia, chronic myelogenous leukemia, chronic myeloproliferative disorders, cutaneous T-cell lymphoma, desmoplastic small round cell tumor, endometrial cancer, ependymoma, esophageal cancer, Ewing's, childhood extragonadal Germ cell tumor, extrahepatic bile duct cancer, eye Cancer, intraocular melanoma eye Cancer, retinoblastoma, gallbladder cancer, gastric (stomach) cancer, gastrointestinal carcinoid tumor, gastrointestinal stromal tumor (GIST), germ cell tumor: extracranial, extragonadal, or ovarian, gestational trophoblastic tumor, glioma of the
brain stem, glioma, childhood cerebral astrocytoma, childhood visual pathway and hypothalamic glioma, gastric carcinoid, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular (liver) cancer, Hodgkin lymphoma, hypopharyngeal cancer, hypothalamic and visual pathway glioma, childhood intraocular melanoma, islet cell carcinoma (endocrine pancreas), kaposi sarcoma, kidney cancer (renal cell cancer), laryngeal cancer , leukemia, acute lymphoblastic (also called acute lymphocytic leukemia) leukemia, acute myeloid (also called acute myelogenous leukemia) leukemia, chronic lymphocytic (also called chronic lymphocytic leukemia) leukemia, chronic myelogenous (also called chronic myeloid leukemia) leukemia, hairy cell lip and oral cavity cancer, liposarcoma, liver cancer (primary), non-small cell lung cancer, small cell lung cancer, lymphomas, AIDS-related lymphoma, Burkitt lymphoma, cutaneous T-cell lymphoma, Hodgkin lymphoma, Non-Hodgkin (an old classification of all lymphomas except Hodgkin's) lymphoma, primary central nervous system lymphoma, Waldenstrom macroglobulinemia, malignant fibrous histiocytoma of bone/osteosarcoma, childhood medulloblastoma, melanoma, intraocular (eye) melanoma, merkel cell carcinoma, adult malignant mesothelioma, childhood mesothelioma, metastatic squamous neck cancer, mouth cancer, multiple endocrine neoplasia syndrome, multiple myeloma/plasma cell neoplasm, mycosis fungoides, myelodysplastic syndromes, myelodysplastic/myeloproliferative diseases, chronic myelogenous leukemia, adult acute myeloid leukemia, childhood acute myeloid leukemia, multiple myeloma, chronic myeloproliferative disorders, nasal cavity and paranasal sinus cancer, nasopharyngeal carcinoma, neuroblastoma, oral cancer, oropharyngeal cancer, osteosarcoma/malignant, fibrous histiocytoma of bone, ovarian cancer, ovarian epithelial cancer (surface epithelial- stromal tumor), ovarian germ cell tumor, ovarian low malignant potential tumor, pancreatic cancer, islet cell paranasal sinus and nasal cavity cancer, parathyroid cancer, penile cancer, pharyngeal cancer, pheochromocytoma, pineal astrocytoma, pineal germinoma, pineoblastoma and supratentorial primitive neuroectodermal tumors, childhood pituitary adenoma, plasma cell neoplasia/multiple myeloma, pleuropulmonary blastoma, primary central nervous system lymphoma, prostate cancer, rectal cancer, renal cell carcinoma (kidney cancer), renal pelvis and ureter transitional cell cancer, retinoblastoma, rhabdomyosarcoma, childhood Salivary gland cancer Sarcoma, Ewing family of tumors, Kaposi sarcoma, soft tissue sarcoma, uterine sezary syndrome sarcoma, skin cancer (nonmelanoma), skin cancer (melanoma), skin carcinoma, Merkel cell small cell lung cancer, small intestine cancer, soft tissue sarcoma, squamous cell carcinoma, squamous neck cancer with occult primary, metastatic stomach cancer, supratentorial primitive neuroectodermal tumor, childhood T-cell lymphoma, testicular
cancer, throat cancer, thymoma, childhood thymoma, thymic carcinoma, thyroid cancer, urethral cancer, uterine cancer, endometrial uterine sarcoma, vaginal cancer, visual pathway and hypothalamic glioma, childhood vulvar cancer, and wilms tumor (kidney cancer).
[0100] In some aspects, the cancer comprises ovarian, prostate, colon, or lung cancer. In some aspects, the method is for determining novel biomarkers for ovarian, prostate, colon, or lung cancer by evaluating cell-free RNA using methods of the disclosure.
[0101] In some aspects, methods disclosed herein are performed on RNA that is at a low input concentration. In some aspects, a low input RNA concentration is at about or below about 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0, 14.5, or 15 nanograms, or any range derivable therein. In some aspects, a low input RNA concentration is at about 1 to 10 ng, 5 to 10 ng, 10 to 50 ng, or 10 to 100 ng total RNA. In some aspects, a low input concentration of RNA is obtained from about or less than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 125, 150, 175, 200, 250, 300, 350, 400, 450, or 500 cells.
IV. Sample Preparation
[0102] In certain aspects, methods involve obtaining a sample (also “biological sample”) from a subject. In some aspects, the biological sample is a blood sample. In some aspects, the biological sample is a plasma sample. The sample may be obtained by methods known in the art. In certain aspects the sample is obtained by swabbing, endoscopy, scraping, phlebotomy, or any other methods known in the art. In some cases, the sample may be obtained, stored, or transported using components of a kit of the present methods. In some cases, multiple samples may be obtained for diagnosis by the methods described herein. In other cases, multiple samples, such as one or more samples from one tissue and one or more samples from another specimen (for example serum) may be obtained for diagnosis by the methods. In some cases, multiple samples such as one or more samples from one tissue type and one or more samples from another specimen (e.g. serum) may be obtained at the same or different times. Samples may be obtained at different times are stored and/or analyzed by different methods. For example, a sample may be obtained and analyzed by routine staining methods or any other cytological analysis methods.
[0103] In some aspects, the biological sample may be obtained by a physician, nurse, or other medical professional such as a medical technician, endocrinologist, cytologist, phlebotomist, radiologist, or a pulmonologist. The medical professional may indicate the appropriate test or assay to perform on the sample. In certain aspects a molecular profiling business may consult on which assays or tests are most appropriately indicated. In further aspects of the current methods, the patient or subject may obtain a biological sample for testing without the assistance of a medical professional, such as obtaining a whole blood sample, a urine sample, a fecal sample, a buccal sample, or a saliva sample.
[0104] General methods for obtaining biological samples are also known in the art. Publications such as Ramzy, Ibrahim Clinical Cytopathology and Aspiration Biopsy 2001, which is herein incorporated by reference in its entirety, describes general methods for biopsy and cytological methods. In some aspects, the sample is a fine needle aspirate of a tissue or a suspected tumor or neoplasm. In some cases, the fine needle aspirate sampling procedure may be guided by the use of an ultrasound, X-ray, or other imaging device.
[0105] In some aspects of the present methods, the molecular profiling business may obtain the biological sample from a subject directly, from a medical professional, from a third party, or from a kit provided by a molecular profiling business or a third party. In some cases, the biological sample may be obtained by the molecular profiling business after the subject, a medical professional, or a third party acquires and sends the biological sample to the molecular profiling business. In some cases, the molecular profiling business may provide suitable containers, and excipients for storage and transport of the biological sample to the molecular profiling business.
[0106] In some aspects of the methods described herein, a medical professional need not be involved in the initial diagnosis or sample acquisition. An individual may alternatively obtain a sample through the use of an over the counter (OTC) kit. An OTC kit may contain a means for obtaining said sample as described herein, a means for storing said sample for inspection, and instructions for proper use of the kit. In some cases, molecular profiling services are included in the price for purchase of the kit. In other cases, the molecular profiling services are billed separately. A sample suitable for use by the molecular profiling business may be any material containing tissues, cells, nucleic acids, genes, gene fragments, expression products, gene expression products, or gene expression product fragments of an individual to be tested. Methods for determining sample suitability and/or adequacy are provided.
[0107] In some aspects, the subject may be referred to a specialist such as an oncologist, surgeon, or endocrinologist. The specialist may likewise obtain a biological sample for testing
or refer the individual to a testing center or laboratory for submission of the biological sample. In some cases the medical professional may refer the subject to a testing center or laboratory for submission of the biological sample. In other cases, the subject may provide the sample. In some cases, a molecular profiling business may obtain the sample.
V. Kits
[0108] Also disclosed herein are kits, which may be useful for performing the methods of the disclosure. The contents of a kit can include one or more reagents described throughout the disclosure and/or one or more reagents known in the art for performing one or more steps described throughout the disclosure. For example, the kits may include one or more of the following: nuclease-free water, one or more primers, polyethylene glycol, magnetic beads, DNA polymerase, taq polymerase, DNA ligase, RNA ligase, a reverse transcriptase, dNTPs, DNA polymerase buffer, RNA polymerase, DTT, redox reagent, Mg2+, K+, adaptors, DNA adaptors, DNA comprising an RNA promoter, a protease, an alkaline solution, a sodium hydroxide solution, and NTPs. Any one or more of the preceding components may be excluded from a kit in certain aspects of the present disclosure. In some aspects, a kit of the disclosure does not comprise sodium bisulfite or added sodium bisulfite. In some aspects, a kit of the disclosure does not comprise ammonium sulfite or added ammonium sulfite.
[0109] In certain aspects, a kit of the disclosure comprises instructions for processing a nucleic acid sample, such as a DNA sample or an RNA sample. Instructions may comprise instructions for using one or more components of the kit in a method disclosed herein.
[0110] One or more reagent is can be supplied in a solid form or liquid buffer that is suitable for inventory storage, and later for addition into the reaction medium when the method of using the reagent is performed. Suitable packaging is provided. The kit may provide additional components that are useful in the procedure. These additional components may include buffers, capture reagents, developing reagents, labels, reacting surfaces, means for detection, control samples, instructions, and interpretive information.
[0111] Any components of a kit described herein may be used in a method disclosed herein. Further, components described in the context of a disclosed method may be provided in a kit of the present disclosure.
Examples
[0112] The following examples are included to demonstrate certain embodiments of the invention. It should be appreciated by those of skill in the art that the techniques disclosed in
the examples which follow represent techniques discovered by the inventor to function well in the practice of the invention, and thus can be considered to constitute certain modes for its practice. However, those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit and scope of the invention.
Example 1 - RNA Modification Signature for Cancer
[0113] Shown below are measured values of the RNA modifications of Table A in cancer.
*FC (cancer/control)
** Mutation in control
*** Mutation in cancer
Example 2 - Epitranscriptomic markers in microbiome-derived cell-free RNA from plasma for colorectal cancer detection
[0114] The presence of site-specific RNA modifications and changes in their stoichiometry add new layers of information beyond RNA sequences. With the development of new methods for sequencing RNA modifications17 22, the inventors hypothesized that the changes in RNA modifications from tumor- or tumor microenvironment (TME)-derived transcripts could be distinct from those in non-tumor cells. The inventors therefore investigated cfRNA modifications in clinical samples. The inventors observed that a stable amount of total RNA (-5-10 ng) could be isolated from 1 mL of plasma. Short RNA fragments (-50 nt) are the predominant forms in the isolated cfRNAs, with the sizes similar to those of tRNAs (Fig. la). tRNAs are highly structured and more resistant to nuclease-mediated degradation than other RNA species. Based on the observed RNA size profiles, Other structured non-coding RNAs (ncRNAs) or tRNA fragments (tRF) may also be present (Fig. la). These RNAs could be bound by tRNA synthetases or RNA-binding proteins (RBPs), providing additional protection against degradation in plasma. Given that tRNAs are known to be heavily modified, the inventors speculated that changes in tRNA modification stoichiometry might offer additional clues for human disease diagnosis (e.g. detection of early-stage cancer).
[0115] The numerous internal methylations (frequent in tRNAs) may obstruct the read- through of reverse transcriptase; however, these methylation patterns in cellular tRNAs carry valuable information23’24. The inventors have previously reported a method to sequence these methylations that block Watson-Crick base pairing21. The inventors have subsequently developed the Low-Input Multiple Methylation Sequencing (LIME-seq) that features simultaneous detection of site-specific RNA modifications (mxA, m3C, mxG, m22G, m3U, inosine etc.) within multiple RNA species in ultralow-input cfRNAs, as well as monitors the stoichiometry changes of these methylations (Fig. lb)21. LIME-seq maps and quantifies these modifications via base mutation (base misincorporation) generated during reverse transcription
(RT), using human immunodeficiency virus reverse transcriptase (HIV-RT). LIME-seq employs cDNA 3 ’-ligation after RT and utilizes HIV-RT to ensure read-through of RNA methylations while inducing mutation signatures for quantification25. These features allow the inventors to measure the abundance of individual small RNAs with greater precision, and to quantitatively reveal the landscape of major RNA methylations on cfRNAs. The RNA/cDNA ligation strategy in LIME-seq also ensures the capture of all short RNA species in plasma, which are often lost in typical RNA-seq libraries that use commercial kits. To validate LIME- seq, the inventors applied it to small RNAs (<200 nt) isolated from HepG2 cells. For cytoplasmic tRNAs, LIME-seq detected two tRNAs with nfAg, forty-four with nfAss, seven with m3C32, ten with nfGg, ten with rn'Gs?. and twenty-one with m22G26, along with modification stoichiometry information (Extended Data Fig. la-f)21,26,27. Intact read coverage profiles were observed at tRNA regions through LIME-seq, demonstrating its capability to measure tRNA abundance as well (Extended Data Fig. 1g). The inventors also performed LIME-seq using 1.5-50 ng of HepG2 small RNA and confirmed high consistency between technical replicates (r = 0.990) and minimal effect from RNA input variation (r > 0.96 between 50 ng and 1.5 ng, Extended Data Fig. lh-1).
[0116] The inventors next applied LIME-seq to cfRNAs isolated from 36 non-cancer human plasma samples. Notably, 19 out of the top 50 highly expressed transcripts were cytoplasmic tRNAs, which accounted for -13% of the total reads mapped to the human genome (hg38). This finding supports the notion that tRNAs are a main component of cfRNA in human plasma (Fig. Ic-d). LIME-seq also captured abundant RNA fragments from rRNAs, Alu elements, LINEs, snRNAs, mRNAs, and IncRNAs. The inventors observed intact read coverage profiles in most annotated cytoplasmic tRNA regions (including mitochondrial tRNA regions, Fig. 1g), indicating the preservation of cell-free tRNAs in human plasma. In contrast, only fragments were detected for other abundant cfRNA species, such as snRNAs, Y RNAs, 7SK RNAs, and rRNAs (Fig. le-f and Extended Data Fig. 2a-e). The presence of intact tRNAs aligns with their highly structured nature or potential protection by tRNA synthases. Furthermore, the inventors validated that LIME-seq comprehensively captured human tRNA methylation sites in cfRNAs, providing methylation stoichiometry information for both cytoplasmic and mitochondrial tRNAs as well as fragments from rRNAs (Extended Data Fig. 2f-m).
[0117] Interestingly, while the inventors observed that -60-80% mapped cfRNA reads aligned with the human genome, the inventors also found that a significant fraction (-20-40%) aligned with the microbial genomes (Fig. Ih). These microbiome-derived sequences, which
ranges from -30-50 nt in length, allowed the inventors to study their bacterial origins. Although over 60% of these reads could be mapped to ten major bacterial species, the inventors also detected several hundred distinct gut microbes based on rRNA fragments and tRNA sequences (Fig. li and Extended Data Fig. 3a). Stenotrophomonas maltophilia. Staphylococcus aureus and Pseudomonas sp. CIP-10 emerged as the three dominant bacteria detected in human plasma (Fig. li and Extended Data Fig. 3b-c). To further validate that LIME-seq can accurately identify microbiome RNAs in plasma, the inventors isolated small RNAs from lab-cultured S. maltophilia and conducted LIME-seq. The inventors confirmed the consistency between plasma cfRNA and the RNA from lab-cultured microbes in terms of both read coverage and methylation patterns; skin samples served as negative controls, showing poor read coverage or distinct read patterns when aligned to the S. maltophilia genome (Extended Data Fig. 3 d— g). The inventors also applied the AlkB-mediated demethylation of small RNAs isolated from cultured microbes and confirmed that the methylation sites detected in these RNAs were also present in cfRNA from plasma. To assess potential skin microbe contamination, the inventors collected multiple tubes of blood samples from the same patient without changing the needle between blood draws. The microbial profiles remained consistent across samples, suggesting minimal skin microbe contamination which would be expected to enrich in the first tube.
[0118] Since the inventors are particularly intrigued by the potential to monitor commensal microbial species through microbiome-derived cfRNAs; the abundance and methylation patterns revealed by LIME-seq may reflect the dynamic status of host microbiomes. Given the increasing evidence of gut microbiome’s role in colorectal cancer (CRC) initiation and progression, the inventors asked whether cfRNA profiled by LIME-seq could be a non-invasive detection method for CRC patients28.
[0119] The inventors collected plasma samples from 27 CRC patients (5 at stage 0, 6 at stage I, 8 at stage II, 8 at stage III) and 36 age- and sex-matched non-cancer individuals as controls. LIME-seq was performed using cfRNAs isolated from -600 uL of plasma from each participant (Supplementary Table 1). Overall, the inventors observed that -30% of mapped reads aligned with the microbial genome for both CRC patients and non-cancer controls (Extended Data Fig. 4a), and the most abundant microbial species were common to both groups (Extended Data Fig. 4b-d). Although differential host cfRNA profiling revealed a difference in RNA levels of several host RNA species between CRC patients and controls (Fig. Ij and Extended Data Fig. 5a-b), the expression levels and mutation signatures of these host snRNAs, tRNAs, and rRNAs were not able to significantly classify CRC patients from controls (Extended Data Fig. 5 c— i) . The inventors next collected tumor and adjacent normal tissues from
four patients and performed LIME-seq on isolated small RNAs. The inventors indeed observed tRNA modification differences in tumor versus normal tissue samples. However, these tRNA methylation signatures in plasma cfRNA were not able to distinguish CRC patients from noncancer controls (Extended Data Fig. 5j-n). The inventors suspect cell-free host tRNAs may be derived from a range of different cell types, which potentially displayed different tRNA methylation levels based on cell type and cell status and masked the signatures from tumor tissues. In contrast, the well-developed Kraken2 methodology enabled the inventors to compare the relative abundance of various microbial species using the mapped reads from LIME-seq, and revealed distinct abundances of diverse microbial species among CRC patients and noncancer controls, suggesting their potential as biomarkers for CRC diagnosis (Fig. lj)29.
[0120] Previous studies have shown that cfDNA and cfRNA from microorganisms in plasma may reflect the abundance of various microbes, suggesting their potential as noninvasive biomarkers for cancer detection28’30’31. Therefore, the inventors established a statistical model based on these microbiome-related reads sequences (Extended Data Fig. 6a). The detection performance of this model was evaluated using the leave-one-out cross- validation (LOOCV), given the relatively small sample size. An area under the curve (AUC) of 0.77 was achieved (Extended Data Fig. 6b), showing 56% sensitivity and 97% specificity at the cutoff that maximized the Youden’s index. This finding corroborates results from a previous report that also utilized plasma cell-free RNAs derived from human and microbes (Extended Data Fig. 6c)30, validating the approach despite the limited sample size. Larger cohorts of clinical samples are warranted in the future for further improvement.
[0121] The inventors next investigated whether changes in the distribution of methylation site distribution or methylation stoichiometry in fragmented cfRNA could enhance discrimination power beyond cfRNA abundance changes. Using Stenotrophomonas maltophilia as an example, LIME-seq clearly detected methylation sites such as m'Gs? in tRNA-Pro-TGG, tRNA-Pro-CGG and tRNA-Leu-CAG in plasma cfRNA and validated by AlkB demethylation treatment with lab-cultured microbes (Fig. 2a and Extended Data Fig. 3d- g). The analysis revealed differential methylation patterns, including both up- and downalteration of methylation levels, across various methylated sites in cell-free tRNAs and other small RNA fragments when comparing colorectal cancer (CRC) patients to non-cancer individuals. Notably, the majority of methylation sites exhibited increased modification levels in the CRC samples (Extended Data Fig. 7a). Higher modification levels in microbiome RNA may correlate with increased metabolic activity and external stress, suggesting dysregulation of microbiomes in CRC patients32 36. Further, a heatmap plot confirmed the presence of
differential methylation sites in microbiome-derived cfRNAs between CRC patients and noncancer individuals (Fig. 2b). The inventors were able to distinguish the mutational signatures from RNA modifications and single nucleotide polymorphisms (SNPs) by identifying mixed mutational signatures that were only present at RNA modification sites during RT (Extended Data Fig. 7b). RNA modifications constitute a significant component and feature a notably higher ratio of sites with statistically meaningful differences (P < 0.05, Extended Data Fig. 7c). These results strongly suggest the potential of microbiome-derived cfRNA methylation patterns as biomarkers for CRC detection. Principal Component Analysis (PCA) revealed that high-dimensional data are crucial for distinguishing mutation signatures between CRC patients and non-cancer individuals (Extended Data Fig. 7d-e). This insight led the inventors to adopt a support vector machine (SVM) for the diagnostic model, which effectively separates patient and non-cancer classes using a hyperplane (Extended Data Fig. 7f)37’38. By applying LIME-seq to cfRNA samples, the inventors were able to analyze mutation patterns of the microbiome- derived cfRNA, thereby generating predictive indicators for CRC (Extended Data Fig. 7g).
[0122] The statistical model, validated using LOOCV, demonstrated exceptional predictive ability for classifying CRC patients, achieving an AUC of 0.98 and an accuracy rate of 0.95 (Fig. 2c and Extended Data Fig. 7h). This represents a significant improvement compared to the model based solely on microbial abundance, which attained an AUC of only 0.77 (Fig. 2c). Furthermore, the statistical model enabled the application of these mutation signatures in distinguishing CRC stages (Fig. 2d), where remarkably high accuracy was observed even for early stages. This sensitivity may be attributed to the microbiota's responsiveness to abnormal cells, even at stage 039 41. Recognizing that LOOCV may overestimate the model's performance in real-world scenarios, the inventors employed more rigorous validation methods, including bootstrapping and k-fold cross-validation, both utilizing a 25% validation set (Extended Data Fig. 7i). Under these stringent validation frameworks, the mutation-signature-based model maintained high accuracy with an AUC of 0.92 (Confidence Intervals (CI) 95%: 0.82-1.00, Extended Data Fig. 7j- 1). At the optimal cutoff determined by maximizing Youden’s index, the model exhibited a sensitivity of 0.93 (CI 95%: 0.89-0.97), a specificity of 0.92 (CI 95%: 0.88-0.97), with an accuracy of 0.92 (CI 95%: 0.89-0.95). This performance significantly outperforms the model that relied solely on microbial species abundances as biomarkers, which attained an AUC of 0.68 (CI 95%: 0.43-0.86, Extended Data Fig. 7j- 1). In comparison to the FDA-approved blood-based CRC test (SEPT9 promoter methylation), which had -70% sensitivity in CRC detection, the results suggest a substantial potential for significantly enhancing detection accuracy for early CRC42 44.
[0123] The inventors lastly accessed the important methylation sites identified in the model using Shapley Additive explanations (SHAP), a cooperative game theory-based approach for interpreting statistical models (Fig. 2e-g). Among the top 10 important microbes, Burkholderia emerged as a key target with upregulated methylation level and is distinctly enriched in CRC patients45; Microbacterium, also a top target, is known to correlate with infections in immunocompromised individuals, such as cancer patients, and has been found to display increased methylation level in CRC tumor tissues46 48. Conversely, Clostridium, which exhibited reduced methylation level in CRC patients, has been studied as a target for anticancer treatment49 51. Notably, 18 out of the 26 most significant genera identified by the model corresponded with those found in previous research (Supplementary Table 2-3). Additionally, the inventors successfully traced changes of the microbiome across different stages of CRC (Fig. 2g and Extended Data Fig. 7m-o). The inventors further focused on microbes known to be enriched in CRC patients, including Peptostreptococcus anaerobius, Prevotella intermedia, and Parvimonas micra52^55. The inventors observed high mutation levels among differentially methylated sites in cfRNA from CRC patients, suggesting a connection between tRNA methylation levels and elevated microbiome activity (Extended Data Fig. 7p).
[0124] To evaluate the robustness of these biomarkers under clinical conditions, the inventors collected plasma from two healthy individuals and stored them at 4°C for 2, 8, and 24 hours before plasma extraction (Extended Data Fig. 8a). The inventors observed a high correlation of mutational signatures in microbiome reads maintained after 8 hours of storage (r
> 0.9, Extended Data Fig. 8b). Although an acceptable correlation remained after 24 hours (r
> 0.85), there was a significant increase in the human genome fraction, likely due to the leakage of blood cells over time (Extended Data Fig. 8c), which generated large variations of host cfRNA reads (Extended Data Fig. 8d).
[0125] From an initial set of 330 features, the inventors refined the model to 12 highly expressed sites to enhance stability and minimize experimental variation. A similar accuracy between training and LOOCV in the training cohort (AUC = 0.94 vs. 0.91) suggested a minimal risk of overfitting (Extended Data Fig. 9a). The inventors then applied the model to the two independent validation cohorts that included 30 CRC patients and 8 controls in validation cohort 1 and 7 CRC patients and 8 controls in validation cohort 2 (Supplementary Table 4-5, Extended Data Fig. 9b), and observed similar results (AUC = 0.89 and 0.93, respectively, Fig. 2h and Extended Data Fig. 9c). Notably, in validation cohort 1, which included 11 adenoma patients and 8 stage I CRC patients, the model effectively distinguished both adenoma and stage I CRC from non-cancer controls (Extended Data Fig. 9d-e). Despite inter-individual
variability, significant differences in specific microbial methylation sites were consistently observed across multiple cohorts (Extended Data Fig. 9c). The inventors further applied the strategy to pancreatic cancer (PANC), a cancer that may not directly associate with microbiota. In a small cohort (7 PANC patients and 8 non-cancer controls), microbiome-derived cfRNA methylation profiles also showed differences between cancer patients and non-cancer controls (Extended Data Fig. 10). While this result hints a broader applicability beyond CRC, further validation in larger cohorts and mechanistic understanding are needed in the future.
[0126] In summary, the inventors demonstrate that methylation levels in microbiome- derived cfRNA are effective and promising biomarkers for colorectal cancer (CRC) diagnosis, offering a significant advantage over microbial abundance profiles of cfRNA/cfDNA. Unlike abundance profiles, changes in methylation pattern in microbiome-derived cfRNA may reflect the intrinsic status and activity of the microbiota, making them more sensitive to cancerous microenvironment36. The inventors achieved unprecedented accuracy in CRC detection, especially in the very early stages of the disease. Host RNA modification level changes in tissues or plasma may also be explored for disease diagnosis or prognosis. The findings highlight the potential of cfRNA and their modification patterns as reliable biomarkers for monitoring host microbiome dynamics, not only for cancer diagnosis but also for other health- related applications.
Example 3 - Materials and Methods for Aspects Herein
Patient recruitment
[0127] A total of 27 patients with colorectal cancer and a total of 36 sex, age matched non- cancer individuals were recruited from the Lutheran General Hospital and the University of Chicago Medical Center. Blood from patients with CRC was collected before surgical resection (CRC). Peripheral blood samples from all participants were collected using standardized venipuncture protocols in antecubital venipuncture regions, with skin surfaces disinfected with 70% ethanol prior to phlebotomy to ensure aseptic conditions. All specimens were visually inspected to confirm absence of hemolysis. Blood from non-cancer individuals was collected from those undergoing screening colonoscopies at the University of Chicago Medical Center, with strict adherence to the same pre-analytical quality control measures, and only included subjects with no tumor upon post-colonoscopy pathological analyses.
[0128] The inventors included two validation cohorts in this study. Validation cohort 1 consisted of 8 non-cancer individuals and 29 patients with CRC. These samples were obtained from stored specimens, which were collected and processed by a separate research group at the same institute. To ensure comparability, CRC patients at different disease stages were age- and
sex-matched with the non-cancer individuals. Validation cohort 2 included samples from additional individuals, with 8 non-cancer individuals and 7 patients with CRC, who were recruited in a different phase of this study. These samples were collected using the same procedures as the training cohort and processed by the same research group.
Whole blood processing
[0129] For all cohorts, ~ 3 ml of whole blood samples was collected in K2 EDTA or Streck vacutainers. Plasma separation was performed within 4 hours of whole blood collection. Blood samples were first centrifuged at l,350xg for 12 min at 4°C, to collect the upper plasma layer. The plasma layer was then transferred to a clean 15 ml tube and centrifuged again at l,350xg for 12 min at 4°C again. Later, the upper plasma layer was transferred to a clean 1.5-ml EP tube and further centrifuged at 13,500xg for 5 min at 4°C to ensure complete removal of cell debris. The harvested plasma was split into 600 pl aliquots and stored at -80 °C until cfRNA isolation. cfRNA extraction from plasma samples
[0130] cfRNAs were isolated from 600 pl plasma using QIAamp ccfDNA/RNA Kit (Qiagen, Cat. 55184) according to the manufacturer's protocol. In brief, 180 pl RPL buffer was added into 600 pl plasma, followed by vortexing for 5 s. After leaving at room temperature for 3 mins, 100 pl RPP buffer was added. Then, the EP tube cap was closed, and immediately mix vigorously by vortexing 10 s for 2 times, which is important to disperse the solid. Incubate the EP tube on ice for 3 min. After the standard washing and concentration steps following the manufacturer's protocol, the purified cfRNA was finally eluted in 14 pl of RNase-free water.
Cell culture
[0131] HepG2 cell line was purchased from the American Type Culture Collection
(ATCC) and maintained in DMEM (Gibco, 11995) with 10% FBS and 1% penicillin/streptomycin. Cells were cultured at 37 °C with 5.0% CO2 in a Heracell VIOS 160i incubator (Thermo Scientific).
Culture of the Stenotrophomonas maltophilia
[0132] Stenotrophomonas maltophilia strains were obtained from the clinical microbiology laboratory after confirming the strend and subsequently cultured in Eugon Broth medium. Bacterial growth was carried out under aerobic conditions at 37°C with continuous agitation (180 rpm) for 24-48 hours.
Skin specimens
[0133] Skin samples were collected from the 2x2 cm2 antecubital venipuncture regions of two non-cancer adult donors using sterile scraping technique (30-second scraping duration per site). Following collection, samples were immediately homogenized in 1 mL TRIzol reagent (Invitrogen). The resulting lysates were stored at -80°C until RNA extraction.
CRC tumor tissues and NATs.
[0134] CRC tumor tissues and matched normal adjacent tissues (NATs) were obtained from surgical resections. Tissue samples were immediately snap-frozen in liquid nitrogen and stored at -80°C until processing. For RNA extraction, frozen tissues samples were mechanically homogenized using a stainless-steel bead milling system. Then, TRIzol reagent (Invitrogen) was added into the homogenate and incubated at room temperature for 5 min to ensure complete dissociation of nucleoprotein complexes, followed by centrifugation at 12,000 x g for 10 min at 4°C to remove insoluble debris. The supernatant was transferred to fresh RNase-free tubes for subsequent RNA isolation.
Small RNA isolation
[0135] Total cellular total RNA of HepG2 was isolated with TRIzol reagent (Invitrogen) following the manufacturer’s standard protocol based on isopropanol precipitation. The small RNA fraction (size < 200 nt) was further extracted from the purified total RNA using the mirVana miRNA Isolation Kit (AMI 560, Invitrogen), following the manufacturer’s standard protocol.
LIME-seq library preparation
[0136] Plasma cell-free RNA (cfRNA) or small RNA from HepG2 (<200 nt) were fragmented to lengths of -35-55 nucleotides using the 10X RNA Fragmentation Reagent (AM8740, Invitrogen), which was used as 15X with heating at 70 °C for 14 mins. Purify the fragmented RNA using the Oligo Clean & Concentrator kit (Zymo Research) with a final elution volume of 10 pl. The subsequent 3 '-end repair involved the preparation of a mixture of fragmented RNA, 2 pl of 10x T4 Polynucleotide Kinase Reaction Buffer (B0201S, NEB), 2 pl of T4 PNK, 1 pl SUPERase n RNase Inhibitor and a proper volume of RNase-free water. Mix the 20 pl reaction mixture very well and incubate at 37°C for 1 hour. 3 ’-end-repaired RNA fragments were purified by Oligo Clean & Concentrator kit and eluted in 10 pl of RNase-free water.
[0137] For 3'-adapter ligation, RNA fragments were denatured with lul of 10 pM RNA 3'- linker (5'rApp-NNNNNAGATCGGAAGAGCGTCGTG-3SpC3; SEQ ID NO: 1) by heating at 70°C for 2 minutes and then immediately moved onto ice. Then a mixture of 2.5 pl of 10x T4 RNA Ligase Reaction Buffer (NEB), 7.5 pl of 50% PEG8000, 1 pl SUPERaseHn, and 2 pl T4 RNA ligase 2 truncated KQ (NEB) was prepared, followed by mixing well and incubating at 25°C for 2 hrs and then at 16°C for 10 hrs. The excessive adapters were removed by adding 2 pl of 5 '-deadenylase (NEB) with an incubation at 30 °C for 45 mins, followed by adding 1.0 pl of RecJf (NEB) with an incubation at 37 °C for 45 mins. The 3 ’-ligated RNA was then purified by RNA Clean & Concentrator (Zymo Research) and eluted in 10 pl of RNase-free water.
[0138] Reverse transcription was initiated by mixing the 3’-ligated RNA with 1 ul of 2.0 pMRT primer (5'-ACACGACGCTCTTCCGATCT-3'; SEQ ID NO:2) and heating at 65°C for 2 minutes. Then move the samples onto ice immediately. A mixture of 2 pl of 10x AMV Reverse Transcriptase Reaction Buffer (NEB), 2 pl of 10 mM dNTPs, 1 pl RNaseOUT (Thermo Fisher Scientific), 2 pl HIV Recombinant Reverse Transcriptase (Worthington) and a proper volume of RNase-free water was prepared, to give a final volume of 20 pl for a one- hour incubation at 37°C. Upon the completion of RT, 1.0 pl volume of RNase H (NEB, M0297L) was added to the reaction mixture, followed by incubation at 37 °C for 20 min and 70 °C for 5 min.
[0139] The produced cDNAs after RT were purified by Oligo Clean & Concentrator (Zymo Research) and eluted in 10 pl of RNase-free water. A mixture of eluted cDNA and 2.0 pl of 10 pM cDNA 3 '-linker (5'Phos-NNNNNAGATCGGAAGAGCACACGTCTG-3SpC3; SEQ ID NO:3) was denatured at 75 °C for 2 min and then moved onto ice immediately. 3 -pl of 10x T4 RNA Ligase Reaction Buffer (NEB), 15 pl of 50% PEG8000, 3 pl of 10 mM ATP, and 1.0 pl of T4 RNA ligase 1 (high concentration, NEB) were added to the cDNA-adapter mixture and mixed very well. Incubate the ligation reaction mixture at 25 °C for 12 h. The libraries were amplified using universal and indexed primers from NEBNext Multiplex Oligos for Illumina (NEB). All libraries were sequenced on an Illumina NovaSeq 6000 or NovaSeqX with SEI 00 bp mode.
[0140] The LIME-seq method enables high mutation ratio detection across multiple RNA modifications, including mxA, m3C, mxG, m22G, m3U, etc., with an enhanced read-through at these methylation sites. Such performance has been facilitated by the robust HIV RT enzyme, and it effectively uncovered methylation fraction information at methylated sites.
Raw reads processing and mapping on human genome
[0141] The sequencing data were all trimmed with the cutadapt tool (version 4.6)56 to remove adapters and low-quality reads (length shorter than 28 bp). Then, PCR duplicates were removed with the BBMap tool (https://sourceforge.net/projects/bbmap/). Later, random barcodes at reads end were trimmed, and low-quality reads were removed using the cutadapt tool again. Then the remaining reads were aligned to the human genome (hg38) using STAR (version 2.7.11a) 57 and bowtie2 (version 2.5.1)58 allowing a maximum of three mismatches. The inventors compared the cfRNA mapping results between bowtie2 and STAR, mis-splicing and short reads in the libraries promote the inventors to use bowtie2. Based on the aligned results from bowtie2, the inventors normalized the data by calculating the reads per million (RPM) and performed student t-test (two tailed) to calculate the p value for the differential expression. Most differentially expressed protein-coding transcripts identified by LIME-seq are miRNAs occurring within these transcripts. For analyzing cfRNA transcripts from the human genome, the inventors set a minimum RPM of 20 in at least one group (CRC or noncancer), focused on non-protein coding transcripts, and excluded redundant transcripts.
Access the stability of the biomarkers
[0142] To evaluate the stability of the mutation signature of microbiome derived cfRNA biomarkers, three blood tubes were collected from each of two healthy donors under standardized conditions. Immediately after collection, all tubes were placed on ice and transferred to a 4°C refrigerator for temporary storage prior to plasma isolation. The experimental design incorporated three processing timepoints: plasma isolation initiated within 2 hours, plasma isolation after 8-hour refrigeration, and plasma isolation after 24-hour refrigeration. Plasma was separated through a standard protocol and stored in 80°C after extraction.
Visualization of reads coverage
[0143] After aligning sequences to the reference genome, the inventors employ “depth” function in the samtools (version 1.16.1)59 to extract regions corresponding to various RNA species. Following the extraction of read depths at these sites, the inventors utilize an in-house Python pipeline to plot the read coverage across these regions.
Quantification of the methylation in HepG2 cell, tissues and host part in cfRNA
[0144] After mapping to the human genome with bowtie2, the generated .bam files were split into positive and negative strands and sorted using samtools. Sequence variants were identified by measuring the base composition at each position using fine-tuned bam-readcount
(https://github.com/genome/bam-readcount). The generated bam-readcount output results were parsed and analysed to calculate the misincorporation ratio at each methylated site (mxA, mxG, m22G and m3C) in tRNA, followed by confirmation using direct visualization through IGV software (https://software.broadinstitute.org/software/igv/). For quality control of cfRNA, the inventors established a criterion where the average mutation ratio at the m1 A sites within human tRNA must exceed 0.6. All samples used in this study passed this cut-off.
Quantification of the abundance of microbe from cfRNA
[0145] Cleaned reads that failed to align to the human genome were applied using Kraken2 with its standard database, which contains bacterial, archaeal, viral, and human sequences. Based on the aligned results from Kraken2, the inventors normalized the data by calculating the reads per million (RPM). Then the inventors performed student t-test (two tailed) to calculate the p value for the differential expression.
Quantification of the methylation in microbiome from cfRNA
[0146] The inventors identified microbial species in cfRNA samples using kraken2 (version 2.1.3)60, setting a threshold whereby the median reads of the specie must exceed 10 in at least one group (CRC or non-cancer). Following species identification, the inventors extracted their sequences from the kraken2 reference genome. To construct a new, streamlined reference, the inventors limited each microbial species to a maximum of ten reference genomes, minimizing redundancy
[0147] Subsequently, the inventors aligned the reads that failed to align to the human genome to this curated microbiome reference genome using bowtie2. The alignments generated by bowtie2 were applied in the calculation of the microbiome mapping ratio. Variant identification within these sequences is then achieved through the precise analysis of base composition at each genomic position, employing bam-readcount tool (https://github.com/genome/bam-readcount). The generated bam-readcount output results were parsed and analyzed to calculate the misincorporation ratio. The inventors plotted differentially expressed microbes with RPM greater than 20 for comparison with host transcripts.
Analysis of the methylation sites in microbiome
[0148] To ascertain whether a site's mutation signal is due to methylation or a single nucleotide polymorphism (SNP), the inventors examined the mutation signals induced by HIV- RT. The analysis indicates that signals attributed to methylation typically result from a combination of mutations, whereas SNPs are characterized by a single mutation type (e.g.,
A>G). Consequently, at any given site, if the mutation ratio for a specific base exceeds 95% of the total mutation ratio, the site is classified as an SNP. Otherwise, this site will be annotated as the methylation.
Differential methylation analysis in microbiome
[0149] The inventors organized raw bam-readcount data and applied filtering to minimize random variation. Specifically, mutational sites in each sample with a depth of less than 3 were deemed insufficient (considered as empty) and excluded from subsequent analyses to ensure robustness in calculating p values and the average mutation ratio. The analysis proceeded with the establishment of two critical thresholds: (1) a minimum median depth of more than 5 reads in at least one of the groups (CRC or non-cancer), and (2) an average mutation ratio exceeding 0.10 in at least one group (CRC or non-cancer). These criteria significantly refined the dataset, resulting in approximately 9,000 eligible sites for further analysis. The inventors then assessed these sites for differential methylation using a two-tailed Student's t-test and visualized the results through heatmap.
Statistics-based models
[0150] Support vector classifier used in this work can be achieved by Scikit-learn61. The misincorporation ratio of a subgroup of 330 methylation sites have been used as the input for the model. These data are then partitioned into combined training and testing sets. Within the model, any missing feature values are imputed with the feature's average value across samples, ensuring completeness. To evaluate the model's performance, the inventors apply several validation techniques, including LOOCV, 4-fold Cross-Validation, and bootstrapping, the latter using 25% of the data as a test set.
Explanation of the statistics-based models
[0151] SHAP values were calculated to estimate feature attribution for each endpoint and model individually62. SHAP values integrate Shapley values — rooted in cooperative game theory — to compute the average incremental effect of each feature on the model's prediction, adhering to the principle of local additivity. A key attribute of Shapley values is their ability to quantify the contribution of each player (or feature) in a game (or model) by comparing the outcome with all players (or features) involved versus an absence of players (or features). In the context of statistical models, this translates to SHAP values summing to the difference between the expected model output (baseline) and the actual output for a specific prediction. This ensures a comprehensive and fair attribution of each feature's impact on the prediction
outcome. In this work, the inventors apply 4-fold cross-validation in calculating the SHAP values for every feature across all samples.
Data availability
[0152] The raw and processed sequencing data have been deposited into the NCBI Gene Expression Omnibus (GEO) database with the accession number GSE264208.
Example 4 - Dynamic regulation of bacterial tRNA modifications under different growth conditions.
[0153] The inventors cultured Pseudomonas aeruginosa and Staphylococcus aureus in vitro. Using LIME-seq, the inventors profiled tRNA modifications under two different conditions: bacteria grown on plates and in liquid medium (Figure 14a-b). These two conditions represent surface-associated and planktonic growth, respectively. Surface- associated growth mimics biofilm-like states, which are often associated with increased antibiotic resistance and persistent infections, whereas planktonic growth reflects a more active, free-living bacterial state. The inventors observed numerous differentially modified sites between the two environments, suggesting that RNA modifications may play a role in bacterial adaptation to distinct physiological states.
[0154] To further investigate the relationship between tRNA modifications and bacterial growth dynamics, the inventors collected E. coli samples across a range of optical densities (OD600), from 0.12 to 1.70 (Figure 14c), representing different growth phases — from early exponential to late stationary phase. The inventors observed a clear increase in modification levels at higher cell densities, with prominent changes at U47 — likely corresponding to acp3U — and A38 (Figure 14d). These modifications may reflect adaptive responses to growth phase transitions, potentially modulating translation efficiency or stress response pathways.
[0155] Together, these findings underscore the dynamic nature of RNA modifications in bacteria and suggest their roles in regulating growth-state transitions. Importantly, the modification patterns observed under different in vitro growth conditions may help interpret the bacterial states represented by microbiome-derived cfRNA in host plasma, providing valuable context for understanding RNA modification changes during microbial dysbiosis or in vivo adaptation under disease conditions.
[0156] There are lots of bacterial infection in clinic physicians need to culture bacteria and then determine which ones account for the infection so they can provide proper treatment. Even
in those cases they may get multiple bacteria out of culture and still unsure which ones account for the real infection.
[0157] Based on these results, one can sequence a sample obtained from a patient to determine the tRNA methylation level in each microbe present in the sample. The microbe(s) with high tRNA methylation level is hypothesized to have the highest activity.
References
[0158] The following references and the references cited throughout the specification, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
1. Volckmar, A.-L. et al. A field guide for cancer diagnostics using cell-free DNA: From principles to practice and clinical applications. Genes, Chromosomes and Cancer 57, 123-139 (2018).
2. Sun, K. et al. Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proceedings of the National Academy of Sciences 112, E5503-E5512 (2015).
3. Johnson, P., Zhou, Q., Dao, D. Y. & Lo, Y. M. D. Circulating biomarkers in the diagnosis and management of hepatocellular carcinoma. Nat Rev Gastroenterol Hepatol 19, 670-681 (2022).
4. Luo, H. et al. Circulating tumor DNA methylation profiles enable early diagnosis, prognosis prediction, and screening for colorectal cancer. Science Translational Medicine 12, eaax7533 (2020).
5. Cescon, D. W., Bratman, S. V., Chan, S. M. & Siu, L. L. Circulating tumor DNA and liquid biopsy in oncology. Nat Cancer 1, 276-290 (2020).
6. Song, P. et al. Limitations and opportunities of technologies for the analysis of cell -free DNA in cancer diagnostics. Nat. Biomed. Eng 6, 232-245 (2022).
7. Siejka-Zielinska, P. et al. Cell-free DNA TAPS provides multimodal information for early cancer detection. Science Advances 7, eabh0534 (2021).
8. Li, W. et al. 5 -Hydroxymethylcytosine signatures in circulating cell-free DNA as diagnostic biomarkers for human cancers. Cell Res 27, 1243-1257 (2017).
9. Ignatiadis, M., Sledge, G. W. & Jeffrey, S. S. Liquid biopsy enters the clinic — implementation issues and future challenges. Nat Rev Clin Oncol 18, 297-312 (2021).
10. van der Pol, Y. & Mouliere, F. Toward the Early Detection of Cancer by Decoding the Epigenetic and Environmental Fingerprints of Cell-Free DNA. Cancer Cell 36, 350-368 (2019).
11. Zaporozhchenko, I. A., Ponomaryova, A. A., Rykova, E. Y. & Laktionov, P. P. The potential of circulating cell-free RNA as a cancer biomarker: challenges and opportunities. Expert Review of Molecular Diagnostics 18, 133-145 (2018).
12. Tsang, J. C. H. et al. Integrative single-cell and cell-free plasma RNA transcriptomics elucidates placental cellular dynamics. Proceedings of the National Academy of Sciences 114, E7786- E7795 (2017).
13. Han, T. W. et al. Cell-free Formation of RNA Granules: Bound RNAs Identify Features and Components of Cellular Assemblies. Cell 149, 768-779 (2012).
14. Moufarrej, M. N. et al. Early prediction of preeclampsia in pregnancy with cell-free RNA. Nature 602, 689-694 (2022).
15. Larson, M. H. et al. A comprehensive characterization of the cell -free transcriptome reveals tissue- and subtype-specific biomarkers for cancer detection. Nat Commun 12, 2357 (2021).
16. Liu, Z. et al. Polyadenylation ligation-mediated sequencing (PALM-Seq) characterizes cell- free coding and non-coding RNAs in human biofluids. Clinical and Translational Medicine 12, e987 (2022).
17. Dai, Q. et al. Quantitative sequencing using BID-seq uncovers abundant pseudouridines in mammalian mRNA at base resolution. Nat Biotechnol 41, 344-354 (2023).
18. Ge, R. etal. m6A-SAC-seq for quantitative whole transcriptome m6A profiling. NatProtoc 18, 626-657 (2023).
19. Hu, L. et al. m6A RNA modifications are measured at single-base resolution across the mammalian transcriptome. Nat Biotechnol 40, 1210-1219 (2022).
20. Zhang, L.-S., Dai, Q. & He, C. Base-Resolution Sequencing Methods for Whole-Transcriptome Quantification of mRNA Modifications. Acc. Chem. Res. 57, 47-58 (2024).
21. Zhang, L.-S. et al. ALKBH7 -mediated demethylation regulates mitochondrial polycistronic RNA processing. Nat Cell Biol 23, 684-691 (2021).
22. Zhang, L.-S. et al. BID-seq for transcriptome-wide quantitative sequencing of mRNA pseudouridine at base resolution. Nat Protoc 19, 517-538 (2024).
23. Wang, P etal. Terminal modifications independent cell -free RNA sequencing enables sensitive early cancer detection and classification. Nat Commun 15, 156 (2024).
24. Hu, J. F. et al. Quantitative mapping of the cellular small RNA landscape with AQRNA-seq. Nat Biotechnol 39, 978-988 (2021).
25. Shi, J. et al. PANDORA-seq expands the repertoire of regulatory small RNAs by overcoming RNA modifications. Nat Cell Biol 23, 424-436 (2021).
26. Zheng, G. et al. Efficient and quantitative high-throughput tRNA sequencing. Nat Methods 12, 835-837 (2015).
27. Dominissini, D. et al. The dynamic N1 -methyladenosine methylome in eukaryotic messenger RNA. Nature 530, 441-446 (2016).
28. Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567-574 (2020).
29. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019).
30. Chen, S. et al. Cancer type classification using plasma cell-free RNAs derived from human and microbes. eLife 11, e75181 (2022).
31. You, L. et al. Novel directions of precision oncology: circulating microbial DNA emerging in cancer-microbiome areas. Precision Clinical Medicine 5, pbac005 (2022).
32. Chan, C. T. Y. et al. A Quantitative Systems Approach Reveals Dynamic Control of tRNA Modifications during Cellular Stress. PLOS Genetics 6, el001247 (2010).
33. Valesyan, S., Jora, M., Addepalli, B. & Limbach, P. A. Stress-induced modification of Escherichia coli tRNA generates 5 -methylcytidine in the variable loop. Proceedings of the National Academy of Sciences 121, e2317857121 (2024).
34. Schwartz, M. H., Waldbauer, J. R., Zhang, L. & Pan, T. Global tRNA misacylation induced by anaerobiosis and antibiotic exposure broadly increases stress resistance in Escherichia coli. Nucleic Acids Research 44, 10292-10303 (2016).
35. Heiss, M., Hagelskamp, F., Marchand, V., Motorin, Y. & Kellner, S. Cell culture NAIL-MS allows insight into human tRNA and rRNA modification dynamics in vivo. Nat Commun 12, 389 (2021).
36. Schwartz, M. H. et al. Microbiome characterization by high-throughput transfer RNA sequencing and modification analysis. Nat Commun 9, 5353 (2018).
37. Erfani, S. M., Rajasegarar, S., Karunasekera, S. & Leckie, C. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning. Pattern Recognition 58, 121-134 (2016).
38. Lin, W.-J. & Chen, J. J. Class-imbalanced classifiers for high-dimensional data. Briefings in Bioinformatics 14, 13-26 (2013).
39. Konishi, Y. et al. Development and evaluation of a colorectal cancer screening method using machine learning-based gut microbiota analysis. Cancer Medicine 11, 3194-3206 (2022).
40. Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med 25, 968-976 (2019).
41. Zwezerijnen-Jiwa, F. H., Sivov, H., Paizs, P., Zafeiropoulou, K. & Kinross, J. A systematic review of microbiome -derived biomarkers for early colorectal cancer detection. Neoplasia 36, 100868 (2023).
42. Church, T. R. et al. Prospective evaluation of methylated SEPT9 in plasma for detection of asymptomatic colorectal cancer. Gut 63, 317-325 (2014).
43. deVos, T. et al. Circulating Methylated SEPT9 DNA in Plasma Is a Biomarker for Colorectal Cancer. Clinical Chemistry 55, 1337-1346 (2009).
44. Song, L., Jia, J., Peng, X., Xiao, W. & Li, Y. The performance of the SEPT9 gene methylation assay and a comparison with other CRC screening tests: A meta-analysis. Sci Rep 7, 3032 (2017).
45. Yang, J. et al. Establishing high-accuracy biomarkers for colorectal cancer by comparing fecal microbiomes in patients with healthy families. Gut Microbes 11, 918-929 (2020).
46. Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Molecular Systems Biology 10, 766 (2014).
47. Kharrat, N. et al. Data mining analysis of human gut microbiota links Fusobacterium spp. with colorectal cancer onset. Bioinformation 15, 372-379 (2019).
48. Geng, J., Fan, H., Tang, X., Zhai, H. & Zhang, Z. Diversified pattern of the human colorectal cancer microbiome. Gut Pathog 5, 2 (2013).
49. Yaghoubi, A. et al. The use of Clostridium in cancer therapy: a promising way. Reviews and Research in Medical Microbiology 33, 121 (2022).
50. Cong, J. et al. Bile acids modified by the intestinal microbiota promote colorectal cancer growth by suppressing CD8+ T cell effector functions. Immunity 0, (2024).
51. Xie, Y.-H. et al. Fecal Clostridium symbiosum for Noninvasive Detection of Early and Advanced Colorectal Cancer: Test and Validation Studies. eBioMedicine 25, 32-40 (2017).
52. Lo, C.-H. etal. Enrichment of Prevotella intermedia in human colorectal cancer and its additive effects with Fusobacterium nucleatum on the malignant transformation of colorectal adenomas. J Biomed Sci 29, 88 (2022).
53. Tsoi, H. et al. Peptostreptococcus anaerobius Induces Intracellular Cholesterol Biosynthesis in Colon Cells to Induce Proliferation and Causes Dysplasia in Mice. Gastroenterology 152, 1419- 1433. e5 (2017).
54. Long, X. et al. Peptostreptococcus anaerobius promotes colorectal carcinogenesis and modulates tumour immunity. Nat Microbiol 4, 2319-2330 (2019).
55. Osman, M. A. etal. Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer. Sci Rep 11, 2925 (2021).
56. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17, 10-12 (2011).
57. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 (2013).
58. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-359 (2012).
59. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078- 2079 (2009).
60. Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biology 20, 257 (2019).
61. Pedregosa, F. et al. Scikit-leam: Machine Learning in Python. Preprint at https://doi.org/10.48550/arXiv.1201.0490 (2018).
62. Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions, in Advances in Neural Information Processing Systems vol. 30 (Curran Associates, Inc., 2017).
* * *
[0159] All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of certain aspects, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
Claims
1. A method for predicting disease in a patient, the method comprising: receiving sequencing data obtained from the patient; generating an input feature vector comprising the sequencing data; and applying, into a trained machine learning model, the input feature vector to generate an output feature vector predicting whether the patient has the disease, wherein the sequencing data comprises an RNA modification signature from cell-free RNA.
2. The method of claim 1, wherein the disease is cancer.
3. The method of claim 2, wherein the cancer is colorectal cancer.
4. The method of claim 1, wherein the disease is an inflammatory disease, irritable bowel syndrome, sepsis, graft versus host disease (GvHD), GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
5. The method of claim 1, wherein the disease is associated with microbiome dysbiosis.
6. The method of claim 1, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
7. The method of any one of claims 1 to 6, wherein the method is a computer implemented method.
8. The method of any one of claims 1 to 7, wherein the receiving is done by one or more processors.
9. The method of any one of claims 1 to 8, wherein the generating is done by one or more processors.
10. The method of any one of claims 1 to 9, wherein the applying is done by one or more processors.
11. The method of any one of claims 1 to 10, wherein the sequencing data is obtained from a cell free RNA sample from the patient.
12. The method of any one of claims 1 to 11, wherein the sequencing data is obtained from plasma from the patient.
13. The method of any one of claims 1 to 12, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof.
14. The method of claim 13, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
15. The method of any one of claims 1 to 14, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
16. The method of any one of claims 1 to 14, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
17. The method of any one of claims 1 to 16, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
18. A method of detecting a microbiome signature in a biological sample, the method comprising detecting an RNA modification signature in cell-free RNA from the biological sample.
19. The method of claim 18, wherein the biological sample is a human plasma sample.
20. The method of claim 18 or 19, wherein the biological sample is from a patient suspected of having cancer.
21. The method of claim 20, wherein the cancer is colorectal cancer.
22. The method of claim 18 or 19, wherein the biological sample is from a patient having or suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease or from a patient that has received or will receive a transplant.
23. The method of claim 18, wherein the disease is associated with microbiome dysbiosis.
24. The method of claim 18, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
25. The method of any one of claims 18 to 24, wherein the biological sample comprises cell free RNA.
26. The method of any one of claims 18 to 25, wherein the RNA sample comprises approximately 0.1-50 ng of total RNA.
27. The method of any one of claims 18 to 26, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof.
28. The method of claim 27, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
29. The method of any one of claims 18 to 28, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
30. A method of treating disease in a patient, the method comprising administering to the patient an effective amount of a therapy, wherein the patient has been determined to have a disease after detection of an RNA modification signature in cell free RNA from a biological sample from the patient.
31. The method of claim 30, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof.
32. The method of claim 31, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
33. The method of any one of claims 30 to 32, wherein the biological sample comprises plasma from the patient.
34. The method of any one of claims 30 to 33, wherein the biological sample comprises approximately 0.1-50 ng of total RNA.
35. The method of any one of claims 30 to 34, wherein the patient is a human patient.
36. The method of any one of claims 30 to 35, wherein the patient has or is suspected of having cancer.
37. The method of claim 36, wherein the cancer is colorectal cancer.
38. The method of any one of claims 30 to 35, wherein the patient has or suspected of having an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease or has received or will receive a transplant.
39. The method of any one of claims 30 to 35, wherein the disease is associated with microbiome dysbiosis.
40. The method of any one of claims 30 to 35, wherein the patient has or is suspected of having Alzheimer’s disease or Parkinson’s disease.
41. The method of any one of claims 30 to 40, wherein the therapy is determined based on the detected RNA modification signature.
42. The method of any one of claims 30 to 41, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
43. The method of any one of claims 30 to 42, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
44. The method of any one of claims 30 to 43, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
45. A method of diagnosing a disease in a patient, the method comprising detecting an RNA modification signature in cell free RNA in a biological sample from the patient.
46. The method of claim 45, wherein the RNA modification signature comprises mxA, m3C, mxG, m22G, m5C, pseudouridine, 2’-o-methyl, m3U, acp3U modifications, or a combination thereof.
47. The method of claim 46, wherein the mxA, m3C, mxG, m22G is determined by LIME-Seq.
48. The method of any one of claims 45 to 47, wherein the biological sample comprises plasma from the patient.
49. The method of any one of claims 45 to 48, wherein the biological sample comprises approximately 5-10 ng of total RNA.
50. The method of any one of claims 45 to 49, wherein the patient is a human patient.
51. The method of any one of claims 45 to 50, wherein the disease is cancer.
52. The method of claim 51, wherein the cancer is colorectal cancer.
53. The method of any one of claims 45 to 50, wherein the disease is an inflammatory disease, irritable bowel syndrome, sepsis, GvHD from a transplant operation, an infection, asthma, diabetes, or an autoimmune disease.
54. The method of any one of claims 45 to 50, wherein the disease is associated with microbiome dysbiosis.
55. The method of any one of claims 45 to 50, wherein the disease is Alzheimer’s disease or Parkinson’s disease.
56. The method of any one of claims 45 to 55, wherein the RNA modification signature comprises RNA modifications from microbiome-derived RNA.
57. The method of any one of claims 45 to 56, wherein the RNA modification signature comprises RNA modifications from patient-derived RNA.
58. The method of any one of claims 45 to 57, wherein the RNA modification signature comprises an RNA modification at one or more RNA locations of Table A.
59. A method of detecting a modification signature on a ribonucleic acid (RNA) sample, the method comprising: ligating a 3’ adapter to a plurality of RNA molecules in the RNA sample; reverse transcribing the plurality of RNA molecules in the RNA sample using a reverse transcriptase to generate a population of complementary DNA (cDNA); ligating a 3’ linker to the population of cDNA; and sequencing the population of cDNA.
60. The method of claim 59, wherein the modification signature comprises mxA, m3C, mxG, m22G modifications, or a combination thereof, in the RNA sample.
61. The method of claim 59 or 60, wherein the RNA sample comprises cell free RNA.
62. The method of any one of claims 59 to 61, wherein the RNA sample is from human plasma.
63. The method of any one of claims 59 to 62, wherein the RNA sample comprises approximately 0.1-50 ng of total RNA.
64. The method of any one of claims 59 to 63, wherein the RNA sample is fragmented prior to the reverse transcribing step.
65. The method of any one of claims 59 to 64, wherein the population of cDNA is amplified before sequencing.
66. A method of determining bacterial activity and/or active bacterial infections in a patient, the method comprising identifying one or more tRNA modifications in nucleic acid from one or more bacteria in a biological sample obtained from the patient.
67. The method of claim 66, wherein the identifying comprises sequencing bacterial nucleic acid in the biological sample.
68. The method of claim 67, wherein the sequencing comprises performing LIME-seq.
69. The method of any one of claims 66 to 68, wherein identification of one or more tRNA modifications in a bacterial strain in the bacterial population indicates the bacterial strain is actively growing.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463634886P | 2024-04-16 | 2024-04-16 | |
| US63/634,886 | 2024-04-16 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025221865A1 true WO2025221865A1 (en) | 2025-10-23 |
Family
ID=97404328
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/024928 Pending WO2025221865A1 (en) | 2024-04-16 | 2025-04-16 | Methods and compositions for cell free rna modification analysis |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025221865A1 (en) |
-
2025
- 2025-04-16 WO PCT/US2025/024928 patent/WO2025221865A1/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7542672B2 (en) | Methods and compositions for analyzing nucleic acids | |
| JP7504854B2 (en) | Transfer to native chromatin for personalized epigenomics | |
| Galardi et al. | Cell-free DNA-methylation-based methods and applications in oncology | |
| García-Giménez et al. | Epigenetic biomarkers: Current strategies and future challenges for their use in the clinical laboratory | |
| JP6680680B2 (en) | Methods and processes for non-invasive assessment of chromosomal alterations | |
| US10658070B2 (en) | Resolving genome fractions using polymorphism counts | |
| JP2022120007A (en) | Non-invasive diagnostics by sequencing 5-hydroxymethylated cell-free DNA | |
| JP6161607B2 (en) | How to determine the presence or absence of different aneuploidies in a sample | |
| US20170298427A1 (en) | Nucleic acids and methods for detecting methylation status | |
| JP2018524993A (en) | Nucleic acids and methods for detecting chromosomal abnormalities | |
| Wang et al. | Terminal modifications independent cell-free RNA sequencing enables sensitive early cancer detection and classification | |
| US20240274298A1 (en) | Systems and methods for predicting pathogenic status of fusion candidates detected in next generation sequencing data | |
| JP2020530261A (en) | Methods for Accurate Computational Degradation of DNA Mixtures from Contributors of Unknown Genotypes | |
| US10450612B2 (en) | Chromosomal assessment to diagnose urogenital malignancy in dogs | |
| Yang et al. | Advancements in DNA methylation technologies and their application in cancer diagnosis | |
| Ju et al. | Modifications of microbiome-derived cell-free RNA in plasma discriminates colorectal cancer samples | |
| WO2025221865A1 (en) | Methods and compositions for cell free rna modification analysis | |
| CN112970068A (en) | Method and system for detecting contamination between samples | |
| Jiang et al. | Global characterization of extrachromosomal circular DNAs in four body fluids | |
| WO2024159118A1 (en) | Methods of hyper- and hypo-methylation analysis for disease detection | |
| Christodoulou et al. | G001. Development of TAF1 Genotyping Assay for X-Linked Dystonia-Parkinsonism-Associated Haplotype Detection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25790918 Country of ref document: EP Kind code of ref document: A1 |