[go: up one dir, main page]

WO2021110987A1 - Methods and apparatuses for diagnosing cancer from cell-free nucleic acids - Google Patents

Methods and apparatuses for diagnosing cancer from cell-free nucleic acids Download PDF

Info

Publication number
WO2021110987A1
WO2021110987A1 PCT/EP2020/084760 EP2020084760W WO2021110987A1 WO 2021110987 A1 WO2021110987 A1 WO 2021110987A1 EP 2020084760 W EP2020084760 W EP 2020084760W WO 2021110987 A1 WO2021110987 A1 WO 2021110987A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
subject
sequence reads
sequencing
nucleic acids
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/EP2020/084760
Other languages
French (fr)
Inventor
Ségolène DIRY
Emmanuel GILSON
Eric GINOUX
Virginie CHESNAIS
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Life and Soft a Seqone Company SAS
Original Assignee
Life and Soft SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Life and Soft SAS filed Critical Life and Soft SAS
Publication of WO2021110987A1 publication Critical patent/WO2021110987A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention relates to methods and apparatuses for estimating the probability of a subject to be affected with cancer, diagnosing cancer, determining the origin of a tumor in a subject and determining a personalized course of treatment in a subject affected or likely to be affected with cancer; based on the sequencing of cell-free nucleic acids and identification therein of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers.
  • Non-invasive detection methods using imagery approach like mammography for the detection of breast cancers, or protein dosage like Prostate-Specific Antigen (PSA) dosage for prostate cancer detection are already used routinely.
  • PSA Prostate-Specific Antigen
  • these current methods are tumor site-specific, and are described to have a poor sensibility.
  • Carcinoembryonic Antigen (CEA) dosage used for the detection of colorectal cancer, is reported to have a sensibility of 41-52% and a specificity of 85-95%.
  • cfDNA Cell-free circulating DNA extracted from plasma helps diagnosing patients at initial cancer stage. Indeed, many tumors, even at an early stage, release cfDNA with the same genetic background than primary tumor. Recently, a combination of markers has been used to detect and localized 8 major cancers in a cohort of more than 1817 samples with a high accuracy. Somatic point mutations were identified on cfDNA in combination with protein dosages in plasma to determine the presence of a cancer with specificity greater than 99%, while sensibility ranging between 30% and 99% according to cancer type. Other studies have tried to reach same goal using only genomic information from cfDNA sequencing but retrieve lower accuracy.
  • Standard next generation sequencing technologies such as, e.g, Illumina ®
  • Standard next generation sequencing technologies involve clonal amplification of DNA and require specific experimental protocol for each biomarker.
  • bisulfite treatment is required beside the sequencing, while chromatin accessibility evaluation passes through PCR-free or single-stranded library.
  • Third-generation sequencing such as, e.g, Nanopore ® technologies, are characterized by the sequencing of native DNA that passes through the nanopore and changes the ion current.
  • This long-read sequencing technology can be combined with a shotgun PCR-free library to allow the detection of genomic alterations from point mutations to larger abnormalities like copy number variation (CNV) or rearrangement, the presence of viral specific sequence, the detection of methylated CpG or nucleosome position and chromatin remodeling.
  • CNV copy number variation
  • the invention described hereafter overcomes the limitations of currently known non-invasive methods, by offering a fast and efficient diagnosis of cancer from cell-free nucleic acids.
  • the present invention relates to a method for estimating the probability of a subject to be affected with cancer, comprising the steps of:
  • the present invention also relates to a method for diagnosing cancer in a subject in need thereof, comprising the steps of:
  • the present invention also relates to a method for determining the origin of a tumor in a subject in need thereof, comprising the steps of:
  • the present invention also relates to a method for determining a personalized course of treatment in a subject affected or likely to be affected with cancer, comprising the steps of:
  • the sample is a bodily fluid.
  • the sample is selected from the group comprising blood, lymph, ascetic fluid, cystic fluid, urine, gastric juices, pancreatic juices, bile, nipple exudate, synovial fluid, bronchoalveolar lavage fluid, mucus, sputum, amniotic fluid, peritoneal fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, semen, milk, saliva, sweat, tears, feces, stools, and alveolar macrophages.
  • the sample is selected from the group comprising whole blood, plasma and serum.
  • the nucleic acids are cell-free nucleic acids (cfNAs). In one embodiment, the nucleic acids are cell-free circulating DNA (cfDNA). In one embodiment, the extracted nucleic acids are sequenced by single-molecule nucleic acid sequencing. In one embodiment, the extracted nucleic acids are sequenced by a sequencing method selected from the group comprising nanopore sequencing, single molecule real-time sequencing (SMRT), annular dark-field scanning transmission electron microscopy sequencing, Heli scope sequencing, nano-knife-edge probe sequencing. In one embodiment, the extracted nucleic acids are sequenced by nanopore sequencing.
  • SMRT single molecule real-time sequencing
  • annular dark-field scanning transmission electron microscopy sequencing Heli scope sequencing
  • nano-knife-edge probe sequencing In one embodiment, the extracted nucleic acids are sequenced by nanopore sequencing.
  • assigning the plurality of sequence reads at step (c) of the methods of the invention comprises: cl) aligning the plurality of sequence reads on the human genome, thereby obtaining human-mapped sequence reads; c2) discarding sequence reads that did not match with the human genome at step cl); c3) optionally, aligning sequence reads discarded at step c2) on at least one further reference genome or a portion thereof; preferably aligning sequence reads discarded at step c2) on at least one pathogen genome; preferably on a pathogen database; more preferably aligning sequence reads discarded at step c2) on at least one bacterial and/or viral genome; preferably on a bacterial and/or viral genome database; thereby obtaining exogenous-mapped sequence reads; c4) discarding sequence reads that did not match with the at least one further reference genome or a portion thereof at step c3).
  • genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer include genomic alterations, telomere length, retrotransposon sequence, DNA hypermethylation or hypomethylation, nucleosome footprint, nucleic acid fragment size, mitochondria quantity, cancer-inducing virus sequences and cancer-associated bacteria sequences.
  • genomic alterations include base pair mutations, differential trinucleotide frequencies, mutational signatures, copy number alterations, gene rearrangements, short tandem repeat polymorphism, and/or chromosomal abnormalities.
  • computer-processing the plurality of mapped sequence reads at step d) of the methods of the invention comprises correlating the mapped sequence reads with information available in databases and/or with information obtained from at least one reference subject, preferably from a reference population.
  • at least one reference subject is a substantially healthy subject; or the at least one reference subject is a cancer subject.
  • the present invention also relates to a method for treating a subject affected with cancer, comprising the steps of:
  • step 2 treating said subject depending on the estimation, diagnosis, or determination of step 1).
  • treating said subject is carried out by any one of, or a combination of two or more of: surgery, radiation therapy, chemotherapy, activation immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
  • the present invention also relates to a computer system for: estimating the probability of a subject to be affected with cancer; or diagnosing cancer in a subj ect in need thereof; or determining the origin of a tumor in a subject in need thereof; or determining a personalized course of treatment in a subject affected with cancer; comprising: a) a processor and b) a storage medium that stores code readable by the processor; wherein the code stored on the storage medium, when executed by the processor, causes the computer system to: a.
  • At least one raw sequencing signal from a sequencing experiment of nucleic acids, preferably of cell-free nucleic acids (cfNAs), more preferably of cell-free circulating DNA (cfDNA), previously extracted from a sample from the subject; b. optionally, base-call and demultiplex said at least one raw sequencing signal, thereby obtaining at least one sequence read or a plurality of sequence reads; c. assign said at least one sequence read or the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining at least one mapped sequence read or a plurality of mapped sequence reads; d.
  • cfNAs cell-free nucleic acids
  • cfDNA cell-free circulating DNA
  • the term “about”, when set in front of a numerical value, means that said numerical value is approximate and small variations would not significantly affect the practice of the disclosed embodiments. Such small variations are, e.g, of ⁇ 1 %, ⁇ 2 %, ⁇ 3 %, ⁇ 4 %, ⁇ 5 %, ⁇ 6 %, ⁇ 7 %, ⁇ 8 %, ⁇ 9 %, ⁇ 10 % or more.
  • the term “subject” refers to a mammal, preferably a human.
  • a subject may be a “patient”, i.e., a warm-blooded animal, more preferably a human, who/which is awaiting the receipt of, or is receiving medical care or was/is/will be the object of a medical procedure, or is monitored for the development of a disease, such as cancer.
  • patient refers here to any mammal, including humans, domestic and farm animals, and zoo, sports, or pet animals, such as dogs, cats, cattle, horses, sheep, pigs, goats, rabbits, etc.
  • the mammal is a primate, more preferably a human.
  • the present invention relates to a method for estimating the probability of a subject to be affected with cancer.
  • It also relates to a method for diagnosing cancer in a subject in need thereof. It also relates to a method for evaluating the origin of a tumor in a subj ect in need thereof.
  • It also relates to a method for determining a personalized course of treatment in a subject affected or likely to be affected with cancer.
  • step 1) estimating the probability of said subject to be affected with cancer, or diagnosing cancer in said subject, or determining the origin of a tumor in said subject; and 2) treating said subject depending on the estimation, diagnosis, or determination of step 1).
  • the methods according to the present invention are not limited to a specific type of cancer, and therefore apply to “cancer” in its broadest sense. Alternatively, the methods according to the present invention may also be adapted to a given type or subtype of cancer.
  • the cancer is an early cancer. In one embodiment, the cancer is an advanced cancer. In one embodiment, the cancer is a metastatic cancer. In one embodiment, the cancer is a recurrent cancer. In one embodiment, the cancer is a stage 0, stage I, stage II, stage III, or stage IV cancer.
  • stage of a cancer describes the size of a tumour and how far it has spread from where it originated.
  • the cancer is a stage 0 cancer.
  • Stage 0 cancer describes cancer in situ. Stage 0 cancers are still located in the place they started and have not spread to nearby tissues. This stage of cancer is often highly curable, usually by removing the entire tumor with surgery.
  • the cancer is a stage I cancer.
  • Stage I cancer describes a small cancer or tumor that has not grown deeply into nearby tissues. It also has not spread to the lymph nodes or other parts of the body.
  • the cancer is a stage II cancer. “Stage II cancer” indicates that the cancer has grown, but hasn’t spread.
  • the cancer is a stage III cancer.
  • Stage III cancer indicates that the cancer is larger and may have spread to the surrounding tissues and/or the lymph nodes.
  • the cancer is a stage IV cancer.
  • Stage IV cancer describes a cancer that has spread to other organs or parts of the body.
  • the cancer is a grade I, grade II, or grade III cancer.
  • the “grade” of a cancer describes the appearance of the cancerous cells. In general, a lower grade indicates a slower-growing cancer and a higher grade indicates a faster-growing one.
  • the cancer is a grade I cancer. “Grade I cancer” indicates that the cancer comprises cancer cells that resemble normal cells, which aren’t growing rapidly.
  • the cancer is a grade II cancer. “Grade II cancer” indicates that the cancer comprises cancer cells that don’t look like normal cells, which are growing faster than normal cells.
  • the cancer is a grade III cancer.
  • “Grade III cancer” indicates that the cancer comprises cancer cells that look abnormal, which may grow or spread more aggressively.
  • cancers include those listed in the 10 th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), under chapter II, blocks COO to D48.
  • Further examples of cancers include, but are not limited to, adenofibroma, adenoma, agnogenic myeloid metaplasia, AIDS-related malignancies, ameloblastoma, anal cancer, angiofollicular mediastinal lymph node hyperplasia, angiokeratoma, angiolymphoid hyperplasia with eosinophilia, angiomatosis, anhidrotic ectodermal dysplasia, anterofacial dysplasia, apocrine metaplasia, apudoma, asphyxiating thoracic dysplasia, astrocytoma (including, e.g, cerebellar astrocytoma and cerebral astrocytoma), atriodigital dysplasia, atypical mel
  • the cancer is a liquid cancer.
  • liquid cancer refers to cancer cells that are present in body fluids, such as blood, lymph and bone marrow. Lymphomas and leukemias are common types of such liquid cancers. In one embodiment, the cancer is a common cancer.
  • the term “common cancer” refers to one of the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more cancer that is clinically diagnosed with the greatest frequency in a population.
  • the term “common cancer” refers to a cancer that is diagnosed with an annual incidence rate above about 1 in 50,000 people, such as about 1 in 40000 people, about 1 in 30 000 people, about 1 in 20000 people, about 1 in 1000 people 0, about 1 in 9 500 people, about 1 in 9000 people, about 1 in 8 750 people, about
  • 1 in 7 750 people about 1 in 7 500 people, about 1 in 7250 people, about 1 in 7 000 people, about 1 in 6750 people, about 1 in 6 500 people, about
  • Examples of common cancers include, but are not limited to, breast cancer, lung and bronchus cancer, prostate cancer, colorectal cancer, melanoma, bladder cancer, non- Hodgkin’s lymphoma, kidney cancer, endometrial cancer, leukemia, pancreatic cancer, thyroid cancer, and liver cancer.
  • the cancer is selected from the group comprising or consisting of breast cancer, lung and bronchus cancer, prostate cancer, colorectal cancer, melanoma, bladder cancer, kidney cancer, and endometrial cancer.
  • the methods comprise a step of extracting nucleic acids from a sample.
  • the sample is a body tissue sample or a bodily fluid sample.
  • the sample is a body tissue sample.
  • body tissues include, but are not limited to, muscle, nerve, brain, heart, lung, liver, pancreas, spleen, thymus, esophagus, stomach, intestine, kidney, testis, prostate, ovary, hair, skin, bone, breast, uterus, bladder and spinal cord.
  • a body tissue sample may be recovered from the subject, e.g, by biopsy or during a surgical operation.
  • the sample is not a body tissue sample.
  • the sample is a bodily fluid.
  • bodily fluids include, but are not limited to, blood (including whole blood, plasma and serum), lymph, ascetic fluid, cystic fluid, urine, gastric juices, pancreatic juices, bile, nipple exudate, synovial fluid, bronchoalveolar lavage fluid, mucus, sputum, amniotic fluid, peritoneal fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, semen, milk, saliva, sweat, tears, feces, stools, and alveolar macrophages.
  • the sample is blood, such as whole blood, plasma or serum.
  • the sample is whole blood.
  • the term “whole blood” is as conventionally defined.
  • the sample is readily obtainable by minimally invasive methods or non-invasive methods, allowing the removal or isolation of the whole blood from the subject.
  • the sample is plasma.
  • plasma is as conventionally defined. Plasma is usually obtained from a sample of whole blood, provided or contacted with an anticoagulant (such as, e.g, heparin, citrate, oxalate or EDTA). Subsequently, cellular components of the whole blood sample are separated from the liquid component (i.e., the plasma) by an appropriate technique, typically by centrifugation.
  • an anticoagulant such as, e.g, heparin, citrate, oxalate or EDTA
  • the sample is serum.
  • serum is as conventionally defined. Serum can be usually obtained from a sample of whole blood, by (1) allowing clotting to take place in the whole blood sample and (2) subsequently separating the so-formed clot and cellular components of the blood sample from the liquid component (i.e., the serum) by an appropriate technique, typically by centrifugation. Alternatively, serum can be obtained from plasma by removing the anticoagulant and fibrin. The term “serum” therefore refers to a composition which does not form part of a human or animal body.
  • the sample was previously taken from the subject, i.e., the method of the invention does not comprise a step of recovering a sample from the subject. Consequently, according to this embodiment, the methods of the invention are non- invasive methods.
  • nucleic acids refers to both DNA and RNA. Nucleic acids can be single-stranded or double-stranded. In one embodiment, the nucleic acid is DNA. In one embodiment, the nucleic acid is RNA.
  • the nucleic acids are cell-free nucleic acids (cfNAs).
  • cell-free nucleic acid or “cfNA”, sometimes referred to as “cell-free circulating nucleic acid” or “circulating nucleic acid”, are commonly used in the art to describe nucleic acid fragments that circulate in a subject’s bodily fluid and originate from one or more healthy cells and/or from one or more cancer cells from said subject.
  • the cfNA is a cell-free circulating DNA (cfDNA). In one embodiment, the cfNA is a cell-free circulating RNA (cfRNA).
  • Means and methods for extracting nucleic acids from a sample are well known to the one skilled in the art. Such means and methods include, e.g, phenol -chi oroform extraction method, or commercially available nucleic acid extraction reagents. Extraction can be carried out using commercially available kits.
  • cfNA cfDNA, cfRNA or a combination of both
  • several means and methods can be carried out.
  • means and methods for extracting cfDNAs are well known in the art and commercial kits are readily available, e.g, the phenol -chi oroform extraction method, the sodium iodide extraction method, the guanidine-resin extraction method, the “QIAamp ® MinElute ccfDNA” kit from Qiagen, the “QIAamp ® Circulating Nucleic Acids” kit from Qiagen, the “QIAamp ® DNA Blood” kit from Qiagen, the “Gentra Puregene Blood” kit from Qiagen, the “MagMAXTM Cell-Free DNA Isolation” kit from Applied Biosystem, the “Quick-cfDNA Serum & Plasma” kit from Zymo Research, and the like.
  • means and methods for extracting cfRNAs can be adapted from the art and commercial kits are readily available, e.g, the trizol extraction method, the “RNeasy Mini” kit from Qiagen, the “QIAamp ® Circulating Nucleic Acids” kit from Qiagen, the “MagMAXTM-96 Blood RNA Isolation” kit from Thermofisher Scientific, and the like.
  • means and methods for extracting total cfNAs can be adapted from the art and commercial kits are readily available, e.g, the “AllPrep DNA/RNA Mini” kit from Qiagen, “MagMAXTM Cell-Free Total Nucleic Acid Isolation” kit from Thermofisher Scientific, and the like.
  • the methods comprise a step of sequencing the extracted nucleic acids, preferably the extracted cfNAs.
  • sequence refers to any method by which the identity of at least about 5, about 10, about 15, about 20, about 25, about 30, about 40, about 50, about 60, about 70, about 80, about 90, about 100, about 125, about 150, about 175, about 200 or more nucleotides of a nucleic acid molecule is obtained.
  • sequence encompasses methods by which epigenetic information may also be obtained, such as, e.g, nucleotide modifications.
  • nucleotide modifications refers to any modification of a nucleotide which does not affect the nucleic acid sequence itself. Examples of such modifications include, but are not limited to, methylation (such as, e.g, cytosine methylation leading to 5-methylcytosine; adenosine methylation leading to A ⁇ -methyladenosine), oxidation (such as, e.g, 5-methylcytosine oxidation leading to 5-hydroxymethylcytosine; 5-hydroxymethylcytosine oxidation leading to 5-formylcytosine; 5-formylcytosine oxidation leading to 5-carboxylcytosine). These modifications are well known and to the one skilled in the art.
  • the extracted nucleic acids preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide-specific physicochemical features, including size, optical, electrical, and/or magnetic properties.
  • the extracted nucleic acids are sequenced by a sequencing method detecting nucleotide size.
  • the extracted nucleic acids, preferably the extracted cfNAs are sequenced by a sequencing method detecting nucleotide optical properties (such as, e.g, fluorescence or absorption spectrum).
  • the extracted nucleic acids, preferably the extracted cfNAs are sequenced by a sequencing method detecting nucleotide electrical properties.
  • the extracted nucleic acids, preferably the extracted cfNAs are sequenced by a sequencing method detecting nucleotide magnetic properties.
  • the extracted nucleic acids are not sequenced by a first- or second-generation sequencing method.
  • first-generation sequencing refers to Sanger sequencing, i.e., a sequencing method based on the selective incorporation of chain-terminating di-deoxynucleotides by DNA polymerase during in vitro DNA replication.
  • second-generation sequencing also termed “massive parallel sequencing”, “massively parallel sequencing” or “next-generation sequencing”, refers to methods of “sequencing-by-synthesis”, wherein nucleic acid molecules to be sequenced are amplified, then sequenced in batch through nucleic acid neostrand synthesis.
  • second-generation sequencing methods include, but are not limited to, pyrosequencing (such as, e.g., using the 454 platform from Roche, or the GS FLX Titanium platform from 454 Life Sciences), sequencing by reversible terminator chemistry (such as, e.g, using the MiSeq platform, the HiSeq platform or the Genome Analyzer IIX platform from Illumina), and sequencing by ligation (such as, e.g, using the SOLiD4 platform from Life Technologies, now Thermo Fisher Scientific; or the Complete Genomics platform from Complete Genomics).
  • the extracted nucleic acids preferably the extracted cfNAs, are sequenced by a third-generation sequencing method.
  • third-generation sequencing also termed “single-molecule nucleic acid sequencing”, refers to sequencing methods, wherein the nucleotide sequence is read at the single nucleic acid molecule level.
  • third-generation sequencing methods include, but are not limited to, nanopore sequencing (such as, e.g, from Oxford Nanopore Technology, from Quantapore, or from Stratos Genomics Inc.); single molecule real-time sequencing (SMRT) (such as, e.g.
  • nanopore sequencing may sometimes be referred in some literature to as fourth-generati on sequencing.
  • Third-generation sequencing methods are well known in the art. For a review, see, e.g., Niedringhaus et al. (2011. Anal Chem. 83(12):4327-41) or Xu et al. (2009. Small. 5(23):2638-49).
  • the sequencing method provides, beside the identity of the nucleotides of a nucleic acid molecule, epigenetic information, such as, e.g., nucleotide modifications.
  • the extracted nucleic acids are sequenced by nanopore sequencing.
  • raw sequencing data are obtained upon sequencing the extracted nucleic acids, preferably the extracted cfNAs.
  • raw sequencing data refers to the output of a sequencing run.
  • Raw sequencing data are represented by the signal measured by the sequencer.
  • raw sequencing data may be pictures of fluorescent signal or recording of electric signal.
  • raw sequencing data are pre-processed to obtain sequence reads.
  • pre-process also termed “base-call”, “base-called”, “base-calling”, refer to the transformation of the raw sequencing data (e.g, the fluorescent signal, electric signal, or the like) into corresponding nucleotides; in other words, to the assignment of nucleotides to a raw sequencing signal.
  • a plurality of sequence reads is obtained upon sequencing the extracted nucleic acids, preferably the extracted cfNAs. In one embodiment, a plurality of sequence reads is obtained upon base-calling of the raw sequencing data.
  • sequence read refers to the output of a sequencing run after pre-processing of raw signal. Sequence reads are represented by a string of nucleotides. Sequence reads may be accompanied by metrics about the quality of the sequence. The quality is determined during the base-calling step and indicates the accuracy of base called. For example, each nucleotide in a sequence read may be associated with the confidence of the base-call, i.e., a determination of whether a nucleotide is a G, A, T or C, for that position. In one embodiment, a plurality of “sequence reads” can include unique or substantially unique nucleic acid sequences.
  • a plurality of “sequence reads” can include redundant sequences of the same parent molecule, generated, e.g, by an amplification step carried out before and/or during sequencing.
  • “consensus sequence reads” can be generated from sequence reads after comparing redundant sequence reads and selecting the most common nucleotide observed at a given position, after comparing to a reference genome or a portion thereof or other approaches.
  • Unique or non-unique molecular tags UMI can be added to the nucleic acids to be sequenced before an amplification step, to label each nucleic acid molecule.
  • sequencing run refers to any step or portion of a sequencing experiment performed to determine some information related to at least one nucleic acid molecule.
  • sequence reads when referring to sequence reads, means that more than one, such as, e.g, at least 2 sequence reads are obtained. In certain cases, a plurality of sequence reads may have at least about 10, at least about 100, at least about 1000, at least about 10 000, at least about 100 000, at least about 10 6 , at least about 10 7 , at least about 10 8 , at least about 10 9 or more sequence reads.
  • the methods comprise a step of assigning the plurality of sequence reads to at least one reference genome or a portion thereof.
  • nucleic acid sequences such as, e.g, a sequence read and a reference genome sequence or a portion thereof
  • mapped e.g., a sequence read and a reference genome sequence or a portion thereof
  • sequence identity e.g., with at least about 50% sequence identity, such as at least about 60%, about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, about 96%, about 97%, about 98%, or about 99% sequence identity.
  • a sequence read is assigned to a reference genome sequence or a portion thereof when said sequence read is “aligned” on said reference genome sequence or a portion thereof.
  • an alignment may comprise a mismatch, i.e., a site at which a nucleotide in one sequence read and a nucleotide in the - or in a portion of the - reference genome with which it is aligned are not complementary.
  • an alignment may comprise 1, 2, 3, 4, 5 or more, contiguous or non-conti guous, mismatches.
  • an alignment may comprise 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10% or more, contiguous or non-conti guous, mismatches.
  • mapped sequence read refers to a sequence read that has been assigned to (such as, e.g, “mapped” or “aligned”) a matching sequence in the at least one reference genome.
  • Assigning a sequence read to a reference genome or a portion thereof can be done manually or by a computer (e.g. , using a software, program, computer program component, or algorithm or machine learning algorithm or deep learning algorithm).
  • a computer e.g. , using a software, program, computer program component, or algorithm or machine learning algorithm or deep learning algorithm.
  • Various computational methods can be used to assign sequence reads to a reference genome.
  • Sequence reads can be mapped by a mapping component or by a machine or computer comprising a mapping component (e.g. , a suitable mapping and/or alignment and/or classification program), which mapping component generally compared reads to a reference genome or segment thereof.
  • Sequence reads can be mapped to or aligned with a reference genome or a portion thereof by use of a suitable mapping and/or alignment program.
  • suitable mapping and/or alignment program include, but are not limited to, BWA (Li H. and Durbin R. (2009) Bioinformatics 25, 1754-60), Novoalign [Novocraft (2010)], Bowtie (Langmead B, et al., (2009) Genome Biol. 10:R25), SOAP2 (Li R, et al., (2009) Bioinformatics 25, 1966-67), BFAST (Homer N, et al., (2009) PLoS ONE , e7767), GASSST (Rizk, G. and Lavenier, D.
  • Sequence reads can be mapped assigned to (such as, e.g, mapped or aligned) a reference genome or a portion thereof using a suitable short read alignment program.
  • Examples of such program include, but are not limited to, BarraCUDA, BFAST, BLASTN, BLAST, BLAT, BLITZ, Bowtie (e.g, BOWTIE 1, BOWTIE 2), BWA, CASHX, CUDA-EC, CUSHAW, CUSHAW2, desalt, drFAST, FASTA, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP, Geneious Assembler, GraphMap, iSAAC, LAST, MAQ, marginAlign, minimap, minimap2, mini align, mrFAST, mrsFAST, MO S AIK, MPscan, NanoBLASTer, Novoalign, NovoalignCS, Novocraft, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PROBEMATCH, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RTG, Segemehl, Se
  • Sequence reads can be assigned to (such as, e.g, mapped or aligned) a reference genome or a portion thereof by use of a suitable machine learning or deep learning algorithms.
  • Examples of such program include, but are not limited to, fastText (Joulin, 2016. arXiv: 1607.01759 ⁇ cs.CL , Joulin et al, 2017. In 15th Conference of the European Chapter of the Association for Computational Linguistics (Eacl 2017): Valencia, Spain, 3-7 April 2017. Stroudsburg, PA: Association for Computational Linguistics), fastDNA (Menegaux & Vert, 2019. J Comput Biol. 26(6):509-518), large scale linear model by learning continuous low-dimensional representations of the k-mers.
  • a mapping component can map sequence reads by a suitable method known in the art or described herein.
  • a mapping component or a machine or computer comprising a mapping component is required to provide mapped sequence reads.
  • a mapping component often comprises a suitable mapping and/or alignment program or algorithm.
  • a plurality of sequence reads and/or information associated with a plurality of sequence reads are stored on and/or accessed from a non-transitory computer-readable storage medium in a suitable computer-readable format.
  • Information stored on a non-transitory computer-readable storage medium is sometimes referred to as a “file” or “data file”.
  • a file or data file often comprises a format.
  • a sequence read or a plurality of sequence reads is sometimes stored in a format that includes information about one or more sequence reads, non-limiting examples of which include, but are not limited to, a complete or partial nucleic acid sequence, mappability, a mappability score, a mapped location, a relative location or distance from other mapped or unmapped reads (e.g., estimated distance between read mates), orientation relative to a reference genome or to other reads (e.g. , relative to read mates), an estimated or precise location of a read mates, a G/C content, nucleotide modification (e.g, methylation), the like or combinations thereof.
  • a complete or partial nucleic acid sequence mappability, a mappability score, a mapped location, a relative location or distance from other mapped or unmapped reads (e.g., estimated distance between read mates), orientation relative to a reference genome or to other reads (e.g. , relative to read mates), an estimated or precise location of
  • a “computer-readable format” is sometimes referred to generally herein as a “format”.
  • sequence reads are stored and/or accessed in a suitable binary format, a text format, the like or a combination thereof.
  • a binary format is sometimes a BAM format.
  • a text format is sometimes a sequence alignment/map (SAM) format.
  • binary and/or text formats include, but are not limited to, BAM, sorted BAM, SAM, SRF, FASTA, FASTQ, Gzip, the like, or combinations thereof.
  • a program is configured to instruct a microprocessor to obtain or retrieve one or more files.
  • a program is configured to instruct a microprocessor to obtain or retrieve one or more FASTQ files (e.g. , a FASTQ file for a first read and a second read) and/or one or more reference files (e.g, a FASTA or FASTQ file).
  • a program instructs a microprocessor to call a computer program component and/or transfers data and/or information (e.g. , files) to or from one or more computer program components (e.g. , an adapter trimmer component, BWA- MEM aligner, insert size distribution component, samtools, and the like).
  • a program instructs a processor to call a computer program component which creates new files and formats for input into another processing step.
  • the plurality of sequence reads is assigned to (such as, e.g, mapped or aligned) at least one reference genome or a portion thereof, such as on 1 reference genome, on 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000 or more reference genomes or a portion thereof.
  • a sequence read in the plurality of sequence reads may uniquely or non-uni quely map to a reference genome or a portion thereof.
  • a sequence read is considered as “uniquely mapped” if it is assigned to (such as, e.g, mapped or aligned) - completely or partially - with a single sequence in the at least one reference genome or a portion thereof.
  • a sequence read is considered as “non-uniquely mapped” if it is assigned (such as, e.g, mapped or aligned) - completely or partially - with two or more sequences in the at least one reference genome or a portion thereof.
  • non-uniquely mapped sequence reads may be eliminated from further analysis.
  • a certain degree of mismatch between the reference genome or a portion thereof and the sequence reads may be allowed to account for, e.g, single nucleotide polymorphisms or sequencing errors.
  • no degree of mismatch between the reference genome or a portion thereof and the sequence reads may be allowed.
  • reference genome can refer to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or virus which may be used to reference identified sequences in the plurality of sequence reads.
  • a “reference genome” may refers to a portion of a genome (e.g, a chromosome or part thereof, e.g, one or more portions of a genome).
  • Human genomes, human genome assemblies and/or genomes from any other organisms or virus can be used as a reference genome.
  • One or more human genomes, human genome assemblies as well as genomes of other organisms or viruses can be found, e.g, at the National Center for Biotechnology Information (NCBI) at www.ncbi.nlm.nih.gov/genome.
  • a “genome” refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences.
  • a reference genome or a portion thereof often is an assembled or partially assembled genomic sequence from a subject or multiple subject.
  • a reference genome or a portion thereof is an assembled or partially assembled genomic sequence from one or more human subjects.
  • a reference genome or a portion thereof comprises sequences assigned to chromosomes. In one embodiment, a reference genome or a portion thereof comprises sequences obtained from a reference subject or sample. In one embodiment, a reference genome or a portion thereof comprises sequences, an assembly of sequences, and/or a consensus sequence (e.g, a sequence contig). In one embodiment, a reference genome or a portion thereof is obtained from a reference subject or sample substantially free of a genetic variation. In one embodiment, a reference genome or a portion thereof is obtained from a reference subject or sample comprising a known genetic variation.
  • sequence reads can be assigned to (such as, e.g, mapped or aligned) sequences in nucleic acid databases known in the art.
  • databases include, but are not limited to, the International Nucleotide Sequence Database (at www.insdc.org), GenBank (at www.ncbi.nlm.nih.gov), the European Nucleotide Archive (at www.ebi.ac.uk/ena/browser/home), and the DNA Data Bank of Japan (at www.ddbj.nig.ac.jp).
  • Suitable examples include, without limitation, 23andMe, 1000 Genomes Project, ArrayExpress, Bioinformatic Harvester, ClinVar, COSMIC, dbSNP, ENCODE, Ensembl, Ensembl Genomes, Gene Disease Database, Gene Expression Omnibus (GEO), GTEx, HapMap, Human Microbiome Project (HMP), Human Protein Atlas (HP A), Online Mendelian Inheritance in Man (OMIM), Personal Genome Project, RefSeq, SNPedia, and TCGA.
  • BLAST or similar tools can be used to search sequence reads against a sequence database.
  • the mappability is assessed for a genomic region (e.g, one or more portions of a genome).
  • mappability refers to the ability to unambiguously assign a sequence read to a portion of a reference genome, typically up to a specified number of mismatches (such as, e.g, 1, 2, 3, 4, 5 or more mismatches).
  • mappability is provided as a score or value, where the score or value is generated by a suitable mapping algorithm or computer-mapping software.
  • the plurality of sequence reads is compared to one reference genome or a portion thereof.
  • the reference genome is the human ⁇ homo sapiens sapiens) genome or a portion thereof.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g., the human genome are discarded from further analysis.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one further reference genome or a portion thereof or genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one pathogen genome or genome database, such as, e.g, an archaeal, bacterial, fungal, protist, protozoal, and/or viral reference genome or a portion thereof or genome database.
  • pathogen genome or genome database such as, e.g, an archaeal, bacterial, fungal, protist, protozoal, and/or viral reference genome or a portion thereof or genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one archaeal genome or a portion thereof or with an archaeal genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one bacterial genome or a portion thereof or with a bacterial genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one fungal genome or a portion thereof or with a fungal genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one protist genome or a portion thereof or with a protist genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one protozoal genome or a portion thereof or with a protozoal genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to at least one viral genome or a portion thereof or with a viral genome database.
  • sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome are compared to a bacterial and/or viral genome database.
  • sequence reads that did not match with neither of the first reference genome, e.g, the human genome, and the further reference genome(s) or genome database(s), e.g, pathogen genome(s) or genome database(s) are discarded from further analysis.
  • sequence reads that matched with the at least one reference genome or a portion thereof are kept for further analysis.
  • mapped sequence reads may be classified into sequences reads that mapped with a first reference genome or a portion thereof, e.g, the human genome (i.e., “human-mapped sequence reads”), and sequences reads that mapped with the further reference genome(s) or a portion thereof, e.g, non-human genome, such as pathogen genome(s) or genome database(s) (i.e., “exogenous-mapped sequence reads”).
  • the methods comprise a step of computer-processing the mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic, metabolic and/or metagenomic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - genetic and epigenetic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and transcriptomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and metagenomic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - epigenetic and transcriptomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and transcriptomic biomarkers. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer- processing comprises identifying - or assessing the presence of - transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
  • computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
  • genetic biomarker is broader than a gene.
  • a “genetic biomarker” refers to a fragment of nucleic acid, such as a fragment of DNA, with an identifiable physical location on a chromosome whose inheritance can be followed.
  • a genetic biomarker can have a function and thus be, or be a fragment of, a gene.
  • a genetic biomarker can be a fragment of nucleic acid, such as a fragment of DNA, with no known function.
  • genetic biomarkers of cancer include, but are not limited to, genomic alterations (such as, e.g, somatic mutations, single-nucleotide polymorphism), telomere length evaluation, and retrotransposon sequence detection.
  • genomic alterations such as, e.g, somatic mutations, single-nucleotide polymorphism
  • telomere length evaluation such as, e.g, telomere length evaluation
  • retrotransposon sequence detection such as, e.g., somatic mutations, single-nucleotide polymorphism
  • identifying - or assessing the presence of - genetic biomarkers of cancer in said mapped sequence reads may comprise analyzing the position, gene and impact of genomic alterations and/or nucleosome footprint in said mapped sequence reads.
  • genomic alteration refers to a change (or mutation) in the nucleotide sequence of the genome of a cancer cell, which change is not present in a non-cancer cell genome.
  • genomic alterations include, but are not limited to, base pair substitutions (such as, e.g, single-nucleotide polymorphism), base pair insertions, base pair deletions, copy number alteration, gene rearrangement (such as, e.g, gene fusion), short tandem repeat polymorphism (such as, e.g, STR), chromosomal abnormalities and any combination thereof.
  • base pair substitutions, insertions and deletion are commonly included under the general term “base pair mutation”.
  • genomic alteration may be defined relative to the locus or gene in which it is present in cancer cells relative to a non-cancer cell genome. In one embodiment, the genomic alteration may be defined relative to trinucleotide frequencies in the genome of cancer cells relative to a non-cancer cell genome.
  • cancer-specific base pair mutations include, but are not limited to, mutations in any one of the following genes: ACVR2A, AFF3, ALK, APC, AR, ARID1A, ARID 2, ATM, ATRX, BARI, BCOR, BRAF, CAMTA1, CDH1, CDKN2A, CREBBP, CTCF, CTNNB1, EBF1, EGFR, EP300, ERBB2, ERBB3, ERBB4, ERCC2, ESR1, FAT1, FAT 4, FBXW7, FGFR2, FGFR3, FRIT, FOXP1, GATA3, GRIN2A, HRAS, KDM6A, KDM5C, KDR, KEAP1, KIT, KMT2A, KMT2C, KMT2D, KRAS, LPP, LRP1B, MAP3K1, MET, MTOR, MSH6, NF1, NF2, NRAS, PBRM1, PIK3CA, PIK3R1, POLE, PPP2R1A
  • Table 1 provides examples ofbase pair mutation occurrences (expressed in %) observed in certain types of cancers. Table 1. Extracted from GRCh38 COSMIC v90.
  • mutational signature Differential trinucleotide frequencies in the genome of cancer cells relative to a non-cancer cell genome have been termed “mutational signature” in the art. Examples of such mutational signatures are described in, e.g, in Alexandrov et al, 2013 Nature. 500(7463):415-21. For example, certain mutational signatures are known to be associated with certain types of cancers. Table 2 provides examples of mutational signatures and their correlation with certain types of cancers.
  • cancer-specific copy number alteration examples include, but are not limited to, increased copy number (i.e., gain of gene copy number in cancer cells relative to non-cancer cells) of any one of the following genes: AARD, APCDD1L, ATP 1 IB, ATP5F1E, CSMD3 , CTSZ , DCUN1D1, DPP6, EIF3H , EXT1, GNAS , HDAC9, LAMPS , MAL2, MCCC1, PRELID3B, RAB22A, RAD21, SAMD12 , SLC30A8, SOX2, TG, TOX2,
  • cancer-specific copy number alteration examples include, but are not limited to, decreased copy number (i.e., loss of gene copy number in cancer cells relative to non-cancer cells) of any one of the following genes: CDKN2A, CDKN2B, CSMD1, DNAAF5, EML4IALK, FRIT, MTAP, RBFOX1, and SCAPER.
  • cancer-specific gene rearrangement examples include, but are not limited to, ETV6/NTRK3 fusion, MYB/NFIB fusion, TMPRSS2/ERG fusion, and TRPSl fusion.
  • cancer-specific short tandem repeat polymorphism examples include, but are not limited to, IGF-I and AR.
  • the presence of any one of the following genomic alteration is known to be associated with breast cancer: ETV6/NTRK3 fusion, MYB/NFIB fusion, TRPSl fusion; increased copy number of AARD, CSMD3, EIF3H, EXT1, MAL2, RAD21, SAMD12, SLC30A8, and/or UTP23; and decreased copy number of CSMD1, DNAAF5, and/or SCAPER.
  • the presence of any one of the following genomic alteration is known to be associated with colorectal cancer: increased copy number of APCDD1L, ATP5F1E, CTSZ, GNAS, PRELID3B, RAB22A, TOX2, and/or VAPB, and decreased copy number of FRIT, and/or RBFOX1.
  • the presence of any one of the following genomic alteration is known to be associated with kidney cancer: ETV6INTRK3 fusion; and decreased copy number of SCAPER.
  • the presence of any one of the following genomic alteration is known to be associated with lung and bronchus cancer: increased copy number of ATP 1 IB, DCUN1D1, LAMPS , MCCC1, and/or SOX2; and decreased copy number of CDKN2B, EML4/ALK , and/or SCAPER.
  • the presence of any one of the following genomic alteration is known to be associated with melanoma: increased copy number of DPP6, HDAC9 , and/or TG; and decreased copy number of CDKN2A, CDKN2B, MTAP, and/or SCAPER.
  • the presence of the following genomic alteration is known to be associated with prostate cancer: TMPRSS2/ERG fusion.
  • the presence of the following genomic alteration is known to be associated with prostate cancer: CAG repeats length in AR.
  • Assessing the size - or size distribution - of telomeres in said mapped sequence reads may provide information on the presence of cancer cells.
  • small size-telomeres are indicatives of cancer.
  • Identifying - or assessing the presence of - retrotransposon sequences in said mapped sequence reads may comprise analyzing the number, position and impact of retrotransposon sequences.
  • retrotransposon sequences include, but are not limited to, short interspersed nuclear elements (such as, e.g., Alu sequences and mammalian-wide interspersed repeats), long interspersed nuclear elements (such as, e.g, LINE1 and LINE2), and long terminal repeats (such as, e.g, HERV, MER4 and retroposons).
  • epigenetic biomarker refers to a modification in a nucleic acid, such as in a DNA molecule, by a process or processes that do not change the nucleic acid sequence itself.
  • epigenetic biomarkers of cancer include, but are not limited to, DNA hypermethylation or hypomethylation (when taken in comparison to a substantially healthy, i.e., non-cancerous, sample), nucleosome footprint and nucleic acid fragment size.
  • identifying - or assessing the presence of - epigenetic biomarkers of cancer in said mapped sequence reads may comprise analyzing the position, CpG count and methylation status of said mapped sequence reads.
  • assessing the presence (or absence) of DNA hypermethylation or hypomethylation in a sample may provide information on cancer-specific methylation status.
  • the methylation status may be defined relative to a locus or a gene.
  • cancer-specific DNA hypermethylation i.e., increased presence of methylated nucleotides in cancer cells relative to non-cancer cells
  • examples of cancer-specific DNA hypermethylation include, but are not limited to, hypermethylation of any one of the following loci: 1:147545131, 1:159010051, 1:184867071, 1:234772479, 1:234772634, 1:9626465, 2:111494677,
  • cancer-specific DNA hypomethylation i.e., decreased presence of methylated nucleotides in cancer cells relative to non-cancer cells
  • DNA hyper- or hypomethylation is known to be associated with bladder cancer: DNA hypermethylation at locus 1:147545131, 2:2318016, 4:113355678, 4:1494607, 5:179354562, 6:30163104,
  • DNA hypermethylation at locus 1:234772479, 1:234772634, 2:111494677, 4:6322902, 5:112329851, 5:40841488, 6:30769291, 6:70312472, 7:17234713, 7:82805693, 13:101706760, 15:67150555, and/or
  • DNA hypermethylation at locus 1:9626465, 4:19455540, 4:634860, 6:10528259, 3:1163104, 3:13224, 7:1177297, 7:158428678, 8:88957006, 10:130045534, 12:122898852, 13:24511163, 13:24511531, and/or 18:74499068; and DNA hypomethylation at locus 1:227561011, 1:227561018, 1:54781467, 2:208124524, 2:239309155, 5:141419191, 6:104940793, 6:104953110, 6:104953118, 6:27582968, 7:149692578, 10:45427926, 14:20435452, 17:2238547
  • kidney cancer 1:159010051, 1 :234772634, 2: 111699234, 2:237687894, 7:17234713, 10:11685287, 10:133259456, 12:76183708, 14:75124193, 16:80027393, 16:80027393, and/or 16:80027460; and DNA hypomethylation at locus 1:10673454, 4:79964827, 6:31728646, 10:11275824, 19:1907973, 21:45425245, and/or X: 118499399.
  • DNA hyper- or hypomethylation is known to be associated with lung and bronchus cancer: DNA hypermethylation at locus 1:184867071, 5:3764427, 3:1163224, 3:1163224, 7:17234713, 7:75776649,
  • DNA hyper- or hypomethylation is known to be associated with prostate cancer: DNA hypermethylation at locus 2:239052858, 10:26642983, 17:79979217, and/or 17:79979289; and DNA hypomethylation at locus 2:208124524, 2:231396296, 6:104953110, 6:104953118, 7:16465977, 12:54259580, 19:38211354, and/or 19:46297345.
  • nucleosome footprint refers to the mapping of nucleosome occupancy, which correlates with nuclear architecture, gene structure and gene expression observed in a given type of cell. Hence, nucleosome footprinting allows to identify the cell-type of origin based on the fragmentation pattern of cfNA, expression of genes, presence of mitochondrial DNA, and the like.
  • Assessing the size - or size distribution - of said mapped sequence reads may provide information on the type of cell death responsible for the release of the cfNAs.
  • small size-mapped sequence reads are indicative of apoptosis.
  • large size-mapped sequence reads are indicative of necrosis.
  • transcriptomic biomarker refers to a nucleic acid fragment, such as RNA fragment, with an identified physical location.
  • a transcriptomic biomarker can be a count of nucleic fragment aligned at a position to represent gene expression level or the determination of alternative transcript expression, or the identification of small RNA of interest like miRNA implicated in gene expression regulation.
  • mitochondrial chromosome refers to the mitochondria quantity. Mitochondria quantity can be readily evaluated by quantifying sequencing reads aligned on the mitochondrial chromosome (chrM).
  • the mitochondrial chromosome is a closed circular molecule that contains 16.569 bp.
  • Each mitochondrial chromosome in a mitochondrion normally contains a full set of all the mitochondrial genes.
  • a human mitochondrion contains approximately 5 such mitochondrial chromosomes, with a quantity usually ranging from 1 to 15.
  • a “metagenomic biomarker” refers to a microbial sequence, such as, e.g, a nucleic acid sequence matching with an archaeal, bacterial, fungal, protist, protozoal, or viral reference genome or a portion thereof.
  • a “metagenomic biomarker” refers to a nucleic acid sequence matching with a bacterial and/or viral reference genome or a portion thereof.
  • assessing the presence (or absence) of pathogenic biomarkers of cancer in a sample may provide information on viral sequences originating, e.g, from cancer-inducing viruses; or on bacterial sequences originating, e.g, from cancer-associated bacteria.
  • the step of computer-processing the mapped sequence reads, and identifying - or assessing the presence of - pathogenic biomarkers of cancer in said mapped sequence reads is preferably performed on exogenous-mapped sequence reads identified in previous steps of the methods.
  • cancer-inducing viruses include, but are not limited to, cytomegalovirus (CMV), Epstein-Barr virus (EBV), hepatitis B virus (HBV), hepatitis C virus (HCV), Kaposi’s sarcoma-associated herpesvirus (KSHV, formally known as HHV-8), human immunodeficiency virus (HIV), human papillomavirus (HPV), human T-lymphotropic virus (also known as human T-cell lymphotropic virus or human T-cell leukemia-lymphoma virus, HTLV).
  • CMV cytomegalovirus
  • EBV Epstein-Barr virus
  • HBV hepatitis B virus
  • HCV hepatitis C virus
  • KSHV Kaposi’s sarcoma-associated herpesvirus
  • HAV human immunodeficiency virus
  • HPV human papillomavirus
  • T-lymphotropic virus also known as human T-cell lymphotropic virus or human
  • EBV is known to associated with Hodgkin’s and non- Hodgkin’s lymphoma, nasopharyngeal cancer, and Burkitt lymphoma.
  • HBV is known to be associated with hepatocellular carcinoma.
  • HCV is known to be associated with hepatocellular carcinoma.
  • HHV-8 is known to be associated with Kaposi sarcoma.
  • HIV is known to be associated with several cancers.
  • HPV is known to be associated with endometrial cancer.
  • HTLV is known to be associated with lymphoma and leukemia.
  • VCM is known to be associated with colorectal cancer.
  • cancer-associated bacteria examples include, but are not limited to, Bacteroides fragilis, Borrelia burgdorferi, Campylobacter jejuni, Chlamydia pneumonia,
  • Opisthorchis viverrini Salmonella enterica serovar Typhimurium, Salmonella enterica serovar Paratyphi, Salmonella Typhi, Schistozoma haematobium, Streptococcus bovis, and Treponema pallidum.
  • Salmonella enterica serovar Typhimurium Salmonella enterica serovar Paratyphi
  • Salmonella Typhi Salmonella Typhi
  • Schistozoma haematobium Streptococcus bovis
  • Treponema pallidum the presence of Helicobacter hepaticus, Salmonella enterica serovar Typhimurium, Salmonella enterica serovar Paratyphi and/or Opisthorchis viverrini in a sample is known to be associated with bile duct cancer.
  • the presence of Neisseria gonorrhoeae, Cutibacterium acnes and/or Treponema pallidum is known to be associated with prostate cancer.
  • Neisseria gonorrhoeae The presence of Neisseria gonorrhoeae, Cutibacterium acnes and/or Treponema pallidum, Helicobacter bilis, Salmonella Typhi and/or Schistozoma haematobium is known to be associated with bladder cancer.
  • the presence of Bacteroides fragilis, Clostridium ssp, Mycoplasma fermentans, Mycoplasma hyorhinis, Mycoplasma penetrans and/or Streptococcus bovis is known to be associated with colorectal cancer.
  • the presence of Chlamydia trachomatis is known to be associated with endometrial cancer.
  • Chlamydophila psittaci is known to be associated with eye cancer.
  • the presence of Borrelia burgdorferi, Helicobacter bizzozeronii, Helicobacter felis, Helicobacter heilmannii, Helicobacter pylori, Helicobacter salomonis, Helicobacter suis, Mycoplasma fermentans, Mycoplasma hyorhinis and/ or Mycoplasma penetrans is known to be associated with gastric cancer.
  • the presence of Chlamydia pneumoniae, Chlamydia pneumonia, Mycoplasma fermentans, Mycoplasma hyorhinis and/or Mycoplasma penetrans is known to be associated with lung cancer.
  • Mycoplasma fermentans The presence of Mycoplasma fermentans, Mycoplasma hyorhinis and/ or Mycoplasma penetrans is known to be associated with ovarian cancer.
  • the presence of Campylobacter jejuni is known to be associated with small intestine cancer.
  • biomarkers of cancer are identified in the methods of the invention based on results obtained after sequencing and comparatively analyzing multiple samples labeled as cancer samples and substantially healthy samples ( i.e ., without any evidence of cancers).
  • biomarkers of cancer are identified in the methods of the invention based on known information available in databases.
  • databases include, but are not limited to, the International Nucleotide Sequence Database (at www.insdc.org), GenBank (at www.ncbi.nlm.nih.gov), the European Nucleotide Archive (at www. ebi . ac.uk/ena/browser/home), and the DNA Data Bank of Japan (at www.ddbj.nig.ac.jp).
  • Suitable examples include, without limitation, 23andMe, 1000 Genomes Project, ArrayExpress, Bioinformatic Harvester, ClinVar, COSMIC, dbSNP, ENCODE, Ensembl, Ensembl Genomes, Gene Disease Database, Gene Expression Omnibus (GEO), GTEx, HapMap, Human Microbiome Project (HMP), Human Protein Atlas (HP A), Online Mendelian Inheritance in Man (OMIM), Personal Genome Project, RefSeq, SNPedia, and TCGA.
  • GEO Gene Expression Omnibus
  • HMP Human Microbiome Project
  • HP A Human Protein Atlas
  • OMIM Online Mendelian Inheritance in Man
  • biomarkers of cancer are identified in the methods of the invention using a learning algorithm.
  • learning algorithm or “machine learning algorithm” refer to computer- executed algorithms that automate analytical model building, e.g, for clustering, classification or profile recognition. Learning algorithms perform analyses on training datasets provided to the algorithm. Learning algorithms output a “model”, also referred to as a “classifier”, “classification algorithm” or “diagnostic algorithm”. Models receive, as input, test data and produce, as output, an inference or a classification of the input data as belonging to one or another class, cluster group or position on a scale, such as diagnosis, stage, prognosis, disease progression, responsiveness to a drug, etc.
  • a variety of learning algorithms can be used to infer a condition or state of a subject. Machine learning algorithms may be supervised or unsupervised.
  • Examples of learning algorithms include, but are not limited to, artificial neural networks (e.g, back propagation networks), discriminant analyses (e.g. , Bayesian classifier, Fischer analysis), support vector machines, decision trees (e.g, recursive partitioning processes, such as classification and regression trees [CART]), random forests, linear classifiers (e.g. , multiple linear regression [MLR], partial least squares [PLS] regression, principal components regression [PCR]), hierarchical clustering and cluster analysis.
  • the learning algorithm generates a model or classifier that can be used to make an inference, e.g, an inference about a disease state of a subject.
  • the learning algorithm was previously trained with at least one training dataset.
  • the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from at least one reference subject.
  • the reference subject is an animal, preferably a mammal.
  • mammals include, but are not limited to, humans, non-human primates (such as, e.g, chimpanzees, and other apes and monkey species), farm animals (such as, e.g, cattle, horses, sheep, goats, and swine), domestic animals (such as, e.g, rabbits, dogs, and cats), laboratory animals (such as, e.g, rats, mice and guinea pigs), and the like.
  • farm animals such as, e.g, cattle, horses, sheep, goats, and swine
  • domestic animals such as, e.g, rabbits, dogs, and cats
  • laboratory animals such as, e.g, rats, mice and guinea pigs
  • the reference subject is a primate, including human and non-human primates. In one embodiment, the reference subject is a human.
  • the reference subject is a substantially healthy subject.
  • a “substantially healthy subject” has not been previously or will not be diagnosed or identified as having or suffering from cancer.
  • the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from a healthy reference population.
  • the term “healthy reference population” refers to a group of substantially healthy subjects, either of similar or different origin, ethnical background, gender, age, etc., such as a group of at least 10, preferably at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500 or more substantially healthy subjects.
  • the reference subj ect is a cancer subject.
  • a “cancer subject” has been previously or will be diagnosed or identified as having or suffering from cancer.
  • the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from a cancer reference population.
  • cancer reference population refers to a group of cancer subjects, either of similar or different origin, ethnical background, gender, age, etc., such as a group of at least 10, preferably at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500 or more cancer subjects.
  • the cancer reference population may comprise cancer subjects who has been previously or will be diagnosed or identified as having or suffering from one type of cancer.
  • the cancer reference population may comprise cancer subjects who has been previously or will be diagnosed or identified as having or suffering from any type of cancer.
  • the methods comprise a step of assigning a score to each biomarker of cancer identified in previous steps of the methods.
  • score refers to a value computed to resume multiple results into a single one.
  • Example of scoring methods include, but are not limited to, mean, median, sum or the average of probabilities of the positive class (pi) for all features (n) associated with a sample multiplied by the number of positive detected features, as illustrated, e.g, by the following formula:
  • the scoring method is applied to each group of biomarkers of cancer independently. In one embodiment, the scoring method is applied to several group of biomarkers of cancer based on their functional impact. In one embodiment, the scoring method is applied to each detected biomarker of cancer.
  • the methods comprise a step of classifying the subject’s sample based on the scores assigned to each identified biomarker of cancer in previous steps of the methods.
  • the methods comprise a step of concluding, based on classification of the subject’s sample: on the probability of the subject to be affected with cancer; or on the diagnosis of cancer in the subject; or on the determination of the origin of a tumor in the subject; or on the determination of a personalized course of treatment for the subject.
  • the methods may comprise a further step of treating the subject.
  • treating or “treatment” or “alleviation” refer to therapeutic treatment, excluding prophylactic or preventative measures; wherein the object is to slow down (lessen) a given disease, such as, e.g, cancer.
  • a given disease such as, e.g, cancer.
  • Those in need of treatment include those already with the disease (such as, e.g, cancer) as well those suspected to have the disease (such as, e.g, cancer).
  • a subject is successfully “treated” for a given disease (such as, e.g, cancer) if, after receiving a therapeutic amount of a therapeutic agent, said subject shows observable and/or measurable reduction in or absence of one or more of the following: one or more of the symptoms associated with the disease (such as, e.g, cancer); reduced morbidity and mortality; and/or improvement in quality of life issues.
  • a given disease such as, e.g, cancer
  • the above parameters for assessing successful treatment and improvement in a given disease are readily measurable by routine procedures familiar to a physician.
  • treating the subject for cancer is carried out by any of - or a combination of two or more of - surgery, radiation therapy, chemotherapy, activation immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
  • radiation therapy also termed “radiotherapy” and often abbreviated as “RT”, “RTx” or “XRT”, refers to a therapy using ionizing radiation, to control or kill malignant cells.
  • radiation therapies include, but are not limited to, external beam radiotherapy (such as, e.g, superficial X-rays therapy, orthovoltage X-rays therapy, megavoltage X-rays therapy, radiosurgery, stereotactic radiation therapy, cobalt therapy, electron therapy, fast neutron therapy, neutron-capture therapy, proton therapy, and the like); brachytherapy; unsealed source radiotherapy; tomotherapy; and the like.
  • chemotherapy refers to a therapy using a chemotherapeutic agent, i.e., any molecule that is effective in inhibiting tumor growth.
  • chemotherapeutic agents include those described under subgroup L01 of the Anatomical Therapeutic Chemical Classification System.
  • chemotherapeutic agents include, but are not limited to: alkylating agents, such as, e.g. :
  • nitrogen mustards including chlormethine, cyclophosphamide, ifosfamide, trofosfamide, chlorambucil, melphalan, prednimustine, bendamustine, uramustine, chlornaphazine, cholophosphamide, estramustine, mechlorethamine, mechlorethamine oxide hydrochloride, novembichin, phenesterine, uracil mustard and the like;
  • ⁇ nitrosoureas including carmustine, lomustine, semustine, fotemustine, nimustine, ranimustine, streptozocin, chlorozotocin, and the like;
  • alkyl sulfonates including busulfan, mannosulfan, treosulfan, and the like;
  • aziridines including carboquone, thiotepa, triaziquone, triethylenemelamine, benzodopa, meturedopa, uredopa, and the like; hydrazines, including procarbazine, and the like;
  • triazenes including dacarbazine, temozolomide, and the like; ethylenimines and methylamelamines, including altretamine, triethylenemelamine, tri etyl enephosphorami de, tri ethylenethi ophosphaorarni de, trimethylolomelamine and the like;
  • mitobronitol pipobroman, actinomycin, bleomycin, mitomycins (including mitomycin C, and the like), plicamycin, and the like; acetogenins, such as, e.g, bullatacin, bullatacinone, and the like; benzodiazepines, such as, e.g, 2-oxoquazepam, 3 -hy droxyphenazepam, bromazepam, camazepam, carburazepam, chlordiazepoxide, cinazepam, cinolazepam, clonazepam, cloniprazepam, clorazepate, cyprazepam, delorazepam, demoxepam, desmethylflunitrazepam, devazepide, diazepam, diclazepam, difludiazepam, doxefazepam,
  • ⁇ antifolates including aminopterin, methotrexate, pemetrexed, pralatrexate, pteropterin, raltitrexed, denopterin, trimetrexate, pemetrexed, and the like;
  • ⁇ purine analogues including pentostatin, cladribine, clofarabine, fludarabine, nelarabine, tioguanine, mercaptopurine, and the like;
  • ⁇ pyrimidine analogues including fluorouracil, capecitabine, doxifluridine, tegafur, tegafur/gimeracil/oteracil, carmofur, floxuridine, cytarabine, gemcitabine, azacytidine, decitabine, and the like; and
  • hydroxy carbamide ⁇ hydroxy carbamide
  • - anti-adrenals such as, e.g, aminoglutethimide, mitotane, trilostane, and the like
  • folic acid replenishers such as, e.g, frolinic acid, and the like
  • maytansinoids such as, e.g, maytansine, ansamitocins, and the like
  • platinum analogs such as, e.g, platinum, carboplatin, cisplatin, dicycloplatin, nedaplatin, oxaliplatin, satraplatin, and the like
  • trichothecenes such as, e.g, T-2 toxin, verracurin A, roridinA, anguidine and the like
  • - toxoids such as, e.g, cabazitaxel, docetaxel, larotaxel, ortataxel, paclitaxel, tesetaxel, and the
  • eleutherobin pancrati statin; sarcodictyin; spongi statin; aclacinomysins; authramycin; azaserine; bleomycin; cactinomycin; carabicin; canninomycin; carzinophilin; chromomycins; dactinomycin; daunorubicin; detorubicin; 6-diazo-5-oxo-L-norleucine; doxorubicin (including morpholino-doxorubicin, cyanomorpholino-doxorubicin, 2-pyrrolino-doxorubicin, deoxydoxorubicin, and the like); epirubicin; esorubicin; idanrbicin; marcellomycin; mycophenolic acid; nogalarnycin; olivomycins; peplomycin; potfiromycin; puromycin; quelamycin; rodorubicin
  • activation immunotherapy refers to the artificial stimulation of the immune system to treat cancer, using activation immunotherapeutic agents (or immunostimulatory agents), such as, e.g, monoclonal antibodies, oncolytic viruses, CAR T-cells, dendritic cells, cancer vaccines, cytokines (including interferons and interleukins), and the like.
  • activation immunotherapeutic agents or immunostimulatory agents
  • cytokines including interferons and interleukins
  • immune checkpoint inhibitors such as, e.g., inhibitors of CTLA4, PD- 1, PD-L l, LAG-3, B7-H3, B7-H4, TIM3, A2AR, and/or IDO, including nivolumab, pembrolizumab, pidilizumab, AMP-224, MPDL32
  • CD l ligands CD l ligands; growth hormone; immunocyanin; pegademase; prolactin; tasonermin; female sex steroids; histamine dihydrochloride; poly ICLC; vitamin D; lentinan; plerixafor; roquinimex; mifamurtide; glatiramer acetate; thymopentin; thymosin al; thymulin; polyinosinic:polycytidylic acid; pidotimod; Bacillus Calmette-Guerin; melanoma vaccine; sipuleucel-T; and the like
  • targeted therapy refers to a therapy using a targeted therapy agent, i.e., any molecule which aims at one or more particular target molecules (such as, e.g, proteins) involved in tumor genesis, tumor progression, tumor metastasis, tumor cell proliferation, cell repair, and the like.
  • a targeted therapy agent i.e., any molecule which aims at one or more particular target molecules (such as, e.g, proteins) involved in tumor genesis, tumor progression, tumor metastasis, tumor cell proliferation, cell repair, and the like.
  • targeted therapy agents include, but are not limited to, tyrosine- kinase inhibitors, serine/threonine kinase inhibitors, monoclonal antibodies and the like.
  • targeted therapy agents include, but are not limited to, HER1/EGFR inhibitors (such as, e.g, brigatinib, erlotinib, gefitinib, olmutinib, osimertinib, rociletinib, vandetanib, and the like); HER2/neu inhibitors (such as, e.g, afatinib, lapatinib, neratinib, and the like); C-kit and PDGFR inhibitors (such as, e.g, axitinib, masitinib, pazopanib, sunitinib, sorafenib, toceranib, and the like); FLT3 inhibitors (such as, e.g,
  • anti-CD33 monoclonal antibodies such as, e.g, gemtuzumab, and the like
  • anti-CD52 monoclonal antibodies such as, e.g, alemtuzumab, and the like.
  • hormone therapy refers to the artificial manipulation of the endocrine system through exogenous or external administration of specific hormones, in particular steroid hormones, or drugs which inhibit the production or activity of such hormones (i.e., inhibitors of hormone synthesis and hormone receptor antagonists).
  • hormones include, but are not limited to, androgens (such as, e.g, androstenediol dipropionate, boldenone undecylenate, clostebol, clostebol acetate, clostebol caproate, clostebol propionate, cloxotestosterone acetate, prasterone, prasterone enanthate, prasterone sulfate, quinbolone, testosterone, testosterone cypionate, testosterone enanthate, testosterone propionate, testosterone undecanoate, testosterone ester mixtures, deposterona, omnadren, sustanon, testoviron depot, androstanolone, androstanolone esters, bolazine capronate, drostanolone propionate, epitiostanol, mepitiostane, mesterolone, metenolone acetate, metenolone enanthat
  • inhibitors of hormone synthesis and hormone receptor antagonists include, but are not limited to, anti-estrogens (such as, e.g, including tamoxifen, raloxifene, aromatase inhibiting 4(5)-imidazoles, 4-hydroxytamoxifen, trioxifene, keoxifene, LY117018, onapri stone, toremifene, and the like), and anti-androgens (such as, e.g, flutamide, nilutamide, bicalutamide, leuprolide, goserelin, and the like).
  • anti-estrogens such as, e.g, including tamoxifen, raloxifene, aromatase inhibiting 4(5)-imidazoles, 4-hydroxytamoxifen, trioxifene, keoxifene, LY117018, onapri stone, toremifene, and the like
  • anti-androgens such as
  • stem cell transplant refers to a transplantation of stem cells (either autologous or allogenic) aiming at replacing or reinforcing pre-existing bone marrow cells that may have been partially or totally destroyed by cancer or by therapy.
  • the present invention also relates to a computer system for estimating the probability of a subj ect to be affected with cancer.
  • It also relates to a computer system for diagnosing cancer in a subject in need thereof.
  • It also relates to a computer system for determining the origin of a tumor in a subject in need thereof.
  • It also relates to a computer system for determining a personalized course of treatment in a subject affected with cancer.
  • computer system refers to any and all devices capable of storing and processing information and/or capable of using the stored information to control the behavior or execution of the device itself, regardless of whether such devices are electronic, mechanical, logical, or virtual in nature.
  • computer system can refer to a single computer, but also to a plurality of computers working together to perform the function described as being performed on or by a computer system.
  • the computer system according to the present invention comprises:
  • processor is meant to include any integrated circuit or other electronic device capable of performing an operation on at least one instruction word, such as, e.g, executing instructions, codes, computer programs, and scripts which it accesses from a storage medium.
  • processors include, but are not limited to, central processing units (CPU), microprocessors, digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), and other equivalent integrated or discrete logic circuitry.
  • CPU central processing units
  • DSP digital signal processors
  • ASIC application specific integrated circuits
  • FPGA field programmable logic arrays
  • the code stored on the storage medium when executed by the processor, causes the computer system to: a. optionally, receive at least one raw sequencing signal from a sequencing experiment of nucleic acids, preferably cfNAs previously extracted from a sample from the subject, as described hereinabove; b. optionally, base-call and demultiplex said at least one raw sequencing signal, thereby obtaining at least one sequence read or a plurality of sequence reads; c. assign said at least one sequence read or the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining at least one mapped sequence read or a plurality of mapped sequence reads, as described hereinabove; d.
  • the learning algorithm was previously trained with at least one training dataset, as described hereinabove.
  • the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from at least one reference subject, as described hereinabove.
  • Figure 1 is a flowchart illustrating the bioinformatic steps of the methods carried out after the sequencing step.
  • Figures 2A-B represent the “Silico mixl” description.
  • Figure 2A shows the ratio of THP1 DNA mixed into HeLa DNA for all samples, including controls (samples without THP1 DNA mixed into HeLa DNA);
  • Figure 2B shows the distribution of simulated depth for controls (annotated as 0) and THP1 -positive samples (annotated as 1).
  • Figures 3A-B represent the “Silico mix2” description.
  • Figure 3 A shows the ratio of THP1 or HeLa DNA mixed into normal plasma DNA from healthy donor of all samples, including controls (samples without HeLa or THP1 DNA);
  • Figure 3B shows the simulated depth for controls (annotated as 0) and THP1- or HeLa-positive samples (annotated as 1).
  • Figures 4A-D represent the methylation biomarkers analysis.
  • Figure 4A shows the distribution of the methylation score regarding ratio of abnormal DNA in the sample from “Silico mixl”.
  • Figure 4B shows the ROC analysis of methylation scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”.
  • Figure 4C shows the distribution of the methylation score regarding ratio of abnormal DNA in the sample from “Silico mix2”.
  • Figure 4D shows the ROC analysis of methylation scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
  • Figures 5A-D represent the variants biomarkers analysis.
  • Figure 5A shows the distribution of the variant score regarding ratio of abnormal DNA in the sample from “Silico mixl”.
  • Figure 5B shows the ROC analysis of variant scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”.
  • Figure 5C shows the distribution of the variant score regarding ratio of abnormal DNA in the sample from “Silico mix2”.
  • Figure 5D shows the ROC analysis of variant scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
  • Figures 6A-D represent the nucleosome footprint biomarkers analysis.
  • Figure 6A shows the distribution of the nucleosome score regarding ratio of abnormal DNA in the sample from “Silico mixl”.
  • Figure 6B shows the ROC analysis of nucleosome scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”.
  • Figure 6C shows the distribution of the nucleosome score regarding ratio of abnormal DNA in the sample from “Silico mix2”.
  • Figure 6D shows the ROC analysis of nucleosome scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
  • Figure 7 represents the distribution of the transposons score regarding ratio of abnormal DNA in the sample for “Silico mix2”.
  • Figure 8 represents the distribution of the telomere length score regarding ratio of abnormal DNA in the sample for “Silico mix2”.
  • Figure 9 represents the biomarker performances to discriminate samples. Performance for all samples (left panel) and samples with “abnormal” DNA ratio inferior to 5 % (right panel) are shown.
  • Figures 10A-B represent the detection performance of THPl “abnormal” DNA into silico mixl after classification of samples based on different combination of 1 to 3 different biomarkers (M: methylation; V: variant; N: nucleosome footprint).
  • M methylation
  • V variant
  • N nucleosome footprint
  • Figure 11 represents the performance of “abnormal” DNA detection for THPl and HeLa samples in the two silico mixes for different biomarkers’ combinations (M: methylation; V: variant; N: nucleosome footprint).
  • Figures 12A-C represent the detection performances of THPl or HeLa “abnormal” DNA into “Silico mix2” after classification of samples based on different combination of 1 to 6 different biomarkers (M: methylation; V: variant; N: nucleosome footprint; T: transposon; Mi: mitochondria; Tel: telomeres length).
  • M methylation
  • V variant
  • N nucleosome footprint
  • T transposon
  • Mi mitochondria
  • Tel telomeres length
  • Figure 12A shows the balanced accuracy.
  • Figure 12B shows the precision rate.
  • Figure 12C shows the recall rate.
  • Figures 13A-B represent the quantification of “abnormal” DNA into “Silico mixl”.
  • Figure 13A shows the correlation factor (Pearson) for different biomarkers combination (M: methylation; V: variant; N: nucleosome footprint).
  • Figure 13B is the correlation plot for the best biomarkers’ combination.
  • Figures 14A-B represent the quantification of “abnormal” DNA into “Silico mixl”.
  • Figure 14A shows the correlation factor (Pearson) for different biomarkers combination for both HeLa DNA (left panel) and THP1 DNA (right panel).
  • Figure 14B shows the correlation plot for the best biomarkers’ combinations for both HeLa DNA and THP1 DNA.
  • Figure 15 represents the ROC curve of sample classification accuracy, based on the type of abnormal DNA.
  • Figure 16 represents the throughput obtained from each sequencing test of cfDNA. Testl corresponds to cfDNA obtained form in vitro culture of cell lines. Test2 was performed using the same protocol used for in vitro test (default Nanopore protocol). Test3 was done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Figure 17 represents the percentage of reads with a length inferior to 1 000 pb from each sequencing test of cfDNA.
  • Testl corresponds to cfDNA obtained form in vitro culture of cell lines.
  • Test2 was performed using the same protocol used for in vitro test (default Nanopore protocol).
  • Test3 was done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Figure 18A-C represent the description of sequencing data obtained from DNA extracted with two different commercial kits.
  • Figure 18A shows the sequencing throughput in reads count.
  • Figure 18B shows the percentage of small reads (size under 1 000 pb).
  • Figure 18C shows the quality of reads estimated by mean BASEQ of sequenced nucleotides.
  • Figure 19A-D represent the reads size distribution for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Figure 19A sample 1.
  • Figure 19B sample 2.
  • Figure 19C sample 3.
  • Figure 19D sample 4.
  • Figure 20 represents the methylated fraction of CpG for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Figure 21 represents the correlation between methylation frequency in all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Methylation frequencies indicate the methylation status of the position: a value superior to 0.5 indicate that the site is more frequently methylated.
  • First row and line sample 1; second row and line: sample 2; third row and line: sample 3; fourth row and line: sample 4.
  • Figure 22A-C represent the nucleosome analysis for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads.
  • Figure 22A shows the number of regions that have a large coverage and could indicate nucleosomes.
  • Figure 22B shows the proportion of nucleosomes found in all other samples.
  • Figure 22C shows the proportion of nucleosome found in common between samples 2 and 3 that have the same count of total identified nucleosomes.
  • Figure 23 represents the expression of nucleosomes in samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads. Expression is described by the mean reads depth at nucleosome position. First row and line: sample 1; second row and line: sample 2; third row and line: sample 3; fourth row and line: sample 4. EXAMPLES
  • Example 1 In vitro cultures
  • the goal of this study was to validate in vitro the ability of Nanopore ® device to sequence cfDNA extracted from cells’ culture.
  • the reads were then mixed in silico to create artificial samples with increasing ratio of abnormal DNA (THPl) mixed into background DNA (HeLa).
  • THPl abnormal DNA
  • HeLa background DNA
  • Plasma samples stocked at -80°C, were thawed at room temperature. Cell-free DNA extraction was done from 200 pL of plasma using QIAamp Circulating Nucleic Acid Kit (Qiagen ® ).
  • shotgun sequencing libraries were prepared from 50 ng of DNA using library kit SGK-LSK009 (Nanopore ® ). Protocol was adapted to select reads of low size. Several samples were multiplexed using native barcoding kit EXP-NBD104 (Nanopore ® ) and sequenced on MinlON or GridlON device (Nanopore ® ). Sequencing were performed during 48 hours until all pores were inactivated. Raw signal was called and demultiplexed with high accuracy using Guppy last version. In silico samples generation
  • Reads generated by sequencing from two cell lines were mixed in silico to mix DNA from different origins.
  • a first group of samples hereafter named “Silico mixl” was generated by mixing THPl reads in HeLa reads background at different ratio.
  • a second group of samples hereafter named “Silico mix2”, was generated by mixing either THPl reads or HeLa reads into “normal” cfDNA reads obtained from the sequencing of cfDNA extracted from a healthy donor.
  • Figure 1 Reads were filtered based on their quality estimated by the BASEQ score. Reads with global quality inferior to 10 were filtered with NanoFilt. Then, high quality reads were aligned on human genome (version hg38 - RefSeq assembly accession: GCF_000001405.26) with minimap2. From these alignment results, several biomarkers were analyzed: methylation of reads aligned on the human genome were evaluated by integrating the raw signal from sequencing by NanoPolish.
  • Methylated cytosines were identified into CpG island by NanoPolish and methylation confidence was evaluated by a log ratio score; presence of nucleosome was identified on alignment files: regions overcovered were identified by analyzing the coverage at each genome position, which corresponded to regions where a nucleosome was present (nucleosome footprint); variants were analyzed by genotyping approach. Composition of bases at each position was analyzed using samtools and target variants were searched directly from these results. For this step, we did not use variant caller algorithms; mitochondria quantity was evaluated by computing the alignment depth on mitochondrial chromosome; telomere length was evaluated by searching telomere pattern on the reads using TelSeq tool; transposons were searched from alignment file. Long insertion and deletion were identified using sniffles and transposon like abnormalities were identified using TLDR tool.
  • biomarkers were identified by comparing whole biomarkers from the different DNA sequencing results. Transposons, methylated CpG island, genetic variants and nucleosome footprint were analyzed to find biomarkers specific of a cell type. Two sets of biomarkers were identified that corresponded to the two simulated group of samples (“Silico mixl” and “Silico mix2”). A first comparison identified THP1 -specific biomarkers, by comparing THP1 and HeLa results: biomarkers present in the THP1 sample but absent from the HeLa samples were selected. A second set of biomarkers was identified by comparing both THPl and HeLa results with normal human DNA. The count of biomarkers in each group is summarized in Table 3.
  • Table 3 Count of specific biomarkers used for each silico simulated samples' group.
  • Methyl ati on score was computed from the log-like ratio of each reads that displayed CpG of interest. The mean of only positive log-like-ratio was computed and pondered based on the depth at the position. The sum of all pondered means was then computed and score was finally pondered using the global depth of the sample.
  • Variant score was computed from the variant specific detection on the alignment files. Count of found variants was performed and normalized using the global depth of the sample.
  • Nucleosome footprint was determined by the presence of high coverage clusters on the reference genome. Coverage depth was evaluated for each cluster specific to a cell line and then ponder by the global depth of the sample. The sum of normalized depth of all clusters was finally performed.
  • the transposon score was determined with the same approach than variants’ score: the count of transposon identified on the alignment file was done and normalized by the sequencing depth of the sample.
  • the mitochondria score is a quantification of reads aligned on the chrM and normalized by the sequencing depth of the sample.
  • the telomere score derives from the estimation of the length of the telomere by the research of “TTAGGG” motifs (SEQ ID NO: 1) into aligned reads.
  • Two sets of data were generated in silico.
  • ThPl DNA was mixed into HeLa DNA at various ratio ranging from 0.66 % to 50 % (Fig. 2A).
  • Fig. 2B Several sequencing depths were simulated to assess the detection threshold of the methods.
  • a second group of samples was generated by mixing either ThPl or HeLa DNA into normal human DNA obtained from a healthy donor (“Silico mix2”). Ratio ranged from 1 % to 20 % for each cell lines (Fig. 3A). Various depths were also simulated (Fig. 3B).
  • Methyl ati on score was computed for each samples of each silico mix group. The distribution of the score ranged from 0 to 6 and was correlated with the ratio of abnormal DNA in all samples. The ROC analysis of the score showed that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 4): for “Silico mixl” (Fig. 4A-B): a threshold at 0.54 enabled high accuracy discrimination of samples with a false positive rate (FPR) of 0.00 and a true positive rate (TPR) of0.83; - for “Silico mix2” (Fig. 4C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample. However, a threshold at 1.83 enabled high accuracy discrimination of samples with a FPR of 0.33 and a TPR of0.68.
  • Variants analysis in silico samples Variant score was computed for each sample of each silico mix group. The distribution of the score ranged from 0 to 1.6 and was correlated with the ratio of abnormal DNA in all samples. The ROC analysis of the score shows that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 5): for “Silico mixl” (Fig. 5A-B): a threshold at 0.22 enabled high accuracy discrimination of samples with a FPR of 0.2 and a TPR of 0.6. for “Silico mix2” (Fig. 5C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample.
  • a threshold at 0.22 enabled high accuracy discrimination of samples with a FPR of 0.5 and a TPR of0.82.
  • the variants scores showed lower performance compared to the methylated score (Fig. 4A and 4B).
  • One limit was the recall rate which was lower with this biomarker than with the methylation score.
  • the recall decreased from 0.91 to 0.38 in “Silico mixl” and from 0.73 to 0.62 in “Silico mix2”. It resulted in a low sensitivity in the detection of samples with low rate of abnormal DNA.
  • Nucleosome score was computed for each samples of each silico mix group. The distribution of the score ranged from 0 to 80 and was corelated with the ratio of abnormal DNA in all samples. The ROC analysis of the score showed that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 6): for “Silico mixl” (Fig. 6A-B): a threshold at 7.11 enabled high accuracy discrimination of samples with a FPR of 0.13 and a TPR of 0.83. - for “Silico mix2” (Fig. 6C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample. However, a threshold at 60.5 enabled high accuracy discrimination of samples with a FPR of 0.25 and a TPR of0.85.
  • Transposon analysis in silico samples Transposons were searched for in all samples of the “Silico mix2”. There were not enough specific transposons biomarkers for each cell lines to perform correlation or ROC analysis but it was observed that the presence of a transposon was highly specific of the presence of “abnormal” DNA (Fig. 7).
  • telomere length analysis in silico samples The telomere length was computed for each sample of the “Silico mix2” and compared the results to negative controls. HeLa samples had shorter telomeres and ThPl samples had longer telomeres compared to negative controls (Fig. 8).
  • DNA extraction cfDNA were extracted using two different commercial kits: QIAamp MinElute (Qiagen ® ) and alle MiniMaxTM (Beckman ® ). Both methods are based on a small DNA fragment capture using magnetic beads.
  • Library preparation was performed using ligation protocol developed by Nanopore ® . Basically, this method required the ligation of barcodes at both ends of DNA fragments after a previous step that prepared the ends of the DNA. After barcode ligation, sequencing adaptors were attached to enable the sequencing of the fragments. Between these steps, DNA washing was done to remove reagents used at each step. The washing was performed by the capture of DNA using magnetic beads. The ratio of beads on DNA had an impact on the size of DNA fragments that were retained from the washing, and we modulated this step during our test to capture preferentially the cfDNA (data not shown).
  • Throughput obtained from different runs Throughput of runs were estimated by counting the total reads sequenced per samples. For each run, we used a quantity of DNA ranging from 10 ng to 30 ng. The reads count was not correlated to the quantity of DNA used for the library. We observed an increase of reads obtained after sequencing on our third test (Fig. 16).
  • testl i.e., cfDNA obtained form in vitro culture of cell lines
  • test2 for which no protocol modification has been done
  • Methylation analysis in plasma samples We analyzed the methylation pattern in the 4 previously described samples. For each sample, we observed a majority of methylated CpG, as expected for blood cells (Fig. 20).
  • nucleosome analysis in plasma samples The nucleosome pattern was next analyzed.

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to methods and apparatuses for estimating the probability of a subject to be affected with cancer, diagnosing cancer, determining the origin of a tumor in a subject and determining a personalized course of treatment in a subject affected or likely to be affected with cancer; based on the sequencing of cell-free nucleic acids and identification therein of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers.

Description

METHODS AND APPARATUSES FOR DIAGNOSING CANCER FROM
CELL-FREE NUCLEIC ACIDS
FIELD OF INVENTION
The present invention relates to methods and apparatuses for estimating the probability of a subject to be affected with cancer, diagnosing cancer, determining the origin of a tumor in a subject and determining a personalized course of treatment in a subject affected or likely to be affected with cancer; based on the sequencing of cell-free nucleic acids and identification therein of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers.
BACKGROUND OF INVENTION
Finding non-invasive methods to assess cancers has long been a goal of oncology diagnostics: tumor progression without appearance of symptoms is one of the reasons why numerous cancers have poor prognosis, z.e., early detection of cancers improves patients’ prognostic and survival. Non-invasive detection methods using imagery approach like mammography for the detection of breast cancers, or protein dosage like Prostate-Specific Antigen (PSA) dosage for prostate cancer detection, are already used routinely. However, these current methods are tumor site-specific, and are described to have a poor sensibility. For example, Carcinoembryonic Antigen (CEA) dosage, used for the detection of colorectal cancer, is reported to have a sensibility of 41-52% and a specificity of 85-95%.
Tumor-derived material in the circulation became a hot topic during the last decade. Cell-free circulating DNA (cfDNA) extracted from plasma helps diagnosing patients at initial cancer stage. Indeed, many tumors, even at an early stage, release cfDNA with the same genetic background than primary tumor. Recently, a combination of markers has been used to detect and localized 8 major cancers in a cohort of more than 1817 samples with a high accuracy. Somatic point mutations were identified on cfDNA in combination with protein dosages in plasma to determine the presence of a cancer with specificity greater than 99%, while sensibility ranging between 30% and 99% according to cancer type. Other studies have tried to reach same goal using only genomic information from cfDNA sequencing but retrieve lower accuracy. Detection of colorectal cancer based on the methyl ati on of 7 genes’ promoters reported a specificity around 73%. Comparison of two approaches in pancreas cancer detection showed that the methylation of two genes has a higher sensibility than the detection of essential KRAS mutations, with 48% patients efficiently identified. In other publications, authors have proposed the use of nucleosome footprint to project chromatin compaction to detect a potential cancer. Nonetheless, accuracy of such approaches has not been validated at large scale. In most cases, methods described are limited to a single sort of biomarkers.
Standard next generation sequencing technologies, such as, e.g, Illumina®, involve clonal amplification of DNA and require specific experimental protocol for each biomarker. For example, to study DNA methylation status, bisulfite treatment is required beside the sequencing, while chromatin accessibility evaluation passes through PCR-free or single-stranded library. Third-generation sequencing, such as, e.g, Nanopore® technologies, are characterized by the sequencing of native DNA that passes through the nanopore and changes the ion current. This long-read sequencing technology can be combined with a shotgun PCR-free library to allow the detection of genomic alterations from point mutations to larger abnormalities like copy number variation (CNV) or rearrangement, the presence of viral specific sequence, the detection of methylated CpG or nucleosome position and chromatin remodeling.
The invention described hereafter overcomes the limitations of currently known non-invasive methods, by offering a fast and efficient diagnosis of cancer from cell-free nucleic acids.
SUMMARY
The present invention relates to a method for estimating the probability of a subject to be affected with cancer, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject, (b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) estimating the probability of the subject to be affected with cancer based on the classification of the subject at step (f).
The present invention also relates to a method for diagnosing cancer in a subject in need thereof, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) diagnosing cancer in the subject based on the classification of the subject at step (f). The present invention also relates to a method for determining the origin of a tumor in a subject in need thereof, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads, (c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) determining the origin of the tumor in the subject based on the classification of the subject at step (f).
The present invention also relates to a method for determining a personalized course of treatment in a subject affected or likely to be affected with cancer, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) determining a personalized course of treatment for the subject based on the classification of the subject at step (f). In one embodiment, the sample is a bodily fluid. In one embodiment, the sample is selected from the group comprising blood, lymph, ascetic fluid, cystic fluid, urine, gastric juices, pancreatic juices, bile, nipple exudate, synovial fluid, bronchoalveolar lavage fluid, mucus, sputum, amniotic fluid, peritoneal fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, semen, milk, saliva, sweat, tears, feces, stools, and alveolar macrophages. In one embodiment, the sample is selected from the group comprising whole blood, plasma and serum.
In one embodiment, the nucleic acids are cell-free nucleic acids (cfNAs). In one embodiment, the nucleic acids are cell-free circulating DNA (cfDNA). In one embodiment, the extracted nucleic acids are sequenced by single-molecule nucleic acid sequencing. In one embodiment, the extracted nucleic acids are sequenced by a sequencing method selected from the group comprising nanopore sequencing, single molecule real-time sequencing (SMRT), annular dark-field scanning transmission electron microscopy sequencing, Heli scope sequencing, nano-knife-edge probe sequencing. In one embodiment, the extracted nucleic acids are sequenced by nanopore sequencing.
In one embodiment, assigning the plurality of sequence reads at step (c) of the methods of the invention comprises: cl) aligning the plurality of sequence reads on the human genome, thereby obtaining human-mapped sequence reads; c2) discarding sequence reads that did not match with the human genome at step cl); c3) optionally, aligning sequence reads discarded at step c2) on at least one further reference genome or a portion thereof; preferably aligning sequence reads discarded at step c2) on at least one pathogen genome; preferably on a pathogen database; more preferably aligning sequence reads discarded at step c2) on at least one bacterial and/or viral genome; preferably on a bacterial and/or viral genome database; thereby obtaining exogenous-mapped sequence reads; c4) discarding sequence reads that did not match with the at least one further reference genome or a portion thereof at step c3).
In one embodiment, genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer include genomic alterations, telomere length, retrotransposon sequence, DNA hypermethylation or hypomethylation, nucleosome footprint, nucleic acid fragment size, mitochondria quantity, cancer-inducing virus sequences and cancer-associated bacteria sequences. In one embodiment, genomic alterations include base pair mutations, differential trinucleotide frequencies, mutational signatures, copy number alterations, gene rearrangements, short tandem repeat polymorphism, and/or chromosomal abnormalities.
In one embodiment, computer-processing the plurality of mapped sequence reads at step d) of the methods of the invention comprises correlating the mapped sequence reads with information available in databases and/or with information obtained from at least one reference subject, preferably from a reference population. In one embodiment, the at least one reference subject is a substantially healthy subject; or the at least one reference subject is a cancer subject.
The present invention also relates to a method for treating a subject affected with cancer, comprising the steps of:
1) (a) estimating the probability of said subject to be affected with cancer with the method for estimating the probability of a subject to be affected with cancer according to the present invention, or
(b) diagnosing cancer in said subject with the method for diagnosing cancer in a subject in need thereof according to the present invention, or
(c) determining the origin of a tumor in said subj ect with the method for determining the origin of a tumor in a subject in need thereof according to the present invention; and
2) treating said subject depending on the estimation, diagnosis, or determination of step 1).
In one embodiment, treating said subject is carried out by any one of, or a combination of two or more of: surgery, radiation therapy, chemotherapy, activation immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
The present invention also relates to a computer system for: estimating the probability of a subject to be affected with cancer; or diagnosing cancer in a subj ect in need thereof; or determining the origin of a tumor in a subject in need thereof; or determining a personalized course of treatment in a subject affected with cancer; comprising: a) a processor and b) a storage medium that stores code readable by the processor; wherein the code stored on the storage medium, when executed by the processor, causes the computer system to: a. optionally, receive at least one raw sequencing signal from a sequencing experiment of nucleic acids, preferably of cell-free nucleic acids (cfNAs), more preferably of cell-free circulating DNA (cfDNA), previously extracted from a sample from the subject; b. optionally, base-call and demultiplex said at least one raw sequencing signal, thereby obtaining at least one sequence read or a plurality of sequence reads; c. assign said at least one sequence read or the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining at least one mapped sequence read or a plurality of mapped sequence reads; d. identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and/or metagenomic biomarkers of cancer in said mapped sequence reads; e. derive a probability score via at least one machine learning algorithm; f. generate an output, wherein the output is the classification label or the probability score; and g. estimate the probability of the subject to be affected with cancer based on the output; or diagnose cancer in the subject based on the output; or determine the origin of a tumor in the subject based on the output; or determine a personalized course of treatment for the subject based on the output. DEFINITIONS
In the present invention, the following terms have the following meanings:
As used herein, the term “about”, when set in front of a numerical value, means that said numerical value is approximate and small variations would not significantly affect the practice of the disclosed embodiments. Such small variations are, e.g, of ± 1 %, ± 2 %, ± 3 %, ± 4 %, ± 5 %, ± 6 %, ± 7 %, ± 8 %, ± 9 %, ± 10 % or more.
As used herein, the term “subject” refers to a mammal, preferably a human. In one embodiment, a subject may be a “patient”, i.e., a warm-blooded animal, more preferably a human, who/which is awaiting the receipt of, or is receiving medical care or was/is/will be the object of a medical procedure, or is monitored for the development of a disease, such as cancer. The term “mammal” refers here to any mammal, including humans, domestic and farm animals, and zoo, sports, or pet animals, such as dogs, cats, cattle, horses, sheep, pigs, goats, rabbits, etc. Preferably, the mammal is a primate, more preferably a human.
DETAILED DESCRIPTION
The present invention relates to a method for estimating the probability of a subject to be affected with cancer.
It also relates to a method for diagnosing cancer in a subject in need thereof. It also relates to a method for evaluating the origin of a tumor in a subj ect in need thereof.
It also relates to a method for determining a personalized course of treatment in a subject affected or likely to be affected with cancer.
It also relates to a method for treating a subject affected with cancer, comprising:
1) estimating the probability of said subject to be affected with cancer, or diagnosing cancer in said subject, or determining the origin of a tumor in said subject; and 2) treating said subject depending on the estimation, diagnosis, or determination of step 1).
It is to be understood that the methods according to the present invention are not limited to a specific type of cancer, and therefore apply to “cancer” in its broadest sense. Alternatively, the methods according to the present invention may also be adapted to a given type or subtype of cancer.
In one embodiment, the cancer is an early cancer. In one embodiment, the cancer is an advanced cancer. In one embodiment, the cancer is a metastatic cancer. In one embodiment, the cancer is a recurrent cancer. In one embodiment, the cancer is a stage 0, stage I, stage II, stage III, or stage IV cancer. The “stage” of a cancer describes the size of a tumour and how far it has spread from where it originated.
In one embodiment, the cancer is a stage 0 cancer. “Stage 0 cancer” describes cancer in situ. Stage 0 cancers are still located in the place they started and have not spread to nearby tissues. This stage of cancer is often highly curable, usually by removing the entire tumor with surgery.
In one embodiment, the cancer is a stage I cancer. “Stage I cancer” describes a small cancer or tumor that has not grown deeply into nearby tissues. It also has not spread to the lymph nodes or other parts of the body. In one embodiment, the cancer is a stage II cancer. “Stage II cancer” indicates that the cancer has grown, but hasn’t spread.
In one embodiment, the cancer is a stage III cancer. “Stage III cancer” indicates that the cancer is larger and may have spread to the surrounding tissues and/or the lymph nodes.
In one embodiment, the cancer is a stage IV cancer. “Stage IV cancer” describes a cancer that has spread to other organs or parts of the body. In one embodiment, the cancer is a grade I, grade II, or grade III cancer. The “grade” of a cancer describes the appearance of the cancerous cells. In general, a lower grade indicates a slower-growing cancer and a higher grade indicates a faster-growing one.
In one embodiment, the cancer is a grade I cancer. “Grade I cancer” indicates that the cancer comprises cancer cells that resemble normal cells, which aren’t growing rapidly.
In one embodiment, the cancer is a grade II cancer. “Grade II cancer” indicates that the cancer comprises cancer cells that don’t look like normal cells, which are growing faster than normal cells.
In one embodiment, the cancer is a grade III cancer. “Grade III cancer” indicates that the cancer comprises cancer cells that look abnormal, which may grow or spread more aggressively.
Examples of cancers include those listed in the 10 th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), under chapter II, blocks COO to D48. Further examples of cancers include, but are not limited to, adenofibroma, adenoma, agnogenic myeloid metaplasia, AIDS-related malignancies, ameloblastoma, anal cancer, angiofollicular mediastinal lymph node hyperplasia, angiokeratoma, angiolymphoid hyperplasia with eosinophilia, angiomatosis, anhidrotic ectodermal dysplasia, anterofacial dysplasia, apocrine metaplasia, apudoma, asphyxiating thoracic dysplasia, astrocytoma (including, e.g, cerebellar astrocytoma and cerebral astrocytoma), atriodigital dysplasia, atypical melanocytic hyperplasia, atypical metaplasia, autoparenchymatous metaplasia, basal cell hyperplasia, benign giant lymph node hyperplasia, bile duct cancer (including, e.g, extrahepatic bile duct cancer), bladder cancer, bone cancer, brain tumor (including, e.g, brain stem glioma, cerebellar astrocytoma glioma, malignant glioma, supratentorial primitive neuroectodermal tumors, visual pathway and hypothalamic glioma, ependymoma, medulloblastoma, gestational trophoblastic tumor glioma, and paraganglioma), branchionia, female breast cancer, male breast cancer, bronchial adenomas/carcinoids, bronchopulmonary dysplasia, cancer growths of epithelial cells, pre-cancerous growths of epithelial cells, metastatic growths of epithelial cells, carcinoid heart disease, carcinoid tumor (including, e.g, gastrointestinal carcinoid tumor), carcinoma (including, e.g, carcinoma of unknown primary origin, adrenocortical carcinoma, islet cells carcinoma, adeno carcinoma, adeoncortical carcinoma, basal cell carcinoma, basosquamous carcinoma, bronchiolar carcinoma, Brown-Pearce carcinoma, cystadenocarcinoma, ductal carcinoma, hepatocarcinoma, Krebs carcinoma, papillary carcinoma, oat cell carcinoma, small cell lung carcinoma, non-small cell lung carcinoma, squamous cell carcinoma, transitional cell carcinoma, Walker carcinoma, Merkel cell carcinoma, and skin carcinoma), cementoma, cementum hyperplasia, cerebral dysplasia, cervical cancer, cervical dysplasia, cholangioma, cholesteatoma, chondroblastoma, chondroectodermal dysplasia, chordoma, chori stoma, chrondroma, cleidocranial dysplasia, colon cancer, colorectal cancer, local metastasized colorectal cancer, congenital adrenal hyperplasia, congenital ectodermal dysplasia, congenital sebaceous hyperplasia, connective tissue metaplasia, craniocarpotarsal dysplasia, craniodiaphysial dysplasia, craniometaphysial dysplasia, craniopharyngioma, cylindroma, cystadenoma, cystic hyperplasia (including, e.g, cystic hyperplasia of the breast), cystosarconia phyllodes, dentin dysplasia, denture hyperplasia, diaphysial dysplasia, ductal hyperplasia, dysgenninoma, dysplasia epiphysialis hemimelia, dysplasia epiphysialis multiplex, dysplasia epiphysialis punctate, ectodermal dysplasia, Ehrlich tumor, enamel dysplasia, encephal oophthalmi c dysplasia, endometrial cancer (including, e.g, ependymoma and endometrial hyperplasia), ependymoma, epithelial cancer, epithelial dysplasia, epithelial metaplasia, esophageal cancer, Ewing’s family of tumors (including, e.g, Ewing’s sarcoma), extrahepatic bile duct cancer, eye cancer (including, e.g, intraocular melanoma and retinoblastoma), faciodigitogenital dysplasia, familial fibrous dysplasia of jaws, familial white folded dysplasia, fibroma, fibromuscular dysplasia, fibromuscular hyperplasia, fibrous dysplasia of bone, florid osseous dysplasia, focal epithelial hyperplasia, gall bladder cancer, ganglioneuroma, gastric cancer (including, e.g, stomach cancer), gastrointestinal carcinoid tumor, gastrointestinal tract cancer, gastrointestinal tumors, Gaucher’ s disease, germ cell tumors (including, e.g, extracranial germ cell tumors, extragonadal germ cell tumors, and ovarian germ cell tumors), giant cell tumor, gingival hyperplasia, glioblastoma, glomangioma, granulosa cell tumor, gynandroblastoma, hamartoma, head and neck cancer, hemangioendothelioma, hemangioma, hemangiopericytoma, hepatocellular cancer, hepatoma, hereditary renal-retinal dysplasia, hi dr otic ectodermal dysplasia, histiocytoma, histiocytosis, hypergammaglobulinemia, hypohidrotic ectodermal dysplasia, hypopharyngeal cancer, inflammatory fibrous hyperplasia, inflammatory papillary hyperplasia, intestinal cancers, intestinal metaplasia, intestinal polyps, intraocular melanoma, intravascular papillary endothelial hyperplasia, kidney cancer, laryngeal cancer, leiomyoma, leukemia (including, e.g, acute lymphoblastic leukemia, acute lymphocytic leukemia, acute myeloid leukemia, acute myelogenous leukemia, acute hairy cell leukemia, acute B-cell leukemia, acute T-cell leukemia, acute HTLV leukemia, chronic lymphoblastic leukemia, chronic lymphocytic leukemia, chronic myeloid leukemia, chronic myelogenous leukemia, chronic hairy cell leukemia, chronic B-cell leukemia, chronic T-cell leukemia, and chronic HTLV leukemia), Ley dig cell tumor, lip and oral cavity cancer, lipoma, liver cancer, lung cancer (including, e.g, small cell lung cancer and non-small cell lung cancer), lymphangiomyoma, lymphaugioma, lymphoma (including, e.g, AIDS -related lymphoma, central nervous system lymphoma, primary central nervous system lymphoma, Hodgkin’s lymphoma, non-Hodgkin’s lymphoma, Hodgkin’s lymphoma during pregnancy, non-Hodgkin’s lymphoma during pregnancy, mast cell lymphoma, B-cell lymphoma, adenolymphoma, Burkitf s lymphoma, cutaneous T-cell lymphoma, large cell lymphoma, and small cell lymphoma), lymphopenic thymic dysplasia, lymphoproliferative disorders, macroglobulinemia (including, e.g, Waldenstrom’s macroglobulinemia), malignant carcinoid syndrome, malignant mesothelioma, malignant thymoma, mammary dysplasia, mandibulofacial dysplasia, medulloblastoma, meningioma, mesenchymoma, mesonephroma, mesothelioma (including, e.g, malignant mesothelioma), metaphysial dysplasia, metaplastic anemia, metaplastic ossification, metaplastic polyps, metastatic squamous neck cancer (including, e.g, metastatic squamous neck cancer with occult primary), Mondini dysplasia, monostotic fibrous dysplasia, mucoepithelial dysplasia, multiple endocrine neoplasia syndrome, multiple epiphysial dysplasia, multiple myeloma/plasma cell neoplasm, mycosis fungoides, myelodysplastic syndrome, myeloid metaplasia, myeloproliferative disorders, chronic myeloproliferative disorders, myoblastoma, myoma, myxoma, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, prostatic neoplasm, colon neoplasm, abdomen neoplasm, bone neoplasm, breast neoplasm, digestive system neoplasm, liver neoplasm, pancreas neoplasm, peritoneum neoplasm, endocrine glands neoplasm (including, e.g, adrenal neoplasm, parathyroid neoplasm, pituitary neoplasm, testicles neoplasm, ovary neoplasm, thymus neoplasm, and thyroid neoplasm), eye neoplasm, head and neck neoplasm, nervous system neoplasm (including, e.g, central nervous system neoplasm and peripheral nervous system neoplasm), lymphatic system neoplasm, pelvic neoplasm, skin neoplasm, soft tissue neoplasm, spleen neoplasm, thoracic neoplasm, urogenital tract neoplasm, neurilemmoma, neuroblastoma, neuroepithelioma, neurofibroma, neurofibromatosis, neuroma, nodular hyperplasia of prostate, nodular regenerative hyperplasia, oculoauriculovertebral dysplasia, oculodentodigital dysplasia, oculovertebral dysplasia, odontogenic dysplasia, odontoma, opthalmomandibulomelic dysplasia, oropharyngeal cancer, osteoma, ovarian cancer (including, e.g, ovarian epithelial cancer and ovarian low malignant potential tumor), pancreatic cancer (including, e.g, islet cell pancreatic cancer and exocrine pancreatic cancer), papilloma, paraganglioma, nonchromaffin paraganglioma, paranasal sinus and nasal cavity cancer, paraproteinemias, parathyroid cancer, periapical cemental dysplasia, pheochromocytoma (including, e.g, penile cancer), pineal and supratentorial primitive neuroectodermal tumors, pinealoma, pituitary tumor, plasma cell neoplasm/multiple myeloma, plasmacytoma, pleuropulmonary blastoma, polyostotic fibrous dysplasia, polyps, pregnancy cancer, pre-neoplastic disorders (including, e.g, benign dysproliferative disorders such as benign tumors, fibrocystic conditions, tissue hypertrophy, intestinal polyps, colon polyps, esophageal dysplasia, leukoplakia, keratoses, Bowen’s disease, Farmer’s skin, solar cheilitis, and solar keratosis), primary hepatocellular cancer, primary liver cancer, primary myeloid metaplasia, prostate cancer, p seudoachondropl asti c spondyloepiphysial dysplasia, pseudoepitheliomatous hyperplasia, purpura, rectal cancer, renal cancer (including, e.g, kidney cancer, renal pelvis, ureter cancer, transitional cell cancer of the renal pelvis and ureter), reticuloendotheliosis, retinal dysplasia, retinoblastoma, salivary gland cancer, sarcomas (including, e.g, uterine sarcoma, soft tissue sarcoma, carcinosarcoma, chondrosarcoma, fibrosarcoma, hemangiosarcoma, Kaposi’s sarcoma, leiomyosarcoma, liposarcoma, lymphangiosarcoma, myosarcoma, myxosarcoma, rhabdosarcoma, sarcoidosis sarcoma, osteosarcoma, Ewing sarcoma, malignant fibrous histiocytoma of bone, and clear cell sarcoma of tendon sheaths), sclerosing angioma, secondary myeloid metaplasia, senile sebaceous hyperplasia, septooptic dysplasia, Sertoli cell tumor, Sezary syndrome, skin cancer (including, e.g, melanoma skin cancer and non-melanoma skin cancer), small intestine cancer, spondyloepiphysial dysplasia, squamous metaplasia (including, e.g, squamous metaplasia of amnion), stomach cancer, supratentorial primitive neuroectodermal and pineal tumors, supratentorial primitive neuroectodermal tumors, symptomatic myeloid metaplasia, teratoma, testicular cancer, theca cell tumor, thymoma (including, e.g, malignant thymoma), thyroid cancer, trophoblastic tumors (including, e.g, gestational trophoblastic tumors), ureter cancer, urethral cancer, uterine cancer, vaginal cancer, ventriculoradial dysplasia, verrucous hyperplasia, vulvar cancer, Waldenstrom’s macroglobulinemia, and Wilms’ tumor. In one embodiment, the cancer is a solid cancer. By “solid cancer”, it is meant a cancer affecting solid organs. Solid cancers may arise in nearly any tissue of the body.
In one embodiment, the cancer is a liquid cancer. The term “liquid cancer”, as used herein, refers to cancer cells that are present in body fluids, such as blood, lymph and bone marrow. Lymphomas and leukemias are common types of such liquid cancers. In one embodiment, the cancer is a common cancer.
As used herein, the term “common cancer” refers to one of the top 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more cancer that is clinically diagnosed with the greatest frequency in a population. Alternatively or additionally, the term “common cancer” refers to a cancer that is diagnosed with an annual incidence rate above about 1 in 50,000 people, such as about 1 in 40000 people, about 1 in 30 000 people, about 1 in 20000 people, about 1 in 1000 people 0, about 1 in 9 500 people, about 1 in 9000 people, about 1 in 8 750 people, about
1 in 8 500 people, about 1 in 8250 people, about 1 in 8 000 people, about
1 in 7 750 people, about 1 in 7 500 people, about 1 in 7250 people, about 1 in 7 000 people, about 1 in 6750 people, about 1 in 6 500 people, about
1 in 6250 people, about 1 in 6000 people, about 1 in 5 750 people, about
1 in 5 500 people, about 1 in 5 250 people, about 1 in 5 000 people, or more. Statistics for such common cancer are reported yearly, in particular by the American Cancer Society (see, e.g, American Cancer Society, 2019 “Cancer Facts & Figures 2019”. Atlanta: American Cancer Society).
Examples of common cancers include, but are not limited to, breast cancer, lung and bronchus cancer, prostate cancer, colorectal cancer, melanoma, bladder cancer, non- Hodgkin’s lymphoma, kidney cancer, endometrial cancer, leukemia, pancreatic cancer, thyroid cancer, and liver cancer.
In one embodiment, the cancer is selected from the group comprising or consisting of breast cancer, lung and bronchus cancer, prostate cancer, colorectal cancer, melanoma, bladder cancer, kidney cancer, and endometrial cancer.
According to the present invention, the methods comprise a step of extracting nucleic acids from a sample.
In one embodiment, the sample is a body tissue sample or a bodily fluid sample.
In one embodiment, the sample is a body tissue sample. Examples of body tissues include, but are not limited to, muscle, nerve, brain, heart, lung, liver, pancreas, spleen, thymus, esophagus, stomach, intestine, kidney, testis, prostate, ovary, hair, skin, bone, breast, uterus, bladder and spinal cord. A body tissue sample may be recovered from the subject, e.g, by biopsy or during a surgical operation.
In one embodiment, the sample is not a body tissue sample. In one embodiment, the sample is a bodily fluid. Examples of bodily fluids include, but are not limited to, blood (including whole blood, plasma and serum), lymph, ascetic fluid, cystic fluid, urine, gastric juices, pancreatic juices, bile, nipple exudate, synovial fluid, bronchoalveolar lavage fluid, mucus, sputum, amniotic fluid, peritoneal fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, semen, milk, saliva, sweat, tears, feces, stools, and alveolar macrophages.
In one embodiment, the sample is blood, such as whole blood, plasma or serum. In one embodiment, the sample is whole blood. The term “whole blood” is as conventionally defined. Preferably, the sample is readily obtainable by minimally invasive methods or non-invasive methods, allowing the removal or isolation of the whole blood from the subject. In one embodiment, the sample is plasma. The term “plasma” is as conventionally defined. Plasma is usually obtained from a sample of whole blood, provided or contacted with an anticoagulant (such as, e.g, heparin, citrate, oxalate or EDTA). Subsequently, cellular components of the whole blood sample are separated from the liquid component (i.e., the plasma) by an appropriate technique, typically by centrifugation. The term “plasma” therefore refers to a composition which does not form part of a human or animal body.
In one embodiment, the sample is serum. The term “serum” is as conventionally defined. Serum can be usually obtained from a sample of whole blood, by (1) allowing clotting to take place in the whole blood sample and (2) subsequently separating the so-formed clot and cellular components of the blood sample from the liquid component (i.e., the serum) by an appropriate technique, typically by centrifugation. Alternatively, serum can be obtained from plasma by removing the anticoagulant and fibrin. The term “serum” therefore refers to a composition which does not form part of a human or animal body.
In one embodiment, the sample was previously taken from the subject, i.e., the method of the invention does not comprise a step of recovering a sample from the subject. Consequently, according to this embodiment, the methods of the invention are non- invasive methods.
The term “nucleic acids” refers to both DNA and RNA. Nucleic acids can be single-stranded or double-stranded. In one embodiment, the nucleic acid is DNA. In one embodiment, the nucleic acid is RNA.
In one embodiment, the nucleic acids are cell-free nucleic acids (cfNAs).
The terms “cell-free nucleic acid” or “cfNA”, sometimes referred to as “cell-free circulating nucleic acid” or “circulating nucleic acid”, are commonly used in the art to describe nucleic acid fragments that circulate in a subject’s bodily fluid and originate from one or more healthy cells and/or from one or more cancer cells from said subject.
In one embodiment, the cfNA is a cell-free circulating DNA (cfDNA). In one embodiment, the cfNA is a cell-free circulating RNA (cfRNA). Means and methods for extracting nucleic acids from a sample are well known to the one skilled in the art. Such means and methods include, e.g, phenol -chi oroform extraction method, or commercially available nucleic acid extraction reagents. Extraction can be carried out using commercially available kits.
Depending on the cfNA to be extracted (i.e., cfDNA, cfRNA or a combination of both), several means and methods can be carried out.
In particular, means and methods for extracting cfDNAs are well known in the art and commercial kits are readily available, e.g, the phenol -chi oroform extraction method, the sodium iodide extraction method, the guanidine-resin extraction method, the “QIAamp® MinElute ccfDNA” kit from Qiagen, the “QIAamp® Circulating Nucleic Acids” kit from Qiagen, the “QIAamp® DNA Blood” kit from Qiagen, the “Gentra Puregene Blood” kit from Qiagen, the “MagMAX™ Cell-Free DNA Isolation” kit from Applied Biosystem, the “Quick-cfDNA Serum & Plasma” kit from Zymo Research, and the like.
In particular, means and methods for extracting cfRNAs can be adapted from the art and commercial kits are readily available, e.g, the trizol extraction method, the “RNeasy Mini” kit from Qiagen, the “QIAamp® Circulating Nucleic Acids” kit from Qiagen, the “MagMAX™-96 Blood RNA Isolation” kit from Thermofisher Scientific, and the like.
In particular, means and methods for extracting total cfNAs (i.e., cfDNA and cfRNA) can be adapted from the art and commercial kits are readily available, e.g, the “AllPrep DNA/RNA Mini” kit from Qiagen, “MagMAX™ Cell-Free Total Nucleic Acid Isolation” kit from Thermofisher Scientific, and the like.
According to the present invention, the methods comprise a step of sequencing the extracted nucleic acids, preferably the extracted cfNAs. The term “sequencing” refers to any method by which the identity of at least about 5, about 10, about 15, about 20, about 25, about 30, about 40, about 50, about 60, about 70, about 80, about 90, about 100, about 125, about 150, about 175, about 200 or more nucleotides of a nucleic acid molecule is obtained. In one embodiment, the term “sequencing” encompasses methods by which epigenetic information may also be obtained, such as, e.g, nucleotide modifications.
The term “nucleotide modifications” refers to any modification of a nucleotide which does not affect the nucleic acid sequence itself. Examples of such modifications include, but are not limited to, methylation (such as, e.g, cytosine methylation leading to 5-methylcytosine; adenosine methylation leading to A^-methyladenosine), oxidation (such as, e.g, 5-methylcytosine oxidation leading to 5-hydroxymethylcytosine; 5-hydroxymethylcytosine oxidation leading to 5-formylcytosine; 5-formylcytosine oxidation leading to 5-carboxylcytosine). These modifications are well known and to the one skilled in the art. In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide-specific physicochemical features, including size, optical, electrical, and/or magnetic properties.
In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide size. In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide optical properties (such as, e.g, fluorescence or absorption spectrum). In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide electrical properties. In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a sequencing method detecting nucleotide magnetic properties.
In one embodiment, the extracted nucleic acids, preferably the extracted cfDNAs, are not sequenced by a first- or second-generation sequencing method. The term “first-generation sequencing” refers to Sanger sequencing, i.e., a sequencing method based on the selective incorporation of chain-terminating di-deoxynucleotides by DNA polymerase during in vitro DNA replication.
The term “second-generation sequencing”, also termed “massive parallel sequencing”, “massively parallel sequencing” or “next-generation sequencing”, refers to methods of “sequencing-by-synthesis”, wherein nucleic acid molecules to be sequenced are amplified, then sequenced in batch through nucleic acid neostrand synthesis.
Examples of second-generation sequencing methods include, but are not limited to, pyrosequencing (such as, e.g., using the 454 platform from Roche, or the GS FLX Titanium platform from 454 Life Sciences), sequencing by reversible terminator chemistry (such as, e.g, using the MiSeq platform, the HiSeq platform or the Genome Analyzer IIX platform from Illumina), and sequencing by ligation (such as, e.g, using the SOLiD4 platform from Life Technologies, now Thermo Fisher Scientific; or the Complete Genomics platform from Complete Genomics). In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by a third-generation sequencing method.
The term “third-generation sequencing”, also termed “single-molecule nucleic acid sequencing”, refers to sequencing methods, wherein the nucleotide sequence is read at the single nucleic acid molecule level. Examples of third-generation sequencing methods include, but are not limited to, nanopore sequencing (such as, e.g, from Oxford Nanopore Technology, from Quantapore, or from Stratos Genomics Inc.); single molecule real-time sequencing (SMRT) (such as, e.g. , from Pacific Biosciences); annular dark-field scanning transmission electron microscopy sequencing (such as, e.g, from ZS Genetics Inc.); and Stratos (WA-USA); Heliscope sequencing (such as, e.g, from Helicos Biosciences); nano-knife-edge probe sequencing (such as, e.g, from Reveo Inc.). Of note, nanopore sequencing may sometimes be referred in some literature to as fourth-generati on sequencing. Third-generation sequencing methods are well known in the art. For a review, see, e.g., Niedringhaus et al. (2011. Anal Chem. 83(12):4327-41) or Xu et al. (2009. Small. 5(23):2638-49).
In one embodiment, the sequencing method provides, beside the identity of the nucleotides of a nucleic acid molecule, epigenetic information, such as, e.g., nucleotide modifications.
In one embodiment, the extracted nucleic acids, preferably the extracted cfNAs, are sequenced by nanopore sequencing.
In one embodiment, raw sequencing data are obtained upon sequencing the extracted nucleic acids, preferably the extracted cfNAs.
The term “raw sequencing data” refers to the output of a sequencing run. Raw sequencing data are represented by the signal measured by the sequencer. For example, raw sequencing data may be pictures of fluorescent signal or recording of electric signal.
In one embodiment, raw sequencing data are pre-processed to obtain sequence reads. The terms “pre-process”, “pre-processed”, “pre-processing”, also termed “base-call”, “base-called”, “base-calling”, refer to the transformation of the raw sequencing data (e.g, the fluorescent signal, electric signal, or the like) into corresponding nucleotides; in other words, to the assignment of nucleotides to a raw sequencing signal.
In one embodiment, a plurality of sequence reads is obtained upon sequencing the extracted nucleic acids, preferably the extracted cfNAs. In one embodiment, a plurality of sequence reads is obtained upon base-calling of the raw sequencing data.
The term “sequence read” refers to the output of a sequencing run after pre-processing of raw signal. Sequence reads are represented by a string of nucleotides. Sequence reads may be accompanied by metrics about the quality of the sequence. The quality is determined during the base-calling step and indicates the accuracy of base called. For example, each nucleotide in a sequence read may be associated with the confidence of the base-call, i.e., a determination of whether a nucleotide is a G, A, T or C, for that position. In one embodiment, a plurality of “sequence reads” can include unique or substantially unique nucleic acid sequences. Alternatively, a plurality of “sequence reads” can include redundant sequences of the same parent molecule, generated, e.g, by an amplification step carried out before and/or during sequencing. In one embodiment, “consensus sequence reads” can be generated from sequence reads after comparing redundant sequence reads and selecting the most common nucleotide observed at a given position, after comparing to a reference genome or a portion thereof or other approaches. Unique or non-unique molecular tags (UMI) can be added to the nucleic acids to be sequenced before an amplification step, to label each nucleic acid molecule.
The term “sequencing run” refers to any step or portion of a sequencing experiment performed to determine some information related to at least one nucleic acid molecule.
The term “a plurality”, when referring to sequence reads, means that more than one, such as, e.g, at least 2 sequence reads are obtained. In certain cases, a plurality of sequence reads may have at least about 10, at least about 100, at least about 1000, at least about 10 000, at least about 100 000, at least about 106, at least about 107, at least about 108, at least about 109 or more sequence reads.
According to the present invention, the methods comprise a step of assigning the plurality of sequence reads to at least one reference genome or a portion thereof.
The terms “assigned”, “assignment”, or “assigning”, also termed “mapped” or “mapping”, refer to two or more nucleic acid sequences (such as, e.g, a sequence read and a reference genome sequence or a portion thereof) that can be identified as a complete match (e.g, with a 100 % sequence identity) or as a partial match (e.g. , with at least about 50% sequence identity, such as at least about 60%, about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, about 96%, about 97%, about 98%, or about 99% sequence identity).
In one embodiment, a sequence read is assigned to a reference genome sequence or a portion thereof when said sequence read is “aligned” on said reference genome sequence or a portion thereof. In one embodiment, an alignment may comprise a mismatch, i.e., a site at which a nucleotide in one sequence read and a nucleotide in the - or in a portion of the - reference genome with which it is aligned are not complementary. In one embodiment, an alignment may comprise 1, 2, 3, 4, 5 or more, contiguous or non-conti guous, mismatches. In one embodiment, an alignment may comprise 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10% or more, contiguous or non-conti guous, mismatches.
In one embodiment, two or more nucleic acid sequences can be compared using either strand. In one embodiment, a nucleic acid sequence is compared with the reverse complement of another nucleic acid sequence. The term “mapped sequence read” refers to a sequence read that has been assigned to (such as, e.g, “mapped” or “aligned”) a matching sequence in the at least one reference genome.
Assigning a sequence read to a reference genome or a portion thereof (such as, e.g, mapping or aligning) can be done manually or by a computer (e.g. , using a software, program, computer program component, or algorithm or machine learning algorithm or deep learning algorithm). Various computational methods can be used to assign sequence reads to a reference genome. Sequence reads can be mapped by a mapping component or by a machine or computer comprising a mapping component (e.g. , a suitable mapping and/or alignment and/or classification program), which mapping component generally compared reads to a reference genome or segment thereof.
Sequence reads can be mapped to or aligned with a reference genome or a portion thereof by use of a suitable mapping and/or alignment program. Examples of such program include, but are not limited to, BWA (Li H. and Durbin R. (2009) Bioinformatics 25, 1754-60), Novoalign [Novocraft (2010)], Bowtie (Langmead B, et al., (2009) Genome Biol. 10:R25), SOAP2 (Li R, et al., (2009) Bioinformatics 25, 1966-67), BFAST (Homer N, et al., (2009) PLoS ONE , e7767), GASSST (Rizk, G. and Lavenier, D. (2010) Bioinformatics 26, 2534-2540), and MPscan (Rivals E., et al. (2009)Lecture Notes in Computer Science 5724, 246-260), or the like. Sequence reads can be mapped assigned to (such as, e.g, mapped or aligned) a reference genome or a portion thereof using a suitable short read alignment program. Examples of such program include, but are not limited to, BarraCUDA, BFAST, BLASTN, BLAST, BLAT, BLITZ, Bowtie (e.g, BOWTIE 1, BOWTIE 2), BWA, CASHX, CUDA-EC, CUSHAW, CUSHAW2, desalt, drFAST, FASTA, ELAND, ERNE, GNUMAP, GEM, GensearchNGS, GMAP, Geneious Assembler, GraphMap, iSAAC, LAST, MAQ, marginAlign, minimap, minimap2, mini align, mrFAST, mrsFAST, MO S AIK, MPscan, NanoBLASTer, Novoalign, NovoalignCS, Novocraft, NextGENe, Omixon, PALMapper, Partek, PASS, PerM, PROBEMATCH, QPalma, RazerS, REAL, cREAL, RMAP, rNA, RTG, Segemehl, SeqMap, Shrec, SHRiMP, SLIDER, SP Aligner, SOAP, SOAP2, SOAP3, SOCS, SSAHA, SSAHA2, Stampy, SToRM, Subread, Subjunc, Taipan, UGENE, VelociMapper, TimeLogic, XpressAlign, ZOOM, the like, variations thereof or combinations thereof.
Sequence reads can be assigned to (such as, e.g, mapped or aligned) a reference genome or a portion thereof by use of a suitable machine learning or deep learning algorithms. Examples of such program include, but are not limited to, fastText (Joulin, 2016. arXiv: 1607.01759 \cs.CL , Joulin et al, 2017. In 15th Conference of the European Chapter of the Association for Computational Linguistics (Eacl 2017): Valencia, Spain, 3-7 April 2017. Stroudsburg, PA: Association for Computational Linguistics), fastDNA (Menegaux & Vert, 2019. J Comput Biol. 26(6):509-518), large scale linear model by learning continuous low-dimensional representations of the k-mers.
A mapping component can map sequence reads by a suitable method known in the art or described herein. In one embodiment, a mapping component or a machine or computer comprising a mapping component is required to provide mapped sequence reads. A mapping component often comprises a suitable mapping and/or alignment program or algorithm.
In one embodiment, a plurality of sequence reads and/or information associated with a plurality of sequence reads are stored on and/or accessed from a non-transitory computer-readable storage medium in a suitable computer-readable format. Information stored on a non-transitory computer-readable storage medium is sometimes referred to as a “file” or “data file”. A file or data file often comprises a format.
For example, a sequence read or a plurality of sequence reads is sometimes stored in a format that includes information about one or more sequence reads, non-limiting examples of which include, but are not limited to, a complete or partial nucleic acid sequence, mappability, a mappability score, a mapped location, a relative location or distance from other mapped or unmapped reads (e.g., estimated distance between read mates), orientation relative to a reference genome or to other reads (e.g. , relative to read mates), an estimated or precise location of a read mates, a G/C content, nucleotide modification (e.g, methylation), the like or combinations thereof.
A “computer-readable format” is sometimes referred to generally herein as a “format”. In one embodiment, sequence reads are stored and/or accessed in a suitable binary format, a text format, the like or a combination thereof. A binary format is sometimes a BAM format. A text format is sometimes a sequence alignment/map (SAM) format.
Examples ofbinary and/or text formats include, but are not limited to, BAM, sorted BAM, SAM, SRF, FASTA, FASTQ, Gzip, the like, or combinations thereof.
In one embodiment, a program is configured to instruct a microprocessor to obtain or retrieve one or more files. In one embodiment, a program is configured to instruct a microprocessor to obtain or retrieve one or more FASTQ files (e.g. , a FASTQ file for a first read and a second read) and/or one or more reference files (e.g, a FASTA or FASTQ file). In one embodiment, a program instructs a microprocessor to call a computer program component and/or transfers data and/or information (e.g. , files) to or from one or more computer program components (e.g. , an adapter trimmer component, BWA- MEM aligner, insert size distribution component, samtools, and the like). In one embodiment, a program instructs a processor to call a computer program component which creates new files and formats for input into another processing step.
In one embodiment, the plurality of sequence reads is assigned to (such as, e.g, mapped or aligned) at least one reference genome or a portion thereof, such as on 1 reference genome, on 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000 or more reference genomes or a portion thereof.
In one embodiment, a sequence read in the plurality of sequence reads may uniquely or non-uni quely map to a reference genome or a portion thereof. A sequence read is considered as “uniquely mapped” if it is assigned to (such as, e.g, mapped or aligned) - completely or partially - with a single sequence in the at least one reference genome or a portion thereof.
A sequence read is considered as “non-uniquely mapped” if it is assigned (such as, e.g, mapped or aligned) - completely or partially - with two or more sequences in the at least one reference genome or a portion thereof. In one embodiment, non-uniquely mapped sequence reads may be eliminated from further analysis.
As explained hereinabove, a certain degree of mismatch between the reference genome or a portion thereof and the sequence reads (such as, e.g, 1, 2, 3, 4, 5 or more, contiguous or non-conti guous, mismatches; or 1%, 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10% or more, contiguous or non-conti guous, mismatches) may be allowed to account for, e.g, single nucleotide polymorphisms or sequencing errors. In one embodiment, no degree of mismatch between the reference genome or a portion thereof and the sequence reads may be allowed.
The term “reference genome” can refer to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or virus which may be used to reference identified sequences in the plurality of sequence reads. A “reference genome” may refers to a portion of a genome (e.g, a chromosome or part thereof, e.g, one or more portions of a genome).
Human genomes, human genome assemblies and/or genomes from any other organisms or virus can be used as a reference genome. One or more human genomes, human genome assemblies as well as genomes of other organisms or viruses can be found, e.g, at the National Center for Biotechnology Information (NCBI) at www.ncbi.nlm.nih.gov/genome. A “genome” refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences. As used herein, a reference genome or a portion thereof often is an assembled or partially assembled genomic sequence from a subject or multiple subject. In one embodiment, a reference genome or a portion thereof is an assembled or partially assembled genomic sequence from one or more human subjects. In one embodiment, a reference genome or a portion thereof comprises sequences assigned to chromosomes. In one embodiment, a reference genome or a portion thereof comprises sequences obtained from a reference subject or sample. In one embodiment, a reference genome or a portion thereof comprises sequences, an assembly of sequences, and/or a consensus sequence (e.g, a sequence contig). In one embodiment, a reference genome or a portion thereof is obtained from a reference subject or sample substantially free of a genetic variation. In one embodiment, a reference genome or a portion thereof is obtained from a reference subject or sample comprising a known genetic variation.
In one embodiment, sequence reads can be assigned to (such as, e.g, mapped or aligned) sequences in nucleic acid databases known in the art. Examples of such databases include, but are not limited to, the International Nucleotide Sequence Database (at www.insdc.org), GenBank (at www.ncbi.nlm.nih.gov), the European Nucleotide Archive (at www.ebi.ac.uk/ena/browser/home), and the DNA Data Bank of Japan (at www.ddbj.nig.ac.jp). Other suitable examples include, without limitation, 23andMe, 1000 Genomes Project, ArrayExpress, Bioinformatic Harvester, ClinVar, COSMIC, dbSNP, ENCODE, Ensembl, Ensembl Genomes, Gene Disease Database, Gene Expression Omnibus (GEO), GTEx, HapMap, Human Microbiome Project (HMP), Human Protein Atlas (HP A), Online Mendelian Inheritance in Man (OMIM), Personal Genome Project, RefSeq, SNPedia, and TCGA. BLAST or similar tools can be used to search sequence reads against a sequence database. In one embodiment, the mappability is assessed for a genomic region (e.g, one or more portions of a genome).
The term “mappability” refers to the ability to unambiguously assign a sequence read to a portion of a reference genome, typically up to a specified number of mismatches (such as, e.g, 1, 2, 3, 4, 5 or more mismatches). In one embodiment, mappability is provided as a score or value, where the score or value is generated by a suitable mapping algorithm or computer-mapping software.
In one embodiment, the plurality of sequence reads is compared to one reference genome or a portion thereof. In one embodiment, the reference genome is the human {homo sapiens sapiens) genome or a portion thereof.
In one embodiment, sequence reads that did not match with the reference genome or a portion thereof, e.g., the human genome, are discarded from further analysis.
In one embodiment, sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one further reference genome or a portion thereof or genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one pathogen genome or genome database, such as, e.g, an archaeal, bacterial, fungal, protist, protozoal, and/or viral reference genome or a portion thereof or genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one archaeal genome or a portion thereof or with an archaeal genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one bacterial genome or a portion thereof or with a bacterial genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one fungal genome or a portion thereof or with a fungal genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one protist genome or a portion thereof or with a protist genome database. In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one protozoal genome or a portion thereof or with a protozoal genome database.
In particular, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to at least one viral genome or a portion thereof or with a viral genome database.
In one embodiment, the sequence reads that did not match with the reference genome or a portion thereof, e.g, the human genome, are compared to a bacterial and/or viral genome database. In one embodiment, sequence reads that did not match with neither of the first reference genome, e.g, the human genome, and the further reference genome(s) or genome database(s), e.g, pathogen genome(s) or genome database(s), are discarded from further analysis.
In one embodiment, sequence reads that matched with the at least one reference genome or a portion thereof (i.e., “mapped sequence reads”), are kept for further analysis. In particular, mapped sequence reads may be classified into sequences reads that mapped with a first reference genome or a portion thereof, e.g, the human genome (i.e., “human-mapped sequence reads”), and sequences reads that mapped with the further reference genome(s) or a portion thereof, e.g, non-human genome, such as pathogen genome(s) or genome database(s) (i.e., “exogenous-mapped sequence reads”).
According to the present invention, the methods comprise a step of computer-processing the mapped sequence reads.
In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic, metabolic and/or metagenomic biomarkers of cancer in said mapped sequence reads.
In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and epigenetic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and transcriptomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic and transcriptomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and transcriptomic biomarkers. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer- processing comprises identifying - or assessing the presence of - transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic and metabolic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads. In one embodiment, computer-processing comprises identifying - or assessing the presence of - epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
In one embodiment, computer-processing comprises identifying - or assessing the presence of - genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer in said mapped sequence reads.
The term “genetic biomarker” is broader than a gene. A “genetic biomarker” refers to a fragment of nucleic acid, such as a fragment of DNA, with an identifiable physical location on a chromosome whose inheritance can be followed. A genetic biomarker can have a function and thus be, or be a fragment of, a gene. Alternatively, a genetic biomarker can be a fragment of nucleic acid, such as a fragment of DNA, with no known function.
Examples of genetic biomarkers of cancer include, but are not limited to, genomic alterations (such as, e.g, somatic mutations, single-nucleotide polymorphism), telomere length evaluation, and retrotransposon sequence detection. Hence, identifying - or assessing the presence of - genetic biomarkers of cancer in said mapped sequence reads may comprise analyzing the position, gene and impact of genomic alterations and/or nucleosome footprint in said mapped sequence reads.
The term “genomic alteration”, in the context of the present invention, refers to a change (or mutation) in the nucleotide sequence of the genome of a cancer cell, which change is not present in a non-cancer cell genome. Such genomic alterations include, but are not limited to, base pair substitutions (such as, e.g, single-nucleotide polymorphism), base pair insertions, base pair deletions, copy number alteration, gene rearrangement (such as, e.g, gene fusion), short tandem repeat polymorphism (such as, e.g, STR), chromosomal abnormalities and any combination thereof. Base pair substitutions, insertions and deletion are commonly included under the general term “base pair mutation”.
In particular, assessing the presence (or absence) of genomic alterations of cancer in a sample may provide information on cancer-specific alterations, such as single mutations, mutational signatures, chromosomal abnormalities. In one embodiment, the genomic alteration may be defined relative to the locus or gene in which it is present in cancer cells relative to a non-cancer cell genome. In one embodiment, the genomic alteration may be defined relative to trinucleotide frequencies in the genome of cancer cells relative to a non-cancer cell genome. Examples of cancer-specific base pair mutations include, but are not limited to, mutations in any one of the following genes: ACVR2A, AFF3, ALK, APC, AR, ARID1A, ARID 2, ATM, ATRX, BARI, BCOR, BRAF, CAMTA1, CDH1, CDKN2A, CREBBP, CTCF, CTNNB1, EBF1, EGFR, EP300, ERBB2, ERBB3, ERBB4, ERCC2, ESR1, FAT1, FAT 4, FBXW7, FGFR2, FGFR3, FRIT, FOXP1, GATA3, GRIN2A, HRAS, KDM6A, KDM5C, KDR, KEAP1, KIT, KMT2A, KMT2C, KMT2D, KRAS, LPP, LRP1B, MAP3K1, MET, MTOR, MSH6, NF1, NF2, NRAS, PBRM1, PIK3CA, PIK3R1, POLE, PPP2R1A, PREX2, PTEN, PTPRK, PTPRT , RAD51B, RBI, RNF43, ROS1, RUNX1, SETD2, SMAD4, SMARCA4, SPOP, STAG2, STK11, TCF7L2, TGFBR2, TP53, TP63, TRRAP, TSC1, TSC2, VHL, and/or ZFHX3. For example, the presence of certain base pair mutations in given genes is known to associated with certain types of cancers. Table 1 provides examples ofbase pair mutation occurrences (expressed in %) observed in certain types of cancers. Table 1. Extracted from GRCh38 COSMIC v90.
Figure imgf000034_0001
Figure imgf000035_0001
Differential trinucleotide frequencies in the genome of cancer cells relative to a non-cancer cell genome have been termed “mutational signature” in the art. Examples of such mutational signatures are described in, e.g, in Alexandrov et al, 2013 Nature. 500(7463):415-21. For example, certain mutational signatures are known to be associated with certain types of cancers. Table 2 provides examples of mutational signatures and their correlation with certain types of cancers.
Table 2. “+” indicates that the mutational signature is typically observed in the given type of cancer. Signatures 1A, IB and 2 to 15 are as defined in Supplementary Figures S2 to S17 of Alexandrov et al, 2013. Nature. 500(7463) :415-21. These figures are herein incorporated by reference and form part of the present disclosure.
Figure imgf000036_0001
A further example of mutational signatures and their correlation with cancers is described in Figure 3 of Alexandrov et al, 2013. Nature. 500(7463):415-21. This figure is herein incorporated by reference and forms part of the present disclosure.
Further examples of mutational signatures and their correlation with specific cancers are described in Supplementary Figures S59 to S88 of Alexandrov et al, 2013. Nature. 500(7463): 415-21. These figures are herein incorporated by reference and form part of the present disclosure. Examples of cancer-specific copy number alteration include, but are not limited to, increased copy number (i.e., gain of gene copy number in cancer cells relative to non-cancer cells) of any one of the following genes: AARD, APCDD1L, ATP 1 IB, ATP5F1E, CSMD3 , CTSZ , DCUN1D1, DPP6, EIF3H , EXT1, GNAS , HDAC9, LAMPS , MAL2, MCCC1, PRELID3B, RAB22A, RAD21, SAMD12 , SLC30A8, SOX2, TG, TOX2,
UTP23, and VAPB.
Examples of cancer-specific copy number alteration include, but are not limited to, decreased copy number (i.e., loss of gene copy number in cancer cells relative to non-cancer cells) of any one of the following genes: CDKN2A, CDKN2B, CSMD1, DNAAF5, EML4IALK, FRIT, MTAP, RBFOX1, and SCAPER.
Examples of cancer-specific gene rearrangement (such as, e.g., gene fusion) include, but are not limited to, ETV6/NTRK3 fusion, MYB/NFIB fusion, TMPRSS2/ERG fusion, and TRPSl fusion.
Examples of cancer-specific short tandem repeat polymorphism include, but are not limited to, IGF-I and AR.
For example, the presence of any one of the following genomic alteration is known to be associated with breast cancer: ETV6/NTRK3 fusion, MYB/NFIB fusion, TRPSl fusion; increased copy number of AARD, CSMD3, EIF3H, EXT1, MAL2, RAD21, SAMD12, SLC30A8, and/or UTP23; and decreased copy number of CSMD1, DNAAF5, and/or SCAPER.
For example, the presence of any one of the following genomic alteration is known to be associated with colorectal cancer: increased copy number of APCDD1L, ATP5F1E, CTSZ, GNAS, PRELID3B, RAB22A, TOX2, and/or VAPB, and decreased copy number of FRIT, and/or RBFOX1. For example, the presence of any one of the following genomic alteration is known to be associated with kidney cancer: ETV6INTRK3 fusion; and decreased copy number of SCAPER. For example, the presence of any one of the following genomic alteration is known to be associated with lung and bronchus cancer: increased copy number of ATP 1 IB, DCUN1D1, LAMPS , MCCC1, and/or SOX2; and decreased copy number of CDKN2B, EML4/ALK , and/or SCAPER. For example, the presence of any one of the following genomic alteration is known to be associated with melanoma: increased copy number of DPP6, HDAC9 , and/or TG; and decreased copy number of CDKN2A, CDKN2B, MTAP, and/or SCAPER.
For example, the presence of the following genomic alteration is known to be associated with prostate cancer: TMPRSS2/ERG fusion. For example, the presence of the following genomic alteration is known to be associated with prostate cancer: CAG repeats length in AR.
For example, the presence of the following genomic alteration in known to be associated with breast cancer: AC repeats length in IGF-I and/or CAG repeats length in AR.
Assessing the size - or size distribution - of telomeres in said mapped sequence reads may provide information on the presence of cancer cells. In one embodiment, small size-telomeres are indicatives of cancer.
Identifying - or assessing the presence of - retrotransposon sequences in said mapped sequence reads may comprise analyzing the number, position and impact of retrotransposon sequences. Examples of retrotransposon sequences include, but are not limited to, short interspersed nuclear elements (such as, e.g., Alu sequences and mammalian-wide interspersed repeats), long interspersed nuclear elements (such as, e.g, LINE1 and LINE2), and long terminal repeats (such as, e.g, HERV, MER4 and retroposons).
The term “epigenetic biomarker” refer to a modification in a nucleic acid, such as in a DNA molecule, by a process or processes that do not change the nucleic acid sequence itself. Examples of epigenetic biomarkers of cancer include, but are not limited to, DNA hypermethylation or hypomethylation (when taken in comparison to a substantially healthy, i.e., non-cancerous, sample), nucleosome footprint and nucleic acid fragment size. Hence, identifying - or assessing the presence of - epigenetic biomarkers of cancer in said mapped sequence reads may comprise analyzing the position, CpG count and methylation status of said mapped sequence reads.
In particular, assessing the presence (or absence) of DNA hypermethylation or hypomethylation in a sample may provide information on cancer-specific methylation status. In one embodiment, the methylation status may be defined relative to a locus or a gene.
Examples of cancer-specific DNA hypermethylation (i.e., increased presence of methylated nucleotides in cancer cells relative to non-cancer cells) include, but are not limited to, hypermethylation of any one of the following loci: 1:147545131, 1:159010051, 1:184867071, 1:234772479, 1:234772634, 1:9626465, 2:111494677,
2:111699234, 2:2318016, 2:237687894, 2:239052858, 2:80497084, 4:113355678, 4:1494607, 4:19455540, 4:6322902, 4:634860, 5:112329851, 5:179354562, 5:3764427, 5:40841488, 6:10528259, 6:30163104, 6:30163224, 6:30769291, 6:422935, 6:70312472, 7:1177297, 7:158428678, 7:17234713, 7:6059619, 7:75776649, 7:82805693, 8:127795817, 8:142465497, 8:88957006, 10:10795855, 10:11164258, 10:11685287,
10:125201246, 10:130045534, 10:133259456, 10:26642983, 10:26642983, 11:9759865, 12:122898852, 12:76183708, 13:101706760, 13:24511163, 13:24511531, 14:75124193, 15:67150555, 15:69761059, 15:70487007, 16:26488645, 16:80027393, 16:80027393, 16:80027460, 17:79131405, 17:79979217, 17:79979289, 18:74499068, 20:53740372, and/or 21:35049560.
Examples of cancer-specific DNA hypomethylation (i.e., decreased presence of methylated nucleotides in cancer cells relative to non-cancer cells) include, but are not limited to, hypomethylation of any one of the following loci: 1:10673454, 1:182952716, 1:2267440, 1:227561011, 1:227561018, 1:54781467, 2:130213591, 2:208124524, 2:231396296, 2:239309155, 2:86809503, 4:3371279, 4:79964827, 5:135399947, 5:141419191, 5:141478374, 5:42953488, 6:104940793, 6:104953110, 6:104953118, 6:27582968, 6:31728646, 6:32096805, 6:50823489, 7:100656169, 7:149692578, 7:16465977, 7:27141861, 7:561866, 7:771569, 8:102729552, 8:98947202, 10:11275824, 10:127737114, 10:45427926, 11:20596684, 12:54259580, 12:95549131, 13:112773780,
14:20435452, 14:74501506, 15:98760640, 17:19344605, 17:2238547, 17:77373009, 17:7929534, 17:81648522, 17:82836398, 19:1907973, 19:36152168, 19:38211354, 19:4305075, 19:46297345, 21:45425245, 22:46413791, X: 118499399, and/or
X:126552818. For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with bladder cancer: DNA hypermethylation at locus 1:147545131, 2:2318016, 4:113355678, 4:1494607, 5:179354562, 6:30163104,
7:6059619, 10:26642983, and/or 15:70487007; and DNA hypomethylation at locus 2:130213591, 2:86809503, 5:135399947, 5:141419191, 5:141419191, 5:42953488, 6:104953110, 6:104953118, 6:50823489, 17:77373009, and/or X: 126552818.
For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with breast cancer: DNA hypermethylation at locus 1:234772634, 2:2318016, 2:80497084, 21:35049560, 6:422935, 8:142465497,
10:10795855, 10:11164258, 16:26488645, and/or 17:79131405; and DNA hypomethylation at locus 1:182952716, 1:2267440, 7:100656169, 8:102729552, 14:74501506, 17:82836398, and/or 19:4305075.
For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with colorectal cancer: DNA hypermethylation at locus 1:234772479, 1:234772634, 2:111494677, 4:6322902, 5:112329851, 5:40841488, 6:30769291, 6:70312472, 7:17234713, 7:82805693, 13:101706760, 15:67150555, and/or
15:69761059; and DNA hypomethylation at locus 5:141478374, 6:32096805,
7:27141861, 7:561866, 7:771569, 10:127737114, 11:20596684, 12:95549131,
13:112773780, 15:98760640, 17:19344605, and/or 22:46413791. For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with endometrial cancer: DNA hypermethylation at locus 1:9626465, 4:19455540, 4:634860, 6:10528259, 6:30163104, 6:30163224, 7:1177297, 7:158428678, 8:88957006, 10:130045534, 12:122898852, 13:24511163, 13:24511531, and/or 18:74499068; and DNA hypomethylation at locus 1:227561011, 1:227561018, 1:54781467, 2:208124524, 2:239309155, 5:141419191, 6:104940793, 6:104953110, 6:104953118, 6:27582968, 7:149692578, 10:45427926, 14:20435452, 17:2238547, 17:7929534, 17:81648522, and/or 19:36152168.
For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with kidney cancer: 1:159010051, 1 :234772634, 2: 111699234, 2:237687894, 7:17234713, 10:11685287, 10:133259456, 12:76183708, 14:75124193, 16:80027393, 16:80027393, and/or 16:80027460; and DNA hypomethylation at locus 1:10673454, 4:79964827, 6:31728646, 10:11275824, 19:1907973, 21:45425245, and/or X: 118499399. For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with lung and bronchus cancer: DNA hypermethylation at locus 1:184867071, 5:3764427, 6:30163224, 6:30163224, 7:17234713, 7:75776649,
8:127795817, 10:125201246, 11:9759865, and/or 20:53740372; and DNA hypomethylation at locus 1:2267440, 1:227561018, 4:3371279, 6:104953110, and/or 8:98947202.
For example, the presence of any one of the following DNA hyper- or hypomethylation is known to be associated with prostate cancer: DNA hypermethylation at locus 2:239052858, 10:26642983, 17:79979217, and/or 17:79979289; and DNA hypomethylation at locus 2:208124524, 2:231396296, 6:104953110, 6:104953118, 7:16465977, 12:54259580, 19:38211354, and/or 19:46297345.
The term “nucleosome footprint” refers to the mapping of nucleosome occupancy, which correlates with nuclear architecture, gene structure and gene expression observed in a given type of cell. Hence, nucleosome footprinting allows to identify the cell-type of origin based on the fragmentation pattern of cfNA, expression of genes, presence of mitochondrial DNA, and the like.
Assessing the size - or size distribution - of said mapped sequence reads may provide information on the type of cell death responsible for the release of the cfNAs. In one embodiment, small size-mapped sequence reads are indicative of apoptosis. In one embodiment, large size-mapped sequence reads are indicative of necrosis.
The term “transcriptomic biomarker”, as used herein, refers to a nucleic acid fragment, such as RNA fragment, with an identified physical location. A transcriptomic biomarker can be a count of nucleic fragment aligned at a position to represent gene expression level or the determination of alternative transcript expression, or the identification of small RNA of interest like miRNA implicated in gene expression regulation.
The term “metabolic biomarker”, as used herein, refers to the mitochondria quantity. Mitochondria quantity can be readily evaluated by quantifying sequencing reads aligned on the mitochondrial chromosome (chrM). In humans, the mitochondrial chromosome is a closed circular molecule that contains 16.569 bp. Each mitochondrial chromosome in a mitochondrion normally contains a full set of all the mitochondrial genes. On average, a human mitochondrion contains approximately 5 such mitochondrial chromosomes, with a quantity usually ranging from 1 to 15.
The term “metagenomic biomarker”, as used herein, refers to a microbial sequence, such as, e.g, a nucleic acid sequence matching with an archaeal, bacterial, fungal, protist, protozoal, or viral reference genome or a portion thereof. In one embodiment, a “metagenomic biomarker” refers to a nucleic acid sequence matching with a bacterial and/or viral reference genome or a portion thereof.
In particular, assessing the presence (or absence) of pathogenic biomarkers of cancer in a sample may provide information on viral sequences originating, e.g, from cancer-inducing viruses; or on bacterial sequences originating, e.g, from cancer-associated bacteria. Hence, the step of computer-processing the mapped sequence reads, and identifying - or assessing the presence of - pathogenic biomarkers of cancer in said mapped sequence reads is preferably performed on exogenous-mapped sequence reads identified in previous steps of the methods. Examples of cancer-inducing viruses include, but are not limited to, cytomegalovirus (CMV), Epstein-Barr virus (EBV), hepatitis B virus (HBV), hepatitis C virus (HCV), Kaposi’s sarcoma-associated herpesvirus (KSHV, formally known as HHV-8), human immunodeficiency virus (HIV), human papillomavirus (HPV), human T-lymphotropic virus (also known as human T-cell lymphotropic virus or human T-cell leukemia-lymphoma virus, HTLV).
For example, the presence of EBV is known to associated with Hodgkin’s and non- Hodgkin’s lymphoma, nasopharyngeal cancer, and Burkitt lymphoma. The presence of HBV is known to be associated with hepatocellular carcinoma. The presence of HCV is known to be associated with hepatocellular carcinoma. The presence of HHV-8 is known to be associated with Kaposi sarcoma. The presence of HIV is known to be associated with several cancers. The presence of HPV is known to be associated with endometrial cancer. The presence of HTLV is known to be associated with lymphoma and leukemia. The presence of VCM is known to be associated with colorectal cancer.
Examples of cancer-associated bacteria include, but are not limited to, Bacteroides fragilis, Borrelia burgdorferi, Campylobacter jejuni, Chlamydia pneumonia,
Chlamydia pneumoniae, Chlamydia trachomatis, Chlamydophila psittaci,
Clostridium ssp, Cutibacterium acnes, Helicobacter bilis, Helicobacter bizzozeronii, Helicobacter felis, Helicobacter heilmannii, Helicobacter hepaticus, Helicobacter pylori, Helicobacter salomonis, Helicobacter suis, Mycoplasma, Mycoplasma fermentans, Mycoplasma hyorhinis, Mycoplasma penetrans, Neisseria gonorrhoeae,
Opisthorchis viverrini, Salmonella enterica serovar Typhimurium, Salmonella enterica serovar Paratyphi, Salmonella Typhi, Schistozoma haematobium, Streptococcus bovis, and Treponema pallidum. For example, the presence of Helicobacter hepaticus, Salmonella enterica serovar Typhimurium, Salmonella enterica serovar Paratyphi and/or Opisthorchis viverrini in a sample is known to be associated with bile duct cancer. The presence of Neisseria gonorrhoeae, Cutibacterium acnes and/or Treponema pallidum is known to be associated with prostate cancer. The presence of Neisseria gonorrhoeae, Cutibacterium acnes and/or Treponema pallidum, Helicobacter bilis, Salmonella Typhi and/or Schistozoma haematobium is known to be associated with bladder cancer. The presence of Bacteroides fragilis, Clostridium ssp, Mycoplasma fermentans, Mycoplasma hyorhinis, Mycoplasma penetrans and/or Streptococcus bovis is known to be associated with colorectal cancer. The presence of Chlamydia trachomatis is known to be associated with endometrial cancer. The presence of Chlamydophila psittaci is known to be associated with eye cancer. The presence of Borrelia burgdorferi, Helicobacter bizzozeronii, Helicobacter felis, Helicobacter heilmannii, Helicobacter pylori, Helicobacter salomonis, Helicobacter suis, Mycoplasma fermentans, Mycoplasma hyorhinis and/ or Mycoplasma penetrans is known to be associated with gastric cancer. The presence of Chlamydia pneumoniae, Chlamydia pneumonia, Mycoplasma fermentans, Mycoplasma hyorhinis and/or Mycoplasma penetrans is known to be associated with lung cancer. The presence of Mycoplasma fermentans, Mycoplasma hyorhinis and/ or Mycoplasma penetrans is known to be associated with ovarian cancer. The presence of Campylobacter jejuni is known to be associated with small intestine cancer.
In one embodiment, biomarkers of cancer are identified in the methods of the invention based on results obtained after sequencing and comparatively analyzing multiple samples labeled as cancer samples and substantially healthy samples ( i.e ., without any evidence of cancers).
In one embodiment, biomarkers of cancer are identified in the methods of the invention based on known information available in databases.
Examples of such databases include, but are not limited to, the International Nucleotide Sequence Database (at www.insdc.org), GenBank (at www.ncbi.nlm.nih.gov), the European Nucleotide Archive (at www. ebi . ac.uk/ena/browser/home), and the DNA Data Bank of Japan (at www.ddbj.nig.ac.jp). Other suitable examples include, without limitation, 23andMe, 1000 Genomes Project, ArrayExpress, Bioinformatic Harvester, ClinVar, COSMIC, dbSNP, ENCODE, Ensembl, Ensembl Genomes, Gene Disease Database, Gene Expression Omnibus (GEO), GTEx, HapMap, Human Microbiome Project (HMP), Human Protein Atlas (HP A), Online Mendelian Inheritance in Man (OMIM), Personal Genome Project, RefSeq, SNPedia, and TCGA.
In one embodiment, biomarkers of cancer are identified in the methods of the invention using a learning algorithm.
The terms “learning algorithm” or “machine learning algorithm” refer to computer- executed algorithms that automate analytical model building, e.g, for clustering, classification or profile recognition. Learning algorithms perform analyses on training datasets provided to the algorithm. Learning algorithms output a “model”, also referred to as a “classifier”, “classification algorithm” or “diagnostic algorithm”. Models receive, as input, test data and produce, as output, an inference or a classification of the input data as belonging to one or another class, cluster group or position on a scale, such as diagnosis, stage, prognosis, disease progression, responsiveness to a drug, etc. A variety of learning algorithms can be used to infer a condition or state of a subject. Machine learning algorithms may be supervised or unsupervised.
Examples of learning algorithms include, but are not limited to, artificial neural networks (e.g, back propagation networks), discriminant analyses (e.g. , Bayesian classifier, Fischer analysis), support vector machines, decision trees (e.g, recursive partitioning processes, such as classification and regression trees [CART]), random forests, linear classifiers (e.g. , multiple linear regression [MLR], partial least squares [PLS] regression, principal components regression [PCR]), hierarchical clustering and cluster analysis. The learning algorithm generates a model or classifier that can be used to make an inference, e.g, an inference about a disease state of a subject.
In one embodiment, the learning algorithm was previously trained with at least one training dataset. In one embodiment, the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from at least one reference subject.
In one embodiment, the reference subject is an animal, preferably a mammal.
Examples of mammals include, but are not limited to, humans, non-human primates (such as, e.g, chimpanzees, and other apes and monkey species), farm animals (such as, e.g, cattle, horses, sheep, goats, and swine), domestic animals (such as, e.g, rabbits, dogs, and cats), laboratory animals (such as, e.g, rats, mice and guinea pigs), and the like. The term does not denote a particular age or gender, unless explicitly stated otherwise.
In one embodiment, the reference subject is a primate, including human and non-human primates. In one embodiment, the reference subject is a human.
In one embodiment, the reference subject is a substantially healthy subject.
A “substantially healthy subject” has not been previously or will not be diagnosed or identified as having or suffering from cancer.
In one embodiment, the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from a healthy reference population.
The term “healthy reference population” refers to a group of substantially healthy subjects, either of similar or different origin, ethnical background, gender, age, etc., such as a group of at least 10, preferably at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500 or more substantially healthy subjects.
In one embodiment, the reference subj ect is a cancer subject.
A “cancer subject” has been previously or will be diagnosed or identified as having or suffering from cancer.
In one embodiment, the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from a cancer reference population. The term “cancer reference population” refers to a group of cancer subjects, either of similar or different origin, ethnical background, gender, age, etc., such as a group of at least 10, preferably at least 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 450, 500 or more cancer subjects. In one embodiment, the cancer reference population may comprise cancer subjects who has been previously or will be diagnosed or identified as having or suffering from one type of cancer. In one embodiment, the cancer reference population may comprise cancer subjects who has been previously or will be diagnosed or identified as having or suffering from any type of cancer. According to the present invention, the methods comprise a step of assigning a score to each biomarker of cancer identified in previous steps of the methods.
The term “score” refers to a value computed to resume multiple results into a single one.
Example of scoring methods include, but are not limited to, mean, median, sum or the average of probabilities of the positive class (pi) for all features (n) associated with a sample multiplied by the number of positive detected features, as illustrated, e.g, by the following formula:
Figure imgf000047_0001
In one embodiment, the scoring method is applied to each group of biomarkers of cancer independently. In one embodiment, the scoring method is applied to several group of biomarkers of cancer based on their functional impact. In one embodiment, the scoring method is applied to each detected biomarker of cancer.
According to the present invention, the methods comprise a step of classifying the subject’s sample based on the scores assigned to each identified biomarker of cancer in previous steps of the methods. According to the present invention, the methods comprise a step of concluding, based on classification of the subject’s sample: on the probability of the subject to be affected with cancer; or on the diagnosis of cancer in the subject; or on the determination of the origin of a tumor in the subject; or on the determination of a personalized course of treatment for the subject. According to the present invention, the methods may comprise a further step of treating the subject.
As used herein, the terms “treating” or “treatment” or “alleviation” refer to therapeutic treatment, excluding prophylactic or preventative measures; wherein the object is to slow down (lessen) a given disease, such as, e.g, cancer. Those in need of treatment include those already with the disease (such as, e.g, cancer) as well those suspected to have the disease (such as, e.g, cancer). A subject is successfully “treated” for a given disease (such as, e.g, cancer) if, after receiving a therapeutic amount of a therapeutic agent, said subject shows observable and/or measurable reduction in or absence of one or more of the following: one or more of the symptoms associated with the disease (such as, e.g, cancer); reduced morbidity and mortality; and/or improvement in quality of life issues. The above parameters for assessing successful treatment and improvement in a given disease (such as, e.g, cancer) are readily measurable by routine procedures familiar to a physician.
In one embodiment, treating the subject for cancer is carried out by any of - or a combination of two or more of - surgery, radiation therapy, chemotherapy, activation immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
The term “radiation therapy”, also termed “radiotherapy” and often abbreviated as “RT”, “RTx” or “XRT”, refers to a therapy using ionizing radiation, to control or kill malignant cells. Examples of radiation therapies include, but are not limited to, external beam radiotherapy (such as, e.g, superficial X-rays therapy, orthovoltage X-rays therapy, megavoltage X-rays therapy, radiosurgery, stereotactic radiation therapy, cobalt therapy, electron therapy, fast neutron therapy, neutron-capture therapy, proton therapy, and the like); brachytherapy; unsealed source radiotherapy; tomotherapy; and the like. The term “chemotherapy” refers to a therapy using a chemotherapeutic agent, i.e., any molecule that is effective in inhibiting tumor growth.
Suitable examples of chemotherapeutic agents include those described under subgroup L01 of the Anatomical Therapeutic Chemical Classification System.
Suitable examples of chemotherapeutic agents include, but are not limited to: alkylating agents, such as, e.g. :
nitrogen mustards, including chlormethine, cyclophosphamide, ifosfamide, trofosfamide, chlorambucil, melphalan, prednimustine, bendamustine, uramustine, chlornaphazine, cholophosphamide, estramustine, mechlorethamine, mechlorethamine oxide hydrochloride, novembichin, phenesterine, uracil mustard and the like;
nitrosoureas, including carmustine, lomustine, semustine, fotemustine, nimustine, ranimustine, streptozocin, chlorozotocin, and the like;
alkyl sulfonates, including busulfan, mannosulfan, treosulfan, and the like;
aziridines, including carboquone, thiotepa, triaziquone, triethylenemelamine, benzodopa, meturedopa, uredopa, and the like; hydrazines, including procarbazine, and the like;
triazenes, including dacarbazine, temozolomide, and the like; ethylenimines and methylamelamines, including altretamine, triethylenemelamine, tri etyl enephosphorami de, tri ethylenethi ophosphaorarni de, trimethylolomelamine and the like;
and others, including mitobronitol, pipobroman, actinomycin, bleomycin, mitomycins (including mitomycin C, and the like), plicamycin, and the like; acetogenins, such as, e.g, bullatacin, bullatacinone, and the like; benzodiazepines, such as, e.g, 2-oxoquazepam, 3 -hy droxyphenazepam, bromazepam, camazepam, carburazepam, chlordiazepoxide, cinazepam, cinolazepam, clonazepam, cloniprazepam, clorazepate, cyprazepam, delorazepam, demoxepam, desmethylflunitrazepam, devazepide, diazepam, diclazepam, difludiazepam, doxefazepam, elfazepam, ethyl carfluzepate, ethyl dirazepate, ethyl loflazepate, flubromazepam, fletazepam, fludiazepam, flunitrazepam, flurazepam, flutemazepam, flutoprazepam, fosazepam, gidazepam, halazepam, iclazepam, irazepine, kenazepine, ketazolam, lorazepam, lormetazepam, lufuradom, meclonazepam, medazepam, menitrazepam, metaclazepam, motrazepam, /V-desalkylflurazepam, nifoxipam, nimetazepam, ni temazepam, nitrazepam, nitrazepate, nordazepam, nortetrazepam, oxazepam, phenazepam, pinazepam, pivoxazepam, prazepam, proflazepam, quazepam, QH-II-66, reclazepam, R04491533, Ro5-4864, SH-I-048A, sulazepam, temazepam, tetrazepam, tifluadom, tolufazepam, triflunordazepam, tuclazepam, uldazepam, arfendazam, clobazam, CP-1414S, lofendazam, triflubazam, girisopam, GYKI-52466, GYKI-52895, nerisopam, talampanel, tofisopam, adinazolam, alprazolam, bromazolam, clonazolam, estazolam, flualprazolam, flubromazolam, flunitrazolam, nitrazolam, pyrazolam, triazolam, bretazenil, climazolam, EVT-201, FG-8205, flumazenil, GL-II-73, imidazenil, 123I-iomazenil, L-655,708, loprazolam, midazolam, PWZ-029, remimazolam, Rol5-4513, Ro48-6791, Ro48-8684, Ro4938581, sarmazenil, SH-053 -R-CH3 -2 ’ F , cloxazolam, flutazolam, haloxazolam, mexazolam, oxazolam, bentazepam, clotiazepam, brotizolam, ciclotizolam, deschloroetizolam, etizolam, fluclotizolam, israpafant, JQ1, metizolam, olanzapine, telenzepine, lopirazepam, zapizolam, razobazam, ripazepam, zolazepam, zomebazam, zometapine, premazepam, clazolam, anthramycin, avizafone, rilmazafone, and the like; antimetabolites, such as, e.g. :
antifolates, including aminopterin, methotrexate, pemetrexed, pralatrexate, pteropterin, raltitrexed, denopterin, trimetrexate, pemetrexed, and the like;
purine analogues, including pentostatin, cladribine, clofarabine, fludarabine, nelarabine, tioguanine, mercaptopurine, and the like;
pyrimidine analogues, including fluorouracil, capecitabine, doxifluridine, tegafur, tegafur/gimeracil/oteracil, carmofur, floxuridine, cytarabine, gemcitabine, azacytidine, decitabine, and the like; and
hydroxy carbamide; - anti-adrenals, such as, e.g, aminoglutethimide, mitotane, trilostane, and the like; folic acid replenishers, such as, e.g, frolinic acid, and the like; maytansinoids, such as, e.g, maytansine, ansamitocins, and the like; platinum analogs, such as, e.g, platinum, carboplatin, cisplatin, dicycloplatin, nedaplatin, oxaliplatin, satraplatin, and the like; trichothecenes, such as, e.g, T-2 toxin, verracurin A, roridinA, anguidine and the like; - toxoids, such as, e.g, cabazitaxel, docetaxel, larotaxel, ortataxel, paclitaxel, tesetaxel, and the like; others, such as, e.g, camptothecin (including the synthetic analogue topotecan); bryostatin; callystatin; CC-1065 (including its adozelesin, carzelesin and bizelesin synthetic analogues); cryptophycins (including cryptophycin 1 and cryptophycin 8); dolastatin; duocarmycin (including its synthetic analogues
KW-2189 and CBI-TMI); eleutherobin; pancrati statin; sarcodictyin; spongi statin; aclacinomysins; authramycin; azaserine; bleomycin; cactinomycin; carabicin; canninomycin; carzinophilin; chromomycins; dactinomycin; daunorubicin; detorubicin; 6-diazo-5-oxo-L-norleucine; doxorubicin (including morpholino-doxorubicin, cyanomorpholino-doxorubicin, 2-pyrrolino-doxorubicin, deoxydoxorubicin, and the like); epirubicin; esorubicin; idanrbicin; marcellomycin; mycophenolic acid; nogalarnycin; olivomycins; peplomycin; potfiromycin; puromycin; quelamycin; rodorubicin; streptomgrin; streptozocin; tubercidin; ubenimex; zinostatin; zorubicin; aceglatone; aldophosphamide glycoside; aminolevulinic acid; amsacrine; bestrabucil; bisantrene; edatraxate; defofamine; demecolcine; diaziquone; elfomi thine; elliptinium acetate; epothilone; etoglucid; gallium nitrate; hydroxyurea; lentinan; lonidamine; mitoguazone; mitoxantrone; mopidamol; nitracrine; phenamet; pirarubicin; podophyllinic acid; 2-ethylhydrazide; PSK®; razoxane; rhizoxin; sizofiran; spirogennanium; tenuazonic acid; 2,2,,2”-trichlorotriethylamine; urethan; vindesine; dacarbazine; mannomustine; mitobromtol; mitolactol; pipobroman; gacytosine; arabinoside; 6-thioguanine; vinblastine; etoposide; vincristine; vinorelbine; navelbine; novantrone; teniposide; daunomycin; xeloda; ibandronate; CPT-11; topoisomerase inhibitor RFS 2000; topoisomerase I inhibitor SN38; difluoromethylornithine; retinoic acid; and the like. The term “activation immunotherapy” refers to the artificial stimulation of the immune system to treat cancer, using activation immunotherapeutic agents (or immunostimulatory agents), such as, e.g, monoclonal antibodies, oncolytic viruses, CAR T-cells, dendritic cells, cancer vaccines, cytokines (including interferons and interleukins), and the like. Suitable examples of immunostimulatory agents include, but are not limited to, cytokines (such as, e.g. , filgrastim, pegfilgrastim, lenograstim, molgramostim, sargramostim, ancestim, albinterferon, interferon alfa natural, interferon alfa 2a, peginterferon alfa-2a, interferon alfa 2b, peginterferon alfa-2b, interferon alfa nl, interferon alfacon-l, interferon alpha-n3, interferon beta natural, interferon beta la, interferon beta lb, interferon gamma, aldesleukin, oprelvekin, and the like); immune checkpoint inhibitors (such as, e.g., inhibitors of CTLA4, PD- 1, PD-L l, LAG-3, B7-H3, B7-H4, TIM3, A2AR, and/or IDO, including nivolumab, pembrolizumab, pidilizumab, AMP-224, MPDL3280A, MDX-1105, MEDI-4736, arelumab, ipilimumab, tremelimumab, pidilizumab, IMP321, MGA271, BMS-986016, lirilumab, urelumab, PF-05082566, IPH2101, MEDI-6469, CP-870,893, mogamulizumab, varlilumab, avelumab, galiximab, AMP-514, AUNP 12, indoximod, NLG-919, INCB024360, and the like); toll-like receptor agonists (such as, e.g. , buprenorphine, carbamazepine, ethanol, fentanyl, GS-9620, imiqimod, lefitolimod, levorphanol, methadone, morphine,
(+)-morphine, morphine-3 -glucuronide, oxcarbazepine, oxycodone, pethidine, resiquimod, SD-lOl, tapentadol, tilsotolimod, VTX-2337, glucuronoxylomannan from Cryptococcus, MALP-2 from Mycoplasma, MALP-404 from Mycoplasma, OspA from Borrelia, porin from Neisseria or Haemophilus, hsp60, hemmaglutinin, LcrV from Yersinia, bacterial flagellin, lipopolysaccharide, lipoteichoic acid, lipomannan from Mycobacterium, glycosylphosphatidylinositol, lysophosphatidylserine, lipophosphoglycan from Leishmania, zymosan from Saccharomyces,
Pam2CGDPKHPKSF, Pam3CSK4, CpG oligodeoxynucleotides, poly(FC) nucleic acid sequences, poly(A:U) nucleic acid sequences, double-stranded viral RNA, and the like); STING receptor agonists (such as, e.g. , those described in W02017100305, vadimezan, CL656, ADU-S IOO, 3’3’-cGAMP, 2’3’-cGAMP, ML RR-S2 CDG, ML RR-S2 cGAMP, cyclic di-GMP, DMXAA, DiABZI, and the like); CD l ligands; growth hormone; immunocyanin; pegademase; prolactin; tasonermin; female sex steroids; histamine dihydrochloride; poly ICLC; vitamin D; lentinan; plerixafor; roquinimex; mifamurtide; glatiramer acetate; thymopentin; thymosin al; thymulin; polyinosinic:polycytidylic acid; pidotimod; Bacillus Calmette-Guerin; melanoma vaccine; sipuleucel-T; and the like.
The term “targeted therapy” refers to a therapy using a targeted therapy agent, i.e., any molecule which aims at one or more particular target molecules (such as, e.g, proteins) involved in tumor genesis, tumor progression, tumor metastasis, tumor cell proliferation, cell repair, and the like.
Suitable examples of targeted therapy agents include, but are not limited to, tyrosine- kinase inhibitors, serine/threonine kinase inhibitors, monoclonal antibodies and the like. Suitable examples of targeted therapy agents include, but are not limited to, HER1/EGFR inhibitors (such as, e.g, brigatinib, erlotinib, gefitinib, olmutinib, osimertinib, rociletinib, vandetanib, and the like); HER2/neu inhibitors (such as, e.g, afatinib, lapatinib, neratinib, and the like); C-kit and PDGFR inhibitors (such as, e.g, axitinib, masitinib, pazopanib, sunitinib, sorafenib, toceranib, and the like); FLT3 inhibitors (such as, e.g, lestaurtinib, and the like); VEGFR inhibitors (such as, e.g, axitinib, cediranib, lenvatinib, nintedanib, pazopanib, regorafenib, semaxanib, sorafenib, sunitinib, tivozanib, toceranib, vandetanib, and the like); RET inhibitors (such as, e.g, vandetanib, entrectinib, and the like); c-MET inhibitors (such as, e.g, cabozantinib, and the like); bcr-abl inhibitors (such as, e.g, imatinib, dasatinib, nilotinib, ponatinib, radotinib, and the like); Src inhibitors (such as, e.g, bosutinib, dasatinib, and the like); Janus kinase inhibitors (such as, e.g, lestaurtinib, momelotinib, ruxolitinib, pacritinib, and the like); MAP2K inhibitors (such as, e.g, cobimetinib, selumetinib, trametinib, binimetinib, and the like); EML4-ALK inhibitors (such as, e.g, alectinib, brigatinib, ceritinib, crizotinib, and the like); Bruton’s inhibitors (such as, e.g, ibrutinib, and the like); mTOR inhibitors (such as, e.g, everolimus, temsirolimus, and the like); hedgehog inhibitors (such as, e.g, soni degib, vismodegib, and the like); CDK inhibitors (such as, e.g, palbociclib, ribociclib, and the like); anti -HER 1 /EGFR monoclonal antibodies (such as, e.g, cetuximab, necitumumab, panitumumab, and the like); anti-HER2/neu monoclonal antibodies (such as, e.g, ado-trastuzumab emtansine, pertuzumab, trastuzumab, trastuzumab-dkst, and the like); anti-EpCAM monoclonal antibodies (such as, e.g, catumaxomab, edrecolomab, and the like); anti-VEGF monoclonal antibodies (such as, e.g, bevacizumab, bevacizumab-awwb, and the like); anti-CD20 monoclonal antibodies (such as, e.g, ibritumomab, obinutuzumab, ocrelizumab, ofatumumab, rituximab, tositumomab, and the like); anti-CD30 monoclonal antibodies (such as, e.g. , brentuximab, and the like); anti-CD33 monoclonal antibodies (such as, e.g, gemtuzumab, and the like); and anti-CD52 monoclonal antibodies (such as, e.g, alemtuzumab, and the like).
The term “hormone therapy” refers to the artificial manipulation of the endocrine system through exogenous or external administration of specific hormones, in particular steroid hormones, or drugs which inhibit the production or activity of such hormones (i.e., inhibitors of hormone synthesis and hormone receptor antagonists).
Suitable examples of hormones include, but are not limited to, androgens (such as, e.g, androstenediol dipropionate, boldenone undecylenate, clostebol, clostebol acetate, clostebol caproate, clostebol propionate, cloxotestosterone acetate, prasterone, prasterone enanthate, prasterone sulfate, quinbolone, testosterone, testosterone cypionate, testosterone enanthate, testosterone propionate, testosterone undecanoate, testosterone ester mixtures, deposterona, omnadren, sustanon, testoviron depot, androstanolone, androstanolone esters, bolazine capronate, drostanolone propionate, epitiostanol, mepitiostane, mesterolone, metenolone acetate, metenolone enanthate, stenbolone acetate, bolandiol dipropionate, nandrolone decanoate, nandrolone phenylpropionate, norclostebol, norclostebol acetate, oxabolone cypionate, trenbolone acetate, trenbolone hexahy drob enzy 1 carb onate, bolasterone, calusterone, chlorodehydromethyltestosterone, fluoxymesterone, formebolone, metandienone, methandriol, methandriol bisenanthoyl acetate, methandriol dipropionate, methandriol propionate, methyltestosterone, methyltestosterone 3 -hexyl ether, oxymesterone, penmesterol, tiomesterone, androisoxazole, furazabol, mebolazine, mestanolone, oxandrolone, oxymetholone, stanozolol, ethylestrenol, mibolerone, norethandrolone, normethandrone, propetandrol, norvinisterone, danazol, gestrinone, progestins, tibolone, medroxyprogesterone acetate, and the like), progestagens (such as, e.g, progesterone, quingestrone, dydrogesterone, trengestone, acetomepregenol, algestone acetophenide, anagestone acetate, chlormadinone acetate, chlormethenmadinone acetate, cyproterone acetate, delmadinone acetate, flugestone acetate, flumedroxone acetate, hydroxyprogesterone acetate, hydroxyprogesterone caproate, hydroxyprogesterone heptanoate, medroxyprogesterone acetate, megestrol acetate, melengestrol acetate, methenmadinone acetate, osaterone acetate, pentagestrone acetate, proligestone, medrogestone, haloprogesterone, gestonorone caproate, nomegestrol acetate, norgestomet, segesterone acetate, demegestone, promegestone, trimegestone, danazol, dimethisterone, ethisterone, allylestrenol, altrenogest, dienogest, etynodiol di acetate, lynestrenol, nor ethisterone, nor ethisterone acetate, norethisterone enanthate, noretynodrel, norgesterone, norgestrienone, normethandrone, norvinisterone, oxendolone, quingestanol acetate, tibolone, desogestrel, etonogestrel, gestodene, gestrinone, levonorgestrel, norelgestromin, norgestimate, norgestrel, drospirenone, nandrolone and esters thereof, trenbolone and esters thereof, ethylestrenol, norethandrolone, and the like), estrogens (such as, e.g, alfatradiol, testosterone, testosterone esters, methyltestosterone, metandienone, nandrolone esters, norethisterone, noretynodrel, etynodiol di acetate, tibolone, clomestrone, cloxestradiol acetate, conjugated estriol, conjugated estrogens, epiestriol, epimestrol, esterified estrogens, estetrol, estradiol, estradiol acetate, estradiol benzoate, estradiol cypionate, estradiol enanthate, estradiol undecylate, estradiol valerate, polyestradiol phosphate, estradiol ester mixtures, climacteron, estramustine phosphate, estriol, estriol succinate, polyestriol phosphate, estrone, estrone esters, estrone sulfate, estropipate, piperazine estrone sulfate, ethinylestradiol, ethinylestradiol sulfonate, hydroxyestrone diacetate, mestranol, methylestradiol, moxestrol, nilestriol, prasterone, prasterone enanthate, prasterone sulfate, promestriene, quinestradol, quinestrol, benzestrol, bifluranol, chlorotrianisene, dienestrol, dienestrol diacetate, diethylstilbestrol, dimestrol, fosfestrol, mestilbol, doisynoestrol, hexestrol, methallenestril, methestrol dipropionate, paroxypropione, quadrosilan, triphenylbromoethylene, tripheny 1 chi oroethy 1 ene, zeranol, and the like), and somatostatin analogues (such as, e.g, BIM-23052, CH-275, corti statin- 14, depreotide, edotreotide, ilatreotide, L-803,087, L-817,818, lanreotide, NNC 26-9100, octreotate, octreotide, pasireotide, pentetreotide, RC-160, seglitide, somatostatin, somatostatin (1-28), SRIF-14, SRIF-28, TT-232, vapreotide, veldoreotide, and the like). Suitable examples of inhibitors of hormone synthesis and hormone receptor antagonists include, but are not limited to, anti-estrogens (such as, e.g, including tamoxifen, raloxifene, aromatase inhibiting 4(5)-imidazoles, 4-hydroxytamoxifen, trioxifene, keoxifene, LY117018, onapri stone, toremifene, and the like), and anti-androgens (such as, e.g, flutamide, nilutamide, bicalutamide, leuprolide, goserelin, and the like).
The term “stem cell transplant” refers to a transplantation of stem cells (either autologous or allogenic) aiming at replacing or reinforcing pre-existing bone marrow cells that may have been partially or totally destroyed by cancer or by therapy.
The present invention also relates to a computer system for estimating the probability of a subj ect to be affected with cancer.
It also relates to a computer system for diagnosing cancer in a subject in need thereof.
It also relates to a computer system for determining the origin of a tumor in a subject in need thereof.
It also relates to a computer system for determining a personalized course of treatment in a subject affected with cancer.
As used herein, the term “computer system” refers to any and all devices capable of storing and processing information and/or capable of using the stored information to control the behavior or execution of the device itself, regardless of whether such devices are electronic, mechanical, logical, or virtual in nature. The term “computer system” can refer to a single computer, but also to a plurality of computers working together to perform the function described as being performed on or by a computer system.
In one embodiment, the computer system according to the present invention comprises:
(i) a processor, and
(ii) a storage medium that stores code readable by the processor. As used herein, the term “processor” is meant to include any integrated circuit or other electronic device capable of performing an operation on at least one instruction word, such as, e.g, executing instructions, codes, computer programs, and scripts which it accesses from a storage medium.
Examples of processors include, but are not limited to, central processing units (CPU), microprocessors, digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), and other equivalent integrated or discrete logic circuitry.
In one embodiment, the code stored on the storage medium, when executed by the processor, causes the computer system to: a. optionally, receive at least one raw sequencing signal from a sequencing experiment of nucleic acids, preferably cfNAs previously extracted from a sample from the subject, as described hereinabove; b. optionally, base-call and demultiplex said at least one raw sequencing signal, thereby obtaining at least one sequence read or a plurality of sequence reads; c. assign said at least one sequence read or the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining at least one mapped sequence read or a plurality of mapped sequence reads, as described hereinabove; d. identify - or assess the presence of - genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer in said mapped sequence reads, as described hereinabove; e. derive a probability score via at least one machine learning algorithm; f. generate an output, wherein the output is the classification label or the probability score; and g. estimate the probability of the subject to be affected with cancer based on the output; or diagnose cancer in the subject based on the output; or determine the origin of a tumor in the subject based on the output; or determine a personalized course of treatment for the subject based on the output.
In one embodiment, the learning algorithm was previously trained with at least one training dataset, as described hereinabove. In one embodiment, the training dataset comprises information relating to genetic, epigenetic, metagenomic and/or pathogenic biomarkers of cancer from samples obtained from at least one reference subject, as described hereinabove.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a flowchart illustrating the bioinformatic steps of the methods carried out after the sequencing step.
Figures 2A-B represent the “Silico mixl” description. Figure 2A shows the ratio of THP1 DNA mixed into HeLa DNA for all samples, including controls (samples without THP1 DNA mixed into HeLa DNA); Figure 2B shows the distribution of simulated depth for controls (annotated as 0) and THP1 -positive samples (annotated as 1).
Figures 3A-B represent the “Silico mix2” description. Figure 3 A shows the ratio of THP1 or HeLa DNA mixed into normal plasma DNA from healthy donor of all samples, including controls (samples without HeLa or THP1 DNA); Figure 3B shows the simulated depth for controls (annotated as 0) and THP1- or HeLa-positive samples (annotated as 1).
Figures 4A-D represent the methylation biomarkers analysis. Figure 4A shows the distribution of the methylation score regarding ratio of abnormal DNA in the sample from “Silico mixl”. Figure 4B shows the ROC analysis of methylation scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”. Figure 4C shows the distribution of the methylation score regarding ratio of abnormal DNA in the sample from “Silico mix2”. Figure 4D shows the ROC analysis of methylation scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
Figures 5A-D represent the variants biomarkers analysis. Figure 5A shows the distribution of the variant score regarding ratio of abnormal DNA in the sample from “Silico mixl”. Figure 5B shows the ROC analysis of variant scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”. Figure 5C shows the distribution of the variant score regarding ratio of abnormal DNA in the sample from “Silico mix2”. Figure 5D shows the ROC analysis of variant scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
Figures 6A-D represent the nucleosome footprint biomarkers analysis. Figure 6A shows the distribution of the nucleosome score regarding ratio of abnormal DNA in the sample from “Silico mixl”. Figure 6B shows the ROC analysis of nucleosome scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mixl”. Figure 6C shows the distribution of the nucleosome score regarding ratio of abnormal DNA in the sample from “Silico mix2”. Figure 6D shows the ROC analysis of nucleosome scores that illustrates the diagnostic ability of the score to discriminate samples with and without abnormal DNA in “Silico mix2”.
Figure 7 represents the distribution of the transposons score regarding ratio of abnormal DNA in the sample for “Silico mix2”. Figure 8 represents the distribution of the telomere length score regarding ratio of abnormal DNA in the sample for “Silico mix2”.
Figure 9 represents the biomarker performances to discriminate samples. Performance for all samples (left panel) and samples with “abnormal” DNA ratio inferior to 5 % (right panel) are shown. Figures 10A-B represent the detection performance of THPl “abnormal” DNA into silico mixl after classification of samples based on different combination of 1 to 3 different biomarkers (M: methylation; V: variant; N: nucleosome footprint). Figure 10A shows the balanced accuracy. Figure 10B shows the recall rate.
Figure 11 represents the performance of “abnormal” DNA detection for THPl and HeLa samples in the two silico mixes for different biomarkers’ combinations (M: methylation; V: variant; N: nucleosome footprint).
Figures 12A-C represent the detection performances of THPl or HeLa “abnormal” DNA into “Silico mix2” after classification of samples based on different combination of 1 to 6 different biomarkers (M: methylation; V: variant; N: nucleosome footprint; T: transposon; Mi: mitochondria; Tel: telomeres length). Figure 12A shows the balanced accuracy. Figure 12B shows the precision rate. Figure 12C shows the recall rate.
Figures 13A-B represent the quantification of “abnormal” DNA into “Silico mixl”. Figure 13A shows the correlation factor (Pearson) for different biomarkers combination (M: methylation; V: variant; N: nucleosome footprint). Figure 13B is the correlation plot for the best biomarkers’ combination.
Figures 14A-B represent the quantification of “abnormal” DNA into “Silico mixl”. Figure 14A shows the correlation factor (Pearson) for different biomarkers combination for both HeLa DNA (left panel) and THP1 DNA (right panel). Figure 14B shows the correlation plot for the best biomarkers’ combinations for both HeLa DNA and THP1 DNA.
Figure 15 represents the ROC curve of sample classification accuracy, based on the type of abnormal DNA. Figure 16 represents the throughput obtained from each sequencing test of cfDNA. Testl corresponds to cfDNA obtained form in vitro culture of cell lines. Test2 was performed using the same protocol used for in vitro test (default Nanopore protocol). Test3 was done by adapting beads over DNA ratio to optimize the capture of small reads.
Figure 17 represents the percentage of reads with a length inferior to 1 000 pb from each sequencing test of cfDNA. Testl corresponds to cfDNA obtained form in vitro culture of cell lines. Test2 was performed using the same protocol used for in vitro test (default Nanopore protocol). Test3 was done by adapting beads over DNA ratio to optimize the capture of small reads.
Figure 18A-C represent the description of sequencing data obtained from DNA extracted with two different commercial kits. Figure 18A shows the sequencing throughput in reads count. Figure 18B shows the percentage of small reads (size under 1 000 pb). Figure 18C shows the quality of reads estimated by mean BASEQ of sequenced nucleotides. Figure 19A-D represent the reads size distribution for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads. Figure 19A: sample 1. Figure 19B: sample 2. Figure 19C: sample 3. Figure 19D: sample 4. Figure 20 represents the methylated fraction of CpG for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads.
Figure 21 represents the correlation between methylation frequency in all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads. Methylation frequencies indicate the methylation status of the position: a value superior to 0.5 indicate that the site is more frequently methylated. First row and line: sample 1; second row and line: sample 2; third row and line: sample 3; fourth row and line: sample 4.
Figure 22A-C represent the nucleosome analysis for all samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads. Figure 22A shows the number of regions that have a large coverage and could indicate nucleosomes. Figure 22B shows the proportion of nucleosomes found in all other samples. Figure 22C shows the proportion of nucleosome found in common between samples 2 and 3 that have the same count of total identified nucleosomes.
Figure 23 represents the expression of nucleosomes in samples sequenced in test3, done by adapting beads over DNA ratio to optimize the capture of small reads. Expression is described by the mean reads depth at nucleosome position. First row and line: sample 1; second row and line: sample 2; third row and line: sample 3; fourth row and line: sample 4. EXAMPLES
The present invention is further illustrated by the following examples. Example 1: In vitro cultures
The goal of this study was to validate in vitro the ability of Nanopore® device to sequence cfDNA extracted from cells’ culture. We performed in vitro culture of two different cell lines with different origins: HeLa and THPl and sequenced them on Nanopore®. The reads were then mixed in silico to create artificial samples with increasing ratio of abnormal DNA (THPl) mixed into background DNA (HeLa). Two data sets were generated using one or two different cells lines. On generated samples, we analyzed different biomarkers to evaluate if we can detect and quantify the presence of abnormal DNA. We also used biomarkers to try to discriminate between two different abnormal DNA.
Aims
To develop extraction procedures from liquid medium containing cfDNA.
To develop shotgun sequencing library protocols of the whole cfDNA without DNA amplification on Nanopore® sequencing device. To compare genetic and epigenetic background of cfDNA, compared to gDNA, thereby assessing the ability of cfDNA to reflect cell status.
Material & Methods
Cells culture and cell-free DNA extraction
Two cell lines, THPl and HeLa, were cultivated in vitro in 75 cm2 flask with 12 mL of dedicated medium. When cells were at confluence, medium were changed once. Medium were collected after 6 hours and 24 hours: mediums were centrifugated during 5 minutes at 1200 g to remove all viable cells. At 24 hours, cells were pelleted with a second centrifugation at 1200 g. Medium and cell pellet were stocked at -80°C until DNA extraction. Cell-free DNA (cfDNA) from cultivated cells’ medium were extracted directly from medium using QIAamp Circulating Nucleic Acid Kit (Qiagen®). cfDNA quality was validated by quantification on QubitTM fluorometer (Invitrogen®). Genomic DNA extraction was done on cell pellet using DNeasy Blood & Tissue Kit (Qiagen®). gDNA quality was validated by quantification on QubitTM fluorometer (Invitrogen®).
Cell-free DNA extraction from plasma samples
Plasma samples, stocked at -80°C, were thawed at room temperature. Cell-free DNA extraction was done from 200 pL of plasma using QIAamp Circulating Nucleic Acid Kit (Qiagen®).
Sequencing library preparation
After DNA extraction of cfDNA and gDNA for each cell lines, shotgun sequencing libraries were prepared from 50 ng of DNA using library kit SGK-LSK009 (Nanopore®). Protocol was adapted to select reads of low size. Several samples were multiplexed using native barcoding kit EXP-NBD104 (Nanopore®) and sequenced on MinlON or GridlON device (Nanopore®). Sequencing were performed during 48 hours until all pores were inactivated. Raw signal was called and demultiplexed with high accuracy using Guppy last version. In silico samples generation
Reads generated by sequencing from two cell lines were mixed in silico to mix DNA from different origins. A first group of samples, hereafter named “Silico mixl”, was generated by mixing THPl reads in HeLa reads background at different ratio. A second group of samples, hereafter named “Silico mix2”, was generated by mixing either THPl reads or HeLa reads into “normal” cfDNA reads obtained from the sequencing of cfDNA extracted from a healthy donor.
Data analysis
Sequence reads from each sample was then analyzed using the workflow illustrated on
Figure 1. Reads were filtered based on their quality estimated by the BASEQ score. Reads with global quality inferior to 10 were filtered with NanoFilt. Then, high quality reads were aligned on human genome (version hg38 - RefSeq assembly accession: GCF_000001405.26) with minimap2. From these alignment results, several biomarkers were analyzed: methylation of reads aligned on the human genome were evaluated by integrating the raw signal from sequencing by NanoPolish. Methylated cytosines were identified into CpG island by NanoPolish and methylation confidence was evaluated by a log ratio score; presence of nucleosome was identified on alignment files: regions overcovered were identified by analyzing the coverage at each genome position, which corresponded to regions where a nucleosome was present (nucleosome footprint); variants were analyzed by genotyping approach. Composition of bases at each position was analyzed using samtools and target variants were searched directly from these results. For this step, we did not use variant caller algorithms; mitochondria quantity was evaluated by computing the alignment depth on mitochondrial chromosome; telomere length was evaluated by searching telomere pattern on the reads using TelSeq tool; transposons were searched from alignment file. Long insertion and deletion were identified using sniffles and transposon like abnormalities were identified using TLDR tool.
Identification of specific biomarkers
Specific biomarkers were identified by comparing whole biomarkers from the different DNA sequencing results. Transposons, methylated CpG island, genetic variants and nucleosome footprint were analyzed to find biomarkers specific of a cell type. Two sets of biomarkers were identified that corresponded to the two simulated group of samples (“Silico mixl” and “Silico mix2”). A first comparison identified THP1 -specific biomarkers, by comparing THP1 and HeLa results: biomarkers present in the THP1 sample but absent from the HeLa samples were selected. A second set of biomarkers was identified by comparing both THPl and HeLa results with normal human DNA. The count of biomarkers in each group is summarized in Table 3.
Figure imgf000065_0001
Table 3: Count of specific biomarkers used for each silico simulated samples' group.
Prediction of the presence of “abnormal” DNA into samples Each biomarker was tracked back into each sample. For each category of biomarkers, a unique score was computed by integrating the results of each biomarker.
Methyl ati on score was computed from the log-like ratio of each reads that displayed CpG of interest. The mean of only positive log-like-ratio was computed and pondered based on the depth at the position. The sum of all pondered means was then computed and score was finally pondered using the global depth of the sample.
Variant score was computed from the variant specific detection on the alignment files. Count of found variants was performed and normalized using the global depth of the sample.
Nucleosome footprint was determined by the presence of high coverage clusters on the reference genome. Coverage depth was evaluated for each cluster specific to a cell line and then ponder by the global depth of the sample. The sum of normalized depth of all clusters was finally performed.
The transposon score was determined with the same approach than variants’ score: the count of transposon identified on the alignment file was done and normalized by the sequencing depth of the sample.
The mitochondria score is a quantification of reads aligned on the chrM and normalized by the sequencing depth of the sample. The telomere score derives from the estimation of the length of the telomere by the research of “TTAGGG” motifs (SEQ ID NO: 1) into aligned reads.
To detect the presence of “abnormal” DNA into the sample, we used all the scores described above in a logistic regression. Training of the model was performed on 80 % of the dataset, and validation, as well as performance evaluation, was done on the remaining 20 % of the dataset. The quantification of the “abnormal” DNA was performed identically, using the same scores but with a linear regression. As described previously, training was performed on 80 % of the dataset and validation on the remaining 20 %.
Results In silico samples description
Two sets of data were generated in silico. In the first set (“Silico mixl”), ThPl DNA was mixed into HeLa DNA at various ratio ranging from 0.66 % to 50 % (Fig. 2A). Several sequencing depths were simulated to assess the detection threshold of the methods (Fig. 2B). A second group of samples was generated by mixing either ThPl or HeLa DNA into normal human DNA obtained from a healthy donor (“Silico mix2”). Ratio ranged from 1 % to 20 % for each cell lines (Fig. 3A). Various depths were also simulated (Fig. 3B).
Reads size obtained after sequencins
We analyzed reads size of sequenced reads. For all samples, we observed a pattern of reads size that oscillated every 164 pb, a length that corresponds to a chromatosome. This length pattern indicates that DNA that have been sequenced result from apoptosis of cells, which is the main origin of cfDNA and confirm that our model enables to study cfDNA of the cultivated cell lines. Biomarkers analysis in silico samples
Methylation analysis in silico samples
Methyl ati on score was computed for each samples of each silico mix group. The distribution of the score ranged from 0 to 6 and was correlated with the ratio of abnormal DNA in all samples. The ROC analysis of the score showed that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 4): for “Silico mixl” (Fig. 4A-B): a threshold at 0.54 enabled high accuracy discrimination of samples with a false positive rate (FPR) of 0.00 and a true positive rate (TPR) of0.83; - for “Silico mix2” (Fig. 4C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample. However, a threshold at 1.83 enabled high accuracy discrimination of samples with a FPR of 0.33 and a TPR of0.68.
Variants analysis in silico samples Variant score was computed for each sample of each silico mix group. The distribution of the score ranged from 0 to 1.6 and was correlated with the ratio of abnormal DNA in all samples. The ROC analysis of the score shows that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 5): for “Silico mixl” (Fig. 5A-B): a threshold at 0.22 enabled high accuracy discrimination of samples with a FPR of 0.2 and a TPR of 0.6. for “Silico mix2” (Fig. 5C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample. However, a threshold at 0.22 enabled high accuracy discrimination of samples with a FPR of 0.5 and a TPR of0.82. In all samples, it appeared that the variants scores showed lower performance compared to the methylated score (Fig. 4A and 4B). One limit was the recall rate which was lower with this biomarker than with the methylation score. For samples with a ratio of abnormal DNA inferior to 5 %, the recall decreased from 0.91 to 0.38 in “Silico mixl” and from 0.73 to 0.62 in “Silico mix2”. It resulted in a low sensitivity in the detection of samples with low rate of abnormal DNA.
Nucleosome footprint analysis in silico samples
Nucleosome score was computed for each samples of each silico mix group. The distribution of the score ranged from 0 to 80 and was corelated with the ratio of abnormal DNA in all samples. The ROC analysis of the score showed that it was a good tool for the discrimination of samples with or without abnormal DNA (Fig. 6): for “Silico mixl” (Fig. 6A-B): a threshold at 7.11 enabled high accuracy discrimination of samples with a FPR of 0.13 and a TPR of 0.83. - for “Silico mix2” (Fig. 6C-D): the accuracy was lower because of HeLa samples which were more difficult to discriminate from negative sample. However, a threshold at 60.5 enabled high accuracy discrimination of samples with a FPR of 0.25 and a TPR of0.85.
Transposon analysis in silico samples Transposons were searched for in all samples of the “Silico mix2”. There were not enough specific transposons biomarkers for each cell lines to perform correlation or ROC analysis but it was observed that the presence of a transposon was highly specific of the presence of “abnormal” DNA (Fig. 7).
Telomere length analysis in silico samples The telomere length was computed for each sample of the “Silico mix2” and compared the results to negative controls. HeLa samples had shorter telomeres and ThPl samples had longer telomeres compared to negative controls (Fig. 8).
Identification of “abnormal DNA ” by the integration of several biomarkers
All scores evaluated independently enable the discrimination of samples based on the presence of “abnormal” DNA. For most biomarkers, the discrimination accuracy was good. In most cases, the precision of detection was better than the recall. The precision was not impacted by the decrease of the ratio of “abnormal”: precision was superior to 80 % for 3 among 4 biomarkers. The recall was more impacted by the quantity of “abnormal” DNA (Fig. 9).
All these biomarkers were used with a machine learning model to differentiate sample with “abnormal” DNA and controls. Dataset was split into training and testing set. The model was trained on the training set and performance were evaluated by the prediction of the testing set. It was performed 50 times to avoid bias due to the aleatory selection of samples from a small dataset. Median balanced accuracy, recall and precision were returned (Fig. 10).
On the “Silico mixl”, the use of a single biomarkers into our models reached the same performance than the ROC analysis with an accuracy ranging between 0.77 and 0.83 (Fig. 10A). The increase of biomarkers improved the detection accuracy to 0.83-0.87 (Fig. 10 A, hatched bars) for two or more biomarkers. The best accuracy was observed for the two biomarkers nucleosome footprint and methylation. The use of variant score did not improve the detection performance. The recall (Fig. 10B) also increased compared to ROC analysis when several biomarkers were used with a best recall score obtained with the two biomarkers nucleosome footprint and methylation.
For the “Silico mix2”, the accuracy for HeLa detection was limited using only 3 biomarkers used previously (Fig. 11). This observation suggests that the most suitable biomarkers could be different for the DNA we are looking for. In this group of samples, the best accuracy score for the two cell lines was obtained when we integrated more scores in the model (Fig. 12 A). For both HeLa and THP1 abnormal DNA, the best accuracy was 0.86 and was obtained from the 6 biomarkers methylation, nucleosome footprint, variants, transposon, mitochondria and telomeres length. The precision was maximum with the same 6 biomarkers for both THP1 and HeLa (Fig. 12B). However, the recall was better when using less biomarkers: in HeLa, the best recall was obtained with only nucleosomes footprint and methylation scores and, in THP1, the best recall was obtained with only variants and methylation scores, suggesting that the addition of information with new biomarkers can increase background noise for negative samples (Fig. 12C). Quantification of “abnormal DNA ” by the integration of several biomarkers
A model was finally created to quantify the “abnormal” DNA present in a sample. For the two silico mixes, the addition of biomarkers increased the correlation between expected and predicted ratio of “abnormal” DNA. In the first group of samples, “Silico mixl”, the performance ranged from 0.67 to 0.93. The best quantification was obtained with three biomarkers: methylation, variant and nucleosome footprint (Fig. 13A). Correlation plot is shown in Fig. 13B.
In the second group of samples, “Silico mix2”, the performance of quantification was limited for HeLa with the three previous biomarkers used to quantify THPl. However, the addition of new biomarkers enabled the increase of quantification performance with a correlation factor that raised to 0.84 with the 6 biomarkers previously used for “abnormal” DNA detection (Fig. 14A). Correlation plot is shown in Fig. 14B.
Classification of samples type using biomarkers
Finally, we tried to go a step further in the classification of samples. More than identifying abnormal samples, we wanted to classify samples regarding their type. We used the different biomarkers themselves instead of computed scores. However, when we looked at the different biomarkers, we did not observe any obvious biomarkers that would have been able to discriminate samples regarding their type (heatmap not shown)
Using two successive models (logistic regression and random forest), it was however possible to discriminate samples regarding the type of abnormal DNA. For all groups, precision was superior to 0.86 and recall superior to 0.82. Sample classification is summarized in Table 4. The ROC curve of classification accuracy is shown in Fig. 15.
Figure imgf000070_0001
Table 4: Confusion matrix of sample classification. Conclusion
In this study, we showed that “abnormal” DNA can be detected in a sample based on the integration of several biomarkers. We observed that, for some cases, the more biomarkers used, the more precise was the detection and the quantification of the “abnormal” DNA. For THP1, 3 biomarkers gave good performance, and for HeLa detection, the best accuracy was obtained with all 6 biomarkers used in this study. This proof-of-concept study demonstrated that several biomarkers are required to be able to identify several types of cells’ DNA in a normal background.
Example 2: Ex vivo analysis Aims
The goal of this study was to develop cfDNA sequencing from real plasma samples. We started from plasma sample and tested cfDNA extraction method. After DNA extraction we tested sequencing library protocol with small adaptation to increase cfDNA sequencing and to limit gDNA contamination. Material & Methods
Plasma sample
Plasma samples were obtained from 4 different healthy donors.
DNA extraction cfDNA were extracted using two different commercial kits: QIAamp MinElute (Qiagen®) and Apostle MiniMax™ (Beckman®). Both methods are based on a small DNA fragment capture using magnetic beads.
Library preparation
Library preparation was performed using ligation protocol developed by Nanopore®. Basically, this method required the ligation of barcodes at both ends of DNA fragments after a previous step that prepared the ends of the DNA. After barcode ligation, sequencing adaptors were attached to enable the sequencing of the fragments. Between these steps, DNA washing was done to remove reagents used at each step. The washing was performed by the capture of DNA using magnetic beads. The ratio of beads on DNA had an impact on the size of DNA fragments that were retained from the washing, and we modulated this step during our test to capture preferentially the cfDNA (data not shown).
Sequencing results analysis
Sequencing data were analyzed using the workflow previously described in Example 1. Results
Throughput obtained from different runs Throughput of runs were estimated by counting the total reads sequenced per samples. For each run, we used a quantity of DNA ranging from 10 ng to 30 ng. The reads count was not correlated to the quantity of DNA used for the library. We observed an increase of reads obtained after sequencing on our third test (Fig. 16).
For this test, we increased the ratio of beads over DNA up to 2x (instead of 0.5x). This increase enabled to keep small reads during washing test and enabled to keep most of the
DNA after washing. Indeed, the percent of small reads was the highest in the third test performed.
We hypothesized that, for in vitro tests (testl, i.e., cfDNA obtained form in vitro culture of cell lines), the low percentage of small size reads compared to test2 (for which no protocol modification has been done) could be explained by other mechanisms of DNA release that occurred in vitro, like necrosis or active secretion that have been described to produced longer reads (> 20000 pb) (Fig. 17).
When we compared the two different extraction kits, we did not observe any difference in the sequencing data obtained. The reads count obtained was similar in the two condition, as well as the percent of small reads size and the quality of reads (Fig. 18A-C). Reads size obtained after sequencing
We analyzed reads size of sequenced reads from the four samples sequenced in our third test (test3, z.e., by adapting beads over DNA ratio to optimize the capture of small reads.). For all samples, we observed a pattern of reads size that oscillated every 164 pb, a length that corresponds to a chromatosome. This length pattern indicates that DNA that have been sequenced result from apoptosis of cells, which is the main origin of cfDNA (Fig. 19A-D).
Biomarkers pattern in normal plasma samples
Methylation analysis in plasma samples We analyzed the methylation pattern in the 4 previously described samples. For each sample, we observed a majority of methylated CpG, as expected for blood cells (Fig. 20).
When we compared methylated sites, we observe a good correlation between samples, with a correlation factor (spearman) ranging from 0.5 to 0.7 depending on samples.
However, we observed that some positions are highly divergent between samples. For example, when comparing sample 4 and sample 1, we observed lots of sites that were always methylated in sample 1 (frequency equal to 1) but that could be methylated or not in sample 4. These observations can be the result of intra-sample heterogeneity due to different factors, like the age of the donor (Fig. 21).
Nucleosome analysis in plasma samples The nucleosome pattern was next analyzed.
We observed a variable range of nucleosome-like regions. This count was highly correlated to the sequencing depth (Fig. 22 A).
We compared all these nucleosome-like regions between samples and counted 27 350 regions found in common between these four samples. These regions corresponded to 5 % to more than 50 % of the total regions found in each sample (Fig. 22B). The difference in ratio could be explained again by sequencing depth. We compared samples 2 and 3 that were identical for the total number of nucleosome-like regions identified. When we compared these samples, we observed that more than 60 % of the regions are common between samples, which was expected for these two biological replicates (Fig. 22C). When we looked at the expression of the 27 350 common nucleosome-like regions in all samples, we observed a high correlation for all samples (Fig. 23). This observation tends to confirm that the nucleosomes identified correspond to real nucleosome positions that were located at similar position in blood cells.
Mutagenesis analysis in plasma samples When we looked at the mutagenesis in normal plasma, we did not observe high rate of mutagenesis. Between 0 to 13 variants were found per sample, and most of them corresponded to polymorphisms.
Microbiome analysis in plasma samples
We also searched for bacteria or viruses into cfDNA that could assess circulating microbiome in plasma. We found bacterial reads for two samples. Bacteria were in both cases Proteobacteria. We found Gammaproteob acteri a, Alphaproteob acteri a and Enter ob acteri ace ae in the two samples. However, the microbiome is rare in the healthy donor plasma and represented in both cases less than 100 reads.
Conclusion This study allowed the development of the sequencing of cfDNA extracted from subject plasma sample. This experimental protocol enabled the sequencing of small reads on Nanopore® device. The size pattern of reads was consistent with chromatosome length and corresponded to DNA release by apoptosis. We did not find lots of long reads. These reads can come from gDNA that contaminate plasma by blood cells lysis or can be released by other mechanisms than apoptosis, like necrosis or active secretion. Some biomarkers patterns have been evaluated and are consistent between samples, which tends to validate the method of the present invention.

Claims

1. A method for estimating the probability of a subject to be affected with cancer, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) estimating the probability of the subject to be affected with cancer based on the classification of the subject at step (f).
2. A method for diagnosing cancer in a subject in need thereof, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) diagnosing cancer in the subject based on the classification of the subject at step (f).
3. A method for determining the origin of a tumor in a subject in need thereof, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) determining the origin of the tumor in the subject based on the classification of the subject at step (f).
4. A method for determining a personalized course of treatment in a subject affected or likely to be affected with cancer, comprising the steps of:
(a) extracting nucleic acids from a sample previously obtained from said subject,
(b) sequencing the extracted nucleic acids, thereby obtaining a plurality of sequence reads,
(c) assigning the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining a plurality of mapped sequence reads,
(d) computer-processing said plurality of mapped sequence reads to identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer,
(e) assigning a score to each biomarker identified at step (d),
(f) classifying the subject based on the scores assigned at step (e), and
(g) determining a personalized course of treatment for the subject based on the classification of the subject at step (f).
5. The method according to any one of claims 1 to 4, wherein the sample is a bodily fluid.
6. The method according to any one of claims 1 to 5, wherein the sample is a bodily fluid selected from the group consisting of blood, lymph, ascetic fluid, cystic fluid, urine, gastric juices, pancreatic juices, bile, nipple exudate, synovial fluid, bronchoalveolar lavage fluid, mucus, sputum, amniotic fluid, peritoneal fluid, cerebrospinal fluid, pleural fluid, pericardial fluid, semen, milk, saliva, sweat, tears, feces, stools, and alveolar macrophages.
7. The method according to any one of claims 1 to 6, wherein the sample is a bodily fluid selected from the group consisting of whole blood, plasma and serum.
8. The method according to any one of claims 1 to 7, wherein the nucleic acids are cell-free nucleic acids (cfNAs).
9. The method according to any one of claims 1 to 8, wherein the nucleic acids are cell-free circulating DNA (cfDNA).
10. The method according to any one of claims 1 to 9, wherein the extracted nucleic acids are sequenced by single-molecule nucleic acid sequencing.
11. The method according to any one of claims 1 to 10, wherein the extracted nucleic acids are sequenced by a sequencing method selected from the group comprising nanopore sequencing, single molecule real-time sequencing (SMRT), annular dark- field scanning transmission electron microscopy sequencing, Heliscope sequencing, nano-knife-edge probe sequencing.
12. The method according to any one of claims 1 to 11, wherein the extracted nucleic acids are sequenced by nanopore sequencing.
13. The method according to any one of claims 1 to 12, wherein assigning the plurality of sequence reads at step (c) comprises: cl) aligning the plurality of sequence reads on the human genome, thereby obtaining human-mapped sequence reads; c2) discarding sequence reads that did not match with the human genome at step cl); c3) optionally, aligning sequence reads discarded at step c2) on at least one further reference genome or a portion thereof; thereby obtaining exogenous-mapped sequence reads; c4) discarding sequence reads that did not match with the at least one further reference genome or a portion thereof at step c3).
14. The method according to claim 13, wherein the at least one further reference genome or a portion thereof is at least one pathogen genome, preferably a bacterial and/or viral genome.
15. The method according to any one of claims 1 to 14, wherein genetic, epigenetic, transcriptomic, metabolic and metagenomic biomarkers of cancer include genomic alterations, telomere length, retrotransposon sequence, DNA hypermethylation or hypomethylation, nucleosome footprint, nucleic acid fragment size, mitochondria quantity, cancer-inducing virus sequences and cancer-associated bacteria sequences.
16. The method according to claim 15, wherein genomic alterations include base pair mutations, differential trinucleotide frequencies, mutational signatures, copy number alterations, gene rearrangements, short tandem repeat polymorphism, and/or chromosomal abnormalities.
17. The method according to any one of claims 1 to 16, wherein computer-processing the plurality of mapped sequence reads at step d) comprises correlating the mapped sequence reads with information available in databases and/or with information obtained from at least one reference subject, preferably from a reference population.
18. The method according to any one of claims 1 to 17, wherein the at least one reference subject is a substantially healthy subject; or the at least one reference subj ect is a cancer subj ect.
19. A method for treating a subject affected with cancer, comprising the steps of: 1) estimating the probability of said subject to be affected with cancer, diagnosing cancer in said subject or determining the origin of a tumor in said subject with the method according to any one of claims 1 to 18; and
2) treating said subject depending on the estimation, diagnosis, or determination of step 1).
20. The method according to claim 19, wherein treating said subject is carried out by any one of, or a combination of two or more of: surgery, radiation therapy, chemotherapy, activation immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
21. A computer system for: estimating the probability of a subject to be affected with cancer; or diagnosing cancer in a subj ect in need thereof; or determining the origin of a tumor in a subject in need thereof; or determining a personalized course of treatment in a subject affected with cancer; comprising: a) a processor and b) a storage medium that stores code readable by the processor; wherein the code stored on the storage medium, when executed by the processor, causes the computer system to: a. optionally, receive at least one raw sequencing signal from a sequencing experiment of nucleic acids, preferably of cell-free nucleic acids (cfNAs), more preferably of cell-free circulating DNA (cfDNA), previously extracted from a sample from the subject; b. optionally, base-call and demultiplex said at least one raw sequencing signal, thereby obtaining at least one sequence read or a plurality of sequence reads; c. assign said at least one sequence read or the plurality of sequence reads to at least one reference genome or a portion thereof, thereby obtaining at least one mapped sequence read or a plurality of mapped sequence reads; d. identify or assess the presence of genetic, epigenetic, transcriptomic, metabolic and/or metagenomic biomarkers of cancer in said mapped sequence reads; e. derive a probability score via at least one machine learning algorithm; f. generate an output, wherein the output is the classification label or the probability score; and g. estimate the probability of the subject to be affected with cancer based on the output; or diagnose cancer in the subject based on the output; or determine the origin of a tumor in the subject based on the output; or determine a personalized course of treatment for the subject based on the output.
PCT/EP2020/084760 2019-12-06 2020-12-04 Methods and apparatuses for diagnosing cancer from cell-free nucleic acids Ceased WO2021110987A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962944502P 2019-12-06 2019-12-06
US62/944,502 2019-12-06

Publications (1)

Publication Number Publication Date
WO2021110987A1 true WO2021110987A1 (en) 2021-06-10

Family

ID=73793183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/084760 Ceased WO2021110987A1 (en) 2019-12-06 2020-12-04 Methods and apparatuses for diagnosing cancer from cell-free nucleic acids

Country Status (1)

Country Link
WO (1) WO2021110987A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114574576A (en) * 2021-12-24 2022-06-03 南京世和医学检验有限公司 Application of bile cfDNA in diagnosis and treatment of gallbladder metastatic cancer
CN115762746A (en) * 2021-09-02 2023-03-07 双链云(武汉)科技有限公司 Method and model for constructing early cancer prediction model based on genome open region
WO2023067597A1 (en) * 2021-10-18 2023-04-27 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. Use of nanopore sequencing for determining the origin of circulating dna
WO2023197825A1 (en) * 2022-04-15 2023-10-19 南京世和基因生物技术股份有限公司 Multi-cancer early screening model construction method and detection device
WO2024010875A1 (en) * 2022-07-06 2024-01-11 The Regents Of The University Of California Repeat-aware profiling of cell-free rna
WO2024168114A1 (en) * 2023-02-08 2024-08-15 Battelle Memorial Institute Technologies for individualized metagenomic profiling
WO2025201509A1 (en) * 2024-03-28 2025-10-02 深圳湾实验室 Method for obtaining cancer risk prediction marker and cancer risk assessment method
WO2025247632A1 (en) 2024-05-27 2025-12-04 European Molecular Biology Laboratory Preparation of cell-free fragmented nucleic acids for genetic analysis sequencing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160203260A1 (en) * 2015-01-13 2016-07-14 The Chinese University Of Hong Kong Applications of plasma mitochondrial dna analysis
WO2017100305A2 (en) 2015-12-07 2017-06-15 Opi Vi - Ip Holdco Llc Composition of antibody construct-agonist conjugates and methods of use thereof
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
WO2019200410A1 (en) * 2018-04-13 2019-10-17 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay of biological samples
WO2019209954A1 (en) * 2018-04-24 2019-10-31 Grail, Inc. Systems and methods for using pathogen nucleic acid load to determine whether a subject has a cancer condition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160203260A1 (en) * 2015-01-13 2016-07-14 The Chinese University Of Hong Kong Applications of plasma mitochondrial dna analysis
WO2017100305A2 (en) 2015-12-07 2017-06-15 Opi Vi - Ip Holdco Llc Composition of antibody construct-agonist conjugates and methods of use thereof
WO2019191649A1 (en) * 2018-03-29 2019-10-03 Freenome Holdings, Inc. Methods and systems for analyzing microbiota
WO2019200410A1 (en) * 2018-04-13 2019-10-17 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay of biological samples
WO2019209954A1 (en) * 2018-04-24 2019-10-31 Grail, Inc. Systems and methods for using pathogen nucleic acid load to determine whether a subject has a cancer condition

Non-Patent Citations (11)

* Cited by examiner, † Cited by third party
Title
ALEXANDROV ET AL., NATURE, vol. 500, no. 7463, 2013, pages 415 - 21
HOMER N ET AL., PLOS ONE, 2009, pages e7767
JOULIN ET AL.: "15th Conference of the European Chapter of the Association for Computational Linguistics (Eacl 2017): Valencia, Spain", 2017, ASSOCIATION FOR COMPUTATIONAL LINGUISTICS
JOULIN, ARXIV:1607.01759, 2016
LANGMEAD B ET AL., GENOME BIOL, vol. 10, 2009, pages R25
LI HDURBIN R, BIOINFORMATICS, vol. 25, 2009, pages 1966 - 67
MENEGAUXVERT, J COMPUT BIOL, vol. 26, no. 6, 2019, pages 509 - 518
NIEDRINGHAUS ET AL., ANAL CHEM., vol. 83, no. 12, 2011, pages 4327 - 41
RIVALS E. ET AL., LECTURE NOTES IN COMPUTER SCIENCE, vol. 5724, 2009, pages 246 - 260
RIZK, GLAVENIER, D, BIOINFORMATICS, vol. 26, 2010, pages 2534 - 2540
XU ET AL., SMALL, vol. 5, no. 23, 2009, pages 2638 - 49

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115762746A (en) * 2021-09-02 2023-03-07 双链云(武汉)科技有限公司 Method and model for constructing early cancer prediction model based on genome open region
WO2023067597A1 (en) * 2021-10-18 2023-04-27 Yissum Research Development Company Of The Hebrew University Of Jerusalem Ltd. Use of nanopore sequencing for determining the origin of circulating dna
CN114574576A (en) * 2021-12-24 2022-06-03 南京世和医学检验有限公司 Application of bile cfDNA in diagnosis and treatment of gallbladder metastatic cancer
CN114574576B (en) * 2021-12-24 2023-01-03 南京世和医学检验有限公司 Application of bile cfDNA in diagnosis and treatment of gallbladder metastatic cancer
WO2023197825A1 (en) * 2022-04-15 2023-10-19 南京世和基因生物技术股份有限公司 Multi-cancer early screening model construction method and detection device
WO2024010875A1 (en) * 2022-07-06 2024-01-11 The Regents Of The University Of California Repeat-aware profiling of cell-free rna
WO2024168114A1 (en) * 2023-02-08 2024-08-15 Battelle Memorial Institute Technologies for individualized metagenomic profiling
WO2025201509A1 (en) * 2024-03-28 2025-10-02 深圳湾实验室 Method for obtaining cancer risk prediction marker and cancer risk assessment method
WO2025247632A1 (en) 2024-05-27 2025-12-04 European Molecular Biology Laboratory Preparation of cell-free fragmented nucleic acids for genetic analysis sequencing

Similar Documents

Publication Publication Date Title
WO2021110987A1 (en) Methods and apparatuses for diagnosing cancer from cell-free nucleic acids
Lebofsky et al. Circulating tumor DNA as a non-invasive substitute to metastasis biopsy for tumor genotyping and personalized medicine in a prospective trial across all tumor types
US20240141432A9 (en) Detection and treatment of disease exhibiting disease cell heterogeneity and systems and methods for communicating test results
ES2911613T3 (en) Analysis of haplotype methylation patterns in tissues in a DNA mixture
Siravegna et al. Clonal evolution and resistance to EGFR blockade in the blood of colorectal cancer patients
Barris et al. Detection of circulating tumor DNA in patients with osteosarcoma
JP2021520816A (en) Methods for Cancer Detection and Monitoring Using Personalized Detection of Circulating Tumor DNA
US11124836B2 (en) Method for selecting personalized tri-therapy for cancer treatment
JP2021516962A (en) Improved variant detection
KR20220157976A (en) Analysis method of cell-free nucleic acid and its application
EP3778924A1 (en) Molecular predictors of patient response to radiotherapy treatment
CN106381332A (en) Detection kit for detecting AML related gene group
EP4623107A1 (en) Systems and methods for tracking personalized methylation biomarkers for the detection of disease
WO2024081769A2 (en) Methods and systems for detection of cancer based on dna methylation of specific cpg sites
EP4605937A1 (en) Method of determining loss of heterozygosity status of a tumor
US20250272835A1 (en) Predicting treatment efficacy by analyzing non-cancer cells
AU2023384165A1 (en) Cell-free dna methylation test for breast cancer
WO2024031097A2 (en) Systems and methods for cancer screening
Jin et al. Genetic heterogeneity in hepatocellular carcinoma and paired bone metastasis revealed by next-generation sequencing
US20210217493A1 (en) Reducing noise in sequencing data
US20250372256A1 (en) Ancestry-related kras co-alteration patterns as prognostic biomarkers
US20250197932A1 (en) Disease subtype classification using genomic features and clustering
US20250305060A1 (en) Pole variant classification strategy identifies patients who may have a favorable prognosis and benefit from immunotherapy
Bult et al. A multi-omics analysis of transformed nodal marginal zone lymphoma
Schrader et al. IgM Immunohistochemical Expression is a Potential Risk Factor for Extracutaneous Dissemination in Patients With Primary Cutaneous Follicle Center Lymphoma

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20823750

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20823750

Country of ref document: EP

Kind code of ref document: A1