[go: up one dir, main page]

WO2025005860A1 - Procédé d'identification et d'ajustement de variabilité systématique dans des mesures d'abondance d'adn - Google Patents

Procédé d'identification et d'ajustement de variabilité systématique dans des mesures d'abondance d'adn Download PDF

Info

Publication number
WO2025005860A1
WO2025005860A1 PCT/SE2024/050638 SE2024050638W WO2025005860A1 WO 2025005860 A1 WO2025005860 A1 WO 2025005860A1 SE 2024050638 W SE2024050638 W SE 2024050638W WO 2025005860 A1 WO2025005860 A1 WO 2025005860A1
Authority
WO
WIPO (PCT)
Prior art keywords
genomic
dna abundance
bins
sample
dna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/SE2024/050638
Other languages
English (en)
Inventor
Johan Lindberg
Karl Markus MAYRHOFER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of WO2025005860A1 publication Critical patent/WO2025005860A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • the present disclosure relates to the field of generating improved DNA abundance measurements suitable for performing copy number analysis on DNA.
  • the present disclosure relates to a method of identifying and adjusting for systematic variability in DNA abundance measurements from DNA samples.
  • Background Genetic changes are known to occur in cells and may lead to the development and progression of cancer.
  • Such mutations include many types of chromosomal abnormalities, which are mainly observed as copy number alterations (abnormal number of copies per cell), structural rearrangements (deletions, insertions and translocations) and point mutations (small substitutions, insertions and deletions).
  • Somatic mutations are genetic changes that occur in the cells of the body after conception and are not present in the germline cells (eggs or sperm). These changes can occur in any cell type in the body, including normal cells and cancer cells, and they can be caused by a variety of factors, such as exposure to environmental toxins, error during DNA replication, or exposure to ionising radiation. Somatic mutations are passed on to daughter cells and may eventually be common to a large number of somatic cells, such as tumours.
  • Germline mutations are genetic changes that have occurred in germline cells and are subsequently present in all somatic cells as well as passed on to future generations.
  • Oncogenes are genes that have the potential to promote cancer development when they are activated, while tumour suppressor genes can contribute to the development of cancer when deactivated.
  • Oncogenes can be activated in different ways. Point mutations can change the coding sequence of the gene, leading to the production of an altered protein. Copy number alteration can lead to the presence of multiple copies (amplification) of the gene and therefore to overexpression. Translocations can lead to the fusion of two genes, leading to the expression of an oncogenic protein. Examples of oncogenes include ERBB2, KRAS and MYC.
  • Tumour suppressor genes are genes that act to inhibit cell growth and division, or prevent the accumulation of DNA damage, and their inactivation can contribute to the development and progression of cancer. Tumour suppressor genes can be inactivated through different mechanisms. Point mutations can change the coding sequence of the gene, leading to the production of a non-functional transcript or protein. Deletion (of genomic sequence leading to a decrease in copy number) of the gene, or part of it, can lead to a complete loss of function of the gene. In addition, epigenetic changes, such as hypermethylation of the promotor region, can lead to the silencing of the gene by preventing the transcription of the gene.
  • tumour suppressor genes typically require two mutations, one affecting each homologous copy. Deletions can be classified as homozygous or hemizygous deletions. Homozygous deletions occur where both homologous copies are deleted, leading to complete loss of a deleted region. Hemizygous deletions occur where only one homologous copy is deleted, and this can contribute to a loss of function of the gene if the remaining copy is also inactivated by mutation or epigenetic changes. Examples of tumour suppressor genes include APC, BRCA2 and TP53. The analysis of tumour DNA is an important aspect of cancer research and clinical management.
  • Tumour DNA can be obtained by biopsy or surgical resection of the tumour tissue, and sometimes from blood plasma as circulating tumour DNA (ctDNA).
  • ctDNA circulating tumour DNA
  • An important feature of DNA extracted from a tumour sample is that the original sample typically contains a significant fraction of cells that do not belong to the clonal expansion of cells that carry the somatic mutations that drive the disease, resulting in such mutations being present in just a fraction of the sample.
  • ctDNA typically only comprises a fraction of a sample of cell-free DNA (cfDNA).
  • cfDNA cell-free DNA
  • a computer- implemented method for identifying and adjusting for systematic variability in a set of DNA abundance measurements of a query sample wherein the DNA abundance measurements correspond to DNA abundance measurements in a biological sample, and wherein each DNA abundance measurement is associated with one of a plurality of genomic bins
  • the method comprising: a) receiving the DNA abundance measurements of the query sample; b) receiving a first score for each genomic bin, wherein the first score is indicative of, for each genomic bin and corresponding DNA abundance measurement, an expected impact of a first systematic variability in the DNA abundance measurements; c) identifying a plurality of genomic regions wherein each genomic region comprises a different sub-plurality of the plurality of genomic bins; d) determining a perturbation effect for at least a first genomic region of the plurality of genomic regions, wherein the perturbation effect is indicative of how a deviation in DNA abundance for that genomic region in a query sample would influence an estimation of a first systematic variability; e) defining a backbone sub
  • the method may further comprise a step of receiving DNA abundance measurements of at least one reference sample, wherein each DNA abundance measurement of the at least one reference sample is associated with a genomic bin of the at least one reference sample, and wherein the genomic bins of the at least one reference sample correspond to the genomic bins of the query sample and wherein the first score for each respective genomic bin is calculated from the DNA abundance measurements of the at least one reference sample.
  • the method may further comprise performing multidimensional scaling on the reference DNA abundance measurements to obtain a first latent feature and a first latent feature score associated with each genomic bin of the plurality of genomic bins wherein the first latent feature represents a bias affecting some reference sample DNA abundance measurements and the first latent feature score is indicative of the impact of the first latent feature bias on the reference DNA abundance measurements of the corresponding genomic bin and wherein the first latent feature score is the first score indicative of a first systematic variability and wherein the bias is the first systematic variability.
  • the step of calculating a first systematic variability for each DNA abundance measurement of the plurality of genomic bins, using the backbone subset of genomic bins and their corresponding first score and DNA abundance measurement may describe the steps of: using at least the backbone subset of the genomic bins, creating a statistical model of the dependency of query sample DNA abundance measurements on the first latent feature scores; and calculating the first systematic variability in query sample DNA abundance measurements of each genomic bin by applying the statistical model to predict query sample DNA abundance measurements from first latent feature scores.
  • the backbone subset of genomic bins comprises, for each of a selection of genomic regions, none or no more than a predetermined number of genomic bins located within that genomic region.
  • each of the selected genomic regions may be one or more of known, observed or suspected to have one or more of: a) a reasonably high probability to be affected by focal deletion or amplification in a query sample; b) a score distribution deviating to a certain extent from that of the other genomic bins; c) for a sequence related feature, a distribution deviating to a certain extent from that of the other of genomic bins; d) a corresponding germline copy number that is influenced by patient sex.
  • the or each reference sample may be characterised by one or more of: a) being a sample in which there are no or less than a predetermined number of known copy-number alterations; b) being a sample that contains less than a predetermined fraction of cancer DNA or cells; c) being a sample that contains no cancer DNA or cells; and d) originating from a cell type resembling the cell type from which the query sample is known or suspected to originate.
  • the DNA abundance measurement of a particular genomic bin is computed from, for each sample, a number of sequence reads or read pairs that can be aligned to the corresponding genomic bin using sequence alignment.
  • the DNA abundance measurements may be provided as one or both of: log transformed; and centred around a point estimate derived from all or a subset of the DNA abundance measurements corresponding to that sample.
  • the point estimate used may be the median.
  • the DNA abundance measurement for a genomic bin known or expected to deviate from normal diploid for that sample may be adjusted before proceeding with any multidimensional scaling, in any of the following ways: a) doubled, or equivalently adjusted based on the expected relative measurement effect of twice the DNA abundance, if on a haploid sex chromosome and not a pseudoautosomal region; b) similarly adjusted based on the theoretical effect of the local copy number on the DNA abundance measurement, relative to copy number two; c) removed and optionally replaced with an imputed, random or constant value; d) if the DNA abundance measurement is already log transformed, performing any of the above adjustments so that the adjustment is effectively applied to the corresponding untransformed value.
  • the multidimensional scaling may be performed such that it obtains a plurality of latent features and corresponding latent feature scores wherein the plurality of latent features includes the first latent feature and at least a second latent feature and the latent feature scores comprise at least the first latent features score and second latent features score.
  • the method may further comprise, for a second latent feature and corresponding second latent feature scores: a) using at least a second backbone subset of the genomic bins, creating a second statistical model of the dependency of query sample DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, on the second latent feature scores; b) estimating a second systematic variability in the query sample DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, of each genomic bin, for the second latent feature by applying the second statistical model to predict query sample DNA abundance measurements, or previously adjusted DNA abundance measurements, from the second latent feature scores; c) obtaining adjusted or further adjusted query sample DNA abundance measurements for each genomic bin using the DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, of the query sample and their associated estimated second systematic variability.
  • the model may be a multivariate model, simultaneously estimating and adjusting for systematic variability corresponding to two or more latent features.
  • the multidimensional scaling may be principal component analysis and the plurality of latent features may be a plurality of principal components.
  • the model may be a linear regression model.
  • the method may further comprise, after the step of performing multidimensional scaling on the reference sample DNA abundance measurements, for each genomic bin: a) determining if a latent feature score of the genomic bin is outside of a predetermined acceptable band; b) if a latent feature score is outside of the predetermined acceptable band, removing the genomic bin and associated query and reference DNA abundance measurements from further analysis.
  • the method may further comprise repeating the step of performing multidimensional scaling, and then proceeding with the analysis using only the retained genomic bins.
  • a method of performing copy number analysis comprising the steps of the first aspect and further comprising determining nucleic acid copy-number alterations in at least one genomic location of the query sample based on the adjusted query sample DNA abundance measurements
  • a computer program product comprising computer program code configured to, when executed on a processor, perform the method of either of the first aspect or the second aspect.
  • a method for identifying and adjusting for systematic variability in a set of DNA abundance measurements of a query sample wherein the DNA abundance measurements correspond to DNA abundance measurements in a biological sample, and with each measurement associated with one of a plurality of genomic bins
  • the method comprising: a) receiving the DNA abundance measurements of the query sample; b) receiving DNA abundance measurements of at least one reference sample, wherein each DNA abundance measurement of the at least one reference sample is associated with a genomic bin of the at least one reference sample, and wherein the genomic bins of the at least one reference sample correspond to the genomic bins of the query sample; c) performing multidimensional scaling on the reference DNA abundance measurements to obtain a first latent feature and a first latent feature score associated with each genomic bin of the plurality of genomic bins wherein the first latent feature represents a bias affecting some reference sample DNA abundance measurements and the first latent feature score is indicative of the impact of the first latent feature bias on the reference DNA abundance measurements of the corresponding genomic bin; d)
  • the subset of genomic bins may comprise, for each of a selection of genomic regions, none or no more than a predetermined number of genomic bins located within that genomic region.
  • each of the selected genomic regions may be one or more of known, observed or suspected to have: a) a reasonably high probability to be affected by focal deletion or amplification in a query sample; b) for a latent feature, a score distribution deviating to a certain extent from that of the other genomic bins; c) for a sequence related feature, a distribution deviating to a certain extent from that of the other of genomic bins; d) a corresponding germline copy number that is influenced by patient sex.
  • the or each reference sample may be characterised by one or more of: a) being a sample in which there are no or less than a predetermined number of known copy-number alterations; b) being a sample that contains less than a predetermined fraction of cancer DNA or cells; c) being a sample that contains no cancer DNA or cells; and d) originating from a cell type resembling the cell type from which the query sample is known or suspected to originate.
  • the DNA abundance measurement of a particular genomic bin may be computed from, for each sample, a number of sequence reads or read pairs that can be aligned to the corresponding genomic bin using sequence alignment.
  • the DNA abundance measurements may be provided as one or both of log transformed and centred around a point estimate derived from all or a subset of the DNA abundance measurements corresponding to that sample.
  • the point estimate used may be the median.
  • the DNA abundance measurement for a genomic bin known or expected to deviate from normal diploid for that sample may be adjusted before proceeding with dimensional reduction, in any of the following ways: a) doubled, or equivalently adjusted based on the expected relative measurement effect of twice the DNA abundance, if on a haploid sex chromosome and not a pseudoautosomal region; b) similarly adjusted based on the theoretical effect of the local copy number on the DNA abundance measurement, relative to copy number two; c) removed and optionally replaced with an imputed, random or constant value; d) if the DNA abundance measurement is already log transformed, performing any of the above adjustments so that the adjustment is effectively applied to the corresponding untransformed value.
  • the multidimensional scaling may be performed such that it obtains a plurality of latent features and corresponding latent feature scores wherein the plurality of latent features includes the first latent feature and at least a second latent feature and the latent feature scores comprise at least the first latent features score and second latent features score.
  • the method may further comprise, for a second latent feature and corresponding second latent feature scores: a) using at least a second backbone subset of the genomic bins, creating a second statistical model of the dependency of query sample DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, on the second latent feature scores; b) estimating a second systematic variability in the query sample DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, of each genomic bin, for the second latent feature by applying the second statistical model to predict query sample DNA abundance measurements, or previously adjusted DNA abundance measurements, from the second latent feature scores; c) obtaining adjusted or further adjusted query sample DNA abundance measurements for each genomic bin using the DNA abundance measurements, or previously adjusted query sample DNA abundance measurements, of the query sample and their associated estimated second systematic variability.
  • the model may be a multivariate model, simultaneously estimating and adjusting for systematic variability corresponding to two or more latent features.
  • the multidimensional scaling may be principal component analysis and the plurality of latent features are a plurality of principal components.
  • the model may be a linear regression model.
  • the method may further comprise, after the step of performing multidimensional scaling on the reference sample DNA abundance measurements, for each genomic bin: a) determining if a latent feature score of the genomic bin is outside of a predetermined acceptable band; b) if a latent feature score is outside of the predetermined acceptable band, removing the genomic bin and associated query and reference DNA abundance measurements from further analysis.
  • the method may further comprise repeating the step of performing multidimensional scaling, and then proceeding with the analysis using only the retained genomic bins.
  • a method of performing copy number analysis comprising the steps of the first aspect and further comprising determining nucleic acid copy-number alterations in at least one genomic bin of the query sample based on the adjusted query sample DNA abundance measurements.
  • a computer program product comprising computer program code configured to, when executed on a processor, perform the method of any of the first or second aspects.
  • Figure 1 shows a simplified illustration of a plurality of chromosomes and a segment of DNA which makes up one of those chromosomes
  • Figure 2 shows a method according to the present disclosure
  • Figure 3 shows a method of performing copy number analysis according to the present disclosure
  • Figure 4 shows an example computer readable medium
  • Figure 5 shows experimental results which demonstrate the efficacy of the disclosed method on a set of genomic bins of a single query sample
  • Figure 6 shows experimental results which demonstrate the efficacy of the disclosed method on a population of query samples.
  • the present disclosure provides a method for identifying and adjusting for systematic variability in a set of DNA abundance measurements of a query sample.
  • Copy number analysis is a technique used to investigate the number of copies per cell of specific chromosomal regions, using DNA abundance measurements corresponding to genomic bins defined using locations on a reference genome. The analysis is performed on a DNA sample typically representing a population of cells. Copy number analysis can be applied to both cancer and non-cancer samples and it is frequently used to identify genetic alterations that can contribute to the development and progression of cancer. Problematically, a measure of DNA abundance can be impacted by a wide range of potential sources of systematic variability which take the form of noise in the signal.
  • sequence GC content in a genomic bin can be a significant source of systematic variability.
  • systematic variability will refer to, in the present disclosure, a predictable error between a measurement obtained and a corresponding true DNA abundance which the measurement is intended to predict.
  • the true DNA abundance referred to as the signal
  • Copy number alterations can be referred to as deletions and amplifications and will, in the current disclosure, correspond to signal and not systematic variability.
  • Systematic variability affecting the measurement might have been introduced at any point during the processing of the sample such as for example during DNA extraction or enrichment.
  • a DNA abundance measurement of a particular genomic bin, defined using a reference genome may, for example, be computed from a number of sequence reads or read pairs that can be aligned to a corresponding genomic bin using sequence alignment.
  • the DNA abundance measurements are provided as one or both of: log-transformed; and centred around a point estimate derived from all or a subset of DNA abundance measurements corresponding to that particular sample.
  • Other techniques may be used, however, using log-transformed and centring around a point estimate may conform with one or more typical approaches in the field, thereby improving the usability of the abundance measurements.
  • the point estimate may be the median DNA abundance measurement for that sample.
  • the method according to the present disclosure proposes to use multidimensional scaling, such as principal component analysis, to identify and adjust for systematic variability from DNA abundance measurements in order to allow for more accurate copy number measurements.
  • DNA abundance measurements may be adjusted in order to reduce or completely remove one or more sources of systematic variability from the DNA abundance measurements.
  • Figure 1 shows an example simplified depiction of the human genome 100 which comprises 46 chromosomes 101 arranged in 23 chromosomal pairs which, in turn, is comprised of 22 autosomal pairs and one pair of sex chromosomes.
  • Each chromosome comprises a DNA molecule in the form of a double strand, or helix, consisting of nucleotide pairs, in which are encoded a plurality of genes, where each gene is made up of a section of the chromosomal DNA 102, which therefore also comprises a string of nucleotide pairs.
  • the nucleotide bases are adenine (A), cytosine (c), Guanine (g) and thymine (T), and their sequence can be adequately described as one sequence, as complementary bases (A and T, C and G) are consistently paired opposite one another.
  • a reference genome is typically a haploid representation of a genome.
  • chromosomes 1-22 the autosomes, of which there are two near-identical copies in a normal cell, are each represented once in the reference genome, and subsequently the copy number per cell throughout the autosomes is normally two.
  • the sex chromosomes, X and Y are different enough to be represented separately in the human reference genome. Their normal copy number is therefore 2 and 0, respectively, in a female cell, and 1 for both in a male cell.
  • genomic bins 103 For the purposes of separating the genome into manageable data points for analysis, it is possible to define a plurality of genomic bins 103 where each genomic bin 103 is a defined section or location, using a start and end position, on a chromosome and therefore on the genome 100. Genomic bins as defined herein relate to such sections or locations on the genome that are used for analysis according to the methods disclosed herein.
  • the size of the genomic bins 103 can be defined in terms of a number of base pairs. It is equally possible to refer to the genomic bins as data points or genomic data points and such regions may be defined to have a length greater than one base pair, of a single base pair, or may have no defined length.
  • a chromosome 101 may contain any plurality of such genomic bins based on the available data and the aim of the analysis 103.
  • a genomic bin may be about 200 base pairs long where the corresponding DNA sequence has been enriched for sequencing.
  • Other genomic bins 103 may be significantly larger, such as by being, for example between 1Mb – 3Mb in length. Other lengths are also possible. It will be appreciated that the size of genomic bins 103 may differ from one-another based on the available data and the purpose and circumstances of the analysis.
  • Some genomic bins 103 may be located within one or more genes, of which some genes may be of a certain interest for copy number analysis. In contrast to genomic bins, the present disclosure will also make reference to genomic regions.
  • a genomic region refers to a section of the genome that is generally larger than at least some genomic bins, and itself may include a plurality of genomic bins.
  • a genomic region may cover a whole gene, a cluster of genes, an enhancer or another genomic feature. In other examples, a genomic region may not cover a specific genomic feature but rather may be a segment of the genome. It will be appreciated that in the present disclosure, both a genomic bin and a genomic region are understood to be much smaller than the reference genome chromosome on which they are located, and that the size of a genomic region is on the order of the size of a gene, and that a genomic region may refer to a section of the genome using chromosome, start and end coordinates.
  • FIG. 2 shows a method 200 according to the present disclosure.
  • Y ⁇ LJ ⁇ The DNA abundance measurements for which it is desired to adjust the systematic variability are associated with a query sample.
  • the term query sample is used here simply as a nomenclature to refer to the sample in which systematic variability is to be identified and adjusted for.
  • the query sample may alternatively be referred to as a test sample, or an interrogation sample.
  • the query sample is a biological sample for which DNA abundance measurements, such as sequence read counts, can be obtained for each genomic bin.
  • DNA abundance measurements can be efficiently adjusted for systematic variability (noise) regardless of how the corresponding genomic bins are distributed over the reference genome.
  • Each DNA abundance measurement is associated with a single genomic bin.
  • the DNA abundance measurements may be obtained by way of any known technique for acquiring DNA abundance measurements of a genomic bin, such as DNA sequencing of the genome, exome, or selected targeted regions, or hybridization to a microarray.
  • the method comprises receiving 201 DNA abundance measurements of the query sample.
  • the DNA abundance measurements of the query sample may be received in any appropriate way.
  • the DNA abundance measurements may be loaded from a memory or received via a user input device coupled with the computing device which is configured to perform the method.
  • the method further comprises receiving 202 DNA abundance measurements of at least one reference sample, wherein each DNA abundance measurement of the at least one reference sample is associated with a genomic bin of the at least one reference sample, and wherein the genomic bins of the at least one reference sample correspond to the genomic bins of the query sample.
  • the DNA abundance measurements of the at least one reference sample may be received in any appropriate way.
  • the DNA abundance measurements may be loaded from a memory or received via a user input device coupled with the computing device which is configured to perform the method.
  • the DNA abundance measurements associated with the genomic bins of the at least one reference sample may be stored in a library of reference sample abundance measurements, for example.
  • reference samples may have been processed and analysed similarly to the query sample, but do not have any of the copy number alterations that one would seek to detect in a query sample.
  • the range of expected biological and technical variation in query samples, including cell or tissue type, sample processing and technical factors such as sequencing depth, may be to at least some extent represented in the reference sample set.
  • Some of the reference samples and associated DNA abundance measurements may be non-aberrant samples where it is known, to a defined level of confidence, that no copy-number aberrations are present that would resemble aberrations of interest. While it may be preferable to use known non-aberrant reference samples, use of some aberrant reference samples may have a negligible impact on the quality of results.
  • Some reference samples may be tumour samples that contain no more than a predetermined fraction of cancer cells or DNA, for example 1% or 5%.
  • some reference samples may be selected to match or resemble the cell type from which the query sample is known or suspected to originate. It is possible to use a single reference sample, however, results may become significantly improved by using a plurality of reference samples.
  • reference samples are processed as described using multidimensional scaling to obtain a first latent feature and first latent feature scores associated with systematic variability as described, in other embodiments, reference samples can be applied implicitly. In one such embodiment, reference samples may have been processed elsewhere and results may be adapted as or processed into first latent feature scores for processing a query sample.
  • point estimates such as a mean or a median of DNA abundance measurements, for each bin and over a set of reference samples selected according to some criteria, may be used as a single reference sample.
  • Multidimensional scaling of the reference sample may in some embodiments become trivial, such that the value per bin can serve as or be scaled into a first latent feature score.
  • an analysis of DNA abundance measurements may have resulted in an established dependency between DNA abundance measurements and a sequence- related feature such as GC content.
  • the sequence related feature can effectively predict DNA abundance measurements in reference samples, it may constitute a proxy for a first latent feature score.
  • One or more such features could therefore, in some embodiments, directly serve as or be further processed into a first score associated with systematic variability as described.
  • the disclosed method will obtain a score that is appropriate and therefore associated with systematic variability as described, and we will henceforth refer to such an appropriate score as a latent feature score or a first or second latent feature score, and so on, and a corresponding feature or latent feature will henceforth be referred to as a latent feature or a first or second latent feature, and so on, respectively.
  • the genomic bins of the reference sample or samples used in the method of the present disclosure will be the same genomic bins as those used for analysis in the query sample. Thus, the abundance measurements from the same genomic bins will be received from each reference sample.
  • Reference sample DNA abundance measurements of a genomic bin for which copy number is expected to deviate from a value of two for that sample may be adjusted to resemble the expected value if the copy number had been two, prior to performing the method of the present disclosure, or at least prior to performing multidimensional scaling, as described below.
  • the measurements may be doubled for a haploid sex chromosome and not a pseudoautosomal region.
  • the measurements may be removed and optionally replaced with imputed, random or constant values for the Y chromosome in a female reference sample. If the DNA abundance measurement is already log-transformed, any of the foregoing copy number adjustments may be performed such that the adjustment is effectively applied to the corresponding untransformed value instead of the log-transformed value.
  • the method further comprises performing 203 multidimensional scaling on the reference DNA abundance measurements to obtain a first latent feature and a first latent feature score associated with each genomic bin of the plurality of genomic bins.
  • Multidimensional scaling is a category of techniques which are capable of identifying and extracting latent features from a dataset.
  • Latent features are variables or properties which may not be directly observed but can be inferred from observations.
  • the latent feature itself may be regarded as an identified feature of the genomic bins that can be associated with variability in the DNA abundance measurements. While there are known features such as GC content that can affect DNA abundance measurements, the latent features are defined using reference measurements. Where multidimensional scaling is performed on a plurality of genomic bins, each genomic bin will receive a latent feature score for each latent feature.
  • the first latent feature score is a value indicative of the impact of a first latent feature bias on the reference DNA abundance measurements of the corresponding genomic bin.
  • latent features and corresponding latent feature scores obtained are appropriately associated with a first systematic variability as described. Where multidimensional scaling would produce a score associated primarily with copy number variation over reference samples, that score would not be considered appropriate as a latent feature score as described.
  • any copy number variation over reference samples are instead adjusted for or filtered as described, prior to the multidimensional scaling step from which appropriate latent features and latent feature scores are obtained.
  • a latent feature score is indicative of the impact of a latent feature systematic variability on reference DNA abundance measurements of the corresponding genomic bin, it can be indicative in different ways. In some embodiments it may be proportional to some observed systematic variability attributable to the first latent feature in reference samples. In other embodiments it can be applied to predict systematic variability in a different way, such as a nonlinear correlation. Principal component analysis is a subset of multidimensional scaling which allows for the identification of latent features in data.
  • the latent features identified in principal component analysis are generally referred to as principal components and their corresponding scores indicative of their impact in a genomic bin may be referred to as principal component scores. While a plurality of different multidimensional scaling methods may be suitable for extracting latent features and corresponding latent feature scores, principal component analysis may be particularly effective in the present disclosure because of its computational efficiency and linear, orthogonal nature. Where it is desired to obtain a plurality of latent features (or, correspondingly, principal components), the method of performing multidimensional scaling only needs to be performed a single time. Multidimensional scaling is capable of extracting latent feature and latent feature scores for a plurality of latent features in a single application of multidimensional scaling.
  • the method may comprise performing multidimensional scaling on the reference DNA abundance measurements to obtain a plurality of latent features and a plurality of corresponding latent features scores associated with each genomic bin of the plurality of genomic bins.
  • Each latent feature represents an apparent bias affecting some reference sample DNA abundance measurements and each latent feature score is indicative of the impact of its corresponding latent feature bias on the reference DNA abundance measurements of the corresponding genomic bin. While a plurality of latent features and corresponding latent features scores can be obtained in a single step of performing multidimensional scaling, the multidimensional scaling does not necessarily need to be performed in a single step.
  • a step of performing multidimensional scaling to obtain a plurality of latent features and corresponding latent features scores may refer to either performing multidimensional scaling a single time or performing multidimensional scaling a plurality of times.
  • the plurality of latent features and corresponding latent feature scores may comprise a first latent feature, a second latent feature and so on along with corresponding first latent feature scores, second latent feature scores, third latent features scores and so on. It will be appreciated that the use of the terms “first”, “second” and “third” here are not necessarily indicative of an order in which these latent features are calculated since, as described above, these can be identified simultaneously. Instead, this nomenclature is used only for simplicity of reference to the different latent features without limiting the disclosure to any particular latent features.
  • node weights of a neural network layer with or without further processing could be applied as latent feature scores.
  • principal component analysis is applied to compute principal components from the reference sample DNA measurements of each genomic bin, with the principal component analysis applied such that a principal component score is calculated for each genomic bin and then further applied as latent feature scores.
  • the principal component analysis could instead be applied to compute principal components wherein a principal component score is computed for each reference sample and a loading weight computed for each bin, and the loading weights subsequently applied as latent feature scores.
  • partial least squares regression, independent component analysis or neural networks such as autoencoders may constitute the multidimensional scaling.
  • multidimensional scaling can be a combination of multidimensional scaling methods.
  • the multidimensional scaling involves only the reference sample DNA abundance measurements.
  • the multidimensional scaling involves other information in addition to reference sample DNA abundance measurements, such as query sample DNA abundance measurements in the example of partial least squares regression.
  • the method further comprises using 204 at least a subset of the genomic bins to create a statistical model of the dependency of query sample DNA abundance measurements on the first latent feature scores of the at least one reference sample.
  • the genomic bins that belong to the subset of genomic bins will be referred to herein as the backbone subset of genomic bins.
  • a backbone subset is a subset containing all or some of the plurality of genomic bins, wherein the extent that bins corresponding to one or more genomic regions may contribute to the creation of a statistical model has been restrained, relative to other bins, to address a perturbation effect as described below.
  • This will be referred to as applying restriction to the one or more genomic regions.
  • applying such restriction entails defining a vector of weights, with a default weight of one, for the plurality of genomic bins, wherein restriction is applied by setting the weight to zero for some of the bins corresponding to each genomic region for which restriction is to be applied, thereby defining a retained subset of the plurality of genomic bins, each with a weight of one.
  • weights other than zero and one may be involved and still attain the restriction.
  • the restriction may be applied for example through setting some weights to any of 0.001, 0.1, 0.5, 1-100 or other values, thereby potentially applying restriction for one or more genomic regions without excluding corresponding genomic bins completely from the backbone subset and subsequently the creation of the statistical model. It will be appreciated that regardless of whether bins are removed from the backbone subset of genomic bins, applying a restriction as described does not imply that genomic regions with restriction applied, or selected bins therein, are completely removed from further analysis, but only removed or downweighted in what would serve as training data when subsequently creating a statistical model.
  • all of the genomic bins forming part of the plurality of genomic bins may be used to create the statistical model. However, excluding certain genomic bins from the process of creating the statistical model may be preferable.
  • the backbone subset of genomic bins may be selected to comprise, for each of a selected set of genomic regions, none or no more than a predetermined number of genomic bins located within that genomic region.
  • the backbone subset of genomic bins comprises, for each of a selection of genomic regions, only a predefined number of genomic bins.
  • the backbone subset of genomic bins may only comprise ten genomic bins.
  • Each genomic region in this case may be the same length or a different length to another genomic region.
  • the genomic regions may simply be contiguous sets of base pairs or might be defined by certain features like a first genomic region corresponding to a particular gene while a second genomic region corresponds to a particular enhancer and a third genomic region corresponds to a second gene.
  • the selected genomic regions may for example be all genes, in which case for any gene, no more than a predetermined number of genomic bins would be included. This would ensure some representation of every gene in the model, while for any gene where a large enough number of genomic bins is located, restrain the potential for an amplification or deletion of that gene to introduce a perturbation to the statistical model. This perturbation can be seen as the true signal sought in the analysis being mistaken as systematic variability.
  • a perturbation is error introduced in a statistical model that models the dependency of query sample DNA abundance measurements on a first latent feature score.
  • the perturbation is introduced due to a dependency between a signal in the query sample, such as a deletion or amplification of a genomic region, and the first latent feature score.
  • a signal in the query sample such as a deletion or amplification of a genomic region
  • the first latent feature score As systematic variability and associated latent feature scores are typically nonuniformly distributed over the genome, modest such perturbation may be introduced by any query sample copy number alteration. Unacceptable perturbation may however occur when a genomic region is deleted or amplified in the query sample, and corresponds to a relatively large fraction of the plurality of genomic bins wherein the corresponding bins have a first latent feature score distribution deviating significantly from that of the plurality of bins.
  • the model will subsequently introduce error to the predicted systematic variability.
  • error may appear as noise disproportionally affecting genomic bins with relatively high or low first latent feature scores relative to the first latent feature score distribution of the plurality of bins, potentially leading to false positive or false negative observations of copy number alteration wherever on the reference genome a local set of bins feature a first latent feature score distribution that deviates significantly from that of other nearby bins.
  • the disclosed method will entail application of a backbone subset to remove or sufficiently reduce the effect of such a perturbation without any need to know in advance whether a perturbation would occur in a specific query sample, i.e.
  • a potential for such a perturbation is referred to as a perturbation effect, and can be determined, for a first latent feature score and for a genomic region corresponding to a set of genomic bins, from the first latent feature score of bins within and not within the genomic region.
  • a perturbation effect is determined from the distribution of the first latent feature score over the genomic bins corresponding to the currently assessed genomic region and from the distribution of the first latent feature score over all other bins of the plurality of genomic bins.
  • an unacceptable perturbation effect may be determined from an observation of perturbation-induced noise or bias in adjusted query sample DNA abundance measurements.
  • the disclosed method entails determining that a perturbation effect for a genomic region would be unacceptable, for example by quantifying the perturbation effect as a numeric value, and determining that the value is above an upper acceptable threshold or below a lower acceptable threshold, and defining a backbone subset such that when the perturbation effect is determined using the backbone subset of genomic bins, that perturbation effect is between the acceptable upper threshold and acceptable lower threshold.
  • Quantifying the perturbation effect for a genomic region and a first latent feature score may for example entail calculating a perturbation factor reflecting how much, for each unit of difference in DNA abundance measurement between genomic bins within and not within that genomic region, the slope of a least squares regression line between DNA abundance measurements and the first latent feature score would change, with the regression calculated using the plurality of genomic bins.
  • determining the perturbation effect may instead constitute observing that adjusted DNA abundance measurements of one or more query samples that feature unacceptable perturbation-induced noise when a backbone subset is not applied, instead feature acceptable or no perturbation-induced noise when that backbone subset is applied. It will be further appreciated that whether a perturbation effect is acceptable or unacceptable will depend on the requirements of the analysis performed with the DNA abundance measurements.
  • a perturbation to the extent predicted by the perturbation effect may be unacceptable because it is considered too likely to result in false positive, false negative or unclear results. In other applications that perturbation may be acceptable as it is considered sufficiently unlikely to result in false positive, false negative or unclear results.
  • applying this restriction to all genes may be impractical. Representation of variability, over features for which systematic variability (bias) may be present in the query sample, is important when building the statistical model, and some such features may also be systematically different between, for example, genomic bins in gene bodies and other genomic bins, motivating their representation when building the model. To attain a backbone subset with better representation of certain features typical to, for example, gene bodies, the restriction can instead be applied to just some genomic regions.
  • genomic regions where the restriction would be applied could be selected based on known, observed or suspected properties, with examples defined in the following paragraphs. It will be appreciated though that in the present disclosure, applying the restrictions described in any of these examples would result in a backbone subset as described only if at least one of the selected genomic regions would otherwise have an unacceptable perturbation effect for a corresponding first latent feature and first latent feature score as described, and that a backbone subset of bins, as well as a perturbation effect, thereby will correspond to at least a first latent feature and a first latent feature score.
  • a restriction could be applied where there is a relatively high probability of deletion or amplification in a query sample.
  • the relatively high probability may be equal to or greater than one, two, four, seven or ten percent, or simply that the gene is known or suspected to sometimes be affected by this type of alteration in query samples subjected to analysis.
  • Deletion or amplification could here be just focal deletions or amplifications, which are relatively small copy number alterations affecting one or a few genes. That is, the backbone subset of genomic bins may comprise no more than a predetermined number of genomic bins from each of selected genomic regions where a relatively strong signal, such as deletion or amplification, is anticipated to have a greater than, for example, one percent probability (or another percentage probability, as listed above) of being observed in the query sample.
  • the latent feature score distribution for genomic bins located there deviates by more than a predetermined amount from that of other genomic bins. That is, the backbone subset of genomic bins may comprise no more than a predetermined number of genomic bins from each of selected genomic regions in which the latent feature score distribution, for genomic bins located there, differs to a predetermined extent from that of other genomic bins.
  • the score distribution may be characterised in any of a large number of ways. For example, the distribution may be characterised by way of its mean or median or with a more complex metric.
  • a restriction could similarly be applied where, for a sequence related feature such as GC content, the distribution deviates by more than a predetermined amount from that of other genomic bins.
  • a restriction could be applied where the copy number is influenced by a patient’s sex. For example, this may include genomic regions on the sex chromosomes. It will be appreciated though that in the present disclosure, applying a backbone subset as described entails applying at least one restriction to address a perturbation effect as described, and that regardless of whether restriction is applied to any genomic region that is located on a sex chromosome or otherwise associated with some reference sample copy number variation, the at least one restriction that will be applied to define a backbone subset as described will be applied specifically to address a perturbation effect as described.
  • the statistical model may be a regression model, devised to predict reference sample DNA measurements from a first latent feature score.
  • the statistical model may be, for example, a linear regression model that correlates the first principal component scores obtained from the multidimensional scaling (which may be plotted along the x-axis in a scatter plot) to the DNA abundance measurements (which may be plotted along the y-axis) for each genomic bin.
  • two parameters could be defined: a gradient, k, and an intercept, m, of the model such that a line is described that for any x value has a corresponding y value.
  • the intercept m can be assumed to be near zero, or even constrained to always be zero, and that in such an embodiment the gradient or slope k could alone represent the model, and that in such an embodiment, the latent feature score may be devised such that they can compute k without explicit creation of a further described or implemented statistical model, and that even in an embodiment where the model is thus represented implicitly by a set of model parameters such as k, the disclosed method would still specifically apply one or more backbone subsets of bins when computing them.
  • a linear regression model is one possible form that the statistical model may take. This may be computationally efficient while providing for good results.
  • the regression model may be a robust regression model and/or a more complex statistical model such as polynomial regression or LOESS regression.
  • the model can be generalized, denoting that the backbone subset of genomic bins and the scores corresponding to the first latent feature are used: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • the method further includes a step of estimating 205 a first systematic variability in the query sample DNA abundance measurements of each of the plurality of genomic bins by applying the statistical model to predict query sample DNA abundance measurements from the first latent features scores. This is performed for all of the genomic bins of the plurality of genomic bins, for which the analysis is performed, and not only for the backbone subset of genomic bins used to create the statistical model.
  • the disclosed method would still specifically apply one or more backbone subsets of bins when computing systematic variability. It will be further appreciated that when creating the statistical model, application of a backbone subset of bins that features weights other than zero and one will require the model to be a weighted model, or, if the model is implicit, a weighted calculation of model parameters or systematic variability estimates.
  • the method comprises a step of obtaining 206 adjusted query sample DNA abundance measurements for each genomic bin using the DNA abundance measurements of the query sample and their associated estimated first systematic variability. This step may comprise, for example, subtracting the first systematic variability estimate of each genomic bin from the query sample DNA abundance measurements for the corresponding genomic bin.
  • the query sample DNA abundance measurements may be adjusted by a mathematical technique other than subtraction.
  • the adjustment of the DNA abundance measurements will generally be performed in order to reduce or eliminate the systematic variability arising from a given latent feature.
  • the steps of: using at least a backbone subset of the genomic bins, creating a statistical model of the dependency of query sample DNA abundance measurements on the first latent feature scores; estimating a first systematic variability in query sample DNA abundance measurements of each genomic bin by applying the statistical model to predict query sample DNA abundance measurements from the first latent feature scores; and obtaining adjusted query sample DNA abundance measurements for each genomic bin using the DNA abundance measurements of the query sample and their associated estimated first systematic variability, are performed for each of multiple latent features and latent features scores of the plurality of latent features.
  • estimation of systematic variability is performed with the application of a backbone subset of bins for at least a first, but potentially also a second or several or all of the latent features, and that what constitutes a backbone subset of bins with respect to a first latent feature and a first latent feature score may or may not constitute a backbone subset of bins with respect to a second latent feature and a second latent feature score, and that application of dissimilar backbone subsets of bins for a first and second latent feature may constitute efficient use of available data by applying restriction to genomic regions only to the extent motivated by perturbation effects with respect to each latent feature.
  • multiple latent features can be processed simultaneously such that systematic variability is estimated and adjusted for using a first and second latent feature at the same time, using a multivariate rather than univariate model.
  • a backbone subset applied will be a backbone subset for at least a first latent feature and a first latent feature score, and that what constitutes a backbone subset for a first latent feature may or may not constitute a backbone subset for a second latent feature, and that application of dissimilar backbone subsets of bins for a first and second latent feature may or may not be practical in that embodiment.
  • the statistical model used may be a multivariate model which simultaneously estimates and adjusts for systematic variability corresponding to the two or more latent features.
  • a multivariate statistical model may be used for any number of latent features that is computationally affordable.
  • K ⁇ The method may include a plurality of steps required to obtain the DNA abundance measurements of the query sample. It will be appreciated, however, that these steps are not essential to the method, as the DNA abundance measurements can be obtained from another source, such as another lab, without the person performing the method of the present disclosure obtaining these themselves.
  • the method may comprise, prior to receiving the DNA abundance measurements of the query sample, performing one or more preparatory steps on the DNA query sample to prepare the DNA query sample for DNA sequencing.
  • the method may also comprise performing DNA sequencing on the prepared DNA query sample in order to obtain the DNA abundance measurements of the plurality of genomic bins of the query sample.
  • the method may further comprise a step of enriching the query DNA sample for DNA corresponding to some or all of the plurality of genomic bins.
  • the method may comprise a plurality of steps required to obtain the DNA abundance measurements of the one or more reference samples. It will be appreciated, however, that these steps are not essential to the method, as the DNA abundance measurements can be obtained from another source, such as another lab or reference library, without the person performing the method of the present disclosure obtaining these themselves.
  • the method may further comprise, after the step of performing multidimensional scaling on the at least one reference sample DNA abundance measurements, for each genomic bin, determining if a latent feature score of the genomic bin is outside of a predetermined acceptable band; and, if the latent feature score is outside of the predetermined acceptable band, removing the genomic bin and associated query and reference DNA abundance measurements from further analysis.
  • the method may further comprise repeating the step of performing multidimensional scaling and proceeding with the analysis as described above using only the retained genomic bins. This approach may allow for the removal of outlier genomic bins which may otherwise introduce errors or bias into the multidimensional scaling approach, its resulting latent feature scores and the subsequent statistical models.
  • such a filtering step may identify and remove from further analysis genomic bins affected by copy number variation in the reference sample set that could otherwise result in for example a principal component being primarily associated with such copy number variation rather than systematic variability as required for a latent feature in the present disclosure, and that such a principal component would not constitute an appropriate latent feature as described, and that in a preferred embodiment, bins where reference sample DNA abundance measurements feature a signal, such as for example sex chromosomes or copy number polymorphisms, are adjusted for prior to multidimensional scaling as described or filtered from further analysis as described here. ⁇ d ⁇
  • some reference biological samples from which DNA abundance measurements of the at least one reference sample are obtained may be body fluid samples comprising cell-free DNA.
  • a query biological sample from which the DNA abundance measurements of the query sample are obtained may be body fluid samples comprising cell-free DNA.
  • the query biological sample may be a body fluid sample comprising circulating tumour DNA.
  • Figure 3 shows an example method 300 of performing copy number analysis which comprises performing 301 the steps of the method of figure 2 followed by determining 302 copy number alterations in at least one genomic region of the query sample based on the adjusted query sample DNA abundance measurements. The method may further comprise using all or a subset of the copy number estimates or alterations, or corrected values, for determining a copy number estimate or alteration in at least one defined genomic location in a source material or patient corresponding to the query biological sample.
  • abundance measurements are sequence read counts.
  • the method comprises a) calculating, for each bin of a plurality of bins representing different nucleic acid sequences (the genomic bins) and for each sequenced reference nucleic acid library of a plurality of sequenced reference nucleic acid libraries (the reference samples) from reference biological samples comprising genomic material apparently not harbouring copy-number alterations relevant for the copy-number analysis, a reference value based on a ratio between i) a number of sequence reads mapping to or matching the bin (DNA abundance measurements) and ii) a point estimate representative of the distribution of the number of sequence reads from the sequenced reference nucleic acid library mapping to each of all or a subset S (a backbone subset) of the plurality of bins.
  • the method further comprises calculating, for each bin of the plurality of bins and for a sequenced query nucleic acid library prepared from a query biological sample, a value V 0 based on a ratio between i) a number of sequence reads mapping to the bin and ii) a point estimate representative of the distribution of the number of sequence reads from the sequenced query nucleic acid library mapping to each of all or the subset S of the plurality of bins.
  • the method further comprises applying dimensional reduction analysis, and preferably principal component analysis (PCA), to the reference values to calculate, for each bin of the plurality of bins, a score for each principal component wherein the principal components represent latent features affecting sequence coverage and each score quantifies an effect of the latent feature represented by the principal component on the reference values for the bin. That is, this step is the step of performing multidimensional scaling in order to obtain the latent features and the corresponding latent feature scores.
  • PCA principal component analysis
  • the method yet further comprises: for each i ⁇ 1...k of a set of subsets 1...k, each comprising one or more of the principal components: estimating, for the query sample and for each bin of the plurality of bins, a systematic variability Ei based on the scores for principal components PC i , using a univariate or multivariate regression analysis model M i generated from all or the subset S of the plurality of bins and their scores for principal components PCi and their values V0 or previously corrected values CVi-1; calculating, for the query sample and for each bin of the plurality of bins, a corrected value CV i based on its estimated systematic variability E i and its value V 0 or previously corrected value CV i-1 ; and determining, for at least one bin of the multiple bins, a copy- number estimate or alteration based on the corrected values CVk.
  • results from an implementation of the present disclosure are compared to results labelled as the disclosed method but with systematic variability estimated using all data points featuring a reference method otherwise identical but which does not entail the application of a backbone set of bins as described, and to results labelled as conventional correction featuring a reference method with identical bins applied and retained, and that results are improved with the disclosed method both relative to the otherwise identical reference algorithm and relative to conventional correction.
  • Figure 5 shows experimental results which demonstrate the efficacy of the disclosed method on a set of genomic bins of a single query sample, using targeted sequencing of a prostate cancer sample.
  • Genomic bins corresponding to chromosome 13 are plotted in order on the x axis, with log2-transformed DNA abundance measurements on the y axis that were adjusted for systematic variability using 500) conventional correction where the estimated systematic variability is, for each data point, the median over reference samples; 501) the disclosed method, but with systematic variability estimated using all data points; 502) the disclosed method, with systematic variability estimated using a selected backbone subset of data points as described. Variability is reduced, without reducing signal, using the disclosed method, and further reduced using the selected backbone subset of data points for estimating systematic variability, improving the observability of a signal 503 indicating a deletion.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé d'identification et d'ajustement d'une variabilité systématique dans un ensemble de mesurages d'abondance d'ADN d'un échantillon d'interrogation, le procédé consistant à : recevoir les mesurages d'abondance d'ADN à la fois de l'échantillon d'interrogation et d'au moins un échantillon de référence ; effectuer une mise à l'échelle multidimensionnelle sur les mesurages d'abondance d'ADN de référence pour obtenir une première caractéristique latente et un premier score de caractéristique latente pour chaque compartiment génomique de la pluralité de compartiments génomiques ; créer un modèle statistique de la dépendance de mesurages d'abondance d'ADN d'échantillon d'interrogation sur les premiers scores de caractéristique latente ; estimer une première variabilité systématique dans des mesurages d'abondance d'ADN d'échantillon d'interrogation via l'application du modèle statistique pour prédire des mesurages d'abondance d'ADN d'échantillon d'interrogation à partir de premiers scores de caractéristique latente ; obtenir des mesurages d'abondance d'ADN d'échantillon d'interrogation ajustés pour chaque compartiment génomique à l'aide des mesurages d'abondance d'ADN de l'échantillon d'interrogation et de leur première variabilité systématique estimée associée.
PCT/SE2024/050638 2023-06-26 2024-06-26 Procédé d'identification et d'ajustement de variabilité systématique dans des mesures d'abondance d'adn Pending WO2025005860A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE2350786 2023-06-26
SE2350786-6 2023-06-26

Publications (1)

Publication Number Publication Date
WO2025005860A1 true WO2025005860A1 (fr) 2025-01-02

Family

ID=91856215

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2024/050638 Pending WO2025005860A1 (fr) 2023-06-26 2024-06-26 Procédé d'identification et d'ajustement de variabilité systématique dans des mesures d'abondance d'adn

Country Status (1)

Country Link
WO (1) WO2025005860A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014190286A2 (fr) * 2013-05-24 2014-11-27 Sequenom, Inc. Méthodes et systèmes d'évaluation non invasive de variations génétiques
US20180032666A1 (en) * 2016-07-27 2018-02-01 Sequenom, Inc. Methods for Non-Invasive Assessment of Genomic Instability
WO2018140521A1 (fr) * 2017-01-24 2018-08-02 Sequenom, Inc. Méthodes et procédés d'évaluation de variations génétiques
US20180327844A1 (en) * 2015-11-16 2018-11-15 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014190286A2 (fr) * 2013-05-24 2014-11-27 Sequenom, Inc. Méthodes et systèmes d'évaluation non invasive de variations génétiques
US20180327844A1 (en) * 2015-11-16 2018-11-15 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US20180032666A1 (en) * 2016-07-27 2018-02-01 Sequenom, Inc. Methods for Non-Invasive Assessment of Genomic Instability
WO2018140521A1 (fr) * 2017-01-24 2018-08-02 Sequenom, Inc. Méthodes et procédés d'évaluation de variations génétiques

Similar Documents

Publication Publication Date Title
US11568957B2 (en) Methods and systems for copy number variant detection
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN103201744B (zh) 用于估算全基因组拷贝数变异的方法
Ronen et al. netSmooth: Network-smoothing based imputation for single cell RNA-seq
US7937225B2 (en) Systems, methods and software arrangements for detection of genome copy number variation
CN109949861B (zh) 肿瘤突变负荷检测方法、装置和存储介质
US20160117444A1 (en) Methods for determining absolute genome-wide copy number variations of complex tumors
WO2019213811A1 (fr) Procédé, appareil et système de détection d'aneuploïdie chromosomique
US20200082910A1 (en) Systems and Methods for Determining Effects of Genetic Variation of Splice Site Selection
US20140336950A1 (en) Clustering copy-number values for segments of genomic data
CN111210873A (zh) 基于外显子测序数据的拷贝数变异检测方法及系统、终端和存储介质
WO2025005860A1 (fr) Procédé d'identification et d'ajustement de variabilité systématique dans des mesures d'abondance d'adn
CN109390039B (zh) 一种统计dna拷贝数信息的方法、装置及存储介质
US20200105374A1 (en) Mixture model for targeted sequencing
Dutheil et al. Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off
Yang et al. Improved detection algorithm for copy number variations based on hidden Markov model
Chen Statistical considerations on NGS data for inferring copy number variations
CN117497056B (zh) 一种无对照hrd检测方法、系统及装置
Shen et al. Algorithm for DNA copy number variation detection with read depth and paramorphism information
O’Fallon et al. Algorithmic improvements for discovery of germline copy number variants in next-generation sequencing data
Alsaedi Evaluating the Application of Allele Frequency in the Saudi Population Variant Detection
Deshpande A new computational framework for the classification and function prediction of long non-coding RNAs
Huang Novel computational methods for transcript reconstruction and quantification using rna-seq data
SK882023A3 (sk) Spôsoby a systém na detekciu mikrosatelitovej instability zo sekvenovanej voľnej cirkulujúcej DNA
CN120966967A (zh) 一种动态设定短序列变异空白限的方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24740247

Country of ref document: EP

Kind code of ref document: A1