WO2025226580A1 - Methods and systems for hla loss determination - Google Patents
Methods and systems for hla loss determinationInfo
- Publication number
- WO2025226580A1 WO2025226580A1 PCT/US2025/025565 US2025025565W WO2025226580A1 WO 2025226580 A1 WO2025226580 A1 WO 2025226580A1 US 2025025565 W US2025025565 W US 2025025565W WO 2025226580 A1 WO2025226580 A1 WO 2025226580A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- hla
- allele
- tumor
- normal
- sequence reads
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6881—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for tissue or cell typing, e.g. human leukocyte antigen [HLA] probes
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/50—Mutagenesis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Definitions
- This present disclosure relates generally to methods for evaluating human leukocyte antigen (HLA) loss. More specifically, this disclosure provides methods and systems for detecting HLA loss (e.g., HLA loss of heterozygosity (HLA LOH)) in a subject based on an analysis of (i) sequence read ratios for a first and second HLA allele in a tumor sample, and (ii) sequence read ratios for the first and second HLA allele in a normal sample from the same subject.
- HLA loss e.g., HLA loss of heterozygosity (HLA LOH)
- HLA genes are genes that form part of the major histocompatibility complex (MHC) - a collection of closely-linked polymorphic genes that encode the cell surface proteins involved in the adaptive immune system. They play a significant role in disease and immune defense.
- MHC major histocompatibility complex
- HLA genes include the HLA Class I (HLA-I) genes (z.e., HLA-A, HLA-B, and HLA-C genes) and HLA Class II (HLA-II) genes (z.e., the HLA-DR, HLA-DQ, and HLA-DP genes).
- HLA loss is the absence or decrease of HLA allelic presence or expression due to genetic modification, epigenetic modification, and/or indirect regulation.
- HLA LOH can indicate the loss of a functional HLA gene allele due to a genetic modification (also referred to as a hard modification).
- HLA loss (including HLA LOH) is often associated with unhealthy or diseased tissue, e.g., a tumor, and may contribute to, for example, immune system evasion by cancer and/or to therapeutic resistance on the part of cancer cells (e.g., due to the loss or reduced availability of a vehicle (e.g., an MHC complex) for presenting a therapeutic antigen (e.g. , a neoantigen) on the cell surface.
- a vehicle e.g., an MHC complex
- Loss of one HLA gene allele through deletion or mutation can contribute to immune system evasion because each allele routinely serves a different function in presenting distinct tumor-associated neoantigens to the extracellular immune surveillance system
- the T1/T2 sequence read count ratio should decrease accordingly. Since a diploid organism is expected to have one copy of each gene allele, the expected allelic ratio is 1.0, regardless of sequencing depth. It is possible that one gene allele is more easily sequenced than another gene allele (resulting in a sequencing bias for one allele over the other), which would cause a deviation away from the expected value of 1.0.
- the two HLA gene alleles in tumor are derived from the same pair of gene alleles that occur in the normal sample, so that the sequence read count ratio for the two alleles in the normal sample can serve as a baseline or control for the sequence read count ratio in the tumor sample for each HLA gene that is free of sequencing bias artifacts.
- An analysis e.g., a statistical analysis is performed to determine if the difference between an observed T1/T2 ratio and a baseline value for a given HLA gene is statistically significant.
- the baseline value can be determined by measuring N1/N2 values for a plurality of non-HLA genomic loci (e.g., a plurality of single nucleotide polymorphism (SNP) loci), and fitting the Nl, N2, Tl, and T2 values determined for the plurality of non-HLA genomic loci (e.g., SNP loci) to a multinomial model, estimating count overdispersion, adjusting standard errors with that overdispersion, and computing a p-value for the difference between the observed T1/T2 value and the baseline.
- a plurality of non-HLA genomic loci e.g., a plurality of single nucleotide polymorphism (SNP) loci
- SNP loci single nucleotide polymorphism
- the disclosed systems and methods enable improved accuracy in detecting HLA LOH (including for tumor samples of lower purity, and for WES sequence read data) due to: (i) reduced sensitivity to sequencing depth variations, (ii) reduced sensitivity to differential sequence capture for the two HLA gene alleles by bait molecules, (iii) improved alignment of sequence reads to HLA gene alleles based on the use of intron-level identifiers, and (iv) novel methods for efficient computation of allelic ratios at multiple genomic loci (e.g., SNP loci).
- HLA Human Leukocyte Antigen
- the methods comprising: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a plurality of sequence reads derived from a normal sample from the subject (paired tumor and normal samples); receiving a subjectspecific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor- derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal-derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA
- the HLA alteration is an HLA loss of heterozygosity. In some embodiments, the HLA alteration is an HLA copy number change. In some embodiments, the HLA alteration is an HLA imbalance.
- the plurality of sequence reads is derived by sequencing nucleic acid molecules extracted from the paired tumor and normal samples using a whole exome sequencing (WES) technique.
- WES whole exome sequencing
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value.
- the expected value is the normal allelic ratio.
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of a log ratio of the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the first allele of the at least one HLA gene from an expected value.
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the log ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the second allele of the at least one HLA gene from an expected value.
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation from an expected value of a log odds ratio corresponding to the log ratio of (i) the ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene and (ii) the ratio of the number of unique normal- derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene.
- Performing a statistical analysis may comprise comparing a log ratio or log odds ratio to an expected value under a null hypothesis, wherein said comparison comprising computing a test statistic and/or comparing a confidence interval around said log ratio or log odds ratio to a predetermined threshold associated with a null hypothesis.
- the null hypothesis corresponds to a log ratio and/or a log odds ratio of 0 or a log ratio corresponding to a baseline estimate.
- a baseline estimate may be obtained using counts from heterozygous SNPs at non-HLA loci.
- an expected value for a log ratio corresponds to an absence of imbalance between the tumor and normal samples.
- an expected value for the log odds ratio corresponds to a lack of HLA allelic imbalance.
- the expected value for the log odds ratio corresponds to a log odds ratio of 0.
- the statistical analysis comprises: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of one or more of the plurality of heterozygous SNP loci; determining, based on the sequence read data, a number of unique normal -derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of one or more of the plurality of heterozygous SNP loci; and estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal-derived sequence reads for the
- the method further comprises determining: a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci.
- the method further comprises determining a baseline log ratio for each of the first and second SNP alleles of each of the plurality of heterozygous SNP loci.
- estimating a degree of overdispersion in sequence read counts comprises calculating an overdispersion statistic based on the comparison of the observed counts at the plurality of heterozygous SNPs to corresponding expected sequence read counts based on fitting the multinomial model.
- calculating an overdispersion statistic comprises calculating a % 2 statistic.
- calculating an overdispersion statistic comprises calculating a residual deviance (D) statistic.
- the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the tumor allelic ratio and normal allelic ratio.
- the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the second allele ratio (T2/N2) and first allele ratio (Tl/Nl). In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log of the product of the tumor allelic ratio and the inverse of the normal allelic ratio (log(T2/Tl*Nl/2)) or the difference between the second allele log ratio and the first allele log ratio (log(T2/N2)-log(Tl/Nl In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a first allele log ratio (log(Tl/Nl)). In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a second allele log ratio (log(T2/N2)).
- the method further comprises using the adjusted standard error for the log odds ratio to adjust a p-value or confidence interval for the detected HLA alteration.
- the method comprises using the adjusted standard error for the log odds ratio to determine a p-value for a test statistic associated with a test that the observed log odds ratio is different from a null hypothesis associated with absence of HLA imbalance.
- the method comprises using the adjusted standard error for the log odds ratio to determine a confidence interval for the log odds ratio.
- the method comprises using the adjusted standard error for the log first ratio to determine a confidence interval for the log odds ratio.
- the method comprises using the adjusted standard error for the first allele log ratio to determine a p-value for a test statistic associated with a test that the observed log ratio is different from a null hypothesis associated with absence of difference between the tumor and normal counts at the first allele. In some embodiments, the method comprises using the adjusted standard error for the second allele log ratio to determine a p-value for a test statistic associated with a test that the observed log ratio is different from a null hypothesis associated with absence of difference between the tumor and normal counts at the second allele. In some embodiments, the method comprises using the adjusted standard error for the first allele log ratio to determine confidence interval around the first allele log ratio. In some embodiments, the method comprises using the adjusted standard error for the second allele log ratio to determine confidence interval around the second allele log ratio.
- the plurality of heterozygous SNP loci comprises at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci. In some embodiments, the plurality of heterozygous SNP loci are selected from a predetermined set of known SNPs. In some embodiments, the plurality of heterozygous SNP loci are selected from a predetermined set of known SNPs for which there is at least a predetermined number of reads in the normal sample.
- the plurality of heterozygous SNP loci is filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome.
- the plurality of heterozygous SNP loci are detected by aligning the sequence read data to a reference sequence using a SNP tolerant alignment method.
- the SNP tolerant alignment method uses a predetermined set of known SNPs.
- the plurality of heterozygous SNP loci are detected using a non-sorting-based method for de-duplicating and tallying sequence read counts.
- the non-sorting based method for de-duplicating and tallying sequence read counts comprises: performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
- the at least one HLA gene comprises HLA-A, HLA-B, HLA- C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof.
- the subject-specific reference sequence for the HLA region of the subject’s genome is generated by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence.
- the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence. In some embodiments, the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region.
- detecting an HLA loss of heterozygosity for the at least one HLA gene does not require a determination of copy number for the at least one HLA gene.
- the paired tumor and normal samples comprise paired tumor and normal surgical resection samples. In some embodiments, the paired tumor and normal samples comprise paired tumor and normal tissue biopsy samples.
- the method further comprises diagnosing or confirming a diagnosis of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying the subject for treatment of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the disease is a cancer.
- systems comprising: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein.
- Disclosed herein are systems comprising: a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject; one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein.
- Non-transitory computer-readable storage media storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform any of the methods described herein.
- FIG. 1 provides a schematic diagram illustrating a non-limiting example of an evaluation system for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIG. 2 provides a non-limiting example of a process for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIG. 3 provides a schematic diagram illustrating HLA loss of heterozygosity.
- FIG. 4 provides a flow diagram illustrating an example of a process for quantifying allele-read alignments for HLA genes, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIGS. 5A-5D provide non-limiting examples of WES sequence read alignments for the HLA B*51 :01 :01 :01 allele (Allele 1) and the HLA B*07:02:01:01 allele (Allele 2) for paired normal and tumor samples, where the tumor sample exhibits a loss of heterozygosity.
- FIG. 5A provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus in the normal sample.
- FIG. 5B provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus in the normal sample.
- FIG. 5C provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus in the tumor sample.
- FIG. 5D provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus in the tumor sample.
- FIGS. 6A-6B provide non-limiting schematic illustrations that compare prior copy number-based methods for detecting HLA LOH (FIG. 6A) to the allelic ratio-based method (FIG. 6B) described herein.
- FIGS. 7A-7D provide schematic illustrations of 2 x 2 contingency tables for sequence read data (e.g., HLA sequence read data) that might be expected for different genetic modifications to a tumor genome.
- FIG. 7A provides a schematic diagram illustrating a nonlimiting example of a 2 x 2 contingency table for HLA sequence read counts in normal and tumor samples.
- FIG. 7B provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a double deletion.
- FIG. 7C provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a single deletion.
- FIG. 7D provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a copy neutral loss of heterozygosity.
- FIGS. 8A-8C provide non-limiting examples of 2 x 2 contingency tables for HLA sequence read counts that were observed for cells from ovarian cell line 23882 that exhibited different genetic modifications in a tumor genome.
- FIG. 8A provides a sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA A*26:01 allele.
- FIG. 8B provides a sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA B*35:01 allele.
- FIG. 8C provides a sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA C*04:01 allele.
- FIGS. 9A-9B provide non-limiting schematic illustrations of the HLA (FIG. 9A) and non-HLA (FIG. 9B) genomic loci used to determine sequence read counts and evaluate allelic ratios for a subject, which are then used as input for a statistical model used to detect HLA loss of heterozygosity in the subject, in accordance with one or more implementations of the systems and method disclosed herein.
- FIG. 10 provides a non-limiting example of a process flowchart for detecting an HLA alteration, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIG. 11 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCH672 chrl region.
- FIG. 12 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B34996 chrl region.
- FIG. 13 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the Pirn 1603 chr6 region.
- FIG. 14 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCH672 chr6 region.
- FIG. 15 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B23882 chr6 region.
- FIG. 16 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B34996 chr6 region.
- FIG. 17 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCI20009 chr6 region.
- FIGS. 18A-18F provide a non-limiting example of sensitivity analysis data for detection of HLA LOH based on allelic ratio data, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIG. 18A provides tumor/normal coverage data for a 10% tumor sample.
- FIG. 18B provides tumor/normal coverage data for a 20% tumor sample.
- FIG. 18C provides tumor/normal coverage data for a 30% tumor sample.
- FIG. 18D tumor/normal coverage data for a 40% tumor sample.
- FIG. 18E provides tumor/normal coverage data for a 50% tumor sample.
- FIG. 18F provides tumor/normal coverage data for a 100% tumor sample.
- FIG. 19 provides a non-limiting example of a block diagram of a computer system, in accordance with one or more implementations of the systems and methods disclosed herein.
- FIGS. 20A-20D provide non-limiting examples of read counts obtained for tumor and normal samples at polymorphic loci using a standard reference-based alignment method (FIG. 20A and FIG. 20C) and a SNP -tolerant alignment method (FIG. 20B and FIG. 20D).
- FIGS. 21A-21B provide a non-limiting example of read counts (as a fraction of total counts at the locus) on chromosome 6 (comprising the HLA locus) in a sample not subject to HLA loss (FIG. 21A) and corresponding statistics obtained using a method disclosed herein (FIG. 21B).
- FIGS. 21C-21D provide a non-limiting example of read counts (as a fraction of total counts) on chromosome 6 (comprising the HLA locus) in a sample subject to HLA loss (FIG. 21C) and corresponding statistics obtained using a method disclosed herein (FIG. 21D).
- FIG. 22 provides a non-limiting example of the number of known HLA LOH events that can be detected in synthetic cell line mixtures with decreasing tumor purities, using methods of the disclosure.
- HLA alterations e.g., HLA allelic imbalance, HLA-LOH and/or HLA copy number alterations
- sequencing data e.g., next generation sequencing data
- Detecting HLA loss e.g., through HLA loss of heterozygosity
- gain e.g., via HLA copy number alteration
- a tumor can use various mechanisms (e.g., genetic modification, epigenetic modification, or indirect regulation) to cause HLA loss that enables the tumor to escape or evade therapy and have a selective advantage.
- a particular immunotherapy may be ineffective and unable to recognize a tumor-specific antigen (e.g., a neoantigen) and thereby fail to activate an immune response when there is HLA loss.
- a tumor-specific antigen e.g., a neoantigen
- a tumor cell that does not have or that has a reduced expression of a particular HLA allele or HLA alleles may not be recognized or killed by T cells that are reactive to a given antigen on that tumor cell.
- a tumor cell that does not have or that has a reduced expression of a particular HLA allele or HLA alleles may be more likely to be recognized or killed by NK cells.
- detecting HLA loss in, for example, a subject's tumor can facilitate development and/or personalization of immunotherapies including, but not limited to, T-cell therapies, vaccines and/or natural killer (NK) cell therapies.
- the disclosed methods can further comprise: (i) diagnosing or confirming a diagnosis of a disease, (ii) identifying the subject for treatment of a disease (or identifying the subject as a candidate for treatment of a disease), (iii) identifying a treatment for a disease with which the subject has been diagnosed, (iv) predicting a clinical outcome for a disease with which the subject has been diagnosed, (v) designing or manufacturing a personalized treatment for the subject, or (vi) identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA loss (e.g., a detected loss of HLA heterozygosity) for at least one HLA gene.
- the disease can be a cancer.
- the systems and methods described herein are based on an allelic ratio approach (z.e., a ratio of the number of unique sequence read counts for a first gene allele to the number of unique sequence read counts for a second gene allele) rather than a copy number approach.
- the allelic ratio approach comprises analyzing the ratio (T1/T2) of the number of unique sequence read counts for HLA gene allele 1 in a tumor sample (Tl) to the number of unique sequence read counts for HLA gene allele 2 in the tumor sample (T2), and the ratio (N1/N2) of the number of unique sequence read counts for HLA gene allele 1 in a normal sample (Nl) to the number of unique sequence read counts for HLA gene allele 2 in the normal sample (N2).
- the T1/T2 sequence read count ratio should decrease accordingly. Since a diploid organism is expected to have one copy of each gene allele, the expected allelic ratio N1/N2 is 1.0, regardless of sequencing depth. It is possible that one gene allele is more easily sequenced than another gene allele (resulting in a sequencing bias for one allele over the other), which would cause a deviation away from the expected value of 1.0 for N1/N2.
- the two HLA gene alleles in tumor are derived from the same pair of gene alleles that occur in the normal sample, so that the sequence read count ratio for the two alleles in the normal sample can serve as a baseline or control for the sequence read count ratio in the tumor sample for each HLA gene that is free of sequencing bias artifacts.
- An analysis e.g., statistical analysis
- the statistical significance can be evaluated using a confidence interval based on the estimated variability around a statistic that captures this difference.
- the confidence interval can be determined at least in part by measuring N1/N2 values for a plurality of non-HLA genomic loci (e.g., a plurality of single nucleotide polymorphism (SNP) loci), and fitting the Nl, N2, Tl, and T2 values for the plurality of non-HLA genomic loci (e.g., SNP loci) to a multinomial model, estimating a degree of overdispersion in the counts observed at the plurality of non-HLA genomic loci, adjusting standard errors for estimates of log ratios and/or log odds ratios based on the estimated degree of overdispersion, and computing a p-value for the difference between the observed ratios and a baseline assuming no allelic imbalance and/or a confidence interval estimate for said ratios using the adjusted standard errors estimates.
- a plurality of non-HLA genomic loci e.g., a plurality of single nucleotide polymorphism (SNP) loci
- SNP loci single
- the disclosed systems and methods enable improved accuracy in detecting HLA LOH (including for tumor samples of lower purity, and for WES sequence read data) due to: (i) reduced sensitivity to sequencing depth variations, and (ii) reduced sensitivity to differential sequence capture for the two HLA gene alleles by bait molecules.
- Embodiments of the methods described herein further improve on this through one or more of: (iii) improved alignment of sequence reads to HLA gene alleles based on the use of intron-level identifiers, (iv) improved accuracy of quantification of read counts at polymorphic loci in the HLA locus and outside of it due to the use of SNP -tolerant read alignment, and (v) novel methods for efficient computation of allelic ratios at multiple genomic loci (e.g., SNP loci).
- the present disclosure provides various systems, methods, and non-transitory computer readable media for evaluating HLA loss (e.g., HLA LOH or HLA copy number changes) in a sample obtained from a subject as compared to another sample.
- HLA loss e.g., HLA LOH or HLA copy number changes
- the methods described herein enable analyzing HLA gene alleles in two samples to estimate whether there is any HLA loss of heterozygosity between the two samples.
- these two samples include a normal or otherwise healthy sample and a tumor or otherwise unhealthy or diseased sample.
- the former may be referred to herein as “normal” sample for simplicity, with associated read counts at any heterozygous locus or HLA gene allele labelled as Nl, N2.
- tumor sample for simplicity, with associated read counts at any heterozygous locus or HLA gene allele labelled as Tl, T2.
- this terminology refers generally to a sample comprising cells or genetic material derived therefrom that can be assumed to not be subject to HLA alteration (“normal” sample) and a related sample comprising cells or genetic material derived therefrom that have an unknown status in relation to HLA alteration (“tumor” sample).
- HLA alleles can be typed based on variances (e.g., polymorphisms) within the exon regions of the HLA alleles or based on variances (e.g., polymorphisms) within the intron regions of the HLA alleles.
- exon resolution or intronresolution identifiers for an HLA allele allows the allele sequence for that HLA gene to be more accurately identified, which can enable improved alignments between HLA gene alleles and sequence reads obtained from nucleic acid sequencing.
- exon-resolution or intron-resolution HLA typing are described in, for example, PCT International Patent Application Publication No. WO 2022/192304, which is incorporated herein by reference in its entirety.
- Alternative methods include e.g. Optitype (Szolek et al. Bioinformatics, Volume 30, Issue 23, December 2014, Pages 3310-3316) or combinations of Optitype and the methods described in WO 2022/192304.
- the disclosed methods can comprise: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from a subject; receiving a subject-specific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal -derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: (i) a tumor allelic ratio comprising a ratio of the determined
- detecting an HLA alteration for the at least one HLA gene can comprise performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value (e.g., a value of 1).
- the statistical analysis can comprise: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of the plurality of heterozygous SNP loci; determining, based on the sequence read data, a number of unique normal-derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP
- the systems and methods described herein may be used in various ways to develop and/or personalize immunotherapies such as T cell therapies or cancer vaccines.
- a T cell therapy or cancer vaccine may be designed to be reactive to or include an antigen presented by another HLA allele in that patient for which HLA loss is not detected. This type of HLA loss detection may be crucial to providing effective and timely therapy to individual subjects.
- substantially means sufficient to work for the intended purpose.
- the term “substantially” thus allows for minor, insignificant variations from an absolute or perfect state, dimension, measurement, result, or the like such as would be expected by a person of ordinary skill in the field but that do not appreciably affect overall performance.
- substantially means within ten percent.
- the term “plurality” can be 2, 3, 4, 5, 6, 7, 8, 9, 10, or more.
- the term “set of’ means one or more. For example, a set of items includes one or more items.
- the phrase “at least one of’, when used with a list of items, means different combinations of one or more of the listed items may be used and only one of the items in the list may be needed.
- the item may be a particular object, thing, step, operation, process, or category.
- “at least one of’ means any combination of items or number of items may be used from the list, but not all of the items in the list may be required.
- “at least one of item A, item B, or item C” means item A; item A and item B; item B; item A, item B, and item C; item Band item C; or item A and C.
- “at least one of item A, item B, or item C” means, but is not limited to, two of item A, one of item B, and ten of item C; four of item B and seven of item C; or some other suitable combination.
- a “model” includes at least one of an algorithm, a formula, a mathematical technique, a machine algorithm, a probability distribution or model, or another type of mathematical or statistical representation.
- a “subject” may refer to a mammal being assessed for treatment and/or being treated, a mammal participating in a clinical trial, a mammal undergoing anticancer therapies, or any other mammal of interest.
- the terms “subj ec ’, “individual”, and “patient” are used interchangeably herein.
- a subject can be a healthy or asymptomatic individual, an individual that has or is suspected of having a disease (e.g., cancer) or a pre-disposition to the disease, an individual that is in need of therapy or suspected of needing therapy, or a combination thereof.
- a subject may be, for example, without limitation, an individual having cancer or an individual having an autoimmune disease.
- a subject may be human.
- a subject may be a non-human mammal.
- a subject may be a mammal used in forming laboratory models for human disease.
- Such mammals include, but are not limited to, model animals, mice, rats, primates (e.g., cynomolgus monkey), etc.
- sample can refer to a “biological sample” of a subject.
- a sample can include tissue (e.g., a biopsy), a single cell, multiple cells, fragments of cells or an aliquot of body fluid.
- the sample may have taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
- nucleotide comprises a nucleoside and a phosphate group.
- a “nucleoside” as used herein comprises a nucleobase and a five-carbon sugar (e.g., ribose, deoxyribose, or analogs thereof).
- a nucleoside When the nucleobase is bonded to ribose, the nucleoside may be referred to as a ribonucleoside.
- the nucleoside When the nucleobase is bonded to deoxyribose, the nucleoside may be referred to as a deoxyribonucleoside.
- a “nucleobase” which may be also referred to as a “nitrogenous base”, can take the form of one of five types: adenine (A), guanine (G), thymine (T), uracil (U), and cytosine (C).
- a “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to a linear polymer of nucleotides (or nucleosides joined by internucleosidic linkages). Generally, a polynucleotide comprises at least three nucleotides. Generally, an oligonucleotide is comprised of nucleotides that range in number from a few nucleotides (or monomeric units) to several hundreds of nucleotides (monomeric units).
- a polynucleotide such as an oligonucleotide is represented by a sequence of letters, such as “ATGCCTG”, it will be understood that the nucleotides are in 5' —> 3' order or direction from left to right and that “A” denotes adenine, “C” denotes cytosine, “G” denotes guanine, and “T” denotes thymine, unless otherwise noted.
- the letters A, C, G, and T may be used to refer to the nucleobases themselves, as described above, the nucleosides that include those nucleobases, or the nucleotides that include those bases, as is standard in the art.
- Deoxyribonucleic acid is a chain of nucleotides consisting of 4 types of nucleotides: adenine (A), thymine (T), cytosine (C), and guanine (G).
- Ribonucleic acid (RNA) is comprised of 4 types of nucleotides: A, C, G, and uracil (U). Certain pairs of nucleotides specifically bind to one another in a complementary fashion, which may be referred to as complementary base pairing. For example, C pairs with G and A pairs with T. In the case of RNA, however, A pairs with U.
- nucleic acid sequencing data denote any information or data that is indicative of the order of the nucleotide bases (e.g., A, C, G, T/U) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.
- sequence reads may be DNA sequence reads. It should be understood that the present disclosure contemplates that this sequence information may be obtained using any of the available varieties of techniques, platforms, or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligationbased systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic-based systems, etc., or a combination thereof.
- a genome is stored on one or more chromosomes comprised of DNA sequences.
- DNA includes, for example, genes, noncoding DNA, and mitochondrial DNA.
- the human genome typically contains 23 pairs of chromosomes: 22 pairs of autosomal chromosomes (autosomes) plus the sex-determining X and Y chromosomes.
- the 23 pairs of chromosomes include one copy from each parent.
- the DNA that makes up the chromosomes is referred to as chromosomal DNA and is present in the nucleus of human cells (nuclear DNA).
- a “gene” is a discrete portion of heritable, genomic sequence which affect a subject's traits by being expressed as a functional product or by regulation of gene expression.
- the total complement of genes in a subject or cell is known as the subject's or cell's genome.
- a region of a chromosome at which a particular gene is located is called its locus.
- Each locus contains one allele of a gene.
- a pair of chromosomes together has two loci that each contain an allele of the gene to form an allele pair.
- the two alleles may be the same or may be different (e.g., have slightly varying gene sequences).
- an “allele” is a sequence, or variant thereof, of a gene.
- One allele of a gene may differ from another allele of the same gene in various ways.
- two alleles for a same gene may differ by, for example, differences in the encoded protein (e.g., differences in the amino acid sequence of the encoded protein), other (e.g., silent or synonymous) variances in the exon regions that do not affect the amino acid sequence, variances in the intron regions, or some combination of these variances.
- a “sequence” denotes any information or data that is indicative of the order of the nucleotide bases (e.g., A, C, G, T/U) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA.
- Sequence information may be obtained using any of the available varieties of techniques, platforms, or technologies, including, but not limited to, capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic-based systems, etc., or a combination thereof.
- sequence information may be obtained using next generation sequencing.
- next generation sequencing refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches. These sequencing technologies have, for example, the ability to generate hundreds of thousands of relatively small sequence reads or "reads" in a single sequencing run.
- next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
- a “read” or “sequence read” includes a string of nucleic acid bases corresponding to a nucleic acid molecule that has been sequenced.
- a read can refer to the sequence of nucleotides determined for a nucleic acid fragment that has been subjected to sequencing, such as, for example, next generation sequencing (“NGS”).
- NGS next generation sequencing
- Reads can be any sequence of any number of nucleotides, with the number of nucleotides defining the read length.
- sequence read may refer to reads obtained by sequencing DNA in a sample.
- the DNA may be genomic DNA.
- sequence read data refers to any information characterizing a sequence read, including the sequence of the read itself or information from which the sequence can be derived. Sequence read data may include, in addition to this, one or more of: one of more quality metrics, alignment information, one or more flags, etc. Sequence read data may be in the form of e.g. a BCL file, a FASTQ file, a SAM file, a BAM file, or any other file format from which read counts for specific loci can be derived as described herein.
- a “major histocompatibility complex gene” or “MHC gene” is a gene that encodes a system, complex, or group of cell-surface proteins responsible for the regulation of the immune system.
- a “human leukocyte antigen gene” or “HLA gene” is a gene that encodes a system, complex, or group of cell-surface proteins responsible for the regulation of the immune system.
- An HLA system or complex is encoded by the MHC gene complex in humans.
- MHC molecules that present antigens on cells are categorized as belonging to one of three classes of MHC molecules, MHC class I, MHC class II, and MHC class III.
- Certain HLA genes including, for example, HLA-A, HLA-B, HLA-C, correspond to MHC class I.
- Certain HLA genes including, for example, HLA-DP, HLA-DM, HLA-DO, HLA-DQ, and HLA-DR, correspond to MHC class II.
- HLA genes that are known include, for example, HLA-A, HLA- B, HLA-C, HLA-E, HLA-F, HLA-G, HLA- H, HLA-J, HLA-K, HLA-L, HLA-N, HLA-P, HLA-S, HLA-T, HLA-U, HLA-V, HLA- W, HLA-X, HLA-Y, HLA-Z, HLA-DRA, HLA- DRB, HLA-DQ, HLA-DOA, HLA-DOB, HLA-DMA, HLA-DMB, HLA-DP A, HLA-DPB, and HFE.
- Other genes that are found in the HLA region include, for example, TAPI, TAP2, PSMB9, PSMB8, MICA, MICB, MICC, MICD, and MICE.
- T cell also known as a T lymphocyte
- T cells develop in the thymus gland and play a central role in the immune response of the body. T cells can be distinguished from other lymphocytes by the presence of a T cell receptor (TCR) on the cell surface. These immune cells originate as precursor cells, derived from bone marrow, and then develop into several distinct types of T cells once they have migrated to the thymus gland. T cell differentiation continues even after they have left the thymus. T cells include, but are not limited to, helper T cells, cytotoxic T cells, memory T cells, regulatory T cells, and killer T cells. Helper T cells stimulate B cells to make antibodies and help killer cells develop.
- TCR T cell receptor
- T cells can also include T cells that express aP TCR chains, T cells that express y5 TCR chains, as well as unique TCR co- expressors (z.e., hybrid aP-yS T cells) that co-express the aP and y5 TCR chains.
- T cells can also include engineered T cells that can attack specific cancer cells.
- Engineered T cells may be designed to recognize MHC -presented peptides.
- an engineered T cell may be designed to recognize an antigen that is not subject to HLA loss (e.g. by engineering the T cell to express a T cell receptor that recognizes said antigen in the context of an HLA allele that is not subject to HLA loss in a subject).
- Engineered T cells can be expanded in culture and then infused into a patient's body.
- Engineered T cells may be designed to multiply and recognize the cancer cells that express a specific protein or neoantigen. This type of technology may be used in potential next-generation immunotherapy treatment.
- immunotherapy refers to a treatment or class of treatments that uses one or more parts of a subject's immune system to fight a disease such as, for example, without limitation, cancer. Immunotherapy can use substances made by the body or synthesized outside of the body to improve how the immune system works to find and destroy cancer cells.
- An immunotherapy may be a cell therapy, such as e.g. a T cell or NK cell therapy.
- An immunotherapy may be a vaccine, such as e.g. a nucleic acid, protein or cell based vaccine.
- a "neoantigen” is a tumor-specific antigen derived from one or more somatic mutations in a tumor.
- a neoantigen can be presented by a subject's cancer cells and antigen presenting cells. This can lead to an immune response against cells expressing the neoantigen.
- Neoantigen therapies such as, but not limited to, neoantigen vaccines, are a relatively new approach for providing individualized cancer treatment.
- a “tumor associated antigen” is an antigen that is expressed exclusively or primarily in tumor cells, but which do not arise from somatic mutation in the protein comprising the antigen.
- a tumor associated antigen may arise through somatic amplification or other overexpression mechanism, or through post-translational modification.
- Cancer vaccines also referred to herein as “neoantigen vaccines” can prime a subject's T cells to recognize and attack cancer cells expressing one or more particular tumor neoantigens and/or tumor associated antigens. This approach generates a tumor-specific immune response that spares healthy cells while targeting tumor cells.
- An individualized vaccine may be engineered or selected based on a subjectspecific tumor antigen profile.
- the tumor antigen profile can be defined by determining DNA and/or RNA sequences from a subject's tumor cell and using the sequences to identify neoantigens and/or tumor associated antigens that are present in tumor cells but absent in normal cells.
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA loss e.g., HLA allelic expression loss
- HLA LOH HLA LOH
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- MHC MHC alignment input, MHC allele, MHC gene, etc.
- HLA loss e.g., HLA allelic loss or imbalance, HLA allelic expression loss
- HLA allelic imbalance refers to the departure from an expected 1 : 1 ratio of representation of two alleles at a heterozygous locus.
- HLA copy number alteration includes copy-neutral loss of heterozygosity (LOH), in which a first allele for a gene is lost (resulting in zero copies of the first allele) and a second allele for that gene is gained (resulting in two copies of the second allele).
- LHO copy-neutral loss of heterozygosity
- references to evaluating or detecting “HLA loss” or “HLA alteration” encompass detecting HLA allelic imbalance of any kind, whether through the copy number alteration (including amplification and deletion) or one or both alleles in a pair of HLA alleles.
- FIG. 1 provides a schematic diagram illustrating a non-limiting example of an evaluation system for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
- Evaluation system 100 is implemented using hardware, software, firmware, or a combination thereof.
- Evaluation system 100 may be implemented using, for example, computer system 102.
- Computer system 102 includes a single computer or multiple computers in communication with each other. When computer system 102 includes multiple computers, in some instances, one computer may be located remotely with respect to at least one other computer.
- Evaluation system 100 includes allelic type generator 104, alignment analyzer 106, statistics generator 108, or a combination thereof.
- Each of allelic type generator 104, alignment analyzer 106, and statistics generator 108 is implemented using hardware, software, firmware, or a combination thereof.
- each of allelic type generator 104, alignment analyzer 106, and statistics generator 108 can be implemented as a distinct compiled computer program, interpreted language script, another type of software, or a combination thereof.
- Alignment analyzer 106 and statistics generator 108 form HLA loss evaluator 110.
- HLA loss evaluator 110 can be implemented in various ways.
- alignment analyzer 106 and statistics generator 108 are separate programs with alignment analyzer 106 generating an output that is sent as input into statistics generator 108. In other instances, alignment analyzer 106 and statistics generator 108 are integrated or the actions that would be performed by alignment analyzer 106 and statistics generator 108 are integrated to form HLA loss evaluator 110. Accordingly, HLA loss evaluator 110 is implemented using hardware, software, or a combination thereof. In some instances, HLA loss evaluator 110 is implemented a compiled computer program, interpreted language script, another type of software, or a combination thereof. In other instances, HLA loss evaluator 110 is implemented as a plurality of programs working together.
- HLA loss evaluator 110 refers to alignment analyzer 106, statistics generator 108, a combination of alignment analyzer 106 and statistics generator 108, operations that would be performed by alignment analyzer 106, operations that would be performed by statistics generator 108, operations that would be performed by a combination of alignment analyzer 106 and statistics generator 108, or a combination thereof.
- Evaluation system 100 receives read data 112 as input.
- One or more of allelic type generator 104 and HLA loss evaluator 110 receives read data 112 as input.
- Read data 112 includes one or more datasets.
- Read data 112 includes, for example, one or more sequencing datasets.
- a sequencing dataset includes, for example, sequence read data for a plurality of reads.
- evaluation system 100 retrieves read data 112 from data store 114.
- Data store 114 includes, for example, but is not limited to, at least one of a database, a data storage unit, a spreadsheet, a file, a server, a cloud storage unit, a cloud database, or some other type of data store.
- data store 114 comprises one or more data storage devices separate from but in communication with computer system 102.
- data store 114 is at least partially integrated as part of computer system 102.
- Read data 112 includes reads (e.g., sequence reads) that are generated using, for example, one or more next-generation sequencing (NGS) systems.
- the reads are generated using, for example, whole-exome sequencing (WES), whole genome sequencing (WGS), shallow whole genome sequencing (sWGS), targeted (panel) sequencing, or a combination thereof.
- WES whole-exome sequencing
- WES whole genome sequencing
- WGS whole genome sequencing
- sWGS shallow whole genome sequencing
- targeted (panel) sequencing or a combination thereof.
- the reads can be generated using, for example, paired-end sequencing.
- read data 112 includes at least a plurality of reads 116.
- Reads 116 are generated for a corresponding first sample which may be, for example, a biological sample.
- the first sample can be obtained from, for example, a subject (e.g., a live subject).
- read data 112 further includes a plurality of reads 118 that are generated for a corresponding second sample that is different from the first sample.
- This second sample may be, for example, a biological sample obtained from a subject that is the same as or different from the subject from which reads 116 are generated.
- Reads 118 are, in some examples, generated via simulation, via a sampling from a collection of reads generated for multiple subjects, or in some other manner.
- the first and second samples may comprise cells or genetic material derived therefrom, obtained from the same subject. Such samples may be referred to as “paired” or “matched”.
- reads 116 and reads 118 are paired-end reads.
- paired-end sequencing of a fragment results in two sequences, a sequence generated beginning at the 5' end of the fragment, and a sequence generated beginning at the 3' end of the fragment. These two sequences form a paired-end read.
- a biological sample for which at least a portion (e.g., reads 116, reads 118) of read data 112 is generated may be, for example, a sample of unhealthy or diseased tissue, a sample of tumor tissue, a sample of tissue that includes tumor cells, a sample of healthy or normal tissue, a sample of tissue that includes normal cells, a sample of tissue taken at a first stage or point in time during a cancer progression, a sample of tissue taken at a second stage or point in time during the cancer progression, or another type of sample.
- reads 116 are generated for a sample of healthy or normal tissue and reads 118 are generated for a sample of unhealthy or diseased tissue (e.g., a tumor).
- reads 116 are referred to as normal reads or healthy reads
- reads 118 are referred to as unhealthy reads, diseased reads, or tumor reads.
- a sample of unhealthy or diseased tissue refers to a sample comprising unhealthy or diseased cells, or genetic material derived therefrom.
- a sample of unhealthy or diseased tissue may also comprise healthy/normal cells or genetic material derived therefrom.
- Such samples may be described as having a particular purity (also referred to as “tumor purity” in the context of cancer), referring to the proportion of the cells represented in the sample that are diseased cells.
- Allelic type generator 104 generates at least one applicable or probable allelic type for one or more genes of interest within a sample using read data 112 (e.g., based on reads 116 or reads 118). Allelic type generator 104 may generate at least one applicable or probable allelic type for each of a plurality of HLA genes. Allelic type generator 104 identifies, in some instances, a set of alleles 120 relevant to a subject using reads 116 generated for the subject. Set of alleles 120 is a set of one or more HLA alleles that is determined to most likely be present within the sample from which reads 116 are generated for a given HLA gene.
- Allelic type generator 104 may identify a set of alleles 120 by identifying a final set of allelic identifiers 122 for set of alleles 120.
- An allelic identifier for an allele can take various forms. For example, an allelic identifier may be comprised of various letter and/or digits that form one or more fields for representing different pieces of information about an allele.
- An allelic identifier refers to any information from which at least the exon sequence of an HLA allele can be derived.
- An allelic identifier can have varying levels of resolution in which higher levels of resolution provide more information than lower levels of resolution. In some cases, additional letters and/or digits provide additional information.
- a 6-digit allelic identifier has a lower resolution than an 8-digit allelic identifier (or 8-digit identifier).
- allelic identifiers are referred to by the number of fields of information represented in these allelic identifiers.
- a 6-digit identifier may be referred to as or generally provide the same level of information as a 3 -field identifier.
- An 8-digit identifier may be referred to as or generally provide the same level of information as a 4-field identifier.
- An HLA allele for an HLA gene can be identified using identifiers of varying resolutions including, but not limited to, exon-resolution identifiers and intron-resolution identifiers.
- An exon-resolution identifier for an HLA allele describes an allele group, a specific allele protein, and exon region information for the corresponding HLA allele.
- the exon-resolution identifier is a 6-digit identifier in which the first and second digits identify the allele group; the third and fourth digits identify the specific allele protein; and the fifth and sixth digits identify the exon region information.
- the specific allele protein is determined based on DNA sequence and differences within the amino acid sequence of the encoded protein.
- the exon region information captures changes in one or more exon regions of the HLA allele such as, for example, synonymous nucleotide substitutions.
- the 6-digit identifier may include one or more letters that indicate the corresponding HLA gene.
- the exon-resolution identifier is a 3 -field identifier in which each field is comprised of any number of letters, digits, symbols, or combination thereof.
- An intron-resolution identifier provides more information than an exon-resolution identifier and therefore has a higher level of resolution than an exon-resolution identifier.
- An intron-resolution identifier for an HLA allele describes an allele group, a specific allele protein, exon region information, and intron region information for the corresponding HLA allele.
- the intron-resolution identifier is an 8-digit identifier that adds, to a 6-digit identifier as described above, seventh and eighth digits that identify intron region information.
- the intron region information captures changes in one or more intron regions of the HLA allele such as, for example, polymorphisms in the intron regions.
- Final set of allelic identifiers 122 is a final set of intron-resolution identifiers in some instances.
- the 8-digit identifier may include one or more letters that indicate the corresponding HLA gene.
- the intron-resolution identifier is a 4-field identifier that adds one field (e.g., comprised of any number of letters, digits, symbols, or combination thereof) to the 3-field identifier.
- a 6-digit or 8-digit identifier may, in addition to the 6 or 8 digits mentioned above, include one or more optional suffixes indicating respective properties of the allele, such as e.g. whether the protein encoded by the allele is expressed as a soluble molecule, whether it has been shown not to be expressed, etc. These may be removed or ignored prior to use in embodiments of methods of the disclosure.
- Allele type generator 104 outputs final set of allelic identifiers 122 that correspond to a given HLA gene as determined using reads 116.
- the final set of allelic identifiers may be exon-level identifiers or intron level identifiers.
- the final set of allelic identifiers may be intron level identifiers obtained from a first set of allelic identifiers that are exon level identifiers. For example, 6-digit HLA identifiers may be identified in a first step and converted to 8-digit identifiers in a second step. The resolution of HLA identifiers in a final set of allelic identifiers may vary between HLA alleles.
- HLA loss evaluator 110 receives final set of allelic identifiers 122 as input. HLA loss evaluator 110 uses final set of allelic identifiers 122 and read data 112 to generate HLA loss information 124. HLA loss information 124 may include various pieces of information that describe HLA loss or imbalance in the tumor sample, based on the comparison between the two samples, from reads 116 and reads 118.
- HLA loss information 124 includes one or more pieces of information that can be used to identify, quantify and/or qualify HLA alteration (e.g. loss or other imbalance) in the sample associated with reads 118 as compared to the sample associated with reads 116.
- HLA alteration e.g. loss or other imbalance
- HLA loss of an HLA allele for an HLA gene refers to an absence or decrease of presence of that HLA allele in a tissue or population of cells.
- HLA loss information 124 provides information that can be used to identify, quantify and/or qualify this absence or decrease.
- An example of information included in HLA loss information 124 is statistics generated based on alignment between a sequence corresponding to an allelic identifier for an HLA allele and various reads.
- Another example of information included in HLA loss information 124 is one or more conclusions or inferences made using statistics.
- Yet another example of information included in HLA loss information 124 is an estimation of the amount of HLA loss (e.g., a percentage, a degree, etc.).
- HLA loss information 124 is generated by operations involving alignment analyzer 106 and statistics generator 108.
- Alignment analyzer 106 receives final set of allelic identifiers 122 and reads data 112 as input and generates alignment output 126 based on this input.
- Alignment analyzer 106 performs an analysis of the alignment of each allele identified by a corresponding one of final set of allelic identifiers 122 with reads 116 and reads 118.
- alignment analyzer 106 aligns reads 116 and reads 118 to a subjectspecific reference sequence which includes the sequence of HLA alleles identified in the final set of allelic identifiers 122.
- Alignment output 126 provides a quantification of these alignments.
- alignment output 126 also includes a quantification of alignments between non-HLA genes with reads 116 and reads 118.
- Alignment analyzer 106 outputs alignment output 126 and statistics generator 108 receives alignment output 126 as input.
- Statistics generator 108 performs statistical analysis 128 using alignment output 126 and, in some cases, other information.
- Statistical generator 108 performs statistical analysis 128 using one or more algorithms for generating statistics, one or more mathematical formulas or equations, one or more other types of analysis techniques, or a combination thereof.
- HLA loss evaluator 110 generates HLA loss information 124 using the results of statistical analysis 128, one or more inferences or conclusions made based on the results of statistical analysis 128, or a combination thereof.
- HLA loss evaluator 110 generates, in some instances, report 130 using HLA loss information 124.
- Report 130 may include, for example, without limitation, at least one of a table, a spreadsheet, a database, a file, a presentation, an alert, a graph, a chart, one or more graphics, or a combination thereof.
- HLA loss evaluator 110 optionally displays report on display system 132.
- Display system 132 comprises one or more display devices in communication with computer system 102. Display system 132 may be separate from or at least partially integrated as part of computer system 102.
- evaluation system 100 is capable of identifying a set of alleles for each of a plurality of HLA genes in a sample and evaluating HLA loss using the identified sets of alleles.
- allelic type generator 104 is shown as part of evaluation system 100, in other instances, allelic type generator 104 may be separate from evaluation system 100.
- FIG. 2 provides a non-limiting example of a process 200 for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
- System 200 can be implemented using the HLA loss evaluator 110 described in FIG. 1.
- system 200 may comprise the statistics generator 108 used to perform statistical analysis 128 in FIG. 1.
- a set of normal sequence reads 116 (e.g., WES sequence reads) derived from a normal sample from a subject are provided as input to, e.g., allele type generator 104 as described in FIG. 1, which identifies a set of HLA alleles 120 and/or HLA allelic identifiers 122 present in the sample for at least one HLA gene.
- the identified HLA alleles may be identified using 6-digit allelic identifiers (e.g., exon-resolution identifiers as described elsewhere herein).
- the identified HLA alleles may comprise 8- digit allelic identifiers (e.g., intron-resolution identifiers as described elsewhere herein).
- the set of alleles 120 and/or allelic identifiers 122 can then be provided as input to, e.g., alignment analyzer 106 as described in reference to FIG. 1 along with a reference genome 202 (e.g., the GRCh38 human reference genome (Genome Reference Consortium)) to generate a subject-specific reference genome comprising a subject-specific HLA genome (HLA-ome) 204 (e.g., a subject-specific reference genome for the HLA region of the genome).
- the subject-specific reference genome can combine the reference genome 202 with the subject-specific HLA genome by replacing the sequence of the HLA genes in the subject-specific HLA genome by the sequences in the subject-specific HLA genome.
- the subject-specific HLA genome can be constructed based on allelic identifiers 122 and a HLA sequence database such as e.g. IMGTTM (www.imgt.org/).
- IMGTTM www.imgt.org/
- the set of allelic identifiers may be generated (e.g, by allele type generator 104 in FIG. 1) through the use of a set-covering algorithm to find a set of HLA gene alleles that best explains the observed set of normal sequence reads (e.g, normal WES sequence reads) in a sample.
- the subject-specific HLA genome 204 or subject-specific reference genome, and sets of tumor sequence reads 118 and normal sequence reads 116 are provided as input to, e.g., alignment analyzer 106 as described in reference to FIG.
- sequence read count ratios based on: (i) a total number of unique sequence reads (Nl) from the normal sample that align to a first HLA gene allele (HLA gene allele 1) for the at least one HLA gene, (ii) a total number of unique sequence reads (N2) from the normal sample that align to a second HLA gene allele (HLA gene allele 2) for the at least one HLA gene, (iii) a total number of unique sequence reads (Tl) from the tumor sample that align to HLA gene allele 1, and (iv) a total number of unique sequence reads (T2) from the tumor sample that align to HLA gene allele 2.
- N1/N2 sequence read count ratio
- T1/T2 sequence read count ratio
- the sequence read count ratio (N1/N2) 206 for the normal sample and the sequence read count ratio (T1/T2) 208 for the tumor sample can be input into, e.g., statistics generator 108 as described in reference to FIG. 1, and processed by statistical analysis 128 as described in reference to FIG. 1, to output a p-value or log odds ratio 212 used to test whether the sequence read count ratio (T1/T2) 208 for the tumor sample is significantly different from the sequence read count ratio (N1/N2) 206 for the normal sample at a given locus in view of an expected or baseline value.
- the expected or baseline value can correspond to the observed sequence read count ratio (T1/T2) 208 for the tumor sample being the same as the sequence read count ratio (N1/N2) 206 for the normal sample.
- the statistical significance of the difference can be determined, for example, by modeling sequence read count data for a plurality of non-HLA heterozygous genomic loci (e.g., heterozygous SNPs identified in the non-HLA region of the subject’s genome).
- the modeling may comprise fitting the Nl, N2, Tl, and T2 values determined for a plurality of heterozygous SNPs 210 identified (e.g., by alignment analyzer 106) in the non-HLA region of the subject’s genome to a multinomial model, then estimating the degree of overdispersion (i.e., a situation in which the variance in sequence read count numbers is much larger than expected, given the mean sequence read count values) in the sequence read data for the normal and tumor samples.
- the statistical analysis 128 may be configured, for example, to output a p-value for the difference between the observed T1/T2 value and the N1/N2 or expected/baseline at a given HLA gene locus, a log odds ratio estimate for the T1/T2 value and the N1/N2 value (i.e. a difference between the log ratio for allele 2 and the log ratio for allele 1, e.g.
- LOR log(T2/N2)- log(Tl/Nl)), a confidence interval around the log odds ratio estimate, a p-value for the difference between the log odds ratio being different from a baseline estimate corresponding the two alleles being balanced, a log ratio estimate for the first allele (log(Tl/Nl)), a log ratio estimate for the second allele (log(T2/N2)), a confidence interval around the log ratio estimate for the first allele, a confidence interval around the log ratio estimate for the second allele, a p- value for the observed log ratio estimate for the first allele being different from a baseline estimate corresponding to the first allele being present in the same amounts in the tumor and normal samples, and/or a p-value for the observed log ratio estimate for the second allele being different from a baseline estimate corresponding to the second allele being present in the same amounts in the tumor and normal sample.
- P-values and confidence intervals may be calculated using standard error estimates that are adjusted for a degree of overdispersion that is estimated using the Nl, N2, Tl, and T2 values determined for a plurality of heterozygous SNPs 210 identified in the non-HLA region of the subject’s genome.
- the p-value or log odds ratio 212 output by statistical analysis 128 may be compared to a predetermined threshold (e.g., by p-value threshold comparator 214) to output HLA loss information 124.
- a predetermined threshold e.g., by p-value threshold comparator 214.
- the functionality of p-value threshold comparator 214 may be provided by, for example, HLA loss evaluator 110 as described in reference to FIG. 1.
- HLA loss information 124 may comprise, for example, a determination of HLA loss of heterozygosity in the subject’s tumor sample for one or more HLA genes.
- process 200 as illustrated in FIG. 2 may be performed using tumor sequence reads 118 and normal sequence reads 116 (e.g., WES sequence reads derived from tumor and normal samples collected from a subject) for at least one HLA gene.
- process 200 may be performed using tumor sequence reads 118 and normal sequence reads 116 (e.g., WES sequence reads derived from tumor and normal samples collected from the subject) for at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more than 20 HLA genes.
- the statistical analysis 128 can be performed to determine if an observed difference between an allelic ratio for a tumor sample (e.g., a ratio of the number of unique sequence reads that align to allele 1 to the number of unique sequence reads that align to allele 2) and a corresponding allelic ratio for a paired normal sample is statistically significant.
- An accurate statistical test requires an understanding of the variance of the underlying ratio statistic derived from sequence read data. Past studies have shown that in short-read sequencing data, the variance in sequence read counts can be higher than expected, resulting in a phenomenon called overdispersion, which can vary from sample to sample.
- the disclosed methods use a determination of baseline sequence read count ratios at a plurality of heterozygous genomic loci (e.g., a plurality of heterozygous SNP loci) outside of the HLA region in the genome of the normal sample.
- a plurality of heterozygous genomic loci e.g., a plurality of heterozygous SNP loci
- HLA loss e.g., HLA LOH
- logR z.e., Iog2 of the ratio of observed sequence read counts to expected read counts, where the expectation is based on the number of read counts observed in the normal sample
- BAF z.e., the “B allele frequency”; a normalized measure of the allelic sequence read count ratio of two alleles (A and B), such that a BAF of 1 or 0 indicates the complete absence of one of the two alleles (e.g., the pair of alleles at the gene locus is either AA or BB), and a BAF of 0.5 indicates the equal presence of both alleles (e.g., the pair of alleles at the gene locus is AB)).
- allelic ratio-based approach to determining HLA loss e.g., HLA LOH
- sequence read counts are determined for each allele, thus requiring four measurements instead of two, where the measurements are taken for genomic loci where two separate alleles are present.
- the disclosed allelic ratio approach to detecting HLA loss therefore requires methodology to identify allelic differences genome-wide for a given patient.
- the standard methodology for identifying genomic differences is very time-consuming, and has thus hindered the development of an allelic ratio-based approach to detection of HLA loss.
- Heterozygous SNPs are identified in the sequence read data for a normal sample obtained from the subject, and the corresponding number of sequence read counts are determined for the tumor sample at those same SNP locations, thereby yielding four allelespecific sequence read count values at a plurality of genomic loci.
- the statistical analysis of the four sequence read count values used in the disclosed allelic ratio-based approach to detection of HLA loss is facilitated by the use of a, e.g., multinomial model (rather than a binomial model) and corresponding methods for estimating overdispersion.
- an overdispersed multinomial model is fit to the sequence read count data for the non-HLA genomic loci, and used to assess the statistical significance of differences in the observed allelic ratios of HLA sequence read counts between tumor and normal samples.
- the HLA alleles present in the subject are identified.
- the subject is HLA-typed. This HLA typing can be performed with respect to each known HLA gene (e.g., HLA- A, HLA-B, HLA-C, etc.).
- HLA- A HLA- A
- HLA-B HLA-B
- HLA-C HLA-C
- the exact two HLA alleles present in a subject for each HLA gene in the subject are identified. These two HLA alleles may be the same (z.e., homozygosity) or different (z.e., heterozygosity). In other cases, the most likely options for the two HLA alleles present in the subject for each HLA gene are identified.
- HLA loss information 124 in FIG. 1 or FIG. 2 may comprise, for example, a determination of HLA loss of heterozygosity in the subject’s tumor sample for one or more HLA genes.
- FIG. 3 provides a schematic diagram illustrating HLA loss of heterozygosity for three HLA genes, z.e., HLA- A, HLA-B, and HLA-C.
- a normal sample comprises two copies of each gene, where the two copies may comprise a same allele or a different allele (in this example, all three gene loci are heterozygous, z.e., they each comprise two different alleles as indicated by the shading).
- loss of an allele e.g., through deletion of all or a portion of one copy of a gene locus
- FIG. 4 provides a flow diagram illustrating an example of a process 400 for quantifying allele-read alignments for HLA genes in accordance with one or more implementations of the disclosed methods.
- Process 400 is one example of a process that is implemented by evaluation system 100 or at least a portion of evaluation system 100 in Fig. 1.
- Process 400 may be implemented by, for example, HLA loss evaluator 110 in Fig. 1.
- Process 400 may be implemented by, for example, alignment analyzer 106 in Fig. 1.
- Step 402 in FIG. 4 includes receiving a first set of intron-resolution identifiers for a first allele for an HLA gene and a second set of intron-resolution identifiers for a second allele for the HLA gene.
- the first allele and the second allele are the same such that the first set of intron-resolution identifiers and the second set of intronresolution identifiers are also the same.
- Step 404 in FIG. 4 includes receiving a first plurality of reads for a first sample and a second plurality of reads for a second sample.
- the first plurality of reads, the second plurality of reads, or both may be reads generated via, for example, WES or WGS.
- the first sample may be, for example, a sample of healthy or normal tissue.
- the second sample may be, for example, a sample of unhealthy or diseased tissue (e.g., a tumor).
- the first sample and the second sample are samples of tissue taken at first and second points in time, respectively, in the progression of a disease (e.g., a tumor, cancer, etc.).
- Step 406 in FIG. 4 includes identifying a first allele sequence for the first allele for the HLA gene using a selected one of the first set of intron-resolution identifiers and a second allele sequence for the second allele for the HLA gene using a selected one of the second set of intron-resolution identifiers.
- multiple combinations of alleles may be made when the first set of intron-resolution identifiers, the second set of intronresolution identifiers, or both include multiple intron-resolution identifiers.
- the combination used for step 406 comprising a selected one of the first set of intron-resolution identifiers and/or a selected one of the second set of intron-resolution identifiers may be performed in different ways.
- the selection may be performed via random selection.
- the selection may be performed in an ordered manner (e.g., selecting the first one of the first set of intron- resolution identifiers, selecting the first one of the second set of intron-resolution identifiers alphanumerically, etc.).
- Step 408 in FIG. 4 includes quantifying allele-read alignments between the first allele sequence and each of the first plurality of reads and the second plurality of reads and between the second allele sequence and each of the first plurality of reads and the second plurality of reads to form an alignment output.
- Step 408 may include, for example, generating an alignment output based on alignment using the first allele sequence, the second allele sequence, the first plurality of reads, and the second plurality of reads.
- Step 408 includes, for example, but is not limited to, counting the number of alignments (e.g., exact alignments, alignments within 1 or 2 base pairs, alignments within 3 or 4 base pairs, etc.) between different combinations of the first allele and the second allele with the first sample and the second sample.
- step 408 can include generating a first count for a number of first allele and first sample alignments, a second count for a number of first allele and second sample alignments, a third count for a number of second allele and first sample alignments, and a fourth count for a number of second allele and second sample alignments.
- the first allele and first sample alignments are alignments between the first allele sequence and the first plurality of reads associated with the first sample.
- the first allele and second sample alignments are alignments between the first allele sequence and the second plurality of reads associated with the second sample.
- the second allele and first sample alignments are alignments between the second allele sequence and the first plurality of reads.
- the second allele and second sample alignments are alignments between the second allele sequence and the second plurality of reads.
- the alignment output in step 408 may take a different form.
- the alignment output may include, for example, but is not limited to, ratios, percentages, or other types of quantification formats that characterize the allele-read alignments between the first allele sequence, the second allele sequence, the first plurality of reads, and the second plurality of reads.
- the alignment output may include a first ratio and a second ratio.
- the first ratio may be, for example, a ratio of the number of alignments between the first allele sequence and the first plurality of reads and the number of alignments between the second allele sequence and the first plurality of reads.
- the second ratio may be, for example, a ratio of the number of alignments between the first allele sequence and the second plurality of reads and the number of alignments between the second allele sequence and the second plurality of reads.
- steps 406 and 408 are repeated for various combinations of selected ones from the first set of intron-resolution identifiers and selected ones of the second set of intron-resolution identifiers. For example, all possible combinations using each possible intron-resolution identifier for the first allele and each possible intron-resolution identifier for the second allele may be evaluated to generate alignment output.
- HLA typing may be performed as described elsewhere herein (e.g. using any method known in the art, such as e.g. using the methods described in PCT International Patent Application Publication No. WO 2022/192304, Optitype (Szolek et al.
- FIGS. 5A-5D provide non-limiting examples of WES sequence read alignments for the HLA B*51 :01 :01 :01 allele (Allele 1, where the 8-digit indicates that this is an allele within allele group 51 for the HLA-B gene, i.e., specifically the allele that encodes for the B*51 :01 HLA protein but that differs from the allele for the B*51 :01 HLA protein by the presence of a synonymous mutation in the coding region, but that has no difference in the noncoding region) and the HLA B*07:02:01 :01 allele (Allele 2) for paired normal and tumor samples, where the tumor sample exhibits a loss of heterozygosity.
- FIG. 1 the 8-digit indicates that this is an allele within allele group 51 for the HLA-B gene, i.e., specifically the allele that encodes for the B*51 :01 HLA protein but that differs from the allele for the B*51 :01 HLA protein by the presence of
- FIG. 5A provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus as a function of position within the gene locus in the normal sample.
- FIG. 5B provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus as a function of position within the gene locus in the normal sample.
- FIG. 5C provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus as a function of position within the gene locus in the tumor sample.
- FIG. 5D provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus as a function of position within the gene locus in the tumor sample.
- Sequence reads that aligned to unique genomic loci are shown in pink and blue in these figures, while sequence reads that mapped to more than one genomic locus are shown in green and yellow. As can be seen in this non-limiting example, fewer sequence reads were observed for allele 1 in the tumor sample, indicating that a loss of the allele has occurred in at least a fraction of the tumor sample.
- the number of aligned sequence reads in FIG. 5C may be non-zero because of stromal contamination of the tumor sample and/or tumor heterogeneity.
- HLA Loss e.g., HLA LOH
- FIGS. 6A-6B provide non-limiting schematic illustrations that compare prior copy number-based methods for detecting HLA LOH (FIG. 6A) to the allelic ratio-based method (FIG. 6B) described herein.
- copy number-based methods for detecting HLA LOH typically rely on determining tumor-to-normal ratios of sequence read counts, e.g., Tl/Nl for allele 1 and T2/N2 for allele 2, as illustrated in FIG. 6A, and using a statistical analysis to determine if the sequence read count ratio observed for a given allele, T/N, is significantly different from an expected value of 1 : 1.
- the presently disclosed methods are based on determining and comparing allelic ratios for both tumor-derived sequence read data (e.g., T1/T2) and normal -derived sequence read data (e.g., N1/N2), as illustrated in FIG. 6B, and using a different statistical analysis to determine if the allelic ratio for the tumor-derived data is significantly different from the allelic ratio determined for the normal -derived data.
- tumor-derived sequence read data e.g., T1/T2
- normal -derived sequence read data e.g., N1/N2
- the statistical analysis can comprise measuring sequence read counts or read count ratio (N1/N2) values in the sequence read data for the normal sample and sequence read counts or read count ratio (T1/T2) values in the sequence read data for the tumor sample at a plurality of non-HLA genomic loci (e.g., a plurality of heterozygous SNP loci located outside the HLA region of the genome), and fitting the Nl, N2, Tl, and T2 data to a multinomial model to estimate overdispersion and correct for the degree of overdispersion for a given pair of samples.
- N1/N2 sequence read counts or read count ratio
- T1/T2 sequence read count ratio
- FIGS. 7A-7D provide schematic illustrations of 2 x 2 contingency tables for sequence read data (e.g., HLA sequence read data) that might be expected for different genetic modifications to a tumor genome.
- FIG. 7A provides a schematic diagram illustrating a nonlimiting example of a 2 x 2 contingency table for HLA sequence read counts in normal and tumor samples.
- Nl is the number of unique sequence reads derived from a normal sample that align to allele 1
- N2 is the number of unique sequence reads derived from the normal sample that align to allele 2.
- Tl is the number of unique sequence reads derived from a tumor sample (corresponding to the matched normal sample) that align to allele 1
- T2 is the number of unique sequence reads derived from the tumor sample that align to allele 2.
- FIG. 7B provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a double deletion, as indicated by the lighter shading for both allele 1 and allele 2 in the tumor sample in comparison to the shading for allele 1 and allele 2 in the normal sample.
- FIG. 7B provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a double deletion, as indicated by the lighter shading for both allele 1 and allele 2 in the tumor sample in comparison to the shading for allele 1 and allele 2 in the normal sample.
- FIG. 7C provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a single deletion, as indicated by the lighter shading for allele 1 in the tumor sample in comparison to that for the normal sample.
- FIG. 7D provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a copy neutral loss of heterozygosity (e.g., conversion of allele 1 to another copy of allele 2), as indicated by the lighter shading for allele 1 and the darker shading for allele 2 in the tumor sample in comparison to that for the normal sample.
- a copy neutral loss of heterozygosity e.g., conversion of allele 1 to another copy of allele 2
- FIGS. 8A-8C provide non-limiting examples of 2 x 2 contingency tables for HLA sequence read counts that were observed for cells from ovarian cell line 23882 that exhibited different genetic modifications in a tumor genome.
- FIG. 8A provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA A*26:01 allele (allele 2), as indicated by the greatly decreased number of sequence reads that aligned to the HLA A*26:01 allele (allele 2) in the tumor sample in comparison to the number of sequence reads that aligned to the A*23:01 allele (allele 1) in both samples, and in comparison to the number of sequence reads that aligned to the HLA A*26:01 allele (allele 2) in the normal sample.
- FIG. 8B provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA B*35:01 allele (allele 1), as indicated by the greatly decreased number of sequence reads that aligned to the HLA B*35:01 allele (allele 1) in the tumor sample in comparison to the number of sequence reads that aligned to allele 1 in the normal sample, and in comparison to the number of sequence reads that aligned to the HLA B*49:01 allele (allele 2) in both the normal and tumor samples.
- 8C provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA C*04:01 allele (allele 1), as indicated by the greatly decreased number of sequence reads that aligned to the HLA C*04:01 allele (allele 1) in the tumor sample in comparison to the number of sequence reads that aligned to allele 1 in the normal sample, and in comparison to the number of sequence reads that aligned to the HLA C*07:01 allele (allele 2) in both the normal and tumor samples.
- FIGS. 9A-9B provide non-limiting schematic illustrations of the HLA (FIG. 9A) and non-HLA (FIG. 9B) genomic loci used to determine sequence read counts for a subject and evaluate allelic ratios, which are then used as input for a statistical model used to detect HLA loss of heterozygosity in the subject, in accordance with one or more implementations of the systems and method disclosed herein.
- FIG. 9A the two alleles (allele 1 and allele 2) for a given HLA gene are illustrated for tumor and normal samples. The ratio of sequence reads for the two alleles is determined for each sample (T1/T2 and N1/N2) based on the number of unique sequence reads that align to each allele.
- FIG. 9B illustrates a non-HLA genomic locus (e.g., a heterozygous single nucleotide polymorphism (SNP) locus).
- SNP single nucleotide polymorphism
- the plurality of heterozygous non-HLA genomic loci can comprise at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci.
- FIG. 10 provides a non-limiting example of a flowchart for a process 1000 (e.g., a computer-implemented method) for detecting an HLA alteration, in accordance with one or more implementations of the systems and methods disclosed herein.
- Process 1000 may be implemented using the evaluation system 100 described in FIG. 1.
- process 1000 or portions thereof may be performed by allele type generator 104, and HLA loss evaluator 110 (including processing by alignment analyzer 106 statistics generator 108) as described in FIG.
- sequence read data e.g., read data 112 in FIG. 1 for a plurality of sequence reads derived from a tumor sample (e.g., reads 116 in FIG. 1) and a normal sample (e.g., reads 118 in FIG. 1) from a subject (e.g., a patient) are received (e.g., by one or more processors of evaluation system 100 described in FIG. 1).
- the sequence read data may be derived, for example, by sequencing nucleic acid molecules (e.g. DNA) extracted from paired tumor and normal samples using a whole exome sequencing (WES) technique, a whole genome sequencing (WGS) technique, or both.
- sequence reads may be generated using, for example, a paired-end sequencing technique.
- the paired tumor and normal samples can be paired tumor and normal surgical resection samples.
- the paired tumor and normal samples can be paired tumor and normal tissue biopsy samples.
- the paired tumor and normal samples can comprise a tumor sample (e.g. biopsy or tumor resection) and a normal sample (e.g. biopsy or blood sample) from the same subject.
- a subject-specific reference sequence for an HLA region of the subject’s genome (e.g., a subject-specific HLA-ome) is received (e.g., by alignment analyzer 106 in FIG. 1).
- the subject-specific reference sequence for the HLA region may be generated by alignment analyzer 106 in FIG. 1 after the sequence read data has been processed by allele type generator 104, e.g. using set of HLA allele identifiers or sequences associated with these identifiers (e.g. from an HLA sequence database).
- the subject-specific reference sequence for the HLA region can be generated, for example, by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence (e.g., the GRCh38 human reference genome (Genome Reference Consortium)).
- the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence.
- the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region.
- the subject-specific reference sequence for the HLA region can be generated, for example, by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to a reference set of HLA allele sequences.
- the subject-specific reference genome for the HLA region may be generated through the use of a set-covering algorithm to find a set of HLA gene alleles that best explains the observed set of normal sequence reads (e.g., normal WES sequence reads) in a sample as aligned to a set of HLA gene alleles.
- the identified HLA alleles may comprise 6-digit allelic identifiers (e.g., exon-resolution identifiers as described elsewhere herein).
- the identified HLA alleles may comprise 8-digit allelic identifiers (e.g., intron-resolution identifiers as described elsewhere herein).
- a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene, and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene are determined (e.g., by alignment analyzer 106 in FIG. 1) based on the sequence read data and the subject-specific reference sequence for the HLA region.
- the at least one HLA gene can comprise HLA-A, HLA-B, HLA- C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof.
- the at least one HLA gene can comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more than 20 HLA genes.
- a number of unique normal-derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene are determined (e.g., by alignment analyzer 106 in FIG. 1) based on the sequence read data and the subject-specific reference sequence for the HLA region.
- an HLA alteration e.g., an HLA loss of heterozygosity or an HLA copy number change
- a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele
- a normal allelic ratio comprising a ratio of the determined number of unique normal-derived sequence reads for the first allele and the determined number of unique normal-derived sequence reads for the second allele.
- detecting HLA loss of heterozygosity for the at least one HLA gene can comprise performing a statistical analysis of the allelic ratio data (e.g, statistical analysis 128 performed by statistics generator 108 in FIG. 1) to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value.
- the expected value is the value determined for the normal allelic ratio (e.g, in order to account for potential sequencing and/or analysis artifacts). In some instances, for example, the expected value is 1.
- the expected value is associated with a similar allelic balance in the tumor and normal samples.
- the expected value is associated with a log odds ratio of 0 (or similarly, an odds ratio of 1) for the allelic ratio in the tumor and normal samples.
- the statistical analysis may comprise computing one or more statistical metrics.
- the statistical analysis may include computing a t-statistics or z-score, using a binomial or multinomial model.
- the methods may comprise computing a t-statistic adjusted for overdispersion, testing the null hypothesis that the turn or-vs. -normal HLA allelic log ratio matches a genomic background turn or-vs. -normal log-ratio estimated by comparing aggregated turn or-vs. -normal counts at non HLA SNP loci (e.g. SNP loci for which both tumor and normal counts are consistent with diploid genomic backgrounds).
- the standard error for log(Tl/Nl) can be calculated as (LRl-LRnull)/SE(LRl), and (
- a t-statistic for a log odds ratio may be calculated as the difference between an estimate of log((NlT2)/(N2Tl)) at the HLA gene and an estimate of background log((NlT2)/(N2Tl)) estimated from non HLA SNP loci, divided by a standard error estimate.
- the standard error around the LOR estimate can be calculated prior to overdispersion adjustment
- Any of the above standard errors can then be adjusted for Nl T1 N2 T2 J J overdispersion prior to being used to calculate a t-statistic and/or a confidence interval around the estimated log ratios and/or log odds ratio.
- Any of the above t-statistics can be used to obtain a p-value, for example a p-value for a two-sided t-test with a standard Gaussian reference distribution.
- the adjustment for overdispersion may be performed by using a standard deviation for the calculation of the t-statistic or the calculation of the confidence interval that is estimated as the product of the standard error estimates above (also referred to as “naive” standard error estimates) and an overdispersion factor estimated using the counts at non-HLA SNP loci.
- a multinomial model may use 4 counts at each non-HLA SNP locus (counts in the tumor and normal sample for each allele at the locus, i.e. Tl, T2, Nl, N2), as in the example above.
- a binomial model may use 2 counts at each or a plurality of reference loci (non HLA loci, which may be heterozygous SNP loci or otherwise, i.e.
- An overdispersion factor can be calculated using a Pearson goodness-of-fit statistic, where the overdispersion factor is the square rood of said statistic, or any other method known in the art to calculate overdispersion in multinomial data.
- the results or information generated by performing the statistical analysis may be referred to as a statistical output.
- the statistical analysis can comprise: (i) detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; (ii) determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of the plurality of heterozygous SNP loci; (iii) determining, based on the sequence read data, a number of unique normal -derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of the plurality of heterozygous SNP loci; and (iv) estimating a degree of overdispersion in sequence read counts based on fitting the
- Estimating a degree of overdispersion in sequence reads counts may comprise estimating, based on the counts in (ii) and (iii), locus-specific multinomial probabilities for a count to be from normal or tumor chromosomes.
- Npi are the expected counts under a Multinomial distribution where a total of N reads (where N is the sum of the observed counts for alleles 1 and 2 in the tumor and the normal sampled) are sampled from categories with probabilities pi (where pi are the estimated probabilities that the counts would be from a normal chromosome - which can assume that normal chromosomes are diploid and therefore a single probability can be estimated - or from either of the tumor chromosomes, which can have distinct frequencies).
- An estimated degree of overdispersion ca be calculated using a residual deviance statistic.
- Expected counts at heterozygous loci e.g. non HLA SNPs
- a multinomial distribution or equivalently, probabilities for a read to be drawn from each of the alleles in each of the normal and tumor samples
- EM expectation maximization
- the EM algorithm finds values of parameters (here a set of probabilities pi) that maximize the log likelihood of the observed counts.
- the EM algorithm may be implemented using one or more of the following assumptions: (i) the normal sample may be assumed to be diploid, such that reads from each of alleles 1 and 2 in the normal samples are expected to occur with equal frequency (i.e.
- the tumor sample may be allowed to have different underlying copy numbers at the locus, such that reads from alleles 1 and 2 in the tumor samples may not be expected to occur with equal frequency (i.e.
- the log likelihood of the observed counts at a locus may be estimated taking into account that at each SNP two configurations are possible, one where the counts T1 are associated with the major chromosome (and counts T2 are associated with the minor chromosome, like in the normal sample; i.e.
- the EM algorithm may estimate posterior probabilities that counts T1 are associated with the major chromosome and counts T2 are associated with the minor chromosome, and that counts T2 are associated with the major chromosome and counts T1 are associated with the minor chromosome.
- An estimated degree of overdispersion can be calculated for each locus for each of these two possible configurations.
- An average or weighted average of the obtained estimates of overdispersion can then be obtained at each locus.
- a weighted average of the two overdispersion estimates that is weighted by the probability of the respective configuration i.e. posterior probabilities estimated by the EM algorithm
- a first overdispersion estimate may be obtained using the estimated posterior probabilities that counts T1 are associated with the major chromosome and counts T2 are associated with the minor chromosome
- a second overdispersion estimate may be obtained using the estimated posterior probabilities that counts T2 are associated with the major chromosome and counts T1 are associated with the minor chromosome.
- a weighted average may then be obtained in which the first overdispersion estimate is weighted by the estimated posterior probabilities that counts T1 are associated with the major chromosome, and the second overdispersion estimate is weighted by the estimated posterior probabilities that counts T2 are associated with the major chromosome.
- alleles 1 and 2 may be associated with the reference and alternative alleles, respectively, in both the normal and tumor samples (e.g. instead of the major and minor chromosome / locus).
- the log likelihood for a SNP locus may be a weighted sum of log likelihoods of the counts at the locus and counts at loci within a predetermined genomic distance (e.g.
- a moving window or counts of a predetermined number of loci closest to the locus (referred to as “nearby SNP loci”), where weighting is based on distance between the SNP locus and the nearby SNP loci.
- the estimated parameters of the multinomial model at each SNP can be used to calculate expected counts at each SNP, which can in turn be used to calculate a degree of overdispersion at each SNP.
- a plurality of estimated degrees of dispersion e.g. obtained for each of a plurality of SNP loci, e.g. using expected counts obtained using posterior probabilities from an EM model as explained above, or in any other way
- are summarized e.g. by using the mean or median of these values, and used to adjust a standard error estimate as explained above.
- pseudocounts may be added to the counts of reads used in any estimation of log ratios or log odds ratios, in order to avoid division by zero.
- a pseudocount is a value that is added automatically to a count according to a predetermined scheme.
- a predetermined pseudocount is added to each count.
- a first pseudocount is added to the tumor counts and a second pseudocount is added to the normal counts, where the values of the first and second pseudocounts depend on the relative coverage (i.e. total number of reads over the whole genome or a portion of the genome, such as e.g. a chromosome, e.g. chromosome 6) in the tumor and normal sample.
- a first pseudocount equal to a predetermined value multiplied by 2*R, where R is the tumor fraction (calculated as tumor-to-normal coverage ratio/(l+ tumor- to-normal coverage ratio)) can be added to the tumor counts
- a second pseudocount equal to a predetermined value multiplied by 2*(1-R) can be added to the normal counts (where 1-R is the normal fraction, calculated as 1-the tumor fraction).
- the values of the first and second pseudocounts depend on the total number of reads in the respective sample across both alleles.
- the first pseudocount value may be equal to the total number of reads at the locus in the tumor sample (T1+T2) multiplied by a factor that depends on the relative coverage between tumor and normal samples (e.g. (T1+T2)*2*R).
- the second pseudocount value may be equal to the total number of reads at the locus in the normal sample (N1+N2) multiplied by a factor that depends on the relative coverage between tumor and normal samples (e.g. (Nl+N2)*2*(l-R)).
- the predetermined values used when calculating the first and second pseudocount may be user defined values, such as e.g. 1.
- the first pseudocount value may be equal to 1*2*R and the second pseudocount value may be equal to 1*2*(1-R), such that a single count is added to both the tumor and the normal counts when the tumor and normal counts are equal.
- the overdispersion parameter estimate can be used to multiplicatively scale a standard error for a log odds ratio computed for a 2x2 contingency table as illustrated in FIG. 7A.
- this log odds ratio is equivalent to the log ratio of the respective tumor and normal allelic ratios (z.e., the log odds ratio is the natural logarithm of the ratio of the tumor allelic ratio divided by the normal allelic ratio, or the natural logarithm of the ratio of the normal allelic ratio divided by the tumor allelic ratio - the two having the same absolute value).
- Estimating the standard error of the log odds ratio can be performed using an established approach under multinomial sampling, e.g. by calculating the square root of the sum of the inverse of the individual counts.
- the estimated overdispersion parameter obtained by fitting the sequence read data to an overdispersed multinomial model is used to scale the "established" standard error.
- the method can further comprise using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the tumor allelic ratio and normal allelic ratio.
- the adjusted standard error for the log odds ratio can be used to adjust a p-value or confidence interval for the log ratios and/or the log odds ratio of a detected HLA alteration e.g., HLA loss of heterozygosity or HLA copy number change).
- the plurality of heterozygous SNP loci can comprise at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci.
- the plurality of heterozygous SNP loci can be filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome.
- the plurality of heterozygous SNP loci are detected using a non- sorting-based method for de-duplicating and tallying sequence read counts.
- the non-sorting based method for de-duplicating and tallying sequence read counts can comprise: (i) performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and (ii) performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
- the plurality of heterozygous SNP loci are detected using reads that have been aligned using a SNP -tolerant alignment method.
- the SNP -tolerant alignment method may use a predetermined set of SNPs.
- the predetermined set of SNPs may be known SNPs, such as e.g. SNPs obtained from a human genome polymorphism database, such as dbSNP (Sherry et al. Nucleic Acids Research, Volume 29, Issue 1, 1 January 2001, Pages 308-311; www.ncbi.nlm nih.gov/snp/).
- the plurality of heterozygous SNP loci are detected in the normal sample.
- the plurality of heterozygous SNP loci are selected as SNP loci from a predetermined set of SNPs where at least a predetermined number of reads or proportion of reads at the locus in the normal sample includes the SNP (Rather than the reference allele).
- a predetermined portion of the genome such as e.g. chromosome 6 or a portion thereof, excluding the HLA locus - as explained elsewhere herein
- a set of loci corresponding to a predetermined set of SNPs may be analysed in the normal sample and a subset of the set of loci may be selected for overdispersion estimation when the number or proportion of reads at the locus in the normal sample exceeds a predetermined threshold.
- a major genotype (allele) at each such locus may be defined as the reference allele or the allele with more counts in the normal sample. Then a 2 x 2 table of counts including Tl, T2, N1 and N2 counts for each of the selected SNP loci may be obtained.
- the method can further comprise diagnosing or confirming a diagnosis of a disease based on a detected HLA loss e.g., a detected loss of HLA heterozygosity or a detected change in HLA copy number) for the at least one HLA gene.
- the method can further comprise identifying the subject for treatment of a disease (or identifying the subject as a candidate for treatment of a disease) based on a detected HLA loss for the at least one HLA gene.
- the method can further comprise identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA loss for the at least one HLA gene.
- the method can further comprise predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA loss for the at least one HLA gene. In some instances, the method can further comprise identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA loss for the at least one HLA gene.
- the disease can be a cancer. Examples of cancers that may be associated with HLA LOH, for example, include, but are not limited to head and neck cancer, squamous cell lung cancer, stomach adenocarcinoma, diffuse large B-cell lymphoma and colon cancer.
- an immunotherapy is a therapy that targets one or more selected cancer antigens present in the subject
- the method comprising detecting HLA loss using a method as described herein, and selecting one or more cancer antigens that are predicted to be presented by an HLA allele that is not subject to HLA loss in the subject.
- Reference to an antigen being predicted to be presented by an HLA allele encompasses the antigen being predicted to bind to the HLA allele, to be presented by the HLA allele, or to be immunogenic in the context of the HLA allele.
- Such a method can comprise one or more of: identifying one or more candidate cancer antigens present in the subject, predicting whether the one or more candidate cancer antigens bind to and/or are presented by and/or is immunogenic in the context of one or more HLA alleles identified in the subject, and excluding from a set of one or more candidate cancer antigens one or more candidate cancer antigens that have been predicted to be exclusively or primarily bound by and/or presented by and/or immunogenic in the context of an HLA allele that has been determined to be subject to HLA loss using a method as described herein.
- a cancer antigen may be a neoantigen or a tumor associated antigen.
- the method may further comprise manufacturing the immunotherapy.
- the immunotherapy may be a vaccine.
- the vaccine may be a RNA-based vaccine, a DNA-based vaccine, a cell vaccine (e.g. dendritic cell vaccine) or a peptide-based vaccine.
- the vaccine may comprise one or more cancer antigen peptides or nucleic acids (e.g. RNA or DNA) encoding for one or more cancer antigen peptides, or antigen presenting cells presenting one or more cancer antigen peptides.
- the immunotherapy may be a T-cell based therapy, comprising a population of T cells that recognize one or more cancer antigens.
- the T cells may be engineered T cells, such as e.g. T cells engineered to express a T cell receptor that specifically recognizes a cancer antigen peptide.
- Predicting whether a candidate cancer antigen binds to and/or is presented by and/or is immunogenic in the context of one or more HLA alleles identified in the subject may be performed using any method known in the art, such as e.g. as described in WO 2022/016125 or as implemented in NetMHCpan-4.1 (Reynisson et al. Nucleic Acids Res. 2020 Jul 2; 48(W1): W449-W454), MHCflurry 2.0 (O’Donnell et al. Cell Systems, vol. 11, Issue 1, 22 July 2020, pp. 42-48. e7), or PRIME (Schmidt et al. Cell Rep Med. 2021 Feb 16; 2(2): 100194).
- An antigen may be considered to be primarily bound by and/or presented by and/or be immunogenic in the context of an HLA allele when the binding affinity of the antigen to the HLA allele and/or the probability of presentation of the antigen by the HLA allele and/or the probability of immunogenicity of the antigen in the context of the HLA allele is above a first predetermined threshold and the binding affinities of the antigen to all other HLA alleles in the subject that are not predicted to be subject to loss and/or the probabilities of presentation of the antigen by all other HLA alleles in the subject that are not predicted to be subject to loss and/or the probabilities of immunogenicity of the antigen in the context of all other HLA alleles in the subject that are not predicted to be subject to loss are below a second predetermined threshold.
- the first and second predetermined thresholds may be the same or different.
- the immunotherapy may be a natural killer (NK) cell therapy.
- the method may comprise selecting a subject for treatment with a NK cell therapy when the subject has been identified as having one or more HLA alleles subject to loss. While loss of HLA has been established as a mechanism of immune escape to T cell based immunity, susceptibility to NK cell mediated lysis has been shown to be inversely correlated with expression of surface MHC class I molecules on target cells. Thus, tumor cells that are subject to HLA loss (and which may therefore have reduced expression of one or more HLA alleles) may be more sensitive to NK cell therapy.
- the method may further comprise administering the immunotherapy to the subject. Therefore, also described herein are methods of treating a subject who has been diagnosed as having or being likely to have cancer, the methods comprising manufacturing an immunotherapy as described herein and administering the immunotherapy to the subject.
- the present disclosure also contemplates systems configured to perform the disclosed methods.
- a system will generally comprise: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein.
- the system may further comprise a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject.
- Non-transitory computer-readable storage medium storing one or more programs are also disclosed, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform any of the methods described herein.
- FIG. 11 provides a non-limiting example of data for the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples from a test subject plotted as a function of the normalized genomic position in the NCI1672 chrl region.
- the plot represents a compilation of the 2 x 2 contingency table sequence read count data for a plurality of heterozygous SNPs that allows one to view local trends across chromosomes in a new way (while still using only on WES bulk read data).
- There are four data points plotted per genomic locus the number of tumor-derived read counts for allele 01 and allele 02, and the number of normal -derived read counts for allele 1 and allele 2.
- the alignment analyzer 106 in FIG. 1 returns allele-specific sequence read counts for heterozygous SNPs in non-HLA regions of the subject’s genome. This allows one to infer regions with genomic copy number changes visually and also analytically.
- the plot in FIG. 11 shows the fraction of tumor-derived sequence reads aligned to allele 1 plotted in orange, and the fraction of normal -derived sequence reads aligned to allele 1 plotted in blue. The SNPs are unphased, so the data generates mirror-image “Rorschach” plots. The trend lines indicate the local estimate for allele 1 to allele 2 read count ratios.
- the alignment analyzer may use a SNP -tolerant alignment method.
- a SNP -tolerant alignment method is a method that allows alignment of sequencing reads to a plurality of possible genotypes at one or more polymorphic sites.
- a SNP -tolerant alignment method aligns reads not just to a single reference sequence, but to a reference ‘space’ of all possible combinations of major and minor alleles from a set of known SNPs.
- a set of known SNPs may be obtained from a human genome polymorphism database, such as e.g. dbSNP.
- a SNP -tolerant alignment method to a reference space instead of a single reference sequence avoids treating minor alleles as mismatches and thereby results in a more accurate alignment and hence more accurate read counts.
- Examples of SNP -tolerant alignment methods include the GSNAP aligner described in Wu & Nacu, Bioinformatics, Volume 26, Issue 7, April 2010, Pages 873-881; and graph-based aligners (see e.g. Rakocevik et al. Nature Genetics volume 51, pages354-362, 2019). The plot in FIGS.
- FIG. 20A-20D shows read counts (total vs reference allele counts) for a representative window of 51 known SNPs in a normal sample and a matched tumor sample using a standard reference genome-based alignment method (FIG. 20A and FIG. 20C) and using a SNP- tolerant alignment method (FIG. 20B and FIG. 20D).
- the plots on the left (standard reference genome) show that using a standard reference genome alignment method can cause a bias towards the reference allele (showing as a deviation from the diagonal in the normal sample) which is corrected by the use of the SNP -tolerant alignment method.
- the use of the SNP- tolerant alignment method also identified additional SNPs that were missed entirely in the standard reference genome alignment method (shown as additional points in a different color in the plots on the right). Further, the use of the SNP -tolerant alignment method also makes it easier to detect loss of heterozygosity, as indicated by the shift away from the diagonal in the tumor sample (bottom right plot), compared to the standard reference genome alignment method (bottom left plot).
- FIG. 12 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B34996 chrl region.
- the plot in FIG. 12 shows the fraction of tumor-derived sequence reads aligned to allele 1 plotted in orange, and the fraction of normal-derived sequence reads aligned to allele 1 plotted in blue.
- the trend lines again indicate the local estimate for allele 1 to allele 2 read count ratios.
- the large deviations between the tumor trend line and the sample trend line suggest that large copy number changes have occurred in the tumor sample in the fractional position ranges of 0.1 - 0.2 and possibly 0.8 to 0.9 within the B34996 chrl region for this sample.
- FIG. 13 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the Pirn 1603 chr6 region.
- the data in this plot provides an example of a partial loss of the Pirn 1603 chr6 region in the tumor sample, e.g., in the vicinity of the HLA-DRB and HLA-DPB loci.
- allelic ratio-based methods can also be applied to an assessment of allele-specific loss over the non-HLA region surrounding the HLA genes (since allelic ratios can be used to estimate allelic imbalance in any region), thereby providing additional evidence for better discrimination between HLA normal and HLA loss, since genomic HLA loss is typically achieved by tumor cells deleting a genomic segment that encompasses the entire HLA region.
- FIG. 14 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the NCH672 chr6 region. In this example, there is no HLA loss observed. Labels indicate the locations of HLA genes A, DQA1, DRB1, DQB1, DPB1.
- FIG. 15 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B23882 chr6 region.
- HLA loss is observed across most of the B23882 chr6 region.
- Labels indicate the locations of HLA genes A, B, C.
- FIG. 16 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B34996 chr6 region.
- HLA loss was observed across the entire B34996 chr6 region.
- Labels indicate the locations of HLA genes A, B, C, DQA1, DRB1, DQB1, DPB1, DRA.
- FIG. 17 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the NCI20009 chr6 region.
- HLA loss was observed across a significant portion of the NCI20009 chr6 region.
- Labels indicate the locations of HLA genes B, C, DQA1, DQB1, DRB1, DPB1, DRA.
- FIGS. 18A-18F provide a non-limiting example of data that was generated for sensitivity analysis testing of detection of HLA LOH based on allelic ratio data, in accordance with one or more implementations of the systems and methods disclosed herein.
- varying percentages of tumor DNA were mixed into a normal DNA sample.
- Each panel provides the tumor/normal coverage determined from sequence read data plotted as a function of chromosome 6 coordinate.
- FIG. 18A tumor/normal coverage data for a 10% tumor sample.
- FIG. 18B tumor/normal coverage data for a 20% tumor sample.
- FIG. 18C tumor/normal coverage data for a 30% tumor sample.
- FIG. 18D tumor/normal coverage data for a 40% tumor sample.
- FIG. 18A tumor/normal coverage data for a 10% tumor sample.
- FIG. 18B tumor/normal coverage data for a 20% tumor sample.
- FIG. 18C tumor/normal coverage data for a 30% tumor sample.
- FIG. 18D tumor/normal coverage data for a 40% tumor sample.
- FIG. 18E tumor/normal coverage data for a 50% tumor sample.
- FIG. 18F tumor/normal coverage data for a 100% tumor sample. Decreasing percentages of tumor samples make it progressively more difficult to determine HLA loss. Sensitivity in such situations is important because the percentage of tumor sample (or the percentage of tumor sample with a particular genetic alteration, such as HLA LOH) is variable in real clinical samples due to stromal contamination and/or tumor heterogeneity.
- FIGS. 21A-22B provide a non-limiting example of read counts (as a fraction of total counts at the locus - i.e. at each SNP, the 2x2 contingency table is normalized to sum to 1) on chromosome 6 (comprising the HLA locus) in a sample not subject to HLA loss (FIG. 21A - where normalized counts for alleles 1 and 2 are shown as red and green dots, respectively, as a function of normalized chromosomal coordinates) and corresponding statistics obtained using a method disclosed herein (FIG. 21B).
- FIGS. 21A-22B provide a non-limiting example of read counts (as a fraction of total counts at the locus - i.e. at each SNP, the 2x2 contingency table is normalized to sum to 1) on chromosome 6 (comprising the HLA locus) in a sample not subject to HLA loss (FIG. 21A - where normalized counts for alleles 1 and 2 are shown
- 21C-21D provide a nonlimiting example of read counts (as a fraction of total counts - as above) on chromosome 6 (comprising the HLA locus) in a sample subject to HLA loss (FIG. 21C - where normalized counts for alleles 1 and 2 are shown as red and green dots, respectively, as a function of normalized chromosomal coordinates) and corresponding statistics obtained using a method disclosed herein (FIG. 21D). Allele 1 may refer to the major allele in the normal sample and allele 2 may refer to the minor allele in the normal sample. Alternatively, when using known SNPs, allele 1 may refer to the reference allele and allele 2 may refer to the alternate allele.
- the black lines are estimated multinomial frequencies for the major and minor alleles calculated over a moving window.
- the model is constrained to have a single parameter for the two normal alleles (as the normal gene is expected to be diploid, with equal frequencies of the two alleles at heterozygous loci), and a parameter for each of the two tumor alleles, giving rise to 3 parameters.
- the vertical lines in FIG. 21A and FIG. 21C shows the location of the HLA region.
- the plots in FIG. 21B and FIG. 21D show statistical estimates (points) and associated confidence intervals (vertical bars) for each of the HLA genes indicated on the x-axis.
- log ratios of tumor vs normal counts for allele 1 log(Tl/Nl)
- log ratios of tumor vs normal counts for allele 2 log(T2/N2)
- Log ratios near the genome-wide tumor-to-normal ratio in the left and middle plots indicate similar counts in the tumor and normal samples for the respective allele.
- Log odds ratios (LORs) near zero in the right plots indicate similar allelic balance in the normal sample and the tumor sample. Log odds ratios away from zero indicate allelic imbalance that differs between the normal and tumor samples. As the normal sample is generally assumed to have no allelic imbalance, LORs away from zero indicate a likely allelic imbalance in the tumor sample.
- the confidence intervals shown around the estimates are the 95% confidence intervals estimated as ⁇ 1.96SE where SE is the standard error for the estimate. The standard error estimate for the
- LOR can be estimated as - 1 - 1 - 1 — .
- the standard error estimate for the log ratios can l N1 N2 T1 T2 ° be estimated for allele 1 or allele 2, respectively.
- the standard error estimates for the log ratios and/or log odds ratios can be adjusted for overdispersion using an estimated degree of overdispersion obtained from counts of non-HLA SNPs (as is the case in the data in FIGS. 21A-21B, which uses known SNPs on chr 6 - excluding the HLA locus - to estimate a degree of overdispersion using a multinomial model and a x 2 statistic).
- Embodiments of the methods described herein use the OR or LOR statistics to determine whether an HLA allele is subject to loss. This advantageously enables to discriminate between HLA loss in a tumor sample and differences in coverage between the tumor and normal samples.
- a threshold on the LOR can then be applied to determine whether one of the alleles has been lost in the tumor sample (or more specifically whether there is allelic imbalance in the tumor sample that is not present in the normal sample - where the value of the LOR or a simple investigation of the read counts can resolve which allele has been lost).
- Double-deletions are not detectable using the LOR but can be detected by looking for changes in both HLA alleles relative to the background tumor-vs-normal rate, i.e.
- a p-value can be calculated associated with a two- sided test of the hypothesis that the true odds ratio equals one using a t-test as explained above.
- a threshold on the LOR can be applied by requiring that the confidence interval of the LOR is entirely above a predetermined value. The threshold may vary depending on the desired balance of false positive to false negative rates. Lower thresholds e.g. 0.75 or below may include more false positives than higher thresholds, e.g. 1, but higher thresholds may include more false negatives than lower thresholds.
- the confidence interval may be associated with a predetermined level of confidence, e.g. a 90% confidence interval, a 95% confidence interval or a 98% confidence interval.
- a 95% confidence interval may be used.
- the 95% confidence interval around a LOR can be estimated as LOR ⁇ 1.96SE where SE is the standard error for the LOR estimate, or a corrected version thereof (e.g. a standard estimate corrected for overdispersion of counts as explained elsewhere herein).
- the variability in the counts may be larger than assumed by the above tests. This means that the true standard error estimate of the log odds ratio and odds ratios may be larger than the value mentioned above, which may lead to some false positive results.
- the present inventors have designed a method that uses read counts at polymorphic locations in the genome to estimate overdispersion patterns. This can then be used to correct the standard error estimate around the estimates provided above, thereby reducing the risk of false positives (i.e. samples being identified as subject to HLA allelic loss or imbalance when the data does not support such a conclusion).
- Overdispersion manifests itself in the presence of highly variable total number of reads (T1+T2+N1+N2), ratio of tumor vs normal total reads (T1+T2/N1+N2) and counts of allele 1 vs allele 2 (Nl+Tl vs N2+T2) across polymorphic loci.
- a sliding window i.e. taking into account counts at SNPs within a window
- an overdispersion estimate can be obtained for each window and an aggregate estimate (e.g. median) can then be obtained using the estimates for each of the windows.
- the use of a sliding window for estimating overdispersion reduces the risk of overdispersion estimates being obtained across regions that have different copy numbers, particularly in the tumor sample. Indeed, estimation of overdispersion for a set of data assumes that the data comes from the same underlying distribution.
- the region of the genome may comprise the whole of chromosome 6 excluding SNPs in HLA genes.
- the region of the genome may be any region comprising at least 100, at least 200, at least 300, at least 400, or at least 500 heterozygous SNPs.
- the methods estimate locus-specific multinomial probabilities for a count to be from a normal chromosome (where the normal genome can be assumed to be diploid, such that a single probability can be estimated), and the probabilities or the counts to be from either tumor chromosome, which can have distinct locus-specific frequencies. These probabilities are then used to estimate overdispersion relative to multinomial counts in the observed count data.
- Embodiments of the methods may use unphased SNP data.
- polymorphic loci for tumor alleles in a window are assigned probabilistically as coming from the major or minor tumor chromosome (i.e., the chromosome with the greater or lesser copy number at the locus, respectively) and the frequencies for these two chromosomes estimated via an EM algorithm.
- Maximum likelihood estimates of the probabilities are used to estimate multinomial count expected values and variances, which are in turn used to estimate local overdispersion.
- the present inventors have found that phasing did not improve the accuracy of the LOH calling, and therefore the use of unphased SNPs results in a method that is more computationally efficient while preserving LOH calling accuracy.
- the difference between the observed variance in counts vs. expected variance in counts under a multinomial distribution can then be used to adjust the standard error around the OR or LOR estimates obtained as described herein. For example, standard error estimates may be multiplied by an overdispersion estimate based on this difference.
- FIG. 22 shows using synthetic cell line mixtures that the methods of the disclosure are able to call HLA LOH down to small tumor purity samples (e.g. samples with 50% and below 50% tumor content, i.e. tumor samples in which 50% or more of the sample comprises normal cell contamination), demonstrating the excellent sensitivity of the method. This is the case even using the more conservative threshold of 1 on LO
- Synthetic cell line mixtures are obtained which comprise decreasing percentages of tumor cell lines with known HLA loss (specifically, ovarian cancer cell line B23882 with a loss of each of an HLA-A, -B and -C alleles, tumor cell line B34996 with a loss of an HLA-B and a HLA- C allele, lung cancer cell line NCLH1672 homozygous for all HLA-A, -B and -C, and adenocarcinoma cell line NCI-H2009 homozygous for HLA-A, and with a loss of an HLA-B and an HLA-C allele) relative to matched normal cells (respectively germline cell lines corresponding to the tumor cell lines B23882 and B34996, and NCI-BL1672 and NCI-BL2009 derived from blood and corresponding to NCLH1672 and NCI-H2009).
- HLA loss specifically, ovarian cancer cell line B23882 with a loss of each of an HLA-A,
- HLA loss was called as described herein.
- the plots show for each cell line the number of HLA LOH events in HLA-A, -B, -C that were called (expected numbers are 3 in B23882 mixtures, 2 in B34996 mixtures, none in NCH672 mixtures, and 2 in NCI2009 mixtures) as a function of the tumor purity of the synthetic mixture, using a LOR threshold of 0.75 and using a more conservative LOR threshold of 1.
- the results show that in the B23882 mixtures, all 3 known HLA LOH events are still called with sample purities as low as 40% with the 0.75 threshold, and as low as 50% with the more conservative threshold of 1.
- process 1000 depicted in FIG. 10 can be used to make various types of decisions with respect to at least one of treating or predicting the progression or outcome of a disease such as a tumor or cancer.
- these processes provide a way of detecting (e.g., estimating or determining) HLA loss (e.g., HLA LOH) in one sample as compared to another sample.
- HLA loss e.g., HLA LOH
- HLA loss e.g., loss of one or more HLA alleles
- T cell cancer therapy can be personalized to account for the loss of certain HLA alleles that would prevent, for example, T cells from reacting to a neoantigen associated with those HLA alleles.
- loss of expression e.g., via absence of the HLA alleles in at least part of the tumor cell population, or reduced expression
- peptides may be presented to the cell surface of a cell with one or more of a subject's HLA alleles (e.g., HLA- A, HLA-B, HLA-C) for immune surveillance. If an antigen (e.g., peptide) appears foreign to the immune system, that cell is killed by the immune system. If HLA allele loss has been detected, a prediction can be made regarding which foreign antigens would have been presented by the lost HLA allele. This type of prediction would help refine the selection of foreign antigens used as targets for tumor therapy (e.g., tumor vaccines). In some cases, if a subject has lost many or all HLA alleles, a determination may be made that the subject is not a good candidate for a tumor vaccine and should be considered for other types of therapies (e.g., therapies involving NK cells).
- therapies e.g., therapies involving NK cells.
- Neoantigen vaccines can prime a subject's T cells to recognize and attack cancer cells expressing one or more particular tumor neoantigens. This approach can generate a tumorspecific immune response that spares healthy cells while targeting tumor cells.
- the individualized vaccine may be engineered or selected based on the information generated by the various embodiments described above.
- An immunotherapy such as, for example, without limitation, a cancer treatment may include collecting a sample (e.g., a blood sample) from a subject. T cells can be isolated and stimulated. The isolation can be performed using, for example, density gradient sedimentation (e.g., and centrifugation), immunomagnetic selection, and/or antibody-complex filtering.
- the stimulation may include, for example, antigen-independent stimulation, which may use a mitogen (e.g., PHA or Con A) or anti- CD3 antibodies (e.g., to bind to CD3 and activate the T- cell receptor complex) and anti-CD28 antibodies (e.g., to bind to CD28 and stimulate T cells).
- a mitogen e.g., PHA or Con A
- anti- CD3 antibodies e.g., to bind to CD3 and activate the T- cell receptor complex
- anti-CD28 antibodies e.g., to bind to CD28 and stimulate T cells.
- a set of peptides can be selected to use in the treatment of the subject based on the information provided by the various embodiments described above corresponding to predictions as to whether and/or an extent to which each of the set of peptides would bind to an MHC molecule (or HLA molecule) of the subject, be presented by the MHC molecule of the subject and/or trigger an immune response in the subject.
- the set of peptides can be selected based on the detection of HLA loss within one or more tumor samples. This HLA loss may include the loss of one or more HLA alleles.
- the set of peptides can be used to produce antigen, e.g. mutant peptide (for example, neoantigen) specific T cells.
- antigen e.g. mutant peptide (for example, neoantigen) specific T cells.
- peripheral blood T cells can be isolated from a subject and contacted with one or more mutant peptides or tumor associated peptides to induce, identify, select or enrich the T cells for mutant peptidespecific T-cells populations that can be administered to a subject.
- the T cell receptor sequence of the mutant/tumor associated peptide-reactive T cells can be sequenced.
- T cells can be engineered to include the T cell receptor that specifically recognizes the mutant peptide. These engineered T cells can then be administered to a subject. See, e.g, Matsuda et al. "Induction of Neoantigen- Specific Cytotoxic T Cells and Construction of T-cell Receptor Engineered T Cells for Ovarian Cancer," Clin. Cancer Res. 1-11 (2018), hereby incorporated by reference in its entirety for all purposes.
- the T cells can be expanded in vitro and/or ex vivo prior to administration to a subject.
- the subject may then be administered (e.g., infused with) a composition that includes the expanded population of T cells.
- the treatment is administered to an individual in an amount effective to, for example, prime, activate and expand T cells in vivo.
- the above examples provide some examples of different types of immunotherapies that may be developed based on HLA loss detection.
- the detection of HLA loss can be used to personalize immunotherapy (e.g., personalize a cancer immunotherapy), determine when to include and when to exclude an antigen that would be presented by an HLA allele as a potential target for an immunotherapy, and/or inform other decisions regarding immunotherapy.
- the immunotherapy may be selected from a group consisting of a T cell therapy, a personalized cancer therapy, an antigen-specific immunotherapy, an antigen-dependent immunotherapy, a vaccine, and a natural killer (NK) cell therapy.
- NK natural killer
- FIG. 19 provides a non-limiting example of a block diagram for a computer system 1800, in accordance with some embodiments.
- Computer system 1900 can be an example of one implementation for computer system 102 described above in FIG. 1.
- Computer system 1900 can be a host computer connected to a network.
- Computer system 1900 can be a client computer or a server.
- computer system 1900 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (portable electronic device), such as a phone or tablet.
- the device can include, for example, one or more of processor 1910, input device 1920, output device 1930, storage 1940, and communication device 1960.
- Input device 1920 and output device 1930 can generally correspond to those described elsewhere herein, and they can either be connectable or integrated with the computer.
- Input device 1920 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device.
- Output device 1930 can be any suitable device that provides output, such as a touch screen, haptics device, or speaker.
- Storage 1940 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory including a RAM, cache, hard drive, or removable storage disk.
- Communication device 1960 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device.
- the components of the computer can be connected in any suitable manner, such as via a physical bus 1970 or wirelessly.
- Software 1950 which can be stored in memory / storage 1940 and executed by processor 1910, can include, for example, the programming that embodies the functionality of the present disclosure (e.g., as embodied in the methods described above).
- Software 1950 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions.
- a computer-readable storage medium can be any medium, such as storage 1940, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.
- Software 1950 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions.
- a transport medium can be any medium that can communicate, propagate, or transport programming for use by or in connection with an instruction execution system, apparatus, or device.
- the transport readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.
- Computer system 1900 may be connected to a network, which can be any suitable type of interconnected communication system.
- the network can implement any suitable communications protocol and can be secured by any suitable security protocol.
- the network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
- Computer system 1900 can implement any operating system suitable for operating on the network.
- Software 1950 can be written in any suitable programming language, such as C, C++, Java, or Python.
- application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a web browser as a web-based application or web service, for example.
- Embodiments disclosed herein may include:
- a method for detecting Human Leukocyte Antigen (HLA) alterations for a subject comprising: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a plurality of sequence reads derived from a normal sample from the subject; receiving a subject-specific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal -derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: a tumor allelic ratio comprising
- HLA alteration is an HLA copy number change.
- the plurality of sequence reads is derived by sequencing nucleic acid molecules extracted from the tumor and normal samples from the subject using a whole exome sequencing (WES) technique.
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value.
- detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for one or more of: a deviation of a log ratio of the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene from an expected value, a deviation of the log ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the second allele of the at least one HLA gene from an expected value, and a deviation from an expected value of a log odds ratio corresponding to the log ratio of (i) the ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene and (ii) the ratio of the number of unique
- an expected value for a log ratio corresponds to an absence of imbalance between the tumor and normal samples, and/or the expected value for the log odds ratio corresponds to a lack of HLA allelic imbalance, optionally wherein the expected value for the log odds ratio corresponds to a log odds ratio of 0. 9.
- the statistical analysis comprises: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non- HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, for each of the plurality of heterozygous SNPs and based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele, a number of unique tumor-derived sequence reads for a second SNP allele , a number of unique normal- derived sequence reads for the first SNP allele and a number of unique normal-derived sequence reads for the second SNP allele; optionally determining: a normal allelic ratio comprising a ratio of the determined number of unique normal-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a
- estimating a degree of overdispersion in sequence read counts comprises calculating an overdispersion statistic based on the comparison of the observed counts at the plurality of heterozygous SNPs to corresponding expected sequence read counts based on fitting the multinomial model.
- estimating a degree of overdispersion in sequence read counts comprises calculating a % 2 statistic and/or a residual deviance (D) statistic.
- estimating a degree of overdispersion in sequence read counts comprises calculating a % 2 statistic and/or a residual deviance (D) statistic.
- D residual deviance
- non-sorting based method for deduplicating and tallying sequence read counts comprises: performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
- tumor and normal samples from the subject comprise paired tumor and normal tissue biopsy samples.
- a system comprising: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of embodiments 1 to 31.
- a system comprising: a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject; one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of embodiments 1 to 31.
- a non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform the method of any one of embodiments 1 to 31.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Organic Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Immunology (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- General Engineering & Computer Science (AREA)
- Pathology (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- Cell Biology (AREA)
- Hospice & Palliative Care (AREA)
- Oncology (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
This present disclosure relates to systems and methods for detecting HLA alterations (e.g., HLA loss of heterozygosity or HLA copy number alterations) in a sample from a subject. The disclosed methods may comprise, for example, receiving sequence read data derived from a tumor sample and a normal sample from the subject; receiving a subject-specific reference sequence for an HLA region of the subject's genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads and unique normal-derived sequence reads for each of a first allele and a second allele of an HLA gene; and detecting an HLA alteration for the HLA gene based on a tumor allelic ratio and a normal allelic ratio that are determined based on the number of unique tumor-derived sequence reads and unique normal-derived sequence reads for the first allele and second alleles of the HLA gene.
Description
METHODS AND SYSTEMS FOR HLA LOSS DETERMINATION
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the priority benefit of United States Provisional Patent Application Serial No. 63/637,294, filed April 22, 2024, the contents of which are incorporated herein by reference in their entirety.
FIELD
[0002] This present disclosure relates generally to methods for evaluating human leukocyte antigen (HLA) loss. More specifically, this disclosure provides methods and systems for detecting HLA loss (e.g., HLA loss of heterozygosity (HLA LOH)) in a subject based on an analysis of (i) sequence read ratios for a first and second HLA allele in a tumor sample, and (ii) sequence read ratios for the first and second HLA allele in a normal sample from the same subject.
BACKGROUND
[0003] HLA genes are genes that form part of the major histocompatibility complex (MHC) - a collection of closely-linked polymorphic genes that encode the cell surface proteins involved in the adaptive immune system. They play a significant role in disease and immune defense. Non-limiting examples of HLA genes include the HLA Class I (HLA-I) genes (z.e., HLA-A, HLA-B, and HLA-C genes) and HLA Class II (HLA-II) genes (z.e., the HLA-DR, HLA-DQ, and HLA-DP genes).
[0004] HLA loss is the absence or decrease of HLA allelic presence or expression due to genetic modification, epigenetic modification, and/or indirect regulation. HLA LOH, for example, can indicate the loss of a functional HLA gene allele due to a genetic modification (also referred to as a hard modification). HLA loss (including HLA LOH) is often associated with unhealthy or diseased tissue, e.g., a tumor, and may contribute to, for example, immune system evasion by cancer and/or to therapeutic resistance on the part of cancer cells (e.g., due to the loss or reduced availability of a vehicle (e.g., an MHC complex) for presenting a therapeutic antigen (e.g. , a neoantigen) on the cell surface. Loss of one HLA gene allele through deletion or mutation can contribute to immune system evasion because each allele routinely
serves a different function in presenting distinct tumor-associated neoantigens to the extracellular immune surveillance system.
[0005] Existing nucleic acid sequencing-based methods for detecting HLA LOH typically rely on a copy number analysis to determine a T/N ratio of the number of tumor-derived sequence read counts (T) to the number of normal-derived sequence read counts (N) for a given HLA gene allele. Such analyses are dependent on the relative depth of sequencing of DNA extracted from tumor and normal samples, and can yield noisy data with tumor-to-normal sequence read count ratios that deviate significantly from the expected value of 1.0. Copy number-based approaches require the determination of an expected or baseline T/N value from the noisy sequence read count data for other genomic loci (e.g., genomic loci outside of the HLA region) to assess whether the observed T/N ratio for the given HLA gene allele is significantly different. The copy number-based approach to HLA LOH determination thus requires a statistical analysis to compare observed T/N ratios to the baseline T/N ratio, is highly dependent on computing the baseline accurately, and can miss clear cases of HLA loss in unambiguous cell line samples. This problem is exacerbated for whole-exome sequencing (WES) (as opposed to whole-genome sequencing (WGS)) because obtaining WES sequence reads for HLA genes is dependent upon the use of exome baits designed for the general human population, and these exome baits may work less effectively on particular HLA alleles. Furthermore, existing copy number-based methods are restricted to academic use only, and have not been commercialized or approved for clinical use. Thus, there remains a need for improved methods for detecting HLA LOH.
SUMMARY
[0006] Disclosed herein are systems and methods for detecting HLA LOH in a subject based on sequencing data (e.g., next generation sequencing data) that enable improved accuracy. The systems and methods described herein are based on an allelic ratio approach (z.e., a ratio of the number of unique sequence read counts for a first gene allele to the number of unique sequence read counts for a second gene allele) rather than a copy number approach. Rather than analyzing the T/N ratio for each HLA gene allele, the ratio (T1/T2) of the number of unique sequence read counts for HLA gene allele 1 (z.e., the first allele for a given HLA gene) in a tumor sample (Tl) to the number of unique sequence read counts for HLA gene allele 2 in the tumor sample (T2), and the ratio (N 1/N2) of the number of unique sequence read counts for HLA gene allele 1 in a normal sample (Nl) to the number of unique sequence read
counts for HLA gene allele 2 in the normal sample (N2) are analyzed. In the case of either a deletion of HLA gene allele 1 or a conversion of HLA gene allele 1 to HLA gene allele 2 in the tumor sample, the T1/T2 sequence read count ratio should decrease accordingly. Since a diploid organism is expected to have one copy of each gene allele, the expected allelic ratio is 1.0, regardless of sequencing depth. It is possible that one gene allele is more easily sequenced than another gene allele (resulting in a sequencing bias for one allele over the other), which would cause a deviation away from the expected value of 1.0. However, the two HLA gene alleles in tumor are derived from the same pair of gene alleles that occur in the normal sample, so that the sequence read count ratio for the two alleles in the normal sample can serve as a baseline or control for the sequence read count ratio in the tumor sample for each HLA gene that is free of sequencing bias artifacts. An analysis (e.g., a statistical analysis) is performed to determine if the difference between an observed T1/T2 ratio and a baseline value for a given HLA gene is statistically significant. The baseline value can be determined by measuring N1/N2 values for a plurality of non-HLA genomic loci (e.g., a plurality of single nucleotide polymorphism (SNP) loci), and fitting the Nl, N2, Tl, and T2 values determined for the plurality of non-HLA genomic loci (e.g., SNP loci) to a multinomial model, estimating count overdispersion, adjusting standard errors with that overdispersion, and computing a p-value for the difference between the observed T1/T2 value and the baseline.
[0007] The disclosed systems and methods enable improved accuracy in detecting HLA LOH (including for tumor samples of lower purity, and for WES sequence read data) due to: (i) reduced sensitivity to sequencing depth variations, (ii) reduced sensitivity to differential sequence capture for the two HLA gene alleles by bait molecules, (iii) improved alignment of sequence reads to HLA gene alleles based on the use of intron-level identifiers, and (iv) novel methods for efficient computation of allelic ratios at multiple genomic loci (e.g., SNP loci).
[0008] Disclosed herein are methods for detecting Human Leukocyte Antigen (HLA) alterations for a subject, the methods comprising: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a plurality of sequence reads derived from a normal sample from the subject (paired tumor and normal samples); receiving a subjectspecific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor- derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining,
based on the sequence read data and the subject-specific reference sequence, a number of unique normal-derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele, and a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first allele and the determined number of unique normal- derived sequence reads for the second allele.
[0009] In some embodiments, the HLA alteration is an HLA loss of heterozygosity. In some embodiments, the HLA alteration is an HLA copy number change. In some embodiments, the HLA alteration is an HLA imbalance.
[0010] In some embodiments, the plurality of sequence reads is derived by sequencing nucleic acid molecules extracted from the paired tumor and normal samples using a whole exome sequencing (WES) technique.
[0011] In some embodiments, detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value. In some embodiments, the expected value is the normal allelic ratio. In some embodiments, detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of a log ratio of the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the first allele of the at least one HLA gene from an expected value. In some embodiments, detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the log ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the second allele of the at least one HLA gene from an expected value. In some embodiments, detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation from an expected value of a log odds ratio corresponding to the log ratio of (i) the ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique tumor-derived sequence reads for the
first allele of the at least one HLA gene and (ii) the ratio of the number of unique normal- derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene. Performing a statistical analysis may comprise comparing a log ratio or log odds ratio to an expected value under a null hypothesis, wherein said comparison comprising computing a test statistic and/or comparing a confidence interval around said log ratio or log odds ratio to a predetermined threshold associated with a null hypothesis. In some embodiments, the null hypothesis corresponds to a log ratio and/or a log odds ratio of 0 or a log ratio corresponding to a baseline estimate. A baseline estimate may be obtained using counts from heterozygous SNPs at non-HLA loci. In some embodiments, an expected value for a log ratio corresponds to an absence of imbalance between the tumor and normal samples. In some embodiments, an expected value for the log odds ratio corresponds to a lack of HLA allelic imbalance. In some embodiments, the expected value for the log odds ratio corresponds to a log odds ratio of 0.
[0012] In some embodiments, the statistical analysis comprises: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of one or more of the plurality of heterozygous SNP loci; determining, based on the sequence read data, a number of unique normal -derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of one or more of the plurality of heterozygous SNP loci; and estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal-derived sequence reads for the first SNP allele, the number of unique normal -derived sequence reads for the second SNP allele, the number of unique tumor-derived sequence reads for the first SNP allele, and the number of unique tumor- derived sequence reads for the second SNP allele determined for each of the plurality of heterozygous SNP loci to a multinomial model. In embodiments, the method further comprises determining: a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci. In some embodiments,
the method further comprises determining a baseline log ratio for each of the first and second SNP alleles of each of the plurality of heterozygous SNP loci.
[0013] In embodiments, estimating a degree of overdispersion in sequence read counts comprises calculating an overdispersion statistic based on the comparison of the observed counts at the plurality of heterozygous SNPs to corresponding expected sequence read counts based on fitting the multinomial model. In embodiments, calculating an overdispersion statistic comprises calculating a %2 statistic. In embodiments, calculating an overdispersion statistic comprises calculating a residual deviance (D) statistic. In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the tumor allelic ratio and normal allelic ratio. In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the second allele ratio (T2/N2) and first allele ratio (Tl/Nl). In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a log of the product of the tumor allelic ratio and the inverse of the normal allelic ratio (log(T2/Tl*Nl/2)) or the difference between the second allele log ratio and the first allele log ratio (log(T2/N2)-log(Tl/Nl In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a first allele log ratio (log(Tl/Nl)). In some embodiments, the method further comprises using the estimated degree of overdispersion to adjust an estimated standard error for a second allele log ratio (log(T2/N2)).
[0014] In some embodiments, the method further comprises using the adjusted standard error for the log odds ratio to adjust a p-value or confidence interval for the detected HLA alteration. In some embodiments, the method comprises using the adjusted standard error for the log odds ratio to determine a p-value for a test statistic associated with a test that the observed log odds ratio is different from a null hypothesis associated with absence of HLA imbalance. In some embodiments, the method comprises using the adjusted standard error for the log odds ratio to determine a confidence interval for the log odds ratio. In some embodiments, the method comprises using the adjusted standard error for the log first ratio to determine a confidence interval for the log odds ratio. In some embodiments, the method comprises using the adjusted standard error for the first allele log ratio to determine a p-value for a test statistic associated with a test that the observed log ratio is different from a null
hypothesis associated with absence of difference between the tumor and normal counts at the first allele. In some embodiments, the method comprises using the adjusted standard error for the second allele log ratio to determine a p-value for a test statistic associated with a test that the observed log ratio is different from a null hypothesis associated with absence of difference between the tumor and normal counts at the second allele. In some embodiments, the method comprises using the adjusted standard error for the first allele log ratio to determine confidence interval around the first allele log ratio. In some embodiments, the method comprises using the adjusted standard error for the second allele log ratio to determine confidence interval around the second allele log ratio.
[0015] In some embodiments, the plurality of heterozygous SNP loci comprises at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci. In some embodiments, the plurality of heterozygous SNP loci are selected from a predetermined set of known SNPs. In some embodiments, the plurality of heterozygous SNP loci are selected from a predetermined set of known SNPs for which there is at least a predetermined number of reads in the normal sample.
[0016] In some embodiments, the plurality of heterozygous SNP loci is filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome. In some embodiments, the plurality of heterozygous SNP loci are detected by aligning the sequence read data to a reference sequence using a SNP tolerant alignment method. In some embodiments, the SNP tolerant alignment method uses a predetermined set of known SNPs.
[0017] In some embodiments, the plurality of heterozygous SNP loci are detected using a non-sorting-based method for de-duplicating and tallying sequence read counts. In some embodiments, the non-sorting based method for de-duplicating and tallying sequence read counts comprises: performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
[0018] In some embodiments, the at least one HLA gene comprises HLA-A, HLA-B, HLA- C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof.
[0019] In some embodiments, the subject-specific reference sequence for the HLA region of the subject’s genome is generated by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence.
[0020] In some embodiments, the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence. In some embodiments, the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region.
[0021] In some embodiments, detecting an HLA loss of heterozygosity for the at least one HLA gene does not require a determination of copy number for the at least one HLA gene.
[0022] In some embodiments, the paired tumor and normal samples comprise paired tumor and normal surgical resection samples. In some embodiments, the paired tumor and normal samples comprise paired tumor and normal tissue biopsy samples.
[0023] In some embodiments, the method further comprises diagnosing or confirming a diagnosis of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying the subject for treatment of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the method further comprises identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA alteration for the at least one HLA gene. In some embodiments, the disease is a cancer.
[0024] The methods described herein are computer-implemented unless indicated otherwise. Disclosed herein are systems comprising: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein.
[0025] Disclosed herein are systems comprising: a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from
the subject; one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein.
[0026] Disclosed herein are non-transitory computer-readable storage media storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform any of the methods described herein.
[0027] The terms and expressions which have been employed are used as terms of description and not of limitation, and there is no intention in the use of such terms and expressions of excluding any equivalents of the features shown and described or portions thereof, but it is recognized that various modifications are possible within the scope of the invention claimed. Thus, it should be understood that although the present invention as claimed has been disclosed by specific, exemplary implementations and optional features, modification and variation of the concepts herein disclosed can be resorted to by those skilled in the art, and that such modifications and variations are considered to be within the scope of this invention as defined by the appended claims.
INCORPORATION BY REFERENCE
[0028] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference in their entirety to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference in its entirety. In the event of a conflict between a term herein and a term in an incorporated reference, the term herein controls.
BRIEF DESCRIPTION OF THE DRAWINGS
[0029] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
[0030] The present disclosure is described in conjunction with the appended figures:
[0031] FIG. 1 provides a schematic diagram illustrating a non-limiting example of an evaluation system for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
[0032] FIG. 2 provides a non-limiting example of a process for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein.
[0033] FIG. 3 provides a schematic diagram illustrating HLA loss of heterozygosity.
[0034] FIG. 4 provides a flow diagram illustrating an example of a process for quantifying allele-read alignments for HLA genes, in accordance with one or more implementations of the systems and methods disclosed herein.
[0035] FIGS. 5A-5D provide non-limiting examples of WES sequence read alignments for the HLA B*51 :01 :01 :01 allele (Allele 1) and the HLA B*07:02:01:01 allele (Allele 2) for paired normal and tumor samples, where the tumor sample exhibits a loss of heterozygosity. FIG. 5A provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus in the normal sample. FIG. 5B provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus in the normal sample. FIG. 5C provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus in the tumor sample. FIG. 5D provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus in the tumor sample.
[0036] FIGS. 6A-6B provide non-limiting schematic illustrations that compare prior copy number-based methods for detecting HLA LOH (FIG. 6A) to the allelic ratio-based method (FIG. 6B) described herein.
[0037] FIGS. 7A-7D provide schematic illustrations of 2 x 2 contingency tables for sequence read data (e.g., HLA sequence read data) that might be expected for different genetic modifications to a tumor genome. FIG. 7A provides a schematic diagram illustrating a nonlimiting example of a 2 x 2 contingency table for HLA sequence read counts in normal and tumor samples. FIG. 7B provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a double deletion. FIG. 7C provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a single deletion. FIG. 7D provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a copy neutral loss of heterozygosity.
[0038] FIGS. 8A-8C provide non-limiting examples of 2 x 2 contingency tables for HLA sequence read counts that were observed for cells from ovarian cell line 23882 that exhibited different genetic modifications in a tumor genome. FIG. 8A provides a sequence read counts
derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA A*26:01 allele. FIG. 8B provides a sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA B*35:01 allele. FIG. 8C provides a sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA C*04:01 allele.
[0039] FIGS. 9A-9B provide non-limiting schematic illustrations of the HLA (FIG. 9A) and non-HLA (FIG. 9B) genomic loci used to determine sequence read counts and evaluate allelic ratios for a subject, which are then used as input for a statistical model used to detect HLA loss of heterozygosity in the subject, in accordance with one or more implementations of the systems and method disclosed herein.
[0040] FIG. 10 provides a non-limiting example of a process flowchart for detecting an HLA alteration, in accordance with one or more implementations of the systems and methods disclosed herein.
[0041] FIG. 11 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCH672 chrl region.
[0042] FIG. 12 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B34996 chrl region.
[0043] FIG. 13 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the Pirn 1603 chr6 region.
[0044] FIG. 14 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCH672 chr6 region.
[0045] FIG. 15 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B23882 chr6 region.
[0046] FIG. 16 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the B34996 chr6 region.
[0047] FIG. 17 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of genomic position in the NCI20009 chr6 region.
[0048] FIGS. 18A-18F provide a non-limiting example of sensitivity analysis data for detection of HLA LOH based on allelic ratio data, in accordance with one or more implementations of the systems and methods disclosed herein. FIG. 18A provides tumor/normal coverage data for a 10% tumor sample. FIG. 18B provides tumor/normal coverage data for a 20% tumor sample. FIG. 18C provides tumor/normal coverage data for a 30% tumor sample. FIG. 18D: tumor/normal coverage data for a 40% tumor sample. FIG. 18E provides tumor/normal coverage data for a 50% tumor sample. FIG. 18F provides tumor/normal coverage data for a 100% tumor sample.
[0049] FIG. 19 provides a non-limiting example of a block diagram of a computer system, in accordance with one or more implementations of the systems and methods disclosed herein.
[0050] FIGS. 20A-20D provide non-limiting examples of read counts obtained for tumor and normal samples at polymorphic loci using a standard reference-based alignment method (FIG. 20A and FIG. 20C) and a SNP -tolerant alignment method (FIG. 20B and FIG. 20D).
[0051] FIGS. 21A-21B provide a non-limiting example of read counts (as a fraction of total counts at the locus) on chromosome 6 (comprising the HLA locus) in a sample not subject to HLA loss (FIG. 21A) and corresponding statistics obtained using a method disclosed herein (FIG. 21B). FIGS. 21C-21D provide a non-limiting example of read counts (as a fraction of total counts) on chromosome 6 (comprising the HLA locus) in a sample subject to HLA loss (FIG. 21C) and corresponding statistics obtained using a method disclosed herein (FIG. 21D).
[0052] FIG. 22 provides a non-limiting example of the number of known HLA LOH events that can be detected in synthetic cell line mixtures with decreasing tumor purities, using methods of the disclosure.
DETAILED DESCRIPTION
I. Overview
[0053] Systems and methods for detecting HLA alterations (e.g., HLA allelic imbalance, HLA-LOH and/or HLA copy number alterations) in a subject based on sequencing data (e.g., next generation sequencing data) that enable improved accuracy are described. Detecting HLA loss (e.g., through HLA loss of heterozygosity) and/or gain (e.g., via HLA copy number
alteration) is important to developing and/or managing certain types of immunotherapies including, but not limited to, cancer immunotherapy. For example, a tumor can use various mechanisms (e.g., genetic modification, epigenetic modification, or indirect regulation) to cause HLA loss that enables the tumor to escape or evade therapy and have a selective advantage. A particular immunotherapy (e.g., a T cell therapy, a cancer vaccine) may be ineffective and unable to recognize a tumor-specific antigen (e.g., a neoantigen) and thereby fail to activate an immune response when there is HLA loss. For example, a tumor cell that does not have or that has a reduced expression of a particular HLA allele or HLA alleles may not be recognized or killed by T cells that are reactive to a given antigen on that tumor cell. Conversely, a tumor cell that does not have or that has a reduced expression of a particular HLA allele or HLA alleles may be more likely to be recognized or killed by NK cells. Thus, detecting HLA loss in, for example, a subject's tumor can facilitate development and/or personalization of immunotherapies including, but not limited to, T-cell therapies, vaccines and/or natural killer (NK) cell therapies. In some instances, the disclosed methods can further comprise: (i) diagnosing or confirming a diagnosis of a disease, (ii) identifying the subject for treatment of a disease (or identifying the subject as a candidate for treatment of a disease), (iii) identifying a treatment for a disease with which the subject has been diagnosed, (iv) predicting a clinical outcome for a disease with which the subject has been diagnosed, (v) designing or manufacturing a personalized treatment for the subject, or (vi) identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA loss (e.g., a detected loss of HLA heterozygosity) for at least one HLA gene. In some instances, for example, the disease can be a cancer.
[0054] The systems and methods described herein are based on an allelic ratio approach (z.e., a ratio of the number of unique sequence read counts for a first gene allele to the number of unique sequence read counts for a second gene allele) rather than a copy number approach. The allelic ratio approach comprises analyzing the ratio (T1/T2) of the number of unique sequence read counts for HLA gene allele 1 in a tumor sample (Tl) to the number of unique sequence read counts for HLA gene allele 2 in the tumor sample (T2), and the ratio (N1/N2) of the number of unique sequence read counts for HLA gene allele 1 in a normal sample (Nl) to the number of unique sequence read counts for HLA gene allele 2 in the normal sample (N2). In the case of either a deletion of HLA gene allele 1 or a conversion of HLA gene allele 1 to HLA gene allele 2 in the tumor sample (e.g. copy neutral LOH), the T1/T2 sequence read count ratio should decrease accordingly. Since a diploid organism is expected to have one copy of
each gene allele, the expected allelic ratio N1/N2 is 1.0, regardless of sequencing depth. It is possible that one gene allele is more easily sequenced than another gene allele (resulting in a sequencing bias for one allele over the other), which would cause a deviation away from the expected value of 1.0 for N1/N2. The two HLA gene alleles in tumor are derived from the same pair of gene alleles that occur in the normal sample, so that the sequence read count ratio for the two alleles in the normal sample can serve as a baseline or control for the sequence read count ratio in the tumor sample for each HLA gene that is free of sequencing bias artifacts. An analysis (e.g., statistical analysis) is performed to determine if the difference between an observed T1/T2 ratio and a baseline value N1/N2 for a given HLA gene is statistically significant. The statistical significance can be evaluated using a confidence interval based on the estimated variability around a statistic that captures this difference. The confidence interval can be determined at least in part by measuring N1/N2 values for a plurality of non-HLA genomic loci (e.g., a plurality of single nucleotide polymorphism (SNP) loci), and fitting the Nl, N2, Tl, and T2 values for the plurality of non-HLA genomic loci (e.g., SNP loci) to a multinomial model, estimating a degree of overdispersion in the counts observed at the plurality of non-HLA genomic loci, adjusting standard errors for estimates of log ratios and/or log odds ratios based on the estimated degree of overdispersion, and computing a p-value for the difference between the observed ratios and a baseline assuming no allelic imbalance and/or a confidence interval estimate for said ratios using the adjusted standard errors estimates. The disclosed systems and methods enable improved accuracy in detecting HLA LOH (including for tumor samples of lower purity, and for WES sequence read data) due to: (i) reduced sensitivity to sequencing depth variations, and (ii) reduced sensitivity to differential sequence capture for the two HLA gene alleles by bait molecules. Embodiments of the methods described herein further improve on this through one or more of: (iii) improved alignment of sequence reads to HLA gene alleles based on the use of intron-level identifiers, (iv) improved accuracy of quantification of read counts at polymorphic loci in the HLA locus and outside of it due to the use of SNP -tolerant read alignment, and (v) novel methods for efficient computation of allelic ratios at multiple genomic loci (e.g., SNP loci).
[0055] The present disclosure provides various systems, methods, and non-transitory computer readable media for evaluating HLA loss (e.g., HLA LOH or HLA copy number changes) in a sample obtained from a subject as compared to another sample. For example, the methods described herein enable analyzing HLA gene alleles in two samples to estimate whether there is any HLA loss of heterozygosity between the two samples. In various
examples, these two samples include a normal or otherwise healthy sample and a tumor or otherwise unhealthy or diseased sample. The former may be referred to herein as “normal” sample for simplicity, with associated read counts at any heterozygous locus or HLA gene allele labelled as Nl, N2. The latter may be referred to herein as “tumor” sample for simplicity, with associated read counts at any heterozygous locus or HLA gene allele labelled as Tl, T2. However, this terminology refers generally to a sample comprising cells or genetic material derived therefrom that can be assumed to not be subject to HLA alteration (“normal” sample) and a related sample comprising cells or genetic material derived therefrom that have an unknown status in relation to HLA alteration (“tumor” sample).
[0056] In some instances, disclosed systems and methods can utilize sequence read data for HLA gene alleles that are typed at an exon level or intron level of resolution (possible even for whole exome sequencing data because whole exome sequencing techniques also capture information about adjacent introns). For example, HLA alleles can be typed based on variances (e.g., polymorphisms) within the exon regions of the HLA alleles or based on variances (e.g., polymorphisms) within the intron regions of the HLA alleles. Using exon resolution or intronresolution identifiers for an HLA allele allows the allele sequence for that HLA gene to be more accurately identified, which can enable improved alignments between HLA gene alleles and sequence reads obtained from nucleic acid sequencing. Examples of methods for performing exon-resolution or intron-resolution HLA typing are described in, for example, PCT International Patent Application Publication No. WO 2022/192304, which is incorporated herein by reference in its entirety. Alternative methods include e.g. Optitype (Szolek et al. Bioinformatics, Volume 30, Issue 23, December 2014, Pages 3310-3316) or combinations of Optitype and the methods described in WO 2022/192304.
[0057] In some instances, for example, the disclosed methods (e.g, computer-implemented methods) can comprise: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from a subject; receiving a subject-specific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal -derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived
sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: (i) a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele, and (ii) a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first allele and the determined number of unique normal -derived sequence reads for the second allele. In some instances, the HLA alteration can be an HLA loss of heterozygosity. In some instances, the HLA alteration can be an HLA copy number change. In some instances, the HLA alteration can be an allelic imbalance.
[0058] In some instances, detecting an HLA alteration for the at least one HLA gene can comprise performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value (e.g., a value of 1). In some instances, the statistical analysis can comprise: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of the plurality of heterozygous SNP loci; determining, based on the sequence read data, a number of unique normal-derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of the plurality of heterozygous SNP loci; optionally determining: a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci; and estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal-derived sequence reads for the first SNP allele, the number of unique normal -derived sequence reads for the second SNP allele, the number of unique tumor-derived sequence reads for the first SNP allele, and the number of unique tumor- derived sequence reads for the second SNP allele determined for each of the plurality of heterozygous SNP loci to a multinomial model.
[0059] The systems and methods described herein may be used in various ways to develop and/or personalize immunotherapies such as T cell therapies or cancer vaccines. For example, if HLA loss is determined to be present in a tumor sample (e.g., a sample of tumor tissue) for one HLA allele, a T cell therapy or cancer vaccine may be designed to be reactive to or include an antigen presented by another HLA allele in that patient for which HLA loss is not detected. This type of HLA loss detection may be crucial to providing effective and timely therapy to individual subjects.
II. Terminology
[0060] Unless otherwise defined, scientific and technical terms used in connection with the present teachings described herein shall have the meanings that are commonly understood by those of ordinary skill in the art. Further, unless otherwise required by context, singular terms shall include pluralities and plural terms shall include the singular. Generally, nomenclatures utilized in connection with, and techniques of, chemistry, biochemistry, molecular biology, pharmacology, and toxicology as described herein are those well-known and commonly used in the art.
[0061] As used herein, “substantially” means sufficient to work for the intended purpose. The term "substantially" thus allows for minor, insignificant variations from an absolute or perfect state, dimension, measurement, result, or the like such as would be expected by a person of ordinary skill in the field but that do not appreciably affect overall performance. When used with respect to numerical values or parameters or characteristics that can be expressed as numerical values, "substantially" means within ten percent.
[0062] As used herein the term “ones” means more than one.
[0063] As used herein, the term “plurality” can be 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. As used herein, the term “set of’ means one or more. For example, a set of items includes one or more items.
[0064] As used herein, the phrase “at least one of’, when used with a list of items, means different combinations of one or more of the listed items may be used and only one of the items in the list may be needed. The item may be a particular object, thing, step, operation, process, or category. In other words, “at least one of’ means any combination of items or number of items may be used from the list, but not all of the items in the list may be required. For example, without limitation, “at least one of item A, item B, or item C” means item A; item A and item
B; item B; item A, item B, and item C; item Band item C; or item A and C. In some cases, “at least one of item A, item B, or item C” means, but is not limited to, two of item A, one of item B, and ten of item C; four of item B and seven of item C; or some other suitable combination.
[0065] Where reference is made to a list of elements (e.g., elements a, b, c), such reference is intended to include any one of the listed elements by itself, any combination of less than all of the listed elements, and/or a combination of all of the listed elements.
[0066] As used herein, a “model” includes at least one of an algorithm, a formula, a mathematical technique, a machine algorithm, a probability distribution or model, or another type of mathematical or statistical representation.
[0067] As used herein, a “subject” may refer to a mammal being assessed for treatment and/or being treated, a mammal participating in a clinical trial, a mammal undergoing anticancer therapies, or any other mammal of interest. In various embodiments, the terms “subj ec ’, “individual”, and “patient” are used interchangeably herein. A subject can be a healthy or asymptomatic individual, an individual that has or is suspected of having a disease (e.g., cancer) or a pre-disposition to the disease, an individual that is in need of therapy or suspected of needing therapy, or a combination thereof. A subject may be, for example, without limitation, an individual having cancer or an individual having an autoimmune disease. A subject may be human. In other cases, a subject may be a non-human mammal. For example, a subject may be a mammal used in forming laboratory models for human disease. Such mammals include, but are not limited to, model animals, mice, rats, primates (e.g., cynomolgus monkey), etc.
[0068] As used herein, a “sample” can refer to a “biological sample” of a subject. A sample can include tissue (e.g., a biopsy), a single cell, multiple cells, fragments of cells or an aliquot of body fluid. The sample may have taken from a subject, by means including venipuncture, excretion, ejaculation, massage, biopsy, needle aspirate, lavage sample, scraping, surgical incision, or intervention or other means known in the art.
[0069] As used herein, a “nucleotide” comprises a nucleoside and a phosphate group.
[0070] A “nucleoside” as used herein comprises a nucleobase and a five-carbon sugar (e.g., ribose, deoxyribose, or analogs thereof). When the nucleobase is bonded to ribose, the nucleoside may be referred to as a ribonucleoside. When the nucleobase is bonded to deoxyribose, the nucleoside may be referred to as a deoxyribonucleoside. A “nucleobase”
which may be also referred to as a “nitrogenous base”, can take the form of one of five types: adenine (A), guanine (G), thymine (T), uracil (U), and cytosine (C).
[0071] As used herein, a “polynucleotide”, “nucleic acid”, or “oligonucleotide” refers to a linear polymer of nucleotides (or nucleosides joined by internucleosidic linkages). Generally, a polynucleotide comprises at least three nucleotides. Generally, an oligonucleotide is comprised of nucleotides that range in number from a few nucleotides (or monomeric units) to several hundreds of nucleotides (monomeric units). Whenever a polynucleotide such as an oligonucleotide is represented by a sequence of letters, such as “ATGCCTG”, it will be understood that the nucleotides are in 5' —> 3' order or direction from left to right and that “A” denotes adenine, “C” denotes cytosine, “G” denotes guanine, and “T” denotes thymine, unless otherwise noted. The letters A, C, G, and T may be used to refer to the nucleobases themselves, as described above, the nucleosides that include those nucleobases, or the nucleotides that include those bases, as is standard in the art.
[0072] Deoxyribonucleic acid (DNA) is a chain of nucleotides consisting of 4 types of nucleotides: adenine (A), thymine (T), cytosine (C), and guanine (G). Ribonucleic acid (RNA) is comprised of 4 types of nucleotides: A, C, G, and uracil (U). Certain pairs of nucleotides specifically bind to one another in a complementary fashion, which may be referred to as complementary base pairing. For example, C pairs with G and A pairs with T. In the case of RNA, however, A pairs with U. When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand. As used herein, “nucleic acid sequencing data”, “nucleic acid sequencing information”, “nucleic acid sequence”, “genomic sequence”, “genetic sequence”, “fragment sequence”, “sequence reads”, or “nucleic acid sequencing read” denote any information or data that is indicative of the order of the nucleotide bases (e.g., A, C, G, T/U) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA. The sequence reads may be DNA sequence reads. It should be understood that the present disclosure contemplates that this sequence information may be obtained using any of the available varieties of techniques, platforms, or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligationbased systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic-based systems, etc., or a combination thereof.
[0073] A term “genome”, as used herein, refers to the genetic material of a cell or organism, including animals, such as mammals (e.g., humans), and comprises nucleic acids, such as DNA. A genome is stored on one or more chromosomes comprised of DNA sequences. In humans, DNA includes, for example, genes, noncoding DNA, and mitochondrial DNA. The human genome typically contains 23 pairs of chromosomes: 22 pairs of autosomal chromosomes (autosomes) plus the sex-determining X and Y chromosomes. The 23 pairs of chromosomes include one copy from each parent. The DNA that makes up the chromosomes is referred to as chromosomal DNA and is present in the nucleus of human cells (nuclear DNA).
[0074] As used herein, a “gene” is a discrete portion of heritable, genomic sequence which affect a subject's traits by being expressed as a functional product or by regulation of gene expression. The total complement of genes in a subject or cell is known as the subject's or cell's genome. A region of a chromosome at which a particular gene is located is called its locus. Each locus contains one allele of a gene. Thus, a pair of chromosomes together has two loci that each contain an allele of the gene to form an allele pair. The two alleles may be the same or may be different (e.g., have slightly varying gene sequences).
[0075] As used herein, an “allele” is a sequence, or variant thereof, of a gene. One allele of a gene may differ from another allele of the same gene in various ways. For example, two alleles for a same gene may differ by, for example, differences in the encoded protein (e.g., differences in the amino acid sequence of the encoded protein), other (e.g., silent or synonymous) variances in the exon regions that do not affect the amino acid sequence, variances in the intron regions, or some combination of these variances.
[0076] As used herein, a “sequence” denotes any information or data that is indicative of the order of the nucleotide bases (e.g., A, C, G, T/U) in a molecule (e.g., whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, fragment, etc.) of DNA or RNA. Sequence information may be obtained using any of the available varieties of techniques, platforms, or technologies, including, but not limited to, capillary electrophoresis, microarrays, ligation-based systems, polymerase-based systems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, electronic-based systems, etc., or a combination thereof. As one example, sequence information may be obtained using next generation sequencing.
[0077] As used herein, “next generation sequencing” (NGS) refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary
electrophoresis-based approaches. These sequencing technologies have, for example, the ability to generate hundreds of thousands of relatively small sequence reads or "reads" in a single sequencing run. Some examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
[0078] As used herein, a “read” or “sequence read” includes a string of nucleic acid bases corresponding to a nucleic acid molecule that has been sequenced. For example, a read can refer to the sequence of nucleotides determined for a nucleic acid fragment that has been subjected to sequencing, such as, for example, next generation sequencing (“NGS”). Reads can be any sequence of any number of nucleotides, with the number of nucleotides defining the read length. The term “sequence read” may refer to reads obtained by sequencing DNA in a sample. The DNA may be genomic DNA. The term “sequence read data” refers to any information characterizing a sequence read, including the sequence of the read itself or information from which the sequence can be derived. Sequence read data may include, in addition to this, one or more of: one of more quality metrics, alignment information, one or more flags, etc. Sequence read data may be in the form of e.g. a BCL file, a FASTQ file, a SAM file, a BAM file, or any other file format from which read counts for specific loci can be derived as described herein.
[0079] As used herein, a “major histocompatibility complex gene” or “MHC gene” is a gene that encodes a system, complex, or group of cell-surface proteins responsible for the regulation of the immune system.
[0080] As used herein, a “human leukocyte antigen gene” or “HLA gene” is a gene that encodes a system, complex, or group of cell-surface proteins responsible for the regulation of the immune system. An HLA system or complex is encoded by the MHC gene complex in humans. MHC molecules that present antigens on cells are categorized as belonging to one of three classes of MHC molecules, MHC class I, MHC class II, and MHC class III. Certain HLA genes including, for example, HLA-A, HLA-B, HLA-C, correspond to MHC class I. Certain HLA genes including, for example, HLA-DP, HLA-DM, HLA-DO, HLA-DQ, and HLA-DR, correspond to MHC class II. HLA genes that are known include, for example, HLA-A, HLA- B, HLA-C, HLA-E, HLA-F, HLA-G, HLA- H, HLA-J, HLA-K, HLA-L, HLA-N, HLA-P, HLA-S, HLA-T, HLA-U, HLA-V, HLA- W, HLA-X, HLA-Y, HLA-Z, HLA-DRA, HLA- DRB, HLA-DQ, HLA-DOA, HLA-DOB, HLA-DMA, HLA-DMB, HLA-DP A, HLA-DPB,
and HFE. Other genes that are found in the HLA region include, for example, TAPI, TAP2, PSMB9, PSMB8, MICA, MICB, MICC, MICD, and MICE.
[0081] As used herein, a “T cell”, also known as a T lymphocyte, refers to a type of adaptive immune cell. T cells develop in the thymus gland and play a central role in the immune response of the body. T cells can be distinguished from other lymphocytes by the presence of a T cell receptor (TCR) on the cell surface. These immune cells originate as precursor cells, derived from bone marrow, and then develop into several distinct types of T cells once they have migrated to the thymus gland. T cell differentiation continues even after they have left the thymus. T cells include, but are not limited to, helper T cells, cytotoxic T cells, memory T cells, regulatory T cells, and killer T cells. Helper T cells stimulate B cells to make antibodies and help killer cells develop. Based on the T cell receptor chain, T cells can also include T cells that express aP TCR chains, T cells that express y5 TCR chains, as well as unique TCR co- expressors (z.e., hybrid aP-yS T cells) that co-express the aP and y5 TCR chains.
[0082] T cells can also include engineered T cells that can attack specific cancer cells. Engineered T cells may be designed to recognize MHC -presented peptides. For example, an engineered T cell may be designed to recognize an antigen that is not subject to HLA loss (e.g. by engineering the T cell to express a T cell receptor that recognizes said antigen in the context of an HLA allele that is not subject to HLA loss in a subject). Engineered T cells can be expanded in culture and then infused into a patient's body. Engineered T cells may be designed to multiply and recognize the cancer cells that express a specific protein or neoantigen. This type of technology may be used in potential next-generation immunotherapy treatment.
[0083] As used herein, "immunotherapy" refers to a treatment or class of treatments that uses one or more parts of a subject's immune system to fight a disease such as, for example, without limitation, cancer. Immunotherapy can use substances made by the body or synthesized outside of the body to improve how the immune system works to find and destroy cancer cells. An immunotherapy may be a cell therapy, such as e.g. a T cell or NK cell therapy. An immunotherapy may be a vaccine, such as e.g. a nucleic acid, protein or cell based vaccine.
[0084] As used herein, a "neoantigen" is a tumor-specific antigen derived from one or more somatic mutations in a tumor. A neoantigen can be presented by a subject's cancer cells and antigen presenting cells. This can lead to an immune response against cells expressing the neoantigen. Neoantigen therapies, such as, but not limited to, neoantigen vaccines, are a relatively new approach for providing individualized cancer treatment. As used herein a “tumor
associated antigen” is an antigen that is expressed exclusively or primarily in tumor cells, but which do not arise from somatic mutation in the protein comprising the antigen. For example, a tumor associated antigen may arise through somatic amplification or other overexpression mechanism, or through post-translational modification. Cancer vaccines (also referred to herein as “neoantigen vaccines”) can prime a subject's T cells to recognize and attack cancer cells expressing one or more particular tumor neoantigens and/or tumor associated antigens. This approach generates a tumor-specific immune response that spares healthy cells while targeting tumor cells. An individualized vaccine may be engineered or selected based on a subjectspecific tumor antigen profile. The tumor antigen profile can be defined by determining DNA and/or RNA sequences from a subject's tumor cell and using the sequences to identify neoantigens and/or tumor associated antigens that are present in tumor cells but absent in normal cells.
III. Evaluation of HLA Loss
[0085] The systems and methods described herein are generally presented with respect to evaluation of HLA loss (e.g., HLA allelic loss or imbalance, HLA allelic expression loss) for a given HLA gene (e.g., through HLA loss of heterozygosity or HLA copy number changes). It should be understood, however, that they could be similarly used to evaluate HLA loss for multiple HLA genes. For example, various processes described below for evaluating HLA loss (e.g., HLA LOH) for a given HLA gene may be repeated or modified as needed to enable evaluating HLA allele loss across multiple HLA genes.
[0086] Further, the systems and methods described herein are generally presented with respect to evaluation of HLA loss (e.g., HLA allelic loss or imbalance, HLA allelic expression loss) for a given HLA gene. It should be understood, however, that they could be similarly used to evaluate MHC loss for one or more MHC genes. Accordingly, terms used herein that include the modifier "HLA" (e.g., HLA loss, HLA loss of heterozygosity, HLA alignment input, HLA allele, HLA gene, etc.) may be interchangeable with the modifier MHC (e.g., MHC loss, MHC loss of heterozygosity, MHC alignment input, MHC allele, MHC gene, etc.).
[0087] Still further, the systems and methods described herein are generally presented with respect to evaluation HLA loss (e.g., HLA allelic loss or imbalance, HLA allelic expression loss). It should be understood, however, that they could be similarly used to evaluate HLA gain for one or more HLA genes. More generally, the methods described herein can be used to evaluate HLA allelic imbalance. HLA allelic imbalance refers to the departure from an
expected 1 : 1 ratio of representation of two alleles at a heterozygous locus. Thus, the various methods and systems described below for evaluating HLA allele loss for a given HLA gene may be repeated or modified as needed to enable evaluating HLA allele gain across one or more HLA genes. Further, it should be understood that in various embodiments, references to HLA loss may be interchangeable with one or more other forms of HLA copy number alteration, which is an alteration in the number of copies of at least one HLA allele for an HLA gene. In some instances described herein, HLA copy number alteration includes copy-neutral loss of heterozygosity (LOH), in which a first allele for a gene is lost (resulting in zero copies of the first allele) and a second allele for that gene is gained (resulting in two copies of the second allele). Thus, references to evaluating or detecting “HLA loss” or “HLA alteration” encompass detecting HLA allelic imbalance of any kind, whether through the copy number alteration (including amplification and deletion) or one or both alleles in a pair of HLA alleles.
III. A. System for Evaluation of HLA Loss
[0088] FIG. 1 provides a schematic diagram illustrating a non-limiting example of an evaluation system for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein. Evaluation system 100 is implemented using hardware, software, firmware, or a combination thereof. Evaluation system 100 may be implemented using, for example, computer system 102. Computer system 102 includes a single computer or multiple computers in communication with each other. When computer system 102 includes multiple computers, in some instances, one computer may be located remotely with respect to at least one other computer.
[0089] Evaluation system 100 includes allelic type generator 104, alignment analyzer 106, statistics generator 108, or a combination thereof. Each of allelic type generator 104, alignment analyzer 106, and statistics generator 108 is implemented using hardware, software, firmware, or a combination thereof. For example, each of allelic type generator 104, alignment analyzer 106, and statistics generator 108 can be implemented as a distinct compiled computer program, interpreted language script, another type of software, or a combination thereof. Alignment analyzer 106 and statistics generator 108 form HLA loss evaluator 110. HLA loss evaluator 110 can be implemented in various ways.
[0090] In some instances, alignment analyzer 106 and statistics generator 108 are separate programs with alignment analyzer 106 generating an output that is sent as input into statistics generator 108. In other instances, alignment analyzer 106 and statistics generator 108 are
integrated or the actions that would be performed by alignment analyzer 106 and statistics generator 108 are integrated to form HLA loss evaluator 110. Accordingly, HLA loss evaluator 110 is implemented using hardware, software, or a combination thereof. In some instances, HLA loss evaluator 110 is implemented a compiled computer program, interpreted language script, another type of software, or a combination thereof. In other instances, HLA loss evaluator 110 is implemented as a plurality of programs working together. Reference herein to HLA loss evaluator 110 refers to alignment analyzer 106, statistics generator 108, a combination of alignment analyzer 106 and statistics generator 108, operations that would be performed by alignment analyzer 106, operations that would be performed by statistics generator 108, operations that would be performed by a combination of alignment analyzer 106 and statistics generator 108, or a combination thereof.
[0091] Evaluation system 100 receives read data 112 as input. One or more of allelic type generator 104 and HLA loss evaluator 110 (e.g., alignment analyzer 106, statistics generator 108, or both) receives read data 112 as input. Read data 112 includes one or more datasets. Read data 112 includes, for example, one or more sequencing datasets. A sequencing dataset includes, for example, sequence read data for a plurality of reads. In some instances, evaluation system 100 retrieves read data 112 from data store 114. Data store 114 includes, for example, but is not limited to, at least one of a database, a data storage unit, a spreadsheet, a file, a server, a cloud storage unit, a cloud database, or some other type of data store. In some examples, data store 114 comprises one or more data storage devices separate from but in communication with computer system 102. In other examples, data store 114 is at least partially integrated as part of computer system 102.
[0092] Read data 112 includes reads (e.g., sequence reads) that are generated using, for example, one or more next-generation sequencing (NGS) systems. The reads are generated using, for example, whole-exome sequencing (WES), whole genome sequencing (WGS), shallow whole genome sequencing (sWGS), targeted (panel) sequencing, or a combination thereof. The reads can be generated using, for example, paired-end sequencing.
[0093] For example, read data 112 includes at least a plurality of reads 116. Reads 116 are generated for a corresponding first sample which may be, for example, a biological sample. The first sample can be obtained from, for example, a subject (e.g., a live subject). In some instances, read data 112 further includes a plurality of reads 118 that are generated for a corresponding second sample that is different from the first sample. This second sample may
be, for example, a biological sample obtained from a subject that is the same as or different from the subject from which reads 116 are generated. Reads 118 are, in some examples, generated via simulation, via a sampling from a collection of reads generated for multiple subjects, or in some other manner. The first and second samples may comprise cells or genetic material derived therefrom, obtained from the same subject. Such samples may be referred to as “paired” or “matched”.
[0094] In some instances, reads 116 and reads 118 are paired-end reads. For example, paired-end sequencing of a fragment results in two sequences, a sequence generated beginning at the 5' end of the fragment, and a sequence generated beginning at the 3' end of the fragment. These two sequences form a paired-end read.
[0095] A biological sample for which at least a portion (e.g., reads 116, reads 118) of read data 112 is generated may be, for example, a sample of unhealthy or diseased tissue, a sample of tumor tissue, a sample of tissue that includes tumor cells, a sample of healthy or normal tissue, a sample of tissue that includes normal cells, a sample of tissue taken at a first stage or point in time during a cancer progression, a sample of tissue taken at a second stage or point in time during the cancer progression, or another type of sample. In some instances, reads 116 are generated for a sample of healthy or normal tissue and reads 118 are generated for a sample of unhealthy or diseased tissue (e.g., a tumor). In some examples, reads 116 are referred to as normal reads or healthy reads, and reads 118 are referred to as unhealthy reads, diseased reads, or tumor reads. A sample of unhealthy or diseased tissue refers to a sample comprising unhealthy or diseased cells, or genetic material derived therefrom. A sample of unhealthy or diseased tissue may also comprise healthy/normal cells or genetic material derived therefrom. Such samples may be described as having a particular purity (also referred to as “tumor purity” in the context of cancer), referring to the proportion of the cells represented in the sample that are diseased cells.
[0096] Allelic type generator 104 generates at least one applicable or probable allelic type for one or more genes of interest within a sample using read data 112 (e.g., based on reads 116 or reads 118). Allelic type generator 104 may generate at least one applicable or probable allelic type for each of a plurality of HLA genes. Allelic type generator 104 identifies, in some instances, a set of alleles 120 relevant to a subject using reads 116 generated for the subject. Set of alleles 120 is a set of one or more HLA alleles that is determined to most likely be present within the sample from which reads 116 are generated for a given HLA gene. Allelic type
generator 104 may identify a set of alleles 120 by identifying a final set of allelic identifiers 122 for set of alleles 120. An allelic identifier for an allele can take various forms. For example, an allelic identifier may be comprised of various letter and/or digits that form one or more fields for representing different pieces of information about an allele. An allelic identifier refers to any information from which at least the exon sequence of an HLA allele can be derived. An allelic identifier can have varying levels of resolution in which higher levels of resolution provide more information than lower levels of resolution. In some cases, additional letters and/or digits provide additional information. As one example, a 6-digit allelic identifier (or 6- digit identifier) has a lower resolution than an 8-digit allelic identifier (or 8-digit identifier). In some instances, allelic identifiers are referred to by the number of fields of information represented in these allelic identifiers. For example, a 6-digit identifier may be referred to as or generally provide the same level of information as a 3 -field identifier. An 8-digit identifier may be referred to as or generally provide the same level of information as a 4-field identifier.
[0097] An HLA allele for an HLA gene can be identified using identifiers of varying resolutions including, but not limited to, exon-resolution identifiers and intron-resolution identifiers. An exon-resolution identifier for an HLA allele describes an allele group, a specific allele protein, and exon region information for the corresponding HLA allele. In some instances, the exon-resolution identifier is a 6-digit identifier in which the first and second digits identify the allele group; the third and fourth digits identify the specific allele protein; and the fifth and sixth digits identify the exon region information. The specific allele protein is determined based on DNA sequence and differences within the amino acid sequence of the encoded protein. The exon region information captures changes in one or more exon regions of the HLA allele such as, for example, synonymous nucleotide substitutions. The 6-digit identifier may include one or more letters that indicate the corresponding HLA gene. In other instances, the exon-resolution identifier is a 3 -field identifier in which each field is comprised of any number of letters, digits, symbols, or combination thereof.
[0098] An intron-resolution identifier provides more information than an exon-resolution identifier and therefore has a higher level of resolution than an exon-resolution identifier. An intron-resolution identifier for an HLA allele describes an allele group, a specific allele protein, exon region information, and intron region information for the corresponding HLA allele. In some instances, the intron-resolution identifier is an 8-digit identifier that adds, to a 6-digit identifier as described above, seventh and eighth digits that identify intron region information.
The intron region information captures changes in one or more intron regions of the HLA allele such as, for example, polymorphisms in the intron regions. Final set of allelic identifiers 122 is a final set of intron-resolution identifiers in some instances. The 8-digit identifier may include one or more letters that indicate the corresponding HLA gene. In other instances, the intron-resolution identifier is a 4-field identifier that adds one field (e.g., comprised of any number of letters, digits, symbols, or combination thereof) to the 3-field identifier. A 6-digit or 8-digit identifier may, in addition to the 6 or 8 digits mentioned above, include one or more optional suffixes indicating respective properties of the allele, such as e.g. whether the protein encoded by the allele is expressed as a soluble molecule, whether it has been shown not to be expressed, etc. These may be removed or ignored prior to use in embodiments of methods of the disclosure.
[0099] Allele type generator 104 outputs final set of allelic identifiers 122 that correspond to a given HLA gene as determined using reads 116. The final set of allelic identifiers may be exon-level identifiers or intron level identifiers. The final set of allelic identifiers may be intron level identifiers obtained from a first set of allelic identifiers that are exon level identifiers. For example, 6-digit HLA identifiers may be identified in a first step and converted to 8-digit identifiers in a second step. The resolution of HLA identifiers in a final set of allelic identifiers may vary between HLA alleles. For example, an 8-digit identifier (intron level identifier) may be used wherever available, and a lower resolution identifier (e.g. exon level identifier, 6-digit identifier) may be used otherwise. HLA loss evaluator 110 receives final set of allelic identifiers 122 as input. HLA loss evaluator 110 uses final set of allelic identifiers 122 and read data 112 to generate HLA loss information 124. HLA loss information 124 may include various pieces of information that describe HLA loss or imbalance in the tumor sample, based on the comparison between the two samples, from reads 116 and reads 118. For example, HLA loss information 124 includes one or more pieces of information that can be used to identify, quantify and/or qualify HLA alteration (e.g. loss or other imbalance) in the sample associated with reads 118 as compared to the sample associated with reads 116.
[0100] As previously described elsewhere herein, HLA loss of an HLA allele for an HLA gene refers to an absence or decrease of presence of that HLA allele in a tissue or population of cells. HLA loss information 124 provides information that can be used to identify, quantify and/or qualify this absence or decrease. An example of information included in HLA loss information 124 is statistics generated based on alignment between a sequence corresponding
to an allelic identifier for an HLA allele and various reads. Another example of information included in HLA loss information 124 is one or more conclusions or inferences made using statistics. Yet another example of information included in HLA loss information 124 is an estimation of the amount of HLA loss (e.g., a percentage, a degree, etc.).
[0101] In some instances, HLA loss information 124 is generated by operations involving alignment analyzer 106 and statistics generator 108. Alignment analyzer 106, for example, receives final set of allelic identifiers 122 and reads data 112 as input and generates alignment output 126 based on this input. Alignment analyzer 106 performs an analysis of the alignment of each allele identified by a corresponding one of final set of allelic identifiers 122 with reads 116 and reads 118. Thus, alignment analyzer 106 aligns reads 116 and reads 118 to a subjectspecific reference sequence which includes the sequence of HLA alleles identified in the final set of allelic identifiers 122. Alignment output 126 provides a quantification of these alignments. In some instances, alignment output 126 also includes a quantification of alignments between non-HLA genes with reads 116 and reads 118.
[0102] Alignment analyzer 106 outputs alignment output 126 and statistics generator 108 receives alignment output 126 as input. Statistics generator 108 performs statistical analysis 128 using alignment output 126 and, in some cases, other information. Statistical generator 108 performs statistical analysis 128 using one or more algorithms for generating statistics, one or more mathematical formulas or equations, one or more other types of analysis techniques, or a combination thereof. HLA loss evaluator 110 generates HLA loss information 124 using the results of statistical analysis 128, one or more inferences or conclusions made based on the results of statistical analysis 128, or a combination thereof.
[0103] HLA loss evaluator 110 generates, in some instances, report 130 using HLA loss information 124. Report 130 may include, for example, without limitation, at least one of a table, a spreadsheet, a database, a file, a presentation, an alert, a graph, a chart, one or more graphics, or a combination thereof. HLA loss evaluator 110 optionally displays report on display system 132. Display system 132 comprises one or more display devices in communication with computer system 102. Display system 132 may be separate from or at least partially integrated as part of computer system 102.
[0104] Although the operations of evaluation system 100 are generally described with respect to a given HLA gene, evaluation system 100 is capable of identifying a set of alleles for each of a plurality of HLA genes in a sample and evaluating HLA loss using the identified
sets of alleles. Further, although allelic type generator 104 is shown as part of evaluation system 100, in other instances, allelic type generator 104 may be separate from evaluation system 100.
[0105] FIG. 2 provides a non-limiting example of a process 200 for evaluating HLA loss, in accordance with one or more implementations of the systems and methods disclosed herein. System 200 can be implemented using the HLA loss evaluator 110 described in FIG. 1. For example, system 200 may comprise the statistics generator 108 used to perform statistical analysis 128 in FIG. 1.
[0106] In FIG. 2, a set of normal sequence reads 116 (e.g., WES sequence reads) derived from a normal sample from a subject are provided as input to, e.g., allele type generator 104 as described in FIG. 1, which identifies a set of HLA alleles 120 and/or HLA allelic identifiers 122 present in the sample for at least one HLA gene. In some instances, the identified HLA alleles may be identified using 6-digit allelic identifiers (e.g., exon-resolution identifiers as described elsewhere herein). In some instances, the identified HLA alleles may comprise 8- digit allelic identifiers (e.g., intron-resolution identifiers as described elsewhere herein). Methods for generating 6-digit and 8-digit allelic identifiers based on a plurality of sequence reads are described in more detail in, for example, PCT International Patent Application Publication No. WO 2022/192304, which is incorporated herein by reference in its entirety.
[0107] As indicated in FIG. 2, the set of alleles 120 and/or allelic identifiers 122 can then be provided as input to, e.g., alignment analyzer 106 as described in reference to FIG. 1 along with a reference genome 202 (e.g., the GRCh38 human reference genome (Genome Reference Consortium)) to generate a subject-specific reference genome comprising a subject-specific HLA genome (HLA-ome) 204 (e.g., a subject-specific reference genome for the HLA region of the genome). The subject-specific reference genome can combine the reference genome 202 with the subject-specific HLA genome by replacing the sequence of the HLA genes in the subject-specific HLA genome by the sequences in the subject-specific HLA genome. The subject-specific HLA genome can be constructed based on allelic identifiers 122 and a HLA sequence database such as e.g. IMGT™ (www.imgt.org/).. In some instances, the set of allelic identifiers may be generated (e.g, by allele type generator 104 in FIG. 1) through the use of a set-covering algorithm to find a set of HLA gene alleles that best explains the observed set of normal sequence reads (e.g, normal WES sequence reads) in a sample.
[0108] As indicated in FIG. 2, the subject-specific HLA genome 204 or subject-specific reference genome, and sets of tumor sequence reads 118 and normal sequence reads 116 are
provided as input to, e.g., alignment analyzer 106 as described in reference to FIG. 1 to determine sequence read count ratios based on: (i) a total number of unique sequence reads (Nl) from the normal sample that align to a first HLA gene allele (HLA gene allele 1) for the at least one HLA gene, (ii) a total number of unique sequence reads (N2) from the normal sample that align to a second HLA gene allele (HLA gene allele 2) for the at least one HLA gene, (iii) a total number of unique sequence reads (Tl) from the tumor sample that align to HLA gene allele 1, and (iv) a total number of unique sequence reads (T2) from the tumor sample that align to HLA gene allele 2. These values are used to calculate a sequence read count ratio (N1/N2) 206 for the normal sample and a sequence read count ratio (T1/T2) 208 for the tumor sample. For a more detailed description of a process for quantifying allele-read alignments for HLA genes, see FIG. 4.
[0109] As indicated in FIG. 2, the sequence read count ratio (N1/N2) 206 for the normal sample and the sequence read count ratio (T1/T2) 208 for the tumor sample can be input into, e.g., statistics generator 108 as described in reference to FIG. 1, and processed by statistical analysis 128 as described in reference to FIG. 1, to output a p-value or log odds ratio 212 used to test whether the sequence read count ratio (T1/T2) 208 for the tumor sample is significantly different from the sequence read count ratio (N1/N2) 206 for the normal sample at a given locus in view of an expected or baseline value. The expected or baseline value can correspond to the observed sequence read count ratio (T1/T2) 208 for the tumor sample being the same as the sequence read count ratio (N1/N2) 206 for the normal sample. The statistical significance of the difference can be determined, for example, by modeling sequence read count data for a plurality of non-HLA heterozygous genomic loci (e.g., heterozygous SNPs identified in the non-HLA region of the subject’s genome). In some instances, for example, the modeling may comprise fitting the Nl, N2, Tl, and T2 values determined for a plurality of heterozygous SNPs 210 identified (e.g., by alignment analyzer 106) in the non-HLA region of the subject’s genome to a multinomial model, then estimating the degree of overdispersion (i.e., a situation in which the variance in sequence read count numbers is much larger than expected, given the mean sequence read count values) in the sequence read data for the normal and tumor samples. The statistical analysis 128 may be configured, for example, to output a p-value for the difference between the observed T1/T2 value and the N1/N2 or expected/baseline at a given HLA gene locus, a log odds ratio estimate for the T1/T2 value and the N1/N2 value (i.e. a difference between the log ratio for allele 2 and the log ratio for allele 1, e.g. LOR=log(T2/N2)- log(Tl/Nl)), a confidence interval around the log odds ratio estimate, a p-value for the
difference between the log odds ratio being different from a baseline estimate corresponding the two alleles being balanced, a log ratio estimate for the first allele (log(Tl/Nl)), a log ratio estimate for the second allele (log(T2/N2)), a confidence interval around the log ratio estimate for the first allele, a confidence interval around the log ratio estimate for the second allele, a p- value for the observed log ratio estimate for the first allele being different from a baseline estimate corresponding to the first allele being present in the same amounts in the tumor and normal samples, and/or a p-value for the observed log ratio estimate for the second allele being different from a baseline estimate corresponding to the second allele being present in the same amounts in the tumor and normal sample. P-values and confidence intervals may be calculated using standard error estimates that are adjusted for a degree of overdispersion that is estimated using the Nl, N2, Tl, and T2 values determined for a plurality of heterozygous SNPs 210 identified in the non-HLA region of the subject’s genome.
[0110] As indicated in FIG. 2, in some instances, the p-value or log odds ratio 212 output by statistical analysis 128 may be compared to a predetermined threshold (e.g., by p-value threshold comparator 214) to output HLA loss information 124. In some instances, the functionality of p-value threshold comparator 214 may be provided by, for example, HLA loss evaluator 110 as described in reference to FIG. 1. HLA loss information 124 may comprise, for example, a determination of HLA loss of heterozygosity in the subject’s tumor sample for one or more HLA genes.
[OHl] In some instances, process 200 as illustrated in FIG. 2 may be performed using tumor sequence reads 118 and normal sequence reads 116 (e.g., WES sequence reads derived from tumor and normal samples collected from a subject) for at least one HLA gene. In some instances, process 200 may be performed using tumor sequence reads 118 and normal sequence reads 116 (e.g., WES sequence reads derived from tumor and normal samples collected from the subject) for at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more than 20 HLA genes.
[0112] As noted above, the statistical analysis 128 can be performed to determine if an observed difference between an allelic ratio for a tumor sample (e.g., a ratio of the number of unique sequence reads that align to allele 1 to the number of unique sequence reads that align to allele 2) and a corresponding allelic ratio for a paired normal sample is statistically significant. An accurate statistical test requires an understanding of the variance of the underlying ratio statistic derived from sequence read data. Past studies have shown that in
short-read sequencing data, the variance in sequence read counts can be higher than expected, resulting in a phenomenon called overdispersion, which can vary from sample to sample. To determine the degree of overdispersion for a given sample, the disclosed methods use a determination of baseline sequence read count ratios at a plurality of heterozygous genomic loci (e.g., a plurality of heterozygous SNP loci) outside of the HLA region in the genome of the normal sample.
[0113] For copy number-based approaches to determining HLA loss (e.g., HLA LOH), determinations of tumor and normal sequence read counts aligned to the HLA gene locus of interest are required for the computation of logR (z.e., Iog2 of the ratio of observed sequence read counts to expected read counts, where the expectation is based on the number of read counts observed in the normal sample) and BAF (z.e., the “B allele frequency”; a normalized measure of the allelic sequence read count ratio of two alleles (A and B), such that a BAF of 1 or 0 indicates the complete absence of one of the two alleles (e.g., the pair of alleles at the gene locus is either AA or BB), and a BAF of 0.5 indicates the equal presence of both alleles (e.g., the pair of alleles at the gene locus is AB)).
[0114] For the presently disclosed allelic ratio-based approach to determining HLA loss (e.g., HLA LOH), sequence read counts are determined for each allele, thus requiring four measurements instead of two, where the measurements are taken for genomic loci where two separate alleles are present. The disclosed allelic ratio approach to detecting HLA loss therefore requires methodology to identify allelic differences genome-wide for a given patient. The standard methodology for identifying genomic differences is very time-consuming, and has thus hindered the development of an allelic ratio-based approach to detection of HLA loss.
[0115] Existing software for identifying genomic differences based on sequence reads involves mapping and sorting sequence read alignments by their genomic position, removing duplicate sequence reads that can result from, e.g., PCR artifacts, and then scanning through the sorted, de-duplicated sequence read alignments to count the nucleotides observed at each genomic position. Sorting takes time proportional to $n\log(n)$, where $n$ is the number of elements, and can be very time-consuming when DNA sequencing runs can generate tens to hundreds of millions of sequence reads. To address this engineering problem, software has been developed to perform de-duplication and tally nucleotide counts from aligned sequence reads using two separate linear scans through the alignments (aligned sequence reads) instead of requiring a sorting step. The first linear scan stores the genomic order of each alignment in
an index, and the second linear scan uses this index to determine whether a given alignment has duplicates in the rest of the data set. This approach is also space efficient, since it does not require disk storage of the alignments before or after sorting. Rather, the sequence read alignment pipeline generates alignments twice, once for each linear scan, obviating the need for disk storage of the alignments.
[0116] Heterozygous SNPs are identified in the sequence read data for a normal sample obtained from the subject, and the corresponding number of sequence read counts are determined for the tumor sample at those same SNP locations, thereby yielding four allelespecific sequence read count values at a plurality of genomic loci. The statistical analysis of the four sequence read count values used in the disclosed allelic ratio-based approach to detection of HLA loss is facilitated by the use of a, e.g., multinomial model (rather than a binomial model) and corresponding methods for estimating overdispersion. For estimating overdispersion, an overdispersed multinomial model is fit to the sequence read count data for the non-HLA genomic loci, and used to assess the statistical significance of differences in the observed allelic ratios of HLA sequence read counts between tumor and normal samples.
III.B. Typing of HLA Alleles
[0117] Before HLA loss in a sample of a subject is evaluated, the HLA alleles present in the subject are identified. The subject is HLA-typed. This HLA typing can be performed with respect to each known HLA gene (e.g., HLA- A, HLA-B, HLA-C, etc.). In some cases, the exact two HLA alleles present in a subject for each HLA gene in the subject are identified. These two HLA alleles may be the same (z.e., homozygosity) or different (z.e., heterozygosity). In other cases, the most likely options for the two HLA alleles present in the subject for each HLA gene are identified. Methods for performing HLA typing are described in, for example, PCT International Patent Application Publication No. WO 2022/192304, Slozek et al. 2014, and Rimmer et al. (Nat Genet. 2014 Aug; 46(8): 912-918), which are incorporated herein by reference in their entirety.
III.C. Alignment of Sequence Reads to HLA Alleles
[0118] As noted above, HLA loss information 124 in FIG. 1 or FIG. 2 may comprise, for example, a determination of HLA loss of heterozygosity in the subject’s tumor sample for one or more HLA genes. FIG. 3 provides a schematic diagram illustrating HLA loss of heterozygosity for three HLA genes, z.e., HLA- A, HLA-B, and HLA-C. As indicated in the upper panel of FIG. 3, a normal sample comprises two copies of each gene, where the two
copies may comprise a same allele or a different allele (in this example, all three gene loci are heterozygous, z.e., they each comprise two different alleles as indicated by the shading). As indicated in the lower panel of FIG. 3, loss of an allele (e.g., through deletion of all or a portion of one copy of a gene locus) can lead to a loss of heterozygosity and elimination of the corresponding expressed protein.
[0119] As noted above, detection of HLA loss (e.g., HL A loss of heterozygosity) by the disclosed methods is based on determining the number of unique sequence reads that align to a first HLA allele and a second HLA allele in a set of tumor and normal samples from a subject (e.g. a patient). FIG. 4 provides a flow diagram illustrating an example of a process 400 for quantifying allele-read alignments for HLA genes in accordance with one or more implementations of the disclosed methods. Process 400 is one example of a process that is implemented by evaluation system 100 or at least a portion of evaluation system 100 in Fig. 1. Process 400 may be implemented by, for example, HLA loss evaluator 110 in Fig. 1. Process 400 may be implemented by, for example, alignment analyzer 106 in Fig. 1.
[0120] Step 402 in FIG. 4 includes receiving a first set of intron-resolution identifiers for a first allele for an HLA gene and a second set of intron-resolution identifiers for a second allele for the HLA gene. In one or more instances, the first allele and the second allele are the same such that the first set of intron-resolution identifiers and the second set of intronresolution identifiers are also the same.
[0121] Step 404 in FIG. 4 includes receiving a first plurality of reads for a first sample and a second plurality of reads for a second sample. The first plurality of reads, the second plurality of reads, or both may be reads generated via, for example, WES or WGS. The first sample may be, for example, a sample of healthy or normal tissue. The second sample may be, for example, a sample of unhealthy or diseased tissue (e.g., a tumor). In other embodiments, the first sample and the second sample are samples of tissue taken at first and second points in time, respectively, in the progression of a disease (e.g., a tumor, cancer, etc.).
[0122] Step 406 in FIG. 4 includes identifying a first allele sequence for the first allele for the HLA gene using a selected one of the first set of intron-resolution identifiers and a second allele sequence for the second allele for the HLA gene using a selected one of the second set of intron-resolution identifiers. In step 406, in some instances, multiple combinations of alleles may be made when the first set of intron-resolution identifiers, the second set of intronresolution identifiers, or both include multiple intron-resolution identifiers. The combination
used for step 406 comprising a selected one of the first set of intron-resolution identifiers and/or a selected one of the second set of intron-resolution identifiers may be performed in different ways. For example, without limitation, the selection may be performed via random selection. In other instances, the selection may be performed in an ordered manner (e.g., selecting the first one of the first set of intron- resolution identifiers, selecting the first one of the second set of intron-resolution identifiers alphanumerically, etc.).
[0123] Step 408 in FIG. 4 includes quantifying allele-read alignments between the first allele sequence and each of the first plurality of reads and the second plurality of reads and between the second allele sequence and each of the first plurality of reads and the second plurality of reads to form an alignment output. Step 408 may include, for example, generating an alignment output based on alignment using the first allele sequence, the second allele sequence, the first plurality of reads, and the second plurality of reads. Step 408 includes, for example, but is not limited to, counting the number of alignments (e.g., exact alignments, alignments within 1 or 2 base pairs, alignments within 3 or 4 base pairs, etc.) between different combinations of the first allele and the second allele with the first sample and the second sample. For example, step 408 can include generating a first count for a number of first allele and first sample alignments, a second count for a number of first allele and second sample alignments, a third count for a number of second allele and first sample alignments, and a fourth count for a number of second allele and second sample alignments. The first allele and first sample alignments are alignments between the first allele sequence and the first plurality of reads associated with the first sample. The first allele and second sample alignments are alignments between the first allele sequence and the second plurality of reads associated with the second sample. The second allele and first sample alignments are alignments between the second allele sequence and the first plurality of reads. The second allele and second sample alignments are alignments between the second allele sequence and the second plurality of reads.
[0124] In other instances, the alignment output in step 408 may take a different form. The alignment output may include, for example, but is not limited to, ratios, percentages, or other types of quantification formats that characterize the allele-read alignments between the first allele sequence, the second allele sequence, the first plurality of reads, and the second plurality of reads. As one example, the alignment output may include a first ratio and a second ratio. The first ratio may be, for example, a ratio of the number of alignments between the first allele
sequence and the first plurality of reads and the number of alignments between the second allele sequence and the first plurality of reads. The second ratio may be, for example, a ratio of the number of alignments between the first allele sequence and the second plurality of reads and the number of alignments between the second allele sequence and the second plurality of reads.
[0125] In various instances, steps 406 and 408 are repeated for various combinations of selected ones from the first set of intron-resolution identifiers and selected ones of the second set of intron-resolution identifiers. For example, all possible combinations using each possible intron-resolution identifier for the first allele and each possible intron-resolution identifier for the second allele may be evaluated to generate alignment output. Instead or in addition to this, HLA typing may be performed as described elsewhere herein (e.g. using any method known in the art, such as e.g. using the methods described in PCT International Patent Application Publication No. WO 2022/192304, Optitype (Szolek et al. Bioinformatics, Volume 30, Issue 23, December 2014, Pages 3310-3316) or combinations of Optitype and the methods described in WO 2022/192304) thereby identifying one or two exon or intron resolution identifier for each of one or more HLA genes, and an alignment may only be performed at steps 406-408 using the sequences corresponding to those identified alleles (i.e. using a subject-specific reference sequence).
[0126] FIGS. 5A-5D provide non-limiting examples of WES sequence read alignments for the HLA B*51 :01 :01 :01 allele (Allele 1, where the 8-digit indicates that this is an allele within allele group 51 for the HLA-B gene, i.e., specifically the allele that encodes for the B*51 :01 HLA protein but that differs from the allele for the B*51 :01 HLA protein by the presence of a synonymous mutation in the coding region, but that has no difference in the noncoding region) and the HLA B*07:02:01 :01 allele (Allele 2) for paired normal and tumor samples, where the tumor sample exhibits a loss of heterozygosity. FIG. 5A provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus as a function of position within the gene locus in the normal sample. FIG. 5B provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus as a function of position within the gene locus in the normal sample. FIG. 5C provides a plot of sequence reads for Allele 1 aligned to the HLA B gene locus as a function of position within the gene locus in the tumor sample. FIG. 5D provides a plot of sequence reads for Allele 2 aligned to the HLA B gene locus as a function of position within the gene locus in the tumor sample. Sequence reads that aligned to unique genomic loci
are shown in pink and blue in these figures, while sequence reads that mapped to more than one genomic locus are shown in green and yellow. As can be seen in this non-limiting example, fewer sequence reads were observed for allele 1 in the tumor sample, indicating that a loss of the allele has occurred in at least a fraction of the tumor sample. The number of aligned sequence reads in FIG. 5C may be non-zero because of stromal contamination of the tumor sample and/or tumor heterogeneity.
IILD. Detection of HLA Loss (e.g., HLA LOH)
[0127] FIGS. 6A-6B provide non-limiting schematic illustrations that compare prior copy number-based methods for detecting HLA LOH (FIG. 6A) to the allelic ratio-based method (FIG. 6B) described herein. As noted elsewhere herein, copy number-based methods for detecting HLA LOH typically rely on determining tumor-to-normal ratios of sequence read counts, e.g., Tl/Nl for allele 1 and T2/N2 for allele 2, as illustrated in FIG. 6A, and using a statistical analysis to determine if the sequence read count ratio observed for a given allele, T/N, is significantly different from an expected value of 1 : 1. The presently disclosed methods, in contrast, are based on determining and comparing allelic ratios for both tumor-derived sequence read data (e.g., T1/T2) and normal -derived sequence read data (e.g., N1/N2), as illustrated in FIG. 6B, and using a different statistical analysis to determine if the allelic ratio for the tumor-derived data is significantly different from the allelic ratio determined for the normal -derived data. As described elsewhere herein, in some instances, the statistical analysis can comprise measuring sequence read counts or read count ratio (N1/N2) values in the sequence read data for the normal sample and sequence read counts or read count ratio (T1/T2) values in the sequence read data for the tumor sample at a plurality of non-HLA genomic loci (e.g., a plurality of heterozygous SNP loci located outside the HLA region of the genome), and fitting the Nl, N2, Tl, and T2 data to a multinomial model to estimate overdispersion and correct for the degree of overdispersion for a given pair of samples.
[0128] FIGS. 7A-7D provide schematic illustrations of 2 x 2 contingency tables for sequence read data (e.g., HLA sequence read data) that might be expected for different genetic modifications to a tumor genome. FIG. 7A provides a schematic diagram illustrating a nonlimiting example of a 2 x 2 contingency table for HLA sequence read counts in normal and tumor samples. In this diagram, Nl is the number of unique sequence reads derived from a normal sample that align to allele 1, and N2 is the number of unique sequence reads derived from the normal sample that align to allele 2. Tl is the number of unique sequence reads derived
from a tumor sample (corresponding to the matched normal sample) that align to allele 1, and T2 is the number of unique sequence reads derived from the tumor sample that align to allele 2. FIG. 7B provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a double deletion, as indicated by the lighter shading for both allele 1 and allele 2 in the tumor sample in comparison to the shading for allele 1 and allele 2 in the normal sample. FIG. 7C provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a single deletion, as indicated by the lighter shading for allele 1 in the tumor sample in comparison to that for the normal sample. FIG. 7D provides a schematic diagram illustrating a non-limiting example of a 2 x 2 contingency table for HLA sequence read counts in a tumor sample that exhibits a copy neutral loss of heterozygosity (e.g., conversion of allele 1 to another copy of allele 2), as indicated by the lighter shading for allele 1 and the darker shading for allele 2 in the tumor sample in comparison to that for the normal sample.
[0129] FIGS. 8A-8C provide non-limiting examples of 2 x 2 contingency tables for HLA sequence read counts that were observed for cells from ovarian cell line 23882 that exhibited different genetic modifications in a tumor genome. FIG. 8A provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA A*26:01 allele (allele 2), as indicated by the greatly decreased number of sequence reads that aligned to the HLA A*26:01 allele (allele 2) in the tumor sample in comparison to the number of sequence reads that aligned to the A*23:01 allele (allele 1) in both samples, and in comparison to the number of sequence reads that aligned to the HLA A*26:01 allele (allele 2) in the normal sample. FIG. 8B provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA B*35:01 allele (allele 1), as indicated by the greatly decreased number of sequence reads that aligned to the HLA B*35:01 allele (allele 1) in the tumor sample in comparison to the number of sequence reads that aligned to allele 1 in the normal sample, and in comparison to the number of sequence reads that aligned to the HLA B*49:01 allele (allele 2) in both the normal and tumor samples. FIG. 8C provides a non-limiting example of sequence read counts derived from an ovarian cell line 23882 sample that exhibits a single deletion of the HLA C*04:01 allele (allele 1), as indicated by the greatly decreased number of sequence reads that aligned to the HLA C*04:01 allele (allele 1) in the tumor sample in comparison to the number of sequence reads that aligned to allele 1 in the normal sample, and in comparison
to the number of sequence reads that aligned to the HLA C*07:01 allele (allele 2) in both the normal and tumor samples.
[0130] FIGS. 9A-9B provide non-limiting schematic illustrations of the HLA (FIG. 9A) and non-HLA (FIG. 9B) genomic loci used to determine sequence read counts for a subject and evaluate allelic ratios, which are then used as input for a statistical model used to detect HLA loss of heterozygosity in the subject, in accordance with one or more implementations of the systems and method disclosed herein. In FIG. 9A, the two alleles (allele 1 and allele 2) for a given HLA gene are illustrated for tumor and normal samples. The ratio of sequence reads for the two alleles is determined for each sample (T1/T2 and N1/N2) based on the number of unique sequence reads that align to each allele. As noted above, a statistical analysis is required to determine if an observed difference between the two allelic ratios (T1/T2 and N1/N2) is significant. FIG. 9B illustrates a non-HLA genomic locus (e.g., a heterozygous single nucleotide polymorphism (SNP) locus). The statistical analysis used in the presently disclosed methods for determining whether or not an observed difference between the two allelic ratios (T1/T2 and N1/N2) is significant can be based on determining allelel / allele 2 sequence read count ratios for a plurality of such heterozygous non-HLA genomic loci (e.g., a plurality of heterozygous SNP loci). In some instances, the plurality of heterozygous non-HLA genomic loci (e.g., heterozygous non-HLA SNP loci) can comprise at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci.
[0131] FIG. 10 provides a non-limiting example of a flowchart for a process 1000 (e.g., a computer-implemented method) for detecting an HLA alteration, in accordance with one or more implementations of the systems and methods disclosed herein. Process 1000 may be implemented using the evaluation system 100 described in FIG. 1. For example, process 1000 or portions thereof may be performed by allele type generator 104, and HLA loss evaluator 110 (including processing by alignment analyzer 106 statistics generator 108) as described in FIG.
1
[0132] At step 1002 in FIG. 10, sequence read data (e.g., read data 112 in FIG. 1) for a plurality of sequence reads derived from a tumor sample (e.g., reads 116 in FIG. 1) and a normal sample (e.g., reads 118 in FIG. 1) from a subject (e.g., a patient) are received (e.g., by one or more processors of evaluation system 100 described in FIG. 1).
[0133] The sequence read data may be derived, for example, by sequencing nucleic acid molecules (e.g. DNA) extracted from paired tumor and normal samples using a whole exome
sequencing (WES) technique, a whole genome sequencing (WGS) technique, or both. In some instances, the sequence reads may be generated using, for example, a paired-end sequencing technique.
[0134] In some instances, for example, the paired tumor and normal samples can be paired tumor and normal surgical resection samples. In some instances, the paired tumor and normal samples can be paired tumor and normal tissue biopsy samples. In some instances, the paired tumor and normal samples can comprise a tumor sample (e.g. biopsy or tumor resection) and a normal sample (e.g. biopsy or blood sample) from the same subject.
[0135] At step 1004 in FIG. 10, a subject-specific reference sequence for an HLA region of the subject’s genome (e.g., a subject-specific HLA-ome) is received (e.g., by alignment analyzer 106 in FIG. 1). In some instances, the subject-specific reference sequence for the HLA region may be generated by alignment analyzer 106 in FIG. 1 after the sequence read data has been processed by allele type generator 104, e.g. using set of HLA allele identifiers or sequences associated with these identifiers (e.g. from an HLA sequence database).
[0136] In some instances, the subject-specific reference sequence for the HLA region can be generated, for example, by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence (e.g., the GRCh38 human reference genome (Genome Reference Consortium)). In some instances, the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence. In some instances, the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region. In some instances, the subject-specific reference sequence for the HLA region can be generated, for example, by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to a reference set of HLA allele sequences.
[0137] In some instances, for example, the subject-specific reference genome for the HLA region (the subject-specific HLA-ome) may be generated through the use of a set-covering algorithm to find a set of HLA gene alleles that best explains the observed set of normal sequence reads (e.g., normal WES sequence reads) in a sample as aligned to a set of HLA gene alleles. In some instances, the identified HLA alleles may comprise 6-digit allelic identifiers (e.g., exon-resolution identifiers as described elsewhere herein). In some instances, the identified HLA alleles may comprise 8-digit allelic identifiers (e.g., intron-resolution
identifiers as described elsewhere herein). Methods for generating 6-digit and 8-digit allelic identifiers based on a plurality of sequence reads are described in more detail in, for example, PCT International Patent Application Publication No. WO 2022/192304, which is incorporated herein by reference in its entirety.
[0138] At step 1006 in FIG. 10, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene, and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene, are determined (e.g., by alignment analyzer 106 in FIG. 1) based on the sequence read data and the subject-specific reference sequence for the HLA region.
[0139] In some instances, the at least one HLA gene can comprise HLA-A, HLA-B, HLA- C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof. In some instances, the at least one HLA gene can comprise at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more than 20 HLA genes.
[0140] At step 1008 in FIG. 10, a number of unique normal-derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene are determined (e.g., by alignment analyzer 106 in FIG. 1) based on the sequence read data and the subject-specific reference sequence for the HLA region.
[0141] At step 1010 in FIG. 10, an HLA alteration (e.g., an HLA loss of heterozygosity or an HLA copy number change) is detected for the at least one HLA gene based on: (i) a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele, and (ii) a normal allelic ratio comprising a ratio of the determined number of unique normal-derived sequence reads for the first allele and the determined number of unique normal-derived sequence reads for the second allele. Note that the methods for detecting HLA LOH for at least one HLA gene described herein do not require a determination of copy number for the at least one HLA gene.
[0142] In some instances, detecting HLA loss of heterozygosity for the at least one HLA gene can comprise performing a statistical analysis of the allelic ratio data (e.g, statistical analysis 128 performed by statistics generator 108 in FIG. 1) to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value. In some instances, the expected value is the value determined for the normal allelic ratio (e.g, in order to account
for potential sequencing and/or analysis artifacts). In some instances, for example, the expected value is 1. In some instances, the expected value is associated with a similar allelic balance in the tumor and normal samples. In some instances, the expected value is associated with a log odds ratio of 0 (or similarly, an odds ratio of 1) for the allelic ratio in the tumor and normal samples.
[0143] The statistical analysis may comprise computing one or more statistical metrics. In one or more instances, the statistical analysis may include computing a t-statistics or z-score, using a binomial or multinomial model. For example, for comparing turn or-vs. -normal within one HLA allele, the methods may comprise computing a t-statistic adjusted for overdispersion, testing the null hypothesis that the turn or-vs. -normal HLA allelic log ratio matches a genomic background turn or-vs. -normal log-ratio estimated by comparing aggregated turn or-vs. -normal counts at non HLA SNP loci (e.g. SNP loci for which both tumor and normal counts are consistent with diploid genomic backgrounds). For example, a t-statistic may be calculated for each allele-specific log ratio (LRl=log(Tl/Nl) and LR2=log(T2/N2)). This can be calculated as (LRl-LRnull)/SE(LRl), and (LR2-LRnull)/SE(LR2), respectively, where SE refers to the standard error estimate for the indicated log ratio, and LRnull refers to the expected log ratio under a null hypothesis, such as the tumor and normal counts being equal (i.e. LRnull=0) or the tumor and normal counts corresponding to expected tumor and normal counts estimated from non-HLA SNPs. For example, counts in genome regions (e.g. bins) can be aggregated to obtain log ratios that can then be summarized across regions (e.g. using a median, mean or weighted version thereof) and used as a null hypothesis. The standard error for log(Tl/Nl) can
1 1 1 1 1 1 be calculated as — I - , and the standard error for log(T2/N2) can be calculated as — I - . Tl Nl’ ’ T2 N2
As another example, a t-statistic for a log odds ratio (LOR) may be calculated as the difference between an estimate of log((NlT2)/(N2Tl)) at the HLA gene and an estimate of background log((NlT2)/(N2Tl)) estimated from non HLA SNP loci, divided by a standard error estimate. This can be calculated as (LOR-LORnull)/SE(LOR), where LOR=LR2-Ll=log(T2/N2)- log(Tl/Nl), SE is the standard error around this estimate, and LORnull refers to the expected log odds ratio under a null hypothesis, such as the alleles being balanced (i.e. LORnull=0). The standard error around the LOR estimate can be calculated prior to overdispersion adjustment
JI 1 1 1
- 1 - 1 - 1 — . Any of the above standard errors can then be adjusted for Nl T1 N2 T2 J J overdispersion prior to being used to calculate a t-statistic and/or a confidence interval around
the estimated log ratios and/or log odds ratio. Any of the above t-statistics can be used to obtain a p-value, for example a p-value for a two-sided t-test with a standard Gaussian reference distribution. The adjustment for overdispersion may be performed by using a standard deviation for the calculation of the t-statistic or the calculation of the confidence interval that is estimated as the product of the standard error estimates above (also referred to as “naive” standard error estimates) and an overdispersion factor estimated using the counts at non-HLA SNP loci. A multinomial model may use 4 counts at each non-HLA SNP locus (counts in the tumor and normal sample for each allele at the locus, i.e. Tl, T2, Nl, N2), as in the example above. A binomial model may use 2 counts at each or a plurality of reference loci (non HLA loci, which may be heterozygous SNP loci or otherwise, i.e. total T and total N counts at the locus). An overdispersion factor can be calculated using a Pearson goodness-of-fit statistic, where the overdispersion factor is the square rood of said statistic, or any other method known in the art to calculate overdispersion in multinomial data. A confidence interval (CI) can be calculated for a log ratio estimate or a log odds ratio estimate as known in the art, such as e.g. CI=( 1.96* standard error) on either side of the estimate for a 95% confidence interval. The results or information generated by performing the statistical analysis may be referred to as a statistical output.
[0144] In some instances, for example, the statistical analysis can comprise: (i) detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; (ii) determining, based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele and a number of unique tumor-derived sequence reads for a second SNP allele for each of the plurality of heterozygous SNP loci; (iii) determining, based on the sequence read data, a number of unique normal -derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of the plurality of heterozygous SNP loci; and (iv) estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal -derived sequence reads for the first SNP allele, the number of unique normal -derived sequence reads for the second SNP allele, the number of unique tumor-derived sequence reads for the first SNP allele, and the number of unique tumor-derived sequence reads for the second SNP allele determined for each of the plurality of heterozygous SNP loci to a multinomial model (e.g., an overdispersed multinomial model). Overdispersion in multinomial data can be estimated by estimating an overdispersion parameter 4> which is used to scale the
variance of estimates (or its square root is used to scale the standard error of estimates). Any methods for estimating an overdispersion parameter 4> in multinomial counts data known in the art may be used. This can include e.g. estimating cp= %2/n-p, or cp= D/n-p where %2 is the Pearson’s goodness-of-fit statistic, D is the residual deviance, n is the total number of observations, and p is the number of parameters estimated (e.g. estimated probabilities pi for a multinomial model, which are estimated probabilities that reads are sampled from each of the two alleles and tumor or normal chromosomes). Estimating a degree of overdispersion in sequence reads counts may comprise estimating, based on the counts in (ii) and (iii), locus-specific multinomial probabilities for a count to be from normal or tumor chromosomes. An estimated degree of overdispersion can be calculated using a Pearson goodness-of-fit statistic (Pearson’s chi-squared test, x2)- This can be calculated as x2=S(; k.i (Oj — Np^/Npi where Oi are the observed counts (i.e. observed tumor counts for SNP allele 1, observed tumor counts for SNP allele 2, observed normal counts for SNP allele 1, observed normal counts for SNP allele 2), Npi are the expected counts under a Multinomial distribution where a total of N reads (where N is the sum of the observed counts for alleles 1 and 2 in the tumor and the normal sampled) are sampled from categories with probabilities pi (where pi are the estimated probabilities that the counts would be from a normal chromosome - which can assume that normal chromosomes are diploid and therefore a single probability can be estimated - or from either of the tumor chromosomes, which can have distinct frequencies). An estimated degree of overdispersion ca be calculated using a residual deviance statistic. This can be calculated as D = where Oi are the
observed counts, Npi are the expected counts under a Multinomial distribution where a total of N reads are sampled from categories with probabilities pi. Log ratios for each HLA allele (tumor vs normal log ratio for HLA allele 1, tumor vs normal log ratio for HLA allele 2) and log odds ratios (log odds ratio for allele 2 vs allele 1) for heterozygous HLA loci can be estimated. Naive asymptotic standard errors for binomial / multinomial models can then be estimated as described above. These can the by inflated by the square root of the estimated degree of overdispersion. In embodiments all of the LR1, LR2 and LOR estimates, as well as confidence intervals around these (optionally adjusted for overdispersion) and/or p-values of t- statistics associated with these estimates are obtained.
[0145] Expected counts at heterozygous loci (e.g. non HLA SNPs) under a multinomial distribution (or equivalently, probabilities for a read to be drawn from each of the alleles in each of the normal and tumor samples) can be obtained using an expectation maximization
(EM) algorithm. The EM algorithm finds values of parameters (here a set of probabilities pi) that maximize the log likelihood of the observed counts. The EM algorithm may be implemented using one or more of the following assumptions: (i) the normal sample may be assumed to be diploid, such that reads from each of alleles 1 and 2 in the normal samples are expected to occur with equal frequency (i.e. denoting PNI and pN2 as the probability of sampling a read from allele 1 and a read from allele 2, respectively, in the normal sample, we have pxi = PN2= PN); (ii) the tumor sample may be allowed to have different underlying copy numbers at the locus, such that reads from alleles 1 and 2 in the tumor samples may not be expected to occur with equal frequency (i.e. denoting pn and p 2 as the probability of sampling a read from allele 1 and a read from allele 2, respectively, in the tumor sample, we have p , PT2 as two different parameters to be estimated); (iii) 0<= PN<1/2; (iv) pm < PM where pm= pn or p 2 depending on which of allele 1 and allele 2 is the minor allele (m) in the tumor sample, and PM= PT2 or pn depending on which of allele 1 and allele 2 is the major allele (M) in the tumor sample, i.e. which of alleles 1 and 2 is on the chromosome that has a higher frequency (e.g. higher copy number) in the tumor sample (where the major allele is present at a higher frequency than the minor allele, and by convention allele 1 is the major or reference allele in the normal sample and allele 2 is the minor or alternative allele in the normal sample); and (v) 2* PN+ pm + PM=1, such that pm 1-2* PN - PM. When using unphased SNP data, the log likelihood of the observed counts at a locus may be estimated taking into account that at each SNP two configurations are possible, one where the counts T1 are associated with the major chromosome (and counts T2 are associated with the minor chromosome, like in the normal sample; i.e. T1 is associated with the major allele PM= pn, T2 is associated with the minor allele pm= PT2), and one where the counts T1 are associated with the minor chromosome (and counts T2 are associated with the major chromosome, contrary to in the normal sample; i.e. T2 is associated with the major allele PM= PT2, T1 is associated with the minor allele pm= pn). Thus for each of a plurality of loci, the EM algorithm may estimate posterior probabilities that counts T1 are associated with the major chromosome and counts T2 are associated with the minor chromosome, and that counts T2 are associated with the major chromosome and counts T1 are associated with the minor chromosome. An estimated degree of overdispersion can be calculated for each locus for each of these two possible configurations. An average or weighted average of the obtained estimates of overdispersion can then be obtained at each locus. For example, a weighted average of the two overdispersion estimates that is weighted by the probability of the respective configuration (i.e. posterior probabilities estimated by the EM
algorithm) may be used. For example, a first overdispersion estimate may be obtained using the estimated posterior probabilities that counts T1 are associated with the major chromosome and counts T2 are associated with the minor chromosome, and a second overdispersion estimate may be obtained using the estimated posterior probabilities that counts T2 are associated with the major chromosome and counts T1 are associated with the minor chromosome. A weighted average may then be obtained in which the first overdispersion estimate is weighted by the estimated posterior probabilities that counts T1 are associated with the major chromosome, and the second overdispersion estimate is weighted by the estimated posterior probabilities that counts T2 are associated with the major chromosome. In embodiments, alleles 1 and 2 may be associated with the reference and alternative alleles, respectively, in both the normal and tumor samples (e.g. instead of the major and minor chromosome / locus). In embodiments, the log likelihood for a SNP locus may be a weighted sum of log likelihoods of the counts at the locus and counts at loci within a predetermined genomic distance (e.g. moving window) or counts of a predetermined number of loci closest to the locus (referred to as “nearby SNP loci”), where weighting is based on distance between the SNP locus and the nearby SNP loci. The estimated parameters of the multinomial model at each SNP can be used to calculate expected counts at each SNP, which can in turn be used to calculate a degree of overdispersion at each SNP. In embodiments, a plurality of estimated degrees of dispersion (e.g. obtained for each of a plurality of SNP loci, e.g. using expected counts obtained using posterior probabilities from an EM model as explained above, or in any other way) are summarized (e.g. by using the mean or median of these values), and used to adjust a standard error estimate as explained above.
[0146] In embodiments, pseudocounts may be added to the counts of reads used in any estimation of log ratios or log odds ratios, in order to avoid division by zero. A pseudocount is a value that is added automatically to a count according to a predetermined scheme. In embodiments, a predetermined pseudocount is added to each count. In embodiments, a first pseudocount is added to the tumor counts and a second pseudocount is added to the normal counts, where the values of the first and second pseudocounts depend on the relative coverage (i.e. total number of reads over the whole genome or a portion of the genome, such as e.g. a chromosome, e.g. chromosome 6) in the tumor and normal sample. This ensures that any pseudocounts added are reflective of the tumor-to-normal global coverage and therefore does not introduce bias in the statistical estimates (e.g. log ratios, log odds ratios) that are produced from the counts. For example, a first pseudocount equal to a predetermined value multiplied
by 2*R, where R is the tumor fraction (calculated as tumor-to-normal coverage ratio/(l+ tumor- to-normal coverage ratio)) can be added to the tumor counts, and a second pseudocount equal to a predetermined value multiplied by 2*(1-R) can be added to the normal counts (where 1-R is the normal fraction, calculated as 1-the tumor fraction). In embodiments, the values of the first and second pseudocounts depend on the total number of reads in the respective sample across both alleles. For example, the first pseudocount value may be equal to the total number of reads at the locus in the tumor sample (T1+T2) multiplied by a factor that depends on the relative coverage between tumor and normal samples (e.g. (T1+T2)*2*R). Similarly, the second pseudocount value may be equal to the total number of reads at the locus in the normal sample (N1+N2) multiplied by a factor that depends on the relative coverage between tumor and normal samples (e.g. (Nl+N2)*2*(l-R)). Alternatively, the predetermined values used when calculating the first and second pseudocount may be user defined values, such as e.g. 1. For example, the first pseudocount value may be equal to 1*2*R and the second pseudocount value may be equal to 1*2*(1-R), such that a single count is added to both the tumor and the normal counts when the tumor and normal counts are equal.
[0147] The overdispersion parameter estimate can be used to multiplicatively scale a standard error for a log odds ratio computed for a 2x2 contingency table as illustrated in FIG. 7A. For example, in the case of a heterozygous HLA gene, this log odds ratio is equivalent to the log ratio of the respective tumor and normal allelic ratios (z.e., the log odds ratio is the natural logarithm of the ratio of the tumor allelic ratio divided by the normal allelic ratio, or the natural logarithm of the ratio of the normal allelic ratio divided by the tumor allelic ratio - the two having the same absolute value). Estimating the standard error of the log odds ratio can be performed using an established approach under multinomial sampling, e.g. by calculating the square root of the sum of the inverse of the individual counts. The estimated overdispersion parameter obtained by fitting the sequence read data to an overdispersed multinomial model is used to scale the "established" standard error.
[0148] Thus, in some instances, the method can further comprise using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log ratio of the tumor allelic ratio and normal allelic ratio. In some instances, the adjusted standard error for the log odds ratio can be used to adjust a p-value or confidence interval for the log ratios and/or the log odds ratio of a detected HLA alteration e.g., HLA loss of heterozygosity or HLA copy number change).
[0149] In some instances, the plurality of heterozygous SNP loci can comprise at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci. In some instances, the plurality of heterozygous SNP loci can be filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome.
[0150] In some instances, the plurality of heterozygous SNP loci are detected using a non- sorting-based method for de-duplicating and tallying sequence read counts. For example, the non-sorting based method for de-duplicating and tallying sequence read counts can comprise: (i) performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and (ii) performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index. In embodiments, the plurality of heterozygous SNP loci are detected using reads that have been aligned using a SNP -tolerant alignment method. The SNP -tolerant alignment method may use a predetermined set of SNPs. The predetermined set of SNPs may be known SNPs, such as e.g. SNPs obtained from a human genome polymorphism database, such as dbSNP (Sherry et al. Nucleic Acids Research, Volume 29, Issue 1, 1 January 2001, Pages 308-311; www.ncbi.nlm nih.gov/snp/). In embodiments, the plurality of heterozygous SNP loci are detected in the normal sample. In embodiments, the plurality of heterozygous SNP loci are selected as SNP loci from a predetermined set of SNPs where at least a predetermined number of reads or proportion of reads at the locus in the normal sample includes the SNP (Rather than the reference allele). For example, within a predetermined portion of the genome (such as e.g. chromosome 6 or a portion thereof, excluding the HLA locus - as explained elsewhere herein), a set of loci corresponding to a predetermined set of SNPs may be analysed in the normal sample and a subset of the set of loci may be selected for overdispersion estimation when the number or proportion of reads at the locus in the normal sample exceeds a predetermined threshold. A major genotype (allele) at each such locus may be defined as the reference allele or the allele with more counts in the normal sample. Then a 2 x 2 table of counts including Tl, T2, N1 and N2 counts for each of the selected SNP loci may be obtained.
[0151] In some instances, the method can further comprise diagnosing or confirming a diagnosis of a disease based on a detected HLA loss e.g., a detected loss of HLA heterozygosity or a detected change in HLA copy number) for the at least one HLA gene. In some instances, the method can further comprise identifying the subject for treatment of a disease (or identifying the subject as a candidate for treatment of a disease) based on a detected
HLA loss for the at least one HLA gene. In some instances, the method can further comprise identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA loss for the at least one HLA gene. In some instances, the method can further comprise predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA loss for the at least one HLA gene. In some instances, the method can further comprise identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA loss for the at least one HLA gene. In some instances, for example, the disease can be a cancer. Examples of cancers that may be associated with HLA LOH, for example, include, but are not limited to head and neck cancer, squamous cell lung cancer, stomach adenocarcinoma, diffuse large B-cell lymphoma and colon cancer. Also described herein is a method of designing or manufacturing an immunotherapy for a subject, wherein an immunotherapy is a therapy that targets one or more selected cancer antigens present in the subject, the method comprising detecting HLA loss using a method as described herein, and selecting one or more cancer antigens that are predicted to be presented by an HLA allele that is not subject to HLA loss in the subject. Reference to an antigen being predicted to be presented by an HLA allele encompasses the antigen being predicted to bind to the HLA allele, to be presented by the HLA allele, or to be immunogenic in the context of the HLA allele. Such a method can comprise one or more of: identifying one or more candidate cancer antigens present in the subject, predicting whether the one or more candidate cancer antigens bind to and/or are presented by and/or is immunogenic in the context of one or more HLA alleles identified in the subject, and excluding from a set of one or more candidate cancer antigens one or more candidate cancer antigens that have been predicted to be exclusively or primarily bound by and/or presented by and/or immunogenic in the context of an HLA allele that has been determined to be subject to HLA loss using a method as described herein. A cancer antigen may be a neoantigen or a tumor associated antigen. The method may further comprise manufacturing the immunotherapy. The immunotherapy may be a vaccine. The vaccine may be a RNA-based vaccine, a DNA-based vaccine, a cell vaccine (e.g. dendritic cell vaccine) or a peptide-based vaccine. The vaccine may comprise one or more cancer antigen peptides or nucleic acids (e.g. RNA or DNA) encoding for one or more cancer antigen peptides, or antigen presenting cells presenting one or more cancer antigen peptides. The immunotherapy may be a T-cell based therapy, comprising a population of T cells that recognize one or more cancer antigens. The T cells may be engineered T cells, such as e.g. T cells engineered to express a T cell receptor that specifically recognizes a cancer antigen peptide. Predicting
whether a candidate cancer antigen binds to and/or is presented by and/or is immunogenic in the context of one or more HLA alleles identified in the subject may be performed using any method known in the art, such as e.g. as described in WO 2022/016125 or as implemented in NetMHCpan-4.1 (Reynisson et al. Nucleic Acids Res. 2020 Jul 2; 48(W1): W449-W454), MHCflurry 2.0 (O’Donnell et al. Cell Systems, vol. 11, Issue 1, 22 July 2020, pp. 42-48. e7), or PRIME (Schmidt et al. Cell Rep Med. 2021 Feb 16; 2(2): 100194). An antigen may be considered to be primarily bound by and/or presented by and/or be immunogenic in the context of an HLA allele when the binding affinity of the antigen to the HLA allele and/or the probability of presentation of the antigen by the HLA allele and/or the probability of immunogenicity of the antigen in the context of the HLA allele is above a first predetermined threshold and the binding affinities of the antigen to all other HLA alleles in the subject that are not predicted to be subject to loss and/or the probabilities of presentation of the antigen by all other HLA alleles in the subject that are not predicted to be subject to loss and/or the probabilities of immunogenicity of the antigen in the context of all other HLA alleles in the subject that are not predicted to be subject to loss are below a second predetermined threshold. The first and second predetermined thresholds may be the same or different. The immunotherapy may be a natural killer (NK) cell therapy. The method may comprise selecting a subject for treatment with a NK cell therapy when the subject has been identified as having one or more HLA alleles subject to loss. While loss of HLA has been established as a mechanism of immune escape to T cell based immunity, susceptibility to NK cell mediated lysis has been shown to be inversely correlated with expression of surface MHC class I molecules on target cells. Thus, tumor cells that are subject to HLA loss (and which may therefore have reduced expression of one or more HLA alleles) may be more sensitive to NK cell therapy. The method may further comprise administering the immunotherapy to the subject. Therefore, also described herein are methods of treating a subject who has been diagnosed as having or being likely to have cancer, the methods comprising manufacturing an immunotherapy as described herein and administering the immunotherapy to the subject.
[0152] As described in more detail elsewhere herein, the present disclosure also contemplates systems configured to perform the disclosed methods. Such a system will generally comprise: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform any of the methods described herein. In some
instances, the system may further comprise a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject.
[0153] Non-transitory computer-readable storage medium storing one or more programs are also disclosed, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform any of the methods described herein.
Examples - Statistical Analysis Data
[0154] FIG. 11 provides a non-limiting example of data for the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples from a test subject plotted as a function of the normalized genomic position in the NCI1672 chrl region. The plot represents a compilation of the 2 x 2 contingency table sequence read count data for a plurality of heterozygous SNPs that allows one to view local trends across chromosomes in a new way (while still using only on WES bulk read data). There are four data points plotted per genomic locus - the number of tumor-derived read counts for allele 01 and allele 02, and the number of normal -derived read counts for allele 1 and allele 2.
[0155] The alignment analyzer 106 in FIG. 1 returns allele-specific sequence read counts for heterozygous SNPs in non-HLA regions of the subject’s genome. This allows one to infer regions with genomic copy number changes visually and also analytically. The plot in FIG. 11 shows the fraction of tumor-derived sequence reads aligned to allele 1 plotted in orange, and the fraction of normal -derived sequence reads aligned to allele 1 plotted in blue. The SNPs are unphased, so the data generates mirror-image “Rorschach” plots. The trend lines indicate the local estimate for allele 1 to allele 2 read count ratios. The large deviation between the tumor trend line and the normal trend line in, for example, the fractional position range of 0.6 to 1.0 within the NCI1672 chrl region suggest large copy number changes in this tumor sample. The variability across the trend lines is used to infer the degree of overdispersion for the sample using a statistical analysis as described elsewhere herein. The alignment analyzer may use a SNP -tolerant alignment method. A SNP -tolerant alignment method is a method that allows alignment of sequencing reads to a plurality of possible genotypes at one or more polymorphic sites. In embodiments, a SNP -tolerant alignment method aligns reads not just to a single reference sequence, but to a reference ‘space’ of all possible combinations of major and minor alleles from a set of known SNPs. A set of known SNPs may be obtained from a human genome polymorphism database, such as e.g. dbSNP. The use of a SNP -tolerant alignment method to a
reference space instead of a single reference sequence avoids treating minor alleles as mismatches and thereby results in a more accurate alignment and hence more accurate read counts. Examples of SNP -tolerant alignment methods include the GSNAP aligner described in Wu & Nacu, Bioinformatics, Volume 26, Issue 7, April 2010, Pages 873-881; and graph-based aligners (see e.g. Rakocevik et al. Nature Genetics volume 51, pages354-362, 2019). The plot in FIGS. 20A-20D shows read counts (total vs reference allele counts) for a representative window of 51 known SNPs in a normal sample and a matched tumor sample using a standard reference genome-based alignment method (FIG. 20A and FIG. 20C) and using a SNP- tolerant alignment method (FIG. 20B and FIG. 20D). The plots on the left (standard reference genome) show that using a standard reference genome alignment method can cause a bias towards the reference allele (showing as a deviation from the diagonal in the normal sample) which is corrected by the use of the SNP -tolerant alignment method. The use of the SNP- tolerant alignment method also identified additional SNPs that were missed entirely in the standard reference genome alignment method (shown as additional points in a different color in the plots on the right). Further, the use of the SNP -tolerant alignment method also makes it easier to detect loss of heterozygosity, as indicated by the shift away from the diagonal in the tumor sample (bottom right plot), compared to the standard reference genome alignment method (bottom left plot).
[0156] FIG. 12 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B34996 chrl region. Again, the plot in FIG. 12 shows the fraction of tumor-derived sequence reads aligned to allele 1 plotted in orange, and the fraction of normal-derived sequence reads aligned to allele 1 plotted in blue. The trend lines again indicate the local estimate for allele 1 to allele 2 read count ratios. Here, the large deviations between the tumor trend line and the sample trend line suggest that large copy number changes have occurred in the tumor sample in the fractional position ranges of 0.1 - 0.2 and possibly 0.8 to 0.9 within the B34996 chrl region for this sample.
[0157] FIG. 13 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the Pirn 1603 chr6
region. The data in this plot provides an example of a partial loss of the Pirn 1603 chr6 region in the tumor sample, e.g., in the vicinity of the HLA-DRB and HLA-DPB loci.
[0158] Prior copy number-based methods for estimating HLA loss (z.e., methods based on determining T/N sequence read count ratios) are sensitive to the choice of allelic loss threshold used to discriminate between HLA normal and HLA loss (see, e.g., PCT International Patent Application Publication No. WO 2022/192304, which is incorporated herein by reference in its entirety). This is not the case with the presently disclosed methods based on allelic ratios, thereby providing for better discrimination between HLA normal and HLA loss than aggregate tumor/normal read count ratios. Further, the currently disclosed allelic ratio-based methods can also be applied to an assessment of allele-specific loss over the non-HLA region surrounding the HLA genes (since allelic ratios can be used to estimate allelic imbalance in any region), thereby providing additional evidence for better discrimination between HLA normal and HLA loss, since genomic HLA loss is typically achieved by tumor cells deleting a genomic segment that encompasses the entire HLA region.
[0159] FIG. 14 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the NCH672 chr6 region. In this example, there is no HLA loss observed. Labels indicate the locations of HLA genes A, DQA1, DRB1, DQB1, DPB1.
[0160] FIG. 15 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B23882 chr6 region. In this example, HLA loss is observed across most of the B23882 chr6 region. Labels indicate the locations of HLA genes A, B, C.
[0161] FIG. 16 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the B34996 chr6 region. In this example, HLA loss was observed across the entire B34996 chr6 region. Labels indicate the locations of HLA genes A, B, C, DQA1, DRB1, DQB1, DPB1, DRA.
[0162] FIG. 17 provides a non-limiting example of the number of allele 1 sequence read counts (specified as a fraction of the total number of sequence reads) observed for tumor and normal samples plotted as a function of the normalized genomic position in the NCI20009 chr6
region. In this example, HLA loss was observed across a significant portion of the NCI20009 chr6 region. Labels indicate the locations of HLA genes B, C, DQA1, DQB1, DRB1, DPB1, DRA.
[0163] FIGS. 18A-18F provide a non-limiting example of data that was generated for sensitivity analysis testing of detection of HLA LOH based on allelic ratio data, in accordance with one or more implementations of the systems and methods disclosed herein. In this study, varying percentages of tumor DNA were mixed into a normal DNA sample. Each panel provides the tumor/normal coverage determined from sequence read data plotted as a function of chromosome 6 coordinate. FIG. 18A: tumor/normal coverage data for a 10% tumor sample. FIG. 18B: tumor/normal coverage data for a 20% tumor sample. FIG. 18C: tumor/normal coverage data for a 30% tumor sample. FIG. 18D: tumor/normal coverage data for a 40% tumor sample. FIG. 18E: tumor/normal coverage data for a 50% tumor sample. FIG. 18F: tumor/normal coverage data for a 100% tumor sample. Decreasing percentages of tumor samples make it progressively more difficult to determine HLA loss. Sensitivity in such situations is important because the percentage of tumor sample (or the percentage of tumor sample with a particular genetic alteration, such as HLA LOH) is variable in real clinical samples due to stromal contamination and/or tumor heterogeneity.
[0164] FIGS. 21A-22B provide a non-limiting example of read counts (as a fraction of total counts at the locus - i.e. at each SNP, the 2x2 contingency table is normalized to sum to 1) on chromosome 6 (comprising the HLA locus) in a sample not subject to HLA loss (FIG. 21A - where normalized counts for alleles 1 and 2 are shown as red and green dots, respectively, as a function of normalized chromosomal coordinates) and corresponding statistics obtained using a method disclosed herein (FIG. 21B). FIGS. 21C-21D provide a nonlimiting example of read counts (as a fraction of total counts - as above) on chromosome 6 (comprising the HLA locus) in a sample subject to HLA loss (FIG. 21C - where normalized counts for alleles 1 and 2 are shown as red and green dots, respectively, as a function of normalized chromosomal coordinates) and corresponding statistics obtained using a method disclosed herein (FIG. 21D). Allele 1 may refer to the major allele in the normal sample and allele 2 may refer to the minor allele in the normal sample. Alternatively, when using known SNPs, allele 1 may refer to the reference allele and allele 2 may refer to the alternate allele. The black lines are estimated multinomial frequencies for the major and minor alleles calculated over a moving window. The model is constrained to have a single parameter for the
two normal alleles (as the normal gene is expected to be diploid, with equal frequencies of the two alleles at heterozygous loci), and a parameter for each of the two tumor alleles, giving rise to 3 parameters. The vertical lines in FIG. 21A and FIG. 21C shows the location of the HLA region. The plots in FIG. 21B and FIG. 21D show statistical estimates (points) and associated confidence intervals (vertical bars) for each of the HLA genes indicated on the x-axis. The statistical estimates shown are, from left to right: log ratios of tumor vs normal counts for allele 1 (log(Tl/Nl)), log ratios of tumor vs normal counts for allele 2 (log(T2/N2)), logs odds ratio (tumor vs normal, allele 1 vs allele 2, i.e. log odds ratio associated with the tumor-normal -allele 1-allele 2 two-by-two contingency table, log((T2/N2)/(Tl/Nl))=log((NlT2)/(N2Tl))). Log ratios near the genome-wide tumor-to-normal ratio in the left and middle plots indicate similar counts in the tumor and normal samples for the respective allele. Log odds ratios (LORs) near zero in the right plots indicate similar allelic balance in the normal sample and the tumor sample. Log odds ratios away from zero indicate allelic imbalance that differs between the normal and tumor samples. As the normal sample is generally assumed to have no allelic imbalance, LORs away from zero indicate a likely allelic imbalance in the tumor sample. The confidence intervals shown around the estimates are the 95% confidence intervals estimated as ± 1.96SE where SE is the standard error for the estimate. The standard error estimate for the
1 1 1 1
LOR can be estimated as - 1 - 1 - 1 — . The standard error estimate for the log ratios can l N1 N2 T1 T2 ° be estimated for allele 1 or allele 2, respectively. As explained above,
’ J ’ the standard error estimates for the log ratios and/or log odds ratios can be adjusted for overdispersion using an estimated degree of overdispersion obtained from counts of non-HLA SNPs (as is the case in the data in FIGS. 21A-21B, which uses known SNPs on chr 6 - excluding the HLA locus - to estimate a degree of overdispersion using a multinomial model and a x2 statistic).
[0165] Embodiments of the methods described herein use the OR or LOR statistics to determine whether an HLA allele is subject to loss. This advantageously enables to discriminate between HLA loss in a tumor sample and differences in coverage between the tumor and normal samples. A threshold on the LOR can then be applied to determine whether one of the alleles has been lost in the tumor sample (or more specifically whether there is allelic imbalance in the tumor sample that is not present in the normal sample - where the value of the LOR or a simple investigation of the read counts can resolve which allele has been lost).
Double-deletions are not detectable using the LOR but can be detected by looking for changes in both HLA alleles relative to the background tumor-vs-normal rate, i.e. by looking at the log ratios of tumor vs normal counts for allele 1 (log(Tl/Nl)) and allele 2 (log(T2/N2)). Instead or in addition to comparing LOR to a threshold, a p-value can be calculated associated with a two- sided test of the hypothesis that the true odds ratio equals one using a t-test as explained above. A threshold on the LOR can be applied by requiring that the confidence interval of the LOR is entirely above a predetermined value. The threshold may vary depending on the desired balance of false positive to false negative rates. Lower thresholds e.g. 0.75 or below may include more false positives than higher thresholds, e.g. 1, but higher thresholds may include more false negatives than lower thresholds. Depending on the intended use of the method, reducing false positives may be prioritized over reducing false negatives, or vice-versa. The confidence interval may be associated with a predetermined level of confidence, e.g. a 90% confidence interval, a 95% confidence interval or a 98% confidence interval. In embodiments, a 95% confidence interval may be used. As explained above, the 95% confidence interval around a LOR can be estimated as LOR± 1.96SE where SE is the standard error for the LOR estimate, or a corrected version thereof (e.g. a standard estimate corrected for overdispersion of counts as explained elsewhere herein).
[0166] Standard error estimates around OR and LOR metrics as explained above assume that the underlying counts arise from a multinomial distribution (where the probability of sampling Nl, N2, T1 and T2 reads given that N1+N2+T1+T2 reads are sampled in total is given by the Multinomial distribution - with variance around the counts given by npij(l-pij) where pij is the probability of a DNA molecule in the tumor (j=2) or normal sample (j=l) carrying allele i=l, 2, and n is the number of reads sampled, i.e. N1+N2+T1+T2). However, in some cases there may be an overdispersion of the counts, i.e. the variability in the counts may be larger than assumed by the above tests. This means that the true standard error estimate of the log odds ratio and odds ratios may be larger than the value mentioned above, which may lead to some false positive results. In order to correct for this, the present inventors have designed a method that uses read counts at polymorphic locations in the genome to estimate overdispersion patterns. This can then be used to correct the standard error estimate around the estimates provided above, thereby reducing the risk of false positives (i.e. samples being identified as subject to HLA allelic loss or imbalance when the data does not support such a conclusion). Overdispersion manifests itself in the presence of highly variable total number of reads (T1+T2+N1+N2), ratio of tumor vs normal total reads (T1+T2/N1+N2) and counts of
allele 1 vs allele 2 (Nl+Tl vs N2+T2) across polymorphic loci. In order to estimate the amount of overdispersion in counts at polymorphic loci in a pair of tumor and normal samples, embodiments of the disclosure estimate two normal allele frequencies (i.e. pNl, pN2, corresponding to pij=pl 1 and pij=p21, respectively using the notation above) and two tumor frequencies (i.e. pTl, pT2, corresponding to pij=pl2 and pij=p22, respectively using the notation above) for heterozygous SNPs using a sliding window (i.e. taking into account counts at SNPs within a window) over a region of the genome. For example, an overdispersion estimate can be obtained for each window and an aggregate estimate (e.g. median) can then be obtained using the estimates for each of the windows. The use of a sliding window for estimating overdispersion reduces the risk of overdispersion estimates being obtained across regions that have different copy numbers, particularly in the tumor sample. Indeed, estimation of overdispersion for a set of data assumes that the data comes from the same underlying distribution. In the case of tumor samples, copy number changes are common, which would give rise to different distributions. For example, if tumor and normal genomes were sampled to the same depth but there is an extra chromosome 1 in the tumor sample and a deletion of chromosome arm 8q, we might expect normalized counts of 0.33 and 0.67 for alleles 1 and 2 on chromosome 1, but 1 and 0 for chromosome arm 8q in pure samples (with variable counts in real samples depending on stromal contamination and tumor heterogeneity). This is handled by estimating overdispersion separately over regions that can be expected to have the same copy number. The region of the genome may comprise or consist of a portion of chromosome 6 not including the HLA locus. The region of the genome may comprise the whole of chromosome 6 excluding SNPs in HLA genes. The region of the genome may be any region comprising at least 100, at least 200, at least 300, at least 400, or at least 500 heterozygous SNPs. Specifically, the methods estimate locus-specific multinomial probabilities for a count to be from a normal chromosome (where the normal genome can be assumed to be diploid, such that a single probability can be estimated), and the probabilities or the counts to be from either tumor chromosome, which can have distinct locus-specific frequencies. These probabilities are then used to estimate overdispersion relative to multinomial counts in the observed count data. In embodiments, the two normal allele frequencies may be assumed to be equal as the normal sample is expected to be diploid (i.e. pNl=pN2=pl l=p21 ). Embodiments of the methods may use unphased SNP data. Thus, at each of a fine grid of values along a chromosome, polymorphic loci for tumor alleles in a window are assigned probabilistically as coming from the major or minor tumor chromosome (i.e., the chromosome with the greater or
lesser copy number at the locus, respectively) and the frequencies for these two chromosomes estimated via an EM algorithm. Maximum likelihood estimates of the probabilities are used to estimate multinomial count expected values and variances, which are in turn used to estimate local overdispersion.
[0167] The present inventors have found that phasing did not improve the accuracy of the LOH calling, and therefore the use of unphased SNPs results in a method that is more computationally efficient while preserving LOH calling accuracy. The difference between the observed variance in counts vs. expected variance in counts under a multinomial distribution can then be used to adjust the standard error around the OR or LOR estimates obtained as described herein. For example, standard error estimates may be multiplied by an overdispersion estimate based on this difference.
[0168] FIG. 22 shows using synthetic cell line mixtures that the methods of the disclosure are able to call HLA LOH down to small tumor purity samples (e.g. samples with 50% and below 50% tumor content, i.e. tumor samples in which 50% or more of the sample comprises normal cell contamination), demonstrating the excellent sensitivity of the method. This is the case even using the more conservative threshold of 1 on LO
[0169] R, with the lower threshold of 0.75 having even higher sensitivity of LOH detection. Synthetic cell line mixtures are obtained which comprise decreasing percentages of tumor cell lines with known HLA loss (specifically, ovarian cancer cell line B23882 with a loss of each of an HLA-A, -B and -C alleles, tumor cell line B34996 with a loss of an HLA-B and a HLA- C allele, lung cancer cell line NCLH1672 homozygous for all HLA-A, -B and -C, and adenocarcinoma cell line NCI-H2009 homozygous for HLA-A, and with a loss of an HLA-B and an HLA-C allele) relative to matched normal cells (respectively germline cell lines corresponding to the tumor cell lines B23882 and B34996, and NCI-BL1672 and NCI-BL2009 derived from blood and corresponding to NCLH1672 and NCI-H2009). These were sequenced and HLA loss was called as described herein. The plots show for each cell line the number of HLA LOH events in HLA-A, -B, -C that were called (expected numbers are 3 in B23882 mixtures, 2 in B34996 mixtures, none in NCH672 mixtures, and 2 in NCI2009 mixtures) as a function of the tumor purity of the synthetic mixture, using a LOR threshold of 0.75 and using a more conservative LOR threshold of 1. The results show that in the B23882 mixtures, all 3 known HLA LOH events are still called with sample purities as low as 40% with the 0.75 threshold, and as low as 50% with the more conservative threshold of 1. In the B34996 mixtures
both expected HL A LOH events are successfully called down to purities of 20% and 30%, respectively for the 0.75 and 1 LOR threshold. In the NCI2009 mixtures, both expected HLA LOH events are successfully called down to purities of 10% and 20%, respectively for the 0.75 and 1 LOR threshold. This demonstrates excellent robustness and sensitivity. This was further validated in a cohort of 50 urothelial carcinoma patients (Balar et al. The Lancet, Volume 389, Issue 10064p67-76 January 07, 2017) with known HLA stats from HLA-probe-enriched SNP arrays, showing a sensitivity of 1 and a specificity of 0.93 with the LOR threshold of 0.75 (i.e. LOH if LOR >=0.75), and a sensitivity of 0.86 and specificity of 0.96 with the LOR threshold of 1 (i.e. LOH if LOR >=1). By comparison, the SpecHLA method described in Wang et al. (Cell Reports Methods, Volume 3, Issue 9, 100589, September 25, 2023) achieved lower sensitivities of 0.82 and 0.79 on the same data.
11 I.E Decision-Making Based on Evaluation of HLA Loss
[0170] The information provided by the processes described above (e.g., process 1000 depicted in FIG. 10) can be used to make various types of decisions with respect to at least one of treating or predicting the progression or outcome of a disease such as a tumor or cancer. In one or more instances, these processes provide a way of detecting (e.g., estimating or determining) HLA loss (e.g., HLA LOH) in one sample as compared to another sample.
[0171] Once HLA loss (e.g., loss of one or more HLA alleles) has been detected, this information can be used to develop and/or personalize immunotherapy, including T cell therapy. For example, T cell cancer therapy can be personalized to account for the loss of certain HLA alleles that would prevent, for example, T cells from reacting to a neoantigen associated with those HLA alleles. Thus, it would be important to develop a T cell therapy that can be activated in a subject based on HLA alleles for which loss of expression (e.g., via absence of the HLA alleles in at least part of the tumor cell population, or reduced expression) has not been detected.
[0172] For example, peptides may be presented to the cell surface of a cell with one or more of a subject's HLA alleles (e.g., HLA- A, HLA-B, HLA-C) for immune surveillance. If an antigen (e.g., peptide) appears foreign to the immune system, that cell is killed by the immune system. If HLA allele loss has been detected, a prediction can be made regarding which foreign antigens would have been presented by the lost HLA allele. This type of prediction would help refine the selection of foreign antigens used as targets for tumor therapy (e.g., tumor vaccines). In some cases, if a subject has lost many or all HLA alleles, a
determination may be made that the subject is not a good candidate for a tumor vaccine and should be considered for other types of therapies (e.g., therapies involving NK cells).
[0173] Neoantigen vaccines can prime a subject's T cells to recognize and attack cancer cells expressing one or more particular tumor neoantigens. This approach can generate a tumorspecific immune response that spares healthy cells while targeting tumor cells. The individualized vaccine may be engineered or selected based on the information generated by the various embodiments described above.
[0174] An immunotherapy such as, for example, without limitation, a cancer treatment may include collecting a sample (e.g., a blood sample) from a subject. T cells can be isolated and stimulated. The isolation can be performed using, for example, density gradient sedimentation (e.g., and centrifugation), immunomagnetic selection, and/or antibody-complex filtering. The stimulation may include, for example, antigen-independent stimulation, which may use a mitogen (e.g., PHA or Con A) or anti- CD3 antibodies (e.g., to bind to CD3 and activate the T- cell receptor complex) and anti-CD28 antibodies (e.g., to bind to CD28 and stimulate T cells). A set of peptides (e.g., mutant peptides) can be selected to use in the treatment of the subject based on the information provided by the various embodiments described above corresponding to predictions as to whether and/or an extent to which each of the set of peptides would bind to an MHC molecule (or HLA molecule) of the subject, be presented by the MHC molecule of the subject and/or trigger an immune response in the subject. For example, the set of peptides can be selected based on the detection of HLA loss within one or more tumor samples. This HLA loss may include the loss of one or more HLA alleles.
[0175] In some instances, the set of peptides (or precursors thereof) can be used to produce antigen, e.g. mutant peptide (for example, neoantigen) specific T cells. For example, peripheral blood T cells can be isolated from a subject and contacted with one or more mutant peptides or tumor associated peptides to induce, identify, select or enrich the T cells for mutant peptidespecific T-cells populations that can be administered to a subject. In some examples, the T cell receptor sequence of the mutant/tumor associated peptide-reactive T cells can be sequenced. Once the T-cell receptor sequence (e.g, amino-acid T-cell receptor sequence) is obtained, T cells can be engineered to include the T cell receptor that specifically recognizes the mutant peptide. These engineered T cells can then be administered to a subject. See, e.g, Matsuda et al. "Induction of Neoantigen- Specific Cytotoxic T Cells and Construction of T-cell Receptor Engineered T Cells for Ovarian Cancer," Clin. Cancer Res. 1-11 (2018), hereby incorporated
by reference in its entirety for all purposes. The T cells can be expanded in vitro and/or ex vivo prior to administration to a subject. The subject may then be administered (e.g., infused with) a composition that includes the expanded population of T cells. In one or more embodiments, the treatment is administered to an individual in an amount effective to, for example, prime, activate and expand T cells in vivo.
[0176] Thus, the above examples provide some examples of different types of immunotherapies that may be developed based on HLA loss detection. The detection of HLA loss can be used to personalize immunotherapy (e.g., personalize a cancer immunotherapy), determine when to include and when to exclude an antigen that would be presented by an HLA allele as a potential target for an immunotherapy, and/or inform other decisions regarding immunotherapy. In some instances, the immunotherapy may be selected from a group consisting of a T cell therapy, a personalized cancer therapy, an antigen-specific immunotherapy, an antigen-dependent immunotherapy, a vaccine, and a natural killer (NK) cell therapy.
IV. Example Computer System
[0177] FIG. 19 provides a non-limiting example of a block diagram for a computer system 1800, in accordance with some embodiments. Computer system 1900 can be an example of one implementation for computer system 102 described above in FIG. 1.
[0178] Computer system 1900 can be a host computer connected to a network. Computer system 1900 can be a client computer or a server. As shown in FIG. 19, computer system 1900 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (portable electronic device), such as a phone or tablet. The device can include, for example, one or more of processor 1910, input device 1920, output device 1930, storage 1940, and communication device 1960. Input device 1920 and output device 1930 can generally correspond to those described elsewhere herein, and they can either be connectable or integrated with the computer.
[0179] Input device 1920 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device. Output device 1930 can be any suitable device that provides output, such as a touch screen, haptics device, or speaker.
[0180] Storage 1940 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory including a RAM, cache, hard drive, or removable storage disk.
Communication device 1960 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the computer can be connected in any suitable manner, such as via a physical bus 1970 or wirelessly.
[0181] Software 1950, which can be stored in memory / storage 1940 and executed by processor 1910, can include, for example, the programming that embodies the functionality of the present disclosure (e.g., as embodied in the methods described above).
[0182] Software 1950 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context disclosure, a computer-readable storage medium can be any medium, such as storage 1940, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.
[0183] Software 1950 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of the present disclosure, a transport medium can be any medium that can communicate, propagate, or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared wired or wireless propagation medium.
[0184] Computer system 1900 may be connected to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.
[0185] Computer system 1900 can implement any operating system suitable for operating on the network. Software 1950 can be written in any suitable programming language, such as C, C++, Java, or Python. In various embodiments, application software embodying the
functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a web browser as a web-based application or web service, for example.
EXAMPLE EMBODIMENTS
[0186] Embodiments disclosed herein may include:
1. A method for detecting Human Leukocyte Antigen (HLA) alterations for a subject, the method comprising: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a plurality of sequence reads derived from a normal sample from the subject; receiving a subject-specific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal -derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele, and a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first allele and the determined number of unique normal -derived sequence reads for the second allele.
2. The method of embodiment 1, wherein the HLA alteration is an HLA loss of heterozygosity and/or a HLA allelic imbalance.
3. The method of embodiment 1, wherein the HLA alteration is an HLA copy number change.
4. The method of any one of embodiments 1 to 3, wherein the plurality of sequence reads is derived by sequencing nucleic acid molecules extracted from the tumor and normal samples from the subject using a whole exome sequencing (WES) technique.
5. The method of any one of embodiments 1 to 4, wherein detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value.
6. The method of embodiment 5, wherein the expected value is the normal allelic ratio.
7. The method of any one of embodiments 1 to 6, wherein detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for one or more of: a deviation of a log ratio of the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene from an expected value, a deviation of the log ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the second allele of the at least one HLA gene from an expected value, and a deviation from an expected value of a log odds ratio corresponding to the log ratio of (i) the ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene and (ii) the ratio of the number of unique normal-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene, optionally wherein performing a statistical analysis comprises comparing a log ratio or log odds ratio to an expected value under a null hypothesis, wherein said comparison comprising computing a test statistic and/or comparing a confidence interval around said log ratio or log odds ratio to a predetermined threshold associated with the null hypothesis, optionally wherein the null hypothesis corresponds to a log ratio and/or a log odds ratio of 0 or a log ratio corresponding to a baseline estimate.
8. The method of embodiment 7, wherein an expected value for a log ratio corresponds to an absence of imbalance between the tumor and normal samples, and/or the expected value for the log odds ratio corresponds to a lack of HLA allelic imbalance, optionally wherein the expected value for the log odds ratio corresponds to a log odds ratio of 0.
9. The method of any one of embodiments 5 to 8, wherein the statistical analysis comprises: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non- HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject; determining, for each of the plurality of heterozygous SNPs and based on the sequence read data, a number of unique tumor-derived sequence reads for a first SNP allele, a number of unique tumor-derived sequence reads for a second SNP allele , a number of unique normal- derived sequence reads for the first SNP allele and a number of unique normal-derived sequence reads for the second SNP allele; optionally determining: a normal allelic ratio comprising a ratio of the determined number of unique normal-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first SNP allele and the second SNP allele for each of the plurality of heterozygous SNP loci, and a baseline log ratio for each of the first and second SNP alleles; and estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal-derived sequence reads for the first SNP allele, the number of unique normal- derived sequence reads for the second SNP allele, the number of unique tumor-derived sequence reads for the first SNP allele, and the number of unique tumor-derived sequence reads for the second SNP allele determined for each of the plurality of heterozygous SNP loci to a multinomial model.
10. The method of embodiment 9, wherein estimating a degree of overdispersion in sequence read counts comprises calculating an overdispersion statistic based on the comparison of the observed counts at the plurality of heterozygous SNPs to corresponding expected sequence read counts based on fitting the multinomial model.
11. The method of embodiment 10, wherein estimating a degree of overdispersion in sequence read counts comprises calculating a %2 statistic and/or a residual deviance (D) statistic.
12. The method of any one of embodiments 9 to 11, further comprising using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log of the product of the tumor allelic ratio and the inverse of the normal allelic ratio or the difference between the second allele log ratio and the first allele log ratio.
13. The method of embodiment 12, further comprising using the adjusted standard error for the log odds ratio to adjust a confidence interval for the log odds ratio, and/or using the adjusted standard error to determine a p-value for a test statistic associated with a test that the observed log odds ratio is different from a null hypothesis associated with absence of HLA imbalance. .
14. The method of any one of embodiments 9 to 13, wherein the plurality of heterozygous SNP loci comprises at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci.
15. The method of any one of embodiments 9 to 14, wherein the plurality of heterozygous SNP loci is filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome.
16. The method of any one of embodiments 9 to 15, wherein the plurality of heterozygous SNP loci are detected using a non-sorting-based method for de-duplicating and tallying sequence read counts.
17. The method of embodiment 16, wherein the non-sorting based method for deduplicating and tallying sequence read counts comprises: performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
18. The method of any one of embodiments 9 to 17, wherein the plurality of heterozygous SNP loci are detected by aligning the sequence read data to a reference sequence using a SNP tolerant alignment method, optionally wherein the SNP tolerant alignment method uses a predetermined set of known SNPs.
19. The method of any one of embodiments 1 to 18, wherein the at least one HLA gene comprises HLA-A, HLA-B, HLA-C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof.
20. The method of any one of embodiments 1 to 19, wherein the subject-specific reference sequence for the HLA region of the subject’s genome is generated by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence.
21. The method of embodiment 20, wherein the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence.
22. The method of embodiment 21, wherein the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region.
23. The method of any one of embodiments 1 to 22, wherein detecting an HLA loss of heterozygosity for the at least one HLA gene does not require a determination of copy number for the at least one HLA gene.
24. The method of any one of embodiments 1 to 23, wherein the tumor and normal samples from the subject comprise paired tumor and normal surgical resection samples.
25. The method of any one of embodiments 1 to 23, wherein the tumor and normal samples from the subject comprise paired tumor and normal tissue biopsy samples.
26. The method of any one of embodiments 1 to 25, further comprising diagnosing or confirming a diagnosis of a disease based on a detected HLA alteration for the at least one HLA gene.
27. The method of any one of embodiments 1 to 26, further comprising identifying the subject for treatment of a disease based on a detected HLA alteration for the at least one HLA gene.
28. The method of any one of embodiments 1 to 27, further comprising identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene.
29. The method of any one of embodiments 1 to 28, further comprising predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene.
30. The method of any one of embodiments 1 to 29, further comprising identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA alteration for the at least one HLA gene.
31. The method of any one of embodiments 26 to 30, wherein the disease is a cancer.
32. A system comprising: one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of embodiments 1 to 31.
33. A system comprising: a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject; one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of embodiments 1 to 31.
34. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform the method of any one of embodiments 1 to 31.
[0187] The description provides preferred example embodiments only, and is not intended to limit the scope, applicability or configuration of the disclosure. Rather, the description of the preferred example embodiments will provide those skilled in the art with an enabling description for implementing various embodiments. It is understood that various changes can be made in the function and arrangement of elements without departing from the spirit and scope as set forth in the appended claims.
Claims
1. A method for detecting Human Leukocyte Antigen (HL A) alterations for a subject, the method comprising: receiving sequence read data for a plurality of sequence reads derived from a tumor sample and a plurality of sequence reads derived from a normal sample from the subject; receiving a subject-specific reference sequence for an HLA region of the subject’s genome; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique tumor-derived sequence reads for a first allele of at least one HLA gene and a number of unique tumor-derived sequence reads for a second allele of the at least one HLA gene; determining, based on the sequence read data and the subject-specific reference sequence, a number of unique normal -derived sequence reads for the first allele of the at least one HLA gene and a number of unique normal -derived sequence reads for the second allele of the at least one HLA gene; and detecting an HLA alteration for the at least one HLA gene based on: a tumor allelic ratio comprising a ratio of the determined number of unique tumor-derived sequence reads for the first allele and the determined number of unique tumor-derived sequence reads for the second allele, and a normal allelic ratio comprising a ratio of the determined number of unique normal -derived sequence reads for the first allele and the determined number of unique normal-derived sequence reads for the second allele.
2. The method of claim 1, wherein the HLA alteration is an HLA allelic imbalance, and/or a loss of heterozygosity.
3. The method of claim 1, wherein the HLA alteration is an HLA copy number change.
4. The method of any one of claims 1 to 3, wherein the plurality of sequence reads is derived by sequencing nucleic acid molecules extracted from the tumor and normal samples from the subject using a whole exome sequencing (WES) technique.
5. The method of any one of claims 1 to 4, wherein detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for a deviation of the tumor allelic ratio from an expected value.
6. The method of claim 5, wherein the expected value is the normal allelic ratio.
7. The method of any one of claims 1 to 6, wherein detecting the HLA alteration for the at least one HLA gene comprises performing a statistical analysis to determine a statistical significance for one or more of: a deviation of a log ratio of the number of unique tumor- derived sequence reads for the first allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene from an expected value, a deviation of the log ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal -derived sequence reads for the second allele of the at least one HLA gene from an expected value, and a deviation from an expected value of a log odds ratio corresponding to the log ratio of (i) the ratio of the number of unique tumor-derived sequence reads for the second allele of the at least one HLA gene to the number of unique tumor-derived sequence reads for the first allele of the at least one HLA gene and (ii) the ratio of the number of unique normal-derived sequence reads for the second allele of the at least one HLA gene to the number of unique normal-derived sequence reads for the first allele of the at least one HLA gene.
8. The method of claim 7, wherein an expected value for a log ratio corresponds to an absence of imbalance between the tumor and normal samples, and/or an expected value for a log odds ratio corresponds to a lack of HLA allelic imbalance.
9. The method of any one of claims 5 to 8, wherein the statistical analysis comprises: detecting a plurality of heterozygous single nucleotide polymorphism (SNP) loci in a non-HLA region of the subject’s genome based on sequence read data for a subset of the plurality of sequence reads derived from the normal sample from the subject;
determining, for each of one or more of the plurality of heterozygous SNPs and based on the sequence read data: a number of unique tumor-derived sequence reads for a first SNP allele, a number of unique tumor-derived sequence reads for a second SNP allele for each of the plurality of heterozygous SNP loci, a number of unique normal-derived sequence reads for the first SNP allele and a number of unique normal -derived sequence reads for the second SNP allele for each of the plurality of heterozygous SNP loci; and estimating a degree of overdispersion in sequence read counts based on fitting the number of unique normal -derived sequence reads for the first SNP allele, the number of unique normal-derived sequence reads for the second SNP allele, the number of unique tumor-derived sequence reads for the first SNP allele, and the number of unique tumor- derived sequence reads for the second SNP allele determined for each of the one or more of the plurality of heterozygous SNP loci to a multinomial model.
10. The method of claim 9, wherein estimating a degree of overdispersion in sequence read counts comprises calculating an overdispersion statistic based on the comparison of the observed counts at the one or more of the plurality of heterozygous SNPs to corresponding expected sequence read counts based on fitting the multinomial model.
11. The method of claim 9 or claim 10, wherein estimating a degree of overdispersion in sequence read counts comprises calculating a x2 statistic and/or a residual deviance (D) statistic.
12. The method of any one of claims 9 to 11, further comprising using the estimated degree of overdispersion to adjust an estimated standard error for a log odds ratio formed by a log of the product of the tumor allelic ratio and the inverse of the normal allelic ratio or the difference between the second allele log ratio and the first allele log ratio.
13. The method of claim 12, further comprising using the adjusted standard error for the log odds ratio to adjust a p-value or confidence interval for the log odds ratio or a test statistic associated with a test that the observed log odds ratio is different from a null hypothesis associated with absence of HLA imbalance.
14. The method of any one of claims 9 to 13, wherein the plurality of heterozygous SNP loci comprises at least 5,000, 10,000, 15,000, 20,000, 25,000, or 30,000 heterozygous SNP loci.
15. The method of any one of claims 9 to 14, wherein the plurality of heterozygous SNP loci is filtered to remove artifactual SNP loci resulting from misalignment of sequence reads to the non-HLA region of the subject’s genome.
16. The method of any one of claims 9 to 15, wherein the plurality of heterozygous SNP loci are detected using a non-sorting-based method for de-duplicating and tallying sequence read counts.
17. The method of claim 16, wherein the non-sorting based method for de-duplicating and tallying sequence read counts comprises: performing a first linear scan through aligned sequence reads to store a genomic position of each aligned sequence read in an index; and performing a second linear scan through the aligned sequence reads to identify duplicate sequence reads based on the index.
18. The method of any one of claims 9 to 17, wherein the plurality of heterozygous SNP loci are detected by aligning the sequence read data to a reference sequence using a SNP tolerant alignment method, optionally wherein the SNP tolerant alignment method uses a predetermined set of known SNPs.
19. The method of any one of claims 1 to 18, wherein the at least one HLA gene comprises HLA-A, HLA-B, HLA-C, HLA-DR, HLA-DQ, HLA-DP, or any combination thereof.
20. The method of any one of claims 1 to 19, wherein the subject-specific reference sequence for the HLA region of the subject’s genome is generated by determining a set of HLA alleles based on an observed distribution of sequence reads aligned to the HLA region of a reference genome sequence.
21. The method of claim 20, wherein the distribution of sequence reads aligned to the HLA region of the reference genome includes sequence reads that align to exons in the HLA region of the reference genome sequence.
22. The method of claim 21, wherein the distribution of sequence reads aligned to the HLA region of the reference genome sequence includes sequence reads that partially align to introns of the HLA region.
23. The method of any one of claims 1 to 22, wherein detecting an HLA loss of heterozygosity for the at least one HLA gene does not require a determination of copy number for the at least one HLA gene.
24. The method of any one of claims 1 to 23, wherein the tumor and normal samples from the subject comprise paired tumor and normal surgical resection samples.
25. The method of any one of claims 1 to 23, wherein the tumor and normal samples from the subject comprise paired tumor and normal tissue biopsy samples.
26. The method of any one of claims 1 to 25, further comprising diagnosing or confirming a diagnosis of a disease based on a detected HLA alteration for the at least one HLA gene.
27. The method of any one of claims 1 to 26, further comprising identifying the subject for treatment of a disease based on a detected HLA alteration for the at least one HLA gene.
28. The method of any one of claims 1 to 27, further comprising identifying a treatment for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene.
29. The method of any one of claims 1 to 28, further comprising predicting a clinical outcome for a disease with which the subject has been diagnosed based on a detected HLA alteration for the at least one HLA gene.
30. The method of any one of claims 1 to 29, further comprising identifying the subject for inclusion in a clinical trial for treatment of a disease based on a detected HLA alteration for the at least one HLA gene.
31. The method of any one of claims 26 to 30, wherein the disease is a cancer.
32. A system comprising: one or more processors; and
a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of claims 1 to 31.
33. A system comprising: a sequencer for obtaining sequence read data for a plurality of sequence reads derived from a tumor sample and a normal sample from the subject; one or more processors; and a memory communicatively coupled to the one or more processors and configured to store instructions that, when executed by the one or more processors, cause the system to perform the method of any one of claims 1 to 31.
34. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a system, cause the system to perform the method of any one of claims 1 to 31.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463637294P | 2024-04-22 | 2024-04-22 | |
| US63/637,294 | 2024-04-22 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025226580A1 true WO2025226580A1 (en) | 2025-10-30 |
Family
ID=95743633
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/025565 Pending WO2025226580A1 (en) | 2024-04-22 | 2025-04-21 | Methods and systems for hla loss determination |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025226580A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020168016A1 (en) * | 2019-02-12 | 2020-08-20 | Tempus Labs, Inc. | Detection of human leukocyte antigen loss of heterozygosity |
| WO2022016125A1 (en) | 2020-07-17 | 2022-01-20 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
| WO2022192304A1 (en) | 2021-03-08 | 2022-09-15 | Genentech, Inc. | Estimating hla expression loss |
| US20230064530A1 (en) * | 2019-02-12 | 2023-03-02 | Tempus Labs, Inc. | Detection of Genetic Variants in Human Leukocyte Antigen Genes |
| WO2024026275A1 (en) * | 2022-07-29 | 2024-02-01 | Foundation Medicine, Inc. | Methods and systems for identifying hla-i loss of heterozygosity |
-
2025
- 2025-04-21 WO PCT/US2025/025565 patent/WO2025226580A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020168016A1 (en) * | 2019-02-12 | 2020-08-20 | Tempus Labs, Inc. | Detection of human leukocyte antigen loss of heterozygosity |
| US20230064530A1 (en) * | 2019-02-12 | 2023-03-02 | Tempus Labs, Inc. | Detection of Genetic Variants in Human Leukocyte Antigen Genes |
| WO2022016125A1 (en) | 2020-07-17 | 2022-01-20 | Genentech, Inc. | Attention-based neural network to predict peptide binding, presentation, and immunogenicity |
| WO2022192304A1 (en) | 2021-03-08 | 2022-09-15 | Genentech, Inc. | Estimating hla expression loss |
| WO2024026275A1 (en) * | 2022-07-29 | 2024-02-01 | Foundation Medicine, Inc. | Methods and systems for identifying hla-i loss of heterozygosity |
Non-Patent Citations (11)
| Title |
|---|
| BALAR ET AL., THE LANCET, vol. 389, 7 January 2017 (2017-01-07), pages 67 - 76 |
| MATSUDA ET AL.: "Induction of Neoantigen- Specific Cytotoxic T Cells and Construction of T-cell Receptor Engineered T Cells for Ovarian Cancer", CLIN. CANCER RES., vol. 1-11, 2018 |
| O'DONNELL ET AL., CELL SYSTEMS, vol. 11, 22 July 2020 (2020-07-22), pages 42 - 48 |
| RAKOCEVIK ET AL., NATURE GENETICS, vol. 51, 2019, pages 354 - 362 |
| REYNISSON ET AL., NUCLEIC ACIDS RES., vol. 48, no. W1, 2 July 2020 (2020-07-02), pages W449 - W454 |
| RIMMER ET AL., NAT GENET., vol. 46, no. 8, August 2014 (2014-08-01), pages 912 - 918 |
| SCHMIDT ET AL., CELL REP MED., vol. 2, no. 2, 16 February 2021 (2021-02-16), pages 100194 |
| SHERRY ET AL., NUCLEIC ACIDS RESEARCH, vol. 29, 1 January 2001 (2001-01-01), pages 308 - 311 |
| SZOLEK ET AL., BIOINFORMATICS, vol. 30, December 2014 (2014-12-01), pages 3310 - 3316 |
| WANG, CELL REPORTS METHODS, vol. 3, 25 September 2023 (2023-09-25), pages 100589 |
| WUNACU, BIOINFORMATICS, vol. 26, April 2010 (2010-04-01), pages 873 - 881 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP7668255B2 (en) | Methods and processes for non-invasive assessment of genetic mutations - Patents.com | |
| JP7773301B2 (en) | Methods and processes for non-invasive assessment of genetic variation | |
| US11081210B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
| JP6561046B2 (en) | Methods and treatments for non-invasive assessment of genetic variation | |
| JP2024029174A (en) | Systems and methods for detection and treatment of diseases exhibiting disease cell heterogeneity and communicating test results | |
| Hanscombe et al. | Genetic fine mapping of systemic lupus erythematosus MHC associations in Europeans and African Americans | |
| CN108624650B (en) | Method for judging whether solid tumor is suitable for immunotherapy and detection kit | |
| Sung et al. | Integrative analysis of risk factors for immune-related adverse events of checkpoint blockade therapy in cancer | |
| US20230154563A1 (en) | Detection of Human Leukocyte Antigen Loss of Heterozygosity | |
| Rabinowitz et al. | Bayesian-based noninvasive prenatal diagnosis of single-gene disorders | |
| Quach et al. | Living in an adaptive world: Genomic dissection of the genus Homo and its immune response | |
| WO2017156290A9 (en) | A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing | |
| Cagliani et al. | Ancient and recent selective pressures shaped genetic diversity at AIM2-like nucleic acid sensors | |
| Cagliani et al. | A trans-specific polymorphism in ZC3HAV1 is maintained by long-standing balancing selection and may confer susceptibility to multiple sclerosis | |
| Moutsianas et al. | Multiple Hodgkin lymphoma–associated loci within the HLA region at chromosome 6p21. 3 | |
| WO2017218798A1 (en) | Systems and methods for diagnosing familial hypercholesterolemia | |
| Jo et al. | Distant regulatory effects of genetic variation in multiple human tissues | |
| EP4363616A1 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
| US20230420076A1 (en) | Estimating hla expression loss | |
| JP2025019066A (en) | Composite biomarkers for cancer immunotherapy | |
| WO2025226580A1 (en) | Methods and systems for hla loss determination | |
| WO2017058904A1 (en) | Linkage disequilibrium method and database | |
| CN114231628A (en) | A marker combination for predicting the efficacy of immune checkpoint inhibitors in gastrointestinal tumors and its application | |
| Dibble | Investigating the genetic and immunological aetiology of myalgic encephalomyelitis/chronic fatigue syndrome |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25725651 Country of ref document: EP Kind code of ref document: A1 |