[go: up one dir, main page]

WO2025090739A1 - Systems and methods for assessing similarity between samples using genotype signatures - Google Patents

Systems and methods for assessing similarity between samples using genotype signatures Download PDF

Info

Publication number
WO2025090739A1
WO2025090739A1 PCT/US2024/052772 US2024052772W WO2025090739A1 WO 2025090739 A1 WO2025090739 A1 WO 2025090739A1 US 2024052772 W US2024052772 W US 2024052772W WO 2025090739 A1 WO2025090739 A1 WO 2025090739A1
Authority
WO
WIPO (PCT)
Prior art keywords
snps
sample
genotype
signature
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/052772
Other languages
French (fr)
Inventor
Ashton TENG
Subashini Srinivasan
Jim CLUNE
Vlad STERZHANOV
Samuel S. Gross
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grail Inc
Original Assignee
Grail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grail Inc filed Critical Grail Inc
Publication of WO2025090739A1 publication Critical patent/WO2025090739A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present disclosure relates generally to the field of clinical diagnostics and, more specifically, to systems and methods for verifying the identity of biological samples using genotype signatures in clinical samples.
  • systems and methods are described for utilizing genotype signatures to accurately correlate patients with obtained samples.
  • the present disclosure provides for systems and methods that may leverage genotype signatures associated with each sample to identify and/or rectify sample swaps and mislabeling, even when dealing with high sample volumes and multiple time points.
  • the computer-implemented method may include: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating, using a processor associated with the computer system, a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling, using the processor, the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing, using the processor, the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying,
  • SNPs single nucleotide polymorphisms
  • a system for verifying similarity between biological samples may include: one or more processors; one or more computer readable media storing instructions that are executable by the one or more processors to perform operations to: receive genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generate a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assemble the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; compare the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identify, responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identify, responsive to determining that the first genotype signature does not share the threshold level of similar
  • a non-transitory computer-readable medium configured to store computer-executable instructions which, when executed by a server, cause the server to perform operations which may include : receiving genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying, based on
  • a computer-implemented method for generating a genotype signature may include: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); identifying, using a processor of the computer system, a total number of SNPs in the genomic data; retaining, using the processor and from the total number of the SNPs, a first subset of SNPs that match a reference subset of SNPs; retaining, using the processor and from the first subset of SNPs, a second subset of SNPs having a coverage depth greater than a predetermined minimum coverage threshold and smaller than a predetermined maximum coverage threshold; computing, using the processor, a variant allele frequency (VAF) for each of the second subset of SNPs; identifying, using the processor and based on a VAF calling threshold, a genetic variant associated with each of the second subset of SNP
  • VAF variant allele frequency
  • a computer-implemented method for comparing genotype signatures may include: identifying a first total number of single nucleotide polymorphisms (SNPs) that are common between a first genotype signature and a second genotype signature; determining whether the first total number of SNPs is greater than a first predetermined threshold; identifying, responsive to determining that the first total number of SNPs is greater than the first predetermined threshold, a second total number of SNPs having a matching genetic variant type between the first genotype signature and the second genotype signature; calculating, based on the identifying, a ratio of the second total number of SNPs to the first total number of SNPs; determining whether the ratio is greater than a second predetermined threshold; and identifying that the first genotype signature is a match with the second genotype signature responsive to determining that the ratio is greater than the second predetermined threshold, or identifying that the that the first genotype signature is not a match with the second
  • a computer-implemented method for assembling a reference subset of single nucleotide polymorphisms may include: receiving, at a computer system, a dataset containing a first set of SNPs; generating, using a processor associated with the computer system and by applying one or more filters against the first set of SNPs, a second set of SNPs; and utilizing the second set of SNPs in a genotype signature encoding process.
  • FIG. 1 depicts an exemplary system environment, according to one or more embodiments of the present disclosure.
  • FIG. 2 depicts a process flow for an encoding module of a genotype signature component, according to one or more embodiments of the present disclosure.
  • FIG. 3 depicts an example illustration of the process flow of the encoding module illustrated in FIG. 2, according to one or more embodiments of the present disclosure.
  • FIG. 4 depicts a process flow for a concordance results module of the genotype signature component, according to one or more embodiments of the present disclosure.
  • FIG. 5 depicts a process flow for verifying the similarity between participant samples, according to one or more embodiments of the present disclosure.
  • FIG. 6 depicts an example computing system, according to one or more aspects of the present disclosure.
  • samples may be collected from subjects, often involving solid (e.g., tissue, bone marrow) and/or liquid biopsy (e.g., whole blood, blood plasma, urine, saliva) collection in containers.
  • a liquid biopsy such as a blood sample
  • a liquid biopsy may be collected in specially designed tubes.
  • multiple samples may be collected from a participant.
  • two tubes may be collected from each participant: one designated for primary processing and analysis, and the other reserved as a backup to address potential technical failures or errors.
  • This multi-sample or dual-sample (e.g., multitube or dual-tube) approach may be used to create redundancy in case of sample loss.
  • samples may be collected from the same participant over a period time in a longitudinal study. Once collected, the samples may undergo a series of processing steps, which may include, e.g, plasma isolation, assay procedures, and subsequent data analysis, depending on the type of biopsy sample being processed. These steps can be intricate and multifaceted, involving transfers of sample material into different containers (e.g., tubes or wells) for processing and experimentation. However, at each juncture, the potential exists for sample mislabeling, contamination, or swaps to occur, e.g., where samples from one participant are inadvertently attributed to another due to human error, miscommunication, or other factors.
  • sample processing such as plasma isolation
  • errors may inadvertently introduce errors.
  • misalignment between sample labels and the layout of sample containment plates used in further assay processing steps may lead to incorrect sample attribution, impacting downstream analyses and conclusions drawn from the data.
  • multiple samples from the same subject, collected at different time points or processed in different ways may need to be accurately matched to avoid confusion and error.
  • sample swaps may include misdiagnoses, improper treatment, compromised clinical studies, and/or inaccurate research findings. Additionally, the complexity and scale of modern clinical studies may exacerbate the risk of sample swaps. More particularly, with the processing of millions of samples annually, even a small percentage of errors may result in a substantial number of sample swaps or contamination events. For instance, previous clinical studies have reported instances of large-scale sample swaps that have generated misleading results, highlighting the urgency for robust solutions to detect and prevent such errors.
  • the present disclosure provides a novel approach for sample identification and verification via utilization of a genotype signature component (GSC) of a sample analysis system, which is configured to analyze genetic information to tackle the challenges of sample mislabeling and sample swaps.
  • GSC genotype signature component
  • the GSC leverages characteristics of single nucleotide polymorphisms (SNPs) within the genomic data to uniquely identify and verify samples, thereby ensuring that they are correctly attributed to the corresponding participant.
  • the GSC may be configured to generate a genotype signature for each sample.
  • the genotype signature is a numerical representation of SNPs at a subset of positions within the genomic data of the sample.
  • the genotype signatures may then be compared against one another to determine whether they share a threshold level of similarity.
  • Samples having genotype signatures that share similarity above this threshold level may be considered to be associated with an identical participant, whereas samples having genotype signatures that share similarity below this threshold level may be considered to be associated with different participants.
  • a threshold level of similarity may be used instead of a pure match, because a pure match may not exist, and a threshold allows for flexibility in the event of, e.g., condition-related or treatment-related modifications to SNPs, rare conditions such as mosaicism, etc.
  • the concepts described herein utilize a process involving nucleic acid (e.g., DNA or RNA) genotyping, encoding, and digital signature creation to ensure accurate sample matching and detection of sample swaps or mislabeling.
  • the encoding process generates specific numeric values that are representative of different genetic variants, thereby transforming genetic data into a structured and standardized format that can be efficiently compared by computing systems in high-throughput environments.
  • the encoded numeric values simplify and accelerate the process of comparing genotype signatures between samples. More particularly, the ability to process large datasets of genotype signatures efficiently is a technical advantage over conventional techniques that involve manual comparisons. This scalability offers practical benefits for industries dealing with numerous samples, such as healthcare, research, and diagnostics.
  • RNA may alternatively or additionally be used in embodiments of the disclosure.
  • genotype signature offers a computational solution to these challenges, enhancing the precision and efficiency of sample identification. More particularly, the genotype signature creation process involves the encoding of genetic data into compact digital genotype signatures that can be rapidly compared across large datasets using computers.
  • the system By transforming complex genomic information into a binary format for storage and comparison, the system enables high-speed, large-scale computations that were not feasible with traditional methods and represents a process that cannot practically be performed in the human mind.
  • the improvements may also extend to the automated linking of participants, which benefits from advanced computational techniques like matching algorithms and machine learning.
  • subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware, or any combination thereof. The following detailed description is, therefore, not intended to be taken in a limiting sense.
  • the system environment 100 may include participant (subject) 10, sample(s) 15, sample data 20, and computing device 102. Although depicted in FIG. 1 as components all belonging to a single computing device 102, it should be understood that one or more components, or portions thereof, may, in some embodiments, be integrated with or incorporated on other devices.
  • computing device 102 may be a user device that may be configured to interact with another device on which genotype signature component 105 may be incorporated.
  • operations or aspects of one or more of the components listed above may be distributed amongst one or more other components.
  • the one or more other components may be physically co-located or may be physically distributed (e.g., in a cloud computing environment).
  • the one or more components may be owned and operated by one or more owners, although the overall orchestration of the components relevant to this disclosure can be performed at the direction of a single entity. Any suitable arrangement and/or integration of the various systems and devices of the environment 100 may be used.
  • the components of the computing device 102 may be associated with a common entity (e.g., a single business or organization, etc.). Alternatively, one or more of the components may be associated with a different entity than another.
  • the computing device 102 may be a computer system such as, for example, a desktop computer, a mobile device, a tablet device, laptop computer, a hybrid device, etc.
  • the computing device 102 may include a display/user interface (III) 102A, a processor 102B, a memory 102C, a database 102D, and/or a network interface 102E.
  • the computing device may execute, by the processor 102B, an operating system (O/S) and at least one electronic application (each stored in memory 102C).
  • the electronic application may be a desktop program, a browser program, a web client, or a mobile application program (which may also be a browser program in a mobile O/S), system control software, system monitoring software, software development tools, or the like.
  • the application may manage the memory 102C, such as a database, to store and provide genotype signatures associated with certain samples.
  • the display/UI 102A may be a touch screen or a display with other input systems (e.g., mouse, keyboard, etc.) so that the user(s) may interact with the application and/or the O/S.
  • the network interface 102E may be a TCP/IP network interface for, e.g., Ethernet or wireless communications with a network (not illustrated).
  • the processor 102B while executing the application, may generate data and/or receive user inputs from the display/UI 102A and/or receive/transmit messages to external components.
  • the electronic application executed by processor 102B of computing device 102, may generate one or many points of data that can be accessed, viewed, and/or interacted with by a user of the computing device 102.
  • the electronic application may enable users to view, edit, and control processing of sequence reads associated with received genomic data.
  • a user may further utilize the electronic application to generate and compare genotype signatures between samples, as further described herein.
  • the computing device 102 may include an electronic data system, computer-readable memory such as a hard drive, flash drive, disk, etc.
  • the computing device 102 includes and/or interacts with an application programming interface for exchanging data to other systems, e.g., one or more of the other components of the environment.
  • the computing device 102 may include and/or act as the host for an application platform (e.g., a sample comparison platform, etc.) that may be accessible by users and/or other components.
  • an application platform e.g., a sample comparison platform, etc.
  • the processor 102B may include and/or execute instructions to implement a genotype signature component 105, which may include encoding module 105A and concordance result module 105B.
  • Encoding module 105A may be configured to encode the genetic data associated with a sample into a numerical representation and assemble those numerals into a genotype signature.
  • Concordance result module 105B may be configured to compare the generated genotype signature against one or more other genotype signatures associated with other samples. The comparison process may ultimately determine whether two samples are attributable to an individual.
  • encoding module 105A and concordance result module 105B may both be contained within genotype signature component 105 on computing device 102.
  • encoding module 105A may reside on computing device 102 and concordance result module 105B may reside on another computing device or server (not illustrated).
  • a process for encoding genomic data from a sample into a genotype signature is provided, according to one or more aspects of the present disclosure.
  • the encoding process described herein is utilized to represent the different genetic variants found at selected SNR positions in a consistent and compact manner. Since there are different possible alleles for each SNP, the encoding process assigns specific numeric values to each possibility. This numeric representation ultimately simplifies the subsequent comparison of genotype signatures, which allows for efficient data storage, and also enables sample comparisons to be conducted quickly across a large scale of samples.
  • a total number of SNPs within sequenced sample data 20 received at encoding module 105A of genotype signature component 105 may be identified. More particularly, genetic material is sequenced from both the forward and reverse strands of the DNA. Each strand provides information about the nucleotides present at specific positions. SNPs are identified by examining differences in nucleotide sequences between the forward and reverse strands. Forward and reverse strand SNP pileup counts may be determined by analyzing the sequencing data and a pileup count may indicate the number of times a specific nucleotide is observed at a particular position in the sequenced reads.
  • positions where differences are observed are identified. These differences may include substitutions, insertions, or deletions of nucleotides.
  • the cumulative number of identified SNPs across the genome constitutes the total number of SNPs, which reflects the variability in genetic information among individuals at specific genomic positions.
  • encoding module 105A may be configured to retain a first subset of the total number of identified SNPs associated with the sample data.
  • the first subset of SNPs may be specifically chosen based on its correlation to a reference subset.
  • the reference subset may be a pre-defined collection of SNPs that have been determined to be indicative of ethnic identity or genetic variation.
  • the reference subset of SNPs may serve as a representative col lection of genetic markers that are known — based on prior research, public databases, publications, etc. — to exhibit variations among different ethnic groups. These SNPs may have been previously identified as being strongly associated with different ethnic backgrounds or populations and may provide information about how genetic sequences can differ between individuals of different ethnicities.
  • the selections of the SNPs in the reference subset may be based on research and analysis. More particularly, genetic studies involving diverse populations may be conducted to identify SNPs that are highly informative of ethnic identity. researchers may look for SNPs that are consistently different across specific ethnic groups while being relatively stable within those groups. For instance, if a reference genome shows a certain DNA base (e.g., adenine or “A”) at a particular position, researchers may have identified that in individuals of one ethnicity, that “A” may be a guanine or “G,” while in individuals of another ethnicity, it may be a thymine or “T.”
  • a certain DNA base e.g., adenine or “A”
  • researchers may have identified that in individuals of one ethnicity, that “A” may be a guanine or “G,” while in individuals of another ethnicity, it may be a thymine or “T.”
  • one or more different computational techniques/processes may be employed to generate the reference subset of SNPs from a larger SNP database. These techniques may involve one or more various extraction, filtration, reduction, and/or comparison steps. For instance, all SNPs from a database (e.g., 1000 Genomes Project, dbSNP, or other population genetics resource) may be initially extracted. In some aspects, the variability (or allele frequency) of each SNP across different populations may be then calculated, whereby SNPs with a high degree of variability (meaning they show distinct allele frequencies in different ethnic groups) may be strong candidates for inclusion in the reference subset.
  • a database e.g., 1000 Genomes Project, dbSNP, or other population genetics resource
  • the variability (or allele frequency) of each SNP across different populations may be then calculated, whereby SNPs with a high degree of variability (meaning they show distinct allele frequencies in different ethnic groups) may be strong candidates for inclusion in the reference subset.
  • duplicates or highly correlated SNPs may be removed, and SNPs that have been extensively studied and confirmed in multiple independent research studies may be prioritized.
  • one or more validation processes may be utilized to test the selected SNPs on independent datasets or populations to ensure their reliability in distinguishing ethnic backgrounds.
  • the steps in a reference SNP subset selection process may be performed wholly by a compute.
  • machine learning algorithms may be leveraged to automatically identify and select SNPs that best differentiate between particular types of population groups.
  • one or more steps may require manual human involvement. For instance, qualified individuals (e.g., researchers, geneticists, bioinformaticians, etc.) with expertise in population genetics may review outcomes at different points in the SNP selection process and make informed decisions based thereon.
  • the reference subset of SNPs may be identified using a process that involves a plurality of sequential filtering steps. For instance, previous research and analysis may identify approximately 1 million total SNPs that are present within a sample group.
  • all insertion-deletion (“indel”) SNPs, multi-allelic SNPs, and/or guanine “G’Vcytosine “C” SNPs may be removed from the original set because reliable data may not be available on those SNP types/position.
  • all SNPs that are missing in all of the training data samples e.g.
  • a more stringent filter may be employed (e.g., a MAF filter of between 0.2 and 0.8) to only retain those SNP positions that vary within a population (e.g., derived from the “1000 Genomes Project”), thereby further reducing the SNP count to approximately 14,000.
  • a standard quality control step may be employed to remove the few SNPs not in Hardy-Weinberg equilibrium.
  • the correlated SNPs that are in significant linkage disequilibrium with each other may also be removed.
  • the SNPs remaining after the fifth and sixth filtering steps may constitute the reference subset, and may number approximately 10,000.
  • the total number of identified SNPs associated with the sample data in step 205 may be narrowed down to the first subset of SNPs by identifying those SNP positions in the total number of SNPs associated with the sample data that are identical to the reference subset SNP positions.
  • the number of reference subset SNP positions may change (for example, as the available SNP positions change due to a change in assay chemistry or panel density)
  • an exemplary representative number of reference subset SNP positions utilized throughout this disclosure is 10,000.
  • 1 ,000,000 total SNPs may have been identified at step 205 and the creation of the first subset of SNPs in step 210 may reduce that number to approximately 10,000. It is important to note that this number of reference subset SNP positions is not limiting and, in some embodiments, the number of reference subset SNP positions may be more or less (e.g., 10, 100, 1000, etc.).
  • the first subset of SNPs associated with the sample data may be subject to a depth coverage filter (e.g., when using a DNA methylation assay). If step 215 is performed, encoding module 105A may be configured to filter out those SNPs in the first subset based on their depth of coverage. Coverage depth in sequencing data indicates how well a specific genomic position has been sequenced, representing the number of times the base at that position has been read. High coverage depth provides confidence in the accuracy of the sequenced base, while low coverage may lead to uncertainty or may indicate missing data.
  • a depth coverage filter e.g., when using a DNA methylation assay
  • encoding module 105A may be configured to leverage a noise model to filter out those SNPs in the first subset that may be indicative of noise and that may not be needed in the genotype signature creation process.
  • a noise model may be implemented at either step 210 or step 215 in order to filter out SNPs having insufficient depth range or that are indicative of noise.
  • two coverage thresholds may be defined: a minimum coverage threshold and a maximum coverage threshold. These thresholds may be configured to delineate the range of coverage depths that are acceptable for the selected SNPs. SNPs with coverage depths below the minimum coverage threshold may be considered to have insufficient coverage and may be considered more prone to sequencing errors or inaccuracies. SNPs with coverage depth exceeding the maximum coverage threshold may be considered to have too high coverage. More particularly, although high coverage provides confidence in the accuracy of the data, excessively high coverage may not significantly improve accuracy and in some instances, may indicate contamination.
  • the SNPs from the first subset that fall within the defined coverage threshold may be retained for further processing as a second subset.
  • the variant allele frequency VAF
  • the VAF represents the proportion of alleles at each SNP position that differ from the reference allele. More particularly, at any given SNP position, there are two possible alleles: the reference allele (i.e. , the allele present in the reference genome) and the alternate allele (the variant allele).
  • the VAF is a measure of the prevalence of the alternate allele at a specific SNP position within an individual’s genetic sequence. It may be calculated as the ratio of the number of alternate alleles to the total number of alleles at that position. Resulting VAF values range from 0 to 1 . A VAF of 0 indicates that all alleles at the SNP position are reference alleles, while a VAF of 1 indicates that all alleles are alternate alleles. Accordingly, a high VAF suggests a strong presence of the alternate allele (indicating a higher likelihood of a genetic variant at that position), whereas a low VAF indicates a dominance of the reference allele.
  • the genetic variant type at each SNP in the second subset associated with the sample data may be determined based on the calculated VAF at step 220.
  • the genetic variant type may indicate whether the individual’s genetic sequence at each SNP position corresponds to a homozygous reference allele (i.e., where both alleles at the SNP position match the reference allele), a homozygous alternate allele (i.e., where both alleles at the SNP position match the alternate allele), or a heterozygous allele (i.e., where two alleles at the SNP position are different, with one being the reference allele and the other being the alternate allele).
  • the VAF calculated in step 220 for each SNP position associated with the sample data may provide a quantitative measure of the presence of the alternate allele. Based on the VAF value, the genetic variant type at each SNP position may be determined. For example, if VAF is below a first threshold (e.g., closer to 0), the alleles at the corresponding SNP position are primarily reference alleles, indicating a homozygous reference allele.
  • a first threshold e.g., closer to 0
  • the alleles at the corresponding SNP position are primarily reference alleles, indicating a homozygous reference allele.
  • the alleles at the corresponding SNP position are primarily alternate alleles, indicating a homozygous alternate allele. If the VAF is between the first and second thresholds, the alleles at the SNP position are a mix of reference and alternate alleles, indicating a heterozygous allele. If the VAF cannot be accurately calculated due to insufficient coverage, the variant type for that SNP position is considered missing data. The missing data designation may indicate that the genetic information at that position is not reliable due to coverage limitations.
  • encoding module 105A may assign numeric values to represent the identified genetic variant types. More particularly, the determined genetic variant types at each SNP position within the second subset are encoded into bytes for ease of computationally-aided comparison. Each genetic variant type is assigned a unique numeric value, and these values are combined to create a compact binary representation, i.e., genotype signature 25.
  • each of the genetic variant types may be assigned a specific numeric value for encoding purposes. For instance:
  • HOM_REF Homozygous Reference Allele
  • HOM_ALT Homozygous Alternate Allele
  • Diagram 30 provides a plurality of SNPs 32 (A-F) having nucleotide base designations for each allele at each SNP position.
  • the two alleles associated with SNP-1 32A are A and A
  • the two alleles associated with SNP-3 32C are T and C
  • no alleles were identified for SNP-2 32B (e.g., due to inadequate coverage).
  • Section 34 identifies the nucleotide base in the reference genome and provides a genotype designation.
  • the nucleotide base in the reference genome is “A” and is designated by 0/0.
  • the nucleotide base in the reference genome is “T” and is designated by 0/1 .
  • Table 1 provides an encoding designation based on the genotype. For instance, Table 1 provides that each homozygous reference allele may be encoded as 0, each heterozygous allele may be encoded as 1 , each homozygous alternate allele may be encoded as 2, and any types of missing data, such as at position SNP-2 32B, may be encoded as 3. Accordingly, the genotype signature for this illustrative sample may be 031201 (001101 100001 ).
  • genotype signature comparison between two samples is provided, according to one or more aspects of the present disclosure.
  • the goal of genotype signature comparison between two samples is to assess the degree of genetic similarity or dissimilarity.
  • genotype signatures 40, 45 associated with the two samples being compared may be received at a concordance result module 105B.
  • Genotype signatures 40, 45 may have been identified using the method described in reference to FIG. 2, described above.
  • a total number of SNPs at specified locations may be identified for the first and second signatures.
  • the specified locations may be selected so as to provide sufficient variability such that they can be used as identifying information.
  • the total number of SNPs at common locations may correspond to the positions where both signatures have genetic variant information.
  • Signature A has genetic variant information at SNP positions 1 , 3, 5, and 7.
  • Signature B has genetic variant information at SNP positions 2, 4, 5 and 7.
  • SNP positions 5 and 7 are common SNPs because both Signature A and Signature B have genetic variant information at these positions.
  • SNP positions 1 , 3, and 2, 4 are not common SNPs because only one of the signatures has genetic variant information at these positions.
  • a threshold check may be performed to determine whether the total number of common SNPs between the first and second genotype signatures is greater than a predetermined threshold.
  • This threshold check serves as a criterion for determining if there is a sufficient level of overlap between the two signatures to proceed with future comparison.
  • the predetermined threshold value may vary based on factors such as the nature of the genetic data, the goals of the comparison, and the desired level of statistical significance. For instance, the predetermined threshold for the comparison of two genotype signatures associated with samples of the same type (e.g., two cfDNA samples) may be higher than if the two genotype signatures were associated with samples of different types (e.g., one cfDNA sample vs. one tissue sample or two tissue samples). Additionally or alternatively, as another example, the predetermined threshold may be higher and more stringent if the sample analysis was associated with a disease state decision or treatment recommendation than it may be for a research study.
  • an embodiment may designate, at step 415, a result identifying that there is insufficient overlap between the two genotype signatures for a meaningful comparison. This may be due to factors such as low coverage of genetic data or limited shared genetic information. Accordingly, an insufficient overlap may halt the comparison from proceeding further.
  • determining, at step 410 that the total number of common SNPs is not greater than the predetermined threshold
  • an embodiment may conclude that there was a sufficient amount of overlap between the two signatures and proceed further in the process to step 420.
  • the total number of SNPs that have identical variant calls (e.g., homozygous reference, homozygous alternate, and heterozygous) between first and second genotype signatures 40, 45 may be determined.
  • the variant call for each position in each genotype signature may be identified, and if the variant call at a position is the same in both signatures, then the total count may be incremented.
  • Signature X may be considered, Signature X and Signature Y.
  • the signatures share common SNP positions at positions 5 and 9.
  • Signature X may have a homozygous reference call
  • Signature Y may also have a homozygous reference call. Because both signatures have the same variant call, the SNP at position 5 may be considered identical between signatures.
  • Signature X may have a homozygous reference call
  • Signature Y may have a heterozygous call. Because the signatures have different variant calls, the SNP position at position 9 may be considered not identical.
  • a concordance ratio between the two genotype signatures 40, 45 may be computed.
  • the concordance ratio provides a quantitative measure of the degree of genetic similarity between the two signatures. It may help in assessing the degree of overlap in genetic variant types and provides insight into how closely the genetic profiles of the two samples match at the common SNP positions.
  • the concordance ratio may be computed by dividing the number of identical SNPs (determined in step 420) by the total number of common SNPs (determined in step 405).
  • the concordance ratio ranges between 0 and 1 , where a concordance ratio of 0 indicates no genetic agreement (i.e., no identical SNPs) between the signatures, whereas a concordance ratio of 1 indicates complete genetic agreement (all common SNPs are identical) between the signatures.
  • Signature X and Signature Y may have 5,000 common SNPs, out of which 4,000 SNPs have identical variant calls.
  • a kinship coefficient ( ⁇ t>), or “coefficient of relationship,” may be calculated.
  • the kinship coefficient may correspond to the measure of genetic relatedness between two samples. It may be calculated based on the sharing of alleles at specific genetic markers, e.g., SNPs, and may provide insight into how closely two or more samples are related by blood, or, in this case, how likely two samples are to be from the same individual.
  • the kinship coefficient may be calculated by identifying, at each SNP, how many alleles are shared, which may be one of three possibilities: homozygous reference (e.g., if both individuals have the same reference allele at a given SNP), heterozygous reference (e.g., if one individual has the reference allele and the other has a variant allele), or homozygous variant (e.g., if both individuals have the same variant allele at a given SNP).
  • the kinship coefficient may be calculated by summing the shared status at each SNP (e.g., 0 for homozygous reference, 1 for heterozygous, or 2 for homozygous variant) and then averaging across all of the total number of common SNPs.
  • the resulting coefficient may range from 0 (indicating no genetic relatedness) to 1 (indicating complete genetic identity).
  • the values may indicate different degrees of genetic relatedness (e.g., a value of 0.25 may suggest that a grandparent-grandchild relationship or an uncle- aunt/niece-nephew relationship, a value 0.5 may indicate a parent-child or a sibling relationship, a value of 1 may suggest identical twins).
  • a value of 1 may indicate that the samples are from the same individual, whereas increasingly smaller values may indicate an increasingly greater likelihood that two samples are not from the same individual.
  • the computed concordance ratio in step 425 may be compared against a concordance threshold.
  • the threshold value may be used as a criterion in determining whether the genetic profiles are similar enough to be considered a positive match.
  • the threshold value may be higher or lower based on certain factors, as described above. For instance, the predetermined threshold for the comparison of two genotype signatures associated with samples of the same type (e g., two cfDNA samples) may be higher than if the two genotype signatures were associated with samples of different types (e.g., one cfDNA sample vs. one tissue sample). Additionally or alternatively, as another example, the predetermined threshold may be higher and more stringent if the sample analysis was associated with a disease state decision or treatment recommendation than it may be for a research study.
  • an embodiment may generate, at step 440, a result indicating a match between the genotype signatures.
  • an embodiment may generate, at step 445, a result indicating a mismatch between the genotype signatures.
  • situations may exist in which the genotype signature of a particular sample is compared against the genotype signatures of a plurality of other samples.
  • a subset of samples in the plurality may be identified as being matched with the particular sample (e.g., at step 440). More particularly, a concordance ratio generated from the comparison of the genotype signatures of the particular sample and one of the subset samples may be determined to be greater than the concordance threshold identified in step 430.
  • the genotype signature of a first sample may be compared against the genotype signatures of 1 million other samples.
  • the concordance threshold may act as an initial filter on a pool of data.
  • the threshold may be adjustable based on need and/or context (e.g., the threshold may be increased or decreased based on the desired goals of the comparison process).
  • the concordance threshold value may change based on the type of samples that the genotype signatures are associated with. More particularly, different types of biological samples (e.g., blood, urine, tissue, etc.) may vary in the extent of genetic variation they contain. Some samples, such as blood, may have a relatively stable and consistent genetic profile, while others, like tumor tissue, may exhibit higher levels of genetic heterogeneity due to mutations and clonal evolution. Accordingly, a first concordance threshold may be established for the comparison of two genotype signatures that are both associated with cfDNA, whereas a second concordance threshold may be established for the comparison of samples with inherently higher genetic variation (e.g., cfDNA vs. tissue). The second concordance threshold may be more relaxed to account for the expected diversity between the samples.
  • biological samples e.g., blood, urine, tissue, etc.
  • Some samples, such as blood may have a relatively stable and consistent genetic profile, while others, like tumor tissue, may exhibit higher levels of genetic heterogeneity due to mutations
  • the concordance threshold may additionally or alternatively be adjusted based on other factors as well. For instance, in longitudinal studies involving repeated sampling from the same participant over time, genetic changes may occur due to factors like disease progression, treatment response, natural aging, or one or more other natural variations. For such studies, the concordance threshold may be adjusted to accommodate expected genetic drift while still identifying samples from the same individual as matching. As another example, the clinical context and specific application of the genotype signature comparison may at least in part influence the choice of the concordance threshold. For example, in diagnostic scenarios in which accurate patient identification is crucial, a stricter concordance threshold may be chosen to reduce the risk of false positive matches.
  • the concordance value may provide various non-binary decisions or insights that are not limited to binary “match” or “mismatch” outcomes. For instance, upon analyzing the genetic profiles of the individuals associated with two samples, A and B, the concordance value may provide insight into the degree of genetic relatedness, or the likelihood that two samples are from the same individual.
  • a concordance value of 0.8 may indicate that 80% of the common SNPs between samples A and B have identical variants. Based on this, it may be concluded from the concordance value that the individuals associated with samples A and B share a significant portion of their genetic variants, suggesting a high degree of genetic similarity or relatedness, or may indicate the likelihood that the samples are from the same individual.
  • the concordance value may be used to determine whether samples A and B are associated with a single individual, and the system of the embodiments may be configured to output a confidence percentage that the samples are from the same person. In other words, the system may output a degree of genetic similarity between the two samples.
  • an exemplary process flow 500 is depicted for determining whether two samples are matched to a subject, according to one or more aspects of the present disclosure.
  • the exemplary process flow 500 may be implemented by system environment 100 and may incorporate the encoding and comparing processes described with reference to FIGs. 2 and 4.
  • genomic data associated with a first sample may be received at a genotype signature component 105 of a computing device 100.
  • the genomic data may include sequenced sample data 20 from a sample 15 belonging to a participant 10.
  • sample 15 may be a blood sample, tissue sample, bone marrow sample, urine sample, saliva sample, plasma sample, etc.
  • a single sample may be collected from a subject or, alternatively, multiple samples may be collected from the subject (e.g., multiple samples may be collected for the subject at a single time point, multiple samples may be collected for the subject across two or more different time points, etc.). Samples collected at the same time and/or samples collected at different times may be subsequently processed at different times.
  • sequenced sample data 20 may be generated using one of a variety of different sequencing techniques on extracted DNA from sample 15.
  • the sequenced sample data 20 may be processed by encoding module 105A of genotype signature component 105 to generate a genotype signature.
  • the encoding process facilitated by encoding module 105A is configured to convert genetic information, specifically the genetic variants present in the DNA sample, into a compact and standardized numerical representation, i.e., the genotype signature. This process simplifies and condenses the genetic data, making it easier to compare and analyze samples efficiently.
  • Encoding may begin by first identifying the total number of SNPs present in sequenced sample data 20. From the total number of identified SNPs, a first subset of SNPs may be retained. This subset may be chosen based on its correlation with an indicative reference subset of SNPs.
  • a second subset may be created based on coverage depth. More particularly, the second subset may include SNPs with coverage depths falling within a range defined by a minimum and maximum coverage threshold, as described above. This ensures that only SNPs with reliable and consistent data are included in the genotype signature.
  • the VAF may be calculated to identify how common a particular genetic variant is within the sample.
  • the genetic variant at each SNP is classified into categories, such as homozygous reference allele, homozygous alternate allele, heterozygous allele, or missing data. These categories provide insight into genetic differences and similarities between samples.
  • the identified genetic variant categories may thereafter be encoded into numerical values. A specific numerical value may be assigned to each category (e.g., homozygous reference is assigned 0, homozygous alternate is assigned 2, heterozygous is assigned 1 , and missing data is assigned 3).
  • the encoded numerical values for each SNP position are combined to create the genotype signature for the sample.
  • the genotype signature derived by encoding module 105A may then be utilized by concordance result module 105B to identify whether the genotype signature matches a second genotype signature associated with a second sample.
  • the comparison process involves assessing the similarity and concordance between the two genotype signatures, which may be used to determine whether the two samples likely originate from the same individual, different individuals, or whether there are discrepancies that require further investigation.
  • the comparison process may begin by identifying the total number of SNPs that are common between the two genotype signatures. These common SNPs represent the genetic positions where both samples have been evaluated for similarity. Once the common SNPs are identified, the comparison may evaluate whether the total number of common SNPs is greater than a predetermined threshold.
  • This threshold serves as a criterion for determining whether there is a meaningful amount of overlap in the genetic information being evaluated. If the number of common SNPs falls below the threshold, it suggests that there may not be sufficient overlap for a reliable comparison. Assuming the number of common SNPs surpasses the threshold, the comparison proceeds to identify the subset of common SNPs that have matching variant calls (i.e. , genetic variant types) between the two genotype signatures. These identical SNPs are positions where the genetic information aligns between the samples, as described above. The next step involves computing the concordance ratio between the two genotype signatures, which provides a quantitative measure of the genetic similarity between the two samples. The concordance ratio may be obtained by dividing the number of identical SNPs by the total number of common SNPs.
  • biological samples may be collected from a participant at different points in time to facilitate longitudinal studies, track disease progression, or assess treatment responses.
  • Embodiments of the disclosure may be used to determine whether the two samples likely originate from the same individual, different individuals, or whether there are discrepancies between the genotypic signatures for the different samples that require further investigation.
  • the first sample may be obtained from a participant at an initial visit, representing the “first time” in the study timeline. This sample may be used for analysis at that time, such as genotyping or other molecular assays.
  • the second sample described above may be obtained from the same participant during a follow-up visit at a later date, e.g., a “second time,” which may be several days, weeks, or months after the first collection.
  • multiple samples may be collected at a single time point, even though only a portion of them is processed at that time point.
  • multiple tubes of blood may be drawn from a single participant during a single visit. While a subset of the tubes of blood may be processed at that time for initial testing and analysis, the remaining subset of the tubes of blood may be stored for future use. These stored samples may be analyzed later without requiring an additional collection from the participant.
  • the first and second time points for sample procurement and/or processing may be substantially the same.
  • embodiments of the disclosure may be used to determine whether the samples taken at the same time but analyzed at different time points likely originate from the same individual, different individuals, or whether there are discrepancies between the genotypic signatures for the different samples that require further investigation.
  • the computed concordance ratio is then compared against a predetermined concordance threshold, which represents the level of genetic similarity required to consider the two samples a match. If the computed concordance ratio is below the threshold, it indicates that the genetic information is not concordant enough to establish a match, and system 100 may output, at step 525, a genotype signature mismatch result.
  • the genotype signature mismatch result may include an indication that the samples are not from the same individual, or that there is sufficient variability between the two samples to call into question whether the two samples are from the same individual.
  • the genotype signature mismatch result may provide an indication that inconclusive results were achieved and may provide a recommendation to rerun the analysis. Conversely to the foregoing, if the ratio exceeds the threshold(s), it suggests that the genetic information is sufficiently concordant, and system 100 may output, at step 530, a genotype signature match result.
  • the foregoing processes may be utilized to identify an unexpected mismatch, e.g., a situation in which a sample matched to a subject is not actually derived from that subject.
  • a set of 5 samples may be conventionally identified (e.g., based on manually recorded clinical data associated with each sample) as originating from a single subject (e.g., Subject A).
  • one of the 5 samples may not be derived from Subject A but may, in fact, be derived from another subject.
  • the system described herein may compare the genotype signatures of each sample of Subject A against one another to identify a sample that is not matched with the rest of the sample set.
  • the foregoing processes may be utilized to identify an unexpected match, e.g., a situation in which a given sample matches a sample set that it may not have been originally paired with. For instance, continuing on with the previous example, upon generating a genotype signature associated with the originally mismatched sample, a user may identify that the generated genotype signature of the originally mismatched sample matches one or more samples associated with another subject, e.g., Subject B. In this situation, the originally mismatched sample may then be paired with the appropriate sample set.
  • an unexpected match e.g., a situation in which a given sample matches a sample set that it may not have been originally paired with. For instance, continuing on with the previous example, upon generating a genotype signature associated with the originally mismatched sample, a user may identify that the generated genotype signature of the originally mismatched sample matches one or more samples associated with another subject, e.g., Subject B. In this situation, the originally mismatched sample may then be paired with the appropriate sample set.
  • the subject that a mismatched sample may have been derived from may not always be known. This may be due to various factors, including: the swap occurring before receipt of the samples by the receiving organization, the swap occurring between studies, or a false negative result (e.g., subject had marrow transplant between intervals, non-cancer to cancer transitions, etc.). In such a situation, if any tests were run using the mismatched sample, then the results of those tests may be discarded and/or not returned to the subject.
  • the process described herein may be utilized to identify why, in the absence of a mismatch, samples associated with separate subjects may be substantially similar to each other.
  • the genotype signature comparison process may identify that all samples associated with Subject A share a genotype signature match with all samples associated with Subject B. In one circumstance, such an occurrence may result if Subjects A and B were identical twins.
  • samples received by a receiving organization may originate from different sources. In some situations, samples originating from the same subject may be received from these separate sources, but may not be appropriately grouped together (e.g., due to improper labeling or incomplete mapping, etc.). In another circumstance, this occurrence may be resultant from a false positive test, e.g., that indicates that all samples from two participants are matched when they really are not. Comparing genotype signatures for all samples may confirm whether a false positive was present or not.
  • the processes described herein may be utilized to identify whether a given sample may have any associations at all. For instance, circumstances may arise where researchers may be left with a lone sample, e.g., one that is not matched with any known subject. In these situations, the genotype signature comparison process may be utilized to determine whether the lone sample shares the same genotype signature as one or more control samples that were utilized in a particular assay. If the genotype signature comparison process reveals that the lone sample is a control sample, then that sample may be disregarded and/or paired with the appropriate grouping of control samples.
  • a sixth exemplary situation situations may arise where the genotype signature comparison process may not be able to be utilized to return an informative result. For instance, in one instance, there may be too low of an SNP intersection between the genotype signatures of the samples (e.g., as indicated and described with respect to step 410 in FIG. 2) to derive an appropriate concordance ratio. This may happen when the processing of one of the samples has “failed” and correspondingly yields low binary coverage. In this instance, sample results may not be returned to the subject and/or a sample redraw from the relevant subject may need to be implemented.
  • Other situations that may prevent the sample matching process from working include those situations in which there is no key (e.g., where there is no output directory, etc.) or domain (e.g., when a study or domain is not approved or added to an “allowed” list, etc.) or when the data associated with the samples is not added to the domain.
  • the processes described herein may be utilized to identify that a particular sample shares no matches at all. More particularly, the genotype signature comparison process may return an indication that the genotype signature associated with a given sample has no resemblance to a genotype signature associated with the samples of any other subject or any other control sample.
  • the processes described herein may be utilized to identify and link participants who have taken repeated tests, but are initially treated as separate individuals in the system due to discrepancies in how their samples are registered.
  • a participant may be supplied with test kits from different sources, such as both a provider portal and a patient portal, resulting in the system recognizing the samples collected from these two different kits as two distinct participants, rather than multiple samples from the same participant.
  • the genotype signatures described herein the genetic information from both samples may be analyzed and compared, revealing that they actually belong to the same individual. This identification enables the clinical lab or other suitable facility to merge or link these participants within the system, thereby correctly consolidating the test results and participant data for accurate medical records and downstream assessments/processes.
  • the processes described herein may be utilized to address errors caused when inputting participant information into a data collection system, e.g., such as typos in participant names. For example, slight variations or spelling mistakes in participant names may cause the system to treat the same individual as two independent participants.
  • the processes described herein may allow the lab to identify that the samples originate from the same person. This may prompt the lab to link the profiles, thereby rectifying the issue and ensuring that all relevant data is associated with the correct participant, or that different samples from the same subject aren’t considered samples from unique subjects.
  • an indication that the samples are contaminated may also be generated.
  • the second sample may be a control sample known to have been acquired from a different individual than the individual from whom the first sample was acquired.
  • a match output would not be appropriate in light of the different known sample origins, and thus the match output may be indicative of contamination or other errors.
  • a mismatch output may likewise indicate sample contamination or other errors, for example, if the first and second samples are known to have been acquired from the same individual.
  • the second sample could be a control sample from the same individual from whom the first sample was acquired, and a match output may thus indicate that the first and second samples were processed correctly. In this way, embodiments of the disclosure may function at least in part as quality control.
  • the sample match or mismatch may be compared to an expected sample match status between the first sample and the second sample.
  • the results of the comparison of the sample match or sample mismatch with the expected sample match status may be reported to a user or to a second computer system.
  • a user may be provided with additional context with which to assess the sample match or sample mismatch determination that is output. For example, if a match was output, but the expected sample match status was a mismatch, then it may be inferred that an error, such as contamination, may have occurred.
  • one or more alerts or indications of potential error or contamination may be output, and the assay may ultimately need to be redone.
  • an error such as contamination
  • one or more alerts or indications of potential error or contamination may be output, and the assay may ultimately need to be redone.
  • the baseline allele frequencies in the population may also be considered. More particularly, for each SNP position, information may be available (e.g., in a local database, in an online database, etc.) for how common each allele is in the general population across different ethnic groups. Thereafter, for each SNP being compared, the probability of observing a match between the alleles in the two genotype signatures may be calculated, considering the population statistics. This calculation is facilitated by applying Bayesian probability calculations and takes into account the likelihood of observing the specific allele types given the population frequencies. In addition to allele frequencies, this approach considers the potential variations that may occur at a specific SNP position, which may make the comparison more comprehensive and robust.
  • the probabilities calculated for each SNP may be combined to yield an overall concordance probability for the entire genotype signature comparison. This probability reflects the likelihood that the two genotype signatures match based on the population statistics and the variations observed. The comparison decision is then made based on the overall concordance probability. If the computed probability exceeds a certain threshold, it indicates a higher likelihood that the samples match. Conversely, if the computed probability falls below the threshold, it suggests a lower likelihood of a match.
  • a central database may be created that stores sample genotype records. This database may be continually updated (e.g., as new results appear) and may be leveraged to allow genotype match searches.
  • the database genotype search may be facilitated through a linear search, which may involve comparing the genotype information associated with a query sample with the genotype information of all samples stored in the central database.
  • a user may initiate a genotype match search by providing, e.g., to an application programming interface (API), information associated with a query sample.
  • the query sample information may contain information about the genetic makeup, or genotype, of a specific individual.
  • the genotype data may include details about various SNPs or other genetic markers or mutations.
  • the query sample may additionally include identifying information associated with the individual, such as a sample ID, various types of metadata (e.g., data of collection, collection method, collection location, etc.), or other relevant details that may help distinguish the sample from other samples.
  • the system may calculate the similarity between the genotypes associated with the query sample against those of each sample stored in the database. More particularly, in-memory encodings may be utilized that make each pairwise genotype comparison efficient and fast.
  • the encoding may be a binary encoding for which the genotype similarity may be computed using various metrics, such as Hamming distance using vector CPU instructions.
  • the system may iterate through each sample record in the database, comparing the genotype information of the query sample with the genotype information of the samples in the database.
  • a similarity threshold may be established to define what constitutes a “match.” Samples with genotype similarities that meet or exceed the threshold are considered potential matches (e.g., indicating a higher likelihood that they originate from the same individual).
  • the central database may be cached into the memory of a server that implements the foregoing API. This reduces the need for repeated disk reads, as data is readily available in memory for comparisons.
  • the server may periodically update its cache by reading records from the central database that were added or modified since the last update. If pipeline run IDs are not guaranteed to be sequential, a range search mechanism may be utilized to efficiently locate and retrieve updated records from the database.
  • LSH locality sensitive hashing
  • each data point e.g., genotype information associated with a sample
  • each dimension of the vector may correspond to a genetic marker or feature.
  • the LSH process may employ a set of hash functions that map these high-dimensional vectors to a lower-dimensional space. Specifically, LSH creates multiple hash functions tailored for the genetic data and these functions map genotype vectors to lower-dimensional hash codes.
  • multiple hash tables may be constructed, each using a different hash function and each table may store references to samples based on their hash codes.
  • the system may then search hash tables for potential matches based on the hash values of the query. Any retrieved candidates from the hash tables may be potential nearest neighbors. To determine if they are true matches, full sample genotypes of the candidates may be obtained and compared to make a final determination.
  • various parameters of the foregoing process may be tuned to ensure that true genotype matches may be found with high probability.
  • each of the hash tables described above may be stored in a type of cloud system and may be configured to scale to 1 billion samples or more.
  • the genotype signatures described herein may be associated with a “version number” that may represent which version of the genotype signature encoding process a sample was encoded with. Differences in the signature version may correspond to differences in the parameters encompassed or leveraged by each encoding process. For instance, an original version of the genotype signature encoding process may be Version 0.
  • Version 0 may be associated with a variety of parameters, e.g., including: 2 bit encoding per variant, a predefined set of SNPs, a predefined minimum number of intersecting SNPs between two samples, a predefined concordance threshold, and/or other variables (e.g., a predefined minimum and/or maximum depth at a position, predefined thresholds relevant to variant calling, etc.).
  • a change in any of the above parameters may trigger initiation of a new encoding process.
  • the assay panel from which the samples were obtained from may change, which may affect the specific type and/or number of SNPs that are utilized in the comparison.
  • each subsequent version of the genotype signature encoding process may sequentially increment.
  • samples having genotype signatures associated with the same version may be comparable. For example, two samples encoded with a version 1 genotype signature may be compared against each other. In some situations, samples having genotype signatures of different versions (e.g., version 1 vs. version 2) may also be compared after one or more factors are addressed (e.g., certain changes may need to be made to certain thresholds, etc.).
  • any process discussed in this disclosure that is understood to be computer-implementable may be performed by one or more processors of a computer system, such as system environment 100, as described above.
  • a process or process step performed by one or more processors may also be referred to as an operation.
  • the one or more processors may be configured to perform such processes by having access to instructions (e.g., software or computer-readable code) that, when executed by the one or more processors, cause the one or more processors to perform the processes.
  • the instructions may be stored in a memory of the computer server.
  • a processor may be a central processing unit (CPU), a graphics processing unit (GPU), or any suitable types of processing unit.
  • a computer system such as system environment 100, may include one or more computing devices. If the one or more processors of the computer system are implemented as a plurality of processors, the plurality of processors may be included in a single computing device or distributed among a plurality of computing devices. If a system environment comprises a plurality of computing devices, the memory of the computer system may include the respective memory of each computing device of the plurality of computing devices. [0097] FIG. 6 is a simplified functional block diagram of a computer system
  • any of the systems herein may be an assembly of hardware including, for example, a data communication interface 620 for packet data communication.
  • the platform also may include a central processing unit (“CPU”) 602, in the form of one or more processors, for executing program instructions.
  • the platform may include an internal communication bus 608, and a storage unit 606 (such as ROM, HDD, SDD, etc.) that may store data on a computer readable medium 622, although the system 600 may receive programming and data via network communications via electronic network 625 (e.g., voice, video, audio, images, or any other data over the electronic network 625).
  • the system 600 may also have a memory 604 (such as RAM) storing instructions 624 for executing techniques presented herein, although the instructions 624 may be stored temporarily or permanently within other modules of system 600 (e.g., processor 602 and/or computer readable medium 622).
  • the system 600 also may include input and output ports 612 and/or a display 610 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc.
  • the various system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load. Alternatively, the systems may be implemented by appropriate programming of one computer hardware platform.
  • the term “based on” means “based at least in part on.”
  • the singular forms “a,” “an,” and “the” include plural referents unless the context dictates otherwise.
  • the term “exemplary” is used in the sense of “example” rather than “ideal.”
  • the terms “comprises,” “comprising,” “includes,” “including,” or other variations thereof, are intended to cover a non-exclusive inclusion such that a process, method, or product that comprises a list of elements does not necessarily include only those elements, but may include other elements not expressly listed or inherent to such a process, method, article, or apparatus.
  • the term “user” generally encompasses any person or entity, such as a researcher and/or a care provider (e.g., a doctor, etc ), that may desire information, resolution of an issue, or engage in any other type of interaction with a provider of the systems and methods described herein (e.g., via an application interface resident on their electronic device, etc.).
  • a care provider e.g., a doctor, etc
  • the term “electronic application” or “application” may be used interchangeably with other terms like “program,” or the like, and generally encompasses software that is configured to interact with, modify, override, supplement, or operate in conjunction with other software.
  • Storage type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks.
  • Such communications may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • the physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software.
  • terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Systems and methods for verifying the similarity between biological samples are disclosed. One method may include: receiving genomic data associated with a first sample from a participant; generating a numerical representation for each single nucleotide polymorphism (SNP) in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying a sample match or a sample mismatch based on the comparing. Other aspects are described and claimed.

Description

SYSTEMS AND METHODS FOR ASSESSING SIMILARITY BETWEEN SAMPLES USING GENOTYPE SIGNATURES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/593,758, filed October 27, 2023, the entirety of which is hereby incorporated by reference herein.
TECHNICAL FIELD
[0002] The present disclosure relates generally to the field of clinical diagnostics and, more specifically, to systems and methods for verifying the identity of biological samples using genotype signatures in clinical samples.
BACKGROUND
[0003] In the realm of clinical diagnostics and genetic testing, promoting the accuracy and reliability of test results is important. The integrity of patient care, medical research, and therapeutic development depends on the correct identification and analysis of biological samples. However, the potential for sample swaps and other contamination at various stages of processing (e.g., the plasma isolation stage, assay processing stage, etc.) poses a significant challenge that may lead to erroneous outcomes and subsequent negative consequences. Existing methods for sample tracking and verification are limited and often lack scalability, efficiency, and accuracy.
[0004] The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art, or suggestions of the prior art, by inclusion in this section.
SUMMARY OF THE DISCLOSURE
[0005] According to certain aspects of the disclosure, systems and methods are described for utilizing genotype signatures to accurately correlate patients with obtained samples. In some samples, the present disclosure provides for systems and methods that may leverage genotype signatures associated with each sample to identify and/or rectify sample swaps and mislabeling, even when dealing with high sample volumes and multiple time points.
[0006] In summary, one aspect provides a computer-implemented method for verifying similarity between biological samples. The computer-implemented method may include: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating, using a processor associated with the computer system, a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling, using the processor, the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing, using the processor, the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying, based on the comparing and responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
[0007] In another aspect, a system for verifying similarity between biological samples is provided. The system may include: one or more processors; one or more computer readable media storing instructions that are executable by the one or more processors to perform operations to: receive genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generate a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assemble the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; compare the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identify, responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identify, responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
[0008] In yet another aspect, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium is configured to store computer-executable instructions which, when executed by a server, cause the server to perform operations which may include : receiving genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying, based on the comparing and responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
[0009] In yet another aspect, a computer-implemented method for generating a genotype signature is provided. The computer-implemented method may include: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); identifying, using a processor of the computer system, a total number of SNPs in the genomic data; retaining, using the processor and from the total number of the SNPs, a first subset of SNPs that match a reference subset of SNPs; retaining, using the processor and from the first subset of SNPs, a second subset of SNPs having a coverage depth greater than a predetermined minimum coverage threshold and smaller than a predetermined maximum coverage threshold; computing, using the processor, a variant allele frequency (VAF) for each of the second subset of SNPs; identifying, using the processor and based on a VAF calling threshold, a genetic variant associated with each of the second subset of SNPs, wherein the genetic variant is one of a first type, a second type, a third type, or a fourth type; and generating, using the processor and based on the identified genetic variant type associated with each of the second subset of SNPs, the genotype signature containing a numerical representation, wherein each numerical representation is based on an allele characteristic associated with each of the SNPs.
[0010] In yet another aspect, a computer-implemented method for comparing genotype signatures is provided. The computer-implemented method may include: identifying a first total number of single nucleotide polymorphisms (SNPs) that are common between a first genotype signature and a second genotype signature; determining whether the first total number of SNPs is greater than a first predetermined threshold; identifying, responsive to determining that the first total number of SNPs is greater than the first predetermined threshold, a second total number of SNPs having a matching genetic variant type between the first genotype signature and the second genotype signature; calculating, based on the identifying, a ratio of the second total number of SNPs to the first total number of SNPs; determining whether the ratio is greater than a second predetermined threshold; and identifying that the first genotype signature is a match with the second genotype signature responsive to determining that the ratio is greater than the second predetermined threshold, or identifying that the that the first genotype signature is not a match with the second genotype signature responsive to determining that the ratio is not greater than the second predetermined threshold.
[0011] In yet another aspect, a computer-implemented method for assembling a reference subset of single nucleotide polymorphisms (SNPs) is provided. The computer-implemented method may include: receiving, at a computer system, a dataset containing a first set of SNPs; generating, using a processor associated with the computer system and by applying one or more filters against the first set of SNPs, a second set of SNPs; and utilizing the second set of SNPs in a genotype signature encoding process.
[0012] Additional objects and advantages of the disclosed embodiments will be set forth in part in the description that follows, and in part will be apparent from the description, or may be learned by practice of the disclosed embodiments. The objects and advantages of the disclosed embodiments will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
[0013] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments and together with the description, serve to explain the principles of the disclosure.
[0015] FIG. 1 depicts an exemplary system environment, according to one or more embodiments of the present disclosure.
[0016] FIG. 2 depicts a process flow for an encoding module of a genotype signature component, according to one or more embodiments of the present disclosure.
[0017] FIG. 3 depicts an example illustration of the process flow of the encoding module illustrated in FIG. 2, according to one or more embodiments of the present disclosure. [0018] FIG. 4 depicts a process flow for a concordance results module of the genotype signature component, according to one or more embodiments of the present disclosure.
[0019] FIG. 5 depicts a process flow for verifying the similarity between participant samples, according to one or more embodiments of the present disclosure.
[0020] FIG. 6 depicts an example computing system, according to one or more aspects of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
[0021] The terminology used below may be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of the present disclosure. Indeed, certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this Detailed Description section. Both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the features, as claimed.
[0022] In the realm of modern clinical studies, the accurate identification and tracking of biological samples are foundational to the integrity and credibility of research outcomes. Advances in genomic analysis and personalized medicine have led to an exponential increase in the volume of biological samples processed for various assays, including genotyping and other molecular assays. However, this surge in sample throughout has amplified the challenges associated with maintaining accurate sample identity throughout the complex workflows inherent to these studies.
[0023] At the outset, samples may be collected from subjects, often involving solid (e.g., tissue, bone marrow) and/or liquid biopsy (e.g., whole blood, blood plasma, urine, saliva) collection in containers. For example, a liquid biopsy such as a blood sample, may be collected in specially designed tubes. In some cases, multiple samples may be collected from a participant. In the case of liquid biopsies, for example, two tubes may be collected from each participant: one designated for primary processing and analysis, and the other reserved as a backup to address potential technical failures or errors. This multi-sample or dual-sample (e.g., multitube or dual-tube) approach may be used to create redundancy in case of sample loss. In the context of a clinical trial, the use of multiple sample containers may help to ensure that the study can proceed even if one sample container (e.g., tube) encounters errors. In some aspects, samples may be collected from the same participant over a period time in a longitudinal study. Once collected, the samples may undergo a series of processing steps, which may include, e.g, plasma isolation, assay procedures, and subsequent data analysis, depending on the type of biopsy sample being processed. These steps can be intricate and multifaceted, involving transfers of sample material into different containers (e.g., tubes or wells) for processing and experimentation. However, at each juncture, the potential exists for sample mislabeling, contamination, or swaps to occur, e.g., where samples from one participant are inadvertently attributed to another due to human error, miscommunication, or other factors.
[0024] As an example, external vendors involved in sample processing, such as plasma isolation, may inadvertently introduce errors. In some aspects, misalignment between sample labels and the layout of sample containment plates used in further assay processing steps may lead to incorrect sample attribution, impacting downstream analyses and conclusions drawn from the data. In other scenarios, multiple samples from the same subject, collected at different time points or processed in different ways, may need to be accurately matched to avoid confusion and error.
[0025] The ramifications of sample swaps may include misdiagnoses, improper treatment, compromised clinical studies, and/or inaccurate research findings. Additionally, the complexity and scale of modern clinical studies may exacerbate the risk of sample swaps. More particularly, with the processing of millions of samples annually, even a small percentage of errors may result in a substantial number of sample swaps or contamination events. For instance, previous clinical studies have reported instances of large-scale sample swaps that have generated misleading results, highlighting the urgency for robust solutions to detect and prevent such errors.
[0026] Conventional methods for sample swap detection often lack scalability, efficiency, and accuracy. More particularly, current techniques heavily rely on manual record-keeping, bar code scanning, and label cross-referencing. Unfortunately, these methods can be error-prone, labor-intensive, and unsuitable for the vast sample volumes encountered in modern high-throughput studies. Additionally, these methods may not be equipped to address the complexities of comparing multiple samples from the same participant or to identify mismatches between clinical information and genetic data. There is therefore an urgent need for a scalable, accurate, and efficient solution for verifying sample identity in activities involving high-volume genomic data analysis, such as clinical studies, clinical analysis, and the like.
[0027] Accordingly, the present disclosure provides a novel approach for sample identification and verification via utilization of a genotype signature component (GSC) of a sample analysis system, which is configured to analyze genetic information to tackle the challenges of sample mislabeling and sample swaps. More particularly, the GSC leverages characteristics of single nucleotide polymorphisms (SNPs) within the genomic data to uniquely identify and verify samples, thereby ensuring that they are correctly attributed to the corresponding participant. Specifically, the GSC may be configured to generate a genotype signature for each sample. The genotype signature is a numerical representation of SNPs at a subset of positions within the genomic data of the sample. The genotype signatures may then be compared against one another to determine whether they share a threshold level of similarity. Samples having genotype signatures that share similarity above this threshold level may be considered to be associated with an identical participant, whereas samples having genotype signatures that share similarity below this threshold level may be considered to be associated with different participants. A threshold level of similarity may be used instead of a pure match, because a pure match may not exist, and a threshold allows for flexibility in the event of, e.g., condition-related or treatment-related modifications to SNPs, rare conditions such as mosaicism, etc.
[0028] The concepts described herein utilize a process involving nucleic acid (e.g., DNA or RNA) genotyping, encoding, and digital signature creation to ensure accurate sample matching and detection of sample swaps or mislabeling. Specifically, the encoding process generates specific numeric values that are representative of different genetic variants, thereby transforming genetic data into a structured and standardized format that can be efficiently compared by computing systems in high-throughput environments. The encoded numeric values simplify and accelerate the process of comparing genotype signatures between samples. More particularly, the ability to process large datasets of genotype signatures efficiently is a technical advantage over conventional techniques that involve manual comparisons. This scalability offers practical benefits for industries dealing with numerous samples, such as healthcare, research, and diagnostics. Furthermore, various thresholds utilized in the encoding and comparing processes may promote accurate results, even for samples with varying characteristics (e.g., different sample types, such as a cell-free DNA (cfDNA) or cell-free RNA (cfDNA) sample vs. tissue sample). Although the specification discusses DNA in particular for convenience, it will be understood that any nucleic acid, such as RNA, may alternatively or additionally be used in embodiments of the disclosure.
[0029] The concepts described herein represent improvements to computer technology, particularly in the realms of data processing, accuracy, and scalability in clinical and research environments. Traditional systems for tracking biological samples often rely on manual processes such as barcoding, label scanning, and record keeping. These methods are error-prone and inefficient, particularly as the volume of biological samples increases in high-throughput environments. The use of a genotype signature offers a computational solution to these challenges, enhancing the precision and efficiency of sample identification. More particularly, the genotype signature creation process involves the encoding of genetic data into compact digital genotype signatures that can be rapidly compared across large datasets using computers. By transforming complex genomic information into a binary format for storage and comparison, the system enables high-speed, large-scale computations that were not feasible with traditional methods and represents a process that cannot practically be performed in the human mind. In an aspect, the improvements may also extend to the automated linking of participants, which benefits from advanced computational techniques like matching algorithms and machine learning.
[0030] The subject matter of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, which form a part hereof, and which show, by way of illustration, specific exemplary embodiments. An embodiment or implementation described herein as “exemplary” is not to be construed as preferred or advantageous, for example, over other embodiments or implementations; rather, it is intended to reflect or indicate that the embodiment(s) is/are “example” embodiment(s). Subject matter may be embodied in a variety of different forms and, therefore, covered or claimed subject matter is intended to be construed as not being limited to any exemplary embodiments set forth herein; exemplary embodiments are provided merely to be illustrative. Likewise, a reasonably broad scope for claimed or covered subject matter is intended. Among other things, for example, subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware, or any combination thereof. The following detailed description is, therefore, not intended to be taken in a limiting sense.
[0031] Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment” or “in some embodiments,” or “in one aspect” or “in some aspects” as used herein does not necessarily refer to the same embodiment or aspect, and the phrase “in another embodiment” or “in another aspect” as used herein does not necessarily refer to a different embodiment or aspect. It is intended, for example, that claimed subject matter include combinations of exemplary embodiments in whole or in part.
[0032] Referring now to FIG. 1 , an exemplary system environment 100 is depicted that may be utilized to verify the similarity between samples. The system environment 100 may include participant (subject) 10, sample(s) 15, sample data 20, and computing device 102. Although depicted in FIG. 1 as components all belonging to a single computing device 102, it should be understood that one or more components, or portions thereof, may, in some embodiments, be integrated with or incorporated on other devices. For example, computing device 102 may be a user device that may be configured to interact with another device on which genotype signature component 105 may be incorporated. In some embodiments, operations or aspects of one or more of the components listed above may be distributed amongst one or more other components. The one or more other components may be physically co-located or may be physically distributed (e.g., in a cloud computing environment). The one or more components may be owned and operated by one or more owners, although the overall orchestration of the components relevant to this disclosure can be performed at the direction of a single entity. Any suitable arrangement and/or integration of the various systems and devices of the environment 100 may be used.
[0033] In some embodiments, the components of the computing device 102 may be associated with a common entity (e.g., a single business or organization, etc.). Alternatively, one or more of the components may be associated with a different entity than another. In some embodiments, the computing device 102 may be a computer system such as, for example, a desktop computer, a mobile device, a tablet device, laptop computer, a hybrid device, etc. The computing device 102 may include a display/user interface (III) 102A, a processor 102B, a memory 102C, a database 102D, and/or a network interface 102E. The computing device may execute, by the processor 102B, an operating system (O/S) and at least one electronic application (each stored in memory 102C). The electronic application may be a desktop program, a browser program, a web client, or a mobile application program (which may also be a browser program in a mobile O/S), system control software, system monitoring software, software development tools, or the like. In an aspect, the application may manage the memory 102C, such as a database, to store and provide genotype signatures associated with certain samples. The display/UI 102A may be a touch screen or a display with other input systems (e.g., mouse, keyboard, etc.) so that the user(s) may interact with the application and/or the O/S. The network interface 102E may be a TCP/IP network interface for, e.g., Ethernet or wireless communications with a network (not illustrated). The processor 102B, while executing the application, may generate data and/or receive user inputs from the display/UI 102A and/or receive/transmit messages to external components.
[0034] The electronic application, executed by processor 102B of computing device 102, may generate one or many points of data that can be accessed, viewed, and/or interacted with by a user of the computing device 102. As an example, the electronic application may enable users to view, edit, and control processing of sequence reads associated with received genomic data. A user may further utilize the electronic application to generate and compare genotype signatures between samples, as further described herein.
[0035] The computing device 102 may include an electronic data system, computer-readable memory such as a hard drive, flash drive, disk, etc. In some embodiments, the computing device 102 includes and/or interacts with an application programming interface for exchanging data to other systems, e.g., one or more of the other components of the environment. The computing device 102 may include and/or act as the host for an application platform (e.g., a sample comparison platform, etc.) that may be accessible by users and/or other components.
[0036] The processor 102B may include and/or execute instructions to implement a genotype signature component 105, which may include encoding module 105A and concordance result module 105B. Encoding module 105A may be configured to encode the genetic data associated with a sample into a numerical representation and assemble those numerals into a genotype signature. Concordance result module 105B may be configured to compare the generated genotype signature against one or more other genotype signatures associated with other samples. The comparison process may ultimately determine whether two samples are attributable to an individual. In an embodiment, encoding module 105A and concordance result module 105B may both be contained within genotype signature component 105 on computing device 102. Alternatively, in another embodiment, one or both of the foregoing modules may reside on other components associated with the system environment 100. For example, encoding module 105A may reside on computing device 102 and concordance result module 105B may reside on another computing device or server (not illustrated).
[0037] Referring now to FIG. 2, a process for encoding genomic data from a sample into a genotype signature is provided, according to one or more aspects of the present disclosure. In general, the encoding process described herein is utilized to represent the different genetic variants found at selected SNR positions in a consistent and compact manner. Since there are different possible alleles for each SNP, the encoding process assigns specific numeric values to each possibility. This numeric representation ultimately simplifies the subsequent comparison of genotype signatures, which allows for efficient data storage, and also enables sample comparisons to be conducted quickly across a large scale of samples.
[0038] At step 205, a total number of SNPs within sequenced sample data 20 received at encoding module 105A of genotype signature component 105 may be identified. More particularly, genetic material is sequenced from both the forward and reverse strands of the DNA. Each strand provides information about the nucleotides present at specific positions. SNPs are identified by examining differences in nucleotide sequences between the forward and reverse strands. Forward and reverse strand SNP pileup counts may be determined by analyzing the sequencing data and a pileup count may indicate the number of times a specific nucleotide is observed at a particular position in the sequenced reads. By comparing the nucleotide sequences from the forward and reverse strands, positions where differences (e.g., SNPs) are observed are identified. These differences may include substitutions, insertions, or deletions of nucleotides. The cumulative number of identified SNPs across the genome constitutes the total number of SNPs, which reflects the variability in genetic information among individuals at specific genomic positions.
[0039] At step 210, encoding module 105A may be configured to retain a first subset of the total number of identified SNPs associated with the sample data. In an aspect, the first subset of SNPs may be specifically chosen based on its correlation to a reference subset. The reference subset may be a pre-defined collection of SNPs that have been determined to be indicative of ethnic identity or genetic variation.
More particularly, the reference subset of SNPs may serve as a representative col lection of genetic markers that are known — based on prior research, public databases, publications, etc. — to exhibit variations among different ethnic groups. These SNPs may have been previously identified as being strongly associated with different ethnic backgrounds or populations and may provide information about how genetic sequences can differ between individuals of different ethnicities.
[0040] In an aspect, the selections of the SNPs in the reference subset may be based on research and analysis. More particularly, genetic studies involving diverse populations may be conducted to identify SNPs that are highly informative of ethnic identity. Researchers may look for SNPs that are consistently different across specific ethnic groups while being relatively stable within those groups. For instance, if a reference genome shows a certain DNA base (e.g., adenine or “A”) at a particular position, researchers may have identified that in individuals of one ethnicity, that “A” may be a guanine or “G,” while in individuals of another ethnicity, it may be a thymine or “T.”
[0041] In an aspect, one or more different computational techniques/processes may be employed to generate the reference subset of SNPs from a larger SNP database. These techniques may involve one or more various extraction, filtration, reduction, and/or comparison steps. For instance, all SNPs from a database (e.g., 1000 Genomes Project, dbSNP, or other population genetics resource) may be initially extracted. In some aspects, the variability (or allele frequency) of each SNP across different populations may be then calculated, whereby SNPs with a high degree of variability (meaning they show distinct allele frequencies in different ethnic groups) may be strong candidates for inclusion in the reference subset. In some aspects, duplicates or highly correlated SNPs may be removed, and SNPs that have been extensively studied and confirmed in multiple independent research studies may be prioritized. In some aspects, one or more validation processes may be utilized to test the selected SNPs on independent datasets or populations to ensure their reliability in distinguishing ethnic backgrounds.
[0042] In an aspect, the steps in a reference SNP subset selection process may be performed wholly by a compute. For example, machine learning algorithms may be leveraged to automatically identify and select SNPs that best differentiate between particular types of population groups. Alternatively, in another aspect, one or more steps may require manual human involvement. For instance, qualified individuals (e.g., researchers, geneticists, bioinformaticians, etc.) with expertise in population genetics may review outcomes at different points in the SNP selection process and make informed decisions based thereon.
[0043] As one non-limiting example of a possible SNP selection technique, in an aspect, the reference subset of SNPs may be identified using a process that involves a plurality of sequential filtering steps. For instance, previous research and analysis may identify approximately 1 million total SNPs that are present within a sample group. In a first filter step, all insertion-deletion (“indel”) SNPs, multi-allelic SNPs, and/or guanine “G’Vcytosine “C” SNPs (e.g., when using a DNA methylation assay) may be removed from the original set because reliable data may not be available on those SNP types/position. In a second filter step, all SNPs that are missing in all of the training data samples (e.g. after each of the samples has gone through a pre-filtering process) may be removed, thereby reducing the total number SNPs from 1 million to approximately 110,000. In a third filter step, all SNPs missing in greater than a predetermined percentage (e.g., more than 80%) of the samples may be removed, thereby reducing the SNP count to approximately 53,000. In a fourth filter step, a more stringent filter may be employed (e.g., a MAF filter of between 0.2 and 0.8) to only retain those SNP positions that vary within a population (e.g., derived from the “1000 Genomes Project”), thereby further reducing the SNP count to approximately 14,000. In a fifth filter step, a standard quality control step may be employed to remove the few SNPs not in Hardy-Weinberg equilibrium. In a sixth filter step, the correlated SNPs that are in significant linkage disequilibrium with each other may also be removed. The SNPs remaining after the fifth and sixth filtering steps may constitute the reference subset, and may number approximately 10,000.
[0044] In view of the foregoing, the total number of identified SNPs associated with the sample data in step 205 may be narrowed down to the first subset of SNPs by identifying those SNP positions in the total number of SNPs associated with the sample data that are identical to the reference subset SNP positions. Although the number of reference subset SNP positions may change (for example, as the available SNP positions change due to a change in assay chemistry or panel density), an exemplary representative number of reference subset SNP positions utilized throughout this disclosure is 10,000. For instance, and as a nonlimiting example, 1 ,000,000 total SNPs may have been identified at step 205 and the creation of the first subset of SNPs in step 210 may reduce that number to approximately 10,000. It is important to note that this number of reference subset SNP positions is not limiting and, in some embodiments, the number of reference subset SNP positions may be more or less (e.g., 10, 100, 1000, etc.).
[0045] At optional step 215, the first subset of SNPs associated with the sample data may be subject to a depth coverage filter (e.g., when using a DNA methylation assay). If step 215 is performed, encoding module 105A may be configured to filter out those SNPs in the first subset based on their depth of coverage. Coverage depth in sequencing data indicates how well a specific genomic position has been sequenced, representing the number of times the base at that position has been read. High coverage depth provides confidence in the accuracy of the sequenced base, while low coverage may lead to uncertainty or may indicate missing data. Additionally or alternatively, in another optional aspect, encoding module 105A may be configured to leverage a noise model to filter out those SNPs in the first subset that may be indicative of noise and that may not be needed in the genotype signature creation process. A noise model may be implemented at either step 210 or step 215 in order to filter out SNPs having insufficient depth range or that are indicative of noise.
[0046] In an aspect, two coverage thresholds may be defined: a minimum coverage threshold and a maximum coverage threshold. These thresholds may be configured to delineate the range of coverage depths that are acceptable for the selected SNPs. SNPs with coverage depths below the minimum coverage threshold may be considered to have insufficient coverage and may be considered more prone to sequencing errors or inaccuracies. SNPs with coverage depth exceeding the maximum coverage threshold may be considered to have too high coverage. More particularly, although high coverage provides confidence in the accuracy of the data, excessively high coverage may not significantly improve accuracy and in some instances, may indicate contamination. Accordingly, the SNPs from the first subset that fall within the defined coverage threshold (i.e., greater than the minimum coverage threshold but less than the maximum coverage threshold) may be retained for further processing as a second subset. re [0047] At step 220, the variant allele frequency (VAF) may be computed for each SNP in the second subset associated with the sample data. The VAF represents the proportion of alleles at each SNP position that differ from the reference allele. More particularly, at any given SNP position, there are two possible alleles: the reference allele (i.e. , the allele present in the reference genome) and the alternate allele (the variant allele). The VAF is a measure of the prevalence of the alternate allele at a specific SNP position within an individual’s genetic sequence. It may be calculated as the ratio of the number of alternate alleles to the total number of alleles at that position. Resulting VAF values range from 0 to 1 . A VAF of 0 indicates that all alleles at the SNP position are reference alleles, while a VAF of 1 indicates that all alleles are alternate alleles. Accordingly, a high VAF suggests a strong presence of the alternate allele (indicating a higher likelihood of a genetic variant at that position), whereas a low VAF indicates a dominance of the reference allele.
[0048] At step 225, the genetic variant type at each SNP in the second subset associated with the sample data may be determined based on the calculated VAF at step 220. In an aspect, the genetic variant type may indicate whether the individual’s genetic sequence at each SNP position corresponds to a homozygous reference allele (i.e., where both alleles at the SNP position match the reference allele), a homozygous alternate allele (i.e., where both alleles at the SNP position match the alternate allele), or a heterozygous allele (i.e., where two alleles at the SNP position are different, with one being the reference allele and the other being the alternate allele). An additional classification may be assigned to those SNP positions where data is missing, e.g., due to low coverage at the SNP position. [0049] In an aspect, the VAF calculated in step 220 for each SNP position associated with the sample data may provide a quantitative measure of the presence of the alternate allele. Based on the VAF value, the genetic variant type at each SNP position may be determined. For example, if VAF is below a first threshold (e.g., closer to 0), the alleles at the corresponding SNP position are primarily reference alleles, indicating a homozygous reference allele. If the VAF is higher than a second threshold (e.g., closer to 1), the alleles at the corresponding SNP position are primarily alternate alleles, indicating a homozygous alternate allele. If the VAF is between the first and second thresholds, the alleles at the SNP position are a mix of reference and alternate alleles, indicating a heterozygous allele. If the VAF cannot be accurately calculated due to insufficient coverage, the variant type for that SNP position is considered missing data. The missing data designation may indicate that the genetic information at that position is not reliable due to coverage limitations.
[0050] At step 230, encoding module 105A may assign numeric values to represent the identified genetic variant types. More particularly, the determined genetic variant types at each SNP position within the second subset are encoded into bytes for ease of computationally-aided comparison. Each genetic variant type is assigned a unique numeric value, and these values are combined to create a compact binary representation, i.e., genotype signature 25.
[0051 ] In an aspect, each of the genetic variant types may be assigned a specific numeric value for encoding purposes. For instance:
• Homozygous Reference Allele (HOM_REF): Encoded as 0 (00)
• Heterozygous Allele (HET): Encoded as 1 (01)
• Homozygous Alternate Allele (HOM_ALT): Encoded as 2 (10)
• Missing Data: Encoded as 3 (11) [0052] For each SNP position, the assigned numeric value corresponding to the determined genetic variant is combined into a sequence representing the entire genotype signature for the sample. In an aspect, encoding genetic variant types for the selected positions into bytes may result in a compact unique representation of the individual’s genetic profile: the genotype signature. This binary format requires little storage space, making it efficient for storage and data management. Additionally, the compact binary representation allows for rapid comparison and analysis against other samples, as further described herein.
[0053] Referring now to FIG. 3, an illustration of an example encoding process is provided. Diagram 30 provides a plurality of SNPs 32 (A-F) having nucleotide base designations for each allele at each SNP position. For instance, the two alleles associated with SNP-1 32A are A and A, the two alleles associated with SNP-3 32C are T and C, and no alleles were identified for SNP-2 32B (e.g., due to inadequate coverage). Section 34 identifies the nucleotide base in the reference genome and provides a genotype designation. For example, at the position for SNP- 1 32A, the nucleotide base in the reference genome is “A” and is designated by 0/0. As another example, at the position for SNP-3 32C, the nucleotide base in the reference genome is “T” and is designated by 0/1 .
[0054] Table 1 , below, provides an encoding designation based on the genotype. For instance, Table 1 provides that each homozygous reference allele may be encoded as 0, each heterozygous allele may be encoded as 1 , each homozygous alternate allele may be encoded as 2, and any types of missing data, such as at position SNP-2 32B, may be encoded as 3. Accordingly, the genotype signature for this illustrative sample may be 031201 (001101 100001 ).
Table 1
Figure imgf000026_0001
[0055] Referring now to FIG. 4, a process for comparing genotype signatures of different samples is provided, according to one or more aspects of the present disclosure. In an aspect, the goal of genotype signature comparison between two samples is to assess the degree of genetic similarity or dissimilarity.
[0056] At step 405, genotype signatures 40, 45 associated with the two samples being compared may be received at a concordance result module 105B. Genotype signatures 40, 45 may have been identified using the method described in reference to FIG. 2, described above. Thereafter, a total number of SNPs at specified locations may be identified for the first and second signatures. The specified locations may be selected so as to provide sufficient variability such that they can be used as identifying information. The total number of SNPs at common locations may correspond to the positions where both signatures have genetic variant information.
[0057] As an example of the foregoing, consider two genotype signatures, Signature A and Signature B. Signature A has genetic variant information at SNP positions 1 , 3, 5, and 7. Signature B has genetic variant information at SNP positions 2, 4, 5 and 7. SNP positions 5 and 7 are common SNPs because both Signature A and Signature B have genetic variant information at these positions. SNP positions 1 , 3, and 2, 4 are not common SNPs because only one of the signatures has genetic variant information at these positions.
[0058] At step 410, a threshold check may be performed to determine whether the total number of common SNPs between the first and second genotype signatures is greater than a predetermined threshold. This threshold check serves as a criterion for determining if there is a sufficient level of overlap between the two signatures to proceed with future comparison. In an aspect, the predetermined threshold value may vary based on factors such as the nature of the genetic data, the goals of the comparison, and the desired level of statistical significance. For instance, the predetermined threshold for the comparison of two genotype signatures associated with samples of the same type (e.g., two cfDNA samples) may be higher than if the two genotype signatures were associated with samples of different types (e.g., one cfDNA sample vs. one tissue sample or two tissue samples). Additionally or alternatively, as another example, the predetermined threshold may be higher and more stringent if the sample analysis was associated with a disease state decision or treatment recommendation than it may be for a research study.
[0059] Responsive to determining, at step 410, that the total number of common SNPs is not greater than the predetermined threshold, an embodiment may designate, at step 415, a result identifying that there is insufficient overlap between the two genotype signatures for a meaningful comparison. This may be due to factors such as low coverage of genetic data or limited shared genetic information. Accordingly, an insufficient overlap may halt the comparison from proceeding further. In contrast, responsive to determining, at step 410, that the total number of common
SNPs was greater than the predetermined threshold, an embodiment may conclude that there was a sufficient amount of overlap between the two signatures and proceed further in the process to step 420.
[0060] At step 420, the total number of SNPs that have identical variant calls (e.g., homozygous reference, homozygous alternate, and heterozygous) between first and second genotype signatures 40, 45 may be determined. To facilitate this determination, the variant call for each position in each genotype signature may be identified, and if the variant call at a position is the same in both signatures, then the total count may be incremented.
[0061] For example, two genotype signatures may be considered, Signature X and Signature Y. The signatures share common SNP positions at positions 5 and 9. At position 5, Signature X may have a homozygous reference call, and Signature Y may also have a homozygous reference call. Because both signatures have the same variant call, the SNP at position 5 may be considered identical between signatures. Conversely, at position 9, Signature X may have a homozygous reference call, and Signature Y may have a heterozygous call. Because the signatures have different variant calls, the SNP position at position 9 may be considered not identical.
[0062] At step 425, a concordance ratio between the two genotype signatures 40, 45 may be computed. The concordance ratio provides a quantitative measure of the degree of genetic similarity between the two signatures. It may help in assessing the degree of overlap in genetic variant types and provides insight into how closely the genetic profiles of the two samples match at the common SNP positions. The concordance ratio may be computed by dividing the number of identical SNPs (determined in step 420) by the total number of common SNPs (determined in step 405). Mathematically, in some examples, the formula for computing the concordance ratio is: Concordance Ratio = Number of Identical SNPs I Total Number of Common SNPs. The concordance ratio ranges between 0 and 1 , where a concordance ratio of 0 indicates no genetic agreement (i.e., no identical SNPs) between the signatures, whereas a concordance ratio of 1 indicates complete genetic agreement (all common SNPs are identical) between the signatures. As a non-limiting example of the foregoing, Signature X and Signature Y may have 5,000 common SNPs, out of which 4,000 SNPs have identical variant calls. The concordance ratio would be calculated as: 4,000 15,000 = 0.8. In this example, the concordance ratio is 0.8, which means that 80% of the common SNPs have identical variants.
[0063] Additionally or alternatively to the foregoing, in another aspect, a kinship coefficient (<t>), or “coefficient of relationship,” may be calculated. The kinship coefficient may correspond to the measure of genetic relatedness between two samples. It may be calculated based on the sharing of alleles at specific genetic markers, e.g., SNPs, and may provide insight into how closely two or more samples are related by blood, or, in this case, how likely two samples are to be from the same individual. In an aspect, the kinship coefficient may be calculated by identifying, at each SNP, how many alleles are shared, which may be one of three possibilities: homozygous reference (e.g., if both individuals have the same reference allele at a given SNP), heterozygous reference (e.g., if one individual has the reference allele and the other has a variant allele), or homozygous variant (e.g., if both individuals have the same variant allele at a given SNP). The kinship coefficient may be calculated by summing the shared status at each SNP (e.g., 0 for homozygous reference, 1 for heterozygous, or 2 for homozygous variant) and then averaging across all of the total number of common SNPs. The resulting coefficient may range from 0 (indicating no genetic relatedness) to 1 (indicating complete genetic identity). In an aspect, the values may indicate different degrees of genetic relatedness (e.g., a value of 0.25 may suggest that a grandparent-grandchild relationship or an uncle- aunt/niece-nephew relationship, a value 0.5 may indicate a parent-child or a sibling relationship, a value of 1 may suggest identical twins). In this case, a value of 1 may indicate that the samples are from the same individual, whereas increasingly smaller values may indicate an increasingly greater likelihood that two samples are not from the same individual.
[0064] At step 430, the computed concordance ratio in step 425 may be compared against a concordance threshold. The threshold value may be used as a criterion in determining whether the genetic profiles are similar enough to be considered a positive match. In an aspect, the threshold value may be higher or lower based on certain factors, as described above. For instance, the predetermined threshold for the comparison of two genotype signatures associated with samples of the same type (e g., two cfDNA samples) may be higher than if the two genotype signatures were associated with samples of different types (e.g., one cfDNA sample vs. one tissue sample). Additionally or alternatively, as another example, the predetermined threshold may be higher and more stringent if the sample analysis was associated with a disease state decision or treatment recommendation than it may be for a research study.
[0065] In an aspect, responsive to determining, at step 430, that the concordance ratio is greater than the concordance threshold, an embodiment may generate, at step 440, a result indicating a match between the genotype signatures.
Conversely, responsive to determining, at step 430, that the concordance ratio is less than the concordance threshold, an embodiment may generate, at step 445, a result indicating a mismatch between the genotype signatures.
[0066] In some embodiments, situations may exist in which the genotype signature of a particular sample is compared against the genotype signatures of a plurality of other samples. At the conclusion of the comparison process (e.g., as represented in FIG. 4), a subset of samples in the plurality may be identified as being matched with the particular sample (e.g., at step 440). More particularly, a concordance ratio generated from the comparison of the genotype signatures of the particular sample and one of the subset samples may be determined to be greater than the concordance threshold identified in step 430. As a non-limiting example, the genotype signature of a first sample may be compared against the genotype signatures of 1 million other samples. After the comparison process depicted in FIG. 4 is complete, approximately 1000 samples contained genotype signatures that were matched to the genotype signature of the first sample. In this circumstance, the resultant subset of samples may then be passed to one or more different modules so that more granular comparison processes may be conducted (e.g., on the sequencing data associated with each sample) to ultimately identify a 1 :1 match. In this way, the concordance threshold may act as an initial filter on a pool of data. In an aspect, the threshold may be adjustable based on need and/or context (e.g., the threshold may be increased or decreased based on the desired goals of the comparison process).
[0067] In an aspect, the concordance threshold value may change based on the type of samples that the genotype signatures are associated with. More particularly, different types of biological samples (e.g., blood, urine, tissue, etc.) may vary in the extent of genetic variation they contain. Some samples, such as blood, may have a relatively stable and consistent genetic profile, while others, like tumor tissue, may exhibit higher levels of genetic heterogeneity due to mutations and clonal evolution. Accordingly, a first concordance threshold may be established for the comparison of two genotype signatures that are both associated with cfDNA, whereas a second concordance threshold may be established for the comparison of samples with inherently higher genetic variation (e.g., cfDNA vs. tissue). The second concordance threshold may be more relaxed to account for the expected diversity between the samples.
[0068] The concordance threshold may additionally or alternatively be adjusted based on other factors as well. For instance, in longitudinal studies involving repeated sampling from the same participant over time, genetic changes may occur due to factors like disease progression, treatment response, natural aging, or one or more other natural variations. For such studies, the concordance threshold may be adjusted to accommodate expected genetic drift while still identifying samples from the same individual as matching. As another example, the clinical context and specific application of the genotype signature comparison may at least in part influence the choice of the concordance threshold. For example, in diagnostic scenarios in which accurate patient identification is crucial, a stricter concordance threshold may be chosen to reduce the risk of false positive matches. On the other hand, in research studies focusing on broader population analysis, a more lenient threshold may be applied to reflect the wider range of genetic diversity between samples. In yet another example, genetic variations may also differ between different ethnic populations. If the samples being compared are from diverse ethnic backgrounds, the concordance threshold may be tailored to account for population-specific genetic variation. [0069] Additionally or alternatively to the foregoing, the concordance value may provide various non-binary decisions or insights that are not limited to binary “match” or “mismatch” outcomes. For instance, upon analyzing the genetic profiles of the individuals associated with two samples, A and B, the concordance value may provide insight into the degree of genetic relatedness, or the likelihood that two samples are from the same individual. For example, a concordance value of 0.8 may indicate that 80% of the common SNPs between samples A and B have identical variants. Based on this, it may be concluded from the concordance value that the individuals associated with samples A and B share a significant portion of their genetic variants, suggesting a high degree of genetic similarity or relatedness, or may indicate the likelihood that the samples are from the same individual. In yet another example, the concordance value may be used to determine whether samples A and B are associated with a single individual, and the system of the embodiments may be configured to output a confidence percentage that the samples are from the same person. In other words, the system may output a degree of genetic similarity between the two samples.
[0070] Referring now to FIG. 5, an exemplary process flow 500 is depicted for determining whether two samples are matched to a subject, according to one or more aspects of the present disclosure. The exemplary process flow 500 may be implemented by system environment 100 and may incorporate the encoding and comparing processes described with reference to FIGs. 2 and 4.
[0071] At step 505, genomic data associated with a first sample may be received at a genotype signature component 105 of a computing device 100. The genomic data may include sequenced sample data 20 from a sample 15 belonging to a participant 10. In an aspect, sample 15 may be a blood sample, tissue sample, bone marrow sample, urine sample, saliva sample, plasma sample, etc. In an aspect, a single sample may be collected from a subject or, alternatively, multiple samples may be collected from the subject (e.g., multiple samples may be collected for the subject at a single time point, multiple samples may be collected for the subject across two or more different time points, etc.). Samples collected at the same time and/or samples collected at different times may be subsequently processed at different times. In an aspect, sequenced sample data 20 may be generated using one of a variety of different sequencing techniques on extracted DNA from sample 15.
[0072] At step 510, the sequenced sample data 20 may be processed by encoding module 105A of genotype signature component 105 to generate a genotype signature. The encoding process facilitated by encoding module 105A is configured to convert genetic information, specifically the genetic variants present in the DNA sample, into a compact and standardized numerical representation, i.e., the genotype signature. This process simplifies and condenses the genetic data, making it easier to compare and analyze samples efficiently. Encoding may begin by first identifying the total number of SNPs present in sequenced sample data 20. From the total number of identified SNPs, a first subset of SNPs may be retained. This subset may be chosen based on its correlation with an indicative reference subset of SNPs. These reference SNPs may be selected because they are indicative of identity and may represent positions where DNA bases may change in different individuals, thereby helping to identify samples from other individuals. With the retained first subset of SNPs, a second subset may be created based on coverage depth. More particularly, the second subset may include SNPs with coverage depths falling within a range defined by a minimum and maximum coverage threshold, as described above. This ensures that only SNPs with reliable and consistent data are included in the genotype signature. Next, for each SNP in the coverage-filtered subset, the VAF may be calculated to identify how common a particular genetic variant is within the sample. Based on the computed VAF, the genetic variant at each SNP is classified into categories, such as homozygous reference allele, homozygous alternate allele, heterozygous allele, or missing data. These categories provide insight into genetic differences and similarities between samples. The identified genetic variant categories may thereafter be encoded into numerical values. A specific numerical value may be assigned to each category (e.g., homozygous reference is assigned 0, homozygous alternate is assigned 2, heterozygous is assigned 1 , and missing data is assigned 3). The encoded numerical values for each SNP position are combined to create the genotype signature for the sample.
[0073] At step 515, the genotype signature derived by encoding module 105A may then be utilized by concordance result module 105B to identify whether the genotype signature matches a second genotype signature associated with a second sample. In this regard, the comparison process involves assessing the similarity and concordance between the two genotype signatures, which may be used to determine whether the two samples likely originate from the same individual, different individuals, or whether there are discrepancies that require further investigation. The comparison process may begin by identifying the total number of SNPs that are common between the two genotype signatures. These common SNPs represent the genetic positions where both samples have been evaluated for similarity. Once the common SNPs are identified, the comparison may evaluate whether the total number of common SNPs is greater than a predetermined threshold. This threshold serves as a criterion for determining whether there is a meaningful amount of overlap in the genetic information being evaluated. If the number of common SNPs falls below the threshold, it suggests that there may not be sufficient overlap for a reliable comparison. Assuming the number of common SNPs surpasses the threshold, the comparison proceeds to identify the subset of common SNPs that have matching variant calls (i.e. , genetic variant types) between the two genotype signatures. These identical SNPs are positions where the genetic information aligns between the samples, as described above. The next step involves computing the concordance ratio between the two genotype signatures, which provides a quantitative measure of the genetic similarity between the two samples. The concordance ratio may be obtained by dividing the number of identical SNPs by the total number of common SNPs.
[0074] In an aspect, biological samples may be collected from a participant at different points in time to facilitate longitudinal studies, track disease progression, or assess treatment responses. Embodiments of the disclosure may be used to determine whether the two samples likely originate from the same individual, different individuals, or whether there are discrepancies between the genotypic signatures for the different samples that require further investigation. For instance, the first sample may be obtained from a participant at an initial visit, representing the “first time” in the study timeline. This sample may be used for analysis at that time, such as genotyping or other molecular assays. The second sample described above may be obtained from the same participant during a follow-up visit at a later date, e.g., a “second time,” which may be several days, weeks, or months after the first collection. Alternatively, in some studies, multiple samples may be collected at a single time point, even though only a portion of them is processed at that time point. For example, in certain clinical trials, multiple tubes of blood may be drawn from a single participant during a single visit. While a subset of the tubes of blood may be processed at that time for initial testing and analysis, the remaining subset of the tubes of blood may be stored for future use. These stored samples may be analyzed later without requiring an additional collection from the participant. In this scenario, the first and second time points for sample procurement and/or processing may be substantially the same. In this aspect, too, embodiments of the disclosure may be used to determine whether the samples taken at the same time but analyzed at different time points likely originate from the same individual, different individuals, or whether there are discrepancies between the genotypic signatures for the different samples that require further investigation.
[0075] At step 520, the computed concordance ratio is then compared against a predetermined concordance threshold, which represents the level of genetic similarity required to consider the two samples a match. If the computed concordance ratio is below the threshold, it indicates that the genetic information is not concordant enough to establish a match, and system 100 may output, at step 525, a genotype signature mismatch result. In some aspects, the genotype signature mismatch result may include an indication that the samples are not from the same individual, or that there is sufficient variability between the two samples to call into question whether the two samples are from the same individual. Alternatively, in some aspects, if the computed concordance ratio is below the predetermined concordance threshold, but above a lower second predetermined concordance threshold, then the genotype signature mismatch result may provide an indication that inconclusive results were achieved and may provide a recommendation to rerun the analysis. Conversely to the foregoing, if the ratio exceeds the threshold(s), it suggests that the genetic information is sufficiently concordant, and system 100 may output, at step 530, a genotype signature match result.
[0076] Provided below are a plurality of exemplary use cases in which the processes described herein may be leveraged to address various types of sample mismatches that may occur in conventional settings.
[0077] In a first exemplary situation, the foregoing processes may be utilized to identify an unexpected mismatch, e.g., a situation in which a sample matched to a subject is not actually derived from that subject. For example, a set of 5 samples may be conventionally identified (e.g., based on manually recorded clinical data associated with each sample) as originating from a single subject (e.g., Subject A). In actuality, one of the 5 samples may not be derived from Subject A but may, in fact, be derived from another subject. To facilitate this mismatch identification, the system described herein may compare the genotype signatures of each sample of Subject A against one another to identify a sample that is not matched with the rest of the sample set.
[0078] In a second exemplary situation, the foregoing processes may be utilized to identify an unexpected match, e.g., a situation in which a given sample matches a sample set that it may not have been originally paired with. For instance, continuing on with the previous example, upon generating a genotype signature associated with the originally mismatched sample, a user may identify that the generated genotype signature of the originally mismatched sample matches one or more samples associated with another subject, e.g., Subject B. In this situation, the originally mismatched sample may then be paired with the appropriate sample set.
[0079] In a third exemplary situation, the subject that a mismatched sample may have been derived from may not always be known. This may be due to various factors, including: the swap occurring before receipt of the samples by the receiving organization, the swap occurring between studies, or a false negative result (e.g., subject had marrow transplant between intervals, non-cancer to cancer transitions, etc.). In such a situation, if any tests were run using the mismatched sample, then the results of those tests may be discarded and/or not returned to the subject.
[0080] In a fourth exemplary situation, the process described herein may be utilized to identify why, in the absence of a mismatch, samples associated with separate subjects may be substantially similar to each other. For instance, the genotype signature comparison process may identify that all samples associated with Subject A share a genotype signature match with all samples associated with Subject B. In one circumstance, such an occurrence may result if Subjects A and B were identical twins. In another circumstance, samples received by a receiving organization may originate from different sources. In some situations, samples originating from the same subject may be received from these separate sources, but may not be appropriately grouped together (e.g., due to improper labeling or incomplete mapping, etc.). In another circumstance, this occurrence may be resultant from a false positive test, e.g., that indicates that all samples from two participants are matched when they really are not. Comparing genotype signatures for all samples may confirm whether a false positive was present or not.
[0081] In a fifth exemplary situation, the processes described herein may be utilized to identify whether a given sample may have any associations at all. For instance, circumstances may arise where researchers may be left with a lone sample, e.g., one that is not matched with any known subject. In these situations, the genotype signature comparison process may be utilized to determine whether the lone sample shares the same genotype signature as one or more control samples that were utilized in a particular assay. If the genotype signature comparison process reveals that the lone sample is a control sample, then that sample may be disregarded and/or paired with the appropriate grouping of control samples.
[0082] In a sixth exemplary situation, situations may arise where the genotype signature comparison process may not be able to be utilized to return an informative result. For instance, in one instance, there may be too low of an SNP intersection between the genotype signatures of the samples (e.g., as indicated and described with respect to step 410 in FIG. 2) to derive an appropriate concordance ratio. This may happen when the processing of one of the samples has “failed” and correspondingly yields low binary coverage. In this instance, sample results may not be returned to the subject and/or a sample redraw from the relevant subject may need to be implemented. Other situations that may prevent the sample matching process from working include those situations in which there is no key (e.g., where there is no output directory, etc.) or domain (e.g., when a study or domain is not approved or added to an “allowed” list, etc.) or when the data associated with the samples is not added to the domain.
[0083] In a seventh exemplary situation, the processes described herein may be utilized to identify that a particular sample shares no matches at all. More particularly, the genotype signature comparison process may return an indication that the genotype signature associated with a given sample has no resemblance to a genotype signature associated with the samples of any other subject or any other control sample.
[0084] In an eighth exemplary situation, the processes described herein may be utilized to identify and link participants who have taken repeated tests, but are initially treated as separate individuals in the system due to discrepancies in how their samples are registered. For example, in some instances, a participant may be supplied with test kits from different sources, such as both a provider portal and a patient portal, resulting in the system recognizing the samples collected from these two different kits as two distinct participants, rather than multiple samples from the same participant. However, by leveraging the genotype signatures described herein, the genetic information from both samples may be analyzed and compared, revealing that they actually belong to the same individual. This identification enables the clinical lab or other suitable facility to merge or link these participants within the system, thereby correctly consolidating the test results and participant data for accurate medical records and downstream assessments/processes.
[0085] In a ninth exemplary situation, the processes described herein may be utilized to address errors caused when inputting participant information into a data collection system, e.g., such as typos in participant names. For example, slight variations or spelling mistakes in participant names may cause the system to treat the same individual as two independent participants. By comparing the genotype signatures from the samples attributed to each name, the processes described herein may allow the lab to identify that the samples originate from the same person. This may prompt the lab to link the profiles, thereby rectifying the issue and ensuring that all relevant data is associated with the correct participant, or that different samples from the same subject aren’t considered samples from unique subjects.
[0086] In some aspects, if a sample match is output, an indication that the samples are contaminated may also be generated. For example, the second sample may be a control sample known to have been acquired from a different individual than the individual from whom the first sample was acquired. In this instance, a match output would not be appropriate in light of the different known sample origins, and thus the match output may be indicative of contamination or other errors. In other aspects, a mismatch output may likewise indicate sample contamination or other errors, for example, if the first and second samples are known to have been acquired from the same individual. As another example, the second sample could be a control sample from the same individual from whom the first sample was acquired, and a match output may thus indicate that the first and second samples were processed correctly. In this way, embodiments of the disclosure may function at least in part as quality control.
[0087] In still other aspects, responsive to identifying a sample match or a sample mismatch between the first sample and the second sample, the sample match or mismatch may be compared to an expected sample match status between the first sample and the second sample. The results of the comparison of the sample match or sample mismatch with the expected sample match status may be reported to a user or to a second computer system. In this way, a user may be provided with additional context with which to assess the sample match or sample mismatch determination that is output. For example, if a match was output, but the expected sample match status was a mismatch, then it may be inferred that an error, such as contamination, may have occurred. As a result, one or more alerts or indications of potential error or contamination may be output, and the assay may ultimately need to be redone. In another example, if a mismatch was output but the expected sample match status was a match, then it may be inferred that an error, such as contamination, may have occurred. As a result, one or more alerts or indications of potential error or contamination may be output, and the assay may ultimately need to be redone. [0088] Additionally or alternatively to the processes described above, an alternate approach to genotype signature comparison may be utilized that leverages Bayesian probability methods and population statistics to enhance the accuracy of match determination. This approach goes beyond simple counting and considers the underlying allele frequencies in the population. In an aspect, instead of relying solely on counts of matching alleles, the baseline allele frequencies in the population may also be considered. More particularly, for each SNP position, information may be available (e.g., in a local database, in an online database, etc.) for how common each allele is in the general population across different ethnic groups. Thereafter, for each SNP being compared, the probability of observing a match between the alleles in the two genotype signatures may be calculated, considering the population statistics. This calculation is facilitated by applying Bayesian probability calculations and takes into account the likelihood of observing the specific allele types given the population frequencies. In addition to allele frequencies, this approach considers the potential variations that may occur at a specific SNP position, which may make the comparison more comprehensive and robust. Ultimately, the probabilities calculated for each SNP may be combined to yield an overall concordance probability for the entire genotype signature comparison. This probability reflects the likelihood that the two genotype signatures match based on the population statistics and the variations observed. The comparison decision is then made based on the overall concordance probability. If the computed probability exceeds a certain threshold, it indicates a higher likelihood that the samples match. Conversely, if the computed probability falls below the threshold, it suggests a lower likelihood of a match.
[0089] Additionally or alternatively to the foregoing concepts, there may be other ways in which to compare the signature from a sample against one or more other samples. For instance, in an aspect, a central database may be created that stores sample genotype records. This database may be continually updated (e.g., as new results appear) and may be leveraged to allow genotype match searches.
[0090] In one implementation, the database genotype search may be facilitated through a linear search, which may involve comparing the genotype information associated with a query sample with the genotype information of all samples stored in the central database. In this regard, a user may initiate a genotype match search by providing, e.g., to an application programming interface (API), information associated with a query sample. Specifically, the query sample information may contain information about the genetic makeup, or genotype, of a specific individual. The genotype data may include details about various SNPs or other genetic markers or mutations. In an aspect, the query sample may additionally include identifying information associated with the individual, such as a sample ID, various types of metadata (e.g., data of collection, collection method, collection location, etc.), or other relevant details that may help distinguish the sample from other samples.
[0091] To facilitate the linear search, the system may calculate the similarity between the genotypes associated with the query sample against those of each sample stored in the database. More particularly, in-memory encodings may be utilized that make each pairwise genotype comparison efficient and fast. For instance, the encoding may be a binary encoding for which the genotype similarity may be computed using various metrics, such as Hamming distance using vector CPU instructions. The system may iterate through each sample record in the database, comparing the genotype information of the query sample with the genotype information of the samples in the database. In an aspect, a similarity threshold may be established to define what constitutes a “match.” Samples with genotype similarities that meet or exceed the threshold are considered potential matches (e.g., indicating a higher likelihood that they originate from the same individual).
[0092] In an aspect, the central database, or a portion thereof, may be cached into the memory of a server that implements the foregoing API. This reduces the need for repeated disk reads, as data is readily available in memory for comparisons. To keep the central database up-to-date, the server may periodically update its cache by reading records from the central database that were added or modified since the last update. If pipeline run IDs are not guaranteed to be sequential, a range search mechanism may be utilized to efficiently locate and retrieve updated records from the database.
[0093] In another implementation, locality sensitive hashing (LSH) may be utilized to avoid the need for a linear search through the database. The LSH search technique may be helpful when dealing with large datasets and searching for data points that are similar to a given query point based on a similarity metric. In an aspect, to facilitate LSH, each data point (e.g., genotype information associated with a sample) may be represented as a high-dimensional vector, wherein each dimension of the vector may correspond to a genetic marker or feature. The LSH process may employ a set of hash functions that map these high-dimensional vectors to a lower-dimensional space. Specifically, LSH creates multiple hash functions tailored for the genetic data and these functions map genotype vectors to lower-dimensional hash codes. In an aspect, multiple hash tables may be constructed, each using a different hash function and each table may store references to samples based on their hash codes. When a query is made, the same hashing functions may be applied to the query data, thereby generating corresponding hash codes. The system may then search hash tables for potential matches based on the hash values of the query. Any retrieved candidates from the hash tables may be potential nearest neighbors. To determine if they are true matches, full sample genotypes of the candidates may be obtained and compared to make a final determination. In an aspect, various parameters of the foregoing process may be tuned to ensure that true genotype matches may be found with high probability. In an aspect, each of the hash tables described above may be stored in a type of cloud system and may be configured to scale to 1 billion samples or more.
[0094] In some aspects, the genotype signatures described herein may be associated with a “version number” that may represent which version of the genotype signature encoding process a sample was encoded with. Differences in the signature version may correspond to differences in the parameters encompassed or leveraged by each encoding process. For instance, an original version of the genotype signature encoding process may be Version 0. Version 0 may be associated with a variety of parameters, e.g., including: 2 bit encoding per variant, a predefined set of SNPs, a predefined minimum number of intersecting SNPs between two samples, a predefined concordance threshold, and/or other variables (e.g., a predefined minimum and/or maximum depth at a position, predefined thresholds relevant to variant calling, etc.). In an aspect, a change in any of the above parameters may trigger initiation of a new encoding process. For example, the assay panel from which the samples were obtained from may change, which may affect the specific type and/or number of SNPs that are utilized in the comparison. In an aspect, each subsequent version of the genotype signature encoding process may sequentially increment. For example, if a current version is Version 0, then the next version may be Version 1 . In an aspect, samples having genotype signatures associated with the same version may be comparable. For example, two samples encoded with a version 1 genotype signature may be compared against each other. In some situations, samples having genotype signatures of different versions (e.g., version 1 vs. version 2) may also be compared after one or more factors are addressed (e.g., certain changes may need to be made to certain thresholds, etc.).
[0095] In general, any process discussed in this disclosure that is understood to be computer-implementable may be performed by one or more processors of a computer system, such as system environment 100, as described above. A process or process step performed by one or more processors may also be referred to as an operation. The one or more processors may be configured to perform such processes by having access to instructions (e.g., software or computer-readable code) that, when executed by the one or more processors, cause the one or more processors to perform the processes. The instructions may be stored in a memory of the computer server. A processor may be a central processing unit (CPU), a graphics processing unit (GPU), or any suitable types of processing unit.
[0096] A computer system, such as system environment 100, may include one or more computing devices. If the one or more processors of the computer system are implemented as a plurality of processors, the plurality of processors may be included in a single computing device or distributed among a plurality of computing devices. If a system environment comprises a plurality of computing devices, the memory of the computer system may include the respective memory of each computing device of the plurality of computing devices. [0097] FIG. 6 is a simplified functional block diagram of a computer system
600 that may be configured as a computing device for executing the processes described herein, according to exemplary embodiments of the present disclosure. In various embodiments, any of the systems herein may be an assembly of hardware including, for example, a data communication interface 620 for packet data communication. The platform also may include a central processing unit (“CPU”) 602, in the form of one or more processors, for executing program instructions. The platform may include an internal communication bus 608, and a storage unit 606 (such as ROM, HDD, SDD, etc.) that may store data on a computer readable medium 622, although the system 600 may receive programming and data via network communications via electronic network 625 (e.g., voice, video, audio, images, or any other data over the electronic network 625). The system 600 may also have a memory 604 (such as RAM) storing instructions 624 for executing techniques presented herein, although the instructions 624 may be stored temporarily or permanently within other modules of system 600 (e.g., processor 602 and/or computer readable medium 622). The system 600 also may include input and output ports 612 and/or a display 610 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc. The various system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load. Alternatively, the systems may be implemented by appropriate programming of one computer hardware platform.
[0098] In this disclosure, the term “based on” means “based at least in part on.” The singular forms “a,” “an,” and “the” include plural referents unless the context dictates otherwise. The term “exemplary” is used in the sense of “example” rather than “ideal.” The terms “comprises,” “comprising,” “includes,” “including,” or other variations thereof, are intended to cover a non-exclusive inclusion such that a process, method, or product that comprises a list of elements does not necessarily include only those elements, but may include other elements not expressly listed or inherent to such a process, method, article, or apparatus. Relative terms, such as “about,” “approximately,” “substantially,” and “generally,” are used to indicate a possible variation of ±10% of a stated or understood value. In addition, the term “between” used in describing ranges of values is intended to include the minimum and maximum values described herein. The use of the term “or” in the claims and specification is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.
[0099] As used herein, the term “user” generally encompasses any person or entity, such as a researcher and/or a care provider (e.g., a doctor, etc ), that may desire information, resolution of an issue, or engage in any other type of interaction with a provider of the systems and methods described herein (e.g., via an application interface resident on their electronic device, etc.). The term “electronic application” or “application” may be used interchangeably with other terms like “program,” or the like, and generally encompasses software that is configured to interact with, modify, override, supplement, or operate in conjunction with other software.
[00100] Program aspects of the technology may be thought of as “products” or “articles of manufacture,” typically in the form of executable code and/or associated data that is carried on or embodied in a type of machine-readable medium. “Storage” type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
[00101] Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
[00102] Thus, while certain embodiments have been described, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present invention.
[00103] The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other implementations, which fall within the true spirit and scope of the present disclosure. Thus, to the maximum extent allowed by law, the scope of the present disclosure is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various implementations of the disclosure have been described, it will be apparent to those of ordinary skill in the art that many more implementations are possible within the scope of the disclosure. Accordingly, the disclosure is not to be restricted except in light of the attached claims and their equivalents.

Claims

WHAT IS CLAIMED IS:
1. A computer-implemented method for verifying similarity between biological samples, comprising: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating, using a processor associated with the computer system, a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling, using the processor, the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing, using the processor, the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying, based on the comparing and responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
2. The computer-implemented method of claim 1 , wherein the generating the numerical representation for each of the SNPs comprises: identifying a total number of SNPs in the genomic data; retaining, from the total number of the SNPs, a first subset of SNPs that match a reference subset of SNPs; retaining, from the first subset of SNPs, a second subset of SNPs having a coverage depth greater than a predetermined minimum coverage threshold and smaller than a predetermined maximum coverage threshold; computing a variant allele frequency (VAF) for each of the second subset of SNPs; identifying, based on a VAF calling threshold, a genetic variant associated with each of the second subset of SNPs, wherein the genetic variant is one of a first type, a second type, a third type, or a fourth type; and assigning, based on the identified genetic variant type associated with each of the second subset of SNPs, the numerical representation to each of the second subset of SNPs.
3. The computer-implemented method of claim 2, wherein the first type corresponds to a reference allele, the second type corresponds to a variant allele, the third type corresponds to a heterozygous allele, and the fourth type corresponds to a missing variant.
4. The computer-implemented method of claim 1 , wherein the comparing comprises: identifying a first total number of SNPs that are common between the first genotype signature and the second genotype signature; determining whether the first total number of SNPs is greater than a first predetermined threshold; identifying, responsive to determining that the first total number of SNPs is greater than the first predetermined threshold, a second total number of SNPs having a matching genetic variant type between the first genotype signature and the second genotype signature; calculating, based on the identifying, a ratio of the second total number of SNPs to the first total number of SNPs; and determining whether the ratio is greater than a second predetermined threshold.
5. The computer-implemented method of claim 4, wherein the identifying the sample match comprises: identifying, responsive to determining that the ratio is greater than the second predetermined threshold, that the first genotype signature is a match with the second genotype signature.
6. The computer-implemented method of claim 4, wherein the identifying the sample mismatch comprises: identifying, responsive to determining that the ratio is not greater than the second predetermined threshold, that the first genotype signature is not a match with the second genotype signature.
7. The computer-implemented method of claim 4, further comprising: identifying, responsive to determining that the first total number of SNPs is lower than the first predetermined threshold, that insufficient data exists to determine whether a match exists between the first genotype signature and the second genotype signature.
8. The computer-implemented method of claim 1 , further comprising storing the first genotype signature and the second genotype signature in a database.
9. The computer-implemented method of claim 1 , wherein the threshold level of similarity depends at least in part upon a sample type associated with the first sample and the second sample.
10. The computer-implemented method of claim 1 , wherein the first sample and the second sample is associated with a sample type selected from the group consisting of: cell-free DNA (cfDNA), cell-free RNA (cfRNA), bone marrow, urine, tissue, saliva, or plasma.
11 . The computer-implemented method of claim 1 , wherein the first sample was obtained from the participant at a first time, and the second sample was obtained from the participant at a second time, the first time being different from the second time.
12. The computer-implemented method of claim 1 , wherein the second sample is a control sample, and wherein a sample match between the first sample and the second sample indicates sample contamination.
13. The computer-implemented method of claim 1 , wherein the participant is a first participant, wherein the second sample is obtained from a second participant, and wherein a sample match between the first sample and the second sample indicates sample contamination.
14. The computer-implemented method of claim 1 , further comprising, responsive to identifying the sample match or sample mismatch between the first sample and the second sample, comparing the sample match or sample mismatch to an expected sample match status between the first sample and the second sample; and reporting the comparison of the sample match or sample mismatch and the expected sample match status to a user of the computer system or to a second computer system.
15. The computer-implemented method of claim 1 , wherein the comparing the first genotype signature to the second genotype signature is performed using: a linear search of a database comprising the second genotype signature; locality sensitive hashing of the first genotype signature and the second genotype signature; or a nearest neighbor search of the database comprising the second genotype signature.
16. A system for verifying similarity between biological samples, comprising: one or more processors; one or more computer readable media storing instructions that are executable by the one or more processors to perform operations to: receive genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generate a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assemble the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; compare the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identify, responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identify, responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
17. The system of claim 16, wherein the operations to generate the numerical representation for each of the SNPs comprise operations to: identify a total number of SNPs in the genomic data; retain , from the total number of the SNPs, a first subset of SNPs that match a reference subset of SNPs; retain, from the first subset of SNPs, a second subset of SNPs having a coverage depth greater than a predetermined minimum coverage threshold and smaller than a predetermined maximum coverage threshold; compute a variant allele frequency (VAF) for each of the second subset of SNPs; identify, based on a VAF calling threshold, a genetic variant associated with each of the second subset of SNPs, wherein the genetic variant is one of a first type, a second type, a third type, or a fourth type; and assign, based on the identified genetic variant type associated with each of the second subset of SNPs, the numerical representation to each of the second subset of SNPs.
18. The system of claim 17, wherein the first type corresponds to a reference allele, the second type corresponds to a variant allele, the third type corresponds to a heterozygous allele, and the fourth type corresponds to a missing allele.
19. The system of claim 16, wherein the operations to compare comprise operations to: identify a first total number of SNPs that are common between the first genotype signature and the second genotype signature; determine whether the first total number of SNPs is greater than a first predetermined threshold; identify, responsive to determining that the first total number of SNPs is greater than the first predetermined threshold, a second total number of SNPs having a matching genetic variant type between the first genotype signature and the second genotype signature; calculate, based on the identifying, a ratio of the second total number of SNPs to the first total number of SNPs; and determine whether the ratio is greater than a second predetermined threshold.
20. The system of claim 19, wherein the operations to identify the sample match comprise operations to: identify, responsive to determining that the ratio is greater than the second predetermined threshold, that the first genotype signature is a match with the second genotype signature.
21 . The system of claim 19, wherein the operations to identify the sample mismatch comprise operations to: identify, responsive to determining that the ratio is not greater than the second predetermined threshold, that the first genotype signature is not a match with the second genotype signature.
22. The system of claim 19, wherein the operations further comprise operations to: identify, responsive to determining that the first total number of SNPs is lower than the first predetermined threshold; that insufficient data exists to determine whether a match exists between the first genotype signature and the second genotype signature.
23. The system of claim 16, wherein the threshold level of similarity depends at least in part upon a sample type associated with the first sample and the second sample.
24. The system of claim 16, wherein the first sample and the second sample is associated with a sample type selected from the group consisting of: DNA, RNA, cell-free DNA (cfDNA), cell-free RNA (cfRNA) in bone marrow, urine, tissue, saliva, or plasma.
25. A non-transitory computer-readable medium storing computer-executable instructions which, when executed by a system, cause the system to perform operations comprising: receiving genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); generating a numerical representation for each of the SNPs in the genomic data that satisfy predetermined criteria, wherein a value of each numerical representation is based on an allele characteristic associated with each of the SNPs; assembling the numerical representation for each of the SNPs into a first genotype signature associated with the first sample; comparing the first genotype signature associated with the first sample to a second genotype signature associated with a second sample; and identifying, based on the comparing and responsive to determining that the first genotype signature shares a threshold level of similarity with the second genotype signature, a sample match between the first sample and the second sample, or identifying, based on the comparing and responsive to determining that the first genotype signature does not share the threshold level of similarity with the second genotype signature, a sample mismatch between the first sample and the second sample.
26. A computer-implemented method for generating a genotype signature, comprising: receiving, at a computer system, genomic data associated with a first sample from a participant, wherein the genomic data includes single nucleotide polymorphisms (SNPs); identifying, using a processor of the computer system, a total number of SNPs in the genomic data; retaining, using the processor and from the total number of the SNPs, a first subset of SNPs that match a reference subset of SNPs; retaining, using the processor and from the first subset of SNPs, a second subset of SNPs having a coverage depth greater than a predetermined minimum coverage threshold and smaller than a predetermined maximum coverage threshold; computing, using the processor, a variant allele frequency (VAF) for each of the second subset of SNPs; identifying, using the processor and based on a VAF calling threshold, a genetic variant associated with each of the second subset of SNPs, wherein the genetic variant is one of a first type, a second type, a third type, or a fourth type; and generating, using the processor and based on the identified genetic variant type associated with each of the second subset of SNPs, the genotype signature containing a numerical representation, wherein each numerical representation is based on an allele characteristic associated with each of the SNPs.
27. A computer-implemented method for comparing genotype signatures, comprising: identifying a first total number of single nucleotide polymorphisms (SNPs) that are common between a first genotype signature and a second genotype signature; determining whether the first total number of SNPs is greater than a first predetermined threshold; identifying, responsive to determining that the first total number of SNPs is greater than the first predetermined threshold, a second total number of SNPs having a matching genetic variant type between the first genotype signature and the second genotype signature; calculating, based on the identifying, a ratio of the second total number of SNPs to the first total number of SNPs; determining whether the ratio is greater than a second predetermined threshold; and identifying that the first genotype signature is a match with the second genotype signature responsive to determining that the ratio is greater than the second predetermined threshold, or identifying that the that the first genotype signature is not a match with the second genotype signature responsive to determining that the ratio is not greater than the second predetermined threshold.
28. A computer-implemented method for assembling a reference subset of single nucleotide polymorphisms (SNPs), comprising: receiving, at a computer system, a dataset containing a first set of SNPs; generating, using a processor associated with the computer system and by applying one or more filters against the first set of SNPs, a second set of SNPs; and utilizing the second set of SNPs in a genotype signature encoding process.
29. The computer-implemented method of claim 28, wherein the applying the one or more filters comprises: removing, at a first filter step, a first subset of SNPs from the first set of SNPs, wherein the first subset of SNPs includes: indel SNPs, multi-allelic SNPs, and G/C SNPs; removing, at a second filter step, a second subset of SNPs from the first set of SNPs having undergone the first filter step, wherein the second subset of SNPs are missing in all samples associated with the dataset; removing, at a third filter step, a third subset of SNPs from the first set of SNPs having undergone the second filter step, wherein the third subset of SNPs are missing in a predetermined percentage of the samples; removing, at a fourth filter step, a fourth subset of SNPs from the first set of SNPs having undergone the third filter step, wherein the fourth subset of SNPs do not vary within a population associated with the dataset; removing, at a fifth filter step, a fifth subset of SNPs from the first set of
SNPs having undergone the fourth filter step, wherein the fifth subset of SNPs are not in Hardy-Weinberg equilibrium; and removing, at a sixth filter step, a sixth subset of SNPs from the first set of SNPs having undergone the fifth filter step, wherein the sixth subset of SNPs are in linkage disequilibrium with each other wherein the second set of SNPs correspond to the first set of SNPs having undergone the sixth filter step.
PCT/US2024/052772 2023-10-27 2024-10-24 Systems and methods for assessing similarity between samples using genotype signatures Pending WO2025090739A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363593758P 2023-10-27 2023-10-27
US63/593,758 2023-10-27

Publications (1)

Publication Number Publication Date
WO2025090739A1 true WO2025090739A1 (en) 2025-05-01

Family

ID=93463407

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/052772 Pending WO2025090739A1 (en) 2023-10-27 2024-10-24 Systems and methods for assessing similarity between samples using genotype signatures

Country Status (1)

Country Link
WO (1) WO2025090739A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210151126A1 (en) * 2018-06-06 2021-05-20 Lexent Bio, Inc. Methods for fingerprinting of biological samples
US20220336045A1 (en) * 2017-09-07 2022-10-20 Regeneron Pharmaceuticals, Inc. Systems and methods for leveraging relatedness in genomic data analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220336045A1 (en) * 2017-09-07 2022-10-20 Regeneron Pharmaceuticals, Inc. Systems and methods for leveraging relatedness in genomic data analysis
US20210151126A1 (en) * 2018-06-06 2021-05-20 Lexent Bio, Inc. Methods for fingerprinting of biological samples

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUSTAVO GLUSMAN ET AL: "Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints", FRONTIERS IN GENETICS, vol. 8, 26 September 2017 (2017-09-26), XP055550925, DOI: 10.3389/fgene.2017.00136 *
SOHEIL YOUSEFI ET AL: "A SNP panel for identification of DNA and RNA specimens", BMC GENOMICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 19, no. 1, 25 January 2018 (2018-01-25), pages 1 - 12, XP021252938, DOI: 10.1186/S12864-018-4482-7 *

Similar Documents

Publication Publication Date Title
Bush et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines
Bickhart et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities
Babadi et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data
Cleary et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines
US11830580B2 (en) K-mer database for organism identification
US20140162257A1 (en) Systems and methods for obtaining and managing sequence data
US11347810B2 (en) Methods of automatically and self-consistently correcting genome databases
US20120078901A1 (en) Personal Genome Indexer
EP3631657B1 (en) System and method for detecting gene fusion
US11809498B2 (en) Optimizing k-mer databases by k-mer subtraction
Kille et al. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation
US20190177719A1 (en) Method and System for Generating and Comparing Reduced Genome Data Sets
Sirén et al. Genotyping common, large structural variations in 5,202 genomes using pangenomes, the Giraffe mapper, and the vg toolkit
Frei et al. Improved functional mapping of complex trait heritability with GSA-MiXeR implicates biologically specific gene sets
US20240120027A1 (en) Machine-learning model for refining structural variant calls
US20200395095A1 (en) Method and system for generating and comparing genotypes
Meleshko et al. Detection and assembly of novel sequence insertions using Linked-Read technology
Lebo et al. Bioinformatics in clinical genomic sequencing
WO2025090739A1 (en) Systems and methods for assessing similarity between samples using genotype signatures
US20240127905A1 (en) Integrating variant calls from multiple sequencing pipelines utilizing a machine learning architecture
Krause et al. Understanding the role of (advanced) machine learning in metagenomic workflows
Sakaue et al. A statistical genetics guide to identifying HLA alleles driving complex disease
Ansbacher‐Feldman et al. GRAMM: A new method for analysis of HLA in families
Vanderbilt et al. Role of bioinformatics in molecular medicine
Kavak et al. Genomize-SEQ: An NGS data analysis platform for genomic variant classification and prioritization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24805294

Country of ref document: EP

Kind code of ref document: A1