[go: up one dir, main page]

WO2023129953A2 - Appel de variant sans génome de référence cible - Google Patents

Appel de variant sans génome de référence cible Download PDF

Info

Publication number
WO2023129953A2
WO2023129953A2 PCT/US2022/082462 US2022082462W WO2023129953A2 WO 2023129953 A2 WO2023129953 A2 WO 2023129953A2 US 2022082462 W US2022082462 W US 2022082462W WO 2023129953 A2 WO2023129953 A2 WO 2023129953A2
Authority
WO
WIPO (PCT)
Prior art keywords
target
variants
variant
reference genome
target species
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2022/082462
Other languages
English (en)
Other versions
WO2023129953A3 (fr
Inventor
Hong Gao
Tobias HAMP
Joshua Goodwin Jon McMaster-Schraiber
Laksshman Sundaram
Kai-How FARH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Illumina Cambridge Ltd
Illumina Inc
Original Assignee
Illumina Cambridge Ltd
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/952,198 external-priority patent/US20230207051A1/en
Priority claimed from US17/952,194 external-priority patent/US12499974B2/en
Application filed by Illumina Cambridge Ltd, Illumina Inc filed Critical Illumina Cambridge Ltd
Priority to JP2024539855A priority Critical patent/JP2024546515A/ja
Priority to EP22850997.2A priority patent/EP4457816A2/fr
Priority to KR1020247024941A priority patent/KR20240124392A/ko
Priority to CN202280086391.6A priority patent/CN118475983A/zh
Priority to CA3242595A priority patent/CA3242595A1/fr
Publication of WO2023129953A2 publication Critical patent/WO2023129953A2/fr
Publication of WO2023129953A3 publication Critical patent/WO2023129953A3/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks.
  • intelligence i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems
  • systems for reasoning with uncertainty e.g., fuzzy logic systems
  • adaptive systems e.g., machine learning systems
  • artificial neural networks e.g., neural network processing methods and products for emulation of intelligence
  • the technology disclosed relates to using deep convolutional neural networks to analyze ordered data.
  • Protein language models capture long-range dependencies, leam rich representations of protein sequences, and can be employed for multiple tasks. For example, protein language models can predict structural contacts from single sequences in an unsupervised way.
  • MSAs multiple sequence alignments
  • Language models were initially developed for natural language processing and operate on a simple but powerful principle: they acquire linguistic understanding by learning to fill in missing words in a sentence, akin to a sentence completion task in standardized tests. Language models develop powerful reasoning capabilities by applying this principle across large text corpora.
  • the Bidirectional Encoder Representations from Transformers (BERT) mode instantiated this principle using Transformers, a class of neural networks in which attention is the primary component of the learning system.
  • each token in the input sentence can “attend” to all other tokens by exchanging activation patterns corresponding to the intermediate outputs of neurons in a neural network.
  • Protein language models like the MSA Transformer have been trained to perform inference from MSAs of evolutionarily related sequences.
  • the MSA Transformer interleaves persequence (“row”) attention with per-site (“column”) attention to incorporate coevolution. Combinations of row attention heads in the MSA Transformer have led to state-of-the-art unsupervised structural contact predictions.
  • PrimerAI End-to-end deep learning approaches for variant effect predictions are applied to predict the pathogenicity of missense variants from protein sequence and sequence conservation data (See Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161-1170 (2016), referred to herein as “PrimateAI”).
  • PrimateAI uses deep neural networks trained on variants of known pathogenicity with data augmentation using cross-species information.
  • PrimateAI in particular uses sequences of wild-type and mutant proteins to compare the difference and decide the pathogenicity of mutations using the trained deep neural networks.
  • Such an approach that utilizes the protein sequences for pathogenicity prediction is promising because it can avoid the problem of circularity and overfitting to previous knowledge.
  • PrimateAI outperforms prior methods when trained directly upon sequence alignments. PrimateAI leams important protein domains, conserved amino acid positions, and sequence dependencies directly from the training data consisting of about 120,000 human samples. PrimateAI substantially exceeds the performance of other variant pathogenicity prediction tools in differentiating benign and pathogenic de-novo mutations in candidate developmental disorder genes, and in reproducing prior knowledge in ClinVar. These results suggest that PrimateAI is an important step forward for variant classification tools that may lessen the reliance of clinical reporting on prior knowledge.
  • FIG. 1 is a flow diagram that illustrates a process of a system for variant calling for a particular Target Species 120 without using the target reference genome.
  • Figure 2 is a sequential flow diagram representing the process of identifying false positive variants by means of a set comparison for identification of true positive variants from sequenced reads.
  • Figure 3 illustrates an example of genetic variants from an example reference genetic sequence A with two example variant sequences in which the variant sequences possess a respective single nucleotide variant in a single base position but otherwise possess an identical composition to the reference sequence.
  • Figure 4 graphically represents the process of base-resolution variant calling sensitivity detection in a series of example flow diagrams mapping sequenced reads to reference genomes.
  • Figure 5 is a schematic illustrating the process of mapping sequenced reads to a reference genome.
  • Figure 6 represents a graphical flow diagram of a process for alternative detection of the second variant set in one implementation of the technology disclosed where a target species reference genome is available.
  • Figure 7 is a schematic illustrating the evolutionary relationship between the target species and the pseudo-target species using simplified phylogenic tree graphics.
  • Figure 8 is a schematic illustrating the evolutionary relationship between the target species and the pseudo-target species using a simplified phylogenic tree graphic.
  • Figure 9 is a schematic representing genome-resolution variant calling sensitivity detection of sequenced reads.
  • Figure 10 is a schematic flow diagram demonstrating how the output data generated by the set comparison of sequenced reads mapped to a non-target reference genome and a pseudotarget reference genome can be used as a training dataset for a quality classifier to predict variant quality where a reference genome is not available, in contrast to methods wherein the quality is determined by mapping to non-target or pseudo-target genomes.
  • Figure 11 is an illustrative example of the variant features in the plurality of variant features describing guanine-cytosine content.
  • Figure 12 is an illustrative example of the variant feature in the plurality of variant features describing local composition complexity.
  • Figure 13 is an illustrative example of the variant feature in the plurality of variant features describing allelic count.
  • Figure 14 is an illustrative example of the variant feature in the plurality of variant features describing process of mapping sequenced reads to a reference genome wherein an additional step is added to compute a quality metric of the mapping.
  • Figure 15 is an illustrative example of the variant feature in the plurality of variant features describing strand bias in sequenced reads when mapped to a reference genome.
  • Figure 16 is an illustrative example of the variant features in the plurality of variant features describing the depth and coverage of sequenced reads mapped to a reference genome.
  • Figure 17 is an illustrated flow diagram representing the variant quality classifier configured as a random forest model to classify a target variant as either belonging to the high quality class or the low quality class.
  • Figure 18 is an illustrated flow diagram representing the variant quality classifier configured as a logistic regression model to classify a target variant as either belonging to the high quality class or the low quality class.
  • Figure 19 is an illustrated flow diagram representing the variant quality classifier configured as a neural network model to classify a target variant as either belonging to the high quality class or the low quality class.
  • Figure 20 is a flow diagram of the unique mapper overview process to further improve the quality of a set of called variants following variant calling or machine learning variant classification.
  • Figure 21 schematically illustrates the gene annotation filter in the series of cascaded filters for variant quality filtering.
  • Figure 22 schematically illustrates the process codon transcription and translation and filtering for codon match.
  • Figure 23 illustrates the process of filtering genes based on a distribution of machine learning scores.
  • Figure 24 illustrates deviation from Hardy -Weinberg Equilibrium in an example population.
  • Figure 25 is an illustration of a nonsense variant.
  • Figure 26 contains a graph of collected results demonstrating the cascading filter effect on the number of nonsense variants per sample.
  • Figure 27 contains a graph of collected results demonstrating the cascading filter effect on missense: synonymous ratio of called variants per sample.
  • Figure 28 contains a graph of collected results demonstrating the cascading filter effect on number of insertion-deletion variants (indels) per sample.
  • Figure 29 shows an example computer system that can be used to implement the technology disclosed.
  • modules can be implemented in hardware or software, and need not be divided up in precisely the same blocks as shown in the figures. Some of the modules can also be implemented on different processors, computers, or servers, or spread among a number of different processors, computers, or servers. In addition, it will be appreciated that some of the modules can be combined, operated in parallel or in a different sequence than that shown in the figures without affecting the functions achieved.
  • the modules in the figures can also be thought of as flowchart steps in a method.
  • a module also need not necessarily have all its code disposed contiguously in memory; some parts of the code can be separated from other parts of the code with code from other modules or other functions disposed in between.
  • the technologies disclosed can be used to improve the quality of pathogenic variant calling.
  • the technology disclosed can be used to improve the quality of variant calling in scenarios where desired reference genomes are unavailable.
  • we developed various methods to reduce the false positives including the random forest classifiers, linear regression models, and neural network models.
  • We also devised a unique-mapper score to identify regions that are not one-to-one mapping between the species, which will further reduce variant calling errors.
  • the technology disclosed relates to determining feasibility of using a reference genome of a non-target species for variant calling a sample of a target species.
  • the technology disclosed relates to mapping sequenced reads of a sample of a target species to a reference genome of a non-target species to detect a first set of variants in the sequenced reads of the sample of the target species, and mapping the sequenced reads of the sample of the target species to a reference genome of a pseudo-target species to detect a second set of variants in the sequenced reads of the sample of the target species.
  • the technology disclosed relates to variant calling of sequenced reads of a sample of a target species against a reference genome of a pseudo-target species.
  • Low-quality variants are identified as false positive variants that are present in the second set of variants but absent from the first set of variants.
  • the technology disclosed relates to a variant quality classifier that is configured to process a plurality of features of a target variant and generate a quality indication for the quality variant.
  • the variant quality classifier is trained on a set of high- quality variants and a set of low-quality variants.
  • High-quality variants are identified as true positive variants that are common between a first set of variants detected by variant calling sequenced reads of a sample of a target species against a reference genome of a non-target species and a second set of variants detected by variant calling of sequenced reads of a sample of a target species against a reference genome of a pseudo-target species.
  • Low-quality variants are identified as false positive variants that are present in the second set of variants but absent from the first set of variants.
  • Figure 1 is a flow diagram that illustrates a process 100 of a system for variant calling for a particular Target Species 120 without using the target reference genome.
  • Mapping sequenced reads from a Target Species 120 to a Non-Target Reference Genome 102 detects a First Set of Variants in the Sequenced Reads of the Target Species 104.
  • the Non-Target Reference Genome 102 is from a non-target species other than the Target Species 120.
  • the Non-Target Reference Genome 102 is non- homologous with the genome of Target Species 120, as determined by a homology threshold (such as a percentage homology below 30%, 40%, or 50%, or a double-bounded range of acceptable homology percentages such as 30-40% or 40-50%).
  • a homology threshold such as a percentage homology below 30%, 40%, or 50%, or a double-bounded range of acceptable homology percentages such as 30-40% or 40-50%).
  • the nontarget species and Target Species 120 belong to the same taxonomical genus, family, order, or class.
  • Mapping sequenced reads from a Target Species 120 to a Pseudo-Target Reference Genome 142 detects a Second Set of Variants in the Sequenced Reads of the Target Species 144.
  • the Pseudo-Target Reference Genome 142 is from a pseudo-target species other than the Target Species 120.
  • the Pseudo-Target Reference Genome 142 is homologous with the genome of Target Species 120, as determined by a homology threshold (such as a percentage homology above 80%, 90%, or 95%, or a doublebounded range of acceptable homology percentages such as 85-90% or 80-89%).
  • a homology threshold set to determine degree of homology between the pseudo-target species and target species may be the same as a homology threshold set to determine degree of homology between the non-target species and target species, or the respective homology thresholds may differ. In some embodiments, the homology threshold set to determine degree of homology between the non-target species and target species may be informed by the degree of homology between the pseudo-target species and target species, or vice versa.
  • the Comparison 126 of the first set of variants and second set of variants identifies a subset of False Positive Variants 128 (i.e., overlapping variants identified by mapping to the Pseudo-Target Reference Genome 142 cannot be considered as reliable positive variants on the basis of homology when the variants are also identified by mapping to Non-Target Reference Genome 102).
  • Figure 2 is a sequential flow diagram 200 representing the process of identifying false positive variants by means of a set comparison for identification of true positive variants from sequenced reads.
  • Mapping sequenced reads from a Target Species 202 to a Non-Target Species Reference Genome 204 detects a First Set of Variants 223.
  • Mapping sequenced reads from a Target Species 206 to a Pseudo-Target Species Reference Genome 208 detects a Second Set of Variants 227.
  • a Venn diagram represents the union Variant Set 1 U Variant Set 2, wherein the set difference Variant Set 1 - Variant Set 2 is represented by the left downward diagonal shaded area 244 and the set difference Variant Set 2 - Variant Set 1 is represented by the right upward diagonal shaded area 246.
  • the intersection Variant Set 1 Cl Variant Set 2 is represented by the center diamond crosshatch shaded area for a set of True Positive Variants 266.
  • the intersection Variant Set 1 Cl Variant Set 2 (i.e., called variants identified both in the first set of variants 223 detected by mapping to a Non-Target Reference Genome 204 and the Second Set of Variants 227 detected by mapping to a Pseudo-Target Reference Genome 208) translates to the set of True Positive Variants 266.
  • the set difference Variant Set 2 - Variant Set 1 (i.e., called variants identified in the Second Set of Variants 227 detected by mapping to a Pseudo-Target Reference Genome 208 but not identified in the First Set of Variants 223 detected by mapping to a NonTarget Reference Genome 204) translates to the Set of False Positive Variants 268.
  • Figure 3 illustrates an example of Genetic Variants 300 from an example Reference Genetic Sequence A 302 with two example Variant Sequences 1A 322 and IB 342 in which the variant sequences possess a respective single nucleotide variant in a single base position but otherwise possess an identical composition to the reference sequence.
  • a single nucleotide substitution is shown as Adenine 326 in Variant 1A 322 and Thymine 336 in Variant IB 342 as compared to Cytosine 306 in the Reference Genetic Sequence A 302.
  • Figure 4 graphically represents the process 400 of base-resolution variant calling sensitivity detection in a series of example flow diagrams mapping sequenced reads to reference genomes.
  • Sequenced Read X 410 is mapped to Pseudo-Target Reference Genome 402 and will be called as a variant due to the A->C single nucleotide variant in position 5.
  • Sequenced Read X 410 is also mapped to Non-Target Reference Genome 404 and will not be called as a variant as a single nucleotide variant is not identified.
  • Sequenced Read X 410 belongs to the set difference between the called variant set from mapping to the Pseudo-Target Reference Genome 402 and the variant set from mapping to the Non-Target Reference Genome 404 therefore Sequenced Read X 410 is a false positive.
  • Sequenced Read Y 412 is mapped to Pseudo-Target Reference Genome 422 and will be called as a variant due to the A->C single nucleotide variant in position 5.
  • Sequenced Read Y 412 is also mapped to Non-Target Reference Genome 424 and will be called as a variant due to the A->C single nucleotide variant in position 5.
  • Sequenced Read Y 412 belongs to the set intersection between the called variant set from mapping to the Pseudo-Target Reference Genome 422 and the variant set from mapping to the Non-Target Reference Genome 424 therefore Sequenced Read Y 412 is a true positive.
  • Sequenced Read Z is mapped to Pseudo-Target Reference Genome 442 and will not be called as a variant despite the cytosine and guanine not being equivalent at position five. Due to base pairing, the complementary strand of the Pseudo-Target Reference Genome 442 possesses a cytosine at position 5 and the complementary strand of the Sequenced Read Z 414 possesses a guanine at position 5. As a result, this Sequenced Read Z 414 is not a variant when mapped to Pseudo-Target Reference Genome 442.
  • Sequenced Read Z 414 is also mapped to Non-Target Reference Genome 444 and will not be called as a variant due to complementary bases being present at position 5. As a result, Sequenced Read Z 414 belongs to the complement of both the called variant set from mapping to the Pseudo-Target Reference Genome 442 and the called variant set from the Non-Target Reference Genome 444 therefore Sequenced Read Z 414 is a true negative variant.
  • Figure 5 is a schematic 500 illustrating the process of mapping sequenced reads to a Reference Genome. Sequenced Reads 502 are mapped to Reference Genome 505 resulting in Mapping 555.
  • each sequenced read from the set of Sequenced Reads 502 aligns with a given genomic region within the Reference Genome 505.
  • mapped sequenced reads may align with genomic regions that are mutually exclusive (i.e., do not overlap, such as the leftmost sequenced read and the middle sequenced read in Mapping 555) or are not mutually exclusive from each other (i.e., do overlap, such as the middle sequenced read and the rightmost sequenced read in Mapping 555).
  • Figure 6 represents a graphical flow diagram 600 of a process for alternative detection of Variant Set Two in one implementation of the technology disclosed where a Target Species Reference Genome 622 is available. Sequenced Reads from A Target Species 602 are mapped to the Target Species Reference Genome 622. The mapped sequence reads of the target species are lifted over to a Non-Target Species Reference Genome 642. Variants that are detected in the Non-Target Species Reference Genome 642 but not detected in the Target Species Reference Genome 622 comprise Variant Set Two 662 wherein the plurality of variants in Variant Set Two 662 are false positive variants.
  • the alternative detection of Variant Set Two 662 is useful in generating training data comprising known ground truth data to be used in training a machine learning classifier for the detection of true positive variants and false positive variants.
  • Figure 7 is a schematic 700 illustrating the evolutionary relationship between the target species and the pseudo-target species using simplified phylogenic tree graphics.
  • the pseudo-target species and the target species are different species that are not orthologous.
  • the evolutionary relationship between the target species and the pseudo-target species is reflective of that shown in Phylogenic Tree A 702.
  • the pseudo-target species and the target species are different species that are orthologous.
  • the evolutionary relationship between the target species and the pseudo-target species is reflective of that shown in Phylogenic Tree B 704.
  • Phylogenic Tree C 706 the pseudo-target species and the target species are the same species.
  • the evolutionary relationship between the target species and the pseudo-target species is reflective of that shown in Phylogenic Tree C 706.
  • Figure 8 is a schematic 800 illustrating the evolutionary relationship between the target species and the pseudo-target species using a simplified phylogenic tree graphic.
  • the non-target species and the target species are different species wherein the nontarget species is a human and the target species is a non-human primate.
  • the evolutionary relationship between the target species and the non- target species reflects that of Phylogenic Tree D 802.
  • primate species samples and reference genomes are leveraged to infer the pathogenicity of orthologous human variants by variant calling to closely-related primate species genomes, variant calling to non-target, non-homologous primate species genomes, and contrasting the results as demonstrated within Figures 1-5.
  • machine learning classifiers are trained to detect false positive variants, further refining the identification process of true positive variants.
  • Figure 9 is a schematic 900 representing genome-resolution variant calling sensitivity detection of sequenced reads.
  • Sequenced Read A 902 from a Target Species is equivalent to Sequenced Read A 942 from a Target Species and maps to Region One 922 in the Non-Target Reference Genome 924 and Region Two 962 in the Pseudo-Target Reference Genome 964.
  • Region One 922 and Region Two 962 are not equivalent (i.e., not orthologous), therefore Sequenced Read A does not map to the same genomic region in the Non-Target Reference Genome 924 and in the Pseudo-Target Reference Genome 964.
  • Sequenced Read A 902 does not result in a called variant in the genomic region within the Non-Target Reference Genome 964 that is orthologous to the genomic region that Sequenced Read A 942 maps to within the Pseudo-Target Reference Genome 964.
  • this variant will belong to the called variant set from mapping to the Pseudo-Target Reference Genome 964 but will not belong to the called variant set from mapping to the Non-Target Reference Genome 924 and results in a false positive.
  • Sequenced Read B 982 from a Target Species, Sequenced Read B from a Target Species 984, and Sequenced Read B from a Target Species 986 are equivalent. Region Three 972, Region Four 978, and Region Five 980 belong to the non-target reference genome and are not equivalent. Sequenced Read B 982 from the Target Species maps to multiple regions within the non-target reference genome. As with Sequenced Read A 902, Sequenced Read B 982 will map to a different genomic region within the non-target species reference genome than the orthologous genomic region that Sequenced Read B 982 maps to within the pseudo-target reference genome due to the multiplicity of variant calling within the non-target reference genome. Subsequently, sequenced read that maps to more than two genomic regions within the non-target reference genome will result in a false positive.
  • Figure 10 is a schematic flow diagram 1000 demonstrating how the output data generated by the set comparison of sequenced reads mapped to a non-target reference genome and a pseudo-target reference genome can be used as a training dataset for a quality classifier to predict variant quality where a reference genome is not available, in contrast to methods wherein the quality is determined by mapping to non-target or pseudo-target genomes.
  • Sequenced Reads from a Target Species 1002 are mapped to reference genomes as previously described in Figures 1, 2, and 3.
  • the intersection of Variant Set One 1004 and Variant Set Two 1006 corresponds to the Set of True Positive Variants 1022.
  • the set difference between Variant Set Two 1006 and Variant Set One 1004 corresponds to the Set of False Positive Variants 1008.
  • the Set of True Positive Variants 1022 is further coded as a Set of High Quality Variants 1024.
  • the Set of False Positive Variants 1008 is further coded as a Set of Low Quality Variants 1010.
  • the combined set of High Quality Variants 1024 and Low Quality Variants 1010 comprise the set of Ground Truth Data 1020. [0092]
  • the Quality Classifier 1064 undergoes a Model Training Process 1040 on the Ground Truth Data 1020.
  • the Quality Classifier 1064 takes an Input Target Variant 1062 represented as a vector containing the set of variant features in the plurality of variant features ⁇ xi:x n ⁇ where each value of x is a variant feature within the set of variant features in the plurality of variant features describing the Target Variant 1062.
  • additional variant features can be extracted from Variant Call Format (.vcf) files.
  • the Quality Classifier 1064 is a binary classification model with output classes for High Quality 1066 and Low Quality 1068.
  • FIG 11 is an illustrative example 1100 of the variant features in the plurality of variant features describing guanine-cytosine content.
  • a short Genetic Sequence B 1102 contains a proportion of adenine, thymine, guanine, and cytosine nucleic acids.
  • the guanine-cytosine content (GC) of a genetic sequence corresponds to the proportion of guanine and cytosine nucleic acids within the sequence.
  • GC content is a physiochemical descriptor of nucleic acid sequences that can be used as a proxy for thermostability of nucleic acid sequences due to differences in chemical bonding behavior as compared to adenine-thymine bonding behavior.
  • Equation 1122 is used for a sample calculation for the GC content of Genetic Sequence B 1102 wherein GC content is equivalent to the ratio of guanine and cytosine count to the overall count of all nucleic acids.
  • Equation 1124 is used for a sample calculation of genetic skew of Genetic Sequence B 1102 wherein GC skew is determined as the ratio of the difference between guanine count and cytosine count to the sum of guanine count and cytosine count for a given window size.
  • the Window Size Example 1164 illustrates a window size of five. When the window size of five is applied to Genetic Sequence B 1102, GC skew is calculated as shown in Table 1184.
  • Figure 12 is an illustrative example 1200 of the variant feature in the plurality of variant features describing local composition complexity.
  • Local composition complexity is a measure of entropy within a genetic sequence.
  • Genetic Sequence X 1202 contains no variability of nucleic acid composition and therefore has low entropy.
  • Genetic Sequence Z 1242 has high variability of nucleic acid composition and therefore has high entropy.
  • Genetic Sequence Y 1222 contains more variability than Genetic Sequence X 1202 but less than Genetic Sequence Z 1242 therefore it can be described as having a medium (i.e., moderate) level of entropy.
  • Equation 1224 computes the entropy of a genetic sequence in the format of local composition complexity.
  • a sample calculation for Genetic Sequence B 1204 results in an entropy value of 1.92 wherein entropy is equivalent to a sum of logarithmic probabilities scaled by the same respective probability for each nucleic acid.
  • Figure 13 is an illustrative example 1300 of the variant feature in the plurality of variant features describing allelic count.
  • Variant One 1302 is shown in shaded grey and Variant Two 1304 is shown in white.
  • Population 1322 contains samples of numerous genetic sequences belonging to either Variant One 1302 or Variant Two 1304. Within Population 1322, there are six total samples belonging to Variant One 1302, thus the total allelic count of Variant One 1302 is six. Within Population 1322, there are nine total samples belonging to Variant Two 1304, thus the total allelic count of Variant One 1304 is nine.
  • the error rate for detecting heterozygote called variants is higher than the comparable error rate of homozygous called variants (i.e., heterozygotic false positives occur at a higher rate than homozygotic false positives).
  • Figure 14 is an illustrative example 1400 of the variant feature in the plurality of variant features describing process of mapping sequenced reads to a reference genome wherein an additional step is added to compute a quality metric of the mapping.
  • Sequenced Reads 1402 are mapped to a Reference Genome 1404 to produce a Mapping 1444.
  • Mapping quality scores quantify the likelihood of a misplaced sequenced read to the reference genome. Mapping quality is determined by total possible alignments for a given sequenced read and the count of mismatched base pairs within the alignment. The mapping quality score is reported on a Phred scale, a commonly used logarithmic data scaling technique for error rates in sequencing analysis.
  • Figure 15 is an illustrative example 1500 of the variant feature in the plurality of variant features describing strand bias in sequenced reads when mapped to a reference genome.
  • Sequenced Reads 1502 contain reads with different strand orientation (i.e., strands oriented in the 5’->3’ direction and strands oriented in the 3’->5’ direction).
  • the Mapping 1544 generated will display sequencing biases based on strand orientation wherein one DNA strand is favored over the other. Strand bias may result in a higher error rate for allele count.
  • Figure 16 is an illustrative example 1600 of the variant features in the plurality of variant features describing the depth and coverage of sequenced reads mapped to a reference genome.
  • Depth and coverage of a particular mapping are measures of mapping quality, where both sequencing coverage and depth of sequencing coverage are proportional metrics to the quality of the particular mapping.
  • Sequenced reads 1602 are mapped to a reference genome 1604 at the various genomic regions along the X-axis. The total percentage of target bases within the reference genome to which sequenced reads are mapped is quantified as the coverage of the genome.
  • the average depth of sequencing coverage is the ratio of the number of reads scaled by read length to the total referenced genome length.
  • This concept is illustrated by visualizing the X- axis as length of the reference genome 1604 with coverage corresponding to the total spread breadth of the aligned sequenced reads 1602, whereas the Y-axis shows visualizes the depth by which the reference genome 1604 is covered.
  • Figure 17 is an illustrated flow diagram 1700 representing the variant quality classifier configured as a random forest model to classify a Target Variant 1762 as either belonging to the High Quality Class 1766 or the Low Quality Class 1768.
  • the Random Forest Model 1744 takes an Input Target Variant 1762 represented as a vector containing the set of variant features in the plurality of variant features ⁇ xi:x n ⁇ where each value of x is a variant feature within the set of variant features in the plurality of variant features describing the Target Variant 1762 and generates a classification from Random Forest Model 1744.
  • a plurality of decision trees each generate a respective output result of the target variant class and a final result is generated via majority averaging.
  • Figure 18 is an illustrated flow diagram 1800 representing the variant quality classifier configured as a logistic regression model to classify a Target Variant 1862 as either belonging to the High Quality Class 1866 or the Low Quality Class 1868.
  • the Quality Classifier 1844 takes an Input Target Variant 1862 represented as a vector containing the set of variant features in the plurality of variant features ⁇ xi:x n ⁇ where each value of x is a variant feature within the set of variant features in the plurality of variant features describing the Target Variant 1862 and generates a classification from Logistic Regression Model 1844.
  • the model In the Logistic Regression Model 1844, the model generates an output value in the range of ⁇ 0,1 ⁇ and a decision threshold boundary determines if an input value (i.e., Target Variant 1862) will be classified as an output of 0 or 1 (e.g., a decision threshold boundary of 0.5 leads to values in the range ⁇ 0,0.4 ⁇ generating an output of 0 and values in the range ⁇ 0.5,1 ⁇ generating an output of 1). Determination of the optimal decision threshold boundary may be determined on the basis of optimization of a particular performance metric when training the logistic regression model, such as accuracy, precision, recall, or a particular error function.
  • the binary output values 0 and 1 are assigned to two output classes, High Quality Class 1866 or Low Quality Class 1868.
  • FIG 19 is an illustrated flow diagram 1900 representing the variant quality classifier configured as a neural network to classify a Target Variant 1962 as either belonging to the High Quality Class 1966 or the Low Quality Class 1968.
  • the Quality Classifier 1944 takes an Input Target Variant 1962 represented as a vector containing the set of variant features in the plurality of variant features ⁇ xi:x n ⁇ where each value of x is a variant feature within the set of variant features in the plurality of variant features describing the Target Variant 1962 and generates a classification from Neural Network 1944.
  • the neural network model processes the Input Target Variant 1962 via a series of connected layers of nodes which each perform a respective weighted data transformation.
  • Backpropagation through the network updates the weights of each node iteratively during the training process and the final trained model generates an output belonging to the High Quality Class 1966 or Low Quality Class 1968 for the Input Target Variant 1962.
  • variants identified as high quality or low quality undergo may further filtering steps in certain embodiments, as described further below.
  • Figure 20 is a flow diagram 2000 of the unique mapper overview process to further improve the quality of a set of called variants following variant calling or machine learning variant classification.
  • Sequenced Reads of a Sample of a Target Species 2002 undergo a filtering process via Cascading Filters 2004 to remove Low Quality Sequenced Reads 2024.
  • the sequencing data may be obtained from Binary Alignment Map (.bam) files.
  • the series of Cascading Filters 2004 comprise filters that remove variants with incorrect codon match between primates and humans, remove variants with annotation errors, gene-specific filters (e.g., skewed distribution of variant machine learning classifier quality scores compared with exomewide scores or deviations from the Hardy-Weinberg equilibrium), and removal of variants that do not meet a particular machine learning classifier performance metric threshold.
  • the resulting Intermediate Set of Sequenced Reads 2006 is mapped to a Pseudo-Target Reference Genome 2008 and to a Non-Target Reference Genome 2026.
  • the Pseudo-Target Reference Genome 2008 is divided into a number of bins (i.e., sequential nonoverlapping genomic regions of specified equal length).
  • the Non-Target Reference Genome 2026 is also divided into an equivalent number of bins of equivalent size in comparison to the Pseudo-Target Reference Genome 2008 bins. Bins are compared on a one-to-one basis to determine the degree of mapping homology between corresponding bins. The best-mapped bin is identified as the bin wherein the degree of match (i.e., alignment between mapped genome for the bin) and used to generate a Unique Mapper Score 2040.
  • the Unique Mapper Score 2040 is unique for each specific sample, and the Unique Mapper Scores across all samples to a specific reference target species are averaged to obtain a single mean Unique Mapper Score which applies to all variants of the reference target species that fall into each respective bin.
  • FIG 21 schematically illustrates the Gene Annotation Filter 2100 in the series of cascaded filters for variant quality filtering.
  • Gene annotation includes the labeling of a genome for features such as gene location, coding and non-coding regions, and various descriptors of genetic function. Incorrect gene annotation can lead to error in the variant calling process.
  • Gene A 2102 is located at a genomic region and includes Feature X at a specific location within its structure.
  • Gene A 2104 is the correct gene annotation of Gene A 2102 wherein the genomic structure is correctly annotated with Feature X properly located.
  • Gene B 2106 is in a different location and contains different structure (i.e., contains Feature Y rather than Feature X).
  • Gene A 2102 may be incorrectly annotated as Gene B 2106.
  • any resulting mapping to Gene B 2106 i.e., despite being labeled as Gene B, the genomic sequence belongs to Gene A) is erroneous. Called variants mapped to genes with annotation errors are filtered out.
  • Figure 22 schematically illustrates the process 2200 codon transcription and translation and filtering for codon match.
  • Genetic Sequence A 2202 consists of nucleic acids. The nucleic acid sequence in Genetic Sequence A 2202 undergoes transcription to generate mRNA Transcript A 2242. Following transcription, mRNA Transcript A 2242 is translated to generate Amino Acid Sequence A 2262. Each amino acid is translated from a three nucleic acid sequence referred to as a codon as highlighted by grey shaded boxes for five total codons across Genetic Sequence A 2202, mRNA Transcript A 2242, and Amino Acid Sequence A 2262. Codon B 2282 and Codon C 2284 contain identical nucleic acids and will therefore be transcribed and translated into the same amino acid.
  • Codon D 2286 and Codon E 2288 differ in the third nucleic acid position and will not transcribe and translate to the same amino acid. If a non-target reference genome contained Codon D 2286 and a pseudo-target reference genome contained Codon E 2288 at the same aligned position, these aligned codons would not match and called variants that align to the genomic region corresponding to Codon D 2286 and Codon E 2288 are filtered out.
  • Figure 23 illustrates the process 2300 of filtering genes based on a distribution of machine learning scores. Scores from a variant quality classifier are plotted on a graph measuring frequency for both a Specific Gene represented by a Specific Gene Distribution 2304 and Exomewide represented by an Exomewide Distribution 2302. A Wilcoxon rank sum test determines if the Specific Gene Distribution 2304 is skewed in comparison to the Exomewide Distribution 2302 via significance testing for the probability of a randomly selected machine learning score from the Specific Gene Distribution 2304 being greater than a randomly selected machine learning score from the Exomewide Distribution 2302 is equivalent to the probability of a randomly selected machine learning score from the Exomewide Distribution 2302 being greater than a randomly selected machine learning score from the Specific Gene Distribution 2304.
  • Called variants mapped to genes that are determined to be skewed in contrast to the Exomewide Distribution 2302 are filtered out. Determination of skewedness when comparing the Exomewide Distribution 2302 and the Specific Gene Distribution 2304 identifies the gene as an outlier with potential error.
  • Figure 24 illustrates deviation from Hardy -Weinberg Equilibrium in an example population 2400.
  • a filter within the Cascading Filters 2004 removes variants that deviate from the Hardy-Weinberg Equilibrium.
  • Dominant alleles are represented by the letter ‘p’ and recessive alleles are represented by the letter ‘q’.
  • Homozygous dominant genotypes i.e., ‘pp’; 2402
  • Heterozygous genotypes (‘pq’; 2404) are represented by diamond crosshatched circles.
  • Homozygous recessive genotypes (‘qq’; 2406) are represented by downward diagonal shaded circles.
  • the population shown includes 25 samples with respective genotypes.
  • each genotype has respective frequencies counted as the proportion of the respective genotype to the total population count.
  • each genotype has updated respective frequencies counted as the proportion of the respective genotype to the total population count. Populations with unchanged genotype frequencies in sequential generations are considered to be in Hardy -Weinberg Equilibrium. The genotype frequencies for the example population shown in Figure 24 are different in Generation Two 2444 from Generation One 2442 therefore the population deviates from the Hardy-Weinberg Equilibrium.
  • Deviations from the Hardy -Weinberg Equilibrium can result in overcalling of heterozygous genotypes and as a result, called variants mapped to genes that are not in Hardy- Weinberg Equilibrium as determined by large population databases are fdtered out.
  • Figure 25 is an illustration 2500 of a nonsense variant.
  • a filter within the Cascading Filters 2004 removes nonsense variants.
  • Nonsense variants also referred to as ‘stop gain variants’, result from single nucleotide polymorphisms which change a codon sequence such that a previously amino acid-translating codon will translate to a stop codon as a result of the novel mutated amino acid sequence.
  • Premature stop codons prevent the remainder of the mRNA transcript from being translated and as a result the amino acid sequence is terminated early.
  • Genetic Sequence B 2502 is transcribed to mRNA Transcript B 2522, and mRNA Transcript B 2522 is translated into Amino Acid Sequence B 2542 for a total of five codons.
  • Single Nucleotide Polymorphism 2540 at position 12 results in a change from a guanine nucleic acid to a thymine nucleic acid.
  • the fourth codon has changed from ACG to ACT and will subsequently be transcribed to a stop codon rather than being transcribed and translated to a cysteine amino acid residue.
  • the premature stop codon ends translation and the fifth codon will never be translated.
  • Genetic Sequence B 2562 is transcribed to mRNA Transcript B 2582, and mRNA Transcript B 2582 is translated into Amino Acid Sequence B 2510 for a total of four codons, where the fourth codon is now a stop codon.
  • Figure 26 contains a graph 2600 of collected results demonstrating the cascading filter effect on number of nonsense variants per sample.
  • number of nonsense variants per sample is compared between samples from nonhuman primate species and human. No filtering of the called variants from samples from nonhuman primate species results in a significantly higher number of nonsense variants per sample as compared to the corresponding human level of nonsense variants per sample.
  • the called variants from samples from non-human primate species undergo cascaded filters including a codon match filter, gene annotation error filter, machine learning distribution skew filter, Hardy -Weinberg Equilibrium deviation filter, Unique Mapper filter (called variants with Unique Mapper scores less than 0.6 removed), and random forest score filter (called variants with a random forest score greater than 0.17 removed).
  • cascaded filters including a codon match filter, gene annotation error filter, machine learning distribution skew filter, Hardy -Weinberg Equilibrium deviation filter, Unique Mapper filter (called variants with Unique Mapper scores less than 0.6 removed), and random forest score filter (called variants with a random forest score greater than 0.17 removed).
  • Boxplots showing the average number of stop-gained variants per sample of each primate reference species was gradually reduced to close to human level after a series of variant filtering steps, including requiring codon-match, removing SNPs in poorly- annotated genes or in genes with skewed random forest (RF) score distribution or deviating from Hardy Weinberg equilibrium, and removing SNPs with unique-mapper score ⁇ 0.6 or RF score >0.17.
  • Each dot represents the average number of stop-gained variants of each primate reference species.
  • the horizontal line shows the average number of stop-gained variants of human samples from Platinum genome project.
  • Figure 27 contains a graph 2700 of collected results demonstrating the cascading filter effect on missense: synonymous ratio of called variants per sample.
  • missense synonymous ratio
  • the called variants from samples from non-human primate species undergo cascaded filters including a codon match filter, gene annotation error filter, machine learning distribution skew filter, Hardy-Weinberg Equilibrium deviation filter, Unique Mapper filter (called variants with Unique Mapper scores less than 0.6 removed), and random forest score filter (called variants with a random forest score greater than 0.17 removed). Boxplots showing missense: synonymous ratios decreased after variant filtering steps. Each dot represents the MSR of each primate reference species. The black line represents MSR of human samples.
  • Figure 28 contains a graph 2800 of collected results demonstrating the cascading filter effect on number of insertion-deletion variants (indels) per sample.
  • the called variants from samples from non-human primates undergo cascading filters including a gene annotation error filter, machine learning distribution skew filter, Hardy-Weinberg Equilibrium deviation filter, and Unique Mapper filter (called variants with Unique Mapper scores less than 0.6 removed).
  • the average number of indels per sample of each primate reference species diminished after filtering steps.
  • Figure 29 shows an example computer system 2900 that can be used to implement the technology disclosed.
  • Computer system 2900 includes at least one central processing unit (CPU) 2924 that communicates with a number of peripheral devices via bus subsystem 2922.
  • peripheral devices can include a storage subsystem 2910 including, for example, memory devices and a file storage subsystem 2918, user interface input devices 2920, user interface output devices 2928, and a network interface subsystem 2926.
  • the input and output devices allow user interaction with computer system 2900.
  • Network interface subsystem 2926 provides an interface to outside networks, including an interface to corresponding interface devices in other computer systems.
  • the random forest model 1744 is communicably linked to the storage subsystem 2910 and the user interface input devices 2920.
  • User interface input devices 2920 can include a keyboard; pointing devices such as a mouse, trackball, touchpad, or graphics tablet; a scanner; a touch screen incorporated into the display; audio input devices such as voice recognition systems and microphones; and other types of input devices.
  • pointing devices such as a mouse, trackball, touchpad, or graphics tablet
  • audio input devices such as voice recognition systems and microphones
  • use of the term “input device” is intended to include all possible types of devices and ways to input information into computer system 2900.
  • User interface output devices 2928 can include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
  • the display subsystem can include an LED display, a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
  • the display subsystem can also provide a non-visual display such as audio output devices.
  • output device is intended to include all possible types of devices and ways to output information from computer system 2900 to the user or to another machine or computer system.
  • Storage subsystem 2910 stores programming and data constructs that provide the functionality of some or all of the modules and methods described herein. These software modules are generally executed by processors 2930.
  • Processors 2930 can be graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), and/or coarse-grained reconfigurable architectures (CGRAs).
  • GPUs graphics processing units
  • FPGAs field-programmable gate arrays
  • ASICs application-specific integrated circuits
  • CGRAs coarse-grained reconfigurable architectures
  • Processors 2930 can be hosted by a deep learning cloud platform such as Google Cloud PlatformTM, XilinxTM, and CirrascaleTM.
  • processors 2930 include Google’s Tensor Processing Unit (TPU)TM, rackmount solutions like GX4 Rackmount SeriesTM, GX29 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM, Qualcomm’s Zeroth PlatformTM with Snapdragon processorsTM, NVIDIA’ s VoltaTM, NVIDIA’ s DRIVE PXTM, NVIDIA’ s JETSON TX1/TX2 MODULETM, Intel’s NirvanaTM, Movidius VPUTM, Fujitsu DPITM, ARM’s DynamicIQTM, IBM TrueNorthTM, Lambda GPU Server with Testa VI 00sTM, and others.
  • TPU Tensor Processing Unit
  • rackmount solutions like GX4 Rackmount SeriesTM, GX29 Rackmount SeriesTM, NVIDIA DGX-1TM, Microsoft’ Stratix V FPGATM, Graphcore’s Intelligent Processor Unit (IPU)TM
  • Memory subsystem 2912 used in the storage subsystem 2910 can include a number of memories including a main random access memory (RAM) 2914 for storage of instructions and data during program execution and a read only memory (ROM) 2916 in which fixed instructions are stored.
  • a file storage subsystem 2918 can provide persistent storage for program and data files, and can include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
  • the modules implementing the functionality of certain implementations can be stored by file storage subsystem 2918 in the storage subsystem 2910, or in other machines accessible by the processor.
  • Bus subsystem 2922 provides a mechanism for letting the various components and subsystems of computer system 2900 communicate with each other as intended. Although bus subsystem 2922 is shown schematically as a single bus, alternative implementations of the bus subsystem can use multiple busses.
  • Computer system 2900 itself can be of varying types including a personal computer, a portable computer, a workstation, a computer terminal, a network computer, a television, a mainframe, a server farm, a widely-distributed set of loosely networked computers, or any other data processing system or user device. Due to the ever-changing nature of computers and networks, the description of computer system 2900 depicted in Figure 29 is intended only as a specific example for purposes of illustrating the preferred implementations of the present invention. Many other configurations of computer system 2900 are possible having more or less components than the computer system depicted in Figure 29. Clauses
  • One or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps.
  • one or more implementations and clauses of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
  • clauses described in this section can include a non- transitory computer readable storage medium storing instructions executable by a processor to perform any of the clauses described in this section.
  • implementations of the clauses described in this section can include a system including memory and one or more processors operable to execute instructions, stored in the memory, to perform any of the clauses described in this section.
  • a computer-implemented method of determining feasibility of using a reference genome of a non-target species for variant calling a sample of a target species including: mapping sequenced reads of a sample of a target species to a reference genome of a non-target species to detect a first set of variants in the sequenced reads of the sample of the target species; mapping the sequenced reads of the sample of the target species to a reference genome of a pseudo-target species to detect a second set of variants in the sequenced reads of the sample of the target species; comparing the first set of variants and the second set of variants, and identifying a subset of true positive variants that are common between the first set of variants and the second set of variants; comparing the first set of variants and the second set of variants, and identifying a subset of false positive variants that are present in the second set of variants but absent from the first set of variants; and based on a count of the subset of false positive variants determining the feasibility of using the reference genome
  • a system including one or more processors coupled to memory, the memory loaded with computer instructions to determine feasibility of using a reference genome of a non-target species for variant calling a sample of a target species, the instructions, when executed on the processors, implement actions comprising: mapping sequenced reads of a sample of a target species to a reference genome of a non-target species to detect a first set of variants in the sequenced reads of the sample of the target species; mapping the sequenced reads of the sample of the target species to a reference genome of a pseudo-target species to detect a second set of variants in the sequenced reads of the sample of the target species; comparing the first set of variants and the second set of variants, and identifying a subset of true positive variants that are common between the first set of variants and the second set of variants; comparing the first set of variants and the second set of variants, and identifying a subset of false positive variants that are present in the second set of variants but absent from the first set of variants; and
  • a non-transitory computer readable storage medium impressed with computer program instructions to determine feasibility of using a reference genome of a non-target species for variant calling a sample of a target species the instructions, when executed on a processor, implement a method comprising: mapping sequenced reads of a sample of a target species to a reference genome of a non-target species to detect a first set of variants in the sequenced reads of the sample of the target species; mapping the sequenced reads of the sample of the target species to a reference genome of a pseudo-target species to detect a second set of variants in the sequenced reads of the sample of the target species; comparing the first set of variants and the second set of variants, and identifying a subset of true positive variants that are common between the first set of variants and the second set of variants; comparing the first set of variants and the second set of variants, and identifying a subset of false positive variants that are present in the second set of variants but absent from the first set of variants; and
  • non-transitory computer readable storage medium of clause 23 further including detecting the second set of variants by mapping the sequenced reads of the sample of the target species to the reference genome of the target species, and then lifting-over the mapped sequenced reads of the sample of the target species to the reference genome of the non-target species.
  • non-transitory computer readable storage medium of clause 23 further including applying a first filter to filter out low-quality variants from the first set of variants and the second set of variants.
  • non-transitory computer readable storage medium of clause 23 further including applying a second filter to filter out, from the first set of variants and the second set of variants, fixed substitutions shared between the reference genome of the non-target species and the reference genome of the pseudo-target species.
  • a system comprising: a variant quality classifier configured to process a plurality of features of a target variant, and generate a quality indication for the target variant, wherein the variant quality classifier is trained on a set of high-quality variants and a set of low-quality variants, wherein high-quality variants in the set of high-quality variants are identified as true positive variants that are common between a first set of variants and a second set of variants, wherein low-quality variants in the set of low-quality variants are identified as false positive variants that are present in the second set of variants but absent from the first set of variants, wherein the first set of variants is detected by variant calling sequenced reads of a sample of a target species against a reference genome of a non-target species, and wherein the second set of variants is detected by variant calling the sequenced reads of the sample of the target species against a reference genome of a pseudo-target species.
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) content within the sequenced reads of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) skew within the sequenced reads of the target variant, wherein the GC skew represents a normalized excess of cytosine over guanine in a given sequenced read of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a local composition complexity within one hundred base pairs upstream or downstream of the target variant.
  • a feature in the plurality of features of the target variant is an allelic count of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mapping quality of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a p-value of Fisher’s exact test to detect strand bias in the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a symmetric odds ratio to detect strand bias in the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a variant quality by depth of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a genotype quality of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a read depth of the target variant normalized by a mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a fraction alternative allele read depth out of a target variant coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within five base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within ten base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a computer-implemented method of processing a plurality of features of a target variant, and generate a quality indication for the target variant including: training a variant quality classifier on a set of high-quality variants and a set of low-quality variants; identifying high-quality variants in the set of high-quality variants as true positive variants that are common between a first set of variants and a second set of variants; identifying low-quality variants in the set of low-quality variants as false positive variants that are present in the second set of variants but absent from the first set of variants; detecting the first set of variants by variant calling sequenced reads of a sample of a target species against a reference genome of a non-target species, and detecting the second set of variants by variant calling the sequenced reads of the sample of the target species against a reference genome of a pseudo-target species.
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) content within the sequenced reads of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) skew within the sequenced reads of the target variant, wherein the GC skew represents a normalized excess of cytosine over guanine in a given sequenced read of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a local composition complexity within one hundred base pairs upstream or downstream of the target variant.
  • a feature in the plurality of features of the target variant is an allelic count of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a symmetric odds ratio to detect strand bias in the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a variant quality by depth of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a genotype quality of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a read depth of the target variant normalized by a mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a fraction alternative allele read depth out of a target variant coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within five base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within ten base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) content within the sequenced reads of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a guanine-cytosine (GC) skew within the sequenced reads of the target variant, wherein the GC skew represents a normalized excess of cytosine over guanine in a given sequenced read of the target variant.
  • GC guanine-cytosine
  • a feature in the plurality of features of the target variant is a p-value of Fisher’s exact test to detect strand bias in the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a symmetric odds ratio to detect strand bias in the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a read depth of the target variant normalized by a mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within five base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is an existence of insertion and/or deletion (indel) mutations within ten base pairs upstream or downstream of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a mean coverage of flanking regions five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by the mean coverage of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of heterozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within one hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a feature in the plurality of features of the target variant is a number of alternate homozygote single nucleotide polymorphisms within five hundred base pairs upstream or downstream of the sequenced reads of the target variant normalized by a median count of variants within the same length regions of the sequenced reads of the target variant.
  • a computer-implemented method of identifying and excluding regions that do not have one- to-one mapping between a first reference genome and a second reference genome including: accessing sequenced reads of a sample of a target species; identifying and removing, from the sequenced reads, low-quality sequenced reads based on applying a mapping quality filter to the sequenced reads, thereby pruning high-quality sequenced reads from the sequenced reads; segmenting a non-target reference genome of a non-target species into a plurality of bins, and then, on a bin-by-bin basis, mapping the high-quality sequenced reads to the plurality of bins in the non-target reference genome; segmenting a pseudo-target reference genome of a pseudo-target species into the plurality of bins, and then, on the bin-by-bin basis, mapping the high-quality sequenced reads to the plurality of bins in the pseudo-target reference genome; identifying a best-mapped bin in the pseudo-target reference genome based on a greatest degree
  • a filter within the plurality of cascading filters is configured to detect and exclude genes within a reference genome possessing a skewed distribution of variant classifier scores in comparison to a distribution of variant classifier scores for the complete reference genome.
  • a filter within the plurality of cascading filters is configured to detect and remove single nucleotide polymorphisms with a random forest score of greater than 0.17.
  • a one-to-one mapping describes the fraction of a number of reads in one bin within non-target reference genome map to a single corresponding region within a pseudo-target reference genome.
  • a system including one or more processors coupled to memory, the memory loaded with computer instructions to identify and exclude regions that do not have one-to-one mapping between a first reference genome and a second reference genome, the instructions, when executed on the processors, implement actions comprising: accessing sequenced reads of a sample of a target species; identifying and removing, from the sequenced reads, low-quality sequenced reads based on applying a mapping quality filter to the sequenced reads, thereby pruning high-quality sequenced reads from the sequenced reads; segmenting a non-target reference genome of a non-target species into a plurality of bins, and then, on a bin-by-bin basis, mapping the high-quality sequenced reads to the plurality of bins in the non-target reference genome; segmenting a pseudo-target reference genome of a pseudo
  • a filter within the plurality of cascading filters is configured to detect and exclude codons that do not match between the pseudo-target species reference genome and the non-target species reference genome.
  • a filter within the plurality of cascading filters is configured to detect and exclude genes within a reference genome possessing a skewed distribution of variant classifier scores in comparison to a distribution of variant classifier scores for the complete reference genome.
  • a filter within the plurality of cascading filters is configured to detect and remove single nucleotide polymorphisms with a random forest score of greater than 0.17.
  • non-transitory computer readable storage medium of clause 159 wherein a filter within the plurality of cascading filters is configured to detect and exclude genetic regions possessing incorrect gene annotation in a reference genome.
  • a filter within the plurality of cascading filters is configured to detect and exclude codons that do not match between the pseudo-target species reference genome and the non-target species reference genome.
  • a filter within the plurality of cascading filters is configured to detect and exclude genes within a reference genome possessing a skewed distribution of variant classifier scores in comparison to a distribution of variant classifier scores for the complete reference genome.
  • non-transitory computer readable storage medium of clause 159 wherein a filter within the plurality of cascading filters is configured to detect and remove single nucleotide polymorphisms with a random forest score of greater than 0.17.
  • non-transitory computer readable storage medium of clause 157 wherein the non-target species is a human. 175. The non-transitory computer readable storage medium of clause 157, wherein the non-target species is a non-human primate.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La technologie divulguée concerne la détermination de la faisabilité d'utilisation d'un génome de référence d'une espèce non cible pour un appel variant d'un échantillon d'une espèce cible. En particulier, la technologie divulguée concerne la mise en correspondance de lectures séquencées d'un échantillon d'une espèce cible avec un génome de référence d'une espèce non cible pour détecter un premier ensemble de variants dans les lectures séquencées de l'échantillon de l'espèce cible et la mise en correspondance des lectures séquencées de l'échantillon de l'espèce cible avec un génome de référence d'une espèce pseudo-cible pour détecter un second ensemble de variants dans les lectures séquencées de l'échantillon de l'espèce cible.
PCT/US2022/082462 2021-12-29 2022-12-28 Appel de variant sans génome de référence cible Ceased WO2023129953A2 (fr)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2024539855A JP2024546515A (ja) 2021-12-29 2022-12-28 標的参照ゲノムを使わないバリアントコーリング
EP22850997.2A EP4457816A2 (fr) 2021-12-29 2022-12-28 Appel de variant sans génome de référence cible
KR1020247024941A KR20240124392A (ko) 2021-12-29 2022-12-28 표적 참조 게놈을 사용하지 않는 변이체 호출
CN202280086391.6A CN118475983A (zh) 2021-12-29 2022-12-28 在没有靶参考基因组的情况下的变体识别
CA3242595A CA3242595A1 (fr) 2021-12-29 2022-12-28 Appel de variant sans genome de reference cible

Applications Claiming Priority (18)

Application Number Priority Date Filing Date Title
US202163294816P 2021-12-29 2021-12-29
US202163294827P 2021-12-29 2021-12-29
US202163294820P 2021-12-29 2021-12-29
US202163294828P 2021-12-29 2021-12-29
US202163294830P 2021-12-29 2021-12-29
US202163294813P 2021-12-29 2021-12-29
US63/294,816 2021-12-29
US63/294,813 2021-12-29
US63/294,828 2021-12-29
US63/294,830 2021-12-29
US63/294,827 2021-12-29
US63/294,820 2021-12-29
US17/952,198 US20230207051A1 (en) 2021-12-29 2022-09-23 Unique mapper tool for excluding regions without one-to-one mapping between a set of two reference genomes
US17/952,194 2022-09-23
US17/952,192 US20230207057A1 (en) 2021-12-29 2022-09-23 Variant calling without a target reference genome
US17/952,198 2022-09-23
US17/952,192 2022-09-23
US17/952,194 US12499974B2 (en) 2022-09-23 Quality detection of variant calling using a machine learning classifier

Publications (2)

Publication Number Publication Date
WO2023129953A2 true WO2023129953A2 (fr) 2023-07-06
WO2023129953A3 WO2023129953A3 (fr) 2023-08-10

Family

ID=85150222

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/082462 Ceased WO2023129953A2 (fr) 2021-12-29 2022-12-28 Appel de variant sans génome de référence cible

Country Status (1)

Country Link
WO (1) WO2023129953A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119741975A (zh) * 2024-12-26 2025-04-01 上海交通大学 基于蛋白质语言模型的非模式物种代谢模型自动构建方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130332081A1 (en) * 2010-09-09 2013-12-12 Omicia Inc Variant annotation, analysis and selection tool
MX2019014690A (es) * 2017-10-16 2020-02-07 Illumina Inc Tecnicas basadas en aprendizaje profundo para el entrenamiento de redes neuronales convolucionales profundas.
EP4048810A4 (fr) * 2019-10-22 2023-11-22 Genembryomics Pty. Ltd Procédé de criblage d'embryons de fiv
US11475981B2 (en) * 2020-02-18 2022-10-18 Tempus Labs, Inc. Methods and systems for dynamic variant thresholding in a liquid biopsy assay

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JAGANATHAN, K. ET AL.: "Predicting splicing from primary sequence with deep learning", CELL, vol. 176, 2019, pages 535 - 548
SUNDARAM, L. ET AL.: "Predicting the clinical impact of human mutation with deep neural networks", NAT. GENET, vol. 50, 2018, pages 1161 - 1170, XP036902750, DOI: 10.1038/s41588-018-0167-z
SUNDARAM, L. ET AL.: "Predicting the clinical impact of human mutation with deep neural networks", NAT. GENET., vol. 50, 2018, pages 1161 - 1170, XP036902750, DOI: 10.1038/s41588-018-0167-z

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119741975A (zh) * 2024-12-26 2025-04-01 上海交通大学 基于蛋白质语言模型的非模式物种代谢模型自动构建方法

Also Published As

Publication number Publication date
WO2023129953A3 (fr) 2023-08-10

Similar Documents

Publication Publication Date Title
US20230207051A1 (en) Unique mapper tool for excluding regions without one-to-one mapping between a set of two reference genomes
WO2023129953A2 (fr) Appel de variant sans génome de référence cible
EP4413577A1 (fr) Prédiction de pathogénicité de variants à partir d'une conservation évolutive à l'aide de voxels de structure protéique tridimensionnelle (3d)
WO2023129621A1 (fr) Scores de risque polygénique de variants rares
US12499974B2 (en) Quality detection of variant calling using a machine learning classifier
US20230343413A1 (en) Protein structure-based protein language models
US20230108368A1 (en) Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples
US20240112751A1 (en) Copy number variation (cnv) breakpoint detection
US20250201348A1 (en) Artificial intelligence-based detection of gene conservation and expression preservation at base resolution
US20230207067A1 (en) Optimized burden test based on nested t-tests that maximize separation between carriers and non-carriers
CN118475983A (zh) 在没有靶参考基因组的情况下的变体识别
CN118922887A (zh) 识别引起极端基因表达水平的罕见变体的计算机实现的方法
Danzi et al. Deep structured learning realizes variant prioritization for Mendelian diseases
Vu et al. Exploration of chaos game representation and integrative deep learning approaches for whole-genome sequencing-based grapevine genetic testing
EP4457818B1 (fr) Test de charge optimisé basé sur des tests t imbriqués maximisant la séparation entre porteurs et non porteurs
Romero Better understanding genomic architecture with the use of applied statistics and explainable artificial intelligence
Rochefort-Boulanger et al. A Transparent and Generalizable Deep Learning Framework for Genomic Ancestry Prediction
WO2023059751A1 (fr) Prédiction de pathogénicité de variants à partir d'une conservation évolutive à l'aide de voxels de structure protéique tridimensionnelle (3d)
Huang et al. FusDRM-m5C: a hybrid model for accurate prediction of 5-methylcytosine modification sites based on feature fusion and attention mechanism
Liu A Rank Score Model of Variants Prioritization for Rare Disease
WO2023129622A1 (fr) Correction de covariables pour des données temporelles à partir de mesures de phénotypes pour différents profils d'utilisation de médicament
WO2023129619A1 (fr) Test de charge optimisé basé sur des tests t imbriqués maximisant la séparation entre porteurs et non porteurs
Zhang et al. Feature Selection for SNP data based on Relief-SVM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22850997

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
WWE Wipo information: entry into national phase

Ref document number: 202280086391.6

Country of ref document: CN

Ref document number: 3242595

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2024539855

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 20247024941

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022850997

Country of ref document: EP

Effective date: 20240729