[go: up one dir, main page]

US20210020266A1 - Phase-aware determination of identity-by-descent dna segments - Google Patents

Phase-aware determination of identity-by-descent dna segments Download PDF

Info

Publication number
US20210020266A1
US20210020266A1 US16/947,107 US202016947107A US2021020266A1 US 20210020266 A1 US20210020266 A1 US 20210020266A1 US 202016947107 A US202016947107 A US 202016947107A US 2021020266 A1 US2021020266 A1 US 2021020266A1
Authority
US
United States
Prior art keywords
ibd
haplotype
segments
sites
potential
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/947,107
Inventor
William A. Freyman
Kimberly F. McManus
Suyash S. Shringarpure
Ethan M. Jewett
Adam Auton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
23andMe Inc
Original Assignee
23andMe Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 23andMe Inc filed Critical 23andMe Inc
Priority to US16/947,107 priority Critical patent/US20210020266A1/en
Assigned to 23ANDME, INC. reassignment 23ANDME, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FREYMAN, WILLIAM A., JEWETT, ETHAN M., MCMANUS, KIMBERLY F., AUTON, ADAM, SHRINGARPURE, SUYASH S.
Publication of US20210020266A1 publication Critical patent/US20210020266A1/en
Priority to US17/249,520 priority patent/US20210193257A1/en
Priority to US18/503,841 priority patent/US12046327B1/en
Priority to US18/737,679 priority patent/US12260936B2/en
Priority to US19/057,224 priority patent/US20250191684A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism

Definitions

  • IBD estimates can best be exploited when they are made using phased haplotypes; this means each individual in the data set is represented by two sequences each of which consists of alleles co-located on the same chromosome and inherited from a different parent. IBD estimates that are phase aware can improve relationship and pedigree inference, allow health and trait inheritance to be traced, and make possible a range of other inferences regarding demographic history and ancestry that are not possible when IBD estimates are made using only unphased genotype data. Therefore, methods and systems that can improve performance of phase aware IBD estimates have significant value.
  • the disclosed implementations concern methods, apparatus, systems, and computer program products for processing haplotype data to accurately estimate IBD segments between individuals.
  • a first aspect of the disclosure provides computer-implemented methods for estimating IBD segments between individuals.
  • the system involves: a sequencer for sequencing nucleic acids of the test sample; a processor; and one or more computer-readable storage media having stored thereon instructions for execution on said processor to estimate IBD segments between individuals.
  • Another aspect of the disclosure provides a computer program product including a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement the methods above for estimating IBD segments.
  • a computer implemented method of processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes including: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype
  • the first and second templates each have a size of at least four consecutive haplotype sites.
  • identifying the first matches among alleles at unmasked sites includes sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
  • computationally identifying IBD segments between the two or more haplotype strings further includes: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
  • computationally identifying additional matches among alleles at unmasked sites employs a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
  • the first through sixth digital templates each include two masked sites and two unmasked sites.
  • the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings.
  • computationally identifying IBD segments between the two or more haplotype strings includes performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, computationally merging the first and second matches among alleles is performed while considering individual polymorphic sites of the two or more haplotype strings using the PBWT. In some embodiments, the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window.
  • PBWT positional Burrows-Wheeler transform
  • the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • applying the first digital template comprises a deterministic process employing the first arrangement of masked and unmasked sites.
  • a system for processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes including: (a) one or more processors and associated memory; (b) computer readable instructions for: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches
  • the first and second templates each have a size of at least four consecutive haplotype sites.
  • the instructions for identifying the first matches among alleles at unmasked sites includes instructions for sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
  • the instructions for identifying IBD segments between the two or more haplotype strings further include instructions for: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
  • the instructions for identifying additional matches among alleles at unmasked sites employ a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
  • the first through sixth digital templates each include two masked sites and two unmasked sites.
  • the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings.
  • the instructions for identifying IBD segments between the two or more haplotype strings include instructions performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, the instructions for merging the first and second matches among alleles include instructions for performing the merging while considering individual polymorphic sites of the two or more haplotype strings using the PBWT.
  • PBWT positional Burrows-Wheeler transform
  • the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window. In some embodiments, the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • a method of identifying IBD segments between two or more haplotypes strings, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism including: (a) computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of two or more haplotype strings at unmasked sites produced by applying a first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying a second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the first digital template includes a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the second digital template includes a second arrangement of masked and unmasked sites in the window of consecutive
  • a computer implemented method of determining identity by descent (IBD) segments including: determining first potential IBD segments among phased haplotype data for a plurality of individuals, wherein the first potential IBD segments have an end site; determining second potential IBD segments among haplotype data for the plurality of individuals, wherein the second potential IBD segments have a start site; determining that the end site of the first potential IBD segments and the start site of the second potential IBD segments are within a threshold number of sites of each other; and merging the first potential IBD segments and the second potential IBD segments to form a combined potential IBD segment.
  • IBD identity by descent
  • the first potential IBD segments and the second potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the first potential IBD segment and the second potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error.
  • the first potential IBD segments and the second potential IBD segments overlap for an individual of the plurality of individuals.
  • the first potential IBD segment and the second potential IBD segment each span at least the threshold number of sites. In some embodiments, the threshold number of sites is between about 0 and 500 SNPs.
  • the plurality of individuals do not share a parent-child relationship.
  • the method further includes: determining a third potential IBD segments among phased haplotype data for a plurality of individuals, wherein the third potential IBD segments have a start site; determining that the end site of the combined potential IBD segments and the start site of the third potential IBD segments are within the threshold number of SNPs of each other; and merging the combined potential IBD segments and the third potential IBD segments.
  • the combined potential IBD segments and the third potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the combined potential IBD segment and the third potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error. In some embodiments, the method further includes determining that the combined potential IBD segments have a minimum length in centimorgans and storing the combined potential IBD segments as IBD segments for the plurality of individuals.
  • a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes including: providing two or more paired haplotypes strings for identification of IBD segments therebetween, each of the two or more paired haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally iterating through the two or more paired haplotype strings by: (i) identifying a first potential IBD segment between the two or more haplotype strings by identifying matches among alleles of the haplotype strings; (ii) comparing the first site of the first potential IBD segment to the last site of a previously identified second potential IBD segment (iii) determining that the last site of the second potential IBD segment and the first site of the first potential IBD segment are within a threshold number of sites of each other; and (iv) merging the first potential IBD segment and the second potential IBD segment to form a combined potential I
  • a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes including: (a) computationally identifying initial IBD segments between two or more haplotype strings by identifying first matches among alleles of the haplotype strings using a plurality of templates, each including a unique arrangement of masked and unmasked sites in a window of consecutive haplotype sites; and (b) providing information characterizing the initial IBD segments to a hidden Markov model (HMM) which removes potential phase switch errors to produce final IBD segment, wherein the HMM analyzes the information characterizing the initial IBD segments using distances between consecutive haplotype sites on a chromosome, one or more rates of recombination based on meiosis, and one or more rates of phase switch error based on a phasing method employed to phase the haplotypes.
  • HMM hidden Markov model
  • the method further includes, after (a) and before (b), removing some initial IBD segments determined to belong to haplotypes having less than a threshold amount of initial IBD segments, wherein the initial IBD segments provided to the HEIM in (b) have had some initial IBD segments removed.
  • the threshold amount of initial IBD segments is less than two initial IBD segments per chromosome.
  • a computer implemented method of determining identical-by-descent (IBD) segments including: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are included in one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, and (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes; wherein applying the HMM model takes as input, in addition to the IBD states as
  • the HMM takes as input: (iii) genetic distances between consecutive sites on a chromosome.
  • transition rates of the HMM are based on a rate at which IBD segments start, which rate is modeled as a function of the number of meioses.
  • the rate at which IBD segments start ( ⁇ s ) is modeled as follows:
  • transition rates of the HMM are based on a rate at which IBD segments end.
  • the rate at which IBD segments end is modeled as a function of the number of meioses.
  • the rate at which IBD segments end ( ⁇ e ) is modeled as follows:
  • the IBD states include nine different IBD states, which indicate nine conditions of zero IBD, half IBD, and full IBD.
  • transition rates of the HMM are based on a transition matrix Q a in FIG. 8B .
  • transition rates of the HMM are weighted by a probability that full IBD between the two individuals is truly present.
  • the probability that full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD.
  • the probability that full IBD between the two individuals is truly present ( ⁇ ) is modeled as follows:
  • the transition rates are weighted by weighting transitions into full IBD states with ⁇ , and weighting transitions out of full IBD states with 1/ ⁇ .
  • the IBD states include 9 different IBD states, and the transition rates are based on a transition matrix Q ⁇ in (Eq. 5).
  • transition rates of the HMM are based on the at least one rate of phase switch error.
  • the at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, there are 4 types of phase switch errors, the IBD states include 9 different IBD states, and the transition rates are based on a 36 ⁇ 36 transition matrix Q in (Eq. 6). In some embodiments, transition probabilities of the HMM are based on the genetic distances between consecutive sites on a chromosome.
  • the transition probabilities are obtained by exponentiating a transition matrix.
  • transition probabilities of hidden IBD states Y l+1 given hidden IBD states Y l are modeled as: P(Y l+1
  • Y l , m, ⁇ 0 , ⁇ 1 , ⁇ 2 ) e Qd l wherein m is the number of meioses, ⁇ 0 is a phase switch error rate for a first individual of the two individuals, ⁇ 1 is a phase switch error rate for a second individual of the two individuals, ⁇ 2 is an amount of estimated full IBD, Q is a transition matrix described by Eq. (Q), and d l is the genetic distances between sites l and l+1.
  • emission probabilities of the HMM are dependent on phase switch errors.
  • the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on four different ways the two individuals may be in phase switch errors.
  • (b) includes using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed states.
  • the mostly likely sequence of hidden IBD states is identified using a Viterbi dynamic programming process.
  • the method further includes: performing (a) and (b) for a plurality of iterations, each iteration using a different number of meioses for the rate of recombination, thereby producing a plurality of sets of error-corrected IBD segments; and selecting a set of error-corrected IBD segments having a highest likelihood or probability in the plurality of sets as a final estimate of one or more IBD segments.
  • the different numbers of meioses are in a range from 1 to 14.
  • the method is initiated when the two individuals' IBD segments including the series of polymorphic sites meet a criterion.
  • the two individuals' IBD segments include two or more IBD segments on a single chromosome.
  • the two individuals' IBD segments exceed a minimum total amount of shared IBD
  • FIG. 1 presents possible phase switch errors in long IBD segments.
  • FIGS. 2A and 2B present example process flows according to various embodiments discussed herein.
  • FIG. 3 presents an example process for finding IBD segments.
  • FIG. 4 illustrates haplotype strings, a positional prefix array, and a divergence array in accordance with various embodiments discussed herein.
  • FIGS. 5A and 5B present process flows according to various embodiments herein.
  • FIG. 5C presents pseudocode for an algorithm according to various embodiments herein.
  • FIG. 5D presents an illustration of the various data structures used in accordance with a Templated PBWT process as described herein.
  • FIGS. 6A and 6C present process flows for a phase switch error correction heuristic according to various embodiments herein.
  • FIG. 6B presents various types of phase switch errors that may occur between two pairs of haplotypes.
  • FIG. 7 illustrates using a HMM to process four haplotypes of two individuals to correct phase switch errors.
  • FIG. 8 presents a flow diagram for correcting phase switch errors in IBD segments using a HMM according to various embodiments.
  • FIG. 9A illustrates the structure of an example HMM model.
  • FIG. 9B illustrates a transition matrix that may be used as part of a HMM according to various embodiments herein.
  • FIG. 10 presents a functional diagram of a computer system for performing various embodiments disclosed herein.
  • FIG. 11 presents a block diagram of an IBD-based personal genomics service according to various embodiments herein.
  • FIG. 12 presents a plot comparing the speed of various IBD inference methods.
  • FIGS. 13-16 present plots comparing the IBD estimate errors of various methods.
  • FIGS. 17 and 18 present plots of errors of various methods for various simulated pairs of haplotypes.
  • FIG. 19 presents plots of the false positive and false negative rates of various methods.
  • FIGS. 20A and 20B illustrate the runtime of various methods and the parameters used for assessing runtime.
  • FIGS. 21 and 22 present illustrates of IBD haplotype sharing across Mexican states as determined by a Templated PBWT method as described herein.
  • the disclosure concerns methods, apparatus, systems, and computer program products for estimating IBD segments between individuals using haplotype data.
  • nucleic acids are written left to right in 5′ to 3′ orientation and amino acid sequences are written left to right in amino to carboxy orientation, respectively.
  • plurality refers to more than one element.
  • the term is used herein in reference to a number of nucleic acid molecules or sequence reads that is sufficient to identify significant differences in repeat expansions in test samples and control samples using the methods disclosed herein.
  • a DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment.
  • An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals.
  • DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
  • nucleic acid refers to a covalently linked sequence of nucleotides (i.e., ribonucleotides for RNA and deoxyribonucleotides for DNA) in which the 3′ position of the pentose of one nucleotide is joined by a phosphodiester group to the 5′ position of the pentose of the next.
  • the nucleotides include sequences of any form of nucleic acid, including, but not limited to RNA and DNA molecules such as cell-free DNA (cfDNA) molecules.
  • cfDNA cell-free DNA
  • polynucleotide includes, without limitation, single- and double-stranded polynucleotides.
  • parameter herein refers to a numerical value that characterizes a physical property. Frequently, a parameter numerically characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, a ratio (or function of a ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome to which the tags are mapped, is a parameter.
  • a site refers to a unique position (i.e. chromosome ID, chromosome position and orientation) on a reference genome.
  • a site may be a residue, a sequence tag, or a segment's position on a sequence.
  • the specific quantitative value may be based on multiple other quantities, not just the one identified.
  • chromosome refers to the heredity-bearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system is employed herein.
  • IBD identical-by-descent
  • phase aware IBD segments is challenging not only because of the large sizes of the genomic data sets but also due to two types of error that break up IBD segments: genotyping error and phase switch/phasing error. Because IBD segments are broken up by meiotic recombination events they are expected to be longer for close relatives. However, long IBD segments are more likely to be impacted by genotype and phasing errors compared to short segments. Thus, errors are particularly problematic when detecting IBD among individuals that are closely related (e.g. first, second, and third degree relatives) since long IBD segments are more likely to be fragmented by these errors. This makes accurate inference of phase aware IBD among close relatives particularly problematic.
  • Genotyping error is error introduced via genotyping (e.g., sequencing) in which an allele that is actually of one type (e.g., a T) gets called a different allele (e.g., an A). This commonly has the effect of prematurely terminating a sequence of matches that would otherwise be long enough to be considered an IBD segment.
  • Phase switch error is error introduced by phasing: maternal and paternal copies are reversed. See FIG. 1 . In some typical cases, statistical phasing processes introduce a phase switch error, on average, about every 12 centimorgans.
  • FIG. 1 illustrates phase switch error in fragment long IBD segments.
  • two IBD segments are shared on a single chromosome in two related individuals (top and bottom).
  • phase switch errors (illustrated by vertical dashed lines) occur at different positions along the chromosome in the two individuals, fragmenting the true IBD segments into many erroneous false IBD segments.
  • the individual represented by the top two haplotypes has six phase switch errors and the individual represented by the bottom two haplotypes has four phase switch errors.
  • phased haplotype data is processed using a method based on the positional Burrows-Wheeler transform (PBWT) and a probabilistic hidden Markov model (HMM).
  • phased haplotype data is processed using a method based on the PBWT and a heuristic to correct phase switch errors.
  • IBD segment finding methods may minimize genotyping and phasing error using one or more of the following techniques:
  • an IBD segment computation procedure passes multiple digital templates over phased haplotypes to differentially mask haplotype sites, thereby temporarily ignoring different sites of potential genotype error.
  • the IBD segments being generated by the different templates are combined to effectively remove fragmentation caused by genotyping error.
  • the procedure is a templated positional Burrows-Wheeler transform. This is the positional Burrows-Wheeler transform (PBWT; Durbin 2014) with substantial modifications to handle genotyping errors and missing data.
  • IBD segments that are likely fragmented by phase switch errors introduced during statistical phasing are addressed using a heuristic that recognizes likely occurrences of phase switch errors and/or a probabilistic model that accounts for error rates of the phasing techniques.
  • the model is applied to haplotype/IBD segment data generated using the templating approach described in 1.
  • Some implementations apply a hidden Markov model (HMM) that accounts for both recombination and the phase switching process to reduce these errors.
  • HMM passes along the chromosomes of the two individuals “stitching” fractured IBD segments together.
  • a heuristic is applied that identifies fractured IBD segments based on potential IBD segments that start or end within a distance of the start or end of another IBD segment.
  • the heuristic may stitch the IBD segments together by, for example, swapping haplotype segments in a target individual.
  • the heuristic may be applied as potential IBD segments are generated using the templating approach described in 1 without a separate iteration over the IBD segment data.
  • the heuristic assumes that IBD segments within a threshold distance of each other are likely to be a single IBD segment fractured by one or more phase switch errors.
  • phased sequence data for two or more individuals who are to be compared to detect IBD segments. See block 203 .
  • phased data for many more individuals is obtained.
  • hundreds, thousands, or even millions of individuals have their phased haplotype data compared using methods disclosed herein.
  • data for greater than about 10 million individuals' phased haplotype data are compared.
  • the operation 203 involves identifying individuals or haplotypes on which an IBD comparison is to be run.
  • the phased data includes two strings of haplotype data for each individual (one per chromosome). In other words, there are four haplotypes to be considered for two individuals.
  • Phased haplotype data may be obtained from various sources including statistical techniques such as BEAGLE, FINCH, EAGLE, and other known techniques.
  • An example discussion of phasing techniques is presented in U.S. Pat. No. 9,836,576, filed Mar. 13, 2013, and incorporated herein by reference in its entirety.
  • the haplotype data may be represented as strings of allele values (e.g., 1s and 0s) for sites in the haplotype, each of which is the site of a polymorphism. Each such site may be referred to as an index in the haplotype string.
  • the process assumes that each haplotype site is a biallelic site on a chromosome. It may be given a value of, e.g., 0 for one allele and a value of 1 for the other allele.
  • a typical chromosome may provide hundreds of thousands of sites.
  • Each haplotype may be given its own unique identifier, which may be arbitrarily set.
  • the phased haplotype data is provided to a first processing block as illustrated by a block 205 .
  • This processing may reduce the fragmenting impact of genotyping error in IBD segment finding.
  • multiple operations are performed in parallel, sequentially, or in some combination thereof.
  • significant computational efficiency is realized by performing these operations together for a given haplotype (e.g., in an inner loop of a software routine as illustrated in the sample code below).
  • the operations performed in the first phase include the following: (a) applying digital templates with masked and unmasked positions to exclude certain haplotype sites along the length of the haplotype, (b) identifying matching allele values at unmasked positions along the haplotypes to identify putative IBD segments for the various digital templates, and (c) merging the resulting IBD segments (e.g., as they are being generated) from the various digital templates.
  • the digital templates are constructed as small windows that can be “slid” or “ratcheted” along the length of the haplotype strings, considering consecutive sub-segments of the haplotype sites as they go. Criteria to consider in choosing template structures are the length of the template, the number of masked or null sites in the template, and the arrangement of masked and unmasked sites in the template. Typically, full sets of templates are used in the process that contain all possible arrangements of masked and unmasked sites in a template length. An example set of four site digital templates, each employing two masked and two unmasked positions is described below. Of course, the process may alternatively employ larger (or smaller) templates and/or use templates having a higher or lower proportion of null positions per template.
  • the output of the first processing block implemented in operation 205 is a set of IBD segments or other haplotype matching data for combinations of the various individuals whose phased haplotype data is processed. This data is then passed to a second processing block for processing as illustrated in an operation 207 .
  • a goal of this second operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • the haplotype/IBD data is subjected to a probabilistic model that accounts for recombination rates based on meioses, which vary based on degree or relatedness of any two individuals, and rates of phase switch errors introduced by the phasing technique(s) employed to generate the phased haplotype data.
  • the model may also account for other inputs such as the genetic distance between adjacent sites on the haplotype and/or the probability of having full IBD state.
  • a hidden Markov model may be used to implement the probabilistic model.
  • An optional last operation of the process 201 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 209 .
  • a process 251 begins by initially obtaining phased sequence data for two or more individuals who are to be compared for identifying IBD segments. See block 253 .
  • phased data for many more individuals may be obtained, e.g., hundreds to millions of individuals.
  • the operation of block 253 involves identifying individuals and/or haplotypes on which an IBD comparison is to be run.
  • Process 251 Computational aspects of process 251 include sequentially considering haplotype positions for genotype errors and phase switch errors within individual haplotypes while keeping track of match segments between haplotypes. Processing each new haplotype position is initiated at a process operation 254 , which selects the next position in the haplotypes under consideration.
  • a first processing operation 255 considers possible errors in the individual haplotypes using multiple templates such as those described elsewhere herein. Haplotype position matches are determined using these templates, and, from these results, an overall decision on matching segments is made. In some implementations, the resulting match segments have reduced genotyping error.
  • the haplotype position under consideration is then analyzed in a processing operation as illustrated in block 257 .
  • a result of this operation may involve reducing the fragmenting impact of genotyping error in IBD segment finding.
  • the haplotype/IBD data may be analyzed by one or more phase switch heuristics and/or models that identify situations where phase switch errors are likely to have occurred.
  • a heuristic may identify situations where one or more IBD segments between individuals end at a first position and then new IBD segments begin at a second position within a threshold distance from the first position.
  • An identified likely phase switch error may be corrected by joining the IBD segments in an individual identified to possess the likely phase switch error. In some cases, error is corrected by swapping haplotype segments within the identified individual.
  • a result of the operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • An optional last operation of the process 251 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 259 .
  • a reference sequence may be the sequence of a whole genome, the sequence of a chromosome, the sequence of a sub-chromosomal region, etc. From a computational perspective, repeats create ambiguities in alignment, which, in turn, can produce biases and errors even at the whole chromosome counting level. Paired end reads coupled with adjustable insert length in various embodiments can help to eliminate ambiguity in alignment of repeating sequences and detection of repeat expansion.
  • a goal of the process is to use alignment of multiple haplotypes to determine genetic relationship(s) between two or more individuals, or in some cases that potentially involve inbreeding, within a single individual.
  • the process determines relationships between two haplotypes. IBD may be used for this purpose.
  • IBD can be understood in the context of meiosis and recombinable DNA. Because of recombination and independent assortment of chromosomes, the autosomal DNA and X chromosome DNA (collectively referred to as recombinable DNA) from the parents is shuffled at the next generation, with small amounts of mutation. Thus, only relatives will share long stretches of genomic regions where their recombinable DNA is completely or nearly identical. Such regions are referred to as “identical-by-descent” (IBD) regions because they arose from the same DNA sequences in an earlier generation/common ancestor.
  • IBD identical-by-descent
  • locating IBD regions includes sequencing the entire genomes of the individuals and comparing the genome sequences. In some embodiments, locating IBD regions includes assaying a large number of markers that tend to vary in different individuals and comparing the markers. Examples of such markers include Single Nucleotide Polymorphisms (SNPs), which are points along the genome with two or more variations; e.g., Short Tandem Repeats (STRs), which are repeated patterns of two or more repeated nucleotide sequences adjacent to each other; and Copy-Number Variants (CNVs), which include longer sequences of DNA that could be present in varying numbers in different individuals. Long stretches of DNA sequences from different individuals' genomes in which markers in the same locations are the same indicate that the rest of the sequences, although not assayed directly, are also likely identical.
  • SNPs Single Nucleotide Polymorphisms
  • STRs Short Tandem Repeats
  • CNVs Copy-Number Variants
  • PBWT positional Burrows-Wheeler transform
  • a PBWT process is implemented according to the following description. Initially, each haplotype under consideration is given its own unique identifier, which may be arbitrarily set. Then, during execution, the method steps through the sites of all haplotypes under consideration, position-by-position, starting at a first position, which may be identified as position 0. As the method steps through the haplotype sites, it keeps track of two arrays, which are updated for every position (index) in the haplotypes. Also, during a pass through the haplotype sites, a templated PBWT process may apply one, some, or all of the digital templates at each position.
  • the first array is a “positional prefix array” that contains a list of all haplotypes under consideration. It is populated with IDs of all the haplotypes. A separate instance of the positional prefix array is produced each time a new site is encountered while traversing the haplotype string. Over the course of the process, and while certain haplotypes have identical allele values from one position to the next, these haplotypes are grouped together in the positional prefix array. In other words, haplotypes having matching allele values, remain together (in the same block) within the positional prefix array for as long as their alleles match. By keeping the haplotypes together while alleles match, the positional prefix array contains information about putative IBD segments.
  • the second array is a “divergence array” that indicates where matches between any two haplotypes under consideration began. It reflects how many positions/sites back in the haplotype string until there was a difference. In other words, this matrix keeps track of the last time that two haplotypes did not match by, e.g., providing the index value of the last mismatch for any two haplotypes.
  • FIG. 3 An example of a general IBD segment finding process 301 is depicted in FIG. 3 . As illustrated, the process begins by receiving haplotype strings representing allele values across all haplotypes to be considered in the process. See block 303 . This may correspond to block 203 in FIG. 2A and/or block 253 of FIG. 2B .
  • the process lists all haplotypes in the positional prefix array. It may do this randomly or in some order, but typically it does not yet account for the allele values at any haplotype site.
  • the individual haplotypes may be listed by unique identifiers.
  • the values in divergence array are all set 0 because there are no previous sites that have been considered.
  • the array initializations are illustrated by operation 305 in process 301 of FIG. 3 .
  • the process goes through all haplotypes in the order of the positional prefix array (which may initially be random or otherwise arbitrary) and orders the haplotypes such that those that have a first allele value (e.g., 0) in the current position are grouped together at the top, and all that have a second allele value (e.g., 1) in the position are grouped together at the bottom. See operation 309 .
  • first allele value e.g., 0
  • second allele value e.g., 1
  • this operation produces a new positional prefix array in which all haplotype indexes that have a 0 at the current position are grouped together in the array, and all haplotype indexes that have a 1 at the position are grouped together.
  • grouped together it is meant that haplotype identifiers are provided in adjacent positions in the positional prefix array. This is illustrated in FIG. 4 which shows a group of six haplotype strings and the associated positional prefix array at a few haplotype sites. Obviously, a typical haplotype has many more sites than illustrated for the haplotypes in FIG. 4 . Further, a typical IBD segment finding process may simultaneously evaluate many more haplotype strings (e.g., hundreds, thousands, or millions).
  • the process notes that all potential IBD segments begin at site 0 and therefore they effectively have a mismatch at position 0. Therefore, the first entry in the divergence array is all zeros. See operation 311 in process 301 and the first column in the divergence array of FIG. 4 .
  • the order of haplotypes in the divergence array is the same as in the positional prefix array.
  • the values in the divergence array are, for currently matching haplotypes, the sites (index values) of the first matching position between the two adjacent haplotypes within the array.
  • the value in the divergence array is the first matching position of the current segment.
  • the method assigns the next site to the new div array even though it has not peeked ahead and checked if the two haplotypes actually match at the next site. If, in the next iteration, the method learns that the segments still do not match, the relevant value in the divergence array simply gets updated again.
  • the process again goes through the haplotypes and again rearranges the haplotype identifiers in the positional prefix array so that those having the same allele value at the current position are grouped together, e.g., all haplotypes having a 0 allele value in the current position are grouped at the top of the array and all those having a 1 allele value are grouped at the bottom.
  • Haplotype strings that have the same alleles over multiple consecutive positions stay near one another in the positional prefix array.
  • the divergence array uses the new arrangement of haplotypes (from the positional prefix array), flags any mismatches between adjacent haplotypes and the current position and inserts the next haplotype site number for mismatching pairs.
  • the next site number is the location of the next possible start position for a new match segment.
  • element i in the divergence array indicates when a current segment match began between the haplotype at ppa[i] and the haplotype at ppa[i ⁇ 1].
  • the positional prefix array has the following values:
  • haplotypes 2 and 3 have a match that extends from the beginning of the alignment (position 0) to the current position (position 5).
  • Haplotype 1 matches with haplotype 3 from position 3 to position 5 (which also implies haplotype 1 and haplotype 2 have the same match).
  • haplotype 4 matches with haplotype 1 from position 2 to position 5 (which also implies haplotype 4 matches haplotypes 1, 2, and 3 between positions 3 and 5).
  • the routines use the alleles at the current position in the alignment to construct the divergence array for the next position.
  • the routine inserts 6 into the position of the haplotypes in the divergence array. Note that the method does not check whether the two haplotypes actually match at position 6, which is why the divergence array does not always contain the beginning position of matches. Once the method actually visits position 6 this value will be overwritten if the haplotypes do not match at site 7. The method continues in this fashion (overwriting values in the divergence array) until actual matches are found.
  • the process flags the two haplotypes as having a potential IBD segment. In the example of process 301 , it does this by creating new match segment records when two haplotypes have a number of consecutive shared matches that is greater than threshold number of consecutive sites. See operation 313 .
  • the threshold value may be chosen to balance speed and sensitivity. In certain examples, the threshold number is between about 50 and 1000 sites (e.g., about 200 matching sites).
  • the process may complete a match segment record when two matching haplotypes finally diverge in allele values, thereby ending the match segment. See operation 315 .
  • the process may maintain a separate report populated with matches of greater than the threshold length.
  • the matches may be identified by start position and end position (indexes) and the haplotypes involved in the match (e.g., haplotype ID #11 and haplotype ID #5).
  • the match segment includes both the starting and ending sites of the match segment.
  • the process may still flag the match segment for further consideration.
  • the match is identified by the two matching haplotypes and their starting index for the match segment.
  • the ending index for the match region is the site at the end of the haplotype.
  • FIG. 3 does not illustrate treatment of the haplotype strings by different digital templates. This will be discussed below.
  • the process 301 proceeds through operations 307 - 315 for each successive haplotype site and reorders the haplotype IDs in the positional prefix array based on matches (the haplotypes having a 0 at the current position are grouped together and those having a 1 are grouped together).
  • the haplotypes that have long stretches of matched sites stay together in the positional prefix array for long durations. This is because all matching haplotypes stay together in the positional prefix array until one of them has a different allele value at a particular haplotype position. At that point, the one or more haplotypes that diverge from the larger group are moved to a different position in the positional prefix array.
  • the method keeps sufficient information to reconstruct all IBD segments for any two haplotypes. This includes all haplotypes under consideration, including the first and last haplotypes in the positional prefix array.
  • FIG. 4 presents an example of positional prefix array values and divergence array values for a few sites on an alignment of haplotype strings.
  • potential IBD segments When there are no further haplotype sites to consider, as indicated by process block 317 , potential IBD segments have been identified and these may be processed in various ways such as, optionally, being used in a relatedness analysis of individuals whose haplotypes were considered in the analysis. See operation 319 .
  • the potential IBD segments may be further processed by a model such as an HMM model to correct for phase shift errors.
  • this additional processing is optional, particularly in situations where phased haplotype data can be expected to have relatively few phase shift errors.
  • potential IBD segments that are flagged by the PBWT process, but are shorter than a defined genetic length are excluded from further consideration.
  • An example of a threshold genetic length is between about 1.5 and 3 centimorgans. Thus, for example, if a segment extends over 200 sites but does not extend the full required genetic distance, the method discards the segment.
  • the PBWT process assumes that there are no errors. If there is in fact an error, it may prematurely truncate a sequence of matches and/or artificially prolong a sequence. Typically, long matches that in fact exist (e.g., between close relatives) are prematurely broken due to genotyping and/or phase switch errors.
  • integrating digital templates into an IBD segment finding process can mitigate the impact of some errors, particularly genotyping errors.
  • One approach employs a digital template that shifts over the haplotype strings and masks certain haplotype sites from consideration as it goes. This approach takes a normal haplotype alignment but applies the template to skip over some sites that would otherwise be considered. With the excluded sites removed from consideration, the process identifies putative IBD segment matches using a general approach such as PBWT. By masking some sites from comparison, sites of erroneous calls may be ignored. Some templates may consider the erroneously called alleles while others exclude them. By considering putative IBD segments created using all the templates, the process can remove breaks and more accurately identify complete IBD segments.
  • the template provides a sliding window of consecutive sites having, in some embodiments, a fixed mask pattern.
  • the template is moved successively along the haplotype string, typically with no overlap of sites between one application of the template and the next.
  • the process is similar to a generic IBD segment finding method such as the PBWT process. That is, in some embodiments, the process generates a positional prefix array and a divergence array for haplotype strings modified by the template.
  • the computational system flags and records matching segments as before. But the matching segments produced by single templates have some sites excluded.
  • the templating function employs a probabilistic function to pick mask sites.
  • the mask pattern is deterministic based on the template.
  • the masking of sites may follow a specific pattern, based on each template, rather than a random selection or masking of sites.
  • the mask pattern may remain fixed as the process moves from one haplotype site to the next. In some cases, however, the mask pattern may vary as the process moves over haplotype sites, but such variation may be deterministic rather than random.
  • the process employs multiple fixed templates for a given matching problem.
  • templates include ⁇ h ⁇ h (all odd sites), h ⁇ h ⁇ (all even sites), ⁇ hh, hh ⁇ , ⁇ hh ⁇ , and h ⁇ h, where sites at ⁇ will be masked out and only sites at h will be used to construct the method.
  • the choice of templates to use together in a process may be made such that for the fixed length of the templates (e.g., the four site templates exemplified here), may guarantee that if there were any errors (e.g., two errors) within this window, at least one of these templates correctly report a match. For example, if there were errors at sites 2 and 4 at one application of a four site template, only the h ⁇ h ⁇ would give an error-free read.
  • the total number of digital templates may be between 1 and k, where k is the number of haplotype sites in the window. In some implementations, the total number of digital templates is k!/(m!*(k ⁇ m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • the templates are characterized by a ratio of masked to unmasked sites which ranges between about 1/w and (w ⁇ 1)/w, where w is the length of the template window
  • the templates are characterized by a length equivalent to the total size of the haplotype alignment. As examples, a range of template lengths is between about three and ten consecutive sites.
  • a templating function can be tuned to alter sensitivity to error.
  • one templating function may be implanted as a decision tree that uses a window size of 4 haplotype sites and 6 templates, and so guarantees any matches within that 4 site window as long as there are no more than 2 errors. If i is the current template (range 0 to 5) and k is the current position within a template window (range 0 to 3), then this templating function TQ, k) may be represented as:
  • FIG. 5A illustrates application of templates to an IBD segment finding method.
  • the depicted process 501 begins by receiving data and setting up parameters for the routine. This may involve receiving phased haplotypes to be used in the templated matching routine (block 503 ), defining templates to apply (block 505 ), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array.
  • the depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted as follows.
  • the process increments to the next haplotype site at an operation 507 , and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 509 .
  • the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 511 ) and merges match segments for the various templates (block 513 ).
  • Operation 511 identifies matches only if the current haplotype site is unmasked in the current template. Assuming that the current site is unmasked, operation 511 may be implemented in various ways such as by updating positional prefix and divergence arrays. Note that each template may have its own match segment information. Using this information, operation 513 may merge currently pending segments (at the current haplotype site) from among the various templates.
  • Operation 515 serves to iterate over all the templates while the process is fixed at a given haplotype site and operation 517 serves to iterate over all haplotype sites. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 519 .
  • FIG. 5B illustrates another application of templating to an IBD segment finding method.
  • the depicted process 551 begins by receiving data and set up parameters for the routine. This may involve receiving phased haplotypes (block 553 ) to be used in the templated matching routine, defining templates to apply (block 555 ), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array.
  • the depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted in the figure as follows.
  • the process increments to the next haplotype site (the current haplotype site) at an operation 557 , and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 559 .
  • the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 561 ) and merges match segments for the various templates (block 563 ).
  • An operation 561 identifies matches only if the current haplotype site is unmasked in the current template.
  • operation 561 may be implemented in various ways such as by updating positional prefix and divergence arrays.
  • each template may have its own match segment information.
  • operation 563 may merge currently pending segments (at the current haplotype site) from among the various templates.
  • Operation 565 serves to iterate over all the templates while the process is fixed at a given haplotype site.
  • phase switch errors may be addressed using a heuristic that recognizes typical phase switch errors.
  • An operation 567 serves to iterate over all haplotype sites. If any of the templates indicates a continuous sequence of matching sites including the current site or sites adjacent to the current site, the match sequence is deemed to continue, even if one or more of the templates indicates a gap in the match sequence. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 569 .
  • Merging may involve aligning the putative IBD segments from each templated result, and then scanning through the template-specific segments for pairs of haplotypes. During this process, as long as one of the six templates (or however many are used) still shows a continuing segment, the method keeps a merged IBD segment intact.
  • the methods assume that any IBD start or end points within an otherwise continuous IBD segment are caused by errors. This is a reasonable assumption because the comparison is made between two individuals. There is a very low probability that two haplotypes will match, for greater than a threshold length, by chance.
  • an additional filtering operation to remove some putative IBD segments is performed after one of the above-described processes such as process 301 or process 501 .
  • the filter may operate by discarding putative IBD segments of size below three centimorgans.
  • templated PBWT Given M haplotypes with N bi-allelic sites, the PBWT algorithm can identify identical subsequences of the haplotypes in O(NM) time.
  • a limitation of PBWT is that it requires exact subsequence matches with no errors or missing data.
  • a templated PBWT may be used.
  • a templated PBWT may be designed or configured to identify matching subsequences of the haplotypes despite missing data and genotyping errors with only a small linear increase in computational time compared to the PBWT.
  • One approach for extending PBWT to report matching haplotypes that include some errors involves constructing multiple replicates of the PBWT data structure. Each of these PBWTs is built by masking the haplotype alignment using a different repeating template. Each PBWT may then be individually swept through identifying exact subsequence matches. The matching subsequences from all PBWTs (each from a different template) may then be merged to produce all matching subsequences within the full (unmasked) haplotype alignment.
  • One example uses different repeating templates: for example ⁇ h ⁇ h, h ⁇ h ⁇ , ⁇ hh, hh ⁇ , ⁇ hh ⁇ , and h ⁇ h, where sites at ⁇ will be masked out and only sites at h are used to construct the IBD segments using, e.g., PBWT.
  • These example templates address haplotype data with no more than two errors per four site window.
  • the design of these six specific templates guarantees that all matches across any given four site window will be found as long as there are no more than two errors within the window. This is because given any possible arrangement of two errors across four sites in the original haplotype alignment at least one of the PBWT replicates will have those errors masked out and therefore still deliver the match correctly.
  • This method's sensitivity to errors may be modified by changing the arrangement and number of templates. For example, more templates could be utilized to ensure matches across longer windows; indeed (n/k) templates are required to ensure all matches across windows of size n with no more than k errors per window. In practice genotyping errors are often low enough that six templates would be adequate (templates of length 4 with two sites masked); even with a genotyping error rate as high as 0.001 the probability of three errors within a four site window is 3.996 ⁇ 10 ⁇ 9 . Running each templated PBWT replicate can be easily parallelized.
  • Templating the PBWT as described above to handle errors and return subsequence matches can be executed in linear time by passing through the data only once and avoiding the need for a post-hoc merging algorithm.
  • two arrays are constructed: ppak the positional prefix array and divk the divergence array (Durbin 2014).
  • ppak is a list of the haplotypes sorted so that their reversed prefixes (from k ⁇ 1 to 0) are ordered. This ordering ensures that haplotypes that match through position k ⁇ 1 will end up adjacent to one another in ppak.
  • the divergence array divk keeps track of where those matches began, the ith element in divk represents the beginning of the match between the ith element in ppak and the i ⁇ 1th element in ppak.
  • the method constructs a separate ppaj,k and divj,k for each template j used at site k.
  • a set of templates (as described above) may be formalized as an indicator function T (j, k) with the value 0 when the template j skips over site k and 1 if template j processes site k.
  • T (j, k) is called for each template j; if T (j, k) is 1 then ppaj,k and divj,k are assembled accordingly.
  • auxiliary data structures Ps and Pe are both M by M two dimensional arrays in which the position x, y holds the start/end positions of the match between haplotype x and haplotype y. If another subsegment has already been stored the routine checks to see if the new matching subsegment overlaps and possibly extends the existing subsegment. If they do not overlap, the routine checks if the old matching segment has a genetic length (in cM) of at least Lf and then reports it. The new matching subsegment is then stored in its place.
  • the “templating” of the haplotype alignment is performed within this modified form of the PBWT itself, and matching subsegments from each template are merged and extended directly as the haplotype alignment is passed through.
  • the templated PBWT has a worst-case time complexity of O(NMt) where t represents the number of templates defined within T (j, k); thus the method represents a linear tradeoff between the speed of PBWT and sensitivity to error.
  • An example templated PBWT is further detailed as pseudocode in Algorithm 1.
  • the algorithm employs 2 parameters: (1) Lm is the minimum number of sites that a sub-segment must span within the haplotype alignment to be merged and extend other sub-segments, and (2) Lf is the final minimum length (in cM) that a segment must have to be reported by the algorithm.
  • Lm is the minimum number of sites that a sub-segment must span within the haplotype alignment to be merged and extend other sub-segments
  • Lf is the final minimum length (in cM) that a segment must have to be reported by the algorithm.
  • the algorithm handles missing data by extending the current longest match.
  • the longest matching haplotype to haplotype ppaj,k will be either ppaj ⁇ 1, k or ppaj+1, k, so if missing data in ppaj,k is encountered it is simply assumed the haplotype continues to extend the longest match.
  • FIG. 5D illustrates the Templated PBWT data structures.
  • the TPBWT passes once through an M by N by t three-dimensional structure where M is the number of haplotypes, N is the number of bi-allelic sites, and t is the number of templates.
  • M is the number of haplotypes
  • N is the number of bi-allelic sites
  • t is the number of templates.
  • Each template is a pattern at which sites are masked out (shaded out in the figure).
  • two arrays are updated.
  • the positional prefix array ppa and the divergence array div are both two dimensional arrays of size M by t.
  • each of the t columns of ppa and div are updated for the templates that are not masked out.
  • Each of the t columns in ppa contains the haplotypes sorted in order of their reversed prefixes.
  • each of the t columns in div contains the position at which matches began between haplotypes adjacent to one another in the sorted order of ppa.
  • short fragments of IBD shared between haplotypes i and j, broken up by errors are identified by each of the t templates (green arrows). As these fragments are identified they are merged and extended with one another in the current match arrays P s and P e . While merging and extending IBD fragments a heuristic may be used to scan for and fix putative phase switch errors, as will be discussed further herein.
  • FIG. 5C shows pseudocodes of the algorithm.
  • long IBD segments may be fractured by phase switch errors introduced by phasing techniques used to phase the haplotype data of the individuals. The locations and frequencies of such fractures may occur in predictable ways.
  • a heuristic is employed to correct phase switch errors as IBD segments are identified. As noted herein in FIG. 5B and operation 566 , a heuristic may be applied to identify and correct potential phase switch errors by analysis of sequential haplotype sites. In certain embodiments, a heuristic involves merging adjacent potential IBD segments (matching segments) that end within a threshold distance and then joining or swapping haplotype segments of a given individual, as necessary, to correct phase switch errors.
  • this heuristic may improve IBD determination because it is biologically unlikely that two IBD segments on the same or opposite haplotypes of an individual both end within a particular distance (e.g., about 500 or fewer SNPs on the same or opposite haplotypes).
  • the phase switch heuristic is turned off between closely related pairs of individuals, e.g., between parent and child. For example, if an individual is trio-phased (phasing a child's genotype compared to the parent's genotype), the phasing is considered highly accurate and there are few to no phase switch errors. While the phase switch heuristic is discussed in the context of a Templated PBWT process, the heuristic may be used alone or in conjunction with any of various other algorithms that identify IBD segments for phased haplotype data. Such other algorithms may or may not include analyses that identify and/or correct genotyping and similar errors.
  • the start position of the new IBD segment is compared to the end position of an adjacent IBD segment.
  • the start position of the new IBD segment and the end position of the adjacent IBD segment are on the same haplotype with a gap between them.
  • the start position of the new IBD segment and the end position of the adjacent IBD segment are on opposing haplotypes with either a gap between them or an overlap.
  • the two IBD segments may be merged to form a single IBD segment.
  • the threshold value is between about 0-500 SNPs, about 0-300 SNPs, about 200-300 SNPs, or about 0-100 SNPs.
  • the threshold value for merging adjacent IBD segments is the same threshold value for determining that two haplotypes have a minimum number of sites that a sub-segment spans to be considered a potential IBD segment (Lm). If the two IBD segments are on opposite haplotypes, portions of the haplotypes (i.e., haplotype segments) may be swapped starting at the location of a break in the IBD segments.
  • the haplotypes remain swapped unless/until the heuristic determines another phase switch error has occurred and swaps the haplotypes.
  • the haplotypes used to identify potential IBD segments remain swapped.
  • the heuristic is used to correct the actual haplotypes for phase switch errors in addition to correcting IBD segments.
  • the merged potential IBD segment must have a minimum length Lf to be deemed an IBD segment.
  • Lf minimum length
  • FIG. 6A illustrates via a series of diagrams how the heuristic may be applied for four individuals that share IBD segments with a focal person.
  • Each pair of haplotypes (0 and 1; dotted lines) represents a copy of the haplotypes of the focal person, while the grey bars represent the IBD segments the focal person shares with four other individuals.
  • the top pair of haplotypes shows IBD segments of the focal person and another individual
  • the second pair of haplotypes shows IBD segments of the same focal person but with a second other individual, and so on.
  • a focal person is provided for purposes of illustrating the heuristic, in some embodiments the heuristic is applied to correct phase switch errors in multiple or all individuals simultaneously.
  • a focal person may be a new user or customer that is added to the database of, e.g., a person genetics platform.
  • the process runs using the new user or customer as the focal individual against the entire database.
  • Panels B through F represent the TPBWT's sweep along the chromosome from left to right, with the black arrow labeled TPBWT representing the current position.
  • TPBWT sweeps along the haplotypes identifying IBD matches it uses a heuristic to identify and fix putative phase switch errors.
  • diagram A two haplotypes (0 and 1; dotted lines) of the focal person and the IBD segments they share with the four other individuals in the haplotype alignment are plotted.
  • the focal person has two phase switch errors (red dashed lines) that break up long IBD segments.
  • the Templated PBWT scans left to right along the chromosome, keeping track of IBD segments shared among all pairs of individuals.
  • phase switch error is inferred within one of the other individuals, then that other individual's haplotypes are swapped, and the focal person's haplotypes remain unswitched.
  • diagram E when the arrangement of IBD segments on the complementary haplotypes again suggests another phase switch error has been encountered the algorithm swaps the focal person's haplotypes again, but this time at the location of the other phase switch error.
  • diagram F the Templated PBWT continues to the end of haplotypes after successfully identifying phase switch errors and “stitching” IBD fragments back into correct long IBD segments.
  • the heuristic is applied to correct phase switch errors when a new potential IBD segment is identified.
  • a potential IBD segment is identified when the Templated PBWT reaches the rightmost end of the potential IBD segment.
  • a potential IBD segment is identified when the Templated PBWT reaches the rightmost end of the potential IBD segment.
  • FIGs C and E only a single new potential IBD segment is identified because the TPBWT has not reached the end of the other IBD segments, triggering their identification as potential IBD segments and application of the heuristic.
  • panel E the rightmost fragment in the second from top haplotype pair has not yet been identified since the TPBWT operation has not reached the fragment's rightmost end.
  • panel F the TPBWT has scanned further right along the chromosome and identified that fragment and applied the heuristic to it (which merged it into the long IBD segment).
  • the heuristic is applied as the Templated PWBT iterates through successive sites along all chromosomes. As potential IBD segments are identified, the proximity of the identified IBD segment to a prior IBD segment on either haplotype is determined to infer whether there is a phase switch error and the IBD segments should be ‘stitched’ together to form a single IBD segment.
  • FIG. 6B illustrates the possible scenarios considered by the Templated PBWT for adjacent IBD segments.
  • Diagram A shows a first IBD segment shared by P and Q.
  • Diagrams B-E show the various combinations of second IBD segments between P and Q that may be considered by the heuristic (the grey box indicates the two potential IBD segments are within a threshold length of each other). The second IBD segments are within a threshold number of SNPs of the first IBD segments.
  • all IBD segments are on the same haplotype, but are separated by a gap. This may result from an even number of phase switch errors, causing the haplotypes to be swapped at both ends of the gap such that the potential IBD segments are on the same haplotype.
  • phase switch errors within the gap may be unknown and is not necessary to infer which individual harbors the phase switch error(s).
  • phase switch errors may have occurred in P, Q, or P and Q, and the heuristic may be applied as a result of two potential IBD segments being identified within a threshold range of each other, as discussed herein.
  • the Templated PBWT may be used to correct for short gaps, e.g., 1-3 SNPs
  • the gap illustrated here may be larger, for example up to about 100 SNPs, or about 300 SNPs, or about 500 SNPs. This may be caused by various errors, including multiple phase switch errors within the gap, such that the matching sites are insufficiently long to be considered potential IBD segments.
  • the heuristic as described herein infers that two segments within the threshold distance are likely to be a single segment broken up by errors, and thus merges them despite the gap.
  • Diagram C illustrates the second IBD segments being on opposite haplotypes for both P and Q, which may be the result of a phase switch error in both individuals. In such cases, the haplotypes may be swapped in both individuals.
  • Diagrams D and E illustrate either Q or P, respectively, having second IBD segments on the opposite haplotype. In these scenarios, if the second IBD segments are within a threshold distance of the first IBD segments but on the opposite haplotype, a phase switch error is inferred and the haplotypes from the second IBD segments forward may be swapped and the first and second IBD merged.
  • the Templated PBWT handles haplotype error (miscalls) and missing data. It is also robust to “blip” phase switch errors in which the phase at a single site is swapped. However, phase switch errors spaced out along the chromosome will cause long regions of the haplotypes to be swapped and fragment IBD segments as illustrated in FIG. 1 . To handle these errors the Templated PBWT may apply a phase correction heuristic that scans for certain patterns of haplotype sharing to identify and correct phase switch errors. Note that for haploid data sets such as human male sex chromosomes this heuristic can be turned off. Large cohorts of samples have patterns of haplotype sharing that are highly informative regarding the location of phase switch errors.
  • phase switch errors in an individual will fragment all IBD segments shared with that individual at the position of the switch error.
  • Each IBD segment that spans the switch error will be broken into two fragments at the position of the error: these fragments will be on complementary haplotypes within the individual with the error and yet may remain on the same haplotype within the other individual.
  • this pattern of haplotype sharing may be the result of actual recombination patterns, however for the majority of more distantly related individuals the pattern can be used to identify phase switch errors.
  • the new segment begins near the end of the existing segment and the new segment is on the same haplotype as the existing segment in individual P but on the complementary haplotypes in individual Q, then possibly there was a phase switch error in individual Q. And of course, the opposite pattern could exist suggesting a phase switch error in individual P.
  • the TPBWT will swap the haplotypes for the individuals containing the error (See FIG. 6B ). Now the new IBD segments merge and extend the fragments on the complementary haplotype that were broken up by the phase switch error.
  • the algorithm stops swapping the individual's haplotypes. This simple heuristic continues to the end of haplotypes “stitching” short stretches of IBD fragmented by errors back into the correct long IBD segments.
  • FIG. 6C illustrates a process for using a phase switching error correction heuristic as described herein.
  • the process 600 begins by receiving haplotype data for a plurality of individuals.
  • the process 600 may be performed while scanning the haplotype data using a method such as a Templated PBWT (e.g., during operation 566 in FIG. 5B ) or may be performed as a standalone process.
  • the process begins by receiving phased haplotype data to be considered.
  • the process loops over multiple haplotype sites, considering each one separately, but also considering adjacent and/or near sites that contain IBD segments, particularly nearly terminated IBD segments along with newly started IBD segments.
  • An operation 602 sets the next haplotype site for consideration.
  • this site incrementing operation may already have been performed.
  • An operation 603 determines phased haplotypes of at least two individuals have first potential IBD segments that terminate at a first location.
  • first potential IBD segments may be identified after a matching subsequence having at least Lm sites terminates. The subsequences must possess more than a threshold length Lm to be considered possible IBD segments.
  • the first potential IBD segments terminate at a first location which may be stored for later reference.
  • An operation 605 determines that the at least two individuals have second IBD segments that start at a second location within a threshold distance of where the first IBD segment ended.
  • the threshold distance may be as described above. In some implementations, the distance may be either a gap or an overlap between the first and second potential IBD segments.
  • An operation 607 identifies or infers which individual, from among those having second IBD segments that starts at a location within the threshold distance of where the first IBD segment ended, likely has a phase switch error.
  • the second potential IBD segments may be between any combination of haplotypes of the at least two individuals. See FIG. 6B . When the second potential IBD segment begins on the opposite haplotype as the first potential IBD segment in at least one of the individuals, a phase switch error is implied in the individual that has the first and second IBD segments on opposite haplotypes.
  • the operation infers that that individual has the phase switch error, and only that individual's haplotypes need correction for the phase switch error.
  • the operation may infer that both individuals have phase switch errors at the same or proximate positions.
  • operation 607 may be skipped. For example, as shown in diagram B of FIG. 6B , above, the potential IBD segments to be merged are on the same haplotype. In such embodiments it may be difficult to determine which individual had a phase switch error and also unnecessary to properly merge the potential IBD segments (as swapping the haplotypes is not necessary to have the potential IBD segments on the same haplotype).
  • the first potential IBD segments and the second potential IBD segments are merged. If the first potential IBD segments and the second potential IBD segments are on opposite haplotypes for any of the at least two individuals (i.e., a phase switch error occurred for those individuals), the haplotypes may be swapped for those individuals. The swap may occur at the location of the phase switch error.
  • Operation 611 is an optional operation to determine whether each potential IBD segment is sufficiently long and/or meets other criteria to be considered a true IBD (e.g., a minimum length Lf). If the criteria are met, the potential IBD segments are determined to be actual IBD segments.
  • Operation 613 is an optional operation to correct for potential genotyping errors. See e.g., the discussion of the Templated PBWT.
  • the current haplotype site is checked for whether it is the last haplotype site. If it is the last haplotype site, the process finishes. If it is not the last haplotype site, the process returns to operation 602 to select the next haplotype site and continue scanning for IBD segments.
  • process 600 is part of another method to identify IBD segments, e.g., a Templated PBWT, the loop may also allow for the Templated PBWT algorithm to continue scanning the next haplotype site.
  • HMM Hidden Markov Model
  • FIG. 7 schematically illustrates using a HMM to process four haplotypes of two individuals (individual 1 and individual 2 , two haplotypes for each individual on chromosome 5) to correct phase switch errors, “stitching” IBD segments fractured by phase switch errors.
  • the HMM process covers the full span of the four haplotypes from left to right shown sequentially in the top panel, lower left panel, and lower right panel.
  • FIG. 8 shows a flow diagram illustrating process 800 for correcting phase switch errors in IBD segments using a hidden Markov model (HMM) according to some implementations.
  • the error correction optionally is initiated or triggered when IBD segments of the two individuals being compared meet one or more criteria. See decision box 802 .
  • This conditional trigger can avoid processing IBD segments that may not need corrections.
  • the HMM error correction process is triggered when the two individuals' IBD segments include two or more IBD segments on a single chromosome. This can avoid applying error correction when there is only a single IBD segment on a single chromosome where no phase switch errors have occurred.
  • the criterion is met when the two individuals' IBD segments exceed a minimum total amount of shared IBD.
  • process 800 proceeds to obtain an IBD state for each polymorphic site of a series of polymorphic sites of the two individuals. See the box 802 , “Yes” branch and box 804 .
  • the IBD state indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and if so, which of the two individuals' phased haplotypes are part of the IBD segment.
  • the series of polymorphic sites are located in one or more pairs of chromosomes of each individual.
  • the polymorphic sites are biallelic sites. In other implementations, more than two alleles may be implemented at a site.
  • the IBD states indicate different conditions of zero IBD, half IBD, and full IBD. In some implementations when the polymorphic site is a biallelic site, the IBD states include nine different IBD states corresponding to nine conditions of zero IBD, half IBD, and full IBD as further described in examples hereinafter.
  • Process 800 then involves applying the HMM to the IBD states. Box 806 .
  • the HMM model takes the IBD states as inputs and uses them as observed states of the model.
  • the HMM model also takes as input (i) a rate of recombination based on a number of meioses (m), (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and, optionally, (iii) genetic distances between consecutive sites on a chromosome. In some implementations, genetic distances between consecutive sites on a chromosome may be omitted.
  • model input herein refers to both variables and parameters.
  • the HMM model's transmission rates or probabilities depend on (i) and (ii), and optionally (iii).
  • the application of the HMM model removes likely phase switch errors and produces error corrected IBD segments based on a most likely sequence of hidden IBD states given the observed IBD states. See block 808 .
  • Applying the HMM involves using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed IBD states.
  • the most likely sequence of hidden IBD states is identified using the Viterbi dynamic programming process.
  • Process 800 is implemented using a computer. It is not practical or feasible to apply the model without a computer due to the complexity of the model. For example, applying the HMM requires using a 36 ⁇ 36 transmission matrix and a 36 ⁇ 36 emission matrix for each polymorphic site, often at hundreds of thousands of polymorphic site, to calculate a most likely sequence. It can take many years and errors for a person to calculate just a single Viterbi sequence.
  • the error correction process involves only the operations illustrated in boxes 804 , 806 , and 808 .
  • Such implementations include: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are comprised in or lie along one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and (ii
  • Some implementations of the disclosure include multiple iterations of applying the HMM to test different numbers of meioses (m). As illustrated in FIG. 8 , process 800 determines whether there are additional values of m that need to be tested. If so, the process loops back to box 804 to obtain IBD states and apply the HMM using a different value of m. In some implementations, the different numbers of meioses are in the range from 0.1 to 14 crossovers. In some implementations, the values of m are in the range from 1 to 14. See block 810 , “Yes” branch. If there are no additional values of m to be tested, process 800 proceeds to use the set of error corrected IBD segments having the highest probability as a final estimate of IBD segments for the two individuals. See decision block 810 , “No” branch, and block 812 . Thereafter, process 800 ends at block 814 .
  • FIG. 9A schematically illustrates the structure of the HMM model. It includes a series of hidden states (illustrated as circles on top) representing the ground-truth IBD states at a series of polymorphic sites and a series of observed states (illustrated as circles at the bottom) representing the observed IBD states based on phased haplotype data of the two individuals.
  • the arrows in the diagram denote conditional dependencies.
  • the hidden states obey the Markov property, such that the hidden state at any site depends on only the hidden state at the immediately previous site. In other words, H l depends only on H l ⁇ 1 . Moreover, the observed state at a particular site depends only on the hidden state at the particular site. In other words, O l depends on only H l .
  • the state space of the hidden variable is discrete.
  • the parameters of a HMM are of two types, transition probabilities and emission probabilities.
  • the transition probabilities between site l ⁇ 1 and site l determine the probability of H l given H l ⁇ 1 .
  • the emission probabilities at site l determine the probability of O l given H l .
  • Pr ( H 1 , H 2 , H 3 , ... ⁇ , O 1 , O 2 , O 3 , ... ⁇ ) Pr ⁇ ( H 1 ) ⁇ Pr ⁇ ( O 1 ⁇ H 1 ) ⁇ Pr ⁇ ( H 2 ⁇ H 1 ) ⁇ Pr ⁇ ( O 2 ⁇ H 2 ) ⁇ Pr ⁇ ( H 3 ⁇ H 2 ) ⁇ Pr ⁇ ( O 3 ⁇ H 3 ) ( Eq . ⁇ 1 )
  • H i ) are emission probabilities/parameters
  • H i ⁇ 1 ) are the transition probabilities/parameters.
  • the hidden state space assumes one of N possible values, modeled as a discrete distribution. For each of the N possible states that a hidden variable at point l can be in, there is a transition probability from this state to each of the N possible states of the hidden variable at point l+1, for a total of N 2 transition probabilities. Note that the set of transition probabilities for transitions from any given state must sum to 1. As such, the N ⁇ N matrix of transition probabilities is a Markov matrix.
  • the emission probabilities governing the distribution of the observed variable at a particular point given the state of the hidden variable at that point.
  • the size of this set depends on the nature of the observed variable. For example, if the observed variable is discrete with M possible values, governed by a discrete distribution, there will be a total of N ⁇ M emission probabilities.
  • each polymorphic site is biallelic, and the IBD states at any site can include nine different IBD states, indicating nine conditions of zero IBD, half IBD, and full IBD.
  • site l can be observed as IBD between the two individuals.
  • the IBD state at site l notated as c* ⁇ is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1.
  • the haplotype at site l is not IBD practitioners represent it as a 0.
  • the IBD states are expanded by multiplying these different 9 conditions of IBD with four types of phase switch errors. But if one disregards the phase switch error types, there would be 9 ⁇ 9 transition rates between hidden states of two consecutive sites.
  • transition rates of the HMM are based upon a rate at which IBD segments start.
  • the rate at which IBD segments start is modeled as a function of the number of meioses. See box 706 , input (i).
  • the rate at which IBD segments start ( ⁇ s ) is modeled as follows.
  • m is the number of meioses, and the recombination rate is assumed to be 1 crossover per 100 cM.
  • transition rates of hidden IBD states are based on a rate at which IBD segments end.
  • the rate at which IBD segments end is modeled as a function of the number of meioses.
  • the rate at which IBD segments ends ( ⁇ e ) is modeled as follows.
  • m is the number of meioses.
  • the IBD states include nine different IBD states, and transition rates are based on a transition matrix Q a in FIG. 9B .
  • each row includes rates for transitioning from the IBD state denoted by the four-letter string at site l to nine IBD states at l+1.
  • the transition rates of hidden IBD states are weighted by a probability that full IBD between the two individuals is truly present.
  • the probability that the full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD.
  • the probability that full IBD between the two individuals is truly present ( ⁇ ) is modeled as follows.
  • ⁇ 2 is the amount of estimated full IBD
  • is an empirical parameter defining the steepness of the logistic function
  • the transition rates of hidden IBD states are weighted by weighting transitions into full IBD states with ⁇ , and waiting transitions out of full IBD states with 1/ ⁇ .
  • the IBD states include nine different IBD states, and the transition rates of hidden IBD states are based on a transition matrix as follows.
  • the transition rates of hidden IBD states are based on the at least one rate of phase switch error. See block 706 , model input (ii).
  • the IBD states include nine different IBD states as described herein.
  • the at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, ⁇ 1 and ⁇ 2 , respectively.
  • the phase switch error rates for the two individuals are the same when the same phasing method is used for both individuals.
  • the transition rates are based on the 36 ⁇ 36 transition matrix described as follows.
  • transition probabilities of hidden IBD states are based upon genetic distances between consecutive sites on a chromosome. See box 706 , model input (iii). In some implementations, transition probabilities of hidden IBD states are obtained by exponentiating a transition matrix. In some implementations, transition probabilities of hidden IBD states Y l+1 given hidden IBD states Y l are modeled as:
  • ⁇ 0 is a phase switch error rate for a first individual of the two individuals
  • ⁇ 1 is a phase switch error rate for a second individual of the two individuals
  • ⁇ 2 is an amount of estimated full IBD
  • Q is a transition matrix described by Eq. 6
  • d l is the genetic distances between sites l and l+1.
  • the emission probabilities of the HMM are dependent on phase switch errors. In some implementations, the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on the four different ways the two individuals may be in phase switch errors.
  • IBD segments shared between two related individuals are generated by passing along the four haplotypes of the two individuals. IBD segments begin and end following a Poisson process with rates that are determined by the number of meioses m that occurred on the pedigree between the two individuals. Phase switch errors occur following a Poisson process with a rate ⁇ determined by empirically testing statistical phasing methods.
  • Y l represents the different ways site l could be observed as IBD plus the different ways the two individuals may be in a phase switch error.
  • site l can be observed as IBD between the two individuals.
  • Practitioners notate the IBD state at site l as c* l , which is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1.
  • haplotype at site 1 is not IBD inventors represent it as a 0.
  • c* l 0000 indicates that the two individuals at site 1 are not IBD, or zero IBD. Accordingly, there are 4 different ways the two individuals could be half IBD: 0101 is when the individuals are IBD through their paternal haplotypes, 1001 is when the individual 0's maternal haplotype is IBD with individual 1's paternal haplotype, 0110 is when the individual 0's paternal haplotype is IBD with individual 1's maternal haplotype, and 1010 is when the individuals are IBD through their maternal haplotypes.
  • hidden states Y l represents the different ways site l could be observed as IBD and also includes information about the different ways in which the two individuals may or may not be in a switch error.
  • Practitioners model the transitions among hidden states Y l with an instantaneous transition rate matrix. If, for a moment, practitioners do not consider transitions in which phase switch errors may occur and practitioners only consider transitions among the 9 IBD states that can be observed, practitioners can define the transition matrix Q a shown in FIG. 9A .
  • the matrix Q a defines the way the model moves between zero, half, and full IBD states. As the model passes along the chromosome ⁇ s is the rate at which IBD segments begin
  • ⁇ e represents the length of the IBD segments shared between individuals 0 and 1.
  • ⁇ s represents the length of segments with no IBD shared between the two individuals.
  • Phase switch errors break up half IBD segments into shorter adjacent half IBD segments on different haplotypes. Since the templated PBWTs procedure described above imperfectly estimates the start and end positions of IBD segments, when the lengths of the two adjacent half IBD segments are over estimated this can result in a short region of erroneous full IBD. Since full IBD is not expected for most pairs of relatives we model the error in the observed proportion of full IBD using a simple logistic function. Practitioners indicate the probability of full IBD truly being present as ⁇ , which is defined as
  • ⁇ 2 is the amount of full IBD estimated by the templated PBWTs.
  • ⁇ 2 ⁇ 25% the amount expected for full siblings
  • ⁇ 2 approaches zero
  • also approaches zero.
  • the probability of observing the IBD state c* l given the possible phase switch errors at site l is P(c* l
  • Phased IBD Phased IBD. It is used in the experiments described hereinafter. It has two stages: First the templated PBWT and then the phase-correcting HMM.
  • the templated PBWT stage generates the IBD segments among all haplotypes very quickly and efficiently.
  • the second stage of the algorithm the HMM
  • the HMM the second stage of the algorithm
  • phase switch errors have not broken up their observed IBD segments and so the HMM does not apply.
  • the HMM the slow stage of the 2-part algorithm, is thus only applied to the small number of individuals within the dataset that are closely related. Practitioners require a pair of individuals to have at least 2 observed IBD segments on a single chromosome before running them through the phase-correcting HMM, though additionally we can require a minimum total amount of shared IBD (in cM) to increase the speed of the entire algorithm.
  • IBD segments can be used for a wide range of purposes. For instance, the amount (length and number) of IBD sharing depends on the familial relationships between the tested individuals. Therefore, one application of IBD segment detection is to quantify relatedness. For example, methods for using IBD segments to quantify relatedness are described in U.S. Pat. No. 8,463,554, issued Jul. 11, 2013, which is incorporated by reference in its entirety for all purposes.
  • the number of shared IBD segments and the amount of DNA shared by two users are computed based on the IBD segments obtained as described above. In some implementations, the longest IBD segment is determined. In some implementations, the amount of DNA shared includes the sum of the lengths of IBD regions and/or percentage of DNA shared. The sum is referred to as IBDhalf or half IBD because the individuals share DNA identical by descent for at least one of the homologous chromosomes. The predicted relationship between the users, the range of possible relationships, or both, is determined using the IBDhalf and number of segments, based on the distribution pattern of IBDhalf and shared segments for different types of relationships.
  • the individuals have IBDhalf that is 100% the total length of all the autosomal chromosomes and 22 shared autosomal chromosome segments; in a second degree grandparent/grandchild relationship, the individuals have IBDhalf that is approximately half the total length of all the autosomal chromosomes and many more shared segments; in each subsequent degree of relationship, the percentage of IBDhalf of the total length is about 50% of the previous degree. Also, for more distant relationships, in each subsequent degree of relationship, the number of shared segments is approximately half of the previous number.
  • the distribution patterns are determined empirically based on survey of real populations. Different population groups may exhibit different distribution patterns. For example, the level of homozygosity within endogamous populations is found to be higher than in populations receiving gene flow from other groups.
  • the bounds of particular relationships are estimated using simulations of IBD using generated family trees. Based at least in part on the distribution patterns, the IBDhalf, and shared number of segments, the degree of relationship between two individuals can be estimated.
  • IBD segments can also be used determine ethnicity or ancestry. See, e.g., U.S. patent application Ser. No. 15/664,619, filed Jul. 31, 2017, which is incorporated by reference in its entirety for all purposes.
  • IBD can be used to perform genotype imputation.
  • Genotype imputation refers to the statistical inference of genotype information not directed assayed. This is especially helpful because many individuals only have sparsely assayed genotype data, usually targeting a limited number of genetic markers in the genome.
  • IBD segments are determined between two individuals, it can be inferred that the genotype of the two individuals are the same in the IBD segments.
  • the known genotype information of an IBD segment of one of the two individuals can be “imputed” into that of the other individual.
  • This further allows association study between phenotypes and genotypes even using individuals that have only the phenotype data collected but not the genotype data assayed. See, e.g., U.S. patent application Ser. No. 15/256,388, filed Sep. 2, 2016, which is incorporated by reference in its entirety for all purposes.
  • FIG. 10 is a functional diagram illustrating a programmed computer system for performing the pipelined ancestry prediction process in accordance with some implementations.
  • Computer system 100 which includes various subsystems as described below, includes at least one microprocessor subsystem (also referred to as a processor or a central processing unit (CPU)) 102 .
  • processor 102 can be implemented by a single-chip processor or by multiple processors.
  • processor 102 is a general purpose digital processor that controls the operation of the computer system 100 . Using instructions retrieved from memory 110 , the processor 102 controls the reception and manipulation of input data, and the output and display of data on output devices (e.g., display 118 ).
  • processor 102 includes and/or is used to provide phasing, genotype error correction, and/or phasing error correction, etc. as described herein.
  • Processor 102 is coupled bi-directionally with memory 110 , which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a read-only memory (ROM).
  • primary storage can be used as a general storage area and as scratch-pad memory, and can also be used to store input data and processed data.
  • Primary storage can also store programming instructions and data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 102 .
  • primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 102 to perform its functions (e.g., programmed instructions).
  • memory 110 can include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional.
  • processor 102 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).
  • a removable mass storage device 112 provides additional data storage capacity for the computer system 100 , and is coupled either bi-directionally (read/write) or uni-directionally (read only) to processor 102 .
  • storage 112 can also include computer-readable media such as magnetic tape, flash memory, PC-CARDS, portable mass storage devices, holographic storage devices, and other storage devices.
  • a fixed mass storage 120 can also, for example, provide additional data storage capacity. The most common example of mass storage 120 is a hard disk drive.
  • Mass storage 112 , 120 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 102 . It will be appreciated that the information retained within mass storage 112 and 120 can be incorporated, if needed, in standard fashion as part of memory 110 (e.g., RAM) as virtual memory.
  • bus 114 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 118 , a network interface 116 , a keyboard 104 , and a pointing device 106 , as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed.
  • the pointing device 106 can be a mouse, stylus, track ball, or tablet, and is useful for interacting with a graphical user interface.
  • the network interface 116 allows processor 102 to be coupled to another computer, computer network, or telecommunications network using a network connection as shown.
  • the processor 102 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps.
  • Information often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network.
  • An interface card or similar device and appropriate software implemented by (e.g., executed/performed on) processor 102 can be used to connect the computer system 100 to an external network and transfer data according to standard protocols.
  • various process implementations disclosed herein can be executed on processor 102 , or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing.
  • Additional mass storage devices can also be connected to processor 102 through network interface 116 .
  • auxiliary I/O device interface can be used in conjunction with computer system 100 .
  • the auxiliary I/O device interface can include general and customized interfaces that allow the processor 102 to send and, more typically, receive data from other devices such as microphones, touch-sensitive displays, transducer card readers, tape readers, voice or handwriting recognizers, biometrics readers, cameras, portable mass storage devices, and other computers.
  • various implementations disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computer-implemented operations.
  • the computer-readable medium is any data storage device that can store data which can thereafter be read by a computer system.
  • Examples of computer-readable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as optical disks; and specially configured hardware devices such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices.
  • Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.
  • the computer system shown in FIG. 10 is but an example of a computer system suitable for use with the various implementations disclosed herein.
  • Other computer systems suitable for such use can include additional or fewer subsystems.
  • bus 114 is illustrative of any interconnection scheme serving to link the subsystems.
  • Other computer architectures having different configurations of subsystems can also be utilized.
  • FIG. 11 is a block diagram illustrating an implementation of an IBD-based personal genomics services system that provides services based on IBD information, which include but are not limited to relatedness estimation, relative detection, ancestry determination, and genotype-phenotype association.
  • a user uses a client device 1102 to communicate with an IBD-based personal genomics services system 1106 via a network 1104 .
  • Examples of device 1102 include a laptop computer, a desktop computer, a smart phone, a mobile device, a tablet device or any other computing device.
  • IBD-based personal genomics services system 1106 is used to perform a pipelined process to predict ancestry based on a user's IBD information.
  • IBD-based personal genomics services system 1106 can be implemented on a networked platform (e.g., a server or cloud-based platform, a peer-to-peer platform, etc.) that supports various applications.
  • a networked platform e.g., a server or cloud-based platform, a peer-to-peer platform, etc.
  • implementations of the platform perform ancestry prediction and provide users with access (e.g., via appropriate user interfaces) to their personal genetic information (e.g., genetic sequence information and/or genotype information obtained by assaying genetic materials such as blood or saliva samples) and predicted ancestry information.
  • the platform also allows users to connect with each other and share information.
  • Device 110 can be used to implement 1102 or 1106 .
  • DNA samples e.g., saliva, blood, etc.
  • the genotype information is obtained (e.g., from genotyping chips directly or from genotyping services that provide assayed results) and stored in database 1108 and is used by system 1106 to make ancestry predictions.
  • Reference data including genotype data of reference individuals, simulated data (e.g., results of machine-based processes that simulate biological processes such as recombination of parents' DNA), pre-computed data (e.g., a precomputed reference haplotype data used in phasing and model training) and the like can also be stored in database 1108 or any other appropriate storage unit.
  • This experiment compares a method according to some implementations as described above to other computer implemented methods known in the art. All of these methods are computer-implemented. IBD accuracies and computer performances are compared among the methods.
  • Phased IBD Phased IBD. It includes techniques as described in the templated PBWT and the HMM examples above.
  • PBWT Burrows-Wheeler transform
  • RaPID RaPID
  • A. Naseri X. Liu, S. Zhang, and D. Zhi. Ultra-fast identity by descent detection in biobank-xcale cohorts using positional Burrows-Wheeler transform. bioRxiv, page 103325, 2017.
  • Browning The method described by Browning is labeled as Refined IBD. See, B. L. Browning and S. R. Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics, 194(2):459-471, 2013.
  • hap-IBD A method described by Zhou is labeled hap-IBD. See, Y. Zhou, S. R. Browning, and B. L. Browning. A fast and simple method for detecting identity by descent segments in large-scale data. BioRxiv, 2019.
  • Shemirani A method described by Shemirani is labeled iLASH. See, R. Shemirani, G. M. Belbin, C. L. Avery, E. E. Kenny, C. R. Gignouz, and J. L. Ambite. Rapid detection of identity-by-descent tracts for mega-scale datasets. BioRxiv, page 749507, 2019.
  • FIG. 12 shows results comparing the speed of different IBD inference methods.
  • IBD segments were computed for 50781 SNPs from human chromosome 1.
  • the x-axis shows the number of haplotypes and the y-axis shows the time in seconds to infer IBD.
  • Phased IBD used a minimum IBD segment length of 200 sites and 1.5 cM.
  • Durbin' s PBWT used a 200 site minimum.
  • RaPID version 1.2.3 used 10 runs, 2 successes, window size of 35, and minimum length of 1.5 cM.
  • Refined IBD used a minimum length of 1.5 cM and all default parameter value settings.
  • the results of FIG. 11 show that Phased IBD and PBWT are the fastest among the four methods and similar to each other. RaPID is the slowest.
  • Phased IBD can correct various errors (including genotyping errors and phase switch errors) that cannot be addressed by PBWT, it is noteworthy that Phased IBD achieves similar computational speed as PBWT.
  • RaPID and Refined IBD can correct errors, albeit to a lesser extent than Phased IBD as shown in FIGS. 14 and 15 , they require significantly longer computer run time.
  • FIGS. 13-16 compare the IBD estimate errors (or the opposite of accuracy) of various methods.
  • FIG. 13 compares the absolute error in number of IBD segments between the Templated PBWT method (x-axis) and Phased IBD (y-axis) that includes both Templated PBWT component and the HMM component.
  • the results of FIG. 13 show that the HAIM process greatly reduce the error rates, reducing maximum error segments by about three folds from about 300 to 100.
  • FIG. 14 shows that Phased IBD is more accurate than PBWT.
  • Each axis shows the proportion of the genome with incorrect IBD estimates.
  • PBWT x-axis
  • y-axis is sensitive to both genotyping and phasing errors compared to Phased IBD
  • FIG. 15 shows that Phased IBD is more accurate than Refined IBD.
  • Refined IBD x-axis
  • Phased IBD y-axis
  • Refined IBD outperforms both PBWT and RaPID.
  • FIG. 16 shows that Phased IBD is more accurate than RaPID (version 1.2.3).
  • RaPID x-axis
  • PBWT PBWT
  • Phased IBD “stitches” together long IBD segments highly fragmented by phasing and genotyping errors.
  • Recombination was simulated using a Poisson model with a rate of 1 expected crossover per 100 cM. This resulted in simulated haplotypes for 2000 closely related pairs of individuals with perfectly known IBD segments, 400 pairs of each relationship type: parent-child, grandparent-grandchild, aunt-niece, first cousins, and siblings.
  • Genotyping errors were introduced into the simulated data set using a simple model. At each position along the simulated chromosomes an error in the genotype call was introduced with a probability of 0.001. When a site was selected for an error, half of the genotype call would be flipped with equal probability (e.g., a 0/0 genotype would be converted to a 1/0 or a 0/1 with equal probability).
  • Statistical phasing errors were also introduced into the simulated haplotype datasets. All of the simulated haplotypes were converted into their respective diploid genotypes and then the statistical haplotype phasing method Eagle2 was used. For the phasing reference panel a phasing panel that included about 200000 non-Europeans and about 300000 Europeans was used.
  • FIG. 17 shows that Templated PBWT had less error in the estimated number of IBD segments shared between relatives than all other methods analyzed.
  • the y-axis represents the number of erroneous IBD segments estimated for a simulated pair of relatives. Error was highest in closely related pairs that shared long IBD segments, particularly parent-child and siblings.
  • FIG. 18 shows that Templated PBWT had less error in the estimated percentage of the genome that is IBD in simulated relatives than other methods.
  • PBWT had less than error than other methods except Templated PBWT, while hap-IBD and Refined IBD had the largest error. Error was higher in simulated pairs that shared long IBD segments, such as parent-child, compared to more distance relatives pairs such as first cousins. Compared to Templated PBWT, the other methods were more sensitive to phasing and genotyping errors in estimated IBD segments.
  • FIG. 19 shows false negative (charts 1901 and 1905 ) and false positive rates (charts 1903 a - b and 1907 a - b ) of inferring IBD by various methods. Rates were calculated for bins of IBD segment lengths. False negative rate by segment is the proportion of true segments in a size bin that do not overlap any segment compared to the total number of true segments in the size bin. False negative rate by segment coverage is the proportion of the length of true segments in a size bin not covered by any estimated segment compared to the total length of true segments in the size bin. False positive rate by segment is the proportion of estimated segments in a size bin that do not overlap any true segment compared to the total number of estimated segments in the size bin.
  • False positive rate by segment coverage is the proportion of the length of estimated segments in a size bin not covered by any true segment compared to the total length of estimated segments in the size bin.
  • Plots 1903 b and 1907 b present the false positive rate with a smaller y-axis scale than plots 1903 a and 1907 a, respectively.
  • IBD segments ⁇ 4 cM all methods had low false positive rates.
  • IBD segments greater than ⁇ 6 cM the Templated PBWT outperformed all other methods.
  • FIG. 20A shows IBD computation runtimes for various methods. All methods were run using 1 CPU core. Templated PBWT was faster than all other methods except Durbin' s PBWT. The relative time shows the runtime to compute IBD for each haplotype in sample sizes of 400 to 20000 haplotypes relative to the time needed to compute IBD for each haplotype in a sample size of 400. A slope near zero indicates linear time complexity, while a positive slope indicates super-linear time complexity. Templated PBWT shows a near linear time complexity. FIG. 20B provides additional compute times for parallelized IBD analyses with large sample sizes.
  • Times are shown for in-sample IBD computes on 1 million individuals, out-of-sample IBD computes on 10 k individuals against 1 million, and out-of-sample IBD computes on 10 k individuals against 10 million.
  • the first two rows show the compute times measured when IBD was estimated over 42927 sites of human chromosome 1.
  • the last three rows show those compute times extrapolated to 23 chromosomes with a total of 600 k sites.
  • the last row additionally extrapolates the time for an out-of-sample analysis on 1 million to 10 million individuals.
  • CPU time is the sum of the computation time for all compute cores.
  • Wall clock time is the “real” time that the entire analysis took to run.
  • the v4 platform had 453065 SNPs and v5 platform had 544042 SNPs.
  • Haplotypes were phased using Eagle2 as described in Loh et. al., Reference-based phasing using the haplotype reference consortium panel. Nature genetics, 48(11):1443, 2016. Individuals on the v4 platform were phased with a reference panel containing 691759 samples. Individuals on the v5 platform were phased with a reference panel containing 286305 samples.
  • IBD sharing among the 9517 individuals was computed using the Templated PBWT with the parameters described in Table 1. IBD estimates among individuals on the same genotyping platform were made using the in-sample method described above, and estimates made among individuals on different platforms was made using the out-of-sample approach described above over the intersection of platform SNPs (only the SNPs present in both the v4 and v5 genotyping platforms).
  • Hierarchical clustering of the mean pairwise IBD haplotype sharing across Mexican states was performed using Ward's method (Ward Jr 1963) in R. To remove close relatives we excluded any pair of individuals that shared more than 20 cM. Geographic maps of the mean pairwise IBD shared across Mexican states were made using the R packages mxmaps, ggplot2, and viridis (Valle-Jones 2019; Wickham 2016; Gamier 2018).
  • FIGS. 21 and 22 show IBD haplotype sharing across Mexican states as determined by a Templated PBWT method.
  • Hierarchal clustering of IBD sharing across Mexican states identified geographic clusters with elevated levels of haplotype sharing, as shown in FIG. 21 . There were two large clusters: one cluster in the Yucatan peninsula and the southern Mexican states, and another cluster representing Mexico City and the central and northern states. The clusters were further subdivided into individual states.
  • Mean pairwise IBD haplotype sharing was highest within states and among geographically neighboring states, as shown in FIG. 22 .
  • mean IBD shared among individuals with all 4 grandparents from Nuevo Leon was over 12 cM
  • the mean pairwise IBD shared between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from neighboring Coahuila and Tamaulipas was over 10 cM
  • mean pairwise sharing between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from Yucatan was less than 6 cM. Similar geographically stratified IBD sharing was found throughout Mexico, as shown in FIG. 22 .

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Organic Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Databases & Information Systems (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Bioethics (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The disclosed embodiments concern methods, apparatus, systems and computer program products for estimating IBD segments. Some implementations use a templated positional Burrows-Wheeler transform (PBWT) technique and a phase switch error heuristic to correct genotyping errors and phase switch errors to make fast and accurate phase aware IBD estimates. In some implementations a templated PBWT technique and a probabilistic hidden Markov model (HMM) are used to correct genotyping errors and phase switch errors.

Description

    INCORPORATION BY REFERENCE
  • An Application Data Sheet is filed concurrently with this specification as part of the present application. Each application that the present application claims benefit of or priority to as identified in the concurrently filed Application Data Sheet is incorporated by reference herein in its entirety and for all purposes.
  • BACKGROUND
  • Modern genetic data sets already number in the millions of genomes and are growing rapidly. Inferring the genomic location and length of identical-by-descent (IBD) segments among the related individuals in these data sets is a central step in many genetic analyses.
  • IBD estimates can best be exploited when they are made using phased haplotypes; this means each individual in the data set is represented by two sequences each of which consists of alleles co-located on the same chromosome and inherited from a different parent. IBD estimates that are phase aware can improve relationship and pedigree inference, allow health and trait inheritance to be traced, and make possible a range of other inferences regarding demographic history and ancestry that are not possible when IBD estimates are made using only unphased genotype data. Therefore, methods and systems that can improve performance of phase aware IBD estimates have significant value.
  • All patents, patent applications, and other publications, including all sequences disclosed within these references, referred to herein are expressly incorporated herein by reference, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated by reference. All documents cited are, in relevant part, incorporated herein by reference in their entireties for the purposes indicated by the context of their citation herein. However, the citation of any document is not to be construed as an admission that it is prior art with respect to the present disclosure.
  • SUMMARY
  • The disclosed implementations concern methods, apparatus, systems, and computer program products for processing haplotype data to accurately estimate IBD segments between individuals.
  • A first aspect of the disclosure provides computer-implemented methods for estimating IBD segments between individuals.
  • Another aspect of the disclosure provides systems for estimating IBD segments. In some implementations, the system involves: a sequencer for sequencing nucleic acids of the test sample; a processor; and one or more computer-readable storage media having stored thereon instructions for execution on said processor to estimate IBD segments between individuals.
  • Another aspect of the disclosure provides a computer program product including a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to implement the methods above for estimating IBD segments.
  • A computer implemented method of processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes is provided, the method including: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the merged set of IBD segments has reduced impact from genotyping errors compared to a set of IBD segments generated without applying the first and second digital templates.
  • In some embodiments, the first and second templates each have a size of at least four consecutive haplotype sites. In some embodiments, identifying the first matches among alleles at unmasked sites includes sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings. In some embodiments, computationally identifying IBD segments between the two or more haplotype strings further includes: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments. In some embodiments, computationally identifying additional matches among alleles at unmasked sites employs a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template. In some embodiments, the first through sixth digital templates each include two masked sites and two unmasked sites. In some embodiments, the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • In some embodiments, the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings. In some embodiments, computationally identifying IBD segments between the two or more haplotype strings includes performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, computationally merging the first and second matches among alleles is performed while considering individual polymorphic sites of the two or more haplotype strings using the PBWT. In some embodiments, the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window. In some embodiments, the total number of digital templates is k!/(m!*(k−m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window. In some embodiments, applying the first digital template comprises a deterministic process employing the first arrangement of masked and unmasked sites.
  • In another aspect of the embodiments provided herein, a system for processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes is provided, the system including: (a) one or more processors and associated memory; (b) computer readable instructions for: providing a first digital template including a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites; providing a second digital template including a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different; providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the merged set of IBD segments has reduced impact from genotyping errors compared to a set of IBD segments generated without applying the first and second digital templates.
  • In some embodiments, the first and second templates each have a size of at least four consecutive haplotype sites. In some embodiments, the instructions for identifying the first matches among alleles at unmasked sites includes instructions for sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings. In some embodiments, the instructions for identifying IBD segments between the two or more haplotype strings further include instructions for: computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and wherein merging the first and second matches among alleles to produce a merged set of IBD segments further includes computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments. In some embodiments, the instructions for identifying additional matches among alleles at unmasked sites employ a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template. In some embodiments, the first through sixth digital templates each include two masked sites and two unmasked sites. In some embodiments, the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
  • In some embodiments, the two or more haplotype strings include at least one thousand haplotype strings. In some embodiments, the two or more haplotype strings include at least one million haplotype strings. In some embodiments, the instructions for identifying IBD segments between the two or more haplotype strings include instructions performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings. In some embodiments, the instructions for merging the first and second matches among alleles include instructions for performing the merging while considering individual polymorphic sites of the two or more haplotype strings using the PBWT. In some embodiments, the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window. In some embodiments, the total number of digital templates is k!/(m!*(k−m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • In another aspect of the embodiments herein, a method of identifying IBD segments between two or more haplotypes strings, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism is provided, the method including: (a) computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of two or more haplotype strings at unmasked sites produced by applying a first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying a second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the first digital template includes a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the second digital template includes a second arrangement of masked and unmasked sites in the window of consecutive haplotype sites, and wherein the first and second arrangements are different; and (b) identifying a potential phase switch error in at least one of the two or more haplotype strings; and (c) correcting the phase switch error. In some embodiments, identifying the potential phase switch error includes identifying proximate IBD segments in at least one pair of the two or more haplotype strings.
  • In another aspect of the embodiments herein, a computer implemented method of determining identity by descent (IBD) segments is provided, the method including: determining first potential IBD segments among phased haplotype data for a plurality of individuals, wherein the first potential IBD segments have an end site; determining second potential IBD segments among haplotype data for the plurality of individuals, wherein the second potential IBD segments have a start site; determining that the end site of the first potential IBD segments and the start site of the second potential IBD segments are within a threshold number of sites of each other; and merging the first potential IBD segments and the second potential IBD segments to form a combined potential IBD segment.
  • In some embodiments, the first potential IBD segments and the second potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the first potential IBD segment and the second potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error. In some embodiments, the first potential IBD segments and the second potential IBD segments overlap for an individual of the plurality of individuals. In some embodiments, the first potential IBD segment and the second potential IBD segment each span at least the threshold number of sites. In some embodiments, the threshold number of sites is between about 0 and 500 SNPs. In some embodiments, the plurality of individuals do not share a parent-child relationship.
  • In some embodiments, the method further includes: determining a third potential IBD segments among phased haplotype data for a plurality of individuals, wherein the third potential IBD segments have a start site; determining that the end site of the combined potential IBD segments and the start site of the third potential IBD segments are within the threshold number of SNPs of each other; and merging the combined potential IBD segments and the third potential IBD segments. In some embodiments, the combined potential IBD segments and the third potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further includes: determining a phase switch error occurred at a site between the combined potential IBD segment and the third potential IBD segment for the individual; and swapping the haplotypes for the individual from the position of the phase switch error. In some embodiments, the method further includes determining that the combined potential IBD segments have a minimum length in centimorgans and storing the combined potential IBD segments as IBD segments for the plurality of individuals.
  • In another aspect of the embodiments herein, a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes is provided, the method including: providing two or more paired haplotypes strings for identification of IBD segments therebetween, each of the two or more paired haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and computationally iterating through the two or more paired haplotype strings by: (i) identifying a first potential IBD segment between the two or more haplotype strings by identifying matches among alleles of the haplotype strings; (ii) comparing the first site of the first potential IBD segment to the last site of a previously identified second potential IBD segment (iii) determining that the last site of the second potential IBD segment and the first site of the first potential IBD segment are within a threshold number of sites of each other; and (iv) merging the first potential IBD segment and the second potential IBD segment to form a combined potential IBD segment.
  • In another aspect of the embodiments disclosed herein, a computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes is provided, the method including: (a) computationally identifying initial IBD segments between two or more haplotype strings by identifying first matches among alleles of the haplotype strings using a plurality of templates, each including a unique arrangement of masked and unmasked sites in a window of consecutive haplotype sites; and (b) providing information characterizing the initial IBD segments to a hidden Markov model (HMM) which removes potential phase switch errors to produce final IBD segment, wherein the HMM analyzes the information characterizing the initial IBD segments using distances between consecutive haplotype sites on a chromosome, one or more rates of recombination based on meiosis, and one or more rates of phase switch error based on a phasing method employed to phase the haplotypes.
  • In some embodiments, the method further includes, after (a) and before (b), removing some initial IBD segments determined to belong to haplotypes having less than a threshold amount of initial IBD segments, wherein the initial IBD segments provided to the HEIM in (b) have had some initial IBD segments removed. In some embodiments, the threshold amount of initial IBD segments is less than two initial IBD segments per chromosome.
  • In another aspect of the embodiments described herein, a computer implemented method of determining identical-by-descent (IBD) segments is provided, the method including: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are included in one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, and (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes; wherein applying the HMM removes likely phase switch errors and produces the error-corrected IBD segments based on a most likely sequence of hidden IBD states; and wherein operations (a) and (b) are performed by one or more processors of a computer system.
  • In some embodiments, the HMM takes as input: (iii) genetic distances between consecutive sites on a chromosome. In some embodiments, transition rates of the HMM are based on a rate at which IBD segments start, which rate is modeled as a function of the number of meioses. In some embodiments, the rate at which IBD segments start (αs) is modeled as follows:
  • α s = 1 1 + ( 100.0 × m )
  • wherein m is the number of meioses. In some embodiments, transition rates of the HMM are based on a rate at which IBD segments end. In some embodiments, the rate at which IBD segments end is modeled as a function of the number of meioses. In some embodiments, the rate at which IBD segments end (αe) is modeled as follows:
  • α e = m 100.0
  • wherein m is the number of meioses. In some embodiments, the IBD states include nine different IBD states, which indicate nine conditions of zero IBD, half IBD, and full IBD. In some embodiments, transition rates of the HMM are based on a transition matrix Qa in FIG. 8B. In some embodiments, transition rates of the HMM are weighted by a probability that full IBD between the two individuals is truly present. In some embodiments, the probability that full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD. In some embodiments, the probability that full IBD between the two individuals is truly present (β) is modeled as follows:
  • β = 1 1 + e η ( 0.125 - l 2 )
  • wherein ι2 is the amount of estimated full IBD, and η is an empirical parameter defining the steepness of the logistic function. In some embodiments, the transition rates are weighted by weighting transitions into full IBD states with β, and weighting transitions out of full IBD states with 1/β. In some embodiments, the IBD states include 9 different IBD states, and the transition rates are based on a transition matrix Qβ in (Eq. 5). In some embodiments, transition rates of the HMM are based on the at least one rate of phase switch error. In some embodiments, the at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, there are 4 types of phase switch errors, the IBD states include 9 different IBD states, and the transition rates are based on a 36×36 transition matrix Q in (Eq. 6). In some embodiments, transition probabilities of the HMM are based on the genetic distances between consecutive sites on a chromosome.
  • In some embodiments, the transition probabilities are obtained by exponentiating a transition matrix. In some embodiments, transition probabilities of hidden IBD states Yl+1 given hidden IBD states Yl are modeled as: P(Yl+1|Yl, m, μ0, μ1, ι2)=eQd l wherein m is the number of meioses, μ0 is a phase switch error rate for a first individual of the two individuals, μ1 is a phase switch error rate for a second individual of the two individuals, ι2 is an amount of estimated full IBD, Q is a transition matrix described by Eq. (Q), and dl is the genetic distances between sites l and l+1.
  • In some embodiments, emission probabilities of the HMM are dependent on phase switch errors. In some embodiments, the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on four different ways the two individuals may be in phase switch errors. In some embodiments, (b) includes using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed states. In some embodiments, the mostly likely sequence of hidden IBD states is identified using a Viterbi dynamic programming process. In some embodiments, the method further includes: performing (a) and (b) for a plurality of iterations, each iteration using a different number of meioses for the rate of recombination, thereby producing a plurality of sets of error-corrected IBD segments; and selecting a set of error-corrected IBD segments having a highest likelihood or probability in the plurality of sets as a final estimate of one or more IBD segments. In some embodiments, the different numbers of meioses are in a range from 1 to 14. In some embodiments, the method is initiated when the two individuals' IBD segments including the series of polymorphic sites meet a criterion. In some embodiments, the two individuals' IBD segments include two or more IBD segments on a single chromosome. In some embodiments, the two individuals' IBD segments exceed a minimum total amount of shared IBD
  • Although the examples herein concern humans and the language is primarily directed to human concerns, the concepts described herein are applicable to genomes from any biological organism. These and other objects and features of the present disclosure will become more fully apparent from the following description and appended claims in conjunction with the associated drawings.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings, which are included as part of the present specification, illustrate embodiments and, together with the general description given above and the detailed description of the embodiment given below, serve to explain and teach the principles described herein.
  • FIG. 1 presents possible phase switch errors in long IBD segments.
  • FIGS. 2A and 2B present example process flows according to various embodiments discussed herein.
  • FIG. 3 presents an example process for finding IBD segments.
  • FIG. 4 illustrates haplotype strings, a positional prefix array, and a divergence array in accordance with various embodiments discussed herein.
  • FIGS. 5A and 5B present process flows according to various embodiments herein.
  • FIG. 5C presents pseudocode for an algorithm according to various embodiments herein.
  • FIG. 5D presents an illustration of the various data structures used in accordance with a Templated PBWT process as described herein.
  • FIGS. 6A and 6C present process flows for a phase switch error correction heuristic according to various embodiments herein.
  • FIG. 6B presents various types of phase switch errors that may occur between two pairs of haplotypes.
  • FIG. 7 illustrates using a HMM to process four haplotypes of two individuals to correct phase switch errors.
  • FIG. 8 presents a flow diagram for correcting phase switch errors in IBD segments using a HMM according to various embodiments.
  • FIG. 9A illustrates the structure of an example HMM model.
  • FIG. 9B illustrates a transition matrix that may be used as part of a HMM according to various embodiments herein.
  • FIG. 10 presents a functional diagram of a computer system for performing various embodiments disclosed herein.
  • FIG. 11 presents a block diagram of an IBD-based personal genomics service according to various embodiments herein.
  • FIG. 12 presents a plot comparing the speed of various IBD inference methods.
  • FIGS. 13-16 present plots comparing the IBD estimate errors of various methods.
  • FIGS. 17 and 18 present plots of errors of various methods for various simulated pairs of haplotypes.
  • FIG. 19 presents plots of the false positive and false negative rates of various methods.
  • FIGS. 20A and 20B illustrate the runtime of various methods and the parameters used for assessing runtime.
  • FIGS. 21 and 22 present illustrates of IBD haplotype sharing across Mexican states as determined by a Templated PBWT method as described herein.
  • DETAILED DESCRIPTION
  • The disclosure concerns methods, apparatus, systems, and computer program products for estimating IBD segments between individuals using haplotype data.
  • Numeric ranges are inclusive of the numbers defining the range. It is intended that every maximum numerical limitation given throughout this specification includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.
  • The headings provided herein are not intended to limit the disclosure.
  • Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Various scientific dictionaries that include the terms included herein are well known and available to those in the art. Although any methods and materials similar or equivalent to those described herein find use in the practice or testing of the embodiments disclosed herein, some methods and materials are described.
  • The terms defined immediately below are more fully described by reference to the Specification as a whole. It is to be understood that this disclosure is not limited to the particular methodology, protocols, and reagents described, as these may vary, depending upon the context they are used by those of skill in the art.
  • Definitions
  • As used herein, the singular terms “a,” “an,” and “the” include the plural reference unless the context clearly indicates otherwise.
  • Unless otherwise indicated, nucleic acids are written left to right in 5′ to 3′ orientation and amino acid sequences are written left to right in amino to carboxy orientation, respectively.
  • The term “plurality” refers to more than one element. For example, the term is used herein in reference to a number of nucleic acid molecules or sequence reads that is sufficient to identify significant differences in repeat expansions in test samples and control samples using the methods disclosed herein.
  • A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
  • The terms “polynucleotide,” “nucleic acid” and “nucleic acid molecules” are used interchangeably and refer to a covalently linked sequence of nucleotides (i.e., ribonucleotides for RNA and deoxyribonucleotides for DNA) in which the 3′ position of the pentose of one nucleotide is joined by a phosphodiester group to the 5′ position of the pentose of the next. The nucleotides include sequences of any form of nucleic acid, including, but not limited to RNA and DNA molecules such as cell-free DNA (cfDNA) molecules. The term “polynucleotide” includes, without limitation, single- and double-stranded polynucleotides.
  • The term “parameter” herein refers to a numerical value that characterizes a physical property. Frequently, a parameter numerically characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, a ratio (or function of a ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome to which the tags are mapped, is a parameter.
  • The term “site” refers to a unique position (i.e. chromosome ID, chromosome position and orientation) on a reference genome. In some embodiments, a site may be a residue, a sequence tag, or a segment's position on a sequence.
  • The term “based on” when used in the context of obtaining a specific quantitative value, herein refers to using another quantity as input to calculate the specific quantitative value as an output. The specific quantitative value may be based on multiple other quantities, not just the one identified.
  • As used herein the term “chromosome” refers to the heredity-bearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones). The conventional internationally recognized individual human genome chromosome numbering system is employed herein.
  • Introduction and Overview
  • Modern genetic data sets already number in the millions of genomes and are growing rapidly. Inferring the genomic location and length of identical-by-descent (IBD) segments among the related individuals in these data sets is a central step in many genetic analyses. IBD estimates can best be exploited when they are made using phased haplotypes; this means each individual in the data set is represented by two sequences each of which consist of alleles co-located on the same chromosome and inherited from a different parent. IBD estimates that are phase aware can improve relationship and pedigree inference, allow health and trait inheritance to be traced, and make possible a range of other inferences regarding demographic history and ancestry that are not possible when IBD estimates are made using only unphased genotype data.
  • Estimating phase aware IBD segments is challenging not only because of the large sizes of the genomic data sets but also due to two types of error that break up IBD segments: genotyping error and phase switch/phasing error. Because IBD segments are broken up by meiotic recombination events they are expected to be longer for close relatives. However, long IBD segments are more likely to be impacted by genotype and phasing errors compared to short segments. Thus, errors are particularly problematic when detecting IBD among individuals that are closely related (e.g. first, second, and third degree relatives) since long IBD segments are more likely to be fragmented by these errors. This makes accurate inference of phase aware IBD among close relatives particularly problematic.
  • Genotyping error is error introduced via genotyping (e.g., sequencing) in which an allele that is actually of one type (e.g., a T) gets called a different allele (e.g., an A). This commonly has the effect of prematurely terminating a sequence of matches that would otherwise be long enough to be considered an IBD segment. Phase switch error is error introduced by phasing: maternal and paternal copies are reversed. See FIG. 1. In some typical cases, statistical phasing processes introduce a phase switch error, on average, about every 12 centimorgans.
  • FIG. 1 illustrates phase switch error in fragment long IBD segments. As shown in the left panel, two IBD segments (thicker lines) are shared on a single chromosome in two related individuals (top and bottom). As shown in the right panel, phase switch errors (illustrated by vertical dashed lines) occur at different positions along the chromosome in the two individuals, fragmenting the true IBD segments into many erroneous false IBD segments. The individual represented by the top two haplotypes has six phase switch errors and the individual represented by the bottom two haplotypes has four phase switch errors.
  • Various embodiments disclosed herein address phase shift and/or genotyping error and thereby improve IBD segment identification using phased haplotype data. In certain embodiments, computational methods employ different templates of masked and unmasked regions to exclude certain haplotype sites during IBD segment identification. By combining results using different templates, the computational methods mitigate genotyping error. In certain embodiments, a probabilistic model considering meiotic rates, phase switch error in phased data, and distance between adjacent haplotype sites is used to identify and remove some phase shift error. In some implementations, phased haplotype data is processed using a method based on the positional Burrows-Wheeler transform (PBWT) and a probabilistic hidden Markov model (HMM). In some implementations, phased haplotype data is processed using a method based on the PBWT and a heuristic to correct phase switch errors. These approaches have been found to make fast and accurate phase aware IBD estimates.
  • Various IBD segment finding methods to compute phase aware IBD segments are disclosed herein. These methods may minimize genotyping and phasing error using one or more of the following techniques:
  • 1. To handle genotyping error, an IBD segment computation procedure passes multiple digital templates over phased haplotypes to differentially mask haplotype sites, thereby temporarily ignoring different sites of potential genotype error. The IBD segments being generated by the different templates are combined to effectively remove fragmentation caused by genotyping error. In some embodiments, the procedure is a templated positional Burrows-Wheeler transform. This is the positional Burrows-Wheeler transform (PBWT; Durbin 2014) with substantial modifications to handle genotyping errors and missing data.
  • 2. To handle phase switch error reduction, IBD segments that are likely fragmented by phase switch errors introduced during statistical phasing are addressed using a heuristic that recognizes likely occurrences of phase switch errors and/or a probabilistic model that accounts for error rates of the phasing techniques.
  • In certain embodiments, the model is applied to haplotype/IBD segment data generated using the templating approach described in 1. Some implementations apply a hidden Markov model (HMM) that accounts for both recombination and the phase switching process to reduce these errors. The HMM passes along the chromosomes of the two individuals “stitching” fractured IBD segments together.
  • In some embodiments, a heuristic is applied that identifies fractured IBD segments based on potential IBD segments that start or end within a distance of the start or end of another IBD segment. The heuristic may stitch the IBD segments together by, for example, swapping haplotype segments in a target individual. In some embodiments, the heuristic may be applied as potential IBD segments are generated using the templating approach described in 1 without a separate iteration over the IBD segment data. In some embodiments, the heuristic assumes that IBD segments within a threshold distance of each other are likely to be a single IBD segment fractured by one or more phase switch errors.
  • An example workflow is illustrated in the flow chart of FIG. 2A. As shown, a process 201 begins by initially obtaining phased sequence data for two or more individuals who are to be compared to detect IBD segments. See block 203. In various embodiments, phased data for many more individuals is obtained. In some contexts, hundreds, thousands, or even millions of individuals have their phased haplotype data compared using methods disclosed herein. In some implementations data for greater than about 10 million individuals' phased haplotype data are compared. In any event, the operation 203 involves identifying individuals or haplotypes on which an IBD comparison is to be run.
  • The phased data includes two strings of haplotype data for each individual (one per chromosome). In other words, there are four haplotypes to be considered for two individuals. Phased haplotype data may be obtained from various sources including statistical techniques such as BEAGLE, FINCH, EAGLE, and other known techniques. An example discussion of phasing techniques is presented in U.S. Pat. No. 9,836,576, filed Mar. 13, 2013, and incorporated herein by reference in its entirety.
  • The haplotype data may be represented as strings of allele values (e.g., 1s and 0s) for sites in the haplotype, each of which is the site of a polymorphism. Each such site may be referred to as an index in the haplotype string. In various embodiments, the process assumes that each haplotype site is a biallelic site on a chromosome. It may be given a value of, e.g., 0 for one allele and a value of 1 for the other allele. A typical chromosome may provide hundreds of thousands of sites. Each haplotype may be given its own unique identifier, which may be arbitrarily set.
  • The phased haplotype data is provided to a first processing block as illustrated by a block 205. This processing may reduce the fragmenting impact of genotyping error in IBD segment finding. In the first processing block, multiple operations are performed in parallel, sequentially, or in some combination thereof. In various embodiments, significant computational efficiency is realized by performing these operations together for a given haplotype (e.g., in an inner loop of a software routine as illustrated in the sample code below). The operations performed in the first phase include the following: (a) applying digital templates with masked and unmasked positions to exclude certain haplotype sites along the length of the haplotype, (b) identifying matching allele values at unmasked positions along the haplotypes to identify putative IBD segments for the various digital templates, and (c) merging the resulting IBD segments (e.g., as they are being generated) from the various digital templates.
  • As explained in more detail elsewhere herein, the IBD segment finding logic may be implemented using the PBWT method. Regardless of whether PBWT or another IBD segment finding process is used, the method produces multiple putative IBD sub-segments, one for each different digital template used. The method also merges or stitches together the IBD sub-segment as generated using each of the various templates. As suggested, in some embodiments, the templating, comparing, and merging operations are performed at a given site, before considering the next site, where the three operations are again performed.
  • There are many different ways to construct the digital templates. In various embodiments, they are constructed as small windows that can be “slid” or “ratcheted” along the length of the haplotype strings, considering consecutive sub-segments of the haplotype sites as they go. Criteria to consider in choosing template structures are the length of the template, the number of masked or null sites in the template, and the arrangement of masked and unmasked sites in the template. Typically, full sets of templates are used in the process that contain all possible arrangements of masked and unmasked sites in a template length. An example set of four site digital templates, each employing two masked and two unmasked positions is described below. Of course, the process may alternatively employ larger (or smaller) templates and/or use templates having a higher or lower proportion of null positions per template.
  • The output of the first processing block implemented in operation 205 is a set of IBD segments or other haplotype matching data for combinations of the various individuals whose phased haplotype data is processed. This data is then passed to a second processing block for processing as illustrated in an operation 207. A goal of this second operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • In operation 207, the haplotype/IBD data is subjected to a probabilistic model that accounts for recombination rates based on meioses, which vary based on degree or relatedness of any two individuals, and rates of phase switch errors introduced by the phasing technique(s) employed to generate the phased haplotype data. The model may also account for other inputs such as the genetic distance between adjacent sites on the haplotype and/or the probability of having full IBD state. As exemplified below, a hidden Markov model may be used to implement the probabilistic model.
  • An optional last operation of the process 201 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 209.
  • Another example workflow is illustrated in the flow chart of FIG. 2B. As shown, a process 251 begins by initially obtaining phased sequence data for two or more individuals who are to be compared for identifying IBD segments. See block 253. As with other embodiments, phased data for many more individuals may be obtained, e.g., hundreds to millions of individuals. Regardless, the operation of block 253 involves identifying individuals and/or haplotypes on which an IBD comparison is to be run.
  • Computational aspects of process 251 include sequentially considering haplotype positions for genotype errors and phase switch errors within individual haplotypes while keeping track of match segments between haplotypes. Processing each new haplotype position is initiated at a process operation 254, which selects the next position in the haplotypes under consideration.
  • For a given haplotype position, a first processing operation 255 considers possible errors in the individual haplotypes using multiple templates such as those described elsewhere herein. Haplotype position matches are determined using these templates, and, from these results, an overall decision on matching segments is made. In some implementations, the resulting match segments have reduced genotyping error.
  • The haplotype position under consideration is then analyzed in a processing operation as illustrated in block 257. A result of this operation may involve reducing the fragmenting impact of genotyping error in IBD segment finding.
  • In operation 257, the haplotype/IBD data may be analyzed by one or more phase switch heuristics and/or models that identify situations where phase switch errors are likely to have occurred. As an example, a heuristic may identify situations where one or more IBD segments between individuals end at a first position and then new IBD segments begin at a second position within a threshold distance from the first position. An identified likely phase switch error may be corrected by joining the IBD segments in an individual identified to possess the likely phase switch error. In some cases, error is corrected by swapping haplotype segments within the identified individual. Regardless of the technique implemented in operation 257, a result of the operation may involve reducing the fragmenting impact of phase switch error in IBD segment finding.
  • An optional last operation of the process 251 involves presenting the processed IBD information in a way that can show the degree of relatedness of the two or more haplotypes that are compared. See the operation represented in block 259.
  • IBD Segments
  • Various repeat expansion analyses using DNA samples involve aligning or mapping sequence reads from a sequencer to a reference sequence. A reference sequence may be the sequence of a whole genome, the sequence of a chromosome, the sequence of a sub-chromosomal region, etc. From a computational perspective, repeats create ambiguities in alignment, which, in turn, can produce biases and errors even at the whole chromosome counting level. Paired end reads coupled with adjustable insert length in various embodiments can help to eliminate ambiguity in alignment of repeating sequences and detection of repeat expansion.
  • In various embodiments, a goal of the process is to use alignment of multiple haplotypes to determine genetic relationship(s) between two or more individuals, or in some cases that potentially involve inbreeding, within a single individual. Fundamentally, the process determines relationships between two haplotypes. IBD may be used for this purpose.
  • IBD can be understood in the context of meiosis and recombinable DNA. Because of recombination and independent assortment of chromosomes, the autosomal DNA and X chromosome DNA (collectively referred to as recombinable DNA) from the parents is shuffled at the next generation, with small amounts of mutation. Thus, only relatives will share long stretches of genomic regions where their recombinable DNA is completely or nearly identical. Such regions are referred to as “identical-by-descent” (IBD) regions because they arose from the same DNA sequences in an earlier generation/common ancestor.
  • In some embodiments, locating IBD regions includes sequencing the entire genomes of the individuals and comparing the genome sequences. In some embodiments, locating IBD regions includes assaying a large number of markers that tend to vary in different individuals and comparing the markers. Examples of such markers include Single Nucleotide Polymorphisms (SNPs), which are points along the genome with two or more variations; e.g., Short Tandem Repeats (STRs), which are repeated patterns of two or more repeated nucleotide sequences adjacent to each other; and Copy-Number Variants (CNVs), which include longer sequences of DNA that could be present in varying numbers in different individuals. Long stretches of DNA sequences from different individuals' genomes in which markers in the same locations are the same indicate that the rest of the sequences, although not assayed directly, are also likely identical.
  • Techniques for matching individual haplotypes, e.g., by using IBD are known. Some of them cannot efficiently handle large numbers of haplotypes and even those that can, do not adequately account for errors. Some discussion of IBD segments may be found in U.S. Pat. No. 8,463,554 filed Dec. 22, 2009, and incorporated herein by reference in its entirety.
  • Templated PBWT
  • To initially detect IBD segments one may extend the positional Burrows-Wheeler transform (PBWT; Durbin 2014). In certain embodiments, a PBWT process is implemented according to the following description. Initially, each haplotype under consideration is given its own unique identifier, which may be arbitrarily set. Then, during execution, the method steps through the sites of all haplotypes under consideration, position-by-position, starting at a first position, which may be identified as position 0. As the method steps through the haplotype sites, it keeps track of two arrays, which are updated for every position (index) in the haplotypes. Also, during a pass through the haplotype sites, a templated PBWT process may apply one, some, or all of the digital templates at each position.
  • The first array is a “positional prefix array” that contains a list of all haplotypes under consideration. It is populated with IDs of all the haplotypes. A separate instance of the positional prefix array is produced each time a new site is encountered while traversing the haplotype string. Over the course of the process, and while certain haplotypes have identical allele values from one position to the next, these haplotypes are grouped together in the positional prefix array. In other words, haplotypes having matching allele values, remain together (in the same block) within the positional prefix array for as long as their alleles match. By keeping the haplotypes together while alleles match, the positional prefix array contains information about putative IBD segments.
  • The second array is a “divergence array” that indicates where matches between any two haplotypes under consideration began. It reflects how many positions/sites back in the haplotype string until there was a difference. In other words, this matrix keeps track of the last time that two haplotypes did not match by, e.g., providing the index value of the last mismatch for any two haplotypes.
  • An example of a general IBD segment finding process 301 is depicted in FIG. 3. As illustrated, the process begins by receiving haplotype strings representing allele values across all haplotypes to be considered in the process. See block 303. This may correspond to block 203 in FIG. 2A and/or block 253 of FIG. 2B.
  • Before considering the first site in the haplotypes, the process lists all haplotypes in the positional prefix array. It may do this randomly or in some order, but typically it does not yet account for the allele values at any haplotype site. The individual haplotypes may be listed by unique identifiers. Further, before considering the first site on the haplotype, the values in divergence array are all set 0 because there are no previous sites that have been considered. The array initializations are illustrated by operation 305 in process 301 of FIG. 3.
  • At the first site of the haplotype strings (which may be reflected as the first column in an aligned set of the haplotype strings), the process goes through all haplotypes in the order of the positional prefix array (which may initially be random or otherwise arbitrary) and orders the haplotypes such that those that have a first allele value (e.g., 0) in the current position are grouped together at the top, and all that have a second allele value (e.g., 1) in the position are grouped together at the bottom. See operation 309. Of course, this order could be reversed or even extended in the case of multiallelic sites. Either way, this operation produces a new positional prefix array in which all haplotype indexes that have a 0 at the current position are grouped together in the array, and all haplotype indexes that have a 1 at the position are grouped together. By “grouped together,” it is meant that haplotype identifiers are provided in adjacent positions in the positional prefix array. This is illustrated in FIG. 4 which shows a group of six haplotype strings and the associated positional prefix array at a few haplotype sites. Obviously, a typical haplotype has many more sites than illustrated for the haplotypes in FIG. 4. Further, a typical IBD segment finding process may simultaneously evaluate many more haplotype strings (e.g., hundreds, thousands, or millions).
  • Regarding the divergence array, by considering the allele values at the first index position with those in the previous position (which does not exist in this iteration), the process notes that all potential IBD segments begin at site 0 and therefore they effectively have a mismatch at position 0. Therefore, the first entry in the divergence array is all zeros. See operation 311 in process 301 and the first column in the divergence array of FIG. 4.
  • The order of haplotypes in the divergence array is the same as in the positional prefix array. Using this order of haplotypes in the divergence array, the values in the divergence array are, for currently matching haplotypes, the sites (index values) of the first matching position between the two adjacent haplotypes within the array. Thus, for adjacent haplotypes that currently match, the value in the divergence array is the first matching position of the current segment. However if two adjacent haplotypes no longer match at the current site the method assigns the next site to the new div array even though it has not peeked ahead and checked if the two haplotypes actually match at the next site. If, in the next iteration, the method learns that the segments still do not match, the relevant value in the divergence array simply gets updated again.
  • Returning to FIG. 3 and ignoring for now operations 313 and 315, when the process has completed operations on a given haplotype site, it determines whether there are any further sites to consider. See operation 317. If there are such sites, the process loops back to an operation 307, where the next site/index is selected for consideration.
  • At the next column (associated with the next site in the aligned haplotype strings), the process again goes through the haplotypes and again rearranges the haplotype identifiers in the positional prefix array so that those having the same allele value at the current position are grouped together, e.g., all haplotypes having a 0 allele value in the current position are grouped at the top of the array and all those having a 1 allele value are grouped at the bottom. Haplotype strings that have the same alleles over multiple consecutive positions stay near one another in the positional prefix array.
  • The divergence array uses the new arrangement of haplotypes (from the positional prefix array), flags any mismatches between adjacent haplotypes and the current position and inserts the next haplotype site number for mismatching pairs. The next site number is the location of the next possible start position for a new match segment.
  • Thus, in some implementations, element i in the divergence array indicates when a current segment match began between the haplotype at ppa[i] and the haplotype at ppa[i−1]. For purposes of example, consider the case where at position 5 in the haplotype alignment, the positional prefix array has the following values:
    • 2
    • 3
    • 1
    • 4
      And the divergence array has the following values:
    • 0
    • 0
    • 3
    • 2
  • In this example haplotypes 2 and 3 have a match that extends from the beginning of the alignment (position 0) to the current position (position 5). Haplotype 1 matches with haplotype 3 from position 3 to position 5 (which also implies haplotype 1 and haplotype 2 have the same match). And haplotype 4 matches with haplotype 1 from position 2 to position 5 (which also implies haplotype 4 matches haplotypes 1, 2, and 3 between positions 3 and 5). There are no mismatched haplotypes at the current site in this example. The routines use the alleles at the current position in the alignment to construct the divergence array for the next position. So, in the above example, if at position 5 two haplotypes do not match, the routine inserts 6 into the position of the haplotypes in the divergence array. Note that the method does not check whether the two haplotypes actually match at position 6, which is why the divergence array does not always contain the beginning position of matches. Once the method actually visits position 6 this value will be overwritten if the haplotypes do not match at site 7. The method continues in this fashion (overwriting values in the divergence array) until actual matches are found.
  • Over the course of the process, the lengths of matching segments of all haplotype pairs are tracked. When the match length of two haplotypes is greater than a preset threshold, the process flags the two haplotypes as having a potential IBD segment. In the example of process 301, it does this by creating new match segment records when two haplotypes have a number of consecutive shared matches that is greater than threshold number of consecutive sites. See operation 313. The threshold value may be chosen to balance speed and sensitivity. In certain examples, the threshold number is between about 50 and 1000 sites (e.g., about 200 matching sites). Similarly, the process may complete a match segment record when two matching haplotypes finally diverge in allele values, thereby ending the match segment. See operation 315. To this end, the process may maintain a separate report populated with matches of greater than the threshold length. The matches may be identified by start position and end position (indexes) and the haplotypes involved in the match (e.g., haplotype ID #11 and haplotype ID #5). In certain embodiments, the match segment includes both the starting and ending sites of the match segment.
  • In cases where a match segment does not end—such as when the end of a haplotype string ends and there are no further sites to consider—the process may still flag the match segment for further consideration. As with matches having defined end points, the match is identified by the two matching haplotypes and their starting index for the match segment. The ending index for the match region is the site at the end of the haplotype.
  • For the sake of clarity, the example of FIG. 3 does not illustrate treatment of the haplotype strings by different digital templates. This will be discussed below.
  • The process 301 proceeds through operations 307-315 for each successive haplotype site and reorders the haplotype IDs in the positional prefix array based on matches (the haplotypes having a 0 at the current position are grouped together and those having a 1 are grouped together).
  • In general, the haplotypes that have long stretches of matched sites stay together in the positional prefix array for long durations. This is because all matching haplotypes stay together in the positional prefix array until one of them has a different allele value at a particular haplotype position. At that point, the one or more haplotypes that diverge from the larger group are moved to a different position in the positional prefix array. By preserving the positional prefix arrays for each position in the haplotype, the method keeps sufficient information to reconstruct all IBD segments for any two haplotypes. This includes all haplotypes under consideration, including the first and last haplotypes in the positional prefix array.
  • In the divergence array, the values for two adjacent haplotypes (in the array) remain at a value of their first match until they again mismatch, at which time the corresponding divergence array value is marked with the next index position after the new mismatch. FIG. 4 presents an example of positional prefix array values and divergence array values for a few sites on an alignment of haplotype strings.
  • When there are no further haplotype sites to consider, as indicated by process block 317, potential IBD segments have been identified and these may be processed in various ways such as, optionally, being used in a relatedness analysis of individuals whose haplotypes were considered in the analysis. See operation 319. Alternatively, and as described with reference to FIG. 2A and as further described below, the potential IBD segments may be further processed by a model such as an HMM model to correct for phase shift errors. However, it should be understood that this additional processing is optional, particularly in situations where phased haplotype data can be expected to have relatively few phase shift errors. In some embodiments, potential IBD segments that are flagged by the PBWT process, but are shorter than a defined genetic length, are excluded from further consideration. An example of a threshold genetic length is between about 1.5 and 3 centimorgans. Thus, for example, if a segment extends over 200 sites but does not extend the full required genetic distance, the method discards the segment.
  • The PBWT process, as well as many other IBD segment finding processes, assumes that there are no errors. If there is in fact an error, it may prematurely truncate a sequence of matches and/or artificially prolong a sequence. Typically, long matches that in fact exist (e.g., between close relatives) are prematurely broken due to genotyping and/or phase switch errors.
  • As mentioned, integrating digital templates into an IBD segment finding process can mitigate the impact of some errors, particularly genotyping errors. One approach employs a digital template that shifts over the haplotype strings and masks certain haplotype sites from consideration as it goes. This approach takes a normal haplotype alignment but applies the template to skip over some sites that would otherwise be considered. With the excluded sites removed from consideration, the process identifies putative IBD segment matches using a general approach such as PBWT. By masking some sites from comparison, sites of erroneous calls may be ignored. Some templates may consider the erroneously called alleles while others exclude them. By considering putative IBD segments created using all the templates, the process can remove breaks and more accurately identify complete IBD segments.
  • The template provides a sliding window of consecutive sites having, in some embodiments, a fixed mask pattern. The template is moved successively along the haplotype string, typically with no overlap of sites between one application of the template and the next. Other than masking some sites, the process is similar to a generic IBD segment finding method such as the PBWT process. That is, in some embodiments, the process generates a positional prefix array and a divergence array for haplotype strings modified by the template. In the course of the process, the computational system flags and records matching segments as before. But the matching segments produced by single templates have some sites excluded. In alternative embodiments, the templating function employs a probabilistic function to pick mask sites.
  • In some embodiments, the mask pattern is deterministic based on the template. The masking of sites may follow a specific pattern, based on each template, rather than a random selection or masking of sites. For example, the mask pattern may remain fixed as the process moves from one haplotype site to the next. In some cases, however, the mask pattern may vary as the process moves over haplotype sites, but such variation may be deterministic rather than random.
  • In certain embodiments, the process employs multiple fixed templates for a given matching problem. Examples of templates include ØhØh (all odd sites), hØhØ (all even sites), ØØhh, hhØØ, ØhhØ, and hØØh, where sites at Ø will be masked out and only sites at h will be used to construct the method. The choice of templates to use together in a process may be made such that for the fixed length of the templates (e.g., the four site templates exemplified here), may guarantee that if there were any errors (e.g., two errors) within this window, at least one of these templates correctly report a match. For example, if there were errors at sites 2 and 4 at one application of a four site template, only the hØhØ would give an error-free read.
  • Depending on the number of indexes/sites in a template and the number excluded within a window, different numbers of templates may be used in a given process. Broadly, the total number of digital templates may be between 1 and k, where k is the number of haplotype sites in the window. In some implementations, the total number of digital templates is k!/(m!*(k−m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
  • In certain embodiments, such as the case of a four site template with two null sites, six templates may be used. However, other template combinations may be employed; e.g., at least two templates, at least six templates, at least eight templates, at least ten templates, etc. In various embodiments, the templates are characterized by a ratio of masked to unmasked sites which ranges between about 1/w and (w−1)/w, where w is the length of the template window In various embodiments, the templates are characterized by a length equivalent to the total size of the haplotype alignment. As examples, a range of template lengths is between about three and ten consecutive sites.
  • A templating function can be tuned to alter sensitivity to error. As suggested, one templating function may be implanted as a decision tree that uses a window size of 4 haplotype sites and 6 templates, and so guarantees any matches within that 4 site window as long as there are no more than 2 errors. If i is the current template (range 0 to 5) and k is the current position within a template window (range 0 to 3), then this templating function TQ, k) may be represented as:
  • T(i, k):
     if i == 0 and (k == 0 or k == 2):
      return 1
     else if i == 1 and (k == 1 or k == 3):
      return 1
     else if i == 2 and (k == 0 or k == 1):
      return 1
     else if i == 3 and (k == 2 or k == 3):
      return 1
     else if i == 4 and (k == 0 or k == 3):
      return 1
     else if i == 5 and (k == 1 or k == 2):
      return 1
     else
      return 0
  • FIG. 5A illustrates application of templates to an IBD segment finding method. The depicted process 501 begins by receiving data and setting up parameters for the routine. This may involve receiving phased haplotypes to be used in the templated matching routine (block 503), defining templates to apply (block 505), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array. The depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted as follows.
  • The process increments to the next haplotype site at an operation 507, and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 509. While the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 511) and merges match segments for the various templates (block 513). Operation 511 identifies matches only if the current haplotype site is unmasked in the current template. Assuming that the current site is unmasked, operation 511 may be implemented in various ways such as by updating positional prefix and divergence arrays. Note that each template may have its own match segment information. Using this information, operation 513 may merge currently pending segments (at the current haplotype site) from among the various templates.
  • Operation 515 serves to iterate over all the templates while the process is fixed at a given haplotype site and operation 517 serves to iterate over all haplotype sites. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 519.
  • FIG. 5B illustrates another application of templating to an IBD segment finding method. The depicted process 551 begins by receiving data and set up parameters for the routine. This may involve receiving phased haplotypes (block 553) to be used in the templated matching routine, defining templates to apply (block 555), and setting up any needed matrices and arrays such as a positional prefix array and a divergence array.
  • The depicted process loops over the various sites of the haplotypes, and at each site it loops over the available templates. This is depicted in the figure as follows. The process increments to the next haplotype site (the current haplotype site) at an operation 557, and while at the current site, it iterates over the various templates, starting by incrementing to the next template at an operation 559. While the routine is fixed at a particular template, the process identifies matches and mismatches among the haplotype strings (block 561) and merges match segments for the various templates (block 563). An operation 561 identifies matches only if the current haplotype site is unmasked in the current template. Assuming that the current site is unmasked, operation 561 may be implemented in various ways such as by updating positional prefix and divergence arrays. Note that each template may have its own match segment information. Using this information, operation 563 may merge currently pending segments (at the current haplotype site) from among the various templates. Operation 565 serves to iterate over all the templates while the process is fixed at a given haplotype site.
  • An optional operation 566 identifies and addresses phase switch errors at the current haplotype position. As indicated, in some embodiments, phase switch errors may be addressed using a heuristic that recognizes typical phase switch errors.
  • An operation 567 serves to iterate over all haplotype sites. If any of the templates indicates a continuous sequence of matching sites including the current site or sites adjacent to the current site, the match sequence is deemed to continue, even if one or more of the templates indicates a gap in the match sequence. Ultimately all haplotype sites are considered and the error-corrected IBD segments are completed. See operation 569.
  • While processes 501 and 551 are implemented in a manner in which all templates are considered at one site before incrementing to the next site, the templating process need not be implemented in this manner. A different approach considers all sites for one template, saves that template's putative IBD segments, considers all sites for a second template, saves that template's putative IBD segments, and so on until all templates are considered. The resulting putative, template-specific IBD segments may then be merged.
  • Merging may involve aligning the putative IBD segments from each templated result, and then scanning through the template-specific segments for pairs of haplotypes. During this process, as long as one of the six templates (or however many are used) still shows a continuing segment, the method keeps a merged IBD segment intact.
  • In various embodiments, the methods assume that any IBD start or end points within an otherwise continuous IBD segment are caused by errors. This is a reasonable assumption because the comparison is made between two individuals. There is a very low probability that two haplotypes will match, for greater than a threshold length, by chance.
  • In some embodiments, an additional filtering operation to remove some putative IBD segments is performed after one of the above-described processes such as process 301 or process 501. For example, the filter may operate by discarding putative IBD segments of size below three centimorgans.
  • To help illustrate the range of implementations, the following example description of templated PBWT is provided. Given M haplotypes with N bi-allelic sites, the PBWT algorithm can identify identical subsequences of the haplotypes in O(NM) time. A limitation of PBWT is that it requires exact subsequence matches with no errors or missing data. To reduce sensitivity to error and missing data, a templated PBWT may be used. A templated PBWT may be designed or configured to identify matching subsequences of the haplotypes despite missing data and genotyping errors with only a small linear increase in computational time compared to the PBWT.
  • One approach for extending PBWT to report matching haplotypes that include some errors involves constructing multiple replicates of the PBWT data structure. Each of these PBWTs is built by masking the haplotype alignment using a different repeating template. Each PBWT may then be individually swept through identifying exact subsequence matches. The matching subsequences from all PBWTs (each from a different template) may then be merged to produce all matching subsequences within the full (unmasked) haplotype alignment.
  • Many different digital template repeating structures may be used. One example uses different repeating templates: for example ØhØh, hØhØ, ØØhh, hhØØ, ØhhØ, and hØØh, where sites at Ø will be masked out and only sites at h are used to construct the IBD segments using, e.g., PBWT. These example templates address haplotype data with no more than two errors per four site window. The design of these six specific templates guarantees that all matches across any given four site window will be found as long as there are no more than two errors within the window. This is because given any possible arrangement of two errors across four sites in the original haplotype alignment at least one of the PBWT replicates will have those errors masked out and therefore still deliver the match correctly.
  • This method's sensitivity to errors may be modified by changing the arrangement and number of templates. For example, more templates could be utilized to ensure matches across longer windows; indeed (n/k) templates are required to ensure all matches across windows of size n with no more than k errors per window. In practice genotyping errors are often low enough that six templates would be adequate (templates of length 4 with two sites masked); even with a genotyping error rate as high as 0.001 the probability of three errors within a four site window is 3.996×10−9. Running each templated PBWT replicate can be easily parallelized.
  • Templating the PBWT as described above to handle errors and return subsequence matches can be executed in linear time by passing through the data only once and avoiding the need for a post-hoc merging algorithm. In some PBWT implementations, at each position k within the haplotype alignment two arrays are constructed: ppak the positional prefix array and divk the divergence array (Durbin 2014). ppak is a list of the haplotypes sorted so that their reversed prefixes (from k−1 to 0) are ordered. This ordering ensures that haplotypes that match through position k−1 will end up adjacent to one another in ppak. The divergence array divk keeps track of where those matches began, the ith element in divk represents the beginning of the match between the ith element in ppak and the i−1th element in ppak.
  • In certain embodiments, to create a templated PBWT, the method constructs a separate ppaj,k and divj,k for each template j used at site k. In this approach, a set of templates (as described above) may be formalized as an indicator function T (j, k) with the value 0 when the template j skips over site k and 1 if template j processes site k. As the haplotype alignment is passed through, T (j, k) is called for each template j; if T (j, k) is 1 then ppaj,k and divj,k are assembled accordingly. When a matching subsequence of at least Lm sites terminates at site k under template j the start and end positions of the match are stored in auxiliary data structures Ps and Pe, respectively. Ps and Pe are both M by M two dimensional arrays in which the position x, y holds the start/end positions of the match between haplotype x and haplotype y. If another subsegment has already been stored the routine checks to see if the new matching subsegment overlaps and possibly extends the existing subsegment. If they do not overlap, the routine checks if the old matching segment has a genetic length (in cM) of at least Lf and then reports it. The new matching subsegment is then stored in its place. In this way the “templating” of the haplotype alignment is performed within this modified form of the PBWT itself, and matching subsegments from each template are merged and extended directly as the haplotype alignment is passed through. Depending on the choice of T (j, k), the templated PBWT has a worst-case time complexity of O(NMt) where t represents the number of templates defined within T (j, k); thus the method represents a linear tradeoff between the speed of PBWT and sensitivity to error. However, using the templates described in the paragraphs above the templated PBWT takes time O(NM3) because t=6 and N becomes N/2 since each template only processes 2 out of every 4 positions in the alignment.
  • An example templated PBWT is further detailed as pseudocode in Algorithm 1. The algorithm employs 2 parameters: (1) Lm is the minimum number of sites that a sub-segment must span within the haplotype alignment to be merged and extend other sub-segments, and (2) Lf is the final minimum length (in cM) that a segment must have to be reported by the algorithm. Notably the algorithm handles missing data by extending the current longest match. At site k the longest matching haplotype to haplotype ppaj,k will be either ppaj−1, k or ppaj+1, k, so if missing data in ppaj,k is encountered it is simply assumed the haplotype continues to extend the longest match. One additional detail is omitted in Algorithm 1 for space considerations; after passing through all sites in the haplotype alignment the routine loops through the haplotypes one last time to report any “trailing” matches (matches that extend all the way through the end of the haplotypes).
  • FIG. 5D illustrates the Templated PBWT data structures. To identify haplotype sharing among a large panel of haplotypes, the TPBWT passes once through an M by N by t three-dimensional structure where M is the number of haplotypes, N is the number of bi-allelic sites, and t is the number of templates. Each template is a pattern at which sites are masked out (shaded out in the figure). During the left-to-right pass through this structure, at each site k, two arrays are updated. The positional prefix array ppa and the divergence array div are both two dimensional arrays of size M by t. At site k, each of the t columns of ppa and div are updated for the templates that are not masked out. Each of the t columns in ppa contains the haplotypes sorted in order of their reversed prefixes. Similarly, each of the t columns in div contains the position at which matches began between haplotypes adjacent to one another in the sorted order of ppa. During the left-to-right pass through this structure, short fragments of IBD shared between haplotypes i and j, broken up by errors, are identified by each of the t templates (green arrows). As these fragments are identified they are merged and extended with one another in the current match arrays Ps and Pe. While merging and extending IBD fragments a heuristic may be used to scan for and fix putative phase switch errors, as will be discussed further herein.
  • Algorithm 1
  • Templated PBWT algorithm to find matching subsequences. Here t represents the number of templates defined within the templating function T(t, k), N is the number of sites in the haplotype alignment, and M is the number of haplotypes. Additionally Aj,k is the allele at position k for haplotype ppat,j. FIG. 5C shows pseudocodes of the algorithm.
  • Phase Correction Heuristic
  • As noted above, long IBD segments may be fractured by phase switch errors introduced by phasing techniques used to phase the haplotype data of the individuals. The locations and frequencies of such fractures may occur in predictable ways. In some embodiments a heuristic is employed to correct phase switch errors as IBD segments are identified. As noted herein in FIG. 5B and operation 566, a heuristic may be applied to identify and correct potential phase switch errors by analysis of sequential haplotype sites. In certain embodiments, a heuristic involves merging adjacent potential IBD segments (matching segments) that end within a threshold distance and then joining or swapping haplotype segments of a given individual, as necessary, to correct phase switch errors. While not wishing to be bound by any particular theory, this heuristic may improve IBD determination because it is biologically unlikely that two IBD segments on the same or opposite haplotypes of an individual both end within a particular distance (e.g., about 500 or fewer SNPs on the same or opposite haplotypes).
  • In some embodiments, the phase switch heuristic is turned off between closely related pairs of individuals, e.g., between parent and child. For example, if an individual is trio-phased (phasing a child's genotype compared to the parent's genotype), the phasing is considered highly accurate and there are few to no phase switch errors. While the phase switch heuristic is discussed in the context of a Templated PBWT process, the heuristic may be used alone or in conjunction with any of various other algorithms that identify IBD segments for phased haplotype data. Such other algorithms may or may not include analyses that identify and/or correct genotyping and similar errors.
  • As a new IBD segment is identified, the start position of the new IBD segment is compared to the end position of an adjacent IBD segment. In some cases, the start position of the new IBD segment and the end position of the adjacent IBD segment are on the same haplotype with a gap between them. In some cases, the start position of the new IBD segment and the end position of the adjacent IBD segment are on opposing haplotypes with either a gap between them or an overlap.
  • If the length of the gap or overlap between adjacent IBD segments is within a threshold value, then the two IBD segments may be merged to form a single IBD segment. In some embodiments, the threshold value is between about 0-500 SNPs, about 0-300 SNPs, about 200-300 SNPs, or about 0-100 SNPs. In some embodiments, the threshold value for merging adjacent IBD segments is the same threshold value for determining that two haplotypes have a minimum number of sites that a sub-segment spans to be considered a potential IBD segment (Lm). If the two IBD segments are on opposite haplotypes, portions of the haplotypes (i.e., haplotype segments) may be swapped starting at the location of a break in the IBD segments. Thus, as the process proceeds to the next haplotype site and beyond, the haplotypes remain swapped unless/until the heuristic determines another phase switch error has occurred and swaps the haplotypes. As the Templated PBWT continues along the chromosome sites, the haplotypes used to identify potential IBD segments remain swapped. In effect, in some embodiments the heuristic is used to correct the actual haplotypes for phase switch errors in addition to correcting IBD segments.
  • In some implementations, the merged potential IBD segment must have a minimum length Lf to be deemed an IBD segment. Of course, a merged segment that does not initially qualify as an IBD segment may grow to a length required to be an IBD segment.
  • FIG. 6A illustrates via a series of diagrams how the heuristic may be applied for four individuals that share IBD segments with a focal person. Each pair of haplotypes (0 and 1; dotted lines) represents a copy of the haplotypes of the focal person, while the grey bars represent the IBD segments the focal person shares with four other individuals. In other words, the top pair of haplotypes shows IBD segments of the focal person and another individual, the second pair of haplotypes shows IBD segments of the same focal person but with a second other individual, and so on. While a focal person is provided for purposes of illustrating the heuristic, in some embodiments the heuristic is applied to correct phase switch errors in multiple or all individuals simultaneously. In such embodiments, there may not be a focal person, per se. However, to simplify illustration of such embodiments, correcting phase switch errors may be presented with reference to a “focal” individual. In some embodiments, the process is implemented using one or more focal individuals. For example, a focal person may be a new user or customer that is added to the database of, e.g., a person genetics platform. In some implementations, the process runs using the new user or customer as the focal individual against the entire database.
  • Panels B through F represent the TPBWT's sweep along the chromosome from left to right, with the black arrow labeled TPBWT representing the current position. As the TPBWT sweeps along the haplotypes identifying IBD matches it uses a heuristic to identify and fix putative phase switch errors. In diagram A, two haplotypes (0 and 1; dotted lines) of the focal person and the IBD segments they share with the four other individuals in the haplotype alignment are plotted. The focal person has two phase switch errors (red dashed lines) that break up long IBD segments. In diagram B, the Templated PBWT scans left to right along the chromosome, keeping track of IBD segments shared among all pairs of individuals. When a phase switch error is encountered in the focal person all IBD segments shared with the focal person are fragmented at the position of the switch error. In diagram C, the Templated PBWT continues to scan left to right and finds another IBD segment. If the new segment begins near the end of all the old segments but on the complementary haplotype of the focal person, then the Templated PBWT infers a phase switch error to have occurred. In diagram D, since a phase switch error is inferred within the focal person, the focal person's haplotypes are now swapped (at or near the point of the phase switch error) so new IBD segments now merge and extend the fragments on the complementary haplotype that were broken up by the phase switch error. If instead the phase switch error is inferred within one of the other individuals, then that other individual's haplotypes are swapped, and the focal person's haplotypes remain unswitched. In diagram E, when the arrangement of IBD segments on the complementary haplotypes again suggests another phase switch error has been encountered the algorithm swaps the focal person's haplotypes again, but this time at the location of the other phase switch error. In diagram F, the Templated PBWT continues to the end of haplotypes after successfully identifying phase switch errors and “stitching” IBD fragments back into correct long IBD segments.
  • In some embodiments, the heuristic is applied to correct phase switch errors when a new potential IBD segment is identified. As described above, a potential IBD segment is identified when the Templated PBWT reaches the rightmost end of the potential IBD segment. For example, in diagrams C and E only a single new potential IBD segment is identified because the TPBWT has not reached the end of the other IBD segments, triggering their identification as potential IBD segments and application of the heuristic. To illustrate this point, note that in panel E the rightmost fragment in the second from top haplotype pair has not yet been identified since the TPBWT operation has not reached the fragment's rightmost end. However, by panel F, the TPBWT has scanned further right along the chromosome and identified that fragment and applied the heuristic to it (which merged it into the long IBD segment).
  • As may be appreciated by FIG. 5B, the heuristic is applied as the Templated PWBT iterates through successive sites along all chromosomes. As potential IBD segments are identified, the proximity of the identified IBD segment to a prior IBD segment on either haplotype is determined to infer whether there is a phase switch error and the IBD segments should be ‘stitched’ together to form a single IBD segment.
  • FIG. 6B illustrates the possible scenarios considered by the Templated PBWT for adjacent IBD segments. Diagram A shows a first IBD segment shared by P and Q. Diagrams B-E show the various combinations of second IBD segments between P and Q that may be considered by the heuristic (the grey box indicates the two potential IBD segments are within a threshold length of each other). The second IBD segments are within a threshold number of SNPs of the first IBD segments. In Diagram B, all IBD segments are on the same haplotype, but are separated by a gap. This may result from an even number of phase switch errors, causing the haplotypes to be swapped at both ends of the gap such that the potential IBD segments are on the same haplotype. It should also be understood that the exact number of phase switch errors within the gap may be unknown and is not necessary to infer which individual harbors the phase switch error(s). For example, in Diagram B phase switch errors may have occurred in P, Q, or P and Q, and the heuristic may be applied as a result of two potential IBD segments being identified within a threshold range of each other, as discussed herein.
  • It should be also be understood that while the Templated PBWT may be used to correct for short gaps, e.g., 1-3 SNPs, the gap illustrated here may be larger, for example up to about 100 SNPs, or about 300 SNPs, or about 500 SNPs. This may be caused by various errors, including multiple phase switch errors within the gap, such that the matching sites are insufficiently long to be considered potential IBD segments. The heuristic as described herein infers that two segments within the threshold distance are likely to be a single segment broken up by errors, and thus merges them despite the gap.
  • Diagram C illustrates the second IBD segments being on opposite haplotypes for both P and Q, which may be the result of a phase switch error in both individuals. In such cases, the haplotypes may be swapped in both individuals. Diagrams D and E illustrate either Q or P, respectively, having second IBD segments on the opposite haplotype. In these scenarios, if the second IBD segments are within a threshold distance of the first IBD segments but on the opposite haplotype, a phase switch error is inferred and the haplotypes from the second IBD segments forward may be swapped and the first and second IBD merged. In cases D and E, the individuals having the first and second IBD segments on opposite haplotypes are the ones inferred to have the phase switch error, and only those individuals' haplotypes are swapped. The swaps begin at the points on or near where the breaks between the first and second IBD segments occur.
  • As described above, the Templated PBWT handles haplotype error (miscalls) and missing data. It is also robust to “blip” phase switch errors in which the phase at a single site is swapped. However, phase switch errors spaced out along the chromosome will cause long regions of the haplotypes to be swapped and fragment IBD segments as illustrated in FIG. 1. To handle these errors the Templated PBWT may apply a phase correction heuristic that scans for certain patterns of haplotype sharing to identify and correct phase switch errors. Note that for haploid data sets such as human male sex chromosomes this heuristic can be turned off. Large cohorts of samples have patterns of haplotype sharing that are highly informative regarding the location of phase switch errors. The phase switch errors in an individual will fragment all IBD segments shared with that individual at the position of the switch error. Each IBD segment that spans the switch error will be broken into two fragments at the position of the error: these fragments will be on complementary haplotypes within the individual with the error and yet may remain on the same haplotype within the other individual. For some closely related pairs (parent-child) this pattern of haplotype sharing may be the result of actual recombination patterns, however for the majority of more distantly related individuals the pattern can be used to identify phase switch errors.
  • As the TPBWT scans left to right through the haplotype alignment finding new IBD segments it keeps track of previously found IBD segments shared among pairs of haplotypes in Ps and Pe. When a new segment shared between two individuals P and Q is found to be adjacent to an existing segment (either slightly overlapping or with a small gap between them; determined by the parameter Lm) there are a number of possible scenarios (FIG. 6C). If the new segment is on the same haplotypes as the existing segment, then possibly the two segments are fragments of a longer segment broken up by error and should be merged. If the new segment begins near the end of the existing segment and the new segment is not on the same haplotypes as the old segment, then possibly there was a phase switch error in both individuals. If the new segment begins near the end of the existing segment and the new segment is on the same haplotype as the existing segment in individual P but on the complementary haplotypes in individual Q, then possibly there was a phase switch error in individual Q. And of course, the opposite pattern could exist suggesting a phase switch error in individual P. If a phase switch error has been identified in either individual P, Q, or both, then the TPBWT will swap the haplotypes for the individuals containing the error (See FIG. 6B). Now the new IBD segments merge and extend the fragments on the complementary haplotype that were broken up by the phase switch error. When the arrangement of IBD segments on the complementary haplotypes again suggests another phase switch error has been encountered the algorithm stops swapping the individual's haplotypes. This simple heuristic continues to the end of haplotypes “stitching” short stretches of IBD fragmented by errors back into the correct long IBD segments.
  • FIG. 6C illustrates a process for using a phase switching error correction heuristic as described herein. The process 600 begins by receiving haplotype data for a plurality of individuals. The process 600 may be performed while scanning the haplotype data using a method such as a Templated PBWT (e.g., during operation 566 in FIG. 5B) or may be performed as a standalone process. The process begins by receiving phased haplotype data to be considered. The process loops over multiple haplotype sites, considering each one separately, but also considering adjacent and/or near sites that contain IBD segments, particularly nearly terminated IBD segments along with newly started IBD segments. An operation 602 sets the next haplotype site for consideration. In embodiments where this heuristic is integrated with another analysis, e.g., Templated PBWT, this site incrementing operation may already have been performed. An operation 603 determines phased haplotypes of at least two individuals have first potential IBD segments that terminate at a first location. In some embodiments, first potential IBD segments may be identified after a matching subsequence having at least Lm sites terminates. The subsequences must possess more than a threshold length Lm to be considered possible IBD segments. The first potential IBD segments terminate at a first location which may be stored for later reference.
  • An operation 605 determines that the at least two individuals have second IBD segments that start at a second location within a threshold distance of where the first IBD segment ended. The threshold distance may be as described above. In some implementations, the distance may be either a gap or an overlap between the first and second potential IBD segments.
  • An operation 607 identifies or infers which individual, from among those having second IBD segments that starts at a location within the threshold distance of where the first IBD segment ended, likely has a phase switch error. The second potential IBD segments may be between any combination of haplotypes of the at least two individuals. See FIG. 6B. When the second potential IBD segment begins on the opposite haplotype as the first potential IBD segment in at least one of the individuals, a phase switch error is implied in the individual that has the first and second IBD segments on opposite haplotypes. In various embodiments, where two fragments of a fragmented IBD segment occur on opposite haplotype strings of single individual, the operation infers that that individual has the phase switch error, and only that individual's haplotypes need correction for the phase switch error. In cases where both individuals sharing an IBD segment have IBD fragments on opposite haplotype strings, the operation may infer that both individuals have phase switch errors at the same or proximate positions. In some cases operation 607 may be skipped. For example, as shown in diagram B of FIG. 6B, above, the potential IBD segments to be merged are on the same haplotype. In such embodiments it may be difficult to determine which individual had a phase switch error and also unnecessary to properly merge the potential IBD segments (as swapping the haplotypes is not necessary to have the potential IBD segments on the same haplotype).
  • In operation 609, based on the first potential IBD segments and the second potential IBD segments being within the threshold distance of each other, the first potential IBD segments and the second potential IBD segments are merged. If the first potential IBD segments and the second potential IBD segments are on opposite haplotypes for any of the at least two individuals (i.e., a phase switch error occurred for those individuals), the haplotypes may be swapped for those individuals. The swap may occur at the location of the phase switch error.
  • Operation 611 is an optional operation to determine whether each potential IBD segment is sufficiently long and/or meets other criteria to be considered a true IBD (e.g., a minimum length Lf). If the criteria are met, the potential IBD segments are determined to be actual IBD segments.
  • Operation 613 is an optional operation to correct for potential genotyping errors. See e.g., the discussion of the Templated PBWT.
  • In block 615 the current haplotype site is checked for whether it is the last haplotype site. If it is the last haplotype site, the process finishes. If it is not the last haplotype site, the process returns to operation 602 to select the next haplotype site and continue scanning for IBD segments. In some embodiments where process 600 is part of another method to identify IBD segments, e.g., a Templated PBWT, the loop may also allow for the Templated PBWT algorithm to continue scanning the next haplotype site.
  • Error Correction Using Hidden Markov Model (HMM)
  • Description and Application of HMM
  • As shown in FIG. 1, long IBD segments are likely fractured by phase switch errors introduced by phasing techniques used to phase the haplotype data of the individuals. FIG. 7 schematically illustrates using a HMM to process four haplotypes of two individuals (individual 1 and individual 2, two haplotypes for each individual on chromosome 5) to correct phase switch errors, “stitching” IBD segments fractured by phase switch errors. The HMM process covers the full span of the four haplotypes from left to right shown sequentially in the top panel, lower left panel, and lower right panel.
  • FIG. 8 shows a flow diagram illustrating process 800 for correcting phase switch errors in IBD segments using a hidden Markov model (HMM) according to some implementations. In some implementations, the error correction optionally is initiated or triggered when IBD segments of the two individuals being compared meet one or more criteria. See decision box 802. This conditional trigger can avoid processing IBD segments that may not need corrections. In some implementations, for example, the HMM error correction process is triggered when the two individuals' IBD segments include two or more IBD segments on a single chromosome. This can avoid applying error correction when there is only a single IBD segment on a single chromosome where no phase switch errors have occurred. In some implementations, the criterion is met when the two individuals' IBD segments exceed a minimum total amount of shared IBD.
  • In these implementations, if the IBD segments of two individuals do not meet the criterion or criteria, the process ends. See box 802, “No” branch, and box 814. If the IBD segments of the two individuals meet the criterion or criteria, process 800 proceeds to obtain an IBD state for each polymorphic site of a series of polymorphic sites of the two individuals. See the box 802, “Yes” branch and box 804. The IBD state indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and if so, which of the two individuals' phased haplotypes are part of the IBD segment. The series of polymorphic sites are located in one or more pairs of chromosomes of each individual. In some implementations, the polymorphic sites are biallelic sites. In other implementations, more than two alleles may be implemented at a site. In some implementations, the IBD states indicate different conditions of zero IBD, half IBD, and full IBD. In some implementations when the polymorphic site is a biallelic site, the IBD states include nine different IBD states corresponding to nine conditions of zero IBD, half IBD, and full IBD as further described in examples hereinafter.
  • Process 800 then involves applying the HMM to the IBD states. Box 806. The HMM model takes the IBD states as inputs and uses them as observed states of the model. The HMM model also takes as input (i) a rate of recombination based on a number of meioses (m), (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and, optionally, (iii) genetic distances between consecutive sites on a chromosome. In some implementations, genetic distances between consecutive sites on a chromosome may be omitted. The term “model input” herein refers to both variables and parameters. The HMM model's transmission rates or probabilities depend on (i) and (ii), and optionally (iii). The application of the HMM model removes likely phase switch errors and produces error corrected IBD segments based on a most likely sequence of hidden IBD states given the observed IBD states. See block 808. Applying the HMM involves using transition probabilities and emission probabilities of the HMM to identify the most likely sequence of hidden IBD states given the observed IBD states. In some implementations, the most likely sequence of hidden IBD states is identified using the Viterbi dynamic programming process.
  • Process 800 is implemented using a computer. It is not practical or feasible to apply the model without a computer due to the complexity of the model. For example, applying the HMM requires using a 36×36 transmission matrix and a 36×36 emission matrix for each polymorphic site, often at hundreds of thousands of polymorphic site, to calculate a most likely sequence. It can take many years and errors for a person to calculate just a single Viterbi sequence.
  • In some implementations, the error correction process involves only the operations illustrated in boxes 804, 806, and 808. Such implementations include: (a) for each polymorphic site in a series of polymorphic sites of two individuals, obtaining an IBD state that indicates whether alleles of the two individuals at the polymorphic site are part of an IBD segment, and, if so, which of the two individuals' phased haplotypes are part of the IBD segment, wherein the series of polymorphic sites are comprised in or lie along one or more pairs of chromosomes; and (b) applying a hidden Markov model (HMM) to the IBD states to produce one or more error-corrected IBD segments, wherein the HMM model takes as input, in addition to the IBD states as observed IBD states, (i) a rate of recombination based on a number of meioses, (ii) at least one rate of phase switch error based on a phasing method employed to phase the haplotypes, and (iii) genetic distances between consecutive sites on a chromosome. In some implementations, genetic distances between consecutive sites on a chromosome may be omitted.
  • Some implementations of the disclosure include multiple iterations of applying the HMM to test different numbers of meioses (m). As illustrated in FIG. 8, process 800 determines whether there are additional values of m that need to be tested. If so, the process loops back to box 804 to obtain IBD states and apply the HMM using a different value of m. In some implementations, the different numbers of meioses are in the range from 0.1 to 14 crossovers. In some implementations, the values of m are in the range from 1 to 14. See block 810, “Yes” branch. If there are no additional values of m to be tested, process 800 proceeds to use the set of error corrected IBD segments having the highest probability as a final estimate of IBD segments for the two individuals. See decision block 810, “No” branch, and block 812. Thereafter, process 800 ends at block 814.
  • In some implementations, m is fixed at 1, requiring no multiple iterations. In other implementations, m=1 provides the set of error corrected IBD segments with the highest probability among multiple values of m.
  • FIG. 9A schematically illustrates the structure of the HMM model. It includes a series of hidden states (illustrated as circles on top) representing the ground-truth IBD states at a series of polymorphic sites and a series of observed states (illustrated as circles at the bottom) representing the observed IBD states based on phased haplotype data of the two individuals. The arrows in the diagram denote conditional dependencies. The hidden states obey the Markov property, such that the hidden state at any site depends on only the hidden state at the immediately previous site. In other words, Hl depends only on Hl−1. Moreover, the observed state at a particular site depends only on the hidden state at the particular site. In other words, Ol depends on only Hl.
  • In the standard type of hidden Markov model considered here, the state space of the hidden variable is discrete. The parameters of a HMM are of two types, transition probabilities and emission probabilities. The transition probabilities between site l−1 and site l determine the probability of Hl given Hl−1.The emission probabilities at site l determine the probability of Ol given Hl.
  • The probability of a series of observed states and a series of hidden states is:
  • Pr ( H 1 , H 2 , H 3 , , O 1 , O 2 , O 3 , ) = Pr ( H 1 ) Pr ( O 1 H 1 ) Pr ( H 2 H 1 ) Pr ( O 2 H 2 ) Pr ( H 3 H 2 ) Pr ( O 3 H 3 ) ( Eq . 1 )
  • where probabilities Pr(Oi|Hi) are emission probabilities/parameters, and probabilities Pr(Hi|Hi−1) are the transition probabilities/parameters.
  • The hidden state space assumes one of N possible values, modeled as a discrete distribution. For each of the N possible states that a hidden variable at point l can be in, there is a transition probability from this state to each of the N possible states of the hidden variable at point l+1, for a total of N2 transition probabilities. Note that the set of transition probabilities for transitions from any given state must sum to 1. As such, the N×N matrix of transition probabilities is a Markov matrix.
  • In addition, for each of the N possible states, there is a set of emission probabilities governing the distribution of the observed variable at a particular point given the state of the hidden variable at that point. The size of this set depends on the nature of the observed variable. For example, if the observed variable is discrete with M possible values, governed by a discrete distribution, there will be a total of N×M emission probabilities.
  • In some implementations, each polymorphic site is biallelic, and the IBD states at any site can include nine different IBD states, indicating nine conditions of zero IBD, half IBD, and full IBD. Considering 4 phased haplotypes for 2 individuals there are 9 different ways site l can be observed as IBD between the two individuals. In some implementations, the IBD state at site l notated as c*ι is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1. When the haplotype at site l is not IBD practitioners represent it as a 0. Therefore, c*ι=0000 indicates that the two individuals at site l are not IBD, or zero IBD. Accordingly, there are 4 different ways the two individuals could be half IBD: 0101 is when the individuals are IBD through their paternal haplotypes, 1001 is when the individual 0's maternal haplotype is IBD with individual 1's paternal haplotype, 0110 is when the individual 0's paternal haplotype is IBD with individual 1's maternal haplotype, and 1010 is when the individuals are IBD through their maternal haplotypes.
  • Similarly there are 4 different ways the two individuals could be full IBD: 1212, 2112, 1221, and 2121. According to the model IBD segments follow a Markovian process in which segments begin and end independently. This means that when modeling full IBD practitioners must keep track of the identity of each of the two IBD segments so one cannot simply represent full IBD as 1111.
  • Of course, other notations may be used to represent the nine different IBD conditions to a similar effect.
  • In some implementations, the IBD states are expanded by multiplying these different 9 conditions of IBD with four types of phase switch errors. But if one disregards the phase switch error types, there would be 9×9 transition rates between hidden states of two consecutive sites.
  • In some implementations, transition rates of the HMM are based upon a rate at which IBD segments start. In some implementations, the rate at which IBD segments start is modeled as a function of the number of meioses. See box 706, input (i). In some implementations, the rate at which IBD segments start (αs) is modeled as follows.
  • α s = 1 1 + ( 100.0 × m ) ( Eq . 2 )
  • wherein m is the number of meioses, and the recombination rate is assumed to be 1 crossover per 100 cM.
  • In some implementations, transition rates of hidden IBD states are based on a rate at which IBD segments end. In some implementations, the rate at which IBD segments end is modeled as a function of the number of meioses. In some implementations, the rate at which IBD segments ends (αe) is modeled as follows.
  • α e = m 100.0 ( Eq . 3 )
  • wherein m is the number of meioses.
  • In some implementations, the IBD states include nine different IBD states, and transition rates are based on a transition matrix Qa in FIG. 9B. In matrix Qa, each row includes rates for transitioning from the IBD state denoted by the four-letter string at site l to nine IBD states at l+1.
  • In some implementations, the transition rates of hidden IBD states are weighted by a probability that full IBD between the two individuals is truly present. In some implementations, the probability that the full IBD between the two individuals is truly present is modeled as a logistic function of an amount of estimated full IBD. In some implementations, the probability that full IBD between the two individuals is truly present (β) is modeled as follows.
  • β = 1 1 + e η ( 0.125 - l 2 ) ( Eq . 4 )
  • wherein ι2 is the amount of estimated full IBD, and η is an empirical parameter defining the steepness of the logistic function.
  • In some implementations, the transition rates of hidden IBD states are weighted by weighting transitions into full IBD states with β, and waiting transitions out of full IBD states with 1/β. In some implementations, the IBD states include nine different IBD states, and the transition rates of hidden IBD states are based on a transition matrix as follows.
  • Q β ij = { Q α ij × β , if i 5 and j > 5 Q α ij × 1 β , if i > 5 and j 5 Q α ij , otherwise ( Eq . 5 )
  • In some implementations, the transition rates of hidden IBD states are based on the at least one rate of phase switch error. See block 706, model input (ii). In some implementations, there are four types of phase switch errors in the two individuals: no switch error in either individual, switch error in individual 0, switch error in individual 1, and switch error in both individuals. The IBD states include nine different IBD states as described herein. The at least one rate of phase switch error includes a rate of phase switch error for each of the two individuals, μ1 and μ2, respectively. In some implementations, the phase switch error rates for the two individuals are the same when the same phasing method is used for both individuals. In some implementations, the transition rates are based on the 36×36 transition matrix described as follows.
  • Q = [ no switch switch in 0 switch in 1 switch in both Q β μ 0 Q β μ 1 Q β μ 0 μ 1 Q β μ 0 Q β Q β μ 0 μ 1 Q β μ 1 Q β μ 1 Q β μ 0 μ 1 Q β Q β μ 0 Q β μ 0 μ 1 Q β μ 1 Q β μ 0 Q β Q β ] ( Eq . 6 )
  • In some implementations, transition probabilities of hidden IBD states are based upon genetic distances between consecutive sites on a chromosome. See box 706, model input (iii). In some implementations, transition probabilities of hidden IBD states are obtained by exponentiating a transition matrix. In some implementations, transition probabilities of hidden IBD states Yl+1 given hidden IBD states Yl are modeled as:

  • P(Y l+1 |Y l , m, μ 0, μ1, ι2)=e Qd l   (Eq. 7)
  • where m is the number of meioses, μ0 is a phase switch error rate for a first individual of the two individuals, μ1 is a phase switch error rate for a second individual of the two individuals, ι2 is an amount of estimated full IBD, Q is a transition matrix described by Eq. 6, and dl is the genetic distances between sites l and l+1.
  • In some implementations, the emission probabilities of the HMM are dependent on phase switch errors. In some implementations, the emission probabilities are defined by a uniform error term that weights probabilities of observed IBD states based on the four different ways the two individuals may be in phase switch errors.
  • Example HMM Implementation
  • To help illustrate the range of implementations, the following example description of using HMM to determine IBD segments, correcting phase switch error is provided. Under the model, IBD segments shared between two related individuals are generated by passing along the four haplotypes of the two individuals. IBD segments begin and end following a Poisson process with rates that are determined by the number of meioses m that occurred on the pedigree between the two individuals. Phase switch errors occur following a Poisson process with a rate μ determined by empirically testing statistical phasing methods.
  • Let c*={c*1, . . . , c*L} be the L sites observed along a chromosome. Practitioners represent the different ways site l can be observed as IBD between the two individuals as c*l. Additionally, let {right arrow over (d)}={d1, . . . , dL} be a vector of genetic distances where the distance between sites l and l+1 is dl. Finally, let ε be an error term that captures the probability that the IBD state practitioners observed at site l is incorrect due to phasing and/or genotyping errors. The conditional probability P(c*|m, μ, {right arrow over (d)}, ε) is structured as a hidden Markov model (HMM) with latent variables {right arrow over (Y)}={Y1, . . . , YL}. Here Yl represents the different ways site l could be observed as IBD plus the different ways the two individuals may be in a phase switch error.
  • State Space
  • When considering the 4 phased haplotypes for 2 individuals, there are 9 different ways site l can be observed as IBD between the two individuals. Practitioners notate the IBD state at site l as c*l, which is represented by a string of 4 integers each corresponding to the 4 haplotypes. The first two integers refer to the maternal and paternal haplotypes in individual 0 and the last two integers refer to the maternal and paternal haplotypes in individual 1. When the haplotype at site 1 is not IBD inventors represent it as a 0.
  • Therefore, c*l=0000 indicates that the two individuals at site 1 are not IBD, or zero IBD. Accordingly, there are 4 different ways the two individuals could be half IBD: 0101 is when the individuals are IBD through their paternal haplotypes, 1001 is when the individual 0's maternal haplotype is IBD with individual 1's paternal haplotype, 0110 is when the individual 0's paternal haplotype is IBD with individual 1's maternal haplotype, and 1010 is when the individuals are IBD through their maternal haplotypes.
  • Similarly there are 4 different ways the two individuals could be full IBD: 1212, 2112, 1221, and 2121. According to the model IBD segments follow a Markovian process in which segments begin and end independently. This means that when modeling full IBD practitioners must keep track of the identity of each of the two IBD segments so one cannot simply represent full IBD as 1111.
  • For this HMM, hidden states Yl represents the different ways site l could be observed as IBD and also includes information about the different ways in which the two individuals may or may not be in a switch error. There are 4 ways in which the two individuals may or may not be in a switch error: neither are in a switch error, individual 0 is in a switch error, individual 1 is in a switch error, or both individuals could be in a switch error. Since there are 9 ways site l could be observed to be IBD and 4 ways phase switch errors could obfuscate the true IBD state, there are a total of 36 states for the latent variables Yl.
  • IBD Segment Model
  • Practitioners model the transitions among hidden states Yl with an instantaneous transition rate matrix. If, for a moment, practitioners do not consider transitions in which phase switch errors may occur and practitioners only consider transitions among the 9 IBD states that can be observed, practitioners can define the transition matrix Qa shown in FIG. 9A.
  • The matrix Qa defines the way the model moves between zero, half, and full IBD states. As the model passes along the chromosome αs is the rate at which IBD segments begin
  • α s = 1 1 + ( 100.0 × m ) ( Eq . 2 )
  • which depends on the number of meioses m on the pedigree between the two individuals being compared. Similarly, the rate at which IBD segments end is
  • α e = m 100.0 ( Eq . 3 )
  • Another way to interpret αe is that it represents the length of the IBD segments shared between individuals 0 and 1. Likewise αs represents the length of segments with no IBD shared between the two individuals.
  • Full IBD Error Model
  • Phase switch errors break up half IBD segments into shorter adjacent half IBD segments on different haplotypes. Since the templated PBWTs procedure described above imperfectly estimates the start and end positions of IBD segments, when the lengths of the two adjacent half IBD segments are over estimated this can result in a short region of erroneous full IBD. Since full IBD is not expected for most pairs of relatives we model the error in the observed proportion of full IBD using a simple logistic function. Practitioners indicate the probability of full IBD truly being present as β, which is defined as
  • β = 1 1 + e η ( 0.125 - l 2 ) ( Eq . 4 )
  • Here ι2 is the amount of full IBD estimated by the templated PBWTs. When ι2≥25% (the amount expected for full siblings) then the probability of there truly being full IBD is β=1. As the amount of full IBD estimated by the templated PBWTs ι2 approaches zero, β also approaches zero. The steepness of the logistic curve is defined by η. Simulation tests showed η=192 was sufficient to reduce the error in full IBD introduced by the templated PBWTs.
  • We incorporate β into the HMM by weighting the transitions into full IBD states with β and by weighting the transitions out of full IBD states 1/β. More explicitly, practitioners define Qβ as:
  • Q β ij = { Q α ij × β , if i 5 and j > 5 Q α ij × 1 β , if i > 5 and j 5 Q α ij , otherwise ( Eq . 5 )
  • When β=1:0 note that Qβ is equal to Qa.
  • Phase Switch Error Model
  • Finally practitioners are ready to incorporate phase switch errors into the HMM. Practitioners now expand the 9 state Qβ matrix into the full 36 states that are possible among the hidden states Yl.
  • Q = [ no switch switch in 0 switch in 1 switch in both Q β μ 0 Q β μ 1 Q β μ 0 μ 1 Q β μ 0 Q β Q β μ 0 μ 1 Q β μ 1 Q β μ 1 Q β μ 0 μ 1 Q β Q β μ 0 Q β μ 0 μ 1 Q β μ 1 Q β μ 0 Q β Q β ] ( Eq . 6 )
  • Here practitioners incorporate two switch error rates μ0 and μ1 for individual 0 and individual 1, respectively. If both individuals were phased using the same method then one may set μ01. Given the genetic distance dl between sites l and l+1, the number of meioses m, the switch error rates μ0 and μ1, and the amount of full IBD observed ι2, practitioners can find the probability of transitioning between the states Yl and Yl+1 by exponentiating matrix Q:

  • P(Y l+1 |Y l , m, μ 0, μ1, ι2)=e Qd l   (Eq. 7)
  • Probability of Observed States
  • The probability of observing the IBD state c*l given the possible phase switch errors at site l is P(c*l|Yl), which are emission probabilities. Practitioners define these emission probabilities using a simple uniform error term ε≥1 that weights observed states based upon the 4 different ways the two individuals may be in phase switch errors. For example, when there are no phase switch errors in either individual, practitioners weight the emission probability so that it is more probable that the observed IBD state is the “true” hidden state:

  • P(c* l|Yl)=P(0101|0101; no switch errors)=ε×P(0101|1001; no switch errors)   (Eq. 8)
  • Conversely, when there is a switch error in one or both of the two individuals practitioners do not expect the “true” hidden IBD state to be the same as the observed IBD state. For example, if there is a switch error in individual 0 then if the “true” IBD state is 1001 practitioners would most probably observe the state 0101:
  • P ( c * l Y l ) = P ( 0101 1001 ; switch error in ind 0 ) = ɛ × P ( 0101 0101 ; switch error in ind 0 ) ( Eq . 9 )
  • Similarly if there were switch errors in both individuals, when the “true” IBD state is 2121 practitioners would observe 1212 with the highest probability. As ∈→∞, the phase switch errors will entirely determine what IBD state can be observed. If ∈=1, all IBD states can be observed with equal probability regardless of the hidden state Yl.
  • Integrating the Templated PBWT and the HMM
  • An implementation described here is named Phased IBD. It is used in the experiments described hereinafter. It has two stages: First the templated PBWT and then the phase-correcting HMM. The templated PBWT stage generates the IBD segments among all haplotypes very quickly and efficiently. However, if run on many pairs (e.g., thousands or millions pairs) of individuals, the second stage of the algorithm (the HMM) would be prohibitively slow for large datasets. In those datasets, though, most individuals will not share any IBD and so they do not require any phase switch error correction. Moreover, the vast majority of related individuals will only be distantly related and share few IBD segments. If they share at most one IBD segment per chromosome then phase switch errors have not broken up their observed IBD segments and so the HMM does not apply. The HMM, the slow stage of the 2-part algorithm, is thus only applied to the small number of individuals within the dataset that are closely related. Practitioners require a pair of individuals to have at least 2 observed IBD segments on a single chromosome before running them through the phase-correcting HMM, though additionally we can require a minimum total amount of shared IBD (in cM) to increase the speed of the entire algorithm.
  • Processing IBD Segments
  • Identified IBD segments can be used for a wide range of purposes. For instance, the amount (length and number) of IBD sharing depends on the familial relationships between the tested individuals. Therefore, one application of IBD segment detection is to quantify relatedness. For example, methods for using IBD segments to quantify relatedness are described in U.S. Pat. No. 8,463,554, issued Jul. 11, 2013, which is incorporated by reference in its entirety for all purposes.
  • In some implementations, the number of shared IBD segments and the amount of DNA shared by two users are computed based on the IBD segments obtained as described above. In some implementations, the longest IBD segment is determined. In some implementations, the amount of DNA shared includes the sum of the lengths of IBD regions and/or percentage of DNA shared. The sum is referred to as IBDhalf or half IBD because the individuals share DNA identical by descent for at least one of the homologous chromosomes. The predicted relationship between the users, the range of possible relationships, or both, is determined using the IBDhalf and number of segments, based on the distribution pattern of IBDhalf and shared segments for different types of relationships. For example, in a first degree parent/child relationship, the individuals have IBDhalf that is 100% the total length of all the autosomal chromosomes and 22 shared autosomal chromosome segments; in a second degree grandparent/grandchild relationship, the individuals have IBDhalf that is approximately half the total length of all the autosomal chromosomes and many more shared segments; in each subsequent degree of relationship, the percentage of IBDhalf of the total length is about 50% of the previous degree. Also, for more distant relationships, in each subsequent degree of relationship, the number of shared segments is approximately half of the previous number.
  • There is a statistical range of possible relationships for the same IBDhalf and shared segment number. In some implementations, the distribution patterns are determined empirically based on survey of real populations. Different population groups may exhibit different distribution patterns. For example, the level of homozygosity within endogamous populations is found to be higher than in populations receiving gene flow from other groups. In some implementations, the bounds of particular relationships are estimated using simulations of IBD using generated family trees. Based at least in part on the distribution patterns, the IBDhalf, and shared number of segments, the degree of relationship between two individuals can be estimated.
  • IBD segments can also be used determine ethnicity or ancestry. See, e.g., U.S. patent application Ser. No. 15/664,619, filed Jul. 31, 2017, which is incorporated by reference in its entirety for all purposes.
  • Moreover, IBD can be used to perform genotype imputation. Genotype imputation refers to the statistical inference of genotype information not directed assayed. This is especially helpful because many individuals only have sparsely assayed genotype data, usually targeting a limited number of genetic markers in the genome. If IBD segments are determined between two individuals, it can be inferred that the genotype of the two individuals are the same in the IBD segments. Thus the known genotype information of an IBD segment of one of the two individuals can be “imputed” into that of the other individual. This further allows association study between phenotypes and genotypes even using individuals that have only the phenotype data collected but not the genotype data assayed. See, e.g., U.S. patent application Ser. No. 15/256,388, filed Sep. 2, 2016, which is incorporated by reference in its entirety for all purposes.
  • Apparatus and Systems
  • FIG. 10 is a functional diagram illustrating a programmed computer system for performing the pipelined ancestry prediction process in accordance with some implementations. Computer system 100, which includes various subsystems as described below, includes at least one microprocessor subsystem (also referred to as a processor or a central processing unit (CPU)) 102. For example, processor 102 can be implemented by a single-chip processor or by multiple processors. In some implementations, processor 102 is a general purpose digital processor that controls the operation of the computer system 100. Using instructions retrieved from memory 110, the processor 102 controls the reception and manipulation of input data, and the output and display of data on output devices (e.g., display 118). In some implementations, processor 102 includes and/or is used to provide phasing, genotype error correction, and/or phasing error correction, etc. as described herein.
  • Processor 102 is coupled bi-directionally with memory 110, which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a read-only memory (ROM). As is well known in the art, primary storage can be used as a general storage area and as scratch-pad memory, and can also be used to store input data and processed data. Primary storage can also store programming instructions and data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 102. Also as is well known in the art, primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 102 to perform its functions (e.g., programmed instructions). For example, memory 110 can include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional. For example, processor 102 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).
  • A removable mass storage device 112 provides additional data storage capacity for the computer system 100, and is coupled either bi-directionally (read/write) or uni-directionally (read only) to processor 102. For example, storage 112 can also include computer-readable media such as magnetic tape, flash memory, PC-CARDS, portable mass storage devices, holographic storage devices, and other storage devices. A fixed mass storage 120 can also, for example, provide additional data storage capacity. The most common example of mass storage 120 is a hard disk drive. Mass storage 112, 120 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 102. It will be appreciated that the information retained within mass storage 112 and 120 can be incorporated, if needed, in standard fashion as part of memory 110 (e.g., RAM) as virtual memory.
  • In addition to providing processor 102 access to storage subsystems, bus 114 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 118, a network interface 116, a keyboard 104, and a pointing device 106, as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed. For example, the pointing device 106 can be a mouse, stylus, track ball, or tablet, and is useful for interacting with a graphical user interface.
  • The network interface 116 allows processor 102 to be coupled to another computer, computer network, or telecommunications network using a network connection as shown. For example, through the network interface 116, the processor 102 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps. Information, often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network. An interface card or similar device and appropriate software implemented by (e.g., executed/performed on) processor 102 can be used to connect the computer system 100 to an external network and transfer data according to standard protocols. For example, various process implementations disclosed herein can be executed on processor 102, or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing. Additional mass storage devices (not shown) can also be connected to processor 102 through network interface 116.
  • An auxiliary I/O device interface (not shown) can be used in conjunction with computer system 100. The auxiliary I/O device interface can include general and customized interfaces that allow the processor 102 to send and, more typically, receive data from other devices such as microphones, touch-sensitive displays, transducer card readers, tape readers, voice or handwriting recognizers, biometrics readers, cameras, portable mass storage devices, and other computers.
  • In addition, various implementations disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computer-implemented operations. The computer-readable medium is any data storage device that can store data which can thereafter be read by a computer system. Examples of computer-readable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as optical disks; and specially configured hardware devices such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices. Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.
  • The computer system shown in FIG. 10 is but an example of a computer system suitable for use with the various implementations disclosed herein. Other computer systems suitable for such use can include additional or fewer subsystems. In addition, bus 114 is illustrative of any interconnection scheme serving to link the subsystems. Other computer architectures having different configurations of subsystems can also be utilized.
  • FIG. 11 is a block diagram illustrating an implementation of an IBD-based personal genomics services system that provides services based on IBD information, which include but are not limited to relatedness estimation, relative detection, ancestry determination, and genotype-phenotype association. In this example, a user uses a client device 1102 to communicate with an IBD-based personal genomics services system 1106 via a network 1104. Examples of device 1102 include a laptop computer, a desktop computer, a smart phone, a mobile device, a tablet device or any other computing device. IBD-based personal genomics services system 1106 is used to perform a pipelined process to predict ancestry based on a user's IBD information. IBD-based personal genomics services system 1106 can be implemented on a networked platform (e.g., a server or cloud-based platform, a peer-to-peer platform, etc.) that supports various applications. For example, implementations of the platform perform ancestry prediction and provide users with access (e.g., via appropriate user interfaces) to their personal genetic information (e.g., genetic sequence information and/or genotype information obtained by assaying genetic materials such as blood or saliva samples) and predicted ancestry information. In some implementations, the platform also allows users to connect with each other and share information. Device 110 can be used to implement 1102 or 1106.
  • In some implementations, DNA samples (e.g., saliva, blood, etc.) are collected from genotyped individuals and analyzed using DNA microarray or other appropriate techniques. The genotype information is obtained (e.g., from genotyping chips directly or from genotyping services that provide assayed results) and stored in database 1108 and is used by system 1106 to make ancestry predictions. Reference data, including genotype data of reference individuals, simulated data (e.g., results of machine-based processes that simulate biological processes such as recombination of parents' DNA), pre-computed data (e.g., a precomputed reference haplotype data used in phasing and model training) and the like can also be stored in database 1108 or any other appropriate storage unit.
  • Experimental
  • This experiment compares a method according to some implementations as described above to other computer implemented methods known in the art. All of these methods are computer-implemented. IBD accuracies and computer performances are compared among the methods.
  • The method according to some implementations is labeled as Phased IBD. It includes techniques as described in the templated PBWT and the HMM examples above.
  • A method described by Durbin is labeled as PBWT. See, R. Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9): 1266-1272, 2014.
  • The method described by Naseri is labeled as RaPID. See, A. Naseri, X. Liu, S. Zhang, and D. Zhi. Ultra-fast identity by descent detection in biobank-xcale cohorts using positional Burrows-Wheeler transform. bioRxiv, page 103325, 2017.
  • The method described by Browning is labeled as Refined IBD. See, B. L. Browning and S. R. Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics, 194(2):459-471, 2013.
  • A method described by Zhou is labeled hap-IBD. See, Y. Zhou, S. R. Browning, and B. L. Browning. A fast and simple method for detecting identity by descent segments in large-scale data. BioRxiv, 2019.
  • A method described by Shemirani is labeled iLASH. See, R. Shemirani, G. M. Belbin, C. L. Avery, E. E. Kenny, C. R. Gignouz, and J. L. Ambite. Rapid detection of identity-by-descent tracts for mega-scale datasets. BioRxiv, page 749507, 2019.
  • FIG. 12 shows results comparing the speed of different IBD inference methods. IBD segments were computed for 50781 SNPs from human chromosome 1. The x-axis shows the number of haplotypes and the y-axis shows the time in seconds to infer IBD. Phased IBD used a minimum IBD segment length of 200 sites and 1.5 cM. Durbin' s PBWT used a 200 site minimum. RaPID (version 1.2.3) used 10 runs, 2 successes, window size of 35, and minimum length of 1.5 cM. Refined IBD used a minimum length of 1.5 cM and all default parameter value settings. The results of FIG. 11 show that Phased IBD and PBWT are the fastest among the four methods and similar to each other. RaPID is the slowest.
  • As explained above and illustrated below, Phased IBD can correct various errors (including genotyping errors and phase switch errors) that cannot be addressed by PBWT, it is noteworthy that Phased IBD achieves similar computational speed as PBWT.
  • On the other hand, although RaPID and Refined IBD can correct errors, albeit to a lesser extent than Phased IBD as shown in FIGS. 14 and 15, they require significantly longer computer run time.
  • FIGS. 13-16 compare the IBD estimate errors (or the opposite of accuracy) of various methods. FIG. 13 compares the absolute error in number of IBD segments between the Templated PBWT method (x-axis) and Phased IBD (y-axis) that includes both Templated PBWT component and the HMM component. The results of FIG. 13 show that the HAIM process greatly reduce the error rates, reducing maximum error segments by about three folds from about 300 to 100.
  • FIG. 14 shows that Phased IBD is more accurate than PBWT. Each axis shows the proportion of the genome with incorrect IBD estimates. PBWT (x-axis) is sensitive to both genotyping and phasing errors compared to Phased IBD (y-axis).
  • FIG. 15 shows that Phased IBD is more accurate than Refined IBD. Refined IBD (x-axis) has higher error than Phased IBD (y-axis), although Refined IBD outperforms both PBWT and RaPID.
  • FIG. 16 shows that Phased IBD is more accurate than RaPID (version 1.2.3). RaPID (x-axis), like PBWT, has high error in relationship types with long IBD segments like parent-child and siblings. Unlike other methods, Phased IBD “stitches” together long IBD segments highly fragmented by phasing and genotyping errors.
  • Simulation Study
  • A simulation study was performed to assess the accuracy of IBD inference methods. Simulated haplotype data sets in which the IBD segments shared were perfectly known were created and then modified to introduce realistic levels of genotyping and phasing errors to test the impact of the errors on IBD segment determinations. Haplotypes inherited with recombination over 400 replicated pedigrees were simulated. Each pedigree had three generations and included at least one pair of each type of close relatives that were used for the simulation study: parent-child, grandparent-grandchild, aunt-niece, first cousins, and siblings. Each pedigree founder consisted of a randomly sampled and unrelated research consented 23andMe customer. Recombination was simulated using a Poisson model with a rate of 1 expected crossover per 100 cM. This resulted in simulated haplotypes for 2000 closely related pairs of individuals with perfectly known IBD segments, 400 pairs of each relationship type: parent-child, grandparent-grandchild, aunt-niece, first cousins, and siblings.
  • Genotyping errors were introduced into the simulated data set using a simple model. At each position along the simulated chromosomes an error in the genotype call was introduced with a probability of 0.001. When a site was selected for an error, half of the genotype call would be flipped with equal probability (e.g., a 0/0 genotype would be converted to a 1/0 or a 0/1 with equal probability).
  • Statistical phasing errors were also introduced into the simulated haplotype datasets. All of the simulated haplotypes were converted into their respective diploid genotypes and then the statistical haplotype phasing method Eagle2 was used. For the phasing reference panel a phasing panel that included about 200000 non-Europeans and about 300000 Europeans was used.
  • The various methods used to analyze the simulated data had the following parameters:
  • TABLE 1
    Algorithm parameter values used for the IBD inference
    methods during analysis of simulated data.
    Software Parameters
    TPBWT Default templates, L_m =
    200, L_f = 3.0, phase
    switch error correction heuristic on
    PBWT longWithin 200
    Hap-IBD v1.0 Default options, except
    nthreads = 1 length = 3.0
    RaPID v1.7 w 3 −r 10 −s 2 −d 3.0
    Refined IBD Default options, except
    v16May19 nthreads = 1 length = 3.0
    iLASH Perm_count 12 shingle_size 20
    shingle_overlap 0 bucket_count 4 max_thread
    1 match_threshold 0.99 interest_threshold 0.70
    min_length 3.0 auto_slice 1 cm_overlap 1.4
  • FIG. 17 shows that Templated PBWT had less error in the estimated number of IBD segments shared between relatives than all other methods analyzed. The y-axis represents the number of erroneous IBD segments estimated for a simulated pair of relatives. Error was highest in closely related pairs that shared long IBD segments, particularly parent-child and siblings.
  • FIG. 18 shows that Templated PBWT had less error in the estimated percentage of the genome that is IBD in simulated relatives than other methods. PBWT had less than error than other methods except Templated PBWT, while hap-IBD and Refined IBD had the largest error. Error was higher in simulated pairs that shared long IBD segments, such as parent-child, compared to more distance relatives pairs such as first cousins. Compared to Templated PBWT, the other methods were more sensitive to phasing and genotyping errors in estimated IBD segments.
  • FIG. 19 shows false negative (charts 1901 and 1905) and false positive rates (charts 1903 a-b and 1907 a-b) of inferring IBD by various methods. Rates were calculated for bins of IBD segment lengths. False negative rate by segment is the proportion of true segments in a size bin that do not overlap any segment compared to the total number of true segments in the size bin. False negative rate by segment coverage is the proportion of the length of true segments in a size bin not covered by any estimated segment compared to the total length of true segments in the size bin. False positive rate by segment is the proportion of estimated segments in a size bin that do not overlap any true segment compared to the total number of estimated segments in the size bin. False positive rate by segment coverage is the proportion of the length of estimated segments in a size bin not covered by any true segment compared to the total length of estimated segments in the size bin. Plots 1903 b and 1907 b present the false positive rate with a smaller y-axis scale than plots 1903 a and 1907 a, respectively. For IBD segments ≥4 cM all methods had low false positive rates. For IBD segments greater than ≥6 cM the Templated PBWT outperformed all other methods.
  • FIG. 20A shows IBD computation runtimes for various methods. All methods were run using 1 CPU core. Templated PBWT was faster than all other methods except Durbin' s PBWT. The relative time shows the runtime to compute IBD for each haplotype in sample sizes of 400 to 20000 haplotypes relative to the time needed to compute IBD for each haplotype in a sample size of 400. A slope near zero indicates linear time complexity, while a positive slope indicates super-linear time complexity. Templated PBWT shows a near linear time complexity. FIG. 20B provides additional compute times for parallelized IBD analyses with large sample sizes. Times are shown for in-sample IBD computes on 1 million individuals, out-of-sample IBD computes on 10 k individuals against 1 million, and out-of-sample IBD computes on 10 k individuals against 10 million. The first two rows show the compute times measured when IBD was estimated over 42927 sites of human chromosome 1. The last three rows show those compute times extrapolated to 23 chromosomes with a total of 600 k sites. The last row additionally extrapolates the time for an out-of-sample analysis on 1 million to 10 million individuals. CPU time is the sum of the computation time for all compute cores. Wall clock time is the “real” time that the entire analysis took to run.
  • Haplotype Sharing in Mexico
  • To demonstrate the utility of the IBD estimates made using the Templated PBWT and the 23andMe database a brief case study was performed to examine the geographic patterns of haplotype sharing within Mexico. 9517 research consented 23andMe customers who self reported that all 4 of their grandparents were from the same Mexican state were identified. Each customer was genotyped on either the 23andMe v4 or v5 bead chip genotyping platform. SNPs with <85% genotyping rate, SNPs with MAF <0.001, SNPs with low trio concordance (effect <0.6 and p-value <1e-20), and SNPs with allele counts of 0 within the samples selected for the phasing reference panel were removed. After this quality control filtering the v4 platform had 453065 SNPs and v5 platform had 544042 SNPs. Haplotypes were phased using Eagle2 as described in Loh et. al., Reference-based phasing using the haplotype reference consortium panel. Nature genetics, 48(11):1443, 2016. Individuals on the v4 platform were phased with a reference panel containing 691759 samples. Individuals on the v5 platform were phased with a reference panel containing 286305 samples.
  • IBD sharing among the 9517 individuals was computed using the Templated PBWT with the parameters described in Table 1. IBD estimates among individuals on the same genotyping platform were made using the in-sample method described above, and estimates made among individuals on different platforms was made using the out-of-sample approach described above over the intersection of platform SNPs (only the SNPs present in both the v4 and v5 genotyping platforms). Hierarchical clustering of the mean pairwise IBD haplotype sharing across Mexican states was performed using Ward's method (Ward Jr 1963) in R. To remove close relatives we excluded any pair of individuals that shared more than 20 cM. Geographic maps of the mean pairwise IBD shared across Mexican states were made using the R packages mxmaps, ggplot2, and viridis (Valle-Jones 2019; Wickham 2016; Gamier 2018).
  • FIGS. 21 and 22 show IBD haplotype sharing across Mexican states as determined by a Templated PBWT method. Hierarchal clustering of IBD sharing across Mexican states identified geographic clusters with elevated levels of haplotype sharing, as shown in FIG. 21. There were two large clusters: one cluster in the Yucatan peninsula and the southern Mexican states, and another cluster representing Mexico City and the central and northern states. The clusters were further subdivided into individual states.
  • Mean pairwise IBD haplotype sharing was highest within states and among geographically neighboring states, as shown in FIG. 22. For example, mean IBD shared among individuals with all 4 grandparents from Nuevo Leon was over 12 cM, and the mean pairwise IBD shared between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from neighboring Coahuila and Tamaulipas was over 10 cM. In contrast, mean pairwise sharing between individuals with all 4 grandparents from Nuevo Leon and individuals with all 4 grandparents from Yucatan was less than 6 cM. Similar geographically stratified IBD sharing was found throughout Mexico, as shown in FIG. 22.
  • CONCLUSION
  • In the description above, for purposes of explanation only, specific nomenclature is set forth to provide a thorough understanding of the present disclosure. However, it will be apparent to one skilled in the art that these specific details are not required to practice the teachings of the present disclosure.
  • The language used to disclose various embodiments describes, but should not limit, the scope of the claims. For example, in the previous description, for purposes of clarity and conciseness of the description, not all of the numerous components shown in the figures are described. The numerous components are shown in the drawings to provide a person of ordinary skill in the art a thorough, enabling disclosure of the present specification. The operation of many of the components would be understood and apparent to one skilled in the art. Similarly, the reader is to understand that the specific ordering and combination of process actions described is merely illustrative, and the disclosure may be performed using different or additional process actions, or a different combination of process actions.
  • Each of the additional features and teachings disclosed herein can be utilized separately or in conjunction with other features and teachings for protective coverings. Representative examples using many of these additional features and teachings, both separately and in combination, are described in further detail with reference to the attached drawings. This detailed description is merely intended for illustration purposes to teach a person of skill in the art further details for practicing preferred aspects of the present teachings and is not intended to limit the scope of the claims. Therefore, combinations of features disclosed in the detailed description may not be necessary to practice the teachings in the broadest sense, and are instead taught merely to describe particularly representative examples of the present disclosure. Additionally and obviously, features may be added or subtracted as desired without departing from the broader spirit and scope of the disclosure. Accordingly, the disclosure is not to be restricted except in light of the attached claims and their equivalents.
  • Moreover, the various features of the representative examples and the dependent claims may be combined in ways that are not specifically and explicitly enumerated in order to provide additional useful embodiments of the present teachings. It is also expressly noted that all value ranges or indications of groups of entities disclose every possible intermediate value or intermediate entity for the purpose of original disclosure, as well as for the purpose of restricting the claimed subject matter. It is also expressly noted that the dimensions and the shapes of the components shown in the figures are designed to help to understand how the present teachings are practiced, but not intended to limit the dimensions and the shapes shown in the examples.
  • None of the pending claims includes limitations presented in “means plus function” or “step plus function” form. (See, 35 USC § 112(f)). It is Applicant's intent that none of the claim limitations be interpreted under or in accordance with 35 U.S.C. § 112(f).

Claims (42)

What is claimed is:
1. A computer implemented method of processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes, the method comprising:
providing a first digital template comprising a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites;
providing a second digital template comprising a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different;
providing two or more haplotypes strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and
computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the merged set of IBD segments has reduced impact from genotyping errors compared to a set of IBD segments generated without applying the first and second digital templates.
2. The method of claim 1, wherein the first and second templates each have a size of at least four consecutive haplotype sites.
3. The method of claim 1, wherein identifying the first matches among alleles at unmasked sites comprises sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
4. The method of claim 1, wherein computationally identifying IBD segments between the two or more haplotype strings further comprises:
computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and
wherein merging the first and second matches among alleles to produce a merged set of IBD segments further comprises computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
5. The method of claim 4, wherein computationally identifying additional matches among alleles at unmasked sites employs a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
6. The method of claim 1, wherein the two or more haplotype strings comprise at least one thousand haplotype strings.
7. The method of claim 1, wherein computationally identifying IBD segments between the two or more haplotype strings comprises performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings.
8. The method of claim 7, wherein computationally merging the first and second matches among alleles is performed while considering individual polymorphic sites of the two or more haplotype strings using the PBWT.
9. The method of claim 1, wherein applying the first digital template comprises a deterministic process employing the first arrangement of masked and unmasked sites.
10. A system for processing haplotypes to reduce genotyping errors when determining identity by descent (IBD) segments between haplotypes, the system comprising:
(a) one or more processors and associated memory;
(b) computer readable instructions for:
providing a first digital template comprising a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites;
providing a second digital template comprising a second arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the first and second arrangements are different;
providing two or more haplotype strings for identification of IBD segments therebetween, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and
identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of the haplotype strings at unmasked sites produced by applying the first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying the second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the merged set of IBD segments has reduced impact from genotyping errors compared to a set of IBD segments generated without applying the first and second digital templates.
11. The system of claim 10, wherein the first and second templates each have a size of at least four consecutive haplotype sites.
12. The system of claim 10, wherein the instructions for identifying the first matches among alleles at unmasked sites comprises instructions for sequentially applying the first digital template to the two or more haplotype strings, each time moving to a next sequential section of the two or more haplotype strings.
13. The system of claim 10, wherein the instructions for identifying IBD segments between the two or more haplotype strings further comprise instructions for:
computationally identifying additional matches among alleles at unmasked sites produced by applying one or more additional digital templates to the two or more haplotype strings, wherein the one or more additional digital templates have additional arrangements of masked and unmasked sites in windows of consecutive haplotype sites, and each of the additional arrangements is different from both the first and the second arrangements, and
wherein merging the first and second matches among alleles to produce a merged set of IBD segments further comprises computationally merging the additional matches with the first and second matches to produce the merged set of IBD segments.
14. The system of claim 13, wherein the instructions for identifying additional matches among alleles at unmasked sites employ a third digital template, a fourth digital template, a fifth digital template, and a sixth digital template.
15. The system of claim 14, wherein the first through sixth digital templates each comprise two masked sites and two unmasked sites.
16. The system of claim 10, wherein the first digital template and the second digital template each have a ratio of masked sites to unmasked sites of between about 2:1 to about 1:2.
17. The system of claim 10, wherein the two or more haplotype strings comprise at least one thousand haplotype strings.
18. The system of claim 10, wherein the two or more haplotype strings comprise at least one million haplotype strings.
19. The system of claim 10, wherein the instructions for identifying IBD segments between the two or more haplotype strings comprise instructions performing a positional Burrows-Wheeler transform (PBWT) on the unmasked sites produced by applying the first and second templates to the two or more haplotype strings.
20. The system of claim 19, wherein the instructions for merging the first and second matches among alleles comprise instructions for performing the merging while considering individual polymorphic sites of the two or more haplotype strings using the PBWT.
21. The system of claim 10, wherein the total number of digital templates is between 2 and k, where k is the number of haplotype sites in the window.
22. The system of claim 10, wherein the total number of digital templates is k!/(m!*(k−m)!), where k is the number of haplotype sites in the window and m is the number of masked sites in the window.
23. The system of claim 10, wherein applying the first digital template comprises a deterministic process employing the first arrangement of masked and unmasked sites.
24. A method of identifying IBD segments between two or more haplotype strings, each of the two or more haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism, the method comprising:
(a) computationally identifying IBD segments between the two or more haplotype strings by (i) identifying first matches among alleles of two or more haplotype strings at unmasked sites produced by applying a first digital template to the two or more haplotype strings, (ii) identifying second matches among alleles of the haplotype at unmasked sites produced by applying a second digital template to the two or more haplotype strings, and (iii) merging the first and second matches among alleles to produce a merged set of IBD segments, wherein the first digital template comprises a first arrangement of masked and unmasked sites in a window of consecutive haplotype sites, wherein the second digital template comprises a second arrangement of masked and unmasked sites in the window of consecutive haplotype sites, and wherein the first and second arrangements are different; and
(b) identifying a potential phase switch error in at least one of the two or more haplotype strings; and
(c) correcting the phase switch error.
25. The method of claim 24, wherein identifying the potential phase switch error comprises identifying proximal IBD segments in at least one pair of the two or more haplotype strings.
26. A computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes, the method comprising:
providing two or more paired haplotypes strings for identification of IBD segments therebetween, each of the two or more paired haplotype strings representing a sequence of allele values at polymorphic sites in a haplotype of an organism; and
computationally iterating through the two or more paired haplotype strings by:
(i) identifying a first potential IBD segment between the two or more paired haplotype strings by identifying matches among alleles of the haplotype strings;
(ii) comparing the first site of the first potential IBD segment to the last site of a previously identified second potential IBD segment
(iii) determining that the last site of the second potential IBD segment and the first site of the first potential IBD segment are within a threshold number of sites of each other; and
(iv) merging the first potential IBD segment and the second potential IBD segment to form a combined potential IBD segment.
27. The method of claim 26, further comprising, after (a) and before (b), removing some initial IBD segments determined to belong to haplotypes having less than a threshold amount of initial IBD segments, wherein the initial IBD segments provided to the HEIM in (b) have had some initial IBD segments removed.
28. The method of claim 27, wherein the threshold amount of initial IBD segments is less than two initial IBD segments per chromosome.
29. A computer implemented method of determining identity by descent (IBD) segments, the method comprising:
determining first potential IBD segments among phased haplotype data for a plurality of individuals, wherein the first potential IBD segments have an end site;
determining second potential IBD segments among haplotype data for the plurality of individuals, wherein the second potential IBD segments have a start site;
determining that the end site of the first potential IBD segments and the start site of the second potential IBD segments are within a threshold number of sites of each other; and
merging the first potential IBD segments and the second potential IBD segments to form a combined potential IBD segment.
30. The method of claim 29, wherein the first potential IBD segments and the second potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the method further comprises:
determining that a phase switch error occurred at a site between the first potential IBD segment and the second potential IBD segment for the individual; and
swapping the haplotypes for the individual from the position of the phase switch error onward.
31. The method of claim 29, wherein the first potential IBD segment and the second potential IBD segment each span at least the threshold number of sites.
32. The method of claim 29, wherein the threshold number of sites is between about 0 and 500 SNPs.
33. A system for determining identity by descent (IBD) segments, the system comprising:
(a) one or more processors and associated memory; and
(b) computer readable instructions for:
determining first potential IBD segments among phased haplotype data for a plurality of individuals, wherein the first potential IBD segments have an end site;
determining second potential IBD segments among haplotype data for the plurality of individuals, wherein the second potential IBD segments have a start site;
determining that the end site of the first potential IBD segments and the start site of the second potential IBD segments are within a threshold number of sites of each other; and
merging the first potential IBD segments and the second potential IBD segments to form a combined potential IBD segment.
34. The system of claim 33, wherein the first potential IBD segments and the second potential IBD segments are on different haplotypes for an individual of the plurality of individuals, and the computer readable instructs further comprise instructions for:
determining that a phase switch error occurred at a site between the first potential IBD segment and the second potential IBD segment for the individual; and
swapping the haplotypes for the individual from the position of the phase switch error onward.
35. The system of claim 33, wherein the first potential IBD segments and the second potential IBD segments overlap for an individual of the plurality of individuals.
36. The system of claim 33, wherein the first potential IBD segment and the second potential IBD segment each span at least the threshold number of sites.
37. The system of claim 33, wherein the threshold number of sites is between about 0 and 500 SNPs.
38. The system of claim 33, wherein the plurality of individuals do not share a parent-child relationship.
39. The system of claim 33, wherein the computer readable instructions further comprise instructions for:
determining a third potential IBD segment among phased haplotype data for a plurality of individuals, wherein the third potential IBD segment has a start site;
determining that the end site of the combined potential IBD segment and the start site of the third potential IBD segments are within the threshold number of SNPs of each other; and
merging the combined potential IBD segments and the third potential IBD segments.
40. The system of claim 39, wherein the combined potential IBD segment and the third potential IBD segment are on different haplotypes for an individual of the plurality of individuals, and the computer readable instructions further comprise instructions for:
determining that a phase switch error occurred at a site between the combined potential IBD segment and the third potential IBD segment for the individual; and
swapping the haplotypes for the individual from the position of the phase switch error.
41. The system of claim 33, wherein the computer readable instructions further comprise instructions for determining that the combined potential IBD segments have a minimum length in centimorgans and storing the combined potential IBD segments as IBD segments for the plurality of individuals.
42. A computer implemented method of processing haplotypes to reduce errors when determining identity by descent (IBD) segments between haplotypes, the method comprising:
(a) computationally identifying initial IBD segments between two or more haplotype strings by identifying first matches among alleles of the haplotype strings using a plurality of templates, each comprising a unique arrangement of masked and unmasked sites in a window of consecutive haplotype sites; and
(b) providing information characterizing the initial IBD segments to a hidden Markov model (HMM) which removes potential phase switch errors to produce final IBD segment, wherein the HMM analyzes the information characterizing the initial IBD segments using distances between consecutive haplotype sites on a chromosome, one or more rates of recombination based on meiosis, and one or more rates of phase switch error based on a phasing method employed to phase the haplotypes.
US16/947,107 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments Abandoned US20210020266A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US16/947,107 US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments
US17/249,520 US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments
US18/503,841 US12046327B1 (en) 2019-07-19 2023-11-07 Identity-by-descent relatedness based on focal and reference segments
US18/737,679 US12260936B2 (en) 2019-07-19 2024-06-07 Identity-by-descent relatedness based on focal and reference segments
US19/057,224 US20250191684A1 (en) 2019-07-19 2025-02-19 Identity-by-descent relatedness based on focal and reference segments

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962876497P 2019-07-19 2019-07-19
US16/947,107 US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US17/249,520 Continuation US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments
US18/503,841 Continuation US12046327B1 (en) 2019-07-19 2023-11-07 Identity-by-descent relatedness based on focal and reference segments

Publications (1)

Publication Number Publication Date
US20210020266A1 true US20210020266A1 (en) 2021-01-21

Family

ID=74192449

Family Applications (5)

Application Number Title Priority Date Filing Date
US16/947,107 Abandoned US20210020266A1 (en) 2019-07-19 2020-07-17 Phase-aware determination of identity-by-descent dna segments
US17/249,520 Abandoned US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments
US18/503,841 Active US12046327B1 (en) 2019-07-19 2023-11-07 Identity-by-descent relatedness based on focal and reference segments
US18/737,679 Active US12260936B2 (en) 2019-07-19 2024-06-07 Identity-by-descent relatedness based on focal and reference segments
US19/057,224 Pending US20250191684A1 (en) 2019-07-19 2025-02-19 Identity-by-descent relatedness based on focal and reference segments

Family Applications After (4)

Application Number Title Priority Date Filing Date
US17/249,520 Abandoned US20210193257A1 (en) 2019-07-19 2021-03-04 Phase-aware determination of identity-by-descent dna segments
US18/503,841 Active US12046327B1 (en) 2019-07-19 2023-11-07 Identity-by-descent relatedness based on focal and reference segments
US18/737,679 Active US12260936B2 (en) 2019-07-19 2024-06-07 Identity-by-descent relatedness based on focal and reference segments
US19/057,224 Pending US20250191684A1 (en) 2019-07-19 2025-02-19 Identity-by-descent relatedness based on focal and reference segments

Country Status (4)

Country Link
US (5) US20210020266A1 (en)
EP (1) EP4000070A4 (en)
CA (1) CA3147888A1 (en)
WO (1) WO2021016114A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200321073A1 (en) * 2019-04-03 2020-10-08 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US11049589B2 (en) 2008-12-31 2021-06-29 23Andme, Inc. Finding relatives in a database
US11171962B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genome sharing
US11170873B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genetic comparisons between grandparents and grandchildren
US11170047B2 (en) 2012-06-06 2021-11-09 23Andme, Inc. Determining family connections of individuals in a database
US11514627B2 (en) 2019-09-13 2022-11-29 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US11521708B1 (en) 2012-11-08 2022-12-06 23Andme, Inc. Scalable pipeline for local ancestry inference
US11531445B1 (en) 2008-03-19 2022-12-20 23Andme, Inc. Ancestry painting
US20230019141A1 (en) * 2021-07-07 2023-01-19 Mars, Incorporated System, method, and apparatus for predicting genetic ancestry
US20230260608A1 (en) * 2022-02-16 2023-08-17 Ancestry.Com Dna, Llc Relationship prediction
US11783919B2 (en) 2020-10-09 2023-10-10 23Andme, Inc. Formatting and storage of genetic markers
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination
US12046327B1 (en) * 2019-07-19 2024-07-23 23Andme, Inc. Identity-by-descent relatedness based on focal and reference segments
US20250307271A1 (en) * 2019-08-02 2025-10-02 Ancestry.Com Dna, Llc Determining data inheritance of data segments

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050191731A1 (en) * 1999-06-25 2005-09-01 Judson Richard S. Methods for obtaining and using haplotype data

Family Cites Families (135)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4632724A (en) * 1985-08-19 1986-12-30 International Business Machines Corporation Visibility enhancement of first order alignment marks
US5692501A (en) 1993-09-20 1997-12-02 Minturn; Paul Scientific wellness personal/clinical/laboratory assessments, profile and health risk managment system with insurability rankings on cross-correlated 10-point optical health/fitness/wellness scales
US20040229213A1 (en) 1998-06-25 2004-11-18 Institut Pasteur Exhaustive analysis of viral protein interactions by two-hybrid screens and selection of correctly folded viral interacting polypeptides
US6703228B1 (en) 1998-09-25 2004-03-09 Massachusetts Institute Of Technology Methods and products related to genotyping and DNA analysis
US20070150978A1 (en) * 1999-05-17 2007-06-28 Byrum Joseph R Nucleic acid molecules and other molecules associated with plants
US20080008996A1 (en) * 1999-06-29 2008-01-10 Byrum Joseph R Nucleic acid molecules and other molecules associated with plants
US20070277267A1 (en) * 1999-10-15 2007-11-29 Byrum Joseph R Nucleic acid molecules and other molecules associated with plants
US6730023B1 (en) 1999-10-15 2004-05-04 Hemopet Animal genetic and health profile database management
US6640211B1 (en) 1999-10-22 2003-10-28 First Genetic Trust Inc. Genetic profiling and banking system and method
US20020133495A1 (en) 2000-03-16 2002-09-19 Rienhoff Hugh Y. Database system and method
US7142205B2 (en) 2000-03-29 2006-11-28 Autodesk, Inc. Single gesture map navigation graphical user interface for a personal digital assistant
US6570567B1 (en) 2000-05-31 2003-05-27 Alan Eaton System and method for using a graphical interface for the presentation of genealogical information
WO2002033520A2 (en) 2000-10-18 2002-04-25 Genomic Health, Inc. Genomic profile information systems and methods
US20030130798A1 (en) 2000-11-14 2003-07-10 The Institute For Systems Biology Multiparameter integration methods for the analysis of biological networks
US20030113727A1 (en) 2000-12-06 2003-06-19 Girn Kanwaljit Singh Family history based genetic screening method and apparatus
US6988040B2 (en) 2001-01-11 2006-01-17 Affymetrix, Inc. System, method, and computer software for genotyping analysis and identification of allelic imbalance
US7957907B2 (en) 2001-03-30 2011-06-07 Sorenson Molecular Genealogy Foundation Method for molecular genealogical research
US7005293B2 (en) 2001-12-18 2006-02-28 Agilent Technologies, Inc. Multiple axis printhead adjuster for non-contact fluid deposition devices
US20040002818A1 (en) 2001-12-21 2004-01-01 Affymetrix, Inc. Method, system and computer software for providing microarray probe data
US7343565B2 (en) 2002-03-20 2008-03-11 Mercurymd, Inc. Handheld device graphical user interfaces for displaying patient medical records
US7135286B2 (en) 2002-03-26 2006-11-14 Perlegen Sciences, Inc. Pharmaceutical and diagnostic business systems and methods
US8855935B2 (en) 2006-10-02 2014-10-07 Ancestry.Com Dna, Llc Method and system for displaying genetic and genealogical data
US20080154566A1 (en) 2006-10-02 2008-06-26 Sorenson Molecular Genealogy Foundation Method and system for displaying genetic and genealogical data
US20040012633A1 (en) 2002-04-26 2004-01-22 Affymetrix, Inc., A Corporation Organized Under The Laws Of Delaware System, method, and computer program product for dynamic display, and analysis of biological sequence data
US20040175700A1 (en) 2002-05-15 2004-09-09 Elixir Pharmaceuticals, Inc. Method for cohort selection
US20070037182A1 (en) 2002-05-28 2007-02-15 Gaskin James Z Multiplex assays for inferring ancestry
US20040229231A1 (en) 2002-05-28 2004-11-18 Frudakis Tony N. Compositions and methods for inferring ancestry
US7217807B2 (en) 2002-11-26 2007-05-15 Rosetta Genomics Ltd Bioinformatically detectable group of novel HIV regulatory genes and uses thereof
US20040146870A1 (en) 2003-01-27 2004-07-29 Guochun Liao Systems and methods for predicting specific genetic loci that affect phenotypic traits
US20060257888A1 (en) 2003-02-27 2006-11-16 Methexis Genomics, N.V. Genetic diagnosis using multiple sequence variant analysis
EP1613734A4 (en) 2003-04-04 2007-04-18 Agilent Technologies Inc Visualizing expression data on chromosomal graphic schemes
US20050039110A1 (en) 2003-04-28 2005-02-17 De La Vega Francisco M. Methodology and graphical user interface to visualize genomic information
EP2360472B1 (en) 2003-12-17 2015-04-15 Fred Hutchinson Cancer Research Center Methods and materials for canine breed identification
US20060046256A1 (en) 2004-01-20 2006-03-02 Applera Corporation Identification of informative genetic markers
US7629122B2 (en) 2004-05-03 2009-12-08 The Children's Hospital Of Philadelphia Methods and compositions for the diagnosis of Cornelia de Lange Syndrome
JP4514687B2 (en) 2004-11-08 2010-07-28 株式会社東芝 Pattern recognition device
US20060161460A1 (en) 2004-12-15 2006-07-20 Critical Connection Inc. System and method for a graphical user interface for healthcare data
US20060166224A1 (en) 2005-01-24 2006-07-27 Norviel Vernon A Associations using genotypes and phenotypes
US8719045B2 (en) 2005-02-03 2014-05-06 The United States Of America As Represented By The Secretary Of The Department Of Health And Human Services, Centers For Disease Control And Prevention Personal assessment including familial risk analysis for personalized disease prevention plan
US7951078B2 (en) 2005-02-03 2011-05-31 Maren Theresa Scheuner Method and apparatus for determining familial risk of disease
AU2006214039A1 (en) 2005-02-18 2006-08-24 Dna Print Genomics Multiplex assays for inferring Ancestry
US7681177B2 (en) 2005-02-28 2010-03-16 Skyler Technology, Inc. Method and/or system for transforming between trees and strings
US20060287876A1 (en) 2005-06-20 2006-12-21 Davor Jedlicka Computer system and method for assessing family structures using affinographs
EP1904938A2 (en) 2005-06-28 2008-04-02 Metacarta, Inc. User interface for geographic search
US8285486B2 (en) 2006-01-18 2012-10-09 Dna Tribes Llc Methods of determining relative genetic likelihoods of an individual matching a population
US20070178500A1 (en) 2006-01-18 2007-08-02 Martin Lucas Methods of determining relative genetic likelihoods of an individual matching a population
US8340950B2 (en) 2006-02-10 2012-12-25 Affymetrix, Inc. Direct to consumer genotype-based products and services
US7818281B2 (en) 2006-02-14 2010-10-19 Affymetrix, Inc. Computer software for visualizing recombination events in a group of individuals from recombination breakpoints and assignments in high density SNP genotyping data by generating a color-coded view for each individual chromosome and a whole genome view for the group
WO2007098805A1 (en) 2006-02-28 2007-09-07 Mentor Graphics Corp. Monitoring physical parameters in an emulation environment
WO2008052344A1 (en) 2006-11-01 2008-05-08 0752004 B.C. Ltd. Method and system for genetic research using genetic sampling via an interactive online network
US8990198B2 (en) 2006-11-02 2015-03-24 Ilan Cohn Method and system for computerized management of related data records
US20080131887A1 (en) 2006-11-30 2008-06-05 Stephan Dietrich A Genetic Analysis Systems and Methods
US7844609B2 (en) 2007-03-16 2010-11-30 Expanse Networks, Inc. Attribute combination discovery
WO2009002942A2 (en) 2007-06-22 2008-12-31 Apple Inc. Touch screen device, method, and graphical user interface for providing maps, directions, and location-based information
WO2009010948A1 (en) 2007-07-18 2009-01-22 Famillion Ltd. Method and system for use of a database of personal data records
US20090043752A1 (en) 2007-08-08 2009-02-12 Expanse Networks, Inc. Predicting Side Effect Attributes
US20090099789A1 (en) 2007-09-26 2009-04-16 Stephan Dietrich A Methods and Systems for Genomic Analysis Using Ancestral Data
US8589437B1 (en) 2007-10-15 2013-11-19 23Andme, Inc. De-identification and sharing of genetic data
US9336177B2 (en) 2007-10-15 2016-05-10 23Andme, Inc. Genome sharing
US10275569B2 (en) 2007-10-15 2019-04-30 22andMe, Inc. Family inheritance
US8510057B1 (en) 2007-10-15 2013-08-13 23Andme, Inc. Summarizing an aggregate contribution to a characteristic for an individual
US7983893B2 (en) 2008-01-08 2011-07-19 Mentor Graphics Corporation Fault support in an emulation environment
US20090182579A1 (en) 2008-01-10 2009-07-16 Edison Liu Method of processing genomic information
US20090198519A1 (en) 2008-01-31 2009-08-06 Mcnamar Richard Timothy System for gene testing and gene research while ensuring privacy
US8214192B2 (en) 2008-02-27 2012-07-03 Mentor Graphics Corporation Resource remapping in a hardware emulation environment
US20170330358A1 (en) 2008-03-19 2017-11-16 23Andme, Inc. Ancestry painting
EP2227780A4 (en) 2008-03-19 2011-08-03 Existence Genetics Llc Genetic analysis
US8214195B2 (en) 2008-03-21 2012-07-03 Mentor Graphics Corporation Testing in a hardware emulation environment
AU2009279434A1 (en) 2008-08-08 2010-02-11 Navigenics, Inc. Methods and systems for personalized action plans
US9218451B2 (en) 2008-08-26 2015-12-22 23Andme, Inc. Processing data from genotyping chips
US8645343B2 (en) 2008-08-26 2014-02-04 23Andme, Inc. Processing data from genotyping chips
US8200509B2 (en) 2008-09-10 2012-06-12 Expanse Networks, Inc. Masked data record access
US20100063830A1 (en) 2008-09-10 2010-03-11 Expanse Networks, Inc. Masked Data Provider Selection
US20100063835A1 (en) 2008-09-10 2010-03-11 Expanse Networks, Inc. Method for Secure Mobile Healthcare Selection
US7917438B2 (en) 2008-09-10 2011-03-29 Expanse Networks, Inc. System for secure mobile healthcare selection
US20100076988A1 (en) 2008-09-10 2010-03-25 Expanse Networks, Inc. Masked Data Service Profiling
US20100070292A1 (en) 2008-09-10 2010-03-18 Expanse Networks, Inc. Masked Data Transaction Database
US20100063865A1 (en) 2008-09-10 2010-03-11 Expanse Networks, Inc. Masked Data Provider Profiling
US20100076950A1 (en) 2008-09-10 2010-03-25 Expanse Networks, Inc. Masked Data Service Selection
BRPI0918889A2 (en) 2008-09-12 2015-12-01 Navigenics Inc method for generating a score for an environmental genetic composite index (egci) for an individual's disease or condition
US20100281401A1 (en) 2008-11-10 2010-11-04 Signature Genomic Labs Interactive Genome Browser
WO2010065139A1 (en) 2008-12-05 2010-06-10 23Andme, Inc. Gamete donor selection based on genetic calculations
US20100169338A1 (en) 2008-12-30 2010-07-01 Expanse Networks, Inc. Pangenetic Web Search System
US8108406B2 (en) 2008-12-30 2012-01-31 Expanse Networks, Inc. Pangenetic web user behavior prediction system
US20100169313A1 (en) 2008-12-30 2010-07-01 Expanse Networks, Inc. Pangenetic Web Item Feedback System
US8386519B2 (en) 2008-12-30 2013-02-26 Expanse Networks, Inc. Pangenetic web item recommendation system
US8255403B2 (en) 2008-12-30 2012-08-28 Expanse Networks, Inc. Pangenetic web satisfaction prediction system
US20100169262A1 (en) 2008-12-30 2010-07-01 Expanse Networks, Inc. Mobile Device for Pangenetic Web
EP2370929A4 (en) 2008-12-31 2016-11-23 23Andme Inc LOOKING FOR PARENTS IN A DATABASE
US8473218B2 (en) 2009-01-29 2013-06-25 Microsoft Corporation Refining HLA data
WO2010144395A2 (en) 2009-06-10 2010-12-16 Ancestralhunt Partners, Llc System and method for the collaborative collection, assignment, visualization, analysis and modification of probable genealogical relationships based on geo-spatial and temporal proximity
HUE061110T2 (en) 2009-11-05 2023-05-28 Univ Hong Kong Chinese Fetal genome analysis from maternal biological sample
US8187811B2 (en) 2009-11-30 2012-05-29 23Andme, Inc. Polymorphisms associated with Parkinson's disease
US20110257889A1 (en) 2010-02-24 2011-10-20 Pacific Biosciences Of California, Inc. Sequence assembly and consensus sequence determination
JP5517752B2 (en) 2010-05-31 2014-06-11 キヤノン株式会社 Image forming apparatus and computer program
WO2012099890A1 (en) 2011-01-18 2012-07-26 University Of Utah Research Foundation Estimation of recent shared ancestry
US8786603B2 (en) 2011-02-25 2014-07-22 Ancestry.Com Operations Inc. Ancestor-to-ancestor relationship linking methods and systems
US10127346B2 (en) 2011-04-13 2018-11-13 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for interpreting a human genome using a synthetic reference sequence
PT2702160T (en) 2011-04-27 2020-07-30 Amyris Inc Methods for genomic modification
US9153142B2 (en) 2011-05-26 2015-10-06 International Business Machines Corporation User interface for an evidence-based, hypothesis-generating decision support system
US9928338B2 (en) 2011-06-01 2018-03-27 The Board Of Trustees Of The Leland Stanford Junior University Method and system for phasing individual genomes in the context of clinical interpretation
US10790041B2 (en) 2011-08-17 2020-09-29 23Andme, Inc. Method for analyzing and displaying genetic information between family members
US9367663B2 (en) 2011-10-06 2016-06-14 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US8990250B1 (en) 2011-10-11 2015-03-24 23Andme, Inc. Cohort selection with privacy protection
TWI619038B (en) 2011-11-07 2018-03-21 Admedec Co Ltd Safety box
US10437858B2 (en) 2011-11-23 2019-10-08 23Andme, Inc. Database and data processing system for use with a network-based personal genetics services platform
US10777302B2 (en) 2012-06-04 2020-09-15 23Andme, Inc. Identifying variants of interest by imputation
US10025877B2 (en) 2012-06-06 2018-07-17 23Andme, Inc. Determining family connections of individuals in a database
US9116882B1 (en) 2012-08-02 2015-08-25 23Andme, Inc. Identification of matrilineal or patrilineal relatives
US10847248B2 (en) 2012-08-10 2020-11-24 The Board Of Trustees Of The Leland Stanford Junior University Techniques for determining haplotype by population genotype and sequence data
US9449143B2 (en) 2012-08-28 2016-09-20 Inova Health System Ancestral-specific reference genomes and uses thereof
EP2893478A1 (en) 2012-09-06 2015-07-15 Ancestry.com DNA, LLC Using haplotypes to infer ancestral origins for recently admixed individuals
US10114922B2 (en) 2012-09-17 2018-10-30 Ancestry.Com Dna, Llc Identifying ancestral relationships using a continuous stream of input
US9213947B1 (en) 2012-11-08 2015-12-15 23Andme, Inc. Scalable pipeline for local ancestry inference
US9213944B1 (en) * 2012-11-08 2015-12-15 23Andme, Inc. Trio-based phasing using a dynamic Bayesian network
AU2014218418A1 (en) 2013-03-15 2015-10-01 Ancestry.Com Dna, Llc Family networks
EP3621080B1 (en) 2014-10-14 2023-09-06 Ancestry.com DNA, LLC Reducing error in predicted genetic relationships
AU2015331621B2 (en) 2014-10-17 2021-02-25 Ancestry.Com Dna, Llc Ancestral human genomes
US20170329902A1 (en) 2014-10-29 2017-11-16 23Andme, Inc. Estimation of admixture generation
US20170329899A1 (en) 2014-10-29 2017-11-16 23Andme, Inc. Display of estimated parental contribution to ancestry
MX389458B (en) 2014-11-06 2025-03-19 Ancestryhealth Com Llc PREDICTION OF HEALTH OUTCOMES.
NZ737553A (en) * 2015-05-30 2017-11-24
US10957422B2 (en) 2015-07-07 2021-03-23 Ancestry.Com Dna, Llc Genetic and genealogical analysis for identification of birth location and surname information
US10558930B2 (en) 2015-07-13 2020-02-11 Ancestry.Com Dna, Llc Local genetic ethnicity determination system
US20170329915A1 (en) 2015-08-27 2017-11-16 23Andme, Inc. Systems and methods for generating a modular web page template to display personal genetic and physiological condition information
EP3432198B1 (en) 2017-07-19 2024-04-17 Tata Consultancy Services Limited Crowdsourcing and deep learning based segmenting and karyotyping of chromosomes
MX2021010260A (en) * 2019-02-27 2021-09-21 Ancestry Com Dna Llc GRAPHIC USER INTERFACE THAT SHOWS RELATIONSHIP BASED ON SHARED DNA.
US11848073B2 (en) * 2019-04-03 2023-12-19 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
EP4000070A4 (en) * 2019-07-19 2023-08-09 23Andme, Inc. Phase-aware determination of identity-by-descent dna segments
EP4029020A4 (en) 2019-09-13 2023-09-20 23Andme, Inc. METHOD AND SYSTEMS FOR DETERMINING AND REPRESENTING FAMILY TREES
US20220044761A1 (en) 2020-05-27 2022-02-10 23Andme, Inc. Machine learning platform for generating risk models
WO2021243094A1 (en) 2020-05-27 2021-12-02 23Andme, Inc. Machine learning platform for generating risk models
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination
WO2022076909A1 (en) 2020-10-09 2022-04-14 23Andme, Inc. Formatting and storage of genetic markers
WO2022087478A1 (en) 2020-10-23 2022-04-28 23Andme, Inc. Machine learning platform for generating risk models

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050191731A1 (en) * 1999-06-25 2005-09-01 Judson Richard S. Methods for obtaining and using haplotype data

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Centimorgan." CentiMorgan - ISOGG Wiki, 10 July 2010, https://isogg.org/wiki/CentiMorgan. (Year: 2010) *
Browning, Brian L., and Sharon R. Browning. "Improving the accuracy and efficiency of identity-by-descent detection in population data." Genetics 194.2 (2013): 459-471. (Year: 2013) *
Browning, Sharon R., and Brian L. Browning. "High-resolution detection of identity by descent in unrelated individuals." The American Journal of Human Genetics 86.4 (2010): 526-539. (Year: 2011) *
Li, Hong, et al. "Relationship estimation from whole-genome sequence data." PLoS genetics 10.1 (2014): e1004144. (Year: 2014) *
Upton, Alex, et al. "High-performance computing to detect epistasis in genome scale data sets." Briefings in bioinformatics 17.3 (2016): 368-379. (Year: 2016) *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11683315B2 (en) 2007-10-15 2023-06-20 23Andme, Inc. Genome sharing
US12327615B2 (en) 2007-10-15 2025-06-10 23Andme, Inc. Genetic comparisons between grandparents and grandchildren
US11171962B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genome sharing
US11170873B2 (en) 2007-10-15 2021-11-09 23Andme, Inc. Genetic comparisons between grandparents and grandchildren
US12431221B2 (en) 2007-10-15 2025-09-30 23Andme, Inc. Window-based method for determining inherited segments
US11531445B1 (en) 2008-03-19 2022-12-20 23Andme, Inc. Ancestry painting
US11508461B2 (en) 2008-12-31 2022-11-22 23Andme, Inc. Finding relatives in a database
US11468971B2 (en) 2008-12-31 2022-10-11 23Andme, Inc. Ancestry finder
US11322227B2 (en) 2008-12-31 2022-05-03 23Andme, Inc. Finding relatives in a database
US11049589B2 (en) 2008-12-31 2021-06-29 23Andme, Inc. Finding relatives in a database
US11170047B2 (en) 2012-06-06 2021-11-09 23Andme, Inc. Determining family connections of individuals in a database
US11521708B1 (en) 2012-11-08 2022-12-06 23Andme, Inc. Scalable pipeline for local ancestry inference
US12354710B1 (en) 2012-11-08 2025-07-08 23Andme, Inc. Scalable pipeline for local ancestry inference
US20240087671A1 (en) * 2019-04-03 2024-03-14 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US12237051B2 (en) * 2019-04-03 2025-02-25 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US20200321073A1 (en) * 2019-04-03 2020-10-08 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US11848073B2 (en) * 2019-04-03 2023-12-19 University Of Central Florida Research Foundation, Inc. Methods and system for efficient indexing for genetic genealogical discovery in large genotype databases
US12260936B2 (en) * 2019-07-19 2025-03-25 23Andme, Inc. Identity-by-descent relatedness based on focal and reference segments
US12046327B1 (en) * 2019-07-19 2024-07-23 23Andme, Inc. Identity-by-descent relatedness based on focal and reference segments
US20240321398A1 (en) * 2019-07-19 2024-09-26 23Andme, Inc. Identity-by-descent relatedness based on focal and reference segments
US20250307271A1 (en) * 2019-08-02 2025-10-02 Ancestry.Com Dna, Llc Determining data inheritance of data segments
US12073495B2 (en) 2019-09-13 2024-08-27 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US11514627B2 (en) 2019-09-13 2022-11-29 23Andme, Inc. Methods and systems for determining and displaying pedigrees
US12159690B2 (en) 2020-08-13 2024-12-03 23Andme, Inc. Ancestry composition determination
US11817176B2 (en) 2020-08-13 2023-11-14 23Andme, Inc. Ancestry composition determination
US11783919B2 (en) 2020-10-09 2023-10-10 23Andme, Inc. Formatting and storage of genetic markers
US20230019141A1 (en) * 2021-07-07 2023-01-19 Mars, Incorporated System, method, and apparatus for predicting genetic ancestry
US20230260608A1 (en) * 2022-02-16 2023-08-17 Ancestry.Com Dna, Llc Relationship prediction

Also Published As

Publication number Publication date
US20210193257A1 (en) 2021-06-24
US20250191684A1 (en) 2025-06-12
US20240321398A1 (en) 2024-09-26
EP4000070A4 (en) 2023-08-09
US12260936B2 (en) 2025-03-25
US12046327B1 (en) 2024-07-23
WO2021016114A1 (en) 2021-01-28
EP4000070A1 (en) 2022-05-25
CA3147888A1 (en) 2021-01-28

Similar Documents

Publication Publication Date Title
US12260936B2 (en) Identity-by-descent relatedness based on focal and reference segments
US12217832B2 (en) Deep learning-based variant classifier
US20230402132A1 (en) Error Correction in Ancestry Classification
US10600217B2 (en) Methods for the graphical representation of genomic sequence data
Freyman et al. Fast and robust identity-by-descent inference with the templated positional burrows–wheeler transform
CA2964902C (en) Ancestral human genomes
US9367800B1 (en) Ancestry painting with local ancestry inference
Cleary et al. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data
Bzikadze et al. UniAligner: a parameter-free framework for fast sequence alignment
Llinares-López et al. Genome-wide genetic heterogeneity discovery with categorical covariates
Yang et al. Improved detection algorithm for copy number variations based on hidden Markov model
Wu et al. A practical algorithm based on particle swarm optimization for haplotype reconstruction
Stram SNP Imputation for Association Studies
CN110504004A (en) A method for identifying controllability genes based on complex network structure
Zhang Biobank-scale ancestral recombination graphs: inference and applications to the analysis of complex traits
Choudhury et al. HAPI-Gen: Highly accurate phasing and imputation of genotype data
HK40062519A (en) Deep learning-based variant classifier
Wang Missing SNP Genotype Imputation
Huang Computational Methods Using Large-Scale Population Whole-Genome Sequencing Data
NZ789499A (en) Deep learning-based variant classifier
NZ791625A (en) Variant classifier based on deep neural networks
HK40025635B (en) Deep learning-based variant classifier
HK40025635A (en) Deep learning-based variant classifier
Kretzschmar Methods for phasing and imputation of very low coverage sequencing data

Legal Events

Date Code Title Description
AS Assignment

Owner name: 23ANDME, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FREYMAN, WILLIAM A.;MCMANUS, KIMBERLY F.;SHRINGARPURE, SUYASH S.;AND OTHERS;SIGNING DATES FROM 20200806 TO 20200807;REEL/FRAME:053556/0220

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION