[go: up one dir, main page]

WO2025212586A1 - High throughput inramolecular consensus reads - Google Patents

High throughput inramolecular consensus reads

Info

Publication number
WO2025212586A1
WO2025212586A1 PCT/US2025/022458 US2025022458W WO2025212586A1 WO 2025212586 A1 WO2025212586 A1 WO 2025212586A1 US 2025022458 W US2025022458 W US 2025022458W WO 2025212586 A1 WO2025212586 A1 WO 2025212586A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
base
read
nucleic acid
consensus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/022458
Other languages
French (fr)
Inventor
Jagadeeswaran CHANDRASEKAR
Amal CHATURVEDI
Mahdi Golkaram
Mark Stamatios Kokoris
Miroslav KUKRICAR
Igor MANDRIC
John MANNION
Robert Mcruer
III John Robert MICHAEL
Sayed Mohammadebrahim SAHRAEIAN
Siamak SALARI SHARIF
Xixi WANG
Daniel ZINDER
Erfan Sayyari
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Roche Sequencing Solutions Inc
Original Assignee
Roche Sequencing Solutions Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Roche Sequencing Solutions Inc filed Critical Roche Sequencing Solutions Inc
Publication of WO2025212586A1 publication Critical patent/WO2025212586A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Definitions

  • SBX technology Given its high-throughput sequencing capacity, SBX technology generates an enormous amount of digital data (e.g., raw sequencing data, alignment files, intermediate and final result files, etc.) that are processed, transferred, stored, and archived. This poses major challenges in the processing of this data in real time where one must consider the practical limitations of their computing system (e.g., available storage space, how fast the files can be accessed once stored, etc.).
  • digital data e.g., raw sequencing data, alignment files, intermediate and final result files, etc.
  • Techniques described herein relate to a method for determining a partial consensus sequence of a double-stranded nucleic acid molecule, the method comprising: sequencing a first PATENT Client Reference No.: P39048-WO-1 strand of the double-stranded nucleic acid molecule to obtain a first sequence of base calls; sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of base calls; identifying a first set of concordant positions and a second set of discordant positions using the first sequence of base calls and the second sequence of base calls; representing each of the first set of concordant positions by a concordant value of a first group of four concordant values, each concordant value representing a concordant pair of bases on the first stand and the second strand; representing each of the second set of discordant positions by a discordant value of a second group of at least 12 discordant values, each discordant value representing a discordant pair of bases
  • the first group of four concordant values is specified using two binary bits and includes A ⁇ >T, C ⁇ >G, G ⁇ >C, and T ⁇ >A.
  • the second group of at least 12 discordant values is specified using at least four binary bits and includes A ⁇ >A, A ⁇ >C, A ⁇ >G, C ⁇ >A, C ⁇ >C, C ⁇ >T, G ⁇ >A, G ⁇ >G, G ⁇ >T, T ⁇ >C, T ⁇ >G, and T ⁇ >T.
  • the second group of at least 12 discordant values includes at least 20 discordant values.
  • at least 20 discordant values are specified using five binary bits.
  • generating the partial consensus sequence includes: including metadata that specifies the second set of discordant positions.
  • the concordant values for the first set of concordant positions, and the discordant values for the second set of concordant positions are usable to recover the base calls of the first sequence and the second sequence at the first set of concordant positions and the second set of discordant positions.
  • PATENT Client Reference No.: P39048-WO-1 PATENT Client Reference No.: P39048-WO-1
  • the method further comprising: transmitting the partial consensus sequence to a computer system.
  • the method further comprising: aligning the first sequence of base calls, the second sequence of base calls, or both to a reference genome, wherein the first set of concordant positions do not match the reference genome; identifying a third set of concordant positions that match the reference genome; and representing each of the third set of concordant positions with an indication of a genomic coordinate in the reference genome.
  • the indication of the genomic coordinate in the reference genome includes a starting genomic coordinate of the first sequence of base calls and a binary bit that specifies whether the concordant position matches the reference genome or not.
  • the first weight and the second weight are dependent on base calls adjacent to the discordant position.
  • PATENT Client Reference No.: P39048-WO-1 determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first quality score being higher than the second quality score for a second base call of the second strand.
  • the initially discordant position is changed to be the concordant position for the first base call of the first strand further based on a concordant base on the second strand having a measured signal that is adjacent to the second base call.
  • determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first weight being higher than the second weight.
  • the consensus sequence is a partial consensus sequence.
  • identifying the first set of concordant positions and the second set of discordant positions includes: aligning the first sequence of base calls to the second sequence of base calls.
  • aligning the first sequence of base calls to the second sequence of base calls includes: aligning the first sequence of base calls to a reference genome; and aligning the second sequence of base calls to the reference genome.
  • the second sequence of base calls is aligned to a second strand of the reference genome.
  • the first sequence of base calls is directly aligned to the second sequence of base calls.
  • sequencing the first strand of the double-stranded nucleic acid molecule to obtain the first sequence of base calls includes: measuring signals for a PATENT Client Reference No.: P39048-WO-1 window of a compound corresponding to the first strand of the double-stranded nucleic acid molecule, the compound comprising a plurality of units, each corresponding to a nucleotide; and determining a base call for a genomic position within the window by comparing the signals to known signal patterns corresponding to different nucleotides.
  • comparing the signals to known patterns corresponding to different nucleotides is performed by a machine learning model trained using the known signal patterns.
  • the compound is (1) the first strand of the double- stranded nucleic acid molecule, a reporter element corresponding to a nucleotide or (2) a surrogate molecule created from the first strand of the double-stranded nucleic acid molecule, the surrogate molecule including one or more reporter elements corresponding to each nucleotide.
  • sequencing the double-stranded nucleic acid molecule includes: creating a surrogate molecule from the double-stranded nucleic acid molecule, the surrogate molecule including one or more reporter elements corresponding to each nucleotide; passing the surrogate molecule through a nanopore to obtain electrical signals; and determining the first sequence of base calls and the second sequence of base calls of nucleotides in the double-stranded nucleic acid molecule using the electrical signals.
  • the method of any of the preceding claims further comprising repeating the method for at least 10,000 nucleic acid molecules.
  • Techniques described herein relate to a system comprising modules that respectively perform the steps of any of the above methods.
  • Techniques described herein relate to a sequencing device for determining consensus sequences of double-stranded nucleic acid molecules, the sequencing device comprising: a set of sequencing cells, each configured to perform: sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements; and sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements, the set of sequencing cells including at least 10,000 sequencing cells; a consensus circuit electrically connected with the set of sequencing cells, wherein the comparator circuit is configured to perform, for each of the double-stranded nucleic acid molecules: receiving the first sequence of base measurements and the second sequence of base measurements; for each of a plurality of positions of the double-stranded nucleic acid molecule: comparing one or more of the first base measurements to one or more of the second base measurements; and determining
  • comparing a first base measurement to a second base measurement comprises: determining a first base call using the one or more of the first base measurements; determining a second base call using the one or more of the second base measurements; and comparing the first base call and the second base call.
  • the comparator circuit is further configured to perform: determining whether a position of the plurality of positions is concordant or discordant based on the comparing, wherein the base call value is dependent on whether the position is concordant or discordant.
  • PATENT Client Reference No.: P39048-WO-1 PATENT Client Reference No.: P39048-WO-1
  • a number of bits used for the base call value is dependent on whether the position is concordant or discordant.
  • FIG.2 illustrates a block diagram of an example system for processing data captured by an example nanopore-based sequencing chip according to certain embodiments.
  • FIG.3 shows a flow chart illustrating a process for determining a consensus sequence of a target molecule according to certain embodiments.
  • FIG.4 is a condensed schematic summarizing one embodiment of a method of generating a duplex nucleic acid construct that is sequenced and analyzed using the methods described herein.
  • FIG.5A shows an example of a single molecule-multi-molecular trace event with a typical sequencing by expansion waveform.
  • FIG.5B illustrates a molecule not clearing a pore.
  • FIGs.11A-11C show the synthesis and expected sequences of an “n” pass HDD read construct.
  • FIG.12 shows a variety of different HDD reads classes that are generated during sequencing.
  • FIG.13A is a hypothetical read structure resulting from the sequencing of a HDD construct.
  • FIG.13B shows on the hypothetical read structure the positions of the target insert molecule.
  • FIG.13C provides an example of a One+ read.
  • FIG.14 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs. In this particular alignment object, three discordant positions can be seen, and are highlighted in red.
  • FIG.15 shows a flowchart illustration for determining a partial order consensus sequence of a double-stranded nucleic acid molecule according to certain embodiments.
  • FIG.16 shows a flow chart illustrating an example method of compressing a sub- stream of base call data according to certain embodiments.
  • FIG.17 shows a graphical representation of various interpretations for homopolymer alignment.
  • FIG.18 shows a flowchart illustration for determining a consensus sequence of a double-stranded nucleic acid molecule using read orientation according to certain embodiments.
  • FIGs.19A and 19B show an exemplary variable length encoding strategy using a variable length encoding algorithm (FIG.19A) and an example of an alignment object with the resulting consensus read, header data, and a computation of the number of bits required for the header (FIG.19B).
  • FIGs.20A and 20B show a second variable length vector of data, sometimes referred to as part of or as a full header string, that is used to encode the differences in the sequence of read 2 relative to the sequence of read 1 (FIG.20A) and an example of an alignment object with the resulting consensus read, header data, and a computation of the number of bits required for the header (FIG.20B).
  • FIGs.25A and 25B show examples of different adapter architectures.
  • FIG.26 shows a flowchart illustrating a segmentation method for detecting different components of adapters using machine learning techniques in accordance with various embodiments.
  • FIG.27 shows a flowchart illustrating a classification method for detecting different components of adapters using machine learning techniques in accordance with various embodiments.
  • FIG.28 shows an example of frequency components represented in a complex plane, where the x-axis represents the real part (Re) and the y-axis represents the imaginary part (Im) of the complex numbers.
  • FIGs.29A-29G shows graphs displaying the cross-correlation signals for seven different candidate adapter sequences at every position of the read construct.
  • FIG.30 shows a flowchart illustrating a method for determining the location and the sequence of an adapter in a sequencing read using cross-correlation frequency-based methods.
  • FIG.31 shows a graph illustrating how the cross-correlation signal would look in real space of the IFFT where a number of bases are deleted from the loop.
  • FIG.32 shows graph illustrating when the cross-correlation method does not generate an interpretable signal to determine the adapter location and sequences.
  • FIG.33 shows an exemplary graph illustrating that the autocorrelation signal for a read sequence used to identify the adapter location in the sequence.
  • FIG.34 shows a flowchart illustrating a method for determining the location and the sequence of an adapter in a sequencing read using autocorrelation frequency-based methods.
  • FIG.35 illustrates a measurement system according to embodiments of the present invention.
  • FIG.36 shows a block diagram of an example computer system usable with systems and methods according to embodiments of the present invention.
  • PATENT Client Reference No.: P39048-WO-1 DETAILED DESCRIPTION [0080] The following description recites various aspects and embodiments of the present compositions and methods.
  • Consensus sequences are generated by combining the sequences of a plurality of sequence reads that align to the same region of a template nucleic acid molecule to form a single high-quality consensus sequence. If the alignment of the plurality of sequence reads occurs between different nucleic acid molecules, (e.g., between a sequence read and reference genome, between complementary plus and minus sequence reads, etc.) an intermolecular consensus read is generated.
  • an intramolecular consensus read is generated.
  • An example of an intermolecular sequencing strategy includes unique molecular identifier (UMI)-based intermolecular consensus sequencing.
  • sequencing methods e.g., SBX, next generation, sequencing by synthesis (SBS), other nanopore-based sequencing PATENT Client Reference No.: P39048-WO-1 methods, e.g., sequencing by basetag (“SBT”), single-molecule real-time (SMRT) sequencing, biological and solid state nanopore sequencing, etc.
  • SBT sequencing by basetag
  • SMRT single-molecule real-time sequencing
  • biological and solid state nanopore sequencing etc.
  • barcodes e.g., UMIs
  • UMI-based intermolecular consensus workflows a major disadvantage of UMI-based intermolecular consensus workflows is that members of the same UMI family are typically dispersed randomly throughout the physical sample, such that each member of a UMI-Original Molecule Family may be read at a different time throughout a run. Such a run may conceivably last an hour, several hours, 24 hours, multiple days, or another duration of time.
  • the clustering step e.g., data processing step
  • the UMI Based Intermolecular Consensus algorithmic workflows cannot be completed until all reads from the entire run have been produced and collected.
  • Another challenge of the intermolecular consensus approach is that reads from paired plus and minus strands from the target nucleic acid may not both be outputted after sequencing.
  • the clusters can have reads from only one strand, either the plus or the minus. When those clusters have reads from both strands, they can be referred to as duplex clusters, meaning they have representation from both the plus and minus.
  • HDD reads comprise single or multi-pass read pairs that are physically coupled together (e.g., via a hairpin structure). As a result of being physically coupled, the single pass reads that they produce are naturally grouped together in the time dimension.
  • higher accuracy consensus reads may be formed from the two, coupled, single pass reads (e.g., both the plus and minus strand), without the need to perform a first informatic clustering step.
  • the clustering step is already taken care of given their physical and temporal coupling. Avoiding the processing steps required for clustering saves significantly PATENT Client Reference No.: P39048-WO-1 on computational resources and allows for in-line processing and production of the consensus reads in real time.
  • intramolecular consensus sequencing using HDD reads also presents several solutions.
  • a first solution is based on a “reference-based consensus calling” process that aligns both strands of the target nucleic acid to a reference genome.
  • the advantage of this approach lies in the accuracy of the consensus reads. This method can preserve information for as long as possible and all of the raw read information until the point of variant calling.
  • a second solution is based on a “reference-free consensus calling” process that instead of aligning the strands to a reference genome, aligns the adapters and the two nucleic acid strands to themselves.
  • a third solution to data processing and compression of HDD reads involves calling consensus just on parts of the HDD read where perfect agreement exists and refraining from making a consensus call on any positions for which there is a disagreement or discordance across the two read pairs. This third approach can use either reference-free or reference-based consensus calling.
  • a fourth solution to data processing and compression of HDD reads involves collecting information regarding the alignment orientation of a read, e.g., which strand or PATENT Client Reference No.: P39048-WO-1 internal copy (e.g., daughter) from which a base call was generated.
  • Read orientation can be particularly useful in determining consensus and concordant calls as certain sequence modifications (e.g., errors arising from DNA damage, epigenetic modifications, measurement or data processing) occur more commonly on one strand orientation versus its complement.
  • the collected information may include base calling quality scores, base calling weights, rate of DNA damage or other mutations, etc.
  • the sequencing device may comprise a set of sequencing cells, such as at least 10,000 sequencing cells, where each individual cell is configured to sequence a first and second strand of a double-stranded nucleic acid molecule to generate a first and second sequence of base measurements.
  • Xpandomers are synthesized that translate the sequence of the first and second strand of the double-stranded nucleic acid molecule, respectively, into measurable polymers that can be sequenced using a sequencing device or sequencing instrument.
  • the sequencing device compares the first base measurements to the second base measurements to determine a base call value for each position.
  • a concordant base call is made based on the base call value; however, in some cases, a discordant base call may be made.
  • any (or a combination) of the above-mentioned solutions may be used to generate a consensus sequence.
  • Implementation of any (or any combination) of the above-mentioned solutions would allow for the sequencing instrument to produce HDD consensus reads directly on the sequencing instrument itself. They would also help address the problem of limited channel capacity for information transmission channels along the path between the instrument and location of secondary analysis or storage.
  • Nucleic acid may refer to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form.
  • the term may encompass nucleic acids containing known nucleotide analogs or modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, which have similar binding properties as the reference nucleic acid, and which are metabolized in a manner similar to the reference nucleotides.
  • Examples of such analogs may include, without limitation, phosphorothioates, phosphoramidites, methyl phosphonates, chiral-methyl phosphonates, 2-O- methyl ribonucleotides, peptide-nucleic acids (PNAs).
  • the nucleic acid may also be represented by surrogate molecules, which are inserted into the original nucleic acid, with each surrogate molecule corresponding to a particular nucleotide.
  • nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated.
  • degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues, as described in, e.g., Batzer et al., Nucleic Acid Res.19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell.
  • nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and polynucleotide.
  • nucleotide in addition to referring to the naturally occurring ribonucleotide or deoxyribonucleotide monomers, may be understood to refer to related structural variants thereof, including derivatives and analogs (e.g., X-NTPs used in SBX- sequencing), that are functionally equivalent with respect to the particular context in which the nucleotide is being used (e.g., hybridization to a complementary base), unless the context clearly indicates otherwise.
  • raw data or “raw signal data” refers to data produced by sensors in a sequencing device.
  • Raw data includes signal values associated with sequencing a nucleic acid molecule.
  • signal value may refer to a value of the sequencing signal output from a sequencing cell.
  • the sequencing signal may be an electrical signal that is measured and/or output from a point in a circuit of one or more sequencing cells, e.g., the signal value may be (or represent) a voltage or a current.
  • the signal value may represent the results of a direct measurement of voltage and/or current and/or may represent an indirect measurement, e.g., the signal value may be a measured duration of time for which it takes a voltage or current to reach a specified value.
  • a signal value may represent any measurable quantity that correlates with the features of the sequencing device. For example, in a nanopore sequencing device the resistivity of a nanopore and from which the resistivity and/or conductance of the nanopore (threaded and/or unthreaded) may be derived can affect the signal value.
  • the signal value may correspond to a light intensity, e.g., from a fluorophore attached to a nucleotide being catalyzed to a nucleic acid with a polymerase.
  • the term “bright period” may generally refer to the time period when a molecule, Xpandomer, or a portion thereof, is forced into a nanopore by an electric field applied through an AC signal.
  • the term “dark period” may generally refer to the time period when a molecule, Xpandomer, or a portion thereof, is pushed out of the nanopore by the electric field applied through the AC signal.
  • An AC cycle may include the bright period and the dark period.
  • the polarity of the voltage signal applied to a nanopore cell to put the nanopore cell into the bright period (or the dark period) may be different.
  • the bright periods and the dark periods can correspond to different portions of an alternating signal relative to a reference voltage.
  • the term “raw read data” or “read data” refers to data generated from the raw data or the raw signal data.
  • the raw read data includes read data stream(s).
  • a read data stream includes sub-streams of data corresponding to a respective nucleic acid molecule including an identifier or header sub-stream, a nucleic acid base call sub-stream, and a quality score sub- stream.
  • the base call data may also include other possible base calls such as an undetermined nucleotide.
  • quality score data refers to data generated from the raw data that provides a measure for confidence in accuracy of a base call correctly made for a nucleic acid (e.g., between the four bases.)
  • the quality score can be reflective of the stochastic behavior that is inherent to single molecule observations.
  • the quality of base calls may not degrade with time or with read length, but there can be different quality scores for different base calls randomly at different points in time on a given nucleic acid.
  • the quality scores of bases in a read may show a dependence on read length or position of base within a read. A higher quality score for a base call can indicate greater confidence in the base call being correct.
  • a signal value that is near a peak of a probability distribution function can result in a base call having a higher quality score than a signal value that is far from a peak of a PDF.
  • head data or “read ID data” refers to information that identifies a read within a larger collection of reads.
  • the raw read data stream generated for a portion of the raw data has the same header data across the raw read data stream for that portion.
  • the raw data can include a plurality of portions of raw data generated simultaneously or at different times for the same nucleic acid molecule (e.g., template nucleic acid molecule) or for different nucleic acid molecules (e.g., different template nucleic acid molecules).
  • Consensus sequence read refers to a nucleic acid sequence read generated from aligning a plurality of sequence reads that correspond to different parts of the same nucleic acid molecule (e.g., different strands and different internal copies), the same template nucleic acid molecule (e.g., amplicons of same molecule), or molecular family (e.g., same barcode). Consensus reads may be intermolecular (i.e., between different molecules) or they may be intramolecular (i.e., within a molecule).
  • Intermolecular and intramolecular consensus sequence reads may be generated by aligning the PATENT Client Reference No.: P39048-WO-1 plurality of sequence reads to one another.
  • an intramolecular consensus read is generated when the plurality of sequence reads are physically coupled together (e.g., via a hairpin segment) so that the plus and minus sequence reads are compared to each other.
  • Consensus reads may also be generated by aligning each of the plurality of sequence reads to a reference genome or to each other.
  • a “concordant position” has a pair of concordant bases on the two strands for the given genomic position in a double-stranded nucleic acid molecule. Concordant bases are ones that hybridize to each other. Thus, pairs of concordant bases are A ⁇ >T, C ⁇ >G, G ⁇ >C, and T ⁇ >A.
  • a “machine learning model” can refer to a software module configured to be run on one or more processors to provide a classification or numerical value of a property of one or more samples.
  • a ML model can include various parameters (e.g., for coefficients, weights, thresholds, functional properties of function, such as activation functions).
  • a ML model can include at least 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, or one million parameters.
  • a ML model can be generated using sample data (e.g., training samples) to make predictions on test data.
  • sample data e.g., training samples
  • Various number of training samples can be used, e.g., at least 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, or at least 200,000 training samples.
  • PATENT Client Reference No.: P39048-WO-1 unsupervised learning model is another example.
  • supervised learning model that can be used with embodiments of the present disclosure. Examples of supervised learning models may include different approaches and algorithms including, but not limited to, analytical learning, statistical models, artificial neural network (e.g.
  • convolutional and/or transformer layers including convolutional and/or transformer layers), boosting (meta-algorithm), Bayesian statistics, case-based reasoning, decision tree learning, inductive logic programming, Gaussian process regression, genetic programming, group method of data handling, kernel estimators, learning automata, learning classifier systems, minimum message length (decision trees, decision graphs, etc.), multilinear subspace learning, naive Bayes classifier, maximum entropy classifier, conditional random field, nearest neighbor algorithm, probably approximately correct learning (PAC) learning, ripple down rules, a knowledge acquisition methodology, symbolic machine learning algorithms, subsymbolic machine learning algorithms, minimum complexity machines (MCM), random forests, ordinal classification, data pre-processing, handling imbalanced datasets, statistical relational learning, or Proaftn (a multicriteria classification algorithm), or an ensemble of any of these types.
  • MCM minimum complexity machines
  • the ML model may include linear regression, logistic regression, deep recurrent neural network (e.g., long short term memory, LSTM), hidden Markov model (HMM), linear discriminant analysis (LDA), k-means clustering, density-based spatial clustering of applications with noise (DBSCAN), random forest algorithm, support vector machine (SVM), or any model described herein.
  • Supervised learning models can be trained in various ways using various cost/loss functions that define the error from the known label (e.g., least squares and absolute difference from known classification) and various optimization techniques, e.g., using backpropagation, steepest descent, conjugate gradient, and Newton and quasi-Newton techniques.
  • Example devices and measurement pipelines for performing embodiments of the present disclosure are now described.
  • the specific examples that follow describe constructs and methods used in SBX sequencing, but the skilled artisan will appreciate that the techniques described herein can also be used to analyze data derived from any sequencing method, (e.g., sequencing by synthesis (SBS) or other nanopore-based sequencing methods, such as sequencing by basetag (“SBT”), single-molecule real-time (SMRT) sequencing, biological and solid state nanopore sequencing, etc).
  • SBS sequencing by synthesis
  • SBT sequencing by basetag
  • SMRT single-molecule real-time sequencing
  • biological and solid state nanopore sequencing etc.
  • alternative sequencing techniques that may employ the PATENT Client Reference No.: P39048-WO-1 methods described herein include, but are not limited to, NGS platforms (e.g., MiSeq, NextSeq, and NovaSeq Sequencing Platforms) by Illumina, Inc. (San Diego, CA); Aviti System by Element Biosciences, Inc. (San Diego, CA); UG 100 System by Ultima Genomics, Inc. (Fremont, CA); G4 and G4X Systems by Singular Genomics Systems, Inc.
  • NGS platforms e.g., MiSeq, NextSeq, and NovaSeq Sequencing Platforms
  • Illumina, Inc. San Diego, CA
  • Aviti System by Element Biosciences, Inc.
  • UG 100 System by Ultima Genomics, Inc. Femont, CA
  • G4 and G4X Systems by Singular Genomics Systems, Inc.
  • FIG.1 is a top view of an embodiment of a nanopore sensor chip 100 having an array 140 of nanopore cells 150.
  • the array 140 of nanopore cells includes at least 10,000 sequencing cells 150.
  • Each nanopore cell 150 includes a control circuit integrated on a silicon substrate of nanopore sensor chip 100.
  • side walls 136 may be included in array 140 to separate groups of nanopore cells 150 so that each group may receive a different sample for characterization.
  • Each nanopore cell may be used to sequence a nucleic acid.
  • each nanopore cell may be used to sequence a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements and sequence a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements.
  • nanopore sensor chip 100 may include a cover plate 130.
  • nanopore sensor chip 100 may also include a plurality of pins 110 for interfacing with other circuits, such as a computer processor.
  • nanopore sensor chip 100 may include multiple chips in a same package, same printed circuit board (PCB), or same integrated circuit (IC), such as, for example, a Multi-Chip Module (MCM) or System-in-Package (SiP).
  • PCB printed circuit board
  • IC integrated circuit
  • MCM Multi-Chip Module
  • SiP System-in-Package
  • the chips may include, for example, a memory, a processor, a field-programmable gate array (FPGA), an application- specific integrated circuit (ASIC), data converters, a high-speed I/O interface, etc.
  • PATENT Client Reference No.: P39048-WO-1 nanopore sensor chip 100 can include consensus circuit 155.
  • the memory, the processor, the FPGA, the ASIC, data converters, the high-speed I/O interface, etc. may be external circuits operatively connected to the sequencing chip.
  • nanopore sensor chip 100 may be coupled to (e.g., docked to) a nanochip workstation 120, which may include various components for carrying out (e.g., automatically carrying out) various embodiments of the processes disclosed herein, including, for example, analyte delivery mechanisms, such as pipettes for delivering lipid suspension or other membrane structure suspension, analyte solution, and/or other liquids, suspension or solids, robotic arms, computer processor, and/or memory.
  • a plurality of polynucleotides may be detected on array 140 of nanopore cells 150.
  • each nanopore cell 150 can be individually addressable.
  • the nanopore cell chip 100 includes or is electrically coupled to a consensus circuit 155.
  • the consensus circuit 155 is configured to receive the first sequence of base measurements and the second sequence of base measurements. With the first sequence and second seqeuence of base measurements, consensus circuit 155 determines a first base call (e.g., using one or more first base measurements) and a second base call (e.g., using one or more second base measurements). Consensus circuit 155 then compares one or more of the first base measurements to one or more of the second base measurements and determines a base call value, based on the comparison. In some cases, the base call value is for complementary concordant positions. In other cases, the base call value is for non-complementary discordant positions.
  • the base call value may be stored in a number of bits, depending on whether the position is found to be concordant or discordant by consensus circuit 155. Further, the consensus circuit may be further configured to generate metadata (e.g., header string, consensus quality score, vector, etc.) that identifies which positions are discordant. This process repeats for each of the plurality of positions of the double-stranded nucleic acid molecule. Once all base call values are determined for each position, a consensus sequence can be generated using the base call values. The consensus sequence may also include the metadata. The consensus sequence (and metadata) is then transmitted to a computer system by a transmitter.
  • metadata e.g., header string, consensus quality score, vector, etc.
  • FIG.2 illustrates a block diagram of an example system for processing data captured by a nanopore-based sequencing sensor chip 210 (e.g., same as nanopore sensor chip 100 described with respect to FIG.1), according to embodiments of the present disclosure.
  • System 200 comprises sequencing device 205 for generating sequencing data that may be transmitted via a bus interface unit 280.
  • Sequencing device 205 includes sensor chip 210, consensus circuit 220, transmitter 221, and local memory 225.
  • Sensor chip 210 may include thousands or millions or more of cells.
  • the data may be captured by the cells of sensor chip 210 during various phases of cell formation and sequencing, including, for example, before the formation of the lipid layer (e.g., to check open/short of the electrical circuit), after the formation of a thick lipid layer, during the thinning of the lipid layer, after the formation of the bilayer, after the formation of the nanopore (e.g., to determine the number of nanopores for each cell or to measure open channel data for normalization), and during the sequencing of a sample (e.g., for normalization).
  • a sensor chip 210 may include thousands or millions of cells, such as 100,000 or more cells, 1 million or more cells, 2 million or more cells, 4 million or more cells, or 8 million or more cells.
  • sensor chip 210 may include 1 million cells, where each cell of the 1 million cells may be a nanopore-based sensor cell as described above with respect to FIGS.1, and may capture, for example, ten data sample points in one cycle of an AC signal at 100 Hz.
  • each cell of the 1 million cells may capture one data point represented by one byte (e.g., 8 bits), and one raw data frame including 1 million bytes (MB) of data from the 1 million cells may be generated.
  • the data point may be a raw data point from an analog-to-digital converter (ADC) output (i.e., ADC value).
  • ADC analog-to-digital converter
  • the data point may be the difference between two consecutive raw data points from the ADC output.
  • a local event detector may be used to determine whether an event has occurred at a cell and the output data point may indicate whether an event has occurred on a cell. For example, the local event detector may detect an event if a difference between a new ADC value and previous ADC value (or other reference value) is greater than a selected threshold.
  • a data frame may indicate no event or state change on some cells and events or state changes on some other cells. Thus, a data frame comprises all of the data points across the cells at a given time.
  • the raw data frame may be represented by, for example, an image file that includes 8 million pixels, where the data point from each cell may be represented by the gray scale or color and/or intensity of a pixel of the image file.
  • ten raw data frames may be generated, one at each sample point. For example, four sample points may be taken in the bright period and six sample points may be taken in the dark period, or vice versa.
  • raw data frames may be generated, which may include 1 gigabyte (GB) (1 MB per frame ⁇ 1000 frames) of data from 1 million cells.
  • the output data rate of sensor chip 210 may be 1 GB per second (GBPS) for a sensor chip with 1 million cells.
  • data captured by sensor chip 210 may be sent to a consensus circuit 220 (e.g., including FPGA(s), ASIC(s), and/or GPU(s)) for preprocessing.
  • Consensus circuit 220 may store the received data to a local memory 225 at a data rate of, for example, 12 GBPS.
  • Consensus circuit 220 may directly send the received data through (or process the received data and then send the preprocessed data through), for example, a Peripheral Component Interconnect Express (PCIe) interface, to a PCIe bus 280, which may have a maximum data transfer rate of, for example, 8 GBPS.
  • PCIe Peripheral Component Interconnect Express
  • Each raw data frame only includes one data sample point from a cell, while each base is determined based on a plurality of sample data points as described above.
  • a data processor may not have sufficient resources to process the raw data frames in real time. Therefore, the raw data frames may be stored first and then be processed together when raw data frames sufficient for determining a base are available.
  • data from consensus circuit 220 may be stored in one or more standard disk drives 260 or one or more fast capture drives 250.
  • Each standard disk drive 260 may have a maximum write speed of 0.2 GBPS, while each fast capture drive 250 may have a maximum write speed of 1 GBPS.
  • data from consensus circuit 220 may be sent to network storage devices through a PATENT Client Reference No.: P39048-WO-1 network interface 270, which may have a maximum data rate of 0.1 GBPS.
  • P39048-WO-1 network interface 270 may have a maximum data rate of 0.1 GBPS.
  • the usable bandwidth of PCIe bus 280 may be less than the full bandwidth of 8 GBPS, such as, for example, 6 GBPS (75% of the full bandwidth) due to other data transportations on the bus.
  • the data from consensus circuit 220 may not be saved to the storage drive fast enough.
  • a large buffer may be used for temporarily storing the data, or some data may be dropped.
  • Consensus circuit 220 may optionally include a base caller circuit, which can be implemented on a graphic processing unit (GPU). As an Xpandomer is passed through a nanopore, raw base calls may be made in real-time and written out into raw sequencing data files (e.g., FASTQ files).
  • Each nucleic acid base (modified and unmodified) generates its own unique electrical signal (e.g., voltage or electrical current pattern) that is captured by the nanopore cell as the base transits the nanopore.
  • the raw sequencing data files may be input into the base caller circuit, where a base calling algorithm, which may be referred to as a base caller, decodes the sequences of bases in real-time, after the sequencing run is complete, or any combination thereof.
  • the base caller is a machine learning model, for example, a neural network (e.g., recurrent neural network (RNN), a convolutional neural network (CNN), a bidirectional hybrid RNN + CNN, and the like).
  • RNN recurrent neural network
  • CNN convolutional neural network
  • bidirectional hybrid RNN + CNN and the like.
  • the base caller may be a non- neural network machine learning model or a statistical model.
  • a GPU that implements a base caller circuit may include hundreds or even thousands of parallel processing cores, making it suitable for the processing of sequencing data from the thousands or millions of cells of sensor chip 210.
  • a host processor 240 may be used to process the stored data.
  • Host processor 240 may include a communication interface having a maximum bandwidth of, for example, about 22 GBPS, which may not be fully utilized due to the bandwidth limitation of PCIe bus 280.
  • Host processor 240 may access a main memory 245 (e.g., a DRAM) at a maximum data rate of, for example, 12 GBPS. In various implementations, host processor 240 may access main memory 245 directly or through, for example, a north bridge.
  • PATENT Client Reference No.: P39048-WO-1 [0121]
  • the base caller circuit may need to read the data back from the storage device, and the data processing speed may be limited by the speed of the data read- back. Thus, if sensor chip 210 is used to sample data, for example, for 2 hours or more for an assay, 2 hours or more may be needed to read the stored data back. Thus, the data processing time may be very long.
  • a sequencing device may be used for determining consensus sequences of double-stranded nucleic acid molecules.
  • the sequencing device comprises a set of sequencing cells that can include at least 10,000 individual sequencing cells.
  • Each sequencing cells may be configured to sequence a first strand and a second strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements and a second sequence of second base measurements, respectively.
  • the first sequence of first base measurements and the second sequence of second base measurements are raw sequencing data that is transmitted to a consensus circuit on the sequencing device at a rate.
  • the consensus circuit may be electrically connected with the set of sequencing cells.
  • the consensus circuit is configured to receive the first and second sequences of base measurements for each position of each double-stranded nucleic acid molecule (e.g., raw data).
  • the data for the first and second sequences can include base calls, quality scores, and other sub- streams (e.g., header information) from the raw data.
  • the rate of transmission can be at least 12 gigabyte per second (GB/s).
  • consensus circuit can include multiple cores or chips.
  • FIG.3 shows a flowchart illustrating a method 300 for determining a consensus sequence of a target molecule.
  • the method 300 depicted in FIG.3 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine).
  • the software may be stored on a non-transitory storage medium (e.g., on a memory device).
  • the method 300 presented in FIG.3 and described below is intended to be illustrative and non-limiting. Although FIG.3 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting.
  • nucleic acid material e.g., DNA
  • the nucleic acid material may be genomic DNA, mitochondrial DNA, cell-free DNA (cfDNA), circulating tumor DNA (ctDNA), or a combination thereof.
  • genomic DNA may be obtained and isolated using any method known in the art.
  • the isolated DNA is fragmented into a plurality of shorter double stranded DNA target fragments through physical (e.g., sonication) or enzymatic (e.g., restriction enzyme digestion) methods.
  • a daughter strand is produced by a template-directed synthesis, wherein the daughter strand includes a plurality of XNTP subunits (i.e., XATP, XCTP, XGTP and XTTP) coupled in a sequence corresponding to a contiguous nucleotide sequence of all or a portion of the target nucleic acid material.
  • the individual XNTP subunits of the daughter strand comprise a reporter construct, a nucleobase residue, and a selectively cleavable bond.
  • the Xpandomer Upon cleavage of the selectively cleavable bond, the Xpandomer is released and sequenced in a nanopore sequencing system, and the reporter constructs in the Xpandomer are used to parse genetic information in a sequence that corresponds to the contiguous nucleotide sequence of all or a portion of the target nucleic acid (see, e.g., U.S. Patent Application Publication No.2022/0411458 for a description of Sequencing by Expansion, which is herein incorporated by reference in its entirety).
  • Xpandomers are ready to be sequenced on a sequencing device (such as the sequencing device described in section I-A) and the measured signals corresponding to the different nucleotides are determined.
  • a sliding window of measured signals can be used to determine a base at respective positions.
  • signals 1-20 may correspond to seven bases where the middle base (base 4) is being called. After base 4 is called, the signal window may shift some number of signals, for example three signals, to determine the base call for the 5 th base (base 5).
  • the new window is now signals 4-23.
  • the sliding window method allows for only those signals proximal to the base being called to potentially influence the called base, rather than the signal for the entire molecule.
  • the measured signal of the base in question may be compared to a threshold, which can be determined based on a number of different parameters such as the separation between the signal values of the different bases.
  • the threshold may be based on (i) previously determined base calls at the position in question, (ii) measurements of base calls upstream and/or downstream of the position in question, or (iii) any combination thereof.
  • concordant positions it can also be determined whether the base call matches the reference genome.
  • the two strands may be aligned to a reference genome in a process known as reference-based alignment (see section V-A).
  • reference-based alignment see section V-A.
  • concordant and discordant positions are identified based on how well each sequenced strand aligns to the reference genome. Concordant positions are those positions where the sequenced strand matches with the reference genome’s sequence, while discordant positions are those positions where the sequenced strand does not match the reference genome’s sequence.
  • quality scores and weights may be used to resolve discordant positions and determine PATENT Client Reference No.: P39048-WO-1 whether the position should actually be concordant.
  • Consensus reads may be determined using various intramolecular consensus workflows, depending on the type of read construct generated from the library prep. Reads may be physically coupled, for example via a hairpin segment. Adapter segments can be identified and removed from the read construct. For an intramolecular consensus workflow, the reads can be aligned to each other (directly or via a reference sequence) to generate an alignment object that is used to form a consensus sequence. [0133] At 360, the aligned sequencing data is compressed to aid in the processing, transfer, storage, and archive of the data. Several techniques may be used to compress the data. One such technique may include partial consensus compression.
  • An intermolecular consensus workflow can use a UMI-based approach.
  • intermolecular consensus calling because reads are not physically coupled together, many reads that align to the same region of the genome may be combined to generate a single, consensus read for that genomic region.
  • Example Duplex Sequencing [0135] Embodiments described herein may be applied to any suitable sequencing platform, including next generation and nanopore sequencing, but are particularly useful for SBX Sequencing. Sequencing by Expansion is described in International Publication No. WO PATENT Client Reference No.: P39048-WO-1 2020/236526, entitled “Translocation control elements, reporter codes, and further means for translocation control for use in nanopore sequencing,” filed May 14, 2020, and U.S. Patent No.
  • DNA from a biological sample is obtained or provided.
  • the DNA obtained or provided from the biological sample may be genomic DNA, mitochondrial DNA, cell-free DNA (cfDNA), circulating tumor DNA (ctDNA), or a combination thereof.
  • DNA samples may be obtained from a patient or subject, from an environmental sample, or from an organism of interest.
  • the DNA sample is extracted, purified, or derived from a cell or collection of cells, a body fluid, a tissue sample, an organ, and/or an organelle.
  • the sample DNA is whole genomic DNA.
  • DNA may be obtained from the same biological sample or source.
  • Many different methods and technologies are available for the isolation of DNA. In general, such methods involve disruption and lysis of the starting material followed by the removal of proteins and other contaminants and finally recovery of the DNA.
  • Cell lysis procedures and reagents are known in the art and may generally be performed by chemical (e.g., detergent, hypotonic solutions, enzymatic procedures, and the like), physical (e.g., French press, sonication, and the like), or electrolytic lysis methods. Removal of proteins can be achieved, for example, by digestion with proteinase K, followed by salting-out, organic extraction, gradient separation, or binding of the DNA to a solid-phase support (either anion-exchange or silica technology).
  • DNA may be recovered by precipitation using ethanol or isopropanol.
  • ethanol or isopropanol There are also commercial kits available for the isolation of nuclear DNA. The choice of a method depends on many factors including, for example, the amount of sample, the required quantity and molecular weight of the DNA, the purity required for downstream applications, and the time and expense.
  • the DNA sample is circulating cell-free DNA (cfDNA), which is DNA found in the blood and is not present within a cell.
  • the cfDNA can be isolated from blood or plasma using methods known in the art.
  • kits are available for isolation of PATENT Client Reference No.: P39048-WO-1 cfDNA including, for example, the Circulating DNA Kit by Qiagen, N.V..
  • sonication instruments by Covaris, LLC (Woburn, MA) are commercially available and are acoustic devices for breaking DNA into 100 bp - 5 kb fragments. Covaris also manufactures tubes (e.g., gTubes) which will process samples in the 6-20 kb for Mate-Pair libraries.
  • tubes e.g., gTubes
  • Another example is the Bioruptor® by Diagenode, LLC (Denville, NJ), a sonication device utilized for shearing chromatin, DNA and disrupting tissues. Small volumes of DNA can be sheared to 150 bp - 1 kb in length.
  • the Digilab Hydroshear® by Thermo Fisher Scientific is another example and utilizes hydrodynamic forces to shear DNA.
  • DNA may be treated with DNase I, or a combination of maltose binding protein (MBP)-T7 Endo I and a non-specific nuclease such as Vibrio vulnificus nuclease (Vvn).
  • MBP maltose binding protein
  • Vvn Vibrio vulnificus nuclease
  • DNA may be treated with NEBNext ® dsDNA Fragmentase from New England Biolabs, Inc. (Ipswich, MA).
  • NEBNext ® dsDNA Fragmentase generates dsDNA breaks in a time-dependent manner to yield 50-1,000 bp DNA fragments depending on reaction time.
  • PATENT Client Reference No.: P39048-WO-1 NEBNext ® dsDNA Fragmentase contains two enzymes, one randomly generates nicks on dsDNA and the other recognizes the nicked site and cuts the opposite DNA strand across from the nick, producing dsDNA breaks. The resulting DNA fragments contain short overhangs, 5 '- phosphates, and 3 '-hydroxyl groups. [0142] In some instances, the DNA sample is fragmented into specific size ranges of target fragments.
  • the DNA sample may be fragmented into fragments in the range of about 25-100 bp, about 25-150 bp, about 50-200 bp, about 25-200 bp, about 50-250 bp, about 25-250 bp, about 50-300 bp, about 25-300 bp, about 50-500 bp, about 25-500 bp, about 150-250 bp, about 100- 500 bp, about 200-800 bp, about 500-1300 bp, about 750-2500 bp, about 1000- 2800 bp, about 500-3000 bp, about 800-5000 bp, or any other size range within these ranges.
  • the DNA sample may be fragmented into fragments of about 50-250 bp. In some instances, the fragments may be larger or smaller by about 25 bp.
  • the fragments are treated to produce blunt ends that are compatible with ligation to a first adapter having a compatible blunt end. Any convenient method for producing blunt ends may be employed, including treatment with one or more (e.g., E. coli Exonuclease III) and/or performing a fill- No limitation in this regard is intended.
  • the target sequence used in SBX sequencing may be a sample of double stranded nucleic acid fragments (generated by the above-described process) overhangs).
  • nucleic acid fragments can be of any size or size range and can include DNA, RNA, DNA-RNA hybrids (e.g., molecules produced by first-strand synthesis during cDNA preparation have one mRNA strand and one complementary DNA strand), genomic DNA, cDNA, mRNA, tRNA, etc.
  • paired-end and duplex may be used interchangeably as they relate to template constructs for Xpandomer synthesis.
  • Sequencing of the single, contiguous Xpandomer that incorporates the features of copies of the paired-end template provides duplexed reads of the original nucleic acid target fragments.
  • the paired-end Xpandomer template constructs can be single nucleic acid chains that PATENT Client Reference No.: P39048-WO-1 each have the following structure: adapter region 1, sense (i.e., forward) nucleic acid strand of the target fragment, adapter region 2, anti-sense (i.e., reverse) nucleic acid strand of the target fragment, adapter region 3.
  • adapter region 2 forms a classic “hairpin” structure in which the stem portion of the hairpin adapter is double stranded and is ligated to one end of the double stranded nucleic acid target fragment.
  • PATENT Client Reference No.: P39048-WO-1 Form (SEQ ID NO.) Amino Acid Sequence wt DPO4 DNA polymerase MIVLFVDFDYFYAQVEEVLNPSLKGKPVVVCVFSGRFEDSGAVATANYE ARKFGVKAGIPIVEAKKILPNAVYLPMRKEVYQQVSSRIMNLLREYSEKIEI (SEQ ID NO: 1) ASIDEAYLDISDKVRDYREAYNLGLEIKNKILEKEKITVTVGISKNKVFAKIA ADMAKPNGIKVIDDEEVKRLIRELDIADVPGIGNITAEKLKKLGINKLVDTL SIEFDKLKGMIGEAKAKYLISLARDEYNEPIRTRVRKSIGRIVTMKRNSRNL EEIKPYLFRAIEESYYKLDKRIPKAIHVVAVTEDLDIVSRGRTFPHGISKETAY SESVKLLQKILEEDERKIRRIGVRFSKFIEAIGLDKFFDT C7326
  • a variant of DPO4 polymerase suitable for the practice of the present invention may be a variant that is at least 85% identical to SEQ ID NO: 3.
  • C. Example Sequencing Operation Using AC Signal [0153] In some embodiments, once an Xpandomer is introduced to the nanopore cell, during a “bright period,” Xpandomer molecules are captured and begin to translocate through the nanopore due to a combination of both baseline and TCE applied voltage pulses.
  • Baseline voltages are sufficient to read the tag code at each XNTP position, and the short, higher voltage TCE pulses are designed to overcome the energetic barrier associated with a TCE.
  • each PATENT Client Reference No.: P39048-WO-1 TCE pulse results in translocation past a single TCE barrier, thus moving the Xpandomer further into the pore in the forward direction by an amount of one “base” position.
  • applied voltage patterns are designed so that there are a fixed number of TCE pulses during each bright period, which cause the Xpandomer to translate in the “forward” direction by a number of bases, corresponding to the number of TCE pulses, or until the Xpandomer fully translocates, and is released into the fluidic “trans” chamber below the membrane.
  • an Xpandomer molecule may not fully translocate prior to the end of a single bright period. This may happen due to the molecule being captured late in the bright period and having an Xpandomer length with more base positions than there are TCE pulses remaining in the bright period.
  • a molecule can get stuck while attempting to translocate in the forward direction for a variety of reasons. There may be a base position which has a defect (such as a failed cleavage event) which makes it impossible or very difficult for the molecule to translocate past that point. In such circumstances, and for other reasons, an Xpandomer may not be able to fully translocate during the bright period, regardless of the number of TCE pulses in a bright period. In such situations, it may be observed that a number of base positions in the beginning of the read are sequenced and generate the expected signal levels, until the defective position is reached. The last tag code level located just before the defect can then be observed for the remainder of the bright period.
  • a defect such as a failed cleavage event
  • FIG.5A shows an example of a single molecule-multi-molecular trace (SM3T) event with a typical sequencing by expansion waveform.
  • the graph shows time in seconds on the x- axis.
  • the graph shows voltage readings on the y-axis.
  • Other electrical measurements may be used instead of voltage, including voltage equivalents (e.g., ADC counts) or current.
  • Dark periods 504 and 508 are normal dark periods, where the pore is clear.
  • Bright period 512 shows signals 516a and 516b of molecule 1 and molecule 2, respectively.
  • Signal 520 shows a molecule during a bright period. The event shows that the molecule gets stuck in the pore and does not clear over several cycles (dark periods 524, 528, PATENT Client Reference No.: P39048-WO-1 532 and signals 536, 540, and 544 in bright periods). Eventually, the molecule clears in a dark cycle, as indicated by the change from signal 548a to signal 548b (when the molecule clears). This event may result from properties of the Xpandomer’s leader segment, which create difficulty in the leader translocating in the reverse direction.
  • FIG.5B illustrates a possible mechanism behind the trace in FIG.5A. Diagram 552 shows a bright period. The translocation direction is downward.
  • Xpandomer molecules can be designed with properties in the leader portion of the Xpandomer that cause the leader to behave differently in the forward and reverse directions. During the bright period (forward direction), the leader may have characteristics that allow the leader to be captured into the pore from the cis side with relatively high capture rates under reasonably applied voltages.
  • the leader may protrude from the underside of the pore (trans side of membrane), as TCE pulses cause the molecule to process steadily through the pore.
  • the dark period reverse direction
  • the molecule should begin to translocate in the reverse direction under a negative applied voltage.
  • the leader may remain on the trans side of the barrel.
  • a base call for each position in a nucleic acid molecule is generated by measuring a unique signal for each individual subunit of the Xpandomer.
  • a simple example of base calling can compare the signal value to a plurality of cutoff (threshold) values, where the value falling in a corresponding range can indicate a particular base.
  • Machine learning PATENT Client Reference No.: P39048-WO-1 techniques can also be used.
  • each nanopore cell (e.g., nanopore cell 150 described with respect to FIG.1) produces a new datapoint (voltage/current change) at a kilohertz or higher rate.
  • waveform e.g., electrical signal type
  • multiple datapoints may be generated per base or just one datapoint per base.
  • a direct current (DC) signal can be applied to the nanopore cell (e.g., so that the direction at which the nucleic acid molecule moves through the nanopore is not reversed).
  • DC direct current
  • a direct current (AC) waveform can reduce the electro-migration to avoid these undesirable effects, and therefore an AC waveform may instead be used.
  • the AC signal can recharge electrochemically the capacitor and the electrochemical cell at the bottom of the well.
  • Suitable conditions for measuring a change in an electrical property that results from the passage of a molecule through the nanopores are known in the art and examples are provided herein.
  • the measurement may be carried out with a voltage applied across the membrane and pore.
  • the voltage used may range from -400 mV to +400 mV.
  • the voltage used is preferably in a range having a lower limit selected from -400 mV, -300 mV, -200 mV, - 150 mV, -100 mV, -50 mV, -20 mV, and 0 mV, and an upper limit independently selected from +10 mV, +20 mV, +50 mV, +100 mV, +150 mV, +200 mV, +300 mV, and +400 mV.
  • the PATENT Client Reference No.: P39048-WO-1 voltage used may be more preferably in the range of 100 mV to 240 mV and most preferably in the range of 160 mV to 240 mV.
  • ADC analog-to-digital converter
  • an AC voltage signal is applied across the nanopore at, e.g., about 100 Hz, and an acquisition rate of the ADC can be about 2000 Hz per cell.
  • an acquisition rate of the ADC can be about 2000 Hz per cell.
  • Data points corresponding to one cycle of the AC waveform may be referred to as a set.
  • the ADC signals may be processed by a base calling algorithm (e.g., neural network, other machine learning model, or statistical model) to determine the corresponding sequence of bases in a nucleic acid molecule.
  • a quality score can be determined.
  • a low-quality base can result when there is an equal or similar probability between two bases, e.g., near an edge of a cutoff value separating two bases or similar probability from an ML model.
  • Such knowledge of a low-quality score and which bases have similar signal values e.g., the bases could have increasing signal values in the order of A, C, G, and T, with T having the highest signal value). Other orders can be used.
  • a base call for T has a low-quality score, then it can be surmised that a likely other base is a G.
  • Such a determination can also be known when the base caller uses more complex techniques. This information can be used when determining an intramolecular consensus read.
  • the error rate of the polymerase used to synthesize Xpandomer surrogate molecules is directly related to the state of the target DNA molecule.
  • the state of the target DNA molecule is influenced by many factors such as whether the stretch of the target DNA molecule within proximity to the polymerase active site is in single or double stranded form. Error rates are also influenced by local sequence context itself (i.e. kmer context).
  • the synthesis of the various HDD DNA constructs outlined below in sections III-A, III-B, and III-C have their own error rates that can be anticipated. For example, whether the target HDD DNA construct is expected to be (or have been) in a single vs. double stranded state.
  • Two pass HDD Read Constructs comprise the parent plus and parent minus strands of a single insert molecule (also referred to as a target molecule).
  • FIG.6A shows an exemplary two pass HDD read construct and its 3-step synthesis.
  • the first step involves ligating the double stranded target molecule to a Y-adapter that is on a solid support (indicated by gray box).
  • the Y-adapter is ligated onto the 3’ end of the parent minus (Parent -) strand and the 5’ end of the parent plus (Parent +) strand.
  • the hairpin adapter is ligated to the 5’ end of the parent minus strand and the 3’ end of the parent plus strand.
  • the strand-adapter complexes are dissociated from molecules that do not have a hairpin followed by a wash.
  • the final step involves releasing the two pass HDD read construct from the solid support beads.
  • FIG.6B shows an example of a fully synthesized two pass HDD read construct.
  • the construct undergoes Xpandomer synthesis, SBX, and base calling.
  • the structure of the Xpandomer makes it easy to keep track of all the segments that comprise the Xpandomer molecule such as the portion of the Y-adapter ligated to the 3’ end of the parent minus strand (e.g., Xpandomer primer region (pink), runway-SID sequence (orange), and Stem (light blue)), the synthetic Xpandomer polymer of the target molecule (e.g., daughter minus (purple), hairpin sequence (red), daughter plus (purple)), and the portion of the Y-adapter ligated to the 5’ end of the parent plus strand (e.g., stem minus (light blue) and blocker sequence to stop Xpandomer synthesis (green).
  • the portion of the Y-adapter ligated to the 3’ end of the parent plus strand e
  • FIG.7 shows example error types that can occur during either sequencing or synthesis of the Xpandomer molecule.
  • a first example is a biological mutation (e.g., single nucleotide polymorphism (SNP)) where the reference sequence has a C in the 7 th position, but the parent plus sequence of the original insert DNA sequence instead has a T. The SNP mutation is carried through during Xpandomer synthesis and is consistently detected across the parent- daughter pairs.
  • a second error type example shown is DNA damage (e.g., 8-Oxoguanine) where the G in the 21 st position of the parent minus strand is indicated by a G*.
  • 8-oxoguanine occurs at a rate of ⁇ 0.02-0.8 x 10 -6 in human primary and cancer cells, and as high as 10 -5 to 10 -4 in prepped samples.
  • 8-oxoguanine causes a G T mutation, which is observed in the parent-daughter minus pair where the daughter minus Xpandomer sequence has an A instead of the expected C observed in the parent-daughter plus strands.
  • a third example is the rate of random error incorporation during the synthesis of the daughter plus and minus Xpandomer sequences. As shown, there are four random errors that occur on both the daughter plus and minus Xpandomer sequences.
  • a total raw read error rate of ⁇ 1% is assumed.
  • the rate at which two random errors PATENT Client Reference No.: P39048-WO-1 (one in each subread) could occur at the same consensus position (e.g., the A at the 47 th position of the daughter plus Xpandomer sequence and the T at the 47 th position of the daughter minus Xpandomer sequence) is a rate of about (0.1) 2 or 10 -4 .
  • FIG.8D the structure of the four pass HDD read construct makes it easy to keep track of all the segments that comprise the Xpandomer molecule. Finally, the four pass HDD read construct undergoes adapter segmentation, effectively removing the Y-Open- Hairpin-adapter and the hairpin adapter so only the daughter -/+ stands and the parent -/+ strands are left.
  • FIG.9 shows example error types that can occur during either sequencing or synthesis of the Xpandomer molecule.
  • the hairpin adapter is ligated to the 5’ end of the second parent minus strand and the 3’ end of the first parent plus strand.
  • the strand-adapter complexes are dissociated from molecules that do not have a hairpin followed by a wash.
  • the four pass HDD read construct is released from the solid support. [0187] FIG.11B shows steps 3 and 4. Following release from the solid support, the four pass HDD read construct is extended with a strand displacing polymerase allowing for complementary daughter minus and daughter plus strands to be synthesized.
  • Consensus will be required on the multiple readouts, where the proximity of two bright cycles with near identical PATENT Client Reference No.: P39048-WO-1 sequences, can establish groping. Alternatively, sequence similarity, or the sequential second pass readout at the immediate beginning of the bright cycle can serve in grouping reads. [0190] For deduplication of readouts without UMI, a positional deduping (genomic position start and end) approach can be used, where the proximity of reads in terms of subsequent bright cycles can be used to eliminate false positive calls. [0191] Once synthesis of the Xpandomer molecule corresponding to each of the above- described HDD read constructs is complete, the Xpandomer is sequenced.
  • Xpandomer sequencing occurs on a sequencing device, like the sequencing device described with respect to section I-A.
  • a solution comprising a plurality of surrogate Xpandomer molecules is loaded onto a nanopore sensor chip (such as the one described in FIG.1).
  • the chip houses thousands of nanopore proteins embedded into a membrane that pass an electrical signal through the nanopore, where changes in electrical signal correspond to the different bases (e.g., XNTP) inserted into the nanopore.
  • raw HDD sequencing data may be stored in sequencing files (e.g., FASTQ files) for downstream processing. IV.
  • FIG.12 shows exemplary HDD reads of different classes that are output during sequencing. Reads which only include the start adapter and one strand orientation are referred to as partial reads. Duplex reads which include a full forward insert, the hairpin adapter and only part of the reverse complement read are referred to as One+ reads. Duplex reads which include full or partial complementary forward and reverse segments and are missing the hairpin adapter are referred to as U-turn reads.
  • a full HDD read includes sequencing of an entire dsDNA template both forward and reverse strands including a start, hairpin and end adapters. Sample IDs and unique molecular identifiers can be assigned to the different adapter segments. Additional read classes are possible, including combinations, e.g. a One+ U-Turn read. PATENT Client Reference No.: P39048-WO-1 [0194]
  • FIG.13A shows an illustration of a hypothetical read structure resulting from the sequencing of a Hairpin-Duplex (HD) construct. The actual read would be a linear string of base calls and an associated linear string of quality scores. Here the linear read string of base calls is depicted as being folded over on itself.
  • FIG.13A shows the target “insert” sequence as a solid line and read segments originating from adapter sequences are shown as dashed lines.
  • FIG.13B shows additional labeling on the read structure.
  • the terms ‘Insert Read 1’ and ‘Insert Read 2’ are used to describe the parts of the overall read, which correspond to the first and second passes of the insert sequence, respectively.
  • Insert Read 1 and Insert Read 2 are both full length passes of the insert segment of the construct, then the sequences contained in them are, or are close to, reverse complements of one another.
  • This is an example of a “Full HDD” read.
  • Full HDD reads performing a local or global pairwise alignment between the two insert reads would most often result in alignment with a relatively high alignment score.
  • FIG.13C shows an example of a “One+ Read.”
  • the subread corresponding to the first pass of an insert i.e. Insert Read 1
  • the subread corresponding to the second pass of an insert i.e. Insert Read 2 may be a partial subread.
  • HDD reads This category of HDD reads are referred to as “One+ Reads,” given that they include one full insert subread, corresponding to the first pass on the insert segment, plus some additional sequence content in a second subread, from a second pass on the insert.
  • a local pairwise alignment of Insert Read 1 and Insert Read 2 would typically result in a high scoring alignment between the shorter of the two subreads and part of the longer of the two subreads.
  • V. Alignment Techniques to Generate Consensus Sequences [0197] To determine a consensus read from a plurality of reads of different strands and possible daughter copies within the same molecule, the base calls at corresponding positions are determined.
  • the reads can be aligned to each other or via a reference sequence to determine the PATENT Client Reference No.: P39048-WO-1 base calls that correspond to the same position.
  • This sequence alignment can be performed using various software packages, such as (but not limited to) BLAST, FASTA, Bowtie, BWA, BFAST, SHRiMP, SSAHA2, NovoAlign, and SOAP. Described below are several exemplary, and non- limiting, alignment techniques (e.g., reference-based alignment, reference-free alignment, reference guided HDD pair mapping and alignment, and three-way alignment) that may be used to generate intramolecular consensus reads.
  • alignment techniques e.g., reference-based alignment, reference-free alignment, reference guided HDD pair mapping and alignment, and three-way alignment
  • the first sequence of base calls and the second sequence of base calls may be aligned to a reference genome to determine which base position on one strand corresponds to which base position on another strand. Once the alignment is done, the bases at the same position on each strand can be compared to determine whether they are concordant or discordant.
  • One advantage of this approach is that it allows for accuracy of consensus reads, depending on the motifs present. Further, this approach preserves information for as long as possible, and preserves all of the raw read information until potentially the point of variant calling. An overview of reference-based guided pairwise alignment and its implications are provided below. [0199] At a first stage, demultiplexing, adapter detection, and optional UMI extraction can be performed.
  • the start, mid- and end adapter are detected, and the position of the adapters is annotated.
  • This step is optional, and an alternative approach may involve direct alignment of the HDD read to the reference where adapter trimming will take place post alignment. If sample identifiers (SIDs) or unique molecular identifiers (UMIs) are present these are extracted, trimmed, and annotated at this step.
  • SIDs sample identifiers
  • UMIs unique molecular identifiers
  • HDD subreads (e.g., per strand or for each copy for more than two passes) are aligned to a reference genome (e.g., using BWA MEM or other alignments software) that corresponds to the origin of the sequence sample (e.g., human sequence sample is aligned to a human reference genome).
  • HDD reads can have the same read name but a different subread identifier. Mapping and alignment results can be stored in a BAM alignment file.
  • a subread can be aligned in its entirety or just the two ends of the subread
  • PATENT Client Reference No.: P39048-WO-1 can be aligned, e.g., 60 bases on each end of the read.
  • the reference can include two strands that are perfectly complementary to each other.
  • Each subread can be aligned to the corresponding strand version of the reference.
  • a subread can be converted before alignment by switching to the complementary base, and thus a reference for only one strand is used.
  • An intramolecular consensus can then be determined.
  • intramolecular consensus is done prior to an optional intermolecular UMI based consensus.
  • HDD subreads are processed to form a single HDD consensus read as well as additional streams of data.
  • the merging process involves using the alignment information from both subreads in a pair against a reference sequence (e.g., CIGAR information) to recreate pairwise alignment between the two subreads.
  • the HDD consensus read and associated streams of data store concordant bases and annotates information on discordant bases as described in detail herein.
  • An optional intermolecular consensus can be determined. As an optional step, HDD reads are grouped based on positional deduplication in combination with UMI information to form intermolecular consensus.
  • reference based compression can be performed. Following HDD read segmentation, intramolecular alignment and consensus, a reference-based compression algorithm (see section VI.B for detailed description) may be used to further compress the intramolecular consensus reads thus achieving lower data rates in situations with limited bandwidth available for data transfer. PATENT Client Reference No.: P39048-WO-1 [0204] In some embodiments, realignment of consensus reads can also be performed.
  • consensus reads are realigned using different alignment parameters to generate an output consensus alignment (currently a BAM file) output.
  • Realignment may offer better or alternative matching alignments of the consensus reads and may use a different reference genome. For instance, realignment may consider a graph or more detailed reference genome and depend on higher accuracy reads as input.
  • Variant calling can then be performed.
  • Variants of consensus reads can be determined in relation to a reference genome.
  • B. Reference-Free Alignment [0206]
  • the first sequence of base calls and the second sequence of base calls may be aligned to each other via the adapter segments, thus alignment to the reference genome does not occur.
  • a consensus read can be formed from the information present in the original full HDD read.
  • demultiplexing, adapter detection and optional UMI extraction can be performed. Initially, for each HDD read, the start, mid and end adapter are detected, and the position of the adapters is annotated. If SIDs or UMIs are present these are extracted, trimmed, and annotated at this step. This step is optional, and an alternative approach considers rough splitting of HDD reads by half, where UMI and SID extraction takes place after intramolecular alignment and consensus.
  • intramolecular alignment can be performed. Mapping and alignment results can be stored in a BAM alignment file.
  • pairwise intramolecular alignment involves alignment of the first and second reverse complementary sections of an HDD read construct.
  • Four-Pass or any other number of passes greater than two, multiple sequence alignment, or the equivalent, should be performed on the additional three or more insert passes.
  • PATENT Client Reference No.: P39048-WO-1 PATENT Client Reference No.: P39048-WO-1
  • Alignment and comparison results may be stored using partial order alignment or through one or more of a variety of lossy and lossless compression embodiments, examples of which are provided below.
  • Methylation sequences may require different alignment parameters (e.g., alignment penalties, scoring, etc.) depending on the use of conversion steps for methylation detection or the inclusion of wobble nucleotides or conversion to facilitate processivity through certain nucleotide motifs.
  • UMI and SID extraction as well as adapter trimming may take place in this step, where concordant read 1, read 2, UMI, and SID bases are annotated, facilitating high accuracy detection of shorter UMI and SID sequences in comparison to raw SBX reads.
  • reference based compression can be performed. Following HDD read segmentation, intramolecular alignment, and consensus, a reference-based compression algorithm (see section VI.B for detailed description) may be used to further compress the intramolecular consensus reads thus achieving lower data rates in situations with limited bandwidth available for data transfer.
  • consensus read mapping and alignment can be performed. Following formation of consensus reads, the consensus reads, and the associated data are mapped and aligned to a reference genome. Alignment may use information on pairwise discordant and concordant bases. Subreads may require realignment and or recovery of original subreads to improve alignment to a reference. Such realignment may be local or encompass the entire subreads.
  • an optional intermolecular consensus can be determined.
  • HDD intramolecular consensus reads are grouped based on positional deduplication in combination with UMI information to form intermolecular consensus.
  • Intramolecular consensus reads contain information that may be leveraged in intermolecular consensus.
  • Intramolecular consensus reads may require realignment and or recovery of original subreads to improve intermolecular consensus calls. Such realignment may be local or encompass the entire subreads.
  • PATENT Client Reference No.: P39048-WO-1 Variant calling can then be performed. Variants of consensus reads are determined in relation to a reference genome.
  • FIG.14 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs.
  • the row names (‘read 1’ and ‘read 2’) are not necessarily part of the alignment object, nor are the column names (numbers 0 through 100, in this case), but are included in the above figure for ease of viewing.
  • three discordant positions can be seen, and are highlighted in red. Namely, there is one substitution, one insertion and one deletion within read 2 relative to read 1.
  • D. Reference Guided HDD Pair Mapping and Alignment [0216] Higher throughput and lower raw read accuracy places constraints on real-time performance for an SBX end-to-end workflow. Further, real-time mapping and alignment to the human genome are challenging. To overcome this challenge, one exemplary embodiment may include exporting, from a base calling station, the demux (e.g., demultiplexed reads) and trimming information in a file, such as hdf5 (station or on-prem). HDD reads are maintained as a PATENT Client Reference No.: P39048-WO-1 continuous sequence annotated at start and end insert positions.
  • the demux e.g., demultiplexed reads
  • trimming information in a file such as hdf5 (station or on-prem).
  • HDD reads are maintained as a PATENT Client Reference No.: P39048-WO-1 continuous sequence annotated at start and end insert positions.
  • HDD reads Mapping of HDD reads in tandem utilizing longer and sparser seed matching compared to single end SBX reads from either forward or reverse reads (on-prem or cloud).
  • Deduping e.g., deduplication
  • consensus calling on-prem or cloud
  • HDD reads are expected to align to the exact same locations in the genome supporting the use of seeds matching either read 1 or read 2.
  • the algorithmic approach to this process includes: (i) use longer seeds matching either reverse or forward read; (ii) mapping only on seeds concordant between first and second subread; (iii) mapping only on more than one shorter concordant seed matches; and (iv) on unmapped and or low mapping quality reads use pairwise alignment to provide high accuracy reference free base calls on HDD reads.
  • the benefits include maintenance of all base call information for pair enabling improved deduped consensus utilizing matches by only one of two HDD subreads. Also, this approach utilizes reference guided alignment of both read pairs reducing the impact of discordance in alignment between reads in consensus.
  • the three-way alignment algorithm comprises the following steps: 1) Each read is mapped to the reference genome.
  • each read is the prediction from an independent classifier.
  • Classifier 1 and Classifier 2 give different predictions (e.g. read 1 and read 2)
  • Bayes' theorem to calculate the probability that Classifier 1 is correct. Let's denote: 1) P(C1) as the probability that Classifier 1 is correct, which is 0.99. 2)
  • P(C2) as the probability that Classifier 2 is correct, which is 0.90.
  • P(D) as the probability that the classifiers disagree.
  • D) the probability that Classifier 1 is correct given that there is a disagreement.
  • the machine learning model may recognize patterns for quality of consensus not only based on whether a particular position is concordant or discordant, but also based on other features of the aligned sequences.
  • the features can include the other base calls near the concordant/discordant position (aka. kmer information), the raw quality score of the base caller, read orientation, or various other features related.
  • the machine learning algorithm can be a neural network that processes two or more aligned reads and generates a vector of quality scores for each position of the aligned reads.
  • Such a model can be trained to generate a continuous range of quality scores, e.g., between 0 and 50, 60, 70, or more, for consensus calls at both concordant and discordant positions.
  • one such neural model is a convolutional neural network (CNN) that takes aligned reads in terms of A, C, G, T and Gap and their respective quality score (0 for gaps) as input and output the consensus calls and corresponding qualities scores for each consensus call position.
  • CNN convolutional neural network
  • another such neural network adopts an architecture similar to U-Net, which is widely used in image segmentation applications and is described at least in Ronneberger et al., “U-Net:Convolutional Networks for Biomedical Imaging Segmentation,” Computer Vision and Pattern Recognition, arXiv1505.04597 (2015), which is hereby incorporated by reference in its entirety.
  • the neural networks described above can be implemented as part of an alignment workflow when making intramolecular consensus reads.
  • Lossless Compressions for Consensus Sequence [0230] Lossless compression allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Embodiments can treat concordant positions differently than discordant positions. Additionally, a reference-based compression can be used.
  • A. Partial Order Alignment [0231] Pairwise alignment and consensus results between HDD read 1 and read 2 can be ambiguous. This may be more often the case on homopolymers (as described in section VI.C below) and on tandem repeats, which are of clinical relevance in cancer microsatellite instability detection and clinical variant calling.
  • discordant positions may be resolved later, e.g., using intermolecular consensus.
  • the partial order alignment generates a sorted DAG (directed acyclic graph), which can be sorted and exported as a linear sequence encoding.
  • DAG directed acyclic graph
  • the ambiguity in alignment will be maintained, enabling the assignment of low-quality scores to discordance bubbles vs. single base mismatches. Such will be the case on homopolymers where exact positions of indels cannot be assigned.
  • the HDD pairs can contain information about their relative alignment in addition to PATENT Client Reference No.: P39048-WO-1 the single pair discordance. This information can be reused ‘as is’ in downstream variant calling.
  • partial order alignment can maintain ambiguity in alignment, which is beneficial in determining concordant read 1 read 2 vs. discordant read 1 read 2 base positions in the read and the confidence in making such calls.
  • partial order alignment export is expected to result in a similar compression ratio on HDD vs. raw reads as pairwise alignment and assignment of discordant bases, that is ⁇ 40% when not considering the benefits coming from extraction and classification of adapters and UMIs, if present.
  • Example Encoding [0234] Table 2 below shows examples of lossless encoding of pairwise concordant and discordant calls following pairwise alignment. The examples below merely illustrate the number of possible values for concordant positions and discordant positions, as opposed to a compression technique.
  • positions can be represented using less data than the discordant positions, PATENT Client Reference No.: P39048-WO-1 potentially using only two bits.
  • 00 can be A; 01 can be C; 10 can be G; and 11 can be T.
  • Other representations are possible that use more bits.
  • For the discordant positions there are at least twelve possible concordant values, with eight more for a total of twenty if indels are taken into account. Thus, such positions can be represented using 4 bits if only the first 12 possible discordant values are used or 5 bits if all 20 possible discordant values are used.
  • additional metadata can be used to decompress such a compressed sequence, where not all of the positions are represented with the same amount of data.
  • FIG.15 shows a flowchart illustrating method 1500 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule.
  • Another set of concordant positions can match the reference, where all those positions can be assumed to be the same as the reference, with only the non-matching concordant positions being identified in metadata as to where those positions are. Regardless of the type of partial consensus sequence generated (e.g., full hairpin duplex, partial, “one+”, “U-turn”, or any combination thereof), such sets of concordant/discordant positions can be determined.
  • PATENT Client Reference No.: P39048-WO-1 PATENT Client Reference No.: P39048-WO-1
  • the first set of concordant positions and the second set of discordant positions are identified by aligning the first sequence of base calls to the second sequence of base calls (e.g., aligning the base calls to each other).
  • the genomic coordinate indication can include a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant positions that do not match the reference genome. This method is referred to as reference-based alignment and is described in more detail in section V-A.
  • Each of the first set of concordant positions may be represented by a concordant value of a first group of four concordant values.
  • the first group of four concordant values is specified PATENT Client Reference No.: P39048-WO-1 using two binary bits and includes A ⁇ >T, C ⁇ >G, G ⁇ >C, and T ⁇ >A. Accordingly, each concordant value represents a concordant pair of bases between the first stand and the second strand of the double-stranded nucleic acid molecule.
  • each discordant value represents a discordant pair of bases between the first stand and the second strand.
  • the second group of at least twelve discordant values includes at least twenty discordant values (accounting for insertions and deletions) and the at least twenty discordant values may be specified using five binary bits.
  • the partial consensus sequence is generated using: (1) the concordant values at the first set of concordant positions; and (2) the discordant values at the second set of discordant positions.
  • the partial consensus sequence may not be for the whole double-stranded nucleic acid molecule, e.g., when reference-based compression is used. In such situations, the partial consensus sequence can correspond to the concordant positions that do not match the reference and to the discordant positions.
  • Generating the partial consensus sequence can include using metadata that specifies the second set of discordant positions (e.g., using headers that indicate which positions are discordant).
  • the metadata allows for the concordant values for the first set of concordant positions, and the discordant values for the second set of concordant positions to be used to recover the base calls of the first sequence and the second sequence at the first set of concordant positions and the second set of discordant positions.
  • For two pass HDD read constructs only two reads are generated to determine a consensus sequence and call concordant and discordant bases. In the event a discordant base is called, because there are only two reads, a tie between the bases is reached.
  • the tie in base call may be resolved by leveraging the Q score, or identifying thatone of the discordant bases displays a very poor ADC signal. In the event the tie cannot be resolved, the information for both discordant bases is preserved. Because of this limitation, two pass HDD read constructs PATENT Client Reference No.: P39048-WO-1 often reach the upper limit of compression that may be achieved. On the other hand, four pass HDD read constructs have four reads compared to only the two reads generated for two pass HDD read constructs. The additional reads significantly decrease the rate at which discordant positions cannot be resolved, thus a greater compression is achieved. [0248] At 1520, the partial consensus sequence is transmitted to a computer system.
  • the computer system may be a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause the computer system to perform method 1500 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule.
  • the computer system may also comprise one or more processors configured to execute instructions stored on the computer readable medium.
  • FIG.16 is a flow chart illustrating method 1600 to compress a base call sub-stream from the raw read data generated by a sequencing device (e.g., nanopore-based sequencing device).
  • the base call data can include a sequence of base calls (also referred to as a sequence read) for each of the at least 100,000 nucleic acid molecules, or for other numbers of molecules, such as at least 2, 3, 4, 5, 10, 50, 100, 1000, 10,000, 100,000, 500,000, one million or more nucleic acid molecules (in each of the same number of sequencing cells).
  • the base call data comprises the base calls for each position in the sequence read.
  • Method 1600 can be performed for each sequence of base calls corresponding to a respective nucleic acid molecule.
  • the compressing can be of the second sub- stream of base call data described above.
  • the base call data sub-stream stores the sequence of bases in a nucleic acid molecule (e.g., DNA or RNA), referred to hereinafter as sequence read(s).
  • a sequence read in a base call data sub-stream may comprise a nucleic acid sequence as a string of A, T, C, G, U or N’s, where each letter denotes adenine (A), thymine (T), guanine (G), cytosine (C), uracil (U), or not determined or ambiguous (N).
  • the sequence read is aligned relative to a reference sequence to obtain the genomic location information.
  • This sequence alignment can be performed using various software PATENT Client Reference No.: P39048-WO-1 packages, such as (but not limited to) BLAST, FASTA, Bowtie, BWA, BFAST, SHRiMP, SSAHA2, NovoAlign, and SOAP, or the techniques embodied with the software, or other techniques as known to the skilled person.
  • the reference sequence can be a human reference sequence, such as hg18 or hg38.
  • the sequence alignment can generate an identifier that identifies the location within the reference sequence that the read aligns.
  • the identifier may comprise the genomic start and end locations of the reference sequence on a chromosome (e.g., a human chromosome) from the reference genome (e.g., human genome) to which the sequence read aligns.
  • the alignment position relative to the reference genome may be determined.
  • the first or last aligned position of the read e.g., closest to a 3’ or 5’ end of the reference sequence
  • the read may be a positive strand or a negative strand. A read is considered “positive” strand if a read aligns without reverse complementing the sequence read.
  • An alignment is considered “negative” strand if a sequence read is to be reverse complemented prior to alignment.
  • Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting example of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g. the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAST (e.g., BLASTn at http://www.ncbi.nlm.nih.gov/), Novoalign by Novocraft Technologies Sdn Bhd (Petaling Jaya, Malaysia), ELAND by Illumina, Inc.
  • any suitable algorithm for aligning sequences non-limiting example of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g. the Burrows Wheeler Aligner), ClustalW, Clustal X,
  • the bit string or text that is encoded at the base level PATENT Client Reference No.: P39048-WO-1 can then be compressed in later steps.
  • the encodings include a match, the 4 substitutions, 4 soft clips (the end of a read is not aligned), 4 insertions, and a deletion.
  • the nucleotides in the first portion can be replaced by a start location relative to the reference sequence, a number that shows the length of the portion, and the code that represents a mismatch.
  • the one or more mismatches may then remain as encoded.
  • Any portion of matching sequences may similarly be replaced (i.e., to compress the sequence data) by a start location corresponding to the position of a first matching nucleotide and a length of the portion of matching sequences.
  • the code for a sequence match may or may not be included.
  • This encoding may also be beneficial since a subset of error modes preferentially creates or applies to homopolymers as opposed to other k- mers: in template slippage, in voltage and temperature dependent insertions, and in dysfunctional state inserts.
  • the quality of forward and reverse complement homopolymers can be uniquely determined based on empirical observations and applied to the entire homopolymer rather than a specific base. Importantly, quality scores can be considered for read 1 vs. read 2, and parent vs. daughter as well as base vs. complement base in determining empirical consensus base quality.
  • a raw signal can include raw voltage signals that produce a base call and a quality score. 2.
  • Read Quality Score (read orientation) [0267] Which read a base call is on can be tracked. For example, which strand the base call is on can be tracked. Also, whether the read is from a daughter read or a parent read can be tracked. These different reads can have different associated errors. Each read can have an associated quality score, which can translate into a corresponding weight when determining the consensus basis. For instance, a base call can be modified/weighted (e.g., multiplied) by its base quality score and its read score, and then a weighted sum for each of the different base calls can be determined, thereby obtaining a final score for each base call. The final score can be used to determine whether to make a concordant or discordant call.
  • a concordant call might be made when a first final score of a first base call is higher by a threshold value than a second final score of a second base call.
  • Read Orientation The effects of data compression algorithms are typically only considered in the downstream processing, transferring, storage, and archival of high-throughput sequencing data.
  • errors that occur in upstream processes such as sample preparation, library preparation, and sequencing also contribute to the amount of data generated by these downstream processes.
  • Xpandomer synthesis and the sequencing process each have a respective error profile that influences downstream processing steps such as consensus read calling, base calling, variant calling, and the like.
  • improvements in consensus read calling, base calling, variant calling, etc. reduce the amount of overall data size of the sequencing data.
  • read 1 and read 2 can have different error profiles as a result of different kinetic rates in which they were synthesized.
  • Xpandomer when Xpandomers are used, an Xpandomer is typically generated from target DNA molecules that are double-stranded.
  • the error rate of the first processed read could be different from the error rate of the second processed read.
  • Difference in strand synthesis error rate may be impacted by: (i) certain modification (e.g., biochemical or DNA damaging) inherently being more common on one strand orientation versus its complement; and (ii) sequencing errors can be influenced based on the surrounding bases.
  • certain modification e.g., biochemical or DNA damaging
  • sequencing errors can be influenced based on the surrounding bases.
  • difference in observed Xpandomer synthesis error profile differences can arise for homopolymer motifs when the target DNA molecule stretch is in a double vs single stranded state.
  • HDD raw reads may be tracked.
  • An algorithmic approach to this problem may include collecting, in addition to read 1 and read 2 base information, also collecting information regarding the alignment orientation of a read to the original parent molecule. Read orientation is of particular interest in consensus and concordance calls and in the assignment of base quality to concordant bases.
  • FIG.18 shows a flowchart illustrating method 1800 for determining a consensus sequence of a double-stranded nucleic acid molecule.
  • the method 1800 depicted in FIG.18 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine).
  • the software may be stored on a non-transitory storage medium (e.g., on a memory device).
  • FIG.18 and PATENT Client Reference No.: P39048-WO-1 described below is intended to be illustrative and non-limiting.
  • FIG.18 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel.
  • a first strand of the double-stranded nucleic acid molecule is sequenced to obtain a first sequence of base calls.
  • Each of the first sequence of base calls has a first quality score and a first label that corresponds to the first strand.
  • a second strand of the double-stranded nucleic acid molecule is sequenced to obtain a second sequence of base calls.
  • Each of the second sequence of base calls has a second quality score and a second label that corresponds to the second strand.
  • the first weight and the second weight can be dependent on base calls adjacent to the discordant position.
  • first set of concordant positions and a second set of discordant positions are identified using the first sequence of base calls and the second sequence of base calls, respectively.
  • first set of concordant positions and the second set of discordant positions are identified by aligning the first sequence of base calls to the second sequence of base calls (e.g., aligning the base calls to each other). This method is referred to as reference free alignment and is described in more detail in section V-B.
  • the first sequence of base calls, the second sequence of base calls, or both may be aligned to a reference genome corresponding to the origin of the sequence (e.g., if human sequence, will be aligned to a human reference genome).
  • the first sequence of base calls may be aligned to a first strand of a reference genome
  • the second sequence of base calls may be aligned to second strand of a reference genome.
  • a portion of the first set of concordant positions may not match the reference genome.
  • a third set of concordant positions may be identified that match the reference genome and each of the third set of concordant positions may be represented with an indication of a genomic coordinate in the reference genome.
  • the genomic coordinate indication can include a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant PATENT Client Reference No.: P39048-WO-1 positions that do not match the reference genome. This method is referred to as reference-based alignment and is described in more detail in section V-A.
  • a consensus base call is determined using the first quality score, the second quality score, a first weight corresponding to the first label, and a second weight corresponding to the second label.
  • the consensus base call is determined at an initial discordant position of the second set of discordant positions. This involves changing the initially discordant position to a concordant position for the first base call on the first strand.
  • This change may be based on either: (i) the first quality score is higher than the second quality score for a second base call of the second strand; (ii) the first weight is higher than the second weight; or (iii) the concordant base on the second strand has a measured signal that is adjacent to the second base call.
  • the reasoning behind options (i) and (ii) regards the notion that, typically, the first base call has higher accuracy compared to the second base call.
  • the consensus sequence is generated using: (1) the concordant values at the first set of concordant positions; and (2) the consensus base calls at the second set of discordant positions.
  • the consensus sequence may be a partial consensus sequence as described above.
  • the consensus sequence is transmitted to a computer system.
  • the computer system may be a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause the computer system to perform method 1800 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule.
  • the computer system may also comprise one or more processors configured to execute instructions stored on the computer readable medium.
  • C. Wobble Nucleotides [0281] In some circumstances it may also be beneficial to include wobble nucleotides such as inosine to facilitate synthesis and reduce complementarity on hard to process or amplify motifs.
  • Table 4 provides examples of lossless encoding table of pairwise concordant and discordant calls following pairwise alignment in the presence of wobble nucleotides.
  • Table 4 Example Wobble Additives to Daughter Strand Read Read Category Encoding 1 2 Inosine Uracil Adenine A T A A Cytosine C G C C Concordant Calls Guanine G C G G Thymine ( or Uracil) T A T T Weak A a weak t T Strong C c Discordant Calls or Wobble strong g G concordance pYrimidine C t pyrmidine Wobble T call t C daughter A paired PATENT Client Reference No.: P39048-WO-1 I complemented with C Keto G t keto t G puRine W obble A call daughter-T paired with U A g U complemented with G purine W obble G call daughter-C paired g A I complemented with A aMino A c amino a C VIII.
  • Example Compression Techniques Described below are a variety of non-limiting exemplary approaches that may be used to achieve read compression that are particular to determining intramolecular consensus. Importantly, the examples provided below demonstrate that based on the compression technique used, a higher compression ratio may be achieved.
  • A. Lossless 1. Consensus Quality Score Vector [0283] Exemplary embodiment 1 for generating a consensus read from an alignment object is outlined below. PATENT Client Reference No.: P39048-WO-1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). ) For discordant positions one of the two characters in either read 1 or read 2 is chosen according to some selection criteria.
  • That selection criteria may involve, for example, one of the following options: a. Comparing the raw read quality scores between the read 1 character and the read 2 character (see section VII-A.1 for description). b. Comparing the kmer context for each of the two characters on read 1 and read 2 (see section VII-A.3). c. Considering which read the characters occurs on (i.e., base quality score based on read orientation; see section VI-A.2). d. Considering error profiles for upstream processing events such as physical processing of the molecule of interest, sample preparation, library preparation, target enrichment, and sequencing preparation.
  • Each of these methods may comprise information about which base transversions (e.g., conversion of a single purine to a pyrimidine, or vice versa) are more or less likely to result from possible chemical lesions in the original DNA molecules during each upstream processing event. See sections III-A, B, and C as well as section VII-B for additional descriptions. e. Some combination of the above criteria. )
  • an arbitrary base may be randomly selected to occupy the positions in the consensus read corresponding to each discordant set of positions in the alignment object (e.g., wobble nucleotides discussed in section VII-C).
  • a second vector of data less than or equal in length to the alignment object, with a single bit per element, can be used to encode whether each position in the consensus read was generated from concordant or discordant characters.
  • This second vector of data might be PATENT Client Reference No.: P39048-WO-1 a Consensus Quality Score vector, where the quality scores are represented by some specific number of bits per consensus base, such as a single bit per consensus base as one example.
  • a third variable length vector of data sometimes referred to as part of or as a full header string, is used to encode the combination of characters that represents the original two characters in read 1 and read 2 at each of the discordant positions. 6)
  • Each character in the variable length vector of data might be assigned five bits, for example.
  • Variable Length Encoding uses a different number of bits to encode different symbols. For example, an A in a sequence may be encoded by ‘0’ and only use 1 bit.
  • C may be encoded by ‘10’ using 2 bits
  • G may be encoded by ‘110’ using 3 bits
  • T may be encoded by ‘1111’ using 4 bits.
  • the process of transforming symbols (A, C, G, and T) into their binary word or sequence (0, 10, 110, 111 respectively) is referred to as (variable length) encoding and is performed by an encoder.
  • Variable length encoding can use short codewords, requiring fewer bits, for common symbols with a high probability of occurring and longer codewords, requiring more bits, for symbols with a low probability of occurring.
  • An advantage of this method is that less storage space is needed and transmitting the data from one place to another occurs very rapidly.
  • variable length encoding depending on the complexity, can make decoding more difficult and increase the demand of computational power and circuit cost.
  • PATENT Client Reference No.: P39048-WO-1 Additional, and non-limiting, examples of variable length encodings that are described in the examples below may include: (i) encoding concordant pairs with 1 bit and discordant pairs with at least 2 bits; and (ii) frequently occurring or common mutations, sequencing errors, etc. may be assigned shorter codewords requiring fewer bits, while infrequent or rare mutations may be assigned longer codewords requiring more bits.
  • a “code” or a “codebook” refers to a mapping between symbols and binary (or non-binary) words (e.g., codeword).
  • a codebook provides information on the structure, contents, and layout of a data file.
  • a codebook can include: column locations and widths for each variable, definitions of different record types, response codes for each variable, codes used to indicate nonresponse and missing data, exact questions and skip patterns used in a survey, other indications of the content and characteristics of each variable, etc.
  • FIGS.19A & 19B show exemplary embodiment 2 for generating a consensus read from an alignment object, outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions one of the two characters in either read 1 or read 2 is chosen according to some selection criteria. That selection criteria may involve, for example, one of the following options: a.
  • FIG.21 shows exemplary embodiment 5 for generating a consensus read from an alignment object, outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions in the alignment object, a 5th character ‘N’ is chosen.
  • FIG.22 shows exemplary embodiment 6 for generating a consensus read from an alignment object.
  • one of the two bases at the discordant positions are selected based on selection criteria.
  • a second vector of data of less than or equal length to the alignment object comprises a single bit per element. The second vector is used to encode whether each position in the consensus read is generated from concordant or discordant PATENT Client Reference No.: P39048-WO-1 characters.
  • the second vector of data is a consensus quality score vector, where the quality scores are represented by a single bit per consensus base.
  • the steps are outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions one of the two bases (or more generally referred to as characters in the alignment object to be inclusive of dashes) in either read 1 or read 2 is chosen according to some selection criteria. That selection criteria may involve, for example, one of the following: a. Comparing the raw read quality scores between the read 1 character and the read 2 character (see section VII-A.1 for description). b. Comparing the kmer context for each of the two characters on read 1 and read 2 (see section VII-A.3). c.
  • a second vector of data less than or equal in length to the alignment object, with a single bit per element, can be used to encode whether each position in the consensus read was generated from concordant or discordant characters.
  • This second vector of data might be considered to be a consensus quality score vector, where the quality scores are represented by a single bit per consensus base, for example binary 0 and 1.
  • a naive implementation of the above algorithm and resulting data structure would include allocating 3 bits per base in the consensus read, and a loss of information at discordant locations, though slightly less of a loss of information as compared with Embodiment 1.
  • IX. Compressions for Different Read Constructs [0293] Further details are provided for handling the different read classes in FIG.12 and the different read constructs generated with differing number of passes.
  • A. Treatment of Full HDD vs. One+ and other non-Full HDD Reads [0294] As described with respect to FIG.13C, the “Duplex Start Position”, p, is identified as the first position on Insert Read 1 at which bases from Insert Read 2 are aligned with Insert Read 1 bases.
  • the Duplex Start Position For “Full HDD Reads” the Duplex Start Position, p, is equal to 0. For “One+” reads, the Duplex Start Position will be some value p greater than 0. Both Full HDD and One+ reads can be accommodated by the encoding strategies described in sections VII and VIII on HDD read classes, and others, by allocating a first bit to a header string (or sub-header string) which indicates whether the HDD read, and therefore alignment object as well, corresponds to a Full HDD versus a One+ read. The encoding methods then treat the two scenarios differently by inserting a step early in the variable length header string generation for One+ reads only.
  • Described below is a sub-method for accommodating the possibility of both Full HDD read constructs and One+ read constructs.
  • An algorithm classifies the alignment object as either a Full HDD read construct or a One+ read construct and a corresponding value is stored in the 0th bit of the variable length header code. PATENT Client Reference No.: P39048-WO-1 2) If the algorithm classifies the alignment object as a Full HDD read construct, then the algorithm skips steps 3 and 4 and begins to look for concordant or discordant. 3) If the algorithm classifies the alignment object as a One+ read construct, then the algorithm allocates a certain number of bits to store the Starting Duplex Position (SDP) relative to the alignment object or to read 1.
  • SDP Starting Duplex Position
  • the number of bits allocated depends on the compression method chosen. 4) The number of bits allocated to store the SDP may be chosen to achieve the highest average compression ratios for the given run conditions. The range of the SDP number may depend, for example, on the expected DNA insert length distribution for a given experimental run condition. The number of bits allocated to SDP values should not be more than necessary to accommodate the expected insert length distribution for a given experiment. [0296] In some instances, an alternative approach for One+ read constructs can be used. Instead of recording the start position of the duplex segment relative to position 0 on read 1, the second stream of data, which records read 2 in a lossless way, may start recording information about read 2 concordant or discordant from the hairpin adapter side of the HDD alignment object.
  • n > 2 pass HDD read constructs that have been developed for that lossy approaches to compressing the insert reads alignment object are satisfactory, and are even preferred, relative to lossless compression of information contained within the n > 2 pass multiple alignment object; however, both lossy and lossless approaches are relevant.
  • Compression Approaches to n > 2 Pass HDD Read Constructs [0299] constructs, a pairwise alignment between the read 1 and read 2 passes may be sufficient.
  • n > 2 pass HDD read constructs a multiple sequence alignment, or sequence of steps, which result in the equivalence of a multiple sequence alignment, may first be required.
  • exemplary embodiments 7 and 8 can be performed to compress the multiple sequence alignment object.
  • exemplary embodiment 7 describes compression of n > 2 pass HDD read constructs, outlined below. PATENT Client Reference No.: P39048-WO-1 1) Read 1 is recorded as the reference read. 2) For each of the other reads, namely read 2, read 3 ....
  • n a second variable length stream of data is generated to record the deltas between each of the other reads and read 1.
  • Approaches such as the exemplar embodiment 4 described in section VIII-B of the 2- pass read section, can be applied in series to each of the reads and their pairing with read 1.
  • Exemplary embodiment 8 describes compression of n > 2 pass HDD read constructs, outlined below. 1) A (potentially lossy) consensus read may be generated first by calling the most probable base for each of the consensus read positions, given the evidence in the multiple sequence alignment object, as well as the associated raw read quality scores. 2) This consensus read may be recorded as a first stream of data. For each of the insert passes, namely read 1, read 2, read 3 ....
  • a consensus quality score can correspond to the final quality score assigned to concordant and discordant positions, e.g., based on the criteria mentioned above.
  • Decompression/Decoding [0305] Various embodiments can decode compressed consensus reads in preparation for processing by downstream processes.
  • a context dependent code is used to encode HDD alignment objects, then both the compressed data and information specifying unknown aspects of the codebook would need to be transmitted to the location of the downstream processing storage and compute elements.
  • A. Intramolecular Consensus To decompress such a compressed sequence, where not all of the positions are represented with the same amount of data, additional metadata can be used. For example, a separate bit vector can specify which positions are concordant (e.g., with a 0) or discordant (e.g., with a 1). As another example, a header file can specify the positions that are discordant.
  • HDD reads for 5mC methylation calls are that when intramolecular consensus calls are made, they provide an unmethylated read that can be readily aligned to the reference genome, as well as methylation calls.
  • EM-Seq or bisulfite methylation workflow eit need to be assigned unique UMIs in order to be grouped together, or alternatively the converted reads need to be aligned to a macerated genome where C’s were converted to T ’s or G’s to A’s.
  • Step 2 perform pairwise alignment using a modified transition matrix, this transition matrix allows for mismatch errors between converted (non-methylated C) and G the substitution matrix will have the following general form: A C T G A 1.0 0.0 0.0 0.0 C 0.0 1.0 0.0 0.0 T 0.0 1.0 1.0 0.0 G 1.0 0.0 0.0 1.0 result of converting methylated C’s to T ’s.
  • Step 3 After the initial pairwise alignment, it is possible to detect T/C pairs which correspond to the conversion of the methylated bases. Once the methylated bases are detected, we randomly reassign CG pairs to these bases and repeat the above step. Each iteration improves the alignment to some extent.
  • substitution dictionary ⁇ PATENT Client Reference No.: P39048-WO-1 'TA': 'N', 'TG': 'N', 'AC': 'N', 'AT': 'N', 'AA': 'A', 'AG': 'N', methyl methyl (rc)
  • substitution dictionary ⁇ PATENT Client Reference No.: P39048-WO-1 'TA': 'N', 'TG': 'N', 'AC': 'N', 'AT': 'N', 'AA': 'A', 'AG': 'N', methyl methyl (rc)
  • substitution dictionary ⁇ PATENT Client Reference No.: P39048-WO-1 'TA': 'N', 'TG': 'N', 'AC': 'N', 'AT': 'N', 'AA': 'A', 'AG': '
  • Methylation often occurs on CpG islands which can cause ambiguity in alignment in the presence of indels and when considering the ambiguity generated EM or bisulfite conversion processes.
  • a solution to this problem is to implement Partial order alignment of HDD reads and include penalties to account for C - U pairing derived based pairs.
  • a way to approach this algorithmically is to have read 1 and read 2 undergo partial order alignment with methylation detection proposed alignment weights, maintaining both read 1 and read 2 information in a lossless manner.
  • the partial order alignment generates a sorted DAG, which can be exported as a linear sequence encoding.
  • template nucleic acid molecules may be amplified during library preparation prior to sequencing. Thus, multiple nucleic acid molecules (e.g., copies and original) of the template can be sequenced.
  • raw data corresponding to these nucleic acid molecules or portions thereof may be generated by the sequencing device (e.g., at different time points).
  • Sequence reads e.g., from raw read data
  • the number of sequence reads that are used to generate the consensus read can be limited to a cutoff number (threshold) or until a consensus read is considered complete or substantially accurate.
  • a cutoff number threshold
  • data from any raw read data that corresponds to the same nucleic acid molecule or portions thereof may be discarded and excluded from further analysis.
  • the corresponding new raw read data may be removed from the instrument to reduce the amount of data in the memory and the amount of data that needs to be output from the memory.
  • an identifier e.g., a unique molecular identifier (UMI), a random sequence barcode (randomer), or content of a sequence read. This information may then be used in real time to discard or retain the sequence read.
  • Identifiers such as UMIs
  • NGS library prep workflows to: (i) identify and cluster sequences belonging to the same population of nucleic acid molecules; and (ii) perform error correction through oversampling of raw reads and consensus read forming strategies.
  • each cluster may contain a plurality of sequence reads that correspond to a nucleic acid molecule.
  • sequence reads may be collapsed into a single sequence read representing a consensus sequence.
  • the consensus sequence of a cluster is a single nucleotide sequence, in which every position is a nucleotide that is most commonly called amongst all the sequence reads in that cluster.
  • the consensus sequence may be generated by performing a multiple alignment between all the sequence reads in a cluster.
  • the consensus sequence may be generated by aligning PATENT Client Reference No.: P39048-WO-1 each sequence read in a cluster to a reference genome. Then, for every position in the multiple alignment or alignment to a reference genome, the most common nucleotide amongst all reads can be selected.
  • each sequence read may contain random errors that can be randomly produced during nucleic acid amplification and sequencing processes.
  • a consensus sequence, generated from a plurality off sequence reads, may therefore more accurately represent a nucleic acid molecule.
  • Including more sequence reads to form a consensus sequence read may lead to a consensus sequence read that may correspond to the actual sequence of the nucleic acid molecule more accurately.
  • a cutoff can be applied to a number of sequence reads that are used in building the consensus.
  • a highly accurate consensus sequence may be generated from at most about 100, 50, 40, 30, 20, 10, or less sequence reads.
  • Barcode and UMI technologies, and methods of labeling nucleic acid molecules with a barcode or UMI sequence are well known in the art (see, e.g., Fu et al., Proc. Nat’l.
  • the amplification and sampling process results in uneven representation across UMI-labeled nucleic acid molecules (or UMI-molecular families).
  • the sampling may include random sampling of the molecules generated in the amplification process. For example, a fraction of the amplified molecules (i.e., including the original template molecules) may be sampled for sequencing. Different parameters in an amplification process (e.g., number of PCR cycles) to generate different molecular families prior to sequencing may cause the molecular families to contain different number of nucleic acid molecules.
  • an initial amount (e.g., concentration) of a nucleic acid molecule may be more than other nucleic acid molecules in a sample, leading to molecular family that contains more progenies with the same barcode and PATENT Client Reference No.: P39048-WO-1 content (i.e., nucleotide sequence). Therefore, an amount of sequence reads generated by the sequencing device corresponding to a nucleic acid molecule or a molecular family may vary significantly across different molecules or molecular families. Consequently, a nucleic acid molecule or molecular family may be over-, or under-sampled.
  • each UMI-molecular family e.g., 10x
  • the resulting intermolecular consensus families may hit that average 10x read depth, but the variance across families will be high.
  • some molecular families may have insufficient representation, while others may have orders of magnitude more reads than are required. Families with extremely high depth of coverage may not benefit the assay much, while the UMI- molecular families with membership number lower than the desired depth will be unable to generate high quality consensus reads.
  • each family labeled using a UMI may represent a region of interest in a genome.
  • the sequencing throughput requirements have to be raised in order for all regions of interest to be covered by at least the minimum required depth.
  • the regions of interest can be the subject of targeted sequencing, e.g., enrichment of DNA from those regions, as may be done by amplification of DNA or capture probes.
  • Another major disadvantage of conventional UMI- based intermolecular consensus workflows is the fact that members of the same UMI family are typically dispersed randomly throughout the physical sample, such that each member of a UMI- Original Molecule Family may be read at a different time throughout a run. Such a run may conceivably last an hour, several hours, 24 hours, multiple days, or another duration of time.
  • Genomic variants are naturally occurring alterations to the DNA sequence not found in a reference sequence.
  • genomic variants include small variants (less than 50 base pairs) such as single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), insertions, and deletions (sometimes referred to as indels), and structural variants (greater than 50 base pairs) such as insertions, deletions, chromosomal rearrangements (e.g., translocations, inversions, and fusions), and copy number variations (CNVs).
  • SNVs single nucleotide variants
  • SNPs single nucleotide polymorphisms
  • insertions insertions
  • deletions insertions
  • deletions chromosomal rearrangements
  • CNVs copy number variations
  • Variant calling generally involves comparing a sequence read to a reference genome and reporting any variation between them.
  • a reference genome is an established, high-quality and well-accepted sequence of a given organism, for example the hg38 human reference genome.
  • Reference genomes comprise pieces of multiple genomes put together to generate a “consensus” reference genome with one assigned nucleotide for every position.
  • variant callers score and filter aligned sequencing data to call true sequence variations. As discussed above, alignment of reads can identify concordant and discordant positions, and the variant caller is responsible for determining which of the discordant positions are true positives or true negatives. After alignment to a reference genome, a next step is variant calling.
  • the system e.g., a de novo software application
  • sequence mutations single base changes and small indels
  • the system can extract candidate variants from alignment, and then score a number of individual metrics for each variant and applies these scores both individually and in combination to identify bona fide sequence mutations and to exclude sequence artifacts.
  • Intramolecular consensus reads and/or intermolecular consensus reads can be used to perform the variant calling. For example, intramolecular consensus reads can be output by a first consensus circuit and then used by a second consensus circuit to determine an intermolecular consensus read for a particular family of molecules (e.g., sharing a same bar code). This can be done across families for a given sequencing run. XIII.
  • Detecting Components of Adapter Constructs Efficient detection of added artificial sequences in nucleic acid fragments (e.g., DNA and/or RNA) is essential for their proper identification during sequencing. In so doing, DNA fragments belonging to the same sample in a pool of samples may be identified and processed. To distinguish DNA fragments from one another, adapter sequences comprising sequence IDs (SIDs) and/or unique barcodes may be used, where each SID corresponds to a single sample in the pool of samples and the unique barcodes distinguishes fragments within a single sample.
  • SIDs sequence IDs
  • unique barcodes distinguishes fragments within a single sample.
  • a hairpin adapter is ligated to an end of a double-stranded nucleic acid molecule thereby forming a resulting molecule (e.g., a hairpin duplex construct).
  • the resulting molecule comprises a hybridized portion (e.g., nucleotides with base pairing) and a non- hybridized portion (e.g., nucleotides without base pairing).
  • the hairpin adapter includes a hairpin loop, with a known loop length, that comprises nucleotides that are not hybridized to other nucleotides.
  • the computer system identifies candidate locations for the hairpin adapter, using a window sliding technique.
  • This technique involves sliding the window structure, based on a specified step size, over a plurality of positions of the output sequence (e.g., sequence read).
  • the defined step size may be 1, 2, 3, 4, 5, or 10 or more bases, or any whole number between 1 and 10 bases.
  • the step size is one base meaning the window construct slides along the sequence base by base.
  • the window structure includes a first window portion separated from a second window portion by the known loop length. The loop length corresponds to the length of the ligated hairpin adapter sequence added to the end of the double stranded nucleic acid molecule at 2405.
  • an edit distance between a first sequence (e.g., a first substring) in the first window portion and a reverse complement of a second sequence (e.g., a second substring) in the second window portion is determined.
  • a set of edit distances is also determined.
  • the plurality of positions includes a specified number of positions before and after a middle of the output sequence.
  • the edit distance PATENT Client Reference No.: P39048-WO-1 corresponds to the number of changes required in either the first sequence or the second sequence to make the other sequence (e.g., either the second sequence or the first sequence, respectively) a perfect reverse complement.
  • determining the location of the hairpin loop in the output sequence based on the edit distances comprises (i) determining the edit distance for each position of the window structure, and (ii) selecting a maximum of the set of edit distances. Different stopping criteria may be used to cease sliding of the window structure. For example, sliding may stop when the maximum edit distance in the set of edit distances is found, or sliding may stop when a threshold for the edit distance is achieved.
  • a measured identity sequence e.g., SID
  • sample identification comprises comparing the measured identity sequence (e.g., SID) to a LUT.to determine if there is a sample in the sample pool that corresponds to the measured identity sequence of the hairpin. Once the location of the hairpin loop in the output sequence is found, the LUT is queried, and if a non-null value is returned, it is reported as a possible SID.
  • sample identification can comprise inputting the measured identity sequence into a machine learning model that is trained on various input sequences of a same length as a sample identifier used in the hairpin adapter.
  • the measured identity sequence PATENT Client Reference No.: P39048-WO-1 allows sample sets in a pool of samples to be separated from one another. Furthermore, accurate detection of the hairpin loop in the output sequence ensures correct adapter removal during adapter trimming processes in downstream computational analysis.
  • a first strand sequence of a first strand of the double-stranded nucleic acid molecule and a second strand sequence of a second strand of the double-stranded nucleic acid molecule may be determined using the location of the hairpin loop in the output sequence.
  • the first strand sequence and the second strand sequence represent the nucleotide sequence of the double stranded nucleic acid molecule, which is the target sequence used in downstream computational analysis.
  • FIGs.25A and 25B illustrate exemplary adapter architectures that may be used during sequencing.
  • the SIDs 2520 are selected PATENT Client Reference No.: P39048-WO-1 from a pool of ‘x’-base fixed sequences, where ‘x’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. Moreover, the SIDs 2520 function as sample identifiers that distinguish one sample from another in a pool of samples.
  • the stems 2525 are reverse complementary pairs (i.e., ST reverse complements ST’) that are also of a ‘s’-based fixed sequence, where ‘s’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases.
  • the UMIs 2530 are the sequences that fall between the stem and the anchor and function to distinguish between original molecules and PCR duplicates within a sample.
  • UMI sequences 2530 may be either randomers or semi-randomers comprising about 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 bases. Similar to the stems 2525, the anchors 2535 are also reverse complementary pairs that have a ‘t’-based fixed sequence, where ‘t’ maybe be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. In between the adapter architecture, lies the nucleic acid molecule (e.g., double stranded DNA molecule) that is also referred to as an insert 2540. This is the segment to be sequenced. [0356] FIG.25B shows another exemplary adapter structure for a linearized nucleic acid molecule.
  • nucleic acid molecule e.g., double stranded DNA molecule
  • the overhangs 2545 comprising the bases GT on 5’ end and AC on 3’ end, are used during adapter ligation to the insert DNA sequence 2540, which is the sequencing target.
  • the overhangs 2545 comprising the bases GT on 5’ end and AC on 3’ end, are used during adapter ligation to the insert DNA sequence 2540, which is the sequencing target.
  • the training, Testing, and Implementation of Machine Learning Models PATENT Client Reference No.: P39048-WO-1 [0357] During sequencing, there is a chance for sequencing errors to occur, such as during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls. Such errors cause problems in identifying the positions of adapters in the sequence reads and thus in identifying the DNA segments. These problems are further exacerbated when the adapters include variable components.
  • machine learning models may be used to account for different patterns and types of errors that may occur during the sequencing process so that the adapters and their various components may still be identified.
  • machine learning models are procedures that are run on datasets (e.g., training and validation datasets) and can perform pattern recognition on datasets, learn from the datasets, and/or are fit on the datasets. Examples of machine learning models include linear and logistic regression, decision trees, artificial neural networks, k-means, and k-nearest neighbor.
  • models include, without limitation, linear regression, logistic regression, decision tree, Support Vector Machines, Naives Bayes algorithm, Bayesian classifier, linear classifier, K-Nearest Neighbors, K-Means, random forest, dimensionality reduction algorithms, grid search algorithm, genetic algorithm, AdaBoosting algorithm, Gradient Boosting Machines, and Artificial Neural Networks such as convolutional neural network (“CNN”), an inception neural network, a U-Net, a V-Net, a residual neural network (“Resnet”), a transform neural network, a recurrent neural network, a Generative adversarial network (GAN), or other variants of Deep Neural Networks (“DNN”) (e.g., a multi-label n-binary DNN classifier or multi-class DNN classifier).
  • CNN convolutional neural network
  • U-Net a U-Net
  • V-Net a residual neural network
  • Resnet residual neural network
  • GAN Generative adversarial network
  • DNN Deep Ne
  • These models can be implemented using various machine learning libraries and frameworks such as TensorFlow, PATENT Client Reference No.: P39048-WO-1 PyTorch, Keras, and scikit-learn, which provide extensive tools and features to facilitate model building, training, validation, and testing.
  • input data is collected and preprocessed if necessary.
  • Data collection can include exploring various data sources such as public datasets, private data collections, or real-time data streams, depending on a project’s needs.
  • the collected data comprises sequencing read data generated from sequencing methods (e.g., Xpandomer sequencing as described with respect to section II).
  • the read data can include the adapters and the DNA segment, where the adapter architecture may include the architectures described with respect to FIGs.25A and 25B.
  • the collected read data is used to synthesize new reads that comprise sequencing errors, which can occur during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls.
  • Sequencing errors that can be introduced include, without limitation, insertions, deletions, and substitutions.
  • techniques such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) may be used to generate new data examples.
  • GANs Generative Adversarial Networks
  • VAEs Variational Autoencoders
  • Another option for simulating library-prep errors and sequencing errors can include incorporating errors based on a hyper-parameter designated probability.
  • high-performance GPUs Graphics Processing Units
  • CPUs Central Processing Units
  • TPUs Torsor Processing Units
  • frameworks and libraries including TensorFlow, PyTorch, Keras, and scikit-learn.
  • the model iteratively adjusts its internal model parameters (e.g., weights, coefficients, trees, feature importance, and/or biases) that minimizes or maximizes an objective function (e.g., a loss function, a cost function, a contrastive loss function, a cross-entropy loss function, an Out-of-Bag (OOB) score, etc.).
  • an objective function e.g., a loss function, a cost function, a contrastive loss function, a cross-entropy loss function, an Out-of-Bag (OOB) score, etc.
  • OOB Out-of-Bag
  • Various techniques may be used to perform the optimization. For example, to train machine learning models such as a neural network, optimization can be done using back propagation. The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in such a way that the error is minimized. The weights are modified using the optimization function.
  • Validating is another phase of developing machine learning models where the model is checked for deficiencies in performance and the hyperparameters are optimized based on validation data provided from the training and validation datasets.
  • the validation data helps to evaluate the model's performance, such as accuracy, precision, recall, or F1-score, to gauge how well the model is likely to perform in real-world scenarios.
  • Hyperparameter optimization on the PATENT Client Reference No.: P39048-WO-1 other hand, involves adjusting the settings that govern the model's learning process (e.g., learning rate, number of layers, size of the layers in neural networks) to find the combination that yields the best performance on the validation data.
  • the validation process includes iterative operations of inputting the validation subset of data into the trained model(s) using a validation technique such as K-Fold Cross-Validation, Leave-one-out Cross-Validation, Leave-one-group-out Cross- Validation, Nested Cross-Validation, or the like, to fine-tune the hyperparameters and ultimately find the optimal set of hyperparameters.
  • a validation technique such as K-Fold Cross-Validation, Leave-one-out Cross-Validation, Leave-one-group-out Cross- Validation, Nested Cross-Validation, or the like.
  • the test dataset serves as new, unseen data for the model, mimicking how the model would perform when deployed in actual use.
  • the model During testing, the model’s predictions are compared against the true values in the test dataset using various performance metrics such as accuracy, precision, recall, and mean squared error, depending on the nature of the problem (classification or regression). This process helps to verify the generalizability of the model—its ability to perform well across different data samples and environments—highlighting potential issues like overfitting or underfitting and ensuring that the model is robust and reliable for practical applications.
  • the machine learning models are fully validated and tested once the output predictions have been deemed acceptable by user defined acceptance parameters.
  • Acceptance parameters may be determined using correlation techniques such as Bland-Altman method and the Spearman’s rank correlation coefficients and calculating performance metrics such as the error, accuracy, precision, recall, receiver operating characteristic curve (ROC), etc.
  • Deploying the machine learning models includes moving the models from a development environment (e.g., a training and validation subsystem, where it has been trained, validated, and tested), into a production environment where it can make inferences on real-world data. This step typically starts with the model being saved after training, including its parameters and configuration such as final architecture and hyperparameters. It is then converted, if necessary, into a format that is suitable for deployment, depending on the deployment environment.
  • a model trained in a scientific computing environment such as Python PATENT Client Reference No.: P39048-WO-1 might be converted into a Java-friendly format for integration into a larger enterprise application.
  • Deployment can be conducted on various platforms, including on-premises servers or cloud environments like AWS, Azure, or Google. The foregoing description can apply to any machine learning model described herein. 3.
  • Machine Learning Models for Locating an Adapter Sequence [0366] As described above, prior to making inferences on real-world data, machine learning models are trained, and potentially validated and tested. Data sets for each of these steps may be generated from a primary collection of data that is divided into training, validation, and testing datasets.
  • the primary collection of data can be obtained from various data sources such as public datasets, private data collections, or real-time data streams, depending on a project’s needs.
  • the collected data comprises sequencing read data generated from sequencing methods (e.g., Xpandomer sequencing as described with respect to section II).
  • the read data can include the adapters and the DNA segment, where the adapter architecture may include the architectures described with respect to FIGs.25A and 25B.
  • the read data is used to synthesize or generate simulated read data that comprise sequencing errors, which can occur during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls.
  • Sequencing errors that can be introduced include, without limitation, insertions, deletions, and substitutions. Simulating library-prep errors and sequencing errors can include incorporating such errors based on a hyper-parameter designated probability. For example, one or more hyper-parameters may include the error rate one or more error types for generating deviations from the expected sequence are incorporated.
  • the error rate may be based on known error rates associated with the sequencing method (e.g., Xpandomer synthesis), random error rate for SNV and/or indels during synthesis (e.g., based on probability distributions corresponding to the specific error), the rate of DNA damage, the sequence of the read (e.g., comprising homopolymers or repetitive regions of some k-mer in length), and the like as described in more detail with respect to FIGs.7, 9 and 10.
  • the error rate may be set by a user that is specific to the error rate of the data they wish to analyze.
  • the error rate hyper-parameter for any given error type may be set so that the mean value of the error rate is about 1-2% across (i) the whole read sequence PATENT Client Reference No.: P39048-WO-1 (e.g., the adapters and the DNA segment), (ii) the whole or a portion of the whole adapter sequence, (iii) the whole or a portion of the whole DNA segment, (iv) or any combination thereof.
  • P39048-WO-1 e.g., the adapters and the DNA segment
  • the whole or a portion of the whole adapter sequence e.g., the whole or a portion of the whole DNA segment
  • the models learn the many nuances that can result in sequencing error.
  • FIG.26 shows a flowchart illustrating method 2600 for using machine learning models to segment components of adapter architecture from nucleic acid sequencing read data.
  • the sequencing read data may be generated by any of the methods described herein, such as the sequencing methods described with respect to section II of the disclosure.
  • the method 2600 depicted in FIG.26 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine).
  • the software may be stored on a non-transitory storage medium (e.g., on a memory device).
  • FIG.26 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel.
  • sequence segment of a nucleic acid molecule is received.
  • the sequence segment specifies nucleotides at positions within the nucleic acid molecule.
  • the sequence segment can be all, or a portion of the sequence read, which can be all or a portion of the nucleic acid molecule.
  • the sequence read includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample (e.g., the nucleic acid insert) PATENT Client Reference No.: P39048-WO-1 and a second sequence portion corresponding to an adapter segment that was added to the nucleic acid segment.
  • the sequence segment may be equal to the length of the sequence read.
  • the sequence segment (and thus the sequence read) may have a mean, median, average, or absolute length of about 15bp to about 1000bp.
  • the sequencing segment may be about 15bp, 16bp, 17bp, 18bp, 19bp, 20bp, 25bp, 50bp, 100bp, 150bp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, or about 1000bp or about any integer value between 15bp and 1000bp.
  • the sequence segment is equal in length to a portion of the sequence read, including the starting and ending portions of the sequence read.
  • 100 bases may be sequenced at one end of a DNA molecule to obtain the sequence read, and a sequence segment of only 64 might be used.
  • the sequence segment can be generated by cutting (e.g., extracting) a fixed number of bases “L” from the start and end positions of the sequence read.
  • the number of bases that are extracted is dependent on the size of the adapter sequence ligated to the end of the nucleic acid molecule, which is previously know. Accordingly, the number of extracted bases cut from the start and end of the sequencing read may comprise 50, 55, 60, 65, 70, or 80 bases, or any whole number between 50 and 80 bases.
  • the sequence segment comprises 64 bases where some proportion of the bases correspond to at least a portion of the nucleic acid segment from a biological sample (e.g., the first sequence portion), while the second sequence portion corresponds to the adapter (also referred to as ‘extracted adapter sequence’) added to the nucleic acid segment.
  • the nucleic acid molecule may be either deoxyribonucleic acid (DNA) molecules or ribonucleic acid (RNA) molecules and polymers thereof in either single- or double-stranded form. Additionally, the nucleic acid molecule can comprise combinations of deoxyribonucleic acids and ribonucleic acids.
  • the nucleic acid molecule is a double stranded DNA molecule, which may also be referred to as an “insert” or “DNA insert”.
  • PATENT Client Reference No.: P39048-WO-1 PATENT Client Reference No.: P39048-WO-1
  • the nucleic acid molecule is obtained or provided from a biological sample and may include, but is not limited to, any cell, tissue or biological fluid comprising nucleic acid molecules.
  • the biological sample can be at least one cell, fetal cell(s), cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, and the like.
  • the biological sample can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses.
  • a first feature vector is generated using the nucleotides of the sequence segment at the positions within the nucleic acid molecule.
  • the first feature vector can include a series of data items, where each data item in the series indicates a nucleotide at a corresponding position in the sequence segment.
  • the series of data items represent encoded nucleotides that correspond to the nucleotide sequence of the nucleic acid molecule.
  • the nucleotide sequence of the nucleic acid molecule may be encoded using various methods that converts categorical data into a numerical format (e.g., one-hot encoding).
  • the nucleotide bases A, T, C, and G can be converted into a binary numerical format such as 2-bit encoding or 4-bit encoding as non-limiting examples.
  • the first feature vector would include a series of data items where the nucleotide bases may be encoded as follows: A: [0, 0]; C: [0, 1]; G: [1, 0]; T: [1, 1].
  • the first feature vector can include a series of data items where the nucleotide bases may be encoded as follows: A: [1, 0, 0, 0]; C: [0, 1, 0, 0]; G: [0, 0, 1, 0]; T: [0, 0, 0, 1].
  • the sequence segment may be received in this format, and thus the generation of the first feature vector may simply use the sequence segment.
  • a first adapter location in the sequence segment of a component of the adapter segment is determined by processing the first feature vector using a first machine learning model.
  • the first feature vector is fed into a first machine learning model, which can be trained to: (i) confirm one or more components of the adapter exists; and (ii) provide a location for the one or more components of the adapter sequence, which may be done for all components of the adapter sequence.
  • the first machine learning model can be a neural network, and more specifically a segmentation neural network.
  • the second sequence PATENT Client Reference No.: P39048-WO-1 portion of the adapter can include a plurality of components including a first component and a second component.
  • the first component can be a fixed sequence (e.g., non-variable sequence) comprising the stems and/or anchor sequences of the adapter and the first component corresponds to the first adapter location.
  • the second component can be a variable sequence comprising SIDs, UMIs, and/or portion of the DNA insert.
  • the first machine learning model can use segmentation techniques to partition the series of data items from the first feature vector (e.g., extracted adapter sequences) into meaningful regions, such as the individual components of the adapter sequence.
  • the first machine learning model can locate the positions of fixed sequences in the adapter (e.g., the stems and/or anchors), which may be identified by their consistent sequence across all the adapters used.
  • the stem component comprises the nucleic acid sequences GACGTGTGCTCTTCCGATCT on the 5’ end and AGATCGGAAGAGCGTCGTGT on 3’ end. Accordingly, the first machine learning model identifies and segments the stem component of the adapter based on this known sequence. This same concept may be used to identify the location of the anchor sequence.
  • variable sequences e.g., SIDs, UMIs, and/or portion of the DNA insert
  • the method described herein may further comprise using the first adapter location to determine the variable sequence of the second component. For example, at the 3’ end of the extracted adapter sequence, the end position of an SID is 10 bases to the right of the end position of the stem, assuming the SID is a fixed length of 10 bases. This process of variable sequence extraction continues until both SIDs, both UMIs, and/or the portion of the DNA insert are identified.
  • the first machine learning model can generate an output vector where each base associated with a fixed sequence (e.g., stem, anchor) with an integer number (e.g., 0, 1, or 2) indicating if the base belongs to a fixed sequence or the rest of the adapter.
  • a fixed sequence e.g., stem, anchor
  • an integer number e.g., 0, 1, or 2
  • 1 may be used to label the stem
  • 2 to label the anchor
  • the start stem and start anchor may be located, and based on this information, the locations of variable sequences may be inferred (as noted above). For each position, a probability can be determined for each classification.
  • the integer label assigned to each base can be based on the probability, calculated by the first machine learning model, indicating whether the base corresponds to a stem sequence, an anchor sequence, or the rest of the adapter (e.g., other). Based on which category (e.g., stem, anchor, other) has the highest probability the appropriate integer label is assigned to that base.
  • a quality control step may be performed to confirm that the identified fixed sequences and/or variable sequences fall into an acceptable range of their expected lengths. Any that do not pass quality control can be removed from the analysis. Those sequence segments that do pass quality control may optionally be compared to a LUT or classified by a second machine learning (described in more detail below) to determine what sample the sequence segments originated from. 4.
  • FIG.27 shows a flowchart illustrating method 2700 for using a machine learning model to classify components of adapter architecture from nucleic acid sequencing read data.
  • the output vector from the first machine learning model comprising encoded/labeled fixed components, and non-labeled variable components is received.
  • the variable components/sequences may be used to determine which sample and/or molecule the sequence corresponds to.
  • the variable sequences e.g., segmented adapter components
  • the second feature vector may be generated using nucleotides at positions within the variable sequences segmented by the first machine learning model.
  • the second feature vector comprises a series of data items representing the encodings of the variable sequences.
  • PATENT Client Reference No.: P39048-WO-1 [0384]
  • the second feature vector is processed by a second machine learning model.
  • the second machine learning model can be a neural network (e.g., a classification neural network).
  • Processing the second feature vector by the second machine learning model can include: (i) encoding the second feature vector to a multidimensional data point of N dimensions; and (ii) comparing the multidimensional data point to a set of reference data points generated by applying the second machine learning model to a set of adapters used for the variable sequence in a sequencing run. This second step may also be used to determine to which sample and/or which molecule the sequence corresponds.
  • a. Fixed Pool An objective of the second machine learning model is to match the variable sequences (e.g., SID, UMIs) to a pool of possible sequences. For example, the second machine learning model may be trained to match the SID sequence to a fixed pool of all the possible SID sequences.
  • the second machine learning model is trained using a simulated training set that was generated by randomly modifying a fixed pool of adapters for the variable sequence. Because the second machine learning model was trained on simulated SID data that comprises insertions, deletions, and substitutions to model data-preparation errors and sequencing errors, the model can match a sample SID, that may comprise errors, to a SID in the fixed pool. The second machine learning model can map the SID sequence to the sequence in the fixed pool that is closest in edit distance. The second machine learning model can output an identifier identifying a particular adapter from the fixed pool of adapters. [0386] In some implementations, if the extracted SID is shorter than its theoretical length, one or more additional values may be added to pad the extracted sequence.
  • the extracted SID may be padded with a 5th value: 4, on top of 0,1,2,3 for ATCG.
  • the sequence with the closest edit distance is output and used to: (i) determine which sample the sequence read belongs to in a pool of samples; and/or (ii) determine where down stream processes, such as adapter trimming, may begin.
  • a fixed pool of SIDs is not used and an arbitrary SID is used.
  • the second machine learning model encodes the measured sequence (presumptive SID) and an actual SID into fixed-dimension vectors and the distance between the encoded vectors are used to approximate the edit distances.
  • the distance can be determined between the measured sequence and each SID in the pool.
  • the SID that has the minimal metric distance from the sequence is deemed the correct classification.
  • a Q-score can be calculated as the metric distance of the second best SID and sequence minus the metric distance of the best SID match and sequence.
  • a minimum Q-score can be required for a sequence to be classified to ensure there is enough separation between the top two SID candidates. Selection of the minimum Q-score threshold can be set according to accuracy requirements and sequence length on a case-by-case basis. For example, arbitrary SID classification can leverage the knowledge of which SIDs are actually in an experiment, making it a Bayesian classification.
  • a third machine learning model may optionally be fed the second feature vector to classify the UMI sequences.
  • the third machine learning model is a neural network. More specifically, the third machine learning model may be aclassification neural network trained to classify UMI sequences.
  • the third machine learning model has been trained in a similar process as the second machine learning model, where simulated UMI data is generated comprising insertions, deletions, and substitutions to model data-prep errors and sequencing errors. Furthermore, the third machine learning model can also apply the fixed pooling method and/or the arbitrary method described with respect to the second machine learning model. In various embodiments, the UMI in the second feature vector is mapped to a pool of UMI sequences that comprises at least 200 sequences. PATENT Client Reference No.: P39048-WO-1 [0390] Any of the methods described herein may use sequencing methods to generate the sequence reads, sequence segments, etc. used in the herein. The sequencing methods may be any one of the sequencing methods described in section II of the disclosure.
  • sequencing the double-stranded nucleic acid molecule includes: (i) creating a surrogate molecule from the double-stranded nucleic acid molecule, wherein the surrogate molecule includes one or more reporter elements corresponding to each nucleotide; (ii) passing the surrogate molecule through a nanopore to obtain electrical signals; and (iii) determining the first sequence of base calls and the second sequence of base calls of nucleotides in the double-stranded nucleic acid molecule using the electrical signals.
  • the methods described herein may further comprise repeating the sequencing method(s), repeating the method for using machine learning models to segment components of adapter architecture from nucleic acid sequencing read data, and/or repeating the method for using a machine learning model to classify components of adapter architecture from nucleic acid sequencing read data for at least 10,000 nucleic acid molecules.
  • a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause a computer system to perform the methods of any one of the methods described herein.
  • PATENT Client Reference No.: P39048-WO-1 [0394] Table 7 below shows the accuracy of a portion of the data collected from an experimental sequencing run .
  • the ground truth is obtained from the alignment approach. In this run, there are six SIDs.
  • the encoding classifier also simulates the case that there are 100 SIDs in the run.
  • Table 8 classifier trained on a encoding classifier encoding classifier fixed pool (848 SIDs) (6 known SIDs) (assume 100 SIDs) Yield 117012 117855 116698 SID error rate 2E -4 0 8E -6 Running Time 95 seconds 93 seconds 581 seconds PATENT Client Reference No.: P39048-WO-1 C.
  • Detecting Adapters using Frequency-Based Methods Described herein are methods based on properties for mathematical (integral) transforms (e.g., Fourier transforms) to detect and identify adapter sequences (e.g. hairpin or adapters at the end) in DNA fragments.
  • Frequency based algorithms are used to analyze functions or signals with respect to frequency, rather than time. For this to occur, signals represented as a function of time in the time domain are converted into frequencies in the frequency domain, where the signal is represented by its constituent frequencies and their respective amplitudes and phases.
  • Various basis functions may be used, such as sines, cosines, plane waves, wavelets, and the like.
  • Non-limiting examples of frequency-based algorithms include Fourier transform (including Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT)), wavelet transform (including Continuous Wavelet Transform (CWT) and Discrete Wavelet Transform (DWT)), Short-Time Fourier Transform (STFT), Z-transform, Laplace transform, Goertzel Algorithm, Hilbert-Huang Transform (HHT), Cepstrum Analysis, Spectrogram Analysis, Autoregressive (AR) Models, Cross-Spectral Analysis, Principal Component Analysis (PCA) in the Frequency Domain, Filter Design Algorithms (including FIR (Finite Impulse Response) Filters and IIR (Infinite Impulse Response) Filters), Convolution and Correlation (including Frequency Domain Convolution and cross-correlation), Music Information Retrieval (MIR) Algorithms, Harmonic Analysis, and Frequency Modulation (FM) and Demodulation.
  • DFT Discrete Fourier Transform
  • FFT
  • the cross-correlation may be determined in the frequency space, which improves the efficiency of the computation.
  • the most common approach is to zero-pad the shorter signal to match the length of the longer signal before performing the Fourier transform and cross-correlation in the frequency domain; this allows you to compare the signals at the same time scale despite their differing lengths.
  • zeros are added to the shorter sequence before (or optionally after) the actual sequence.
  • the resulting sequence signals are the same length, e.g., padded if necessary, they can be encoded (e.g., as described above) and transformed into the frequency-space using frequency-based techniques (e.g., Fourier transformations including Fast Fourier Transform (FFT), Z-transformations, Laplace transformations, and the like).
  • FFT Fast Fourier Transform
  • the frequency transform can be applied to the encoded signal. This transformation decomposes the function or signal into its constituent waveform components, each characterized by a specific frequency, amplitude, and phase.
  • a frequency-based cross-correlation is determined using the frequency-encoded sequences.
  • the cross-correlation is computed in the frequency domain by multiplying the first frequency-encoded signal with the complex conjugate of the second frequency-encoded signal.
  • the complex conjugate is used because in the frequency domain, cross-correlation is equivalent to multiplying one signal with the complex conjugate of the other's frequency-encoded signal.
  • the result of this multiplication is an array representing the cross-correlation in the frequency domain.
  • the frequency-domain cross-correlation array can then be transformed (by applying the Inverse Fast Fourier Transform (IFFT)) to bring it back to the time domain. This step ensures that no information is lost in the transformation processes and gives the cross-correlation signal.
  • IFFT Inverse Fast Fourier Transform
  • the cross-correlation signal provides several key pieces of information: (i) the overall cross- correlation signal (made up of individual nucleotide-space signals) corresponding to the similarity between the signals (e.g., the adapter and the sequence read); and (ii) the nucleotide- space signal for each nucleic acid base comprising the sequence of the adapter sequence and the PATENT Client Reference No.: P39048-WO-1 read sequence, the latter being a maximum at the base position (location) of the start or end of the adapter in the sequence read.
  • P39048-WO-1 read sequence the latter being a maximum at the base position (location) of the start or end of the adapter in the sequence read.
  • similarity index refers to a measure that quantifies the degree of similarity between two signals or data sets and is derived from the cross-correlation signal, e.g., the value of the cross-correlation signal at each position.
  • the maximized index represents the point of maximum similarity between the two signals (i.e., how much one signal needs to be shifted for the best alignment with the other).
  • the cross-correlation signal can be analyzed to find the maximum similarity index (e.g., the highest peak signal) between the two signals (e.g., the adapter sequence and the read sequence) for each adapter.
  • the overall highest maximum similarity index which generally should be higher by a large amount, can identify the location/position of the adapter in the sequence read.
  • SID number 00 (FIG.29A) has the highest autocorrelation value near base 210.
  • This maximum similarity index indicates 1) the adapter sequence added to the read sequence comprises the SID sequence associated to SID 00, which is a known sequence, and 2) the center location of the adapter sequence.
  • the peak signal near base 210 also indicates the “shift” or the translation of a signal in the time (or spatial) domain and its corresponding effect in the frequency domain.
  • the method used to find PATENT Client Reference No.: P39048-WO-1 the starting position of the adapter is also used on the reverse compliment portion of the read construct to find the end position.
  • the reverse of the sequence read and the adapter can be taken, and then encoded, transformed, and cross-correlated.
  • the highest peak (maximum similarity index) for this reverse analysis provides the position of the ending of the adapter sequence.
  • the cross-correlation can be applied between (i) the read sequence and the hairpin, (ii) the read sequence and the SID, (iii) the read sequence and the reverse complement SID, or (iv) any combination thereof.
  • FIG.30 shows a flowchart illustrating method 3000 for determining the location and the sequence of an adapter in a sequencing read using cross-correlation frequency-based methods.
  • the method 3000 depicted in FIG.30 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine).
  • the software may be stored on a non-transitory storage medium (e.g., on a memory device).
  • a non-transitory storage medium e.g., on a memory device.
  • the method presented in FIG.30 and described below is intended to be illustrative and non-limiting. Although FIG.30 depicts the various processing steps occurring in a particular sequence or PATENT Client Reference No.: P39048-WO-1 order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. [0411]
  • an adapter sequence and a sequencing read of a nucleic acid molecule are received.
  • the adapter sequence comprises at least 20 unique known adapter sequences.
  • the set of adapter sequences can comprise a variety of adapter architectures such as hairpin adapters, Y- adapters, dumbbell adapters, the like, or any combination thereof.
  • the sequence read specifies nucleotides at positions within the nucleic acid molecule. Additionally, the sequence read includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample (e.g., a DNA insert) and a second sequence portion corresponding to the adapter that was added to the nucleic acid segment. [0412] At 3010, the nucleotides of the sequence read are encoded into a first series of nucleotide encodings, wherein each nucleotide has a different encoding.
  • the set of adapter sequences are also encoded into a set of second series of nucleotide encodings, where each adapter sequence in the set of adapter sequences has a different encoding.
  • the different nucleotide encodings do not overlap with each other, (ii) are orthogonal to each other, (iii) are in complex space (see FIG.28), or (iv) use at least four dimensions (see Encoding section above).
  • the nucleotide encodings of the first series of nucleotides and the set of second series of nucleotides are transformed into a first frequency domain signal and a set of second frequency domain signals, respectively.
  • the transformation is done using a frequency-based algorithm (e.g., Fourier transformations, Z-transformations, Laplace transformations, and the like).
  • the frequency-based algorithm is the Fast Fourier Transform (FFT) algorithm.
  • FFT Fast Fourier Transform
  • a maximum similarity index determined from the cross-correlation signals, is used to determine the location and the sequence of the true adapter sequence corresponding to the adapter that was added to the nucleic acid segment. Furthermore, the adapter location can be used to determine a segment location of the nucleic acid segment, wherein the adapter location can be a start position of the adapter in the sequence read. [0416] In various embodiments, the adapter sequence from 3005 is a first adapter of a set of adapters used in a sequencing library. Accordingly, the method described herein can further comprise repeating the process of obtaining the cross-correlation signal for other adapters in the set of adapters.
  • FIG.33 shows a graph illustrating how autocorrelation methods may be used to find the center location of an adapter sequence when cross-correlation analysis does not produce conclusive results.
  • the x-axis indicates the ”lag” (i.e., how much a signal was shifted), and the y- axis indicates the magnitude of autocorrelation signal for each discrete amount of “lag.”
  • the autocorrelation signal at every base except the base near 200 bases of “lag” is very low.
  • a peak in the autocorrelation signal is observed, which indicates a position of the hairpin adapter because a location of the peak in the autocorrelation signal shows the best alignment of the sequence and its reverse complement.
  • Autocorrelation can be used to analyze the symmetrical nature of HD sequence reads.
  • the sequence read specifies nucleotides at positions within the nucleic acid molecule.
  • the sequence read includes first sequence portions corresponding to two nucleic acid segments from a biological sample and a second sequence portion corresponding to the adapter that was added between the two nucleic acid segments.
  • the sequence read includes multiple copies of at least one strand of the two strands of the nucleic acid molecule.
  • the adapter indicates an origin of the nucleic acid segment.
  • the different nucleotide encodings do not overlap with each other, (ii) are orthogonal to each other, (iii) are in complex space (see FIG.28), or (iv) use at least four dimensions (see Encoding section above).
  • the reverse complement of the sequence read is also encoded into a second series of nucleotide encodings. This takes advantage of the symmetrical nature of the sequence read.
  • the encoding into different coordinates can use vectors (e.g., independent vectors, orthogonal vectors, normalized vectors, orthonormal vectors, and the like) to be mapped in four-dimensional space or use coordinates in complex space (see FIG.28).
  • the frequency-domain cross-correlation signal is transformed into a time domain signal to obtain cross-correlation signals.
  • an adapter location of the adapter is determined using a maximum of the cross-correction signal.
  • the adapter location corresponds to the middle of the adapter.
  • the adapter is optionally a hairpin adapter. Once the location of the center of the adapter (e.g., SID + hairpin) structure in the sequence is detected using autocorrelation, the specific adapter sequence can be narrowed down using the location and known adapter sequences from the set of adapter candidates.
  • a read sequences comprises adapters at the 3’ and 5’ ends of the read sequences, or if the read sequences are two pass HDD read constructs (described with respect to FIGs.6A and 6B), four pass HDD read constructs (described with respect to FIGs.8A-8D), or “n” pass HDD read constructions (described with respect to FIGs.11A-11C) described with respect to section III (e.g., Alternative HDD Read constructs). Accordingly, multiple maxima in the cross-correlation signal are observed.
  • the methods for adapter detection may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine).
  • the software may be stored on a non-transitory storage medium (e.g., on a memory device).
  • the one or more processors include CPUs, GPUs, TPUs, FPGAs, DSPs, ASICs, MCUs, NPUs, vector processors, quantum processors, FPAAs, SoC, the like, or any combination thereof.
  • the GPU may include hundreds or even thousands of parallel processing cores that allow them to handle multiple tasks simultaneously.
  • Other architectures besides GPUs may also be used, e.g., processors having single instruction multiple data (SIMD).
  • SIMD single instruction multiple data
  • one of skill in the art can appreciate how the aforementioned demultiplexing methods (i.e., adapter detection via (i) dual sliding window algorithm, (ii) machine learning models, and (iii) frequency-based methods) can be modified to detect other adapter constructions not described in FIGs.25A and 25B.
  • hairpin adapters, Y-Open-Hairpin-adapters, dumbbell adapters, and other adapter architectures specific to certain sequencing methods may also be contemplated.
  • the aforementioned demultiplexing methods may also be modified to detect the one or more adapter sequences within the two pass HDD read constructs (described with respect to FIGs.6A and 6B), the four pass HDD read constructs (described with respect to FIGs.8A-8D), and the “n” pass HDD read constructions (described with respect to FIGs.11A-11C) described with respect to section III (e.g., Alternative HDD Read constructs).
  • Modification of the aforementioned demultiplexing methods would not unreasonably broaden the scope of the described methods, as all three methods take advantage of the natural sequence symmetry that is inherent to the sequencing constructions described herein.
  • a system comprising the computer product of any one of the disclosed methods and one or more processors configured to execute the instructions of any of the disclosed methods stored on the computer readable medium.
  • the system comprises the means for performing any of the disclosed methods as well as one or more processors configured to perform any of the disclosed methods.
  • the system comprises modules that respectively perform the steps of any of the disclosed methods.
  • a sequencing device for determining consensus sequences of double-stranded nucleic acid molecules.
  • the sequencing device comprises a set of sequencing cells (e.g., at least 10,000 sequencing cells), each configured to perform (i) sequencing of a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base PATENT Client Reference No.: P39048-WO-1 measurements, and (ii) sequencing of a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements.
  • the sequencing device also comprises a consensus circuit electrically connected with the set of sequencing cells.
  • the comparator circuit for each of the double-stranded nucleic acid molecules, is configured to perform the process of (i) receiving the first sequence of base measurements and the second sequence of base measurements and (ii) generating a consensus sequence using base call values.
  • comparing a first base measurement to a second base measurement comprises (i) determining a first base call using the one or more of the first base measurements, (ii) determining a second base call using the one or more of the second base measurements, and (iii) comparing the first base call and the second base call.
  • the comparator circuit is further configured to determine whether a position of the plurality of positions is concordant or discordant based on the comparing, wherein the base call value is dependent on whether the position is concordant or discordant. [0439] In various embodiments, a number of bits used for the base call value is dependent on whether the position is concordant or discordant. [0440] In various embodiments, the comparator circuit is further configured to generate metadata identifying which positions are discordant, and wherein the consensus sequence includes the metadata. [0441] In various embodiments, the set of sequence cells and the comparator circuit are on a same printed circuit board.
  • FIG.35 illustrates a measurement system 3500 according to an embodiment of the present disclosure.
  • the system as shown includes a sample 3505, such as Xpandomers within an assay device 3510, where an assay 3508 can be performed on sample 3505.
  • sample 3505 can be contacted with reagents of assay 3508 to provide a signal (e.g., an intensity signal) of a physical characteristic 3515 (e.g., sequence information of a cell-free nucleic acid molecule).
  • Assay 3508 may include sequencing by expansion with an assay device 3510.
  • An example of an assay device 510 can be a well plate that includes Xpandomers.
  • Physical characteristic 3515 e.g., a fluorescence intensity, a voltage, or a current
  • Detector 3520 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
  • an analog-to- digital converter converts an analog signal from the detector into digital form at a plurality of times.
  • Assay device 3510 and detector 3520 can form an assay system, e.g., a PCR system or a sequencing system that performs sequencing according to embodiments described herein.
  • a data signal 3525 is sent from detector 3520 to logic system 3530.
  • data signal 3525 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
  • Data signal 3525 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 3505, and thus data signal 3525 can correspond to multiple signals.
  • Data signal 3525 may be stored in a local memory 3535, an external memory 3540, or a storage device 3545.
  • the assay system can be comprised of multiple assay devices 3510 and detectors 3520.
  • Logic system 3530 may be, or may include, a computer system, ASIC, microprocessor, graphics processing unit (GPU), etc.
  • Logic system 3530 and the other components may be part of a stand-alone or network PATENT Client Reference No.: P39048-WO-1 connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 3520 and/or assay device 3510.
  • Logic system 3530 may also include software that executes in a processor 3550.
  • Logic system 3530 may include a computer readable medium storing instructions for controlling system 3500 to perform any of the methods described herein.
  • logic system 3530 can provide commands to a system that includes assay device 3510 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order. Such physical operations may be performed by a robotics system, e.g., including a robotic arm, as may be used to obtain a sample and perform an assay 3508. Logic system 3530 can perform any steps of methods described herein that perform computer processing. [0446] Measurement system 3500 may also include a treatment device 3560, which can provide a treatment to the subject. Treatment device 3560 can determine a treatment and/or be used to perform a treatment.
  • Measurement system 3500 may also include a reporting device 3555, which can present results of any of the methods describe herein, e.g., as determined using the measurement system 3500. Reporting device 3555 can be in communication with a reporting module within logic system 3530 that can aggregate, format, and send a report to reporting device 3555.
  • the reporting module can present information determined using any of the method described herein.
  • the information can be presented by reporting device 3555 in any format that can be recognized and interpreted by a user of the measurement system 3500.
  • the information can be presented by reporting device 3555 in a displayed, printed, or transmitted format, or any combination thereof.
  • PATENT Client Reference No.: P39048-WO-1 Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in the computer systemof FIG.36.
  • the computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices.
  • the subsystems shown in FIG.36 are interconnected via a system bus 3675. Additional subsystems such as a printer 3674, keyboard 3678, storage device(s) 3679, monitor 3676 (e.g., a display screen, such as an LED), which is coupled to display adapter 3682, and others are shown.
  • Peripherals and input/output (I/O) devices which couple to I/O controller 3671, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 3677 (e.g., USB, FireWire ® ).
  • I/O port 3677 or external interface 3681 can be used to connect computer system 3610 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 3675 allows the central processor 3673 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 3672 or the storage device(s) 3679 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 3672 and/or the storage device(s) 3679 may embody a computer readable medium.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 3681, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
  • computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or PATENT Client Reference No.: P39048-WO-1 10,000 devices.
  • Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages.
  • Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Described herein are methods, systems, and apparatuses for determining a partial consensus sequence of a double-stranded nucleic acid molecule. Both strands of the nucleic acid molecule may be sequenced to generate a sequence of base calls. The base calls may be used to identify sets of concordant and discordant positions. A partial consensus sequence may be generated by using concordant values derived from the concordant positions and discordant values derived from the discordant positions. Also described herein are methods, systems, and apparatuses for determining a consensus sequence of a double-stranded nucleic acid molecule. Both strands of the nucleic acid molecule may be sequenced to generate base calls and quality scores corresponding to each strand. Concordant and discordant positions may be identified using the sequences of base calls. Discordant positions may also use the quality scores and weights. A consensus sequence may be determined using the concordant and discordant positions.

Description

PATENT Client Reference No.: P39048-WO-1 INTERNATIONAL PATENT APPLICATION Title: HIGH THROUGHPUT INRAMOLECULAR CONSENSUS READS Inventors: Jagdeesh Chandrasekar, a U.S. citizen, resident of Seattle, WA Amal Chaturvedi, an Indian citizen, resident of San Jose, CA Mahdi Golkaram, a U.S. and Iranian citizen, resident of San Diego, CA Mark Kokoris, a U.S. citizen, resident of Bothell, WA Miroslav Kukricar, a U.S. citizen, resident of Dublin, CA Igor Mandric, a citizen of Moldova, resident of San Diego, CA John Mannion, a U.S. citizen, resident of Menlo Park, CA Robert McRuer, a U.S. citizen, resident of Mercer Island, WA J. Robert Michael, a U.S. citizen, resident of Spring Hill TN Mohammad Sahraeian, a U.S. citizen, resident of Belmont, CA Sam Salari, a Canadian citizen, resident of East Palo Alto, CA Xixi Wang, a U.S citizen, resident of Fremont, CA Daniel Zinder, a U.S. citizen, resident of San Jose, CA Erfan Sayarri, an Iranian citizen, resident of Sunnyvale, CA Assignee: Roche Sequencing Solutions, Inc. 4300 Hacienda Drive Pleasanton, CA 94588 United States of America Entity: Large PATENT Client Reference No.: P39048-WO-1 HIGH THROUGHPUT INRAMOLECULAR CONSENSUS READS CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This Application claims the benefit of United States Provisional Patent Application No.63/573,191, titled “HIGH THROUGHPUT INRAMOLECULAR CONSENSUS READS”, filed on April 2, 2024, and United States Provisional Patent Application No.63/689,578, titled “HIGH THROUGHPUT INRAMOLECULAR CONSENSUS READS”, filed on August 30, 2024, the entire contents of which are hereby incorporated by reference. BACKGROUND [0002] Sequencing by Expansion (SBX) is a high-throughput sequencing technology that utilizes a biochemical process to transcribe the sequence of DNA onto a measurable polymer called an "Xpandomer," which is described in more detail in Kokoris et al., U.S. Pat. No. 7,939,259, entitled "High Throughput Nucleic Acid Sequencing by Expansion", which is herein incorporated by reference in its entirety. The transcribed sequence is encoded along the Xpandomer backbone in high signal-to-noise reporters that are separated by ~10 nm and are designed for high-signal-to-noise, well-differentiated responses. These differences provide significant performance enhancements in sequence read efficiency and accuracy of Xpandomers relative to native DNA. Xpandomers can enable several next generation DNA sequencing detection technologies and are well suited to nanopore sequencing. [0003] Given its high-throughput sequencing capacity, SBX technology generates an enormous amount of digital data (e.g., raw sequencing data, alignment files, intermediate and final result files, etc.) that are processed, transferred, stored, and archived. This poses major challenges in the processing of this data in real time where one must consider the practical limitations of their computing system (e.g., available storage space, how fast the files can be accessed once stored, etc.). SUMMARY [0004] Techniques described herein relate to a method for determining a partial consensus sequence of a double-stranded nucleic acid molecule, the method comprising: sequencing a first PATENT Client Reference No.: P39048-WO-1 strand of the double-stranded nucleic acid molecule to obtain a first sequence of base calls; sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of base calls; identifying a first set of concordant positions and a second set of discordant positions using the first sequence of base calls and the second sequence of base calls; representing each of the first set of concordant positions by a concordant value of a first group of four concordant values, each concordant value representing a concordant pair of bases on the first stand and the second strand; representing each of the second set of discordant positions by a discordant value of a second group of at least 12 discordant values, each discordant value representing a discordant pair of bases on the first stand and the second strand; and generating the partial consensus sequence using (1) the concordant values at the first set of concordant positions and (2) the discordant values at the second set of discordant positions. [0005] According to one embodiment, the first group of four concordant values is specified using two binary bits and includes A<>T, C<>G, G<>C, and T<>A. [0006] According to one embodiment, the second group of at least 12 discordant values is specified using at least four binary bits and includes A<>A, A<>C, A<>G, C<>A, C<>C, C<>T, G<>A, G<>G, G<>T, T<>C, T<>G, and T<>T. [0007] According to one embodiment, the second group of at least 12 discordant values includes at least 20 discordant values. [0008] According to one embodiment, at least 20 discordant values are specified using five binary bits. [0009] According to one embodiment, generating the partial consensus sequence includes: including metadata that specifies the second set of discordant positions. [0010] According to one embodiment, the concordant values for the first set of concordant positions, and the discordant values for the second set of concordant positions are usable to recover the base calls of the first sequence and the second sequence at the first set of concordant positions and the second set of discordant positions. PATENT Client Reference No.: P39048-WO-1 [0011] According to one embodiment, the method further comprising: transmitting the partial consensus sequence to a computer system. [0012] According to one embodiment, the method further comprising: aligning the first sequence of base calls, the second sequence of base calls, or both to a reference genome, wherein the first set of concordant positions do not match the reference genome; identifying a third set of concordant positions that match the reference genome; and representing each of the third set of concordant positions with an indication of a genomic coordinate in the reference genome. [0013] According to one embodiment, the indication of the genomic coordinate in the reference genome includes a starting genomic coordinate of the first sequence of base calls and a binary bit that specifies whether the concordant position matches the reference genome or not. [0014] According to one embodiment, the indication of the genomic coordinate in the reference genome includes a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant positions that do not match the reference genome. [0015] Techniques described herein relate to a method for determining a consensus sequence of a double-stranded nucleic acid molecule, the method comprising: sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of base calls, each having a first quality score and a first label corresponding to the first stand; sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of base calls, each having a second quality score and a second label corresponding to the second stand; identifying a first set of concordant positions and a second set of discordant positions using the first sequence of base calls and the second sequence of base calls; for each discordant position of the second set of discordant positions: determining a consensus base call using the first quality score, the second quality score, a first weight corresponding to the first label, and a second weight corresponding to the second label; generating the consensus sequence using (1) concordant values at the first set of concordant positions and (2) the consensus base calls at the second set of discordant positions. [0016] According to one embodiment, the first weight and the second weight are dependent on base calls adjacent to the discordant position. PATENT Client Reference No.: P39048-WO-1 [0017] According to one embodiment, determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first quality score being higher than the second quality score for a second base call of the second strand. [0018] According to one embodiment, the initially discordant position is changed to be the concordant position for the first base call of the first strand further based on a concordant base on the second strand having a measured signal that is adjacent to the second base call. [0019] According to one embodiment, determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first weight being higher than the second weight. [0020] According to one embodiment, the consensus sequence is a partial consensus sequence. [0021] According to one embodiment, identifying the first set of concordant positions and the second set of discordant positions includes: aligning the first sequence of base calls to the second sequence of base calls. [0022] According to one embodiment, aligning the first sequence of base calls to the second sequence of base calls includes: aligning the first sequence of base calls to a reference genome; and aligning the second sequence of base calls to the reference genome. [0023] According to one embodiment, the second sequence of base calls is aligned to a second strand of the reference genome. [0024] According to one embodiment, the first sequence of base calls is directly aligned to the second sequence of base calls. [0025] According to one embodiment, sequencing the first strand of the double-stranded nucleic acid molecule to obtain the first sequence of base calls includes: measuring signals for a PATENT Client Reference No.: P39048-WO-1 window of a compound corresponding to the first strand of the double-stranded nucleic acid molecule, the compound comprising a plurality of units, each corresponding to a nucleotide; and determining a base call for a genomic position within the window by comparing the signals to known signal patterns corresponding to different nucleotides. [0026] According to one embodiment, comparing the signals to known patterns corresponding to different nucleotides is performed by a machine learning model trained using the known signal patterns. [0027] According to one embodiment, the compound is (1) the first strand of the double- stranded nucleic acid molecule, a reporter element corresponding to a nucleotide or (2) a surrogate molecule created from the first strand of the double-stranded nucleic acid molecule, the surrogate molecule including one or more reporter elements corresponding to each nucleotide. [0028] According to one embodiment, the method of any preceding claim, wherein sequencing the double-stranded nucleic acid molecule includes: creating a surrogate molecule from the double-stranded nucleic acid molecule, the surrogate molecule including one or more reporter elements corresponding to each nucleotide; passing the surrogate molecule through a nanopore to obtain electrical signals; and determining the first sequence of base calls and the second sequence of base calls of nucleotides in the double-stranded nucleic acid molecule using the electrical signals. [0029] According to one embodiment, the method of any of the preceding claims, further comprising repeating the method for at least 10,000 nucleic acid molecules. [0030] Techniques described herein relate to a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause a computer system to perform the method of any one of the preceding claims. [0031] Techniques described herein relate to a system comprising: the computer product from above; and one or more processors configured to execute instructions stored on the computer readable medium. PATENT Client Reference No.: P39048-WO-1 [0032] Techniques described herein relates to a system comprising means for performing any of the above methods. [0033] Techniques described herein relates to a system comprising one or more processors configured to perform any of the above methods. [0034] Techniques described herein relates to a system comprising modules that respectively perform the steps of any of the above methods. [0035] Techniques described herein relate to a sequencing device for determining consensus sequences of double-stranded nucleic acid molecules, the sequencing device comprising: a set of sequencing cells, each configured to perform: sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements; and sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements, the set of sequencing cells including at least 10,000 sequencing cells; a consensus circuit electrically connected with the set of sequencing cells, wherein the comparator circuit is configured to perform, for each of the double-stranded nucleic acid molecules: receiving the first sequence of base measurements and the second sequence of base measurements; for each of a plurality of positions of the double-stranded nucleic acid molecule: comparing one or more of the first base measurements to one or more of the second base measurements; and determining a base call value based on the comparison; and generating a consensus sequence using the base call values; a transmitter configured to transmit the consensus sequence to a computer system. [0036] According to one embodiment, comparing a first base measurement to a second base measurement comprises: determining a first base call using the one or more of the first base measurements; determining a second base call using the one or more of the second base measurements; and comparing the first base call and the second base call. [0037] According to one embodiment, the comparator circuit is further configured to perform: determining whether a position of the plurality of positions is concordant or discordant based on the comparing, wherein the base call value is dependent on whether the position is concordant or discordant. PATENT Client Reference No.: P39048-WO-1 [0038] According to one embodiment, a number of bits used for the base call value is dependent on whether the position is concordant or discordant. [0039] According to one embodiment, the comparator circuit is further configured to generate metadata identifying which positions are discordant, and wherein the consensus sequence includes the metadata. [0040] According to one embodiment, the set of sequence cells and the comparator circuit are on a same printed circuit board. [0041] According to one embodiment, the set of sequence cells and the comparator circuit are on a same integrated circuit. BRIEF DESCRIPTION OF THE DRAWINGS [0042] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. [0043] FIG.1 is a top view of a nanopore sensor chip having an array of nanopore cells. [0044] FIG.2 illustrates a block diagram of an example system for processing data captured by an example nanopore-based sequencing chip according to certain embodiments. [0045] FIG.3 shows a flow chart illustrating a process for determining a consensus sequence of a target molecule according to certain embodiments. [0046] FIG.4 is a condensed schematic summarizing one embodiment of a method of generating a duplex nucleic acid construct that is sequenced and analyzed using the methods described herein. [0047] FIG.5A shows an example of a single molecule-multi-molecular trace event with a typical sequencing by expansion waveform. [0048] FIG.5B illustrates a molecule not clearing a pore. PATENT Client Reference No.: P39048-WO-1 [0049] FIGs.6A and 6B show the synthesis and expected sequences of a two pass HDD read construct. [0050] FIG.7 illustrates example error types that occur during the synthesis of a two pass HDD read sequencing or during Xpandomer synthesis. [0051] FIGs.8A-8D show synthesis and expected sequences of a four pass HDD read construct. [0052] FIG.9 illustrates example error types that occur during the synthesis of a four pass HDD read sequencing or during Xpandomer synthesis. [0053] FIG.10 shows an exemplary intramolecular consensus call out of a four pass HDD read with various error types. [0054] FIGs.11A-11C show the synthesis and expected sequences of an “n” pass HDD read construct. [0055] FIG.12 shows a variety of different HDD reads classes that are generated during sequencing. [0056] FIG.13A is a hypothetical read structure resulting from the sequencing of a HDD construct. FIG.13B shows on the hypothetical read structure the positions of the target insert molecule. FIG.13C provides an example of a One+ read. [0057] FIG.14 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs. In this particular alignment object, three discordant positions can be seen, and are highlighted in red. Namely, there is one substitution, one insertion and one deletion within read 2 relative to read 1. [0058] FIG.15 shows a flowchart illustration for determining a partial order consensus sequence of a double-stranded nucleic acid molecule according to certain embodiments. PATENT Client Reference No.: P39048-WO-1 [0059] FIG.16 shows a flow chart illustrating an example method of compressing a sub- stream of base call data according to certain embodiments. [0060] FIG.17 shows a graphical representation of various interpretations for homopolymer alignment. [0061] FIG.18 shows a flowchart illustration for determining a consensus sequence of a double-stranded nucleic acid molecule using read orientation according to certain embodiments. [0062] FIGs.19A and 19B show an exemplary variable length encoding strategy using a variable length encoding algorithm (FIG.19A) and an example of an alignment object with the resulting consensus read, header data, and a computation of the number of bits required for the header (FIG.19B). [0063] FIGs.20A and 20B show a second variable length vector of data, sometimes referred to as part of or as a full header string, that is used to encode the differences in the sequence of read 2 relative to the sequence of read 1 (FIG.20A) and an example of an alignment object with the resulting consensus read, header data, and a computation of the number of bits required for the header (FIG.20B). [0064] FIG.21 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs. In this particular alignment object, three discordant positions are assigned a 5th character ‘N’. [0065] FIG.22 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs. [0066] FIG.23 shows an example of a sliding window technique that may be used to locate hairpin adapters in a nucleic acid molecule. [0067] FIG.24 shows a flowchart illustrating a method for detecting hairpin adapters using a dual sliding window in accordance with various embodiments. [0068] FIGs.25A and 25B show examples of different adapter architectures. PATENT Client Reference No.: P39048-WO-1 [0069] FIG.26 shows a flowchart illustrating a segmentation method for detecting different components of adapters using machine learning techniques in accordance with various embodiments. [0070] FIG.27 shows a flowchart illustrating a classification method for detecting different components of adapters using machine learning techniques in accordance with various embodiments. [0071] FIG.28 shows an example of frequency components represented in a complex plane, where the x-axis represents the real part (Re) and the y-axis represents the imaginary part (Im) of the complex numbers. [0072] FIGs.29A-29G shows graphs displaying the cross-correlation signals for seven different candidate adapter sequences at every position of the read construct. [0073] FIG.30 shows a flowchart illustrating a method for determining the location and the sequence of an adapter in a sequencing read using cross-correlation frequency-based methods. [0074] FIG.31 shows a graph illustrating how the cross-correlation signal would look in real space of the IFFT where a number of bases are deleted from the loop. [0075] FIG.32 shows graph illustrating when the cross-correlation method does not generate an interpretable signal to determine the adapter location and sequences. [0076] FIG.33 shows an exemplary graph illustrating that the autocorrelation signal for a read sequence used to identify the adapter location in the sequence. [0077] FIG.34 shows a flowchart illustrating a method for determining the location and the sequence of an adapter in a sequencing read using autocorrelation frequency-based methods. [0078] FIG.35 illustrates a measurement system according to embodiments of the present invention. [0079] FIG.36 shows a block diagram of an example computer system usable with systems and methods according to embodiments of the present invention. PATENT Client Reference No.: P39048-WO-1 DETAILED DESCRIPTION [0080] The following description recites various aspects and embodiments of the present compositions and methods. No particular embodiment is intended to define the scope of the methods. Rather, the embodiments merely provide non-limiting examples that are at least included within the scope of the disclosed methods. The description is to be read from the perspective of one of ordinary skill in the art; therefore, information well known to the skilled artisan is not necessarily included. [0081] Techniques disclosed herein relate to accurate high throughput sequencing hardware and methods (e.g., data compression algorithms), and more specifically, to the processing and compression of data generated from duplex or higher sequencing. Duplex/higher sequencing methods can generate consensus reads by combining multiple copies of nucleic acid reads for a specific region into a single high-quality sequence. In so doing, a more accurate ‘consensus’ sequence with reduced random sequencing errors can be achieved. However, this process generates an enormous amount of data that can become problematic during processing and storage. Data compression algorithms can reduce the size of these data files while still preserving all or nearly all of the data stored in the various sequencing files. [0082] Consensus sequences are generated by combining the sequences of a plurality of sequence reads that align to the same region of a template nucleic acid molecule to form a single high-quality consensus sequence. If the alignment of the plurality of sequence reads occurs between different nucleic acid molecules, (e.g., between a sequence read and reference genome, between complementary plus and minus sequence reads, etc.) an intermolecular consensus read is generated. On the other hand, if the alignment of the plurality of sequence reads occurs within the same nucleic acid molecule, (e.g., complementary plus and minus sequence reads physically connected) an intramolecular consensus read is generated. [0083] An example of an intermolecular sequencing strategy includes unique molecular identifier (UMI)-based intermolecular consensus sequencing. These sequencing methods (e.g., SBX, next generation, sequencing by synthesis (SBS), other nanopore-based sequencing PATENT Client Reference No.: P39048-WO-1 methods, e.g., sequencing by basetag (“SBT”), single-molecule real-time (SMRT) sequencing, biological and solid state nanopore sequencing, etc.) add barcodes (e.g., UMIs) to the ends of Xpandomers (for SBX) or to the nucleic acid molecule itself (for next generation and other nanopore-based methods). From the compute perspective, a major disadvantage of UMI-based intermolecular consensus workflows is that members of the same UMI family are typically dispersed randomly throughout the physical sample, such that each member of a UMI-Original Molecule Family may be read at a different time throughout a run. Such a run may conceivably last an hour, several hours, 24 hours, multiple days, or another duration of time. Thus, the clustering step (e.g., data processing step) of the UMI Based Intermolecular Consensus algorithmic workflows cannot be completed until all reads from the entire run have been produced and collected. Namely, the “clustering step,” in which all raw read members of a UMI Based Intermolecular consensus family are clustered, or grouped together, usually must remain in an unfinished state until the run has completed, and no new members of the UMI-Original Molecule family are expected. [0084] Another challenge of the intermolecular consensus approach is that reads from paired plus and minus strands from the target nucleic acid may not both be outputted after sequencing. The clusters can have reads from only one strand, either the plus or the minus. When those clusters have reads from both strands, they can be referred to as duplex clusters, meaning they have representation from both the plus and minus. Those can achieve much higher consensus accuracies than those that have representation from just one or the other. [0085] To overcome the above-mentioned challenges, methods and techniques described herein relate to implementing the use of intramolecular consensus sequencing, (e.g., using hairpin-direct-duplex (HDD) reads), which have several advantages over intermolecular consensus sequencing. One such advantage is that HDD reads comprise single or multi-pass read pairs that are physically coupled together (e.g., via a hairpin structure). As a result of being physically coupled, the single pass reads that they produce are naturally grouped together in the time dimension. In this scenario, higher accuracy consensus reads may be formed from the two, coupled, single pass reads (e.g., both the plus and minus strand), without the need to perform a first informatic clustering step. The clustering step is already taken care of given their physical and temporal coupling. Avoiding the processing steps required for clustering saves significantly PATENT Client Reference No.: P39048-WO-1 on computational resources and allows for in-line processing and production of the consensus reads in real time. [0086] With respect to data processing and decompression, intramolecular consensus sequencing using HDD reads also presents several solutions. Although examples may refer to duplex reads, such examples are also applicable to higher levels of consensus reads, e.g., at least 3, 4, 5, 6, 7, or 8 reads within a molecule can be compared to determine the consensus. [0087] A first solution is based on a “reference-based consensus calling” process that aligns both strands of the target nucleic acid to a reference genome. The advantage of this approach lies in the accuracy of the consensus reads. This method can preserve information for as long as possible and all of the raw read information until the point of variant calling. [0088] A second solution is based on a “reference-free consensus calling” process that instead of aligning the strands to a reference genome, aligns the adapters and the two nucleic acid strands to themselves. This approach essentially forms a consensus read from the information that is present in the original full read, (i.e., the full HDD read). This is particularly advantageous for compressing the data as early as possible and as close to the sequencing instrument as possible. [0089] A third solution to data processing and compression of HDD reads involves calling consensus just on parts of the HDD read where perfect agreement exists and refraining from making a consensus call on any positions for which there is a disagreement or discordance across the two read pairs. This third approach can use either reference-free or reference-based consensus calling. Essentially, instead of making a consensus call for the entire nucleic acid molecule, a consensus call is made only for certain base pairs and then for the base pairs that have discordance, output some of the raw data (e.g., quality scores) for those positions so the discordant positions may be resolved. This approach is a lossless approach where every time there is a discordance call, a variety of techniques to encode both of the bases that disagreed can be used. [0090] A fourth solution to data processing and compression of HDD reads involves collecting information regarding the alignment orientation of a read, e.g., which strand or PATENT Client Reference No.: P39048-WO-1 internal copy (e.g., daughter) from which a base call was generated. Read orientation can be particularly useful in determining consensus and concordant calls as certain sequence modifications (e.g., errors arising from DNA damage, epigenetic modifications, measurement or data processing) occur more commonly on one strand orientation versus its complement. The collected information may include base calling quality scores, base calling weights, rate of DNA damage or other mutations, etc. By knowing this type of information, more discordant positions are likely to be resolved prior to consensus calling, thus there are ultimately fewer discordant positions that require extra bits of storage. This approach is a lossy approach where every time there is a discordance call, a variety of techniques to encode both of the bases that disagreed can be used. [0091] For the above-mentioned techniques, the sequence reads and consensus sequences are generated using a sequencing device or sequencing instrument. The sequencing device may comprise a set of sequencing cells, such as at least 10,000 sequencing cells, where each individual cell is configured to sequence a first and second strand of a double-stranded nucleic acid molecule to generate a first and second sequence of base measurements. In a specific embodiment that employs SBX sequencing, Xpandomers are synthesized that translate the sequence of the first and second strand of the double-stranded nucleic acid molecule, respectively, into measurable polymers that can be sequenced using a sequencing device or sequencing instrument. At each base position of the nucleic acid molecule or Xpandomer, the sequencing device compares the first base measurements to the second base measurements to determine a base call value for each position. In most cases, a concordant base call is made based on the base call value; however, in some cases, a discordant base call may be made. In this event, any (or a combination) of the above-mentioned solutions may be used to generate a consensus sequence. [0092] Implementation of any (or any combination) of the above-mentioned solutions would allow for the sequencing instrument to produce HDD consensus reads directly on the sequencing instrument itself. They would also help address the problem of limited channel capacity for information transmission channels along the path between the instrument and location of secondary analysis or storage. Providing there is sufficient compute resources available on the sequencing instrument to execute the operations required for raw HDD read to HDD consensus PATENT Client Reference No.: P39048-WO-1 read formation, a significant data reduction could be achieved in real time, prior to moving data off of the instrument or through a subsequent bandwidth limited channel. Terms [0093] As used herein, “Nucleic acid” may refer to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form. The term may encompass nucleic acids containing known nucleotide analogs or modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, which have similar binding properties as the reference nucleic acid, and which are metabolized in a manner similar to the reference nucleotides. Examples of such analogs may include, without limitation, phosphorothioates, phosphoramidites, methyl phosphonates, chiral-methyl phosphonates, 2-O- methyl ribonucleotides, peptide-nucleic acids (PNAs). The nucleic acid may also be represented by surrogate molecules, which are inserted into the original nucleic acid, with each surrogate molecule corresponding to a particular nucleotide. [0094] Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues, as described in, e.g., Batzer et al., Nucleic Acid Res.19:5081 (1991); Ohtsuka et al., J. Biol. Chem. 260:2605-2608 (1985); and Rossolini et al., Mol. Cell. Probes 8:91-98 (1994). The term nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and polynucleotide. [0095] The term “nucleotide,” in addition to referring to the naturally occurring ribonucleotide or deoxyribonucleotide monomers, may be understood to refer to related structural variants thereof, including derivatives and analogs (e.g., X-NTPs used in SBX- sequencing), that are functionally equivalent with respect to the particular context in which the nucleotide is being used (e.g., hybridization to a complementary base), unless the context clearly indicates otherwise. PATENT Client Reference No.: P39048-WO-1 [0096] The term “raw data” or “raw signal data” refers to data produced by sensors in a sequencing device. Raw data includes signal values associated with sequencing a nucleic acid molecule. [0097] The term “signal value” may refer to a value of the sequencing signal output from a sequencing cell. According to certain embodiments, the sequencing signal may be an electrical signal that is measured and/or output from a point in a circuit of one or more sequencing cells, e.g., the signal value may be (or represent) a voltage or a current. The signal value may represent the results of a direct measurement of voltage and/or current and/or may represent an indirect measurement, e.g., the signal value may be a measured duration of time for which it takes a voltage or current to reach a specified value. A signal value may represent any measurable quantity that correlates with the features of the sequencing device. For example, in a nanopore sequencing device the resistivity of a nanopore and from which the resistivity and/or conductance of the nanopore (threaded and/or unthreaded) may be derived can affect the signal value. As another example, the signal value may correspond to a light intensity, e.g., from a fluorophore attached to a nucleotide being catalyzed to a nucleic acid with a polymerase. [0098] The term “bright period” may generally refer to the time period when a molecule, Xpandomer, or a portion thereof, is forced into a nanopore by an electric field applied through an AC signal. The term “dark period” may generally refer to the time period when a molecule, Xpandomer, or a portion thereof, is pushed out of the nanopore by the electric field applied through the AC signal. An AC cycle may include the bright period and the dark period. In different embodiments, the polarity of the voltage signal applied to a nanopore cell to put the nanopore cell into the bright period (or the dark period) may be different. The bright periods and the dark periods can correspond to different portions of an alternating signal relative to a reference voltage. [0099] The term “raw read data” or “read data” refers to data generated from the raw data or the raw signal data. The raw read data includes read data stream(s). A read data stream includes sub-streams of data corresponding to a respective nucleic acid molecule including an identifier or header sub-stream, a nucleic acid base call sub-stream, and a quality score sub- stream. PATENT Client Reference No.: P39048-WO-1 [0100] The term “base call data” refers to data generated from the raw data that identifies a nucleotide (e.g., a nitrogen-containing base of a nucleotide) at a given location in a nucleic acid sequence. Each entry in a base call data represents a nucleotide and can include one code for the corresponding nucleotide. The base call data can include primary nucleotides such as adenine (A), thymine (T), guanine (G), cytosine (C), and uracil (U) or a synthetic nucleotide. The base call data may also include other possible base calls such as an undetermined nucleotide. [0101] The term “quality score data” refers to data generated from the raw data that provides a measure for confidence in accuracy of a base call correctly made for a nucleic acid (e.g., between the four bases.) The quality score can be reflective of the stochastic behavior that is inherent to single molecule observations. The quality of base calls may not degrade with time or with read length, but there can be different quality scores for different base calls randomly at different points in time on a given nucleic acid. Alternatively, the quality scores of bases in a read may show a dependence on read length or position of base within a read. A higher quality score for a base call can indicate greater confidence in the base call being correct. For example, a signal value that is near a peak of a probability distribution function (PDF) can result in a base call having a higher quality score than a signal value that is far from a peak of a PDF. [0102] The terms “header data” or “read ID data” refers to information that identifies a read within a larger collection of reads. For example, the raw read data stream generated for a portion of the raw data has the same header data across the raw read data stream for that portion. The raw data can include a plurality of portions of raw data generated simultaneously or at different times for the same nucleic acid molecule (e.g., template nucleic acid molecule) or for different nucleic acid molecules (e.g., different template nucleic acid molecules). [0103] The terms “consensus sequence read,” “consensus sequence,” “consensus read,” or “consensus” refer to a nucleic acid sequence read generated from aligning a plurality of sequence reads that correspond to different parts of the same nucleic acid molecule (e.g., different strands and different internal copies), the same template nucleic acid molecule (e.g., amplicons of same molecule), or molecular family (e.g., same barcode). Consensus reads may be intermolecular (i.e., between different molecules) or they may be intramolecular (i.e., within a molecule). Intermolecular and intramolecular consensus sequence reads may be generated by aligning the PATENT Client Reference No.: P39048-WO-1 plurality of sequence reads to one another. In this instance, an intramolecular consensus read is generated when the plurality of sequence reads are physically coupled together (e.g., via a hairpin segment) so that the plus and minus sequence reads are compared to each other. For intermolecular consensus reads, the complementary plus and minus strands are not physically coupled together. Consensus reads may also be generated by aligning each of the plurality of sequence reads to a reference genome or to each other. [0104] The term “real-time” or “live” refers to processing raw data from a nucleic acid molecule at a rate equal or greater than the raw data is generated. Real-time processing of the raw data eliminates the need to store raw data or read data in a long term memory (e.g., disc, hard drive, cloud storage, or any external memory device). [0105] A “concordant position” has a pair of concordant bases on the two strands for the given genomic position in a double-stranded nucleic acid molecule. Concordant bases are ones that hybridize to each other. Thus, pairs of concordant bases are A<>T, C<>G, G<>C, and T<>A. [0106] A “discordant position” has a pair of discordant bases on the two strands for the given genomic position in a double-stranded nucleic acid molecule. Discordant bases are ones that do not hybridize to each other. Since each base is concordant with one other base, each base would be discordant with the three other bases. Pairs of discordant bases include A<>A, A<>C, A<>G, C<>A, C<>C, C<>T, G<>A, G<>G, G<>T, T<>C, T<>G, and T<>T. Additional examples of discordant pairs include A<>-, C<>-, G<>-, T<>-, -<>T, -<>G, -<>C, and -<>A, where “-“ indicates an insertion or a deletion. [0107] A “machine learning model” (ML model) can refer to a software module configured to be run on one or more processors to provide a classification or numerical value of a property of one or more samples. A ML model can include various parameters (e.g., for coefficients, weights, thresholds, functional properties of function, such as activation functions). As examples, a ML model can include at least 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, or one million parameters. A ML model can be generated using sample data (e.g., training samples) to make predictions on test data. Various number of training samples can be used, e.g., at least 10, 100, 1,000, 5,000, 10,000, 50,000, 100,000, or at least 200,000 training samples. One example is an PATENT Client Reference No.: P39048-WO-1 unsupervised learning model. Another example is a supervised learning model that can be used with embodiments of the present disclosure. Examples of supervised learning models may include different approaches and algorithms including, but not limited to, analytical learning, statistical models, artificial neural network (e.g. including convolutional and/or transformer layers), boosting (meta-algorithm), Bayesian statistics, case-based reasoning, decision tree learning, inductive logic programming, Gaussian process regression, genetic programming, group method of data handling, kernel estimators, learning automata, learning classifier systems, minimum message length (decision trees, decision graphs, etc.), multilinear subspace learning, naive Bayes classifier, maximum entropy classifier, conditional random field, nearest neighbor algorithm, probably approximately correct learning (PAC) learning, ripple down rules, a knowledge acquisition methodology, symbolic machine learning algorithms, subsymbolic machine learning algorithms, minimum complexity machines (MCM), random forests, ordinal classification, data pre-processing, handling imbalanced datasets, statistical relational learning, or Proaftn (a multicriteria classification algorithm), or an ensemble of any of these types. The ML model may include linear regression, logistic regression, deep recurrent neural network (e.g., long short term memory, LSTM), hidden Markov model (HMM), linear discriminant analysis (LDA), k-means clustering, density-based spatial clustering of applications with noise (DBSCAN), random forest algorithm, support vector machine (SVM), or any model described herein. Supervised learning models can be trained in various ways using various cost/loss functions that define the error from the known label (e.g., least squares and absolute difference from known classification) and various optimization techniques, e.g., using backpropagation, steepest descent, conjugate gradient, and Newton and quasi-Newton techniques. I. Overview of Example Pipelines [0108] Example devices and measurement pipelines for performing embodiments of the present disclosure are now described. The specific examples that follow describe constructs and methods used in SBX sequencing, but the skilled artisan will appreciate that the techniques described herein can also be used to analyze data derived from any sequencing method, (e.g., sequencing by synthesis (SBS) or other nanopore-based sequencing methods, such as sequencing by basetag (“SBT”), single-molecule real-time (SMRT) sequencing, biological and solid state nanopore sequencing, etc). For example, alternative sequencing techniques that may employ the PATENT Client Reference No.: P39048-WO-1 methods described herein include, but are not limited to, NGS platforms (e.g., MiSeq, NextSeq, and NovaSeq Sequencing Platforms) by Illumina, Inc. (San Diego, CA); Aviti System by Element Biosciences, Inc. (San Diego, CA); UG 100 System by Ultima Genomics, Inc. (Fremont, CA); G4 and G4X Systems by Singular Genomics Systems, Inc. (San Diego, CA); Revio, Onso, and Sequel Systems by Pacific Biosciences of California, Inc., (Menlo Park, CA); MinION, GridION, and PromethION Systems by Oxford Nanopore Technologies, plc (Oxford, UK); and Ion Torrent NGS Systems by Thermo Fisher Scientific, Inc. (Waltham, MA). A. Example Sequencing Device [0109] FIG.1 is a top view of an embodiment of a nanopore sensor chip 100 having an array 140 of nanopore cells 150. In some embodiments, the array 140 of nanopore cells includes at least 10,000 sequencing cells 150. Each nanopore cell 150 includes a control circuit integrated on a silicon substrate of nanopore sensor chip 100. In some embodiments, side walls 136 may be included in array 140 to separate groups of nanopore cells 150 so that each group may receive a different sample for characterization. Each nanopore cell may be used to sequence a nucleic acid. For example, each nanopore cell may be used to sequence a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements and sequence a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements. In some embodiments, nanopore sensor chip 100 may include a cover plate 130. In some embodiments, nanopore sensor chip 100 may also include a plurality of pins 110 for interfacing with other circuits, such as a computer processor. An exemplary nanopore sensor chip is described, e.g., in U.S. Patent Application Publication No. 20210148886A1, as well as U.S. Patent Nos.10,371,664; 10,809,243; 10,920,312; 10,174,437; 11,098,354; and 9,322,062. The foregoing publication and granted patents are incorporated herein by reference in their entireties. [0110] In some embodiments, nanopore sensor chip 100 may include multiple chips in a same package, same printed circuit board (PCB), or same integrated circuit (IC), such as, for example, a Multi-Chip Module (MCM) or System-in-Package (SiP). The chips may include, for example, a memory, a processor, a field-programmable gate array (FPGA), an application- specific integrated circuit (ASIC), data converters, a high-speed I/O interface, etc. For example, PATENT Client Reference No.: P39048-WO-1 nanopore sensor chip 100 can include consensus circuit 155. In other embodiments, the memory, the processor, the FPGA, the ASIC, data converters, the high-speed I/O interface, etc. may be external circuits operatively connected to the sequencing chip. [0111] In some embodiments, nanopore sensor chip 100 may be coupled to (e.g., docked to) a nanochip workstation 120, which may include various components for carrying out (e.g., automatically carrying out) various embodiments of the processes disclosed herein, including, for example, analyte delivery mechanisms, such as pipettes for delivering lipid suspension or other membrane structure suspension, analyte solution, and/or other liquids, suspension or solids, robotic arms, computer processor, and/or memory. A plurality of polynucleotides may be detected on array 140 of nanopore cells 150. In some embodiments, each nanopore cell 150 can be individually addressable. [0112] In some embodiments, the nanopore cell chip 100 includes or is electrically coupled to a consensus circuit 155. For each of the double-stranded nucleic acid molecules, the consensus circuit 155 is configured to receive the first sequence of base measurements and the second sequence of base measurements. With the first sequence and second seqeuence of base measurements, consensus circuit 155 determines a first base call (e.g., using one or more first base measurements) and a second base call (e.g., using one or more second base measurements). Consensus circuit 155 then compares one or more of the first base measurements to one or more of the second base measurements and determines a base call value, based on the comparison. In some cases, the base call value is for complementary concordant positions. In other cases, the base call value is for non-complementary discordant positions. The base call value may be stored in a number of bits, depending on whether the position is found to be concordant or discordant by consensus circuit 155. Further, the consensus circuit may be further configured to generate metadata (e.g., header string, consensus quality score, vector, etc.) that identifies which positions are discordant. This process repeats for each of the plurality of positions of the double-stranded nucleic acid molecule. Once all base call values are determined for each position, a consensus sequence can be generated using the base call values. The consensus sequence may also include the metadata. The consensus sequence (and metadata) is then transmitted to a computer system by a transmitter. PATENT Client Reference No.: P39048-WO-1 [0113] FIG.2 illustrates a block diagram of an example system for processing data captured by a nanopore-based sequencing sensor chip 210 (e.g., same as nanopore sensor chip 100 described with respect to FIG.1), according to embodiments of the present disclosure. System 200 comprises sequencing device 205 for generating sequencing data that may be transmitted via a bus interface unit 280. Sequencing device 205 includes sensor chip 210, consensus circuit 220, transmitter 221, and local memory 225. Sensor chip 210 may include thousands or millions or more of cells. As described above, the data may be captured by the cells of sensor chip 210 during various phases of cell formation and sequencing, including, for example, before the formation of the lipid layer (e.g., to check open/short of the electrical circuit), after the formation of a thick lipid layer, during the thinning of the lipid layer, after the formation of the bilayer, after the formation of the nanopore (e.g., to determine the number of nanopores for each cell or to measure open channel data for normalization), and during the sequencing of a sample (e.g., for normalization). [0114] A sensor chip 210 may include thousands or millions of cells, such as 100,000 or more cells, 1 million or more cells, 2 million or more cells, 4 million or more cells, or 8 million or more cells. In an example system, sensor chip 210 may include 1 million cells, where each cell of the 1 million cells may be a nanopore-based sensor cell as described above with respect to FIGS.1, and may capture, for example, ten data sample points in one cycle of an AC signal at 100 Hz. Thus, at a given time, each cell of the 1 million cells may capture one data point represented by one byte (e.g., 8 bits), and one raw data frame including 1 million bytes (MB) of data from the 1 million cells may be generated. In some implementations, the data point may be a raw data point from an analog-to-digital converter (ADC) output (i.e., ADC value). In some implementations, rather than outputting the actual ADC values, the data point may be the difference between two consecutive raw data points from the ADC output. In some implementations, a local event detector may be used to determine whether an event has occurred at a cell and the output data point may indicate whether an event has occurred on a cell. For example, the local event detector may detect an event if a difference between a new ADC value and previous ADC value (or other reference value) is greater than a selected threshold. A data frame may indicate no event or state change on some cells and events or state changes on some other cells. Thus, a data frame comprises all of the data points across the cells at a given time. PATENT Client Reference No.: P39048-WO-1 Further details regarding the data points can be found in, for example, U.S. Patent Application Publication No.2017/0089858, entitled “Encoding State Change of Nanopore to Reduce Data Size,” now U.S. Patent No.10,935,512, which is herein incorporated by reference in its entirety. [0115] The raw data frame may be represented by, for example, an image file that includes 8 million pixels, where the data point from each cell may be represented by the gray scale or color and/or intensity of a pixel of the image file. In each AC cycle, ten raw data frames may be generated, one at each sample point. For example, four sample points may be taken in the bright period and six sample points may be taken in the dark period, or vice versa. Thus, in one second, 1000 (100 cycles × 10 raw data frames per cycle) raw data frames may be generated, which may include 1 gigabyte (GB) (1 MB per frame × 1000 frames) of data from 1 million cells. In other words, the output data rate of sensor chip 210 may be 1 GB per second (GBPS) for a sensor chip with 1 million cells. [0116] As shown in FIG.2, data captured by sensor chip 210 may be sent to a consensus circuit 220 (e.g., including FPGA(s), ASIC(s), and/or GPU(s)) for preprocessing. Consensus circuit 220 may store the received data to a local memory 225 at a data rate of, for example, 12 GBPS. Alternatively, data captured by sensor chip 210 may be sent directly to local memory 225. Consensus circuit 220 may directly send the received data through (or process the received data and then send the preprocessed data through), for example, a Peripheral Component Interconnect Express (PCIe) interface, to a PCIe bus 280, which may have a maximum data transfer rate of, for example, 8 GBPS. [0117] Each raw data frame only includes one data sample point from a cell, while each base is determined based on a plurality of sample data points as described above. Furthermore, a data processor may not have sufficient resources to process the raw data frames in real time. Therefore, the raw data frames may be stored first and then be processed together when raw data frames sufficient for determining a base are available. For example, data from consensus circuit 220 may be stored in one or more standard disk drives 260 or one or more fast capture drives 250. Each standard disk drive 260 may have a maximum write speed of 0.2 GBPS, while each fast capture drive 250 may have a maximum write speed of 1 GBPS. Additionally or alternatively, data from consensus circuit 220 may be sent to network storage devices through a PATENT Client Reference No.: P39048-WO-1 network interface 270, which may have a maximum data rate of 0.1 GBPS. Thus, to save data at, for example, 1 GBPS, multiple drives or network interfaces may be needed, which may significantly increase the cost of the system. Furthermore, the usable bandwidth of PCIe bus 280 may be less than the full bandwidth of 8 GBPS, such as, for example, 6 GBPS (75% of the full bandwidth) due to other data transportations on the bus. Thus, in some cases, the data from consensus circuit 220 may not be saved to the storage drive fast enough. A large buffer may be used for temporarily storing the data, or some data may be dropped. [0118] Consensus circuit 220 may optionally include a base caller circuit, which can be implemented on a graphic processing unit (GPU). As an Xpandomer is passed through a nanopore, raw base calls may be made in real-time and written out into raw sequencing data files (e.g., FASTQ files). Each nucleic acid base (modified and unmodified) generates its own unique electrical signal (e.g., voltage or electrical current pattern) that is captured by the nanopore cell as the base transits the nanopore. The raw sequencing data files may be input into the base caller circuit, where a base calling algorithm, which may be referred to as a base caller, decodes the sequences of bases in real-time, after the sequencing run is complete, or any combination thereof. In some cases, the base caller is a machine learning model, for example, a neural network (e.g., recurrent neural network (RNN), a convolutional neural network (CNN), a bidirectional hybrid RNN + CNN, and the like). In other instances, the base caller may be a non- neural network machine learning model or a statistical model. [0119] A GPU that implements a base caller circuit, may include hundreds or even thousands of parallel processing cores, making it suitable for the processing of sequencing data from the thousands or millions of cells of sensor chip 210. [0120] After the data sampling by sensor chip 210 is complete, a host processor 240 may be used to process the stored data. Host processor 240 may include a communication interface having a maximum bandwidth of, for example, about 22 GBPS, which may not be fully utilized due to the bandwidth limitation of PCIe bus 280. Host processor 240 may access a main memory 245 (e.g., a DRAM) at a maximum data rate of, for example, 12 GBPS. In various implementations, host processor 240 may access main memory 245 directly or through, for example, a north bridge. PATENT Client Reference No.: P39048-WO-1 [0121] To process the stored data, the base caller circuit may need to read the data back from the storage device, and the data processing speed may be limited by the speed of the data read- back. Thus, if sensor chip 210 is used to sample data, for example, for 2 hours or more for an assay, 2 hours or more may be needed to read the stored data back. Thus, the data processing time may be very long. Therefore, to reduce the cost of the data processing system and improve the data processing efficiency of the system, it may be desirable to process the data captured by sensor chip 210 in real time and reduce the amount of data transfer between different functional blocks of the system. [0122] Accordingly, in one exemplary embodiment, a sequencing device may be used for determining consensus sequences of double-stranded nucleic acid molecules. The sequencing device comprises a set of sequencing cells that can include at least 10,000 individual sequencing cells. Each sequencing cells may be configured to sequence a first strand and a second strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements and a second sequence of second base measurements, respectively. The first sequence of first base measurements and the second sequence of second base measurements are raw sequencing data that is transmitted to a consensus circuit on the sequencing device at a rate. The consensus circuit may be electrically connected with the set of sequencing cells. [0123] The consensus circuit is configured to receive the first and second sequences of base measurements for each position of each double-stranded nucleic acid molecule (e.g., raw data). The data for the first and second sequences can include base calls, quality scores, and other sub- streams (e.g., header information) from the raw data. In some embodiments, the rate of transmission can be at least 12 gigabyte per second (GB/s). As examples, consensus circuit can include multiple cores or chips. For instance, embodiments could have multiple GPUs (e.g., 4, 6, 8, etc.) connected by extremely high bandwidth links such as a wire-based serial multi-lane near- range communications link (e.g., NVlinks). In some instances, a dynamic random-access memory (DRAM) of one GPU can also have access to the DRAM of the next GPU. This is important because, the rate at which raw data is generated is much higher than the rates at which data is transmitted to and from the storage device. Therefore, there is a need for compressing the data in real-time as it is generated in consensus circuit 155 (see, e.g., U.S. Patent Nos.9,494,554, 9,290,805, 10,663,423, 10,809,244, and 9,041,420 for a description of exemplary electrical PATENT Client Reference No.: P39048-WO-1 connections, each of which are herein incorporated by reference in their entireties). In some instances, each of the double stranded nucleic acid molecules comprises a plurality of positions where one or more bases measurements can be taken. The one or more first base measurements may be compared to the one or more of the second base measurements, and a base call value based on each such comparison is determined. The base calls may also be used to generate a consensus sequence that may be transmitted to a computer system for intermolecular consensus calling and variant calling. B. Example Pipelines [0124] FIG.3 shows a flowchart illustrating a method 300 for determining a consensus sequence of a target molecule. The method 300 depicted in FIG.3 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method 300 presented in FIG.3 and described below is intended to be illustrative and non-limiting. Although FIG.3 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in different order, or some steps may also be performed in parallel, unless clearly contradicted by context. [0125] At 310, nucleic acid material (e.g., DNA) from a biological sample is obtained. The nucleic acid material may be genomic DNA, mitochondrial DNA, cell-free DNA (cfDNA), circulating tumor DNA (ctDNA), or a combination thereof. In some instances, genomic DNA may be obtained and isolated using any method known in the art. Typically, the isolated DNA is fragmented into a plurality of shorter double stranded DNA target fragments through physical (e.g., sonication) or enzymatic (e.g., restriction enzyme digestion) methods. The plurality of double stranded DNA target fragments may undergo various library preparations based on the sequencing method being used. [0126] Briefly, by way of example, a DNA library preparation protocol may include the following steps: (i) DNA fragment end repair to remove overhangs, (ii) A tailing, (iii) adapter ligation, (iv) size selection, and (v) and polymerase chain reaction (PCR) enrichment. Typically, PATENT Client Reference No.: P39048-WO-1 the adapters comprise unique barcodes so DNA fragments from the same sample can be identified. The adapters may also be different depending on the specific sequencing technique employed. [0127] For SBX technology, Xpandomers are synthesized following library preparation. Briefly, a daughter strand is produced by a template-directed synthesis, wherein the daughter strand includes a plurality of XNTP subunits (i.e., XATP, XCTP, XGTP and XTTP) coupled in a sequence corresponding to a contiguous nucleotide sequence of all or a portion of the target nucleic acid material. The individual XNTP subunits of the daughter strand comprise a reporter construct, a nucleobase residue, and a selectively cleavable bond. Upon cleavage of the selectively cleavable bond, the Xpandomer is released and sequenced in a nanopore sequencing system, and the reporter constructs in the Xpandomer are used to parse genetic information in a sequence that corresponds to the contiguous nucleotide sequence of all or a portion of the target nucleic acid (see, e.g., U.S. Patent Application Publication No.2022/0411458 for a description of Sequencing by Expansion, which is herein incorporated by reference in its entirety). [0128] At 320, Xpandomers are ready to be sequenced on a sequencing device (such as the sequencing device described in section I-A) and the measured signals corresponding to the different nucleotides are determined. During SBX sequencing, the Xpandomer is passed through the nanopore, which is embedded in a membrane (e.g., a bilayer lipid membrane). As the Xpandomer passes through the nanopore, an electrical signal (e.g., voltage or current) is measured corresponding to the different bases of the target nucleic acid. The membrane has thousands of nanopore proteins embedded into it allowing for high-throughput sequencing performance similar to next generation sequencing. [0129] At 330, base calls for the first and second strand of the double stranded DNA target molecules (or at least a portion of each) are determined using a pre-trained base caller machine learning model. Base calling algorithms can learn how to determine nucleotide sequence via machine learning. They are trained using data (e.g., fluorescent signal, voltage, or current) of a known sequence, which guides the algorithm to make correct predictions. Once trained they are validated with a subset of reads not included in the training dataset. Through this process, base PATENT Client Reference No.: P39048-WO-1 calling algorithms can distinguish natural nucleotides from modified nucleotides, as each has its own unique data signature. [0130] In some embodiments, a sliding window of measured signals can be used to determine a base at respective positions. For example, signals 1-20 may correspond to seven bases where the middle base (base 4) is being called. After base 4 is called, the signal window may shift some number of signals, for example three signals, to determine the base call for the 5th base (base 5). Accordingly, the new window is now signals 4-23. The sliding window method allows for only those signals proximal to the base being called to potentially influence the called base, rather than the signal for the entire molecule. As an alternative example, to determine a base call at a respective position, the measured signal of the base in question may be compared to a threshold, which can be determined based on a number of different parameters such as the separation between the signal values of the different bases. The threshold may be based on (i) previously determined base calls at the position in question, (ii) measurements of base calls upstream and/or downstream of the position in question, or (iii) any combination thereof. By taking into account information from (i), (ii), or (iii) a base calling procedure (e.g., a machine learning base calling model) can be trained on these patterns to determine the base call at the position in question. [0131] At 340, the first and second sequenced strands are aligned. Alignment allows for any difference between the two strands to be identified. In some cases, the first and second strands are aligned directly to each other in a process known as reference-free alignment (see section V- B). Reference-free alignment relies on canonical Watson-Crick base pairing between the two strands identify positions of concordance (matching base pairs) and positions of discordance (non-matching base pairs). For concordant positions, it can also be determined whether the base call matches the reference genome. In other cases, the two strands may be aligned to a reference genome in a process known as reference-based alignment (see section V-A). In this scenario, concordant and discordant positions are identified based on how well each sequenced strand aligns to the reference genome. Concordant positions are those positions where the sequenced strand matches with the reference genome’s sequence, while discordant positions are those positions where the sequenced strand does not match the reference genome’s sequence. For both scenarios, quality scores and weights may be used to resolve discordant positions and determine PATENT Client Reference No.: P39048-WO-1 whether the position should actually be concordant. That is, one of the initial base calls can be determined as more reliable than the other. [0132] At 350, a consensus read is determined. Consensus reads may be determined using various intramolecular consensus workflows, depending on the type of read construct generated from the library prep. Reads may be physically coupled, for example via a hairpin segment. Adapter segments can be identified and removed from the read construct. For an intramolecular consensus workflow, the reads can be aligned to each other (directly or via a reference sequence) to generate an alignment object that is used to form a consensus sequence. [0133] At 360, the aligned sequencing data is compressed to aid in the processing, transfer, storage, and archive of the data. Several techniques may be used to compress the data. One such technique may include partial consensus compression. In this method, the positions identified as discordant are represented with four or five bits (as there are 12-20 possible values) and these discordant positions are identified in metadata (e.g., headers) for the decompression. For positions that are concordant, these positions are represented with two bits since there are four possible bases. Another technique includes reference-based compression for the concordant positions. If the base at a concordant position is the same as the reference, the data may be dropped because the bases are known to be the same as the reference genome. Thus, the metadata can specify which positions are (a) discordant but also (b) which are concordant but different from the reference. [0134] At 370, the consensus sequences are transmitted to an external system, which can perform intermolecular consensus calling and variant calling. An intermolecular consensus workflow can use a UMI-based approach. For intermolecular consensus calling, because reads are not physically coupled together, many reads that align to the same region of the genome may be combined to generate a single, consensus read for that genomic region. II. Example Duplex Sequencing [0135] Embodiments described herein may be applied to any suitable sequencing platform, including next generation and nanopore sequencing, but are particularly useful for SBX Sequencing. Sequencing by Expansion is described in International Publication No. WO PATENT Client Reference No.: P39048-WO-1 2020/236526, entitled “Translocation control elements, reporter codes, and further means for translocation control for use in nanopore sequencing,” filed May 14, 2020, and U.S. Patent No. 7,939,259, “High throughput nucleic acid sequencing by expansion,” filed June 19, 2008, which are both herein incorporated by reference in their entireties. A. Library Generation [0136] In one aspect, DNA from a biological sample is obtained or provided. The DNA obtained or provided from the biological sample may be genomic DNA, mitochondrial DNA, cell-free DNA (cfDNA), circulating tumor DNA (ctDNA), or a combination thereof. [0137] DNA samples may be obtained from a patient or subject, from an environmental sample, or from an organism of interest. In embodiments, the DNA sample is extracted, purified, or derived from a cell or collection of cells, a body fluid, a tissue sample, an organ, and/or an organelle. In some embodiments, the sample DNA is whole genomic DNA. [0138] In some instances, DNA may be obtained from the same biological sample or source. Many different methods and technologies are available for the isolation of DNA. In general, such methods involve disruption and lysis of the starting material followed by the removal of proteins and other contaminants and finally recovery of the DNA. Cell lysis procedures and reagents are known in the art and may generally be performed by chemical (e.g., detergent, hypotonic solutions, enzymatic procedures, and the like), physical (e.g., French press, sonication, and the like), or electrolytic lysis methods. Removal of proteins can be achieved, for example, by digestion with proteinase K, followed by salting-out, organic extraction, gradient separation, or binding of the DNA to a solid-phase support (either anion-exchange or silica technology). DNA may be recovered by precipitation using ethanol or isopropanol. There are also commercial kits available for the isolation of nuclear DNA. The choice of a method depends on many factors including, for example, the amount of sample, the required quantity and molecular weight of the DNA, the purity required for downstream applications, and the time and expense. [0139] In some embodiments, the DNA sample is circulating cell-free DNA (cfDNA), which is DNA found in the blood and is not present within a cell. The cfDNA can be isolated from blood or plasma using methods known in the art. Commercial kits are available for isolation of PATENT Client Reference No.: P39048-WO-1 cfDNA including, for example, the Circulating DNA Kit by Qiagen, N.V.. The DNA sample may result from an enrichment step, including, but is not limited to antibody immunoprecipitation, chromatin immunoprecipitation, restriction enzyme digestion-based enrichment, hybridization-based enrichment, or chemical labeling-based enrichment. [0140] In some instances, the isolated DNA is fragmented into a plurality of shorter double stranded DNA target fragments. In general, fragmentation of DNA may be performed physically, or enzymatically. For example, physical fragmentation may be performed by acoustic shearing, sonication, microwave irradiation, or hydrodynamic shear. Acoustic shearing and sonication are the main physical methods used to shear DNA. For example, a number of sonication instruments by Covaris, LLC (Woburn, MA) are commercially available and are acoustic devices for breaking DNA into 100 bp - 5 kb fragments. Covaris also manufactures tubes (e.g., gTubes) which will process samples in the 6-20 kb for Mate-Pair libraries. Another example is the Bioruptor® by Diagenode, LLC (Denville, NJ), a sonication device utilized for shearing chromatin, DNA and disrupting tissues. Small volumes of DNA can be sheared to 150 bp - 1 kb in length. The Digilab Hydroshear® by Thermo Fisher Scientific is another example and utilizes hydrodynamic forces to shear DNA. Thermo Fisher Scientific also manufactures nebulizers, which can also be used to atomize liquid using compressed air, shearing DNA into 100 bp -3 kb fragments in seconds. As nebulization may result in loss of sample, in some instances, it may not be a desirable fragmentation method for limited quantities samples. Sonication and acoustic shearing may be better fragmentation methods for smaller sample volumes because the entire amount of DNA from a sample may be retained more efficiently. Other physical fragmentation devices and methods that are known or developed can also be used. [0141] Various enzymatic methods may also be used to fragment DNA. For example, DNA may be treated with DNase I, or a combination of maltose binding protein (MBP)-T7 Endo I and a non-specific nuclease such as Vibrio vulnificus nuclease (Vvn). The combination of non- specific nuclease and T7 Endo synergistically work to produce non-specific nicks and counter nicks, generating fragments that disassociate 8 nucleotides or less from the nick site. In another example, DNA may be treated with NEBNext® dsDNA Fragmentase from New England Biolabs, Inc. (Ipswich, MA). NEBNext® dsDNA Fragmentase generates dsDNA breaks in a time-dependent manner to yield 50-1,000 bp DNA fragments depending on reaction time. PATENT Client Reference No.: P39048-WO-1 NEBNext® dsDNA Fragmentase contains two enzymes, one randomly generates nicks on dsDNA and the other recognizes the nicked site and cuts the opposite DNA strand across from the nick, producing dsDNA breaks. The resulting DNA fragments contain short overhangs, 5 '- phosphates, and 3 '-hydroxyl groups. [0142] In some instances, the DNA sample is fragmented into specific size ranges of target fragments. For example, the DNA sample may be fragmented into fragments in the range of about 25-100 bp, about 25-150 bp, about 50-200 bp, about 25-200 bp, about 50-250 bp, about 25-250 bp, about 50-300 bp, about 25-300 bp, about 50-500 bp, about 25-500 bp, about 150-250 bp, about 100- 500 bp, about 200-800 bp, about 500-1300 bp, about 750-2500 bp, about 1000- 2800 bp, about 500-3000 bp, about 800-5000 bp, or any other size range within these ranges. For example, the DNA sample may be fragmented into fragments of about 50-250 bp. In some instances, the fragments may be larger or smaller by about 25 bp. [0143] In certain embodiments, the fragments are treated to produce blunt ends that are compatible with ligation to a first adapter having a compatible blunt end. Any convenient method for producing blunt ends may be employed, including treatment with one or more (e.g., E. coli Exonuclease III) and/or performing a fill- No limitation in this regard is intended. [0144] In a specific embodiment, the target sequence used in SBX sequencing may be a sample of double stranded nucleic acid fragments (generated by the above-described process) overhangs). These nucleic acid fragments can be of any size or size range and can include DNA, RNA, DNA-RNA hybrids (e.g., molecules produced by first-strand synthesis during cDNA preparation have one mRNA strand and one complementary DNA strand), genomic DNA, cDNA, mRNA, tRNA, etc. The terms “paired-end” and “duplex” (or “duplexed”) may be used interchangeably as they relate to template constructs for Xpandomer synthesis. [0145] Sequencing of the single, contiguous Xpandomer that incorporates the features of copies of the paired-end template provides duplexed reads of the original nucleic acid target fragments. The paired-end Xpandomer template constructs can be single nucleic acid chains that PATENT Client Reference No.: P39048-WO-1 each have the following structure: adapter region 1, sense (i.e., forward) nucleic acid strand of the target fragment, adapter region 2, anti-sense (i.e., reverse) nucleic acid strand of the target fragment, adapter region 3. In some embodiments, adapter region 2 forms a classic “hairpin” structure in which the stem portion of the hairpin adapter is double stranded and is ligated to one end of the double stranded nucleic acid target fragment. The loop portion of the hairpin adapter is single-stranded and operably joins (i.e., covalently links or couples) the sense and antisense strands of the double stranded nucleic acid target fragment. In some embodiments, adapter region 1 and adapter region 3 are derived from a classic “Y adapter” structure, in which the stem portion of the Y adapter is double stranded and is ligated to the opposite end of the double stranded nucleic acid target fragment. The arms of the Y adapter may be single stranded and provide a free 3’ end and a free 5’ end to the paired-end template construct. [0146] SBX has been applied to duplex sequencing for genomic and epigenomic sequence analysis as described in, e.g., International Application No. PCT/US2024/061051, filed December 19, 2024, which is herein incorporated by reference in its entirety. [0147] FIG.4 depicts one exemplary method of generating a duplex template construct to be analyzed using the methods described herein. In this embodiment, a library fragment 400 (i.e., a double stranded DNA target fragment) includes parental (sense) strand 400a and parental (antisense) strand 400b. The library fragment is end repaired and A-tailed to generate single 3’ A overhangs on each strand, as discussed herein. In step 1, target fragment 400 is brought into contact with hairpin adapter 410, which includes a single T overhang to facilitate alignment with the target fragment, and a DNA ligase enzyme under DNA ligation conditions. The desired product of the ligation reaction is an asymmetric duplex template construct 425 that includes a single hairpin adapter ligated to one end of the double stranded target fragment. In certain embodiments, the ratio of the hairpin adapter to the library fragment may be optimized to preferentially generate asymmetric duplex template construct 425. In step 2, Y adapter 430 is immobilized on solid support 433 via linker 435 that is covalently bound at one end to the single stranded 5’ arm of the Y adapter and at the other end to the solid support. In this embodiment, linker 435 includes a poly(U) sequence. The ligation products of step 1, including the asymmetric duplex template construct 425 are then brought into contact with the immobilized Y adapter and a DNA ligase enzyme under DNA ligation conditions. In some embodiments, the PATENT Client Reference No.: P39048-WO-1 products of the ligation of step 1 are first denatured, followed by size selection purification, to remove undesired side products. Other side products of the ligation reaction of step 1, including any symmetric duplex templates (i.e., constructs with hairpin adapters ligated to both double stranded ends of the target fragment) are not capable of ligating to the Y adapter, and thus will not be associated with solid support 433. In contrast, asymmetric duplexed template construct 425 can be ligated to Y adapter 430 to form Y adapter-ligated duplexed template construct 440, bound to solid support 433 via linker 435. In step 3, Y adapter-ligated duplex template construct 440 is treated with, e.g., a USER enzyme, to cleave the poly(U) sequence in linker 435 and release the Y adapter-ligated duplex template construct from the solid support. In other embodiments, linker 435 may include other suitable selectively cleavable moieties known in the art, such as a photocleavable moiety. B. Xpandomer Synthesis [0148] In some embodiments, duplex template construct 450 may be directly used as a template for Xpandomer synthesis without prior PCR amplification. Advantageously, this reduces the likelihood of sequencing errors due to nucleotide misincorporations during amplification, particularly in homopolymer sequences in the target fragment. For Xpandomer synthesis, duplex template construct 450 is contacted with an extension oligonucleotide with a sequence complementary to a sequence in the single stranded 3’ arm of the Y adapter 430 under nucleic acid hybridization conditions. The free 3’ end of the extension oligonucleotide provides the initiation site for Xpandomer synthesis. In certain embodiments a blocker oligonucleotide with a sequence complementary to a sequence in the 5’ single stranded arm of Y adapter 430 may also be contacted with the duplex template construct 450 under nucleic acid hybridization conditions. The blocker oligonucleotide can function to terminate Xpandomer synthesis. [0149] The Xpandomer copy of duplex template construct 450 will include copies of both the parental (sense) strand 400a and the parental (antisense) strand 400b covalently joined by a copy of the intervening hairpin adapter 410. [0150] The Xpandomer synthesis reaction is responsible for accurately transcribing the sequence of the DNA template of interest into the sequence of the Xpandomer. Xpandomer synthesis requires a complex reaction mixture which includes a DNA polymerase and many PATENT Client Reference No.: P39048-WO-1 additives that enable incorporation of the bulky XNTP substrates by the polymerase into the very large Xpandomer macromolecule. In certain embodiments, a non-limiting Xpandomer synthesis reaction mixture may include the following reagents: a buffer/salt system, a polymerase cofactor, a polymerase enhancing moiety (PEM), a DNA polymerase, XNTP substrates, a phosphate shield molecule, a solvent, a crowding agent, and optionally, additional additives. In some embodiments, the buffer/salt system may include TrisCl and NaCl; the polymerase cofactors may include MnCl2 formulated in MES; the PEMs may include molecules disclosed in International Application No. PCT/US2018/067763 and PCT/US2020/038682, which are herein incorporated by reference in their entireties; the DNA polymerase may include a variant of DPO4 polymerase as disclosed in U.S. Patent Nos.11,299,725, 11,708,566, 11,530,392 and International Application No. PCT/EP2024/079005, , each of which is herein incorporated by reference in their entireties; the phosphate shield molecule may include hexametaphosphate (HMP); the solvent may include NMP and DMSO; the crowding agent may include PEG8k; and the additional additives may include imidazole and betaine. [0151] In certain embodiments, the Xpandomer synthesis reaction comprises a variant of wildtype DPO4 polymerase, designated C7326, with the following amino acid substitutions: F37T_D39L_K56Y_A57S_I59M_E63R_M76W_K78E_E79P_Q82W_Q83G_S86E_K152A_I1 289W__E291S_D292R_L293W_D294N_I295S_V296Q_S297Y_G299W_R300S_T301W_K32 -352. The sequences of the wild-type and C7326 variants are provided below: Table 1
PATENT Client Reference No.: P39048-WO-1 Form (SEQ ID NO.) Amino Acid Sequence wt DPO4 DNA polymerase MIVLFVDFDYFYAQVEEVLNPSLKGKPVVVCVFSGRFEDSGAVATANYE ARKFGVKAGIPIVEAKKILPNAVYLPMRKEVYQQVSSRIMNLLREYSEKIEI (SEQ ID NO: 1) ASIDEAYLDISDKVRDYREAYNLGLEIKNKILEKEKITVTVGISKNKVFAKIA ADMAKPNGIKVIDDEEVKRLIRELDIADVPGIGNITAEKLKKLGINKLVDTL SIEFDKLKGMIGEAKAKYLISLARDEYNEPIRTRVRKSIGRIVTMKRNSRNL EEIKPYLFRAIEESYYKLDKRIPKAIHVVAVTEDLDIVSRGRTFPHGISKETAY SESVKLLQKILEEDERKIRRIGVRFSKFIEAIGLDKFFDT C7326 (SEQ ID NO: 2) MIVLFVDFDYFYAQVEEVLNPSLKGKPVVVCVFSGRTELSGAVATANYEA RKFGVYSGMPIVRAKKILPNAVYLPWREPVYWGVSERIMNLLREYSEKIEI ASIDEAYLDISDKVRDYREAYNLGLEIKNKILEKEKITVTVGISKNKVFAAVA GSKAKPNGIKVIDDEEVKRLIRELNIADVQGIPYFTAQKLKKLGINKLVDTL SIEFDKLKGMIGEAKAKYLISLARDEYNEPIRTRVRKSIGRTVTMKRNSRNL EEIKPYLFRAIEECYYKLDKRIPKAIHVVAWRSRWNSQYRWSWFPHGISK ETAYSESVKLLQQILKKDKRKIRRIGVRFSKF [0152] DPO4 polymerase variants suitable for use in the methods described herein as well as the wild type DPO4 polymerase are disclosed in U.S. Patent Nos.11,299,725, 11,708,566, 11,530,392 and International Application No. PCT/EP2024/079005, each of which is herein incorporated by reference in their entireties. In some embodiments, a variant of DPO4 polymerase suitable for the practice of the present invention may be a variant that is at least 85% identical to SEQ ID NO: 3. C. Example Sequencing Operation Using AC Signal [0153] In some embodiments, once an Xpandomer is introduced to the nanopore cell, during a “bright period,” Xpandomer molecules are captured and begin to translocate through the nanopore due to a combination of both baseline and TCE applied voltage pulses. Baseline voltages are sufficient to read the tag code at each XNTP position, and the short, higher voltage TCE pulses are designed to overcome the energetic barrier associated with a TCE. Ideally, each PATENT Client Reference No.: P39048-WO-1 TCE pulse results in translocation past a single TCE barrier, thus moving the Xpandomer further into the pore in the forward direction by an amount of one “base” position. [0154] During typical operation, applied voltage patterns are designed so that there are a fixed number of TCE pulses during each bright period, which cause the Xpandomer to translate in the “forward” direction by a number of bases, corresponding to the number of TCE pulses, or until the Xpandomer fully translocates, and is released into the fluidic “trans” chamber below the membrane. [0155] During typical operation, an Xpandomer molecule may not fully translocate prior to the end of a single bright period. This may happen due to the molecule being captured late in the bright period and having an Xpandomer length with more base positions than there are TCE pulses remaining in the bright period. A molecule can get stuck while attempting to translocate in the forward direction for a variety of reasons. There may be a base position which has a defect (such as a failed cleavage event) which makes it impossible or very difficult for the molecule to translocate past that point. In such circumstances, and for other reasons, an Xpandomer may not be able to fully translocate during the bright period, regardless of the number of TCE pulses in a bright period. In such situations, it may be observed that a number of base positions in the beginning of the read are sequenced and generate the expected signal levels, until the defective position is reached. The last tag code level located just before the defect can then be observed for the remainder of the bright period. In order that pores do not remain permanently clogged, a large negative voltage may be applied for some fraction of time in the dark period in order to clear out any stuck molecules by driving them hard in the reverse direction. [0156] FIG.5A shows an example of a single molecule-multi-molecular trace (SM3T) event with a typical sequencing by expansion waveform. The graph shows time in seconds on the x- axis. The graph shows voltage readings on the y-axis. Other electrical measurements may be used instead of voltage, including voltage equivalents (e.g., ADC counts) or current. Dark periods 504 and 508 are normal dark periods, where the pore is clear. Bright period 512 shows signals 516a and 516b of molecule 1 and molecule 2, respectively. [0157] Signal 520 shows a molecule during a bright period. The event shows that the molecule gets stuck in the pore and does not clear over several cycles (dark periods 524, 528, PATENT Client Reference No.: P39048-WO-1 532 and signals 536, 540, and 544 in bright periods). Eventually, the molecule clears in a dark cycle, as indicated by the change from signal 548a to signal 548b (when the molecule clears). This event may result from properties of the Xpandomer’s leader segment, which create difficulty in the leader translocating in the reverse direction. [0158] FIG.5B illustrates a possible mechanism behind the trace in FIG.5A. Diagram 552 shows a bright period. The translocation direction is downward. Uncleaved position 556 would hit the pore after the next pulse. Normal tag code level is expected. Diagram 558 shows a dark period. The translocation direction is now upward. Leader 560 has trouble translocating in the reverse direction (upward) through the pore. Eventually, leader 560 goes through the pore. [0159] Xpandomer molecules can be designed with properties in the leader portion of the Xpandomer that cause the leader to behave differently in the forward and reverse directions. During the bright period (forward direction), the leader may have characteristics that allow the leader to be captured into the pore from the cis side with relatively high capture rates under reasonably applied voltages. Following capture, but still during the same bright period, the leader may protrude from the underside of the pore (trans side of membrane), as TCE pulses cause the molecule to process steadily through the pore. [0160] During the dark period (reverse direction), if a molecule is still in the pore when a dark period begins, the molecule should begin to translocate in the reverse direction under a negative applied voltage. Once the Xpandomer molecule has almost fully reversed its position (i.e., it has almost fully backed out), the leader may remain on the trans side of the barrel. At this point, the desired property of the leader is that the leader has a high energetic barrier to entry into the barrel from the trans side, and thus is highly resistant to translocation through the barrel in the trans to cis direction (forward direction). D. Measurement of Signals and Base calling [0161] In some embodiments, a base call for each position in a nucleic acid molecule is generated by measuring a unique signal for each individual subunit of the Xpandomer. A simple example of base calling can compare the signal value to a plurality of cutoff (threshold) values, where the value falling in a corresponding range can indicate a particular base. Machine learning PATENT Client Reference No.: P39048-WO-1 techniques can also be used. Example techniques for measuring sequences of nucleic acids and determining a base call is described in International Application No. PCT/EP2017/069820, which is herein incorporated by reference in its entirety. [0162] Briefly, the signal measurement used in SBX sequencing is based on relative changes in an electrical signal (e.g., voltage/current) as the Xpandomer passes through the nanopore. Nanopore proteins are embedded into a membrane (e.g., a bilayer lipid membrane) that pass an electrical signal (via flow of ions) through the nanopore. As an Xpandomer passes through the nanopore, a change in an electrical property is measured, thereby providing an identification of the nucleotide (e.g., XNTP) at a given position of the Xpandomer. Each nanopore cell (e.g., nanopore cell 150 described with respect to FIG.1) produces a new datapoint (voltage/current change) at a kilohertz or higher rate. Depending on the waveform (e.g., electrical signal type) applied across the membrane, multiple datapoints may be generated per base or just one datapoint per base. [0163] In some embodiments, a direct current (DC) signal can be applied to the nanopore cell (e.g., so that the direction at which the nucleic acid molecule moves through the nanopore is not reversed). However, operating a nanopore sensor for long periods of time using a direct current can change the composition of the electrode, unbalance the ion concentrations across the nanopore, and have other undesirable effects that can affect the lifetime of the nanopore cell. Applying an alternating current (AC) waveform can reduce the electro-migration to avoid these undesirable effects, and therefore an AC waveform may instead be used. During a dark period, the AC signal can recharge electrochemically the capacitor and the electrochemical cell at the bottom of the well. [0164] Suitable conditions for measuring a change in an electrical property that results from the passage of a molecule through the nanopores are known in the art and examples are provided herein. The measurement may be carried out with a voltage applied across the membrane and pore. In some embodiments, the voltage used may range from -400 mV to +400 mV. The voltage used is preferably in a range having a lower limit selected from -400 mV, -300 mV, -200 mV, - 150 mV, -100 mV, -50 mV, -20 mV, and 0 mV, and an upper limit independently selected from +10 mV, +20 mV, +50 mV, +100 mV, +150 mV, +200 mV, +300 mV, and +400 mV. The PATENT Client Reference No.: P39048-WO-1 voltage used may be more preferably in the range of 100 mV to 240 mV and most preferably in the range of 160 mV to 240 mV. It is possible to increase discrimination between different nucleotides by a nanopore using an increased applied potential. Sequencing nucleic acids using AC waveforms and tagged nucleotides is described in US Patent Application Publication No. 2014/0134616, which is herein incorporated by reference in its entirety. [0165] To perform sequencing of a nucleic acid, the voltage level across the nanopore can be sampled and converted by an analog-to-digital converter (ADC), which converts continuous-time signal of one quantity to a digital signal (e.g., voltage to a discrete time, quantized amplitude signal). For example, an AC voltage signal is applied across the nanopore at, e.g., about 100 Hz, and an acquisition rate of the ADC can be about 2000 Hz per cell. Thus, there can be about 20 data points (voltage measurements) captured per AC cycle (cycle of an AC waveform). Data points corresponding to one cycle of the AC waveform may be referred to as a set. The ADC signals may be processed by a base calling algorithm (e.g., neural network, other machine learning model, or statistical model) to determine the corresponding sequence of bases in a nucleic acid molecule. [0166] As part of base calling, a quality score can be determined. A low-quality base can result when there is an equal or similar probability between two bases, e.g., near an edge of a cutoff value separating two bases or similar probability from an ML model. Such knowledge of a low-quality score and which bases have similar signal values (e.g., the bases could have increasing signal values in the order of A, C, G, and T, with T having the highest signal value). Other orders can be used. If a base call for T has a low-quality score, then it can be surmised that a likely other base is a G. Such a determination can also be known when the base caller uses more complex techniques. This information can be used when determining an intramolecular consensus read. For example, if one strand had a high-quality C and the other strand had a low- quality T, then the position could be called as concordant for C-G, where the base call for T is assumed to be an error. Such techniques are described in more detail herein for determining a consensus read. In other embodiments, knowledge of the specific other base that had the second highest probability can be provided from the base caller to the circuitry that determines the consensus read and that implements any data compression. PATENT Client Reference No.: P39048-WO-1 III. Alternative HDD Read Constructs (Differing Number of Passes) [0167] As mentioned, the Xpandomer synthesis process can have a predictable error profile that is a function of many factors. For example, the error rate of the polymerase used to synthesize Xpandomer surrogate molecules is directly related to the state of the target DNA molecule. The state of the target DNA molecule is influenced by many factors such as whether the stretch of the target DNA molecule within proximity to the polymerase active site is in single or double stranded form. Error rates are also influenced by local sequence context itself (i.e. kmer context). [0168] Additionally, the synthesis of the various HDD DNA constructs outlined below in sections III-A, III-B, and III-C, have their own error rates that can be anticipated. For example, whether the target HDD DNA construct is expected to be (or have been) in a single vs. double stranded state. This knowledge can then be leveraged by the intramolecular consensus calling algorithm to optimize intramolecular consensus read calls and associated consensus base quality score calls. Accordingly, this information may be passed further downstream and leveraged during subsequent algorithmic stages, such as the (optional) subsequent formation of intermolecular consensus reads from multiple intramolecular consensus reads, or during variant calling stages. A. Two Pass HDD Read Constructs [0169] Two pass HDD read constructs comprise the parent plus and parent minus strands of a single insert molecule (also referred to as a target molecule). [0170] FIG.6A shows an exemplary two pass HDD read construct and its 3-step synthesis. The first step involves ligating the double stranded target molecule to a Y-adapter that is on a solid support (indicated by gray box). The Y-adapter is ligated onto the 3’ end of the parent minus (Parent -) strand and the 5’ end of the parent plus (Parent +) strand. Then at the second step the hairpin adapter is ligated to the 5’ end of the parent minus strand and the 3’ end of the parent plus strand. The strand-adapter complexes are dissociated from molecules that do not have a hairpin followed by a wash. The final step involves releasing the two pass HDD read construct from the solid support beads. PATENT Client Reference No.: P39048-WO-1 [0171] FIG.6B shows an example of a fully synthesized two pass HDD read construct. The construct undergoes Xpandomer synthesis, SBX, and base calling. As illustrated in FIG.6B, the structure of the Xpandomer makes it easy to keep track of all the segments that comprise the Xpandomer molecule such as the portion of the Y-adapter ligated to the 3’ end of the parent minus strand (e.g., Xpandomer primer region (pink), runway-SID sequence (orange), and Stem (light blue)), the synthetic Xpandomer polymer of the target molecule (e.g., daughter minus (purple), hairpin sequence (red), daughter plus (purple)), and the portion of the Y-adapter ligated to the 5’ end of the parent plus strand (e.g., stem minus (light blue) and blocker sequence to stop Xpandomer synthesis (green). [0172] FIG.7 shows example error types that can occur during either sequencing or synthesis of the Xpandomer molecule. A first example is a biological mutation (e.g., single nucleotide polymorphism (SNP)) where the reference sequence has a C in the 7th position, but the parent plus sequence of the original insert DNA sequence instead has a T. The SNP mutation is carried through during Xpandomer synthesis and is consistently detected across the parent- daughter pairs. [0173] A second error type example shown is DNA damage (e.g., 8-Oxoguanine) where the G in the 21st position of the parent minus strand is indicated by a G*. 8-oxoguanine occurs at a rate of ~0.02-0.8 x 10-6 in human primary and cancer cells, and as high as 10-5 to 10-4 in prepped samples. During DNA replication, and thus Xpandomer synthesis, 8-oxoguanine causes a G T mutation, which is observed in the parent-daughter minus pair where the daughter minus Xpandomer sequence has an A instead of the expected C observed in the parent-daughter plus strands. [0174] A third example is the rate of random error incorporation during the synthesis of the daughter plus and minus Xpandomer sequences. As shown, there are four random errors that occur on both the daughter plus and minus Xpandomer sequences. In this particular example, a total raw read error rate of ~1% is assumed. On the daughter minus Xpandomer sequence, at the 17th position no base was called (indicated by when a C would have been expected and at the 36th position, a G was called when an A was expected. The rate at which two random errors PATENT Client Reference No.: P39048-WO-1 (one in each subread) could occur at the same consensus position (e.g., the A at the 47th position of the daughter plus Xpandomer sequence and the T at the 47th position of the daughter minus Xpandomer sequence) is a rate of about (0.1)2 or 10-4. However, those random errors will be of the “same” type (accounting for reverse complementarity) and thus undetectable at rates which are significantly lower than 10-4. [0175] Accordingly, for substitution errors in the consensus sequence, when the expected daughter plus Xpandomer base is G and the expected daughter minus base is a C, the following error rates can be considered: 1) 98.3% of the consensus read positions will not have substitution errors on either Xpandomer strand, and 2) a total of unflaggable multi-substitution errors occur at a rate of 4x10-6. For insertions, under the same expectations, 98.3% of the consensus read positions would be expected to not have insertion errors, but a total unflaggable rate for multi- insertion errors is expected to occur at 3x10-7. Random base substitutions, insertions, or deletions during Xpandomer synthesis, such as A G (e.g., at the 36th position of the daughter minus Xpandomer sequence), would be present in one of the two reads out of every 50 consensus positions (i.e., 2% of consensus positions). B. Four Pass HDD Read Constructs [0176] Four pass HDD read constructs comprise two copies of the insert/target molecule (e.g., two copies of the parent plus and parent minus strands). An exemplary four pass HDD read construct and its synthesis are shown in FIGs.8A-8D. [0177] The first three steps of the synthesis are depicted in FIG.8A. First, the double stranded target molecule is ligated to a Y-Open-Hairpin-adapter that is on a solid support (indicated by the gray box). The Y-Open-Hairpin-adapter is ligated onto the 3’ end of the first parent minus (Parent -) strand and the 5’ end of the second parent plus (Parent +) strand. Then at the second step, the hairpin adapter is ligated to the 5’ end of the second parent minus strand and the 3’ end of the first parent plus strand. The strand-adapter complexes are dissociated from molecules that do not have a hairpin followed by a wash. Next, the four pass HDD read construct is released from the solid support (e.g., beads). PATENT Client Reference No.: P39048-WO-1 [0178] Steps 3 and 4 are shown in FIG.8B. Following release from the solid support, the four pass HDD read construct is extended with a strand displacing polymerase allowing for complementary daughter minus and daughter plus strands to be synthesized. [0179] As shown in FIG.8C, the extended molecule has an SBX adapter ligated to it and in FIG.8D the SBX adapter ligated molecule undergoes Xpandomer synthesis, sequencing, and base calling. [0180] As illustrated in FIG.8D, the structure of the four pass HDD read construct makes it easy to keep track of all the segments that comprise the Xpandomer molecule. Finally, the four pass HDD read construct undergoes adapter segmentation, effectively removing the Y-Open- Hairpin-adapter and the hairpin adapter so only the daughter -/+ stands and the parent -/+ strands are left. [0181] FIG.9 shows example error types that can occur during either sequencing or synthesis of the Xpandomer molecule. A first example is a biological mutation (e.g., single nucleotide polymorphism (SNP)) where the reference sequence has a C in the 7th position, but the parent plus sequence of the original insert DNA sequence instead has a T. The SNP mutation is carried through during Xpandomer synthesis and is consistently detected across the parent- daughter pairs. [0182] A second error type example shown is DNA damage (e.g., 8-Oxoguanine) where the G in the 21st position of the parent minus strand is indicated by a G*. 8-oxoguanine occurs at a rate of ~0.02-0.8 x 10-6 in human primary and cancer cells, and as high as 10-5 to 10-4 in prepped samples. During DNA replication, and thus Xpandomer synthesis, 8-oxoguanine causes a G T mutation, which is observed in the daughter-daughter plus pair where the daughter plus sequence has an A instead of the expected C and the daughter plus Xpandomer sequence has a T instead of the expected G. [0183] A third example is of homopolymer insertions at the end of the poly(T) tract of the parent plus and daughter plus Xpandomer sequences. If A and T insertions each occur at a raw rate about 10% of the time for homopolymers of length eight, then insertions will occur on two PATENT Client Reference No.: P39048-WO-1 passes of the same 8-mer at a rate of ~1%, and insertions will occur on three passes of the same homopolymer about 0.1% of the time. Even when this occurs three times, this would be flagged for lower quality due to a quarter of the passes disagreeing. The exact same error occurring on all four passes of the same 8-mer would be undetectable but would occur at a rate of ~10-4. Given the different error rates, an estimate is made for both the level of consensus accuracy that can be achieved, but also the rate of discordant base positions. Some discordant positions may still not be resolvable despite the estimates and four reads. For other discordant positions, a consensus call is made at the point of intramolecular consensus formation, resolving the discordant position. In so doing, a higher compression rate can be achieved by minimizing the number of unresolvable positions. [0184] FIG.10 shows an example of intramolecular consensus calling output with errors described above in FIG.9. To be efficient, the rate at which possible substitutions occur need to be flagged. “Confident” disagreement between parent plus consensus and parent minus consensus occur at a rate of 10-4 to 10-5 due to true chemical single strand substitutions, or 10-5 to 10-6 due to the same mismatch in both parent and daughter of the + or - Xpandomer reads. For example, the consensus sequence denotes the discordant base due to 8-oxoguanine DNA damage as 1G2T to indicate that one G was turned into a T. Essentially, about 1 in 104 consensus bases can be flagged as being 50% uncertain. With the respect to the rate at which homopolymer 8- mers must be flagged, disagreement of one subread from the others will occur at a rate of 2 x 10-1 (over the cumulative 8-mer homopolymer stretches). Disagreement of two subreads from the other two will occur at rate of 2 x 10-2 (over the cumulative 8-mer homopolymer stretches). Homopolymers of length 4-10 themselves occur at a frequency of about 2 in 10-3 positions within the genome. Essentially, about 1 in 104 consensus bases can be flagged as being calls with low certainty due to homopolymer errors. Whereas, about 1 in 105 consensus bases would be flagged as being calls with only 50% certainty due to homopolymer errors. C. “n” Pass HDD Read Constructs [0185] “N” pass HDD read constructs essentially allow for “n” copies of the insert/target molecule to be incorporated into the HDD read, where “n” is any integer value. FIGs.11A-11C show an exemplary “n” pass HDD read construct and its synthesis. PATENT Client Reference No.: P39048-WO-1 [0186] The first three steps of the synthesis depicted in FIG.11A are very similar to the process described for four pass HDD read constructs. First, the double stranded target molecule is ligated to a Y-Open-Hairpin-adapter that is on a solid support (indicated by the gray box). The Y-Open-Hairpin-adapter is ligated onto the 3’ end of the first parent minus (Parent -) strand and the 5’ end of the second parent plus (Parent +) strand. Then at the second step, the hairpin adapter is ligated to the 5’ end of the second parent minus strand and the 3’ end of the first parent plus strand. The strand-adapter complexes are dissociated from molecules that do not have a hairpin followed by a wash. Next, the four pass HDD read construct is released from the solid support. [0187] FIG.11B shows steps 3 and 4. Following release from the solid support, the four pass HDD read construct is extended with a strand displacing polymerase allowing for complementary daughter minus and daughter plus strands to be synthesized. [0188] As shown in FIG.11C, the extended molecule has another Y-Open-Hairpin-adapter ligated to it, and the molecule undergoes a second round of extension with a strand displacing polymerase. Steps 4 and 5 are repeated an arbitrary number of times to generate an “n” pass HDD read construct. As with the previous constructs, once the desired number of passes have been made, the “n” pass HDD read construct undergoes Xpandomer synthesis, sequencing, and base calling. [0189] In some instances, it is possible to induce multiple readouts of the same HDD read through a process called “n” pass flossing of HDD reads. This is achieved by either (i) inducing a specific waveform (SM3T), or (ii) by attaching an XNTP with a stronger energy barrier at the leader E-oligo end of an HDD read to induce multiple readouts of the same HDD read. The energetic barrier introduced by such a leader requires a higher voltage applied for extraction. Such a voltage can be applied during alternate dark cycles, which is not utilized for sequencing. This approach allows for resequencing the same XNTP molecule multiple times thus eliminating, by way of consensus, most stochastic sequencing errors. On HDD reads the remaining errors will almost only be ones induced by Xpandomer synthesis, eliminating a large fraction of stochastic sequencing errors in read 1 and read 2 consensus. Consensus will be required on the multiple readouts, where the proximity of two bright cycles with near identical PATENT Client Reference No.: P39048-WO-1 sequences, can establish groping. Alternatively, sequence similarity, or the sequential second pass readout at the immediate beginning of the bright cycle can serve in grouping reads. [0190] For deduplication of readouts without UMI, a positional deduping (genomic position start and end) approach can be used, where the proximity of reads in terms of subsequent bright cycles can be used to eliminate false positive calls. [0191] Once synthesis of the Xpandomer molecule corresponding to each of the above- described HDD read constructs is complete, the Xpandomer is sequenced. Xpandomer sequencing occurs on a sequencing device, like the sequencing device described with respect to section I-A. A solution comprising a plurality of surrogate Xpandomer molecules is loaded onto a nanopore sensor chip (such as the one described in FIG.1). The chip houses thousands of nanopore proteins embedded into a membrane that pass an electrical signal through the nanopore, where changes in electrical signal correspond to the different bases (e.g., XNTP) inserted into the nanopore. In some instances, raw HDD sequencing data may be stored in sequencing files (e.g., FASTQ files) for downstream processing. IV. Hairpin-Directed-Duplex (HDD) Read Classes [0192] The sequencing process described in section II-A may generate a variety of different HDD reads that can be grouped based on their description. [0193] FIG.12 shows exemplary HDD reads of different classes that are output during sequencing. Reads which only include the start adapter and one strand orientation are referred to as partial reads. Duplex reads which include a full forward insert, the hairpin adapter and only part of the reverse complement read are referred to as One+ reads. Duplex reads which include full or partial complementary forward and reverse segments and are missing the hairpin adapter are referred to as U-turn reads. A full HDD read includes sequencing of an entire dsDNA template both forward and reverse strands including a start, hairpin and end adapters. Sample IDs and unique molecular identifiers can be assigned to the different adapter segments. Additional read classes are possible, including combinations, e.g. a One+ U-Turn read. PATENT Client Reference No.: P39048-WO-1 [0194] FIG.13A shows an illustration of a hypothetical read structure resulting from the sequencing of a Hairpin-Duplex (HD) construct. The actual read would be a linear string of base calls and an associated linear string of quality scores. Here the linear read string of base calls is depicted as being folded over on itself. This representation is analogous to the conformation in which information is stored in the physical construct, and thus it is easy to envision from where the information in each segment of the read originated. This conformational representation is also relevant for the read itself, as it shows which segments of the full read are expected to be complementary and thus align to one another. FIG.13A shows the target “insert” sequence as a solid line and read segments originating from adapter sequences are shown as dashed lines. [0195] FIG.13B shows additional labeling on the read structure. Here, the terms ‘Insert Read 1’ and ‘Insert Read 2’ are used to describe the parts of the overall read, which correspond to the first and second passes of the insert sequence, respectively. When Insert Read 1 and Insert Read 2 are both full length passes of the insert segment of the construct, then the sequences contained in them are, or are close to, reverse complements of one another. This is an example of a “Full HDD” read. For Full HDD reads, performing a local or global pairwise alignment between the two insert reads would most often result in alignment with a relatively high alignment score. [0196] FIG.13C shows an example of a “One+ Read.” For some HDD Reads, the subread corresponding to the first pass of an insert (i.e. Insert Read 1) may be a full subread, and the subread corresponding to the second pass of an insert (i.e. Insert Read 2) may be a partial subread. This category of HDD reads are referred to as “One+ Reads,” given that they include one full insert subread, corresponding to the first pass on the insert segment, plus some additional sequence content in a second subread, from a second pass on the insert. For “HD One+ Reads,” a local pairwise alignment of Insert Read 1 and Insert Read 2 would typically result in a high scoring alignment between the shorter of the two subreads and part of the longer of the two subreads. V. Alignment Techniques to Generate Consensus Sequences [0197] To determine a consensus read from a plurality of reads of different strands and possible daughter copies within the same molecule, the base calls at corresponding positions are determined. The reads can be aligned to each other or via a reference sequence to determine the PATENT Client Reference No.: P39048-WO-1 base calls that correspond to the same position. This sequence alignment can be performed using various software packages, such as (but not limited to) BLAST, FASTA, Bowtie, BWA, BFAST, SHRiMP, SSAHA2, NovoAlign, and SOAP. Described below are several exemplary, and non- limiting, alignment techniques (e.g., reference-based alignment, reference-free alignment, reference guided HDD pair mapping and alignment, and three-way alignment) that may be used to generate intramolecular consensus reads. A. Reference-Based Alignment [0198] In some embodiments, the first sequence of base calls and the second sequence of base calls may be aligned to a reference genome to determine which base position on one strand corresponds to which base position on another strand. Once the alignment is done, the bases at the same position on each strand can be compared to determine whether they are concordant or discordant. One advantage of this approach is that it allows for accuracy of consensus reads, depending on the motifs present. Further, this approach preserves information for as long as possible, and preserves all of the raw read information until potentially the point of variant calling. An overview of reference-based guided pairwise alignment and its implications are provided below. [0199] At a first stage, demultiplexing, adapter detection, and optional UMI extraction can be performed. Initially, for each hairpin-direct-duplex (HDD) read, the start, mid- and end adapter are detected, and the position of the adapters is annotated. This step is optional, and an alternative approach may involve direct alignment of the HDD read to the reference where adapter trimming will take place post alignment. If sample identifiers (SIDs) or unique molecular identifiers (UMIs) are present these are extracted, trimmed, and annotated at this step. [0200] Once the reads of the different stands of the nucleic acids are determined (also referred to as subreads), reference alignment can be performed. HDD subreads (e.g., per strand or for each copy for more than two passes) are aligned to a reference genome (e.g., using BWA MEM or other alignments software) that corresponds to the origin of the sequence sample (e.g., human sequence sample is aligned to a human reference genome). HDD reads can have the same read name but a different subread identifier. Mapping and alignment results can be stored in a BAM alignment file. A subread can be aligned in its entirety or just the two ends of the subread PATENT Client Reference No.: P39048-WO-1 can be aligned, e.g., 60 bases on each end of the read. The reference can include two strands that are perfectly complementary to each other. Each subread can be aligned to the corresponding strand version of the reference. As another example, a subread can be converted before alignment by switching to the complementary base, and thus a reference for only one strand is used. [0201] An intramolecular consensus can then be determined. As an optional step, intramolecular consensus is done prior to an optional intermolecular UMI based consensus. In this step, HDD subreads are processed to form a single HDD consensus read as well as additional streams of data. Briefly, the merging process involves using the alignment information from both subreads in a pair against a reference sequence (e.g., CIGAR information) to recreate pairwise alignment between the two subreads. Since the reference has known positions, the resulting aligned positions for each subread can be compared to each other with confidence in knowing the positions correspond. For example, G at position 101 on first subread can be compared to C at position 101 on the second subread to determine they are concordant. For this reason, consensus accuracy of reference-guided methods is higher compared to pure pairwise methods (e.g., reference free methods). The HDD consensus read and associated streams of data store concordant bases and annotates information on discordant bases as described in detail herein. [0202] An optional intermolecular consensus can be determined. As an optional step, HDD reads are grouped based on positional deduplication in combination with UMI information to form intermolecular consensus. A more detailed description of intermolecular consensus calling is provided in section XII.A and has been described in International Application No. PCT/US2022/045624, which is herein incorporated by reference in its entirety. [0203] In some embodiments, reference based compression can be performed. Following HDD read segmentation, intramolecular alignment and consensus, a reference-based compression algorithm (see section VI.B for detailed description) may be used to further compress the intramolecular consensus reads thus achieving lower data rates in situations with limited bandwidth available for data transfer. PATENT Client Reference No.: P39048-WO-1 [0204] In some embodiments, realignment of consensus reads can also be performed. As an optional step, consensus reads are realigned using different alignment parameters to generate an output consensus alignment (currently a BAM file) output. Realignment may offer better or alternative matching alignments of the consensus reads and may use a different reference genome. For instance, realignment may consider a graph or more detailed reference genome and depend on higher accuracy reads as input. [0205] Variant calling can then be performed. Variants of consensus reads can be determined in relation to a reference genome. B. Reference-Free Alignment [0206] In some embodiments, the first sequence of base calls and the second sequence of base calls may be aligned to each other via the adapter segments, thus alignment to the reference genome does not occur. A consensus read can be formed from the information present in the original full HDD read. This approach is particularly advantageous for compressing the data quickly as the data can be compressed as soon as it comes off the instrument. An overview of reference-free guided pairwise alignment and its implications are provided below. [0207] At a first stage, demultiplexing, adapter detection and optional UMI extraction can be performed. Initially, for each HDD read, the start, mid and end adapter are detected, and the position of the adapters is annotated. If SIDs or UMIs are present these are extracted, trimmed, and annotated at this step. This step is optional, and an alternative approach considers rough splitting of HDD reads by half, where UMI and SID extraction takes place after intramolecular alignment and consensus. [0208] Once the reads of the different stands of the nucleic acids are determined (also referred to as subreads), intramolecular alignment can be performed. Mapping and alignment results can be stored in a BAM alignment file. For “Two-Pass” HDD read constructs, pairwise intramolecular alignment involves alignment of the first and second reverse complementary sections of an HDD read construct. For “Four-Pass” or any other number of passes greater than two, multiple sequence alignment, or the equivalent, should be performed on the additional three or more insert passes. PATENT Client Reference No.: P39048-WO-1 [0209] A consensus read (potentially including discordant base positions) can be determined using the aligned reads. Once the reads are aligned, the base calls at the same position can be compared. Alignment and comparison results may be stored using partial order alignment or through one or more of a variety of lossy and lossless compression embodiments, examples of which are provided below. Methylation sequences may require different alignment parameters (e.g., alignment penalties, scoring, etc.) depending on the use of conversion steps for methylation detection or the inclusion of wobble nucleotides or conversion to facilitate processivity through certain nucleotide motifs. Optionally, UMI and SID extraction as well as adapter trimming may take place in this step, where concordant read 1, read 2, UMI, and SID bases are annotated, facilitating high accuracy detection of shorter UMI and SID sequences in comparison to raw SBX reads. [0210] In some embodiments, reference based compression can be performed. Following HDD read segmentation, intramolecular alignment, and consensus, a reference-based compression algorithm (see section VI.B for detailed description) may be used to further compress the intramolecular consensus reads thus achieving lower data rates in situations with limited bandwidth available for data transfer. [0211] In some embodiments, consensus read mapping and alignment can be performed. Following formation of consensus reads, the consensus reads, and the associated data are mapped and aligned to a reference genome. Alignment may use information on pairwise discordant and concordant bases. Subreads may require realignment and or recovery of original subreads to improve alignment to a reference. Such realignment may be local or encompass the entire subreads. [0212] In some embodiments, an optional intermolecular consensus can be determined. As an optional step, HDD intramolecular consensus reads are grouped based on positional deduplication in combination with UMI information to form intermolecular consensus. Intramolecular consensus reads contain information that may be leveraged in intermolecular consensus. Intramolecular consensus reads may require realignment and or recovery of original subreads to improve intermolecular consensus calls. Such realignment may be local or encompass the entire subreads. PATENT Client Reference No.: P39048-WO-1 [0213] Variant calling can then be performed. Variants of consensus reads are determined in relation to a reference genome. C. Example Alignment of HDD Read Construct [0214] FIG.14 shows an example of an alignment object, resulting from the alignment between an example read 1 string and a read 2 string for reads of length equal to 100 base pairs. The row names (‘read 1’ and ‘read 2’) are not necessarily part of the alignment object, nor are the column names (numbers 0 through 100, in this case), but are included in the above figure for ease of viewing. In this particular alignment object, three discordant positions can be seen, and are highlighted in red. Namely, there is one substitution, one insertion and one deletion within read 2 relative to read 1. The steps taken to perform efficient lossless compression of HDD insert read segments spanning the insert sequence are described below: 1) Sections of the HDD read construct corresponding to adapter sequences are identified and removed, e.g., by algorithmic approaches described herein. This results in identification of the Insert Read 1 sequence and Insert Read 2 sequence. 2) A pairwise alignment is produced between Insert Read 1 and Insert Read 2. [0215] Options for representing information contained within the “alignment object” which is one possible name for the object which represents the result of the alignment between Insert Read 1 and Insert Read 2, are described. An alignment object might include, for example, strings of characters representing the bases and dashes in both read 1 and read 2 after having been aligned. In order to represent the alignment object in a lossy or lossless, compressed form, a number of encoding algorithms may be used. D. Reference Guided HDD Pair Mapping and Alignment [0216] Higher throughput and lower raw read accuracy places constraints on real-time performance for an SBX end-to-end workflow. Further, real-time mapping and alignment to the human genome are challenging. To overcome this challenge, one exemplary embodiment may include exporting, from a base calling station, the demux (e.g., demultiplexed reads) and trimming information in a file, such as hdf5 (station or on-prem). HDD reads are maintained as a PATENT Client Reference No.: P39048-WO-1 continuous sequence annotated at start and end insert positions. Mapping of HDD reads in tandem utilizing longer and sparser seed matching compared to single end SBX reads from either forward or reverse reads (on-prem or cloud). Deduping (e.g., deduplication) and consensus calling (on-prem or cloud) of reference guided reads where each HDD read is represented as two aligned subreads. In contrast with paired end reads, HDD reads are expected to align to the exact same locations in the genome supporting the use of seeds matching either read 1 or read 2. The algorithmic approach to this process includes: (i) use longer seeds matching either reverse or forward read; (ii) mapping only on seeds concordant between first and second subread; (iii) mapping only on more than one shorter concordant seed matches; and (iv) on unmapped and or low mapping quality reads use pairwise alignment to provide high accuracy reference free base calls on HDD reads. The benefits include maintenance of all base call information for pair enabling improved deduped consensus utilizing matches by only one of two HDD subreads. Also, this approach utilizes reference guided alignment of both read pairs reducing the impact of discordance in alignment between reads in consensus. E. Three Way Alignment [0217] The three-way alignment algorithm comprises the following steps: 1) Each read is mapped to the reference genome. 2) If both reads have the same chromosomal position, the reference sequence of the mapped segment is extracted. 3) The raw reads and the reference sequence are aligned using multiple sequence alignment (MSA). 4) After the alignment, ONLY the raw reads are used to form a consensus. 5) Any discordant bases are identified and annotated (either N, or First reads or randomly assigned, or another method of calling consensus bases at discordant positions). 6) Any consensus (between raw reads) deletions are removed. PATENT Client Reference No.: P39048-WO-1 7) Final consensus is produced. 8) Quality scores are assigned accordingly as follows. [0218] This alignment method can improve the solution space compared to the pairwise only alignment, leading to less number of random errors. This is due to adding the reference genome sequence as an anchor, which reduces the degrees of freedom on the entire solution space. [0219] In multiple sequence alignment, it can be assumed each read is the prediction from an independent classifier. When Classifier 1 and Classifier 2 give different predictions (e.g. read 1 and read 2), we can use Bayes' theorem to calculate the probability that Classifier 1 is correct. Let's denote: 1) P(C1) as the probability that Classifier 1 is correct, which is 0.99. 2) P(C2) as the probability that Classifier 2 is correct, which is 0.90. 3) P(D) as the probability that the classifiers disagree. [0220] We want to find P(C1|D), the probability that Classifier 1 is correct given that there is a disagreement. According to Bayes' theorem: where P(D|C1) is the probability that there is a disagreement given that Classifier 1 is correct. Since Classifier 1 is correct, the disagreement must come from Classifier 2 being incorrect, so P(D|C1) = 1 - P(C2) = 0.10. To find P(D), we consider two scenarios: Classifier 1 is correct and Classifier 2 is wrong, or Classifier 1 is wrong and Classifier 2 is correct. Thus, ( ) is calculated as follows: = 0.99 × 0.10 + 0.01 × 0.90 = 0.108 [0221] Finally, we can calculate P(C1|D) as follows: ( 1| ) = . × . . . . = 0.9167. (Eq. 3) PATENT Client Reference No.: P39048-WO-1 So, the probability that Classifier 1 is correct when there is a disagreement between the two classifiers is approximately 91.67%. [0222] The quality score when both reads have the same prediction can be calculated as follows. When both classifiers give the same prediction, the probability that both are wrong can be calculated using the complement of the probabilities that at least one of them is correct. Let's denote: 1) P(C1) as the probability that Classifier 1 is correct, which is 0.99. 2) P(C2) as the probability that Classifier 2 is correct, which is 0.90. 3) P(W1) as the probability that Classifier 1 is wrong, which is: 1 - P(C1) = 0.01. 4) P(W2) as the probability that Classifier 2 is wrong, which is: 1 - P(C2) = 0.10. [0223] The probability that both classifiers are wrong when they give the same prediction can be calculated as: where ( 1 2) is the probability that both classifiers are wrong, which is ( 1) × ( 2) = 0.01 × 0.10 = 0.001. P(same prediction) is the probability that both classifiers give the same prediction, which can be calculated as the sum of the probabilities that they are both correct or both wrong: ( ) = ( 1 2) + ( 1 2 ) (Eq. 5) = ( 1) × ( 2) + ( 1) × ( 2) which for above example will be: 0.99 × 0.90 + 0.01 × 0.10 = 0.891 + 0.001 = 0.892 Therefore, the probability that both classifiers are wrong when they give the same prediction is: 0.00112 (Eq. 6) PATENT Client Reference No.: P39048-WO-1 So, the probability that both classifiers are wrong when they give the same prediction is approximately 0.112%. [0224] In the case of DNA sequencing, ( 1) × ( 2) is much smaller due to the fact that the incorrectly called bases need to be the same (A/C/T/G/gap) for both reads 1 and 2. There for the second term ( 1) × ( 2) can be ignored from the equation 5. [0225] This method is generic, and can be applied to reference based, pairwise, and three- way alignment. The only difference would be the way raw read accuracy is calculated for reads 1 and 2. In the case of reference-based method, the accuracy obtained by aligning each read to high confidence regions of the reference genome can be used as raw read accuracy. [0226] In the case of three-way alignment, the accuracy can be calculated by comparing the alignment of read 1 and read 2 against the reference genome. In the case of pairwise alignment, the raw read accuracy can be calculated after alignment of the consensus and determining how often the raw read 1 and read 2 made mistakes with respect to the reference genome. F. Inference-Based Quality Scores [0227] The method described above for calculating quality scores for an intramolecular consensus read based on accuracy of the independent classifiers produces a number of discrete quality scores according to whether the base calls for the aligned sequences are concordant or discordant. For example, in pairwise alignment, a quality score for a consensus read for concordant positions may be 30 (according to a probability calculated in Eq.3 above), and a quality score for a consensus read for discordant positions may be 11 (according to a probability calculated in Eq.6 above). Furthermore, when a consensus read cannot be called (e.g., because the quality of both raw reads at the discordant position is low, for example), then the quality score may be 0. While this method provides a quality score associated with an intramolecular consensus read that can be used during downstream processing, such as in forming intermolecular consensus or variant calling, the quality of a particular consensus read can have only one of two values (e.g., 11, or 30), which does not provide a measure of difference in quality between two different consensus base calls at different concordant positions, or a PATENT Client Reference No.: P39048-WO-1 measure of difference in quality between two different consensus base calls at different discordant positions. [0228] Another technique for assigning a quality score to a consensus read can utilize a machine learning model, trained to infer a quality score based on the two or more aligned sequences (e.g., insert read 1 and insert read 2). The machine learning model may recognize patterns for quality of consensus not only based on whether a particular position is concordant or discordant, but also based on other features of the aligned sequences. The features can include the other base calls near the concordant/discordant position (aka. kmer information), the raw quality score of the base caller, read orientation, or various other features related. In particular, the machine learning algorithm can be a neural network that processes two or more aligned reads and generates a vector of quality scores for each position of the aligned reads. Such a model can be trained to generate a continuous range of quality scores, e.g., between 0 and 50, 60, 70, or more, for consensus calls at both concordant and discordant positions. These quality scores can produce better results in downstream processing, such as when forming intermolecular consensus or variant calling. [0229] In an embodiment, one such neural model is a convolutional neural network (CNN) that takes aligned reads in terms of A, C, G, T and Gap and their respective quality score (0 for gaps) as input and output the consensus calls and corresponding qualities scores for each consensus call position. In yet another embodiment, another such neural network adopts an architecture similar to U-Net, which is widely used in image segmentation applications and is described at least in Ronneberger et al., “U-Net:Convolutional Networks for Biomedical Imaging Segmentation,” Computer Vision and Pattern Recognition, arXiv1505.04597 (2015), which is hereby incorporated by reference in its entirety. A U-Net consists of an encoder-like path for extracting information and a decoder path for mapping features to their base position in the sequence read. The final layer of this augmented U-Net Neural Network architecture is a softmax layer to compute the probability of each possible nucleotide at each position in the sequence read. The probabilities can be converted into quality scores (in Phred scale). Similar to the conventional CNN architecture, the probabilities can be converted into quality scores (in Phred scale), and the possible nucleotides encoded for the consensus base calls are A, C, G, T, and Gap. In an embodiment, the neural network models described above (both the CNN PATENT Client Reference No.: P39048-WO-1 embodiment and the U-Net-based embodiment) are trained using variant free regions of common biological samples. The neural networks described above can be implemented as part of an alignment workflow when making intramolecular consensus reads. VI. Lossless Compressions for Consensus Sequence [0230] Lossless compression allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Embodiments can treat concordant positions differently than discordant positions. Additionally, a reference-based compression can be used. A. Partial Order Alignment [0231] Pairwise alignment and consensus results between HDD read 1 and read 2 can be ambiguous. This may be more often the case on homopolymers (as described in section VI.C below) and on tandem repeats, which are of clinical relevance in cancer microsatellite instability detection and clinical variant calling. Maintaining information on both read 1 and read 2 in a lossless manner can be utilized in downstream consensus calling and variant calling. Tracking measurement on raw reads mitigate errors arising from incorrectly calling ambiguous discordant and concordant positions; but bottlenecks can occur during the transfer of raw HDD read data. [0232] A solution to this problem may include partial order alignment of HDD reads and export of a serialized partial order alignment, which maintains both read 1 and read 2 information in a lossless manner. Instead of making a consensus call for the entire nucleic acid molecule, a consensus call is made only for certain base positions (concordant ones). For the base positions that have discordance, two or more of the bases can be output, potentially along with some of the raw data (e.g., quality scores). Those discordant positions may be resolved later, e.g., using intermolecular consensus. The partial order alignment generates a sorted DAG (directed acyclic graph), which can be sorted and exported as a linear sequence encoding. [0233] In contrast with regular pairwise alignment, the ambiguity in alignment will be maintained, enabling the assignment of low-quality scores to discordance bubbles vs. single base mismatches. Such will be the case on homopolymers where exact positions of indels cannot be assigned. The HDD pairs can contain information about their relative alignment in addition to PATENT Client Reference No.: P39048-WO-1 the single pair discordance. This information can be reused ‘as is’ in downstream variant calling. In contrast with pairwise alignment, partial order alignment can maintain ambiguity in alignment, which is beneficial in determining concordant read 1 read 2 vs. discordant read 1 read 2 base positions in the read and the confidence in making such calls. In compression, partial order alignment export is expected to result in a similar compression ratio on HDD vs. raw reads as pairwise alignment and assignment of discordant bases, that is ~40% when not considering the benefits coming from extraction and classification of adapters and UMIs, if present. Example Encoding [0234] Table 2 below shows examples of lossless encoding of pairwise concordant and discordant calls following pairwise alignment. The examples below merely illustrate the number of possible values for concordant positions and discordant positions, as opposed to a compression technique. Table 2 Read 2 Category Encoding Read 1 (reversed) Adenine A A Cytosine C C Concordant Calls Guanine G G Thymine T T (or Uracil) Weak A T Discordant Calls weak T A Strong C G PATENT Client Reference No.: P39048-WO-1 strong G C pYrimidine C T pyrmidine T C Keto G T keto T G puRine A G purine G A aMino A C amino C A lower case a A - b - A lower case c C - d - C Discordant indel lowercase g G - e - G lower case t T - u - T [0235] For the concordant positions, there are only four possible concordant values: A, C, G, or T. Thus, such positions can be represented using less data than the discordant positions, PATENT Client Reference No.: P39048-WO-1 potentially using only two bits. For example, 00 can be A; 01 can be C; 10 can be G; and 11 can be T. Other representations are possible that use more bits. [0236] For the discordant positions, there are at least twelve possible concordant values, with eight more for a total of twenty if indels are taken into account. Thus, such positions can be represented using 4 bits if only the first 12 possible discordant values are used or 5 bits if all 20 possible discordant values are used. [0237] To decompress such a compressed sequence, where not all of the positions are represented with the same amount of data, additional metadata can be used. For example, a separate bit vector can specify which positions are concordant (e.g., with a 0) or discordant (e.g., with a 1). As another example, a header file can specify the positions that are discordant. Thus, the decompression circuitry/module can assume that every two bits is a concordant base call until a discordant position is reach. The decompressor can keep a counter that increases after each position is read, and then cross-reference the current position with the next discordant position in the header file. 2. Method [0238] FIG.15 shows a flowchart illustrating method 1500 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule. The mnethod 1500 depicted in FIG.15 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method 1500 presented in FIG. 15 and described below is intended to be illustrative and non-limiting. Although FIG.15 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. [0239] At 1505, a first strand of the double-stranded nucleic acid molecule is sequenced to obtain a first sequence of base calls and a second strand of the double-stranded nucleic acid molecule is sequenced to obtain a second sequence of base calls. Ideally, the sequencing of the PATENT Client Reference No.: P39048-WO-1 first and second strand of the double-stranded nucleic acid molecule produces a full hairpin duplex read (e.g., a full consensus sequence) that comprises the following segments: a start adapter, the first strand sequence (e.g., a forward or plus strand insert), a hairpin, the second strand sequence (e.g., a reverse or minus strand insert), and an end adapter. However, in some cases, a partial consensus sequence may be obtained that is missing any number of the above- described segments. [0240] Examples of partial consensus sequences that may be obtained include, without limitation: (i) a partial sequence that includes the start adapter and a portion of the first strand of the double-stranded nucleic acid molecule, but is missing the hairpin, the second stand of the double-stranded nucleic acid molecule, and the end adapters; (ii) a “one+” sequence that includes the start adapter, the first strand of the double-stranded nucleic acid molecule, the hairpin and either part of or the entirety of the second strand of the double-stranded nucleic acid molecule, but is missing the end adapter; and (iii) a “U-turn” sequence that includes the start adapter, a portion of the first strand of the double-stranded nucleic acid molecule and a complementary portion of the second strand of the double-stranded nucleic acid molecule, and the end adapter, but is missing the hairpin. One of ordinary skill in the art would appreciate that additional examples of partial consensus sequences are possible, including any combination thereof of examples (i), (ii), and (iii). For example, a partial consensus sequences combination comprising a “one+” and “U-turn” sequence. [0241] At 1510, a first set of concordant positions and a second set of discordant positions are identified using the first sequence of base calls and the second sequence of base calls. When reference-based compression is also used, the first set of concordant positions may only be a portion of the total amount of concordant positions. For example, the first set of concordant positions might only be ones that differ from the reference. Another set of concordant positions can match the reference, where all those positions can be assumed to be the same as the reference, with only the non-matching concordant positions being identified in metadata as to where those positions are. Regardless of the type of partial consensus sequence generated (e.g., full hairpin duplex, partial, “one+”, “U-turn”, or any combination thereof), such sets of concordant/discordant positions can be determined. PATENT Client Reference No.: P39048-WO-1 [0242] In some embodiments, the first set of concordant positions and the second set of discordant positions are identified by aligning the first sequence of base calls to the second sequence of base calls (e.g., aligning the base calls to each other). This method is referred to as reference free alignment and is described in more detail in section V-B. When aligning the first and second sequences of base calls to each other, the “duplex start position”, p, can be identified. “P” is defined as the first position on strand 1 at which bases from strand 2 are aligned with strand 1 bases (see FIG.13C). For “Full HD Reads” the duplex start position corresponds to p=0. For “One+” reads, the duplex start position is some value where p > 0, given that the trail region of the partial subread is considered to be discordant positions (i.e., they do not have another base to form consensus with). Accordingly, when One+ Reads are encoded, the start position of the duplex (designated as “p”) is communicated to the downstream module. This type of information, in addition to raw Q-scores and consensus Q-scores, may be encoded and have an extra bit allocated to it to preserve this information (e.g., in a full string header). For “HDD One+ Reads”, a local pairwise alignment of insert read 1 and insert read 2 would typically result in a high scoring alignment between the shorter of the two subreads and part of the longer of the two subreads. [0243] In another embodiment, the first sequence of base calls, the second sequence of base calls, or both may be aligned to a reference genome corresponding to the origin of the sequence (e.g., if human sequence, will be aligned to a human reference genome). For example, the first sequence of base calls may be aligned to a first strand of a reference genome, while the second sequence of base calls may be aligned to the second strand of a reference genome. In some instances, a portion of the first set of concordant positions may not match the reference genome. Thus, a third set of concordant positions may be identified that match the reference genome and each of the third set of concordant positions may be represented with an indication of a genomic coordinate in the reference genome. The genomic coordinate indication can include a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant positions that do not match the reference genome. This method is referred to as reference-based alignment and is described in more detail in section V-A. [0244] Each of the first set of concordant positions may be represented by a concordant value of a first group of four concordant values. The first group of four concordant values is specified PATENT Client Reference No.: P39048-WO-1 using two binary bits and includes A<>T, C<>G, G<>C, and T<>A. Accordingly, each concordant value represents a concordant pair of bases between the first stand and the second strand of the double-stranded nucleic acid molecule. Regarding the second set of discordant positions, they may be represented by a discordant value of a second group of at least 12 discordant values. The second group of at least twelve discordant values is specified using at least four binary bits and includes A<>A, A<>C, A<>G, C<>A, C<>C, C<>T, G<>A, G<>G, G<>T, T<>C, T<>G, and T<>T. Accordingly, each discordant value represents a discordant pair of bases between the first stand and the second strand. In some embodiments, the second group of at least twelve discordant values includes at least twenty discordant values (accounting for insertions and deletions) and the at least twenty discordant values may be specified using five binary bits. [0245] At 1515, the partial consensus sequence is generated using: (1) the concordant values at the first set of concordant positions; and (2) the discordant values at the second set of discordant positions. In some cases, the partial consensus sequence may not be for the whole double-stranded nucleic acid molecule, e.g., when reference-based compression is used. In such situations, the partial consensus sequence can correspond to the concordant positions that do not match the reference and to the discordant positions. [0246] Generating the partial consensus sequence can include using metadata that specifies the second set of discordant positions (e.g., using headers that indicate which positions are discordant). In so doing, the metadata allows for the concordant values for the first set of concordant positions, and the discordant values for the second set of concordant positions to be used to recover the base calls of the first sequence and the second sequence at the first set of concordant positions and the second set of discordant positions. [0247] For two pass HDD read constructs, only two reads are generated to determine a consensus sequence and call concordant and discordant bases. In the event a discordant base is called, because there are only two reads, a tie between the bases is reached. In some cases, the tie in base call may be resolved by leveraging the Q score, or identifying thatone of the discordant bases displays a very poor ADC signal. In the event the tie cannot be resolved, the information for both discordant bases is preserved. Because of this limitation, two pass HDD read constructs PATENT Client Reference No.: P39048-WO-1 often reach the upper limit of compression that may be achieved. On the other hand, four pass HDD read constructs have four reads compared to only the two reads generated for two pass HDD read constructs. The additional reads significantly decrease the rate at which discordant positions cannot be resolved, thus a greater compression is achieved. [0248] At 1520, the partial consensus sequence is transmitted to a computer system. In one embodiment, the computer system may be a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause the computer system to perform method 1500 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule. The computer system may also comprise one or more processors configured to execute instructions stored on the computer readable medium. B. Reference-Based Compression [0249] FIG.16 is a flow chart illustrating method 1600 to compress a base call sub-stream from the raw read data generated by a sequencing device (e.g., nanopore-based sequencing device). The base call data can include a sequence of base calls (also referred to as a sequence read) for each of the at least 100,000 nucleic acid molecules, or for other numbers of molecules, such as at least 2, 3, 4, 5, 10, 50, 100, 1000, 10,000, 100,000, 500,000, one million or more nucleic acid molecules (in each of the same number of sequencing cells). For the sequence read corresponding to a nucleic acid molecule, the base call data comprises the base calls for each position in the sequence read. Method 1600 can be performed for each sequence of base calls corresponding to a respective nucleic acid molecule. The compressing can be of the second sub- stream of base call data described above. [0250] The base call data sub-stream stores the sequence of bases in a nucleic acid molecule (e.g., DNA or RNA), referred to hereinafter as sequence read(s). A sequence read in a base call data sub-stream may comprise a nucleic acid sequence as a string of A, T, C, G, U or N’s, where each letter denotes adenine (A), thymine (T), guanine (G), cytosine (C), uracil (U), or not determined or ambiguous (N). [0251] In 1610, the sequence read is aligned relative to a reference sequence to obtain the genomic location information. This sequence alignment can be performed using various software PATENT Client Reference No.: P39048-WO-1 packages, such as (but not limited to) BLAST, FASTA, Bowtie, BWA, BFAST, SHRiMP, SSAHA2, NovoAlign, and SOAP, or the techniques embodied with the software, or other techniques as known to the skilled person. The reference sequence can be a human reference sequence, such as hg18 or hg38. [0252] The sequence alignment can generate an identifier that identifies the location within the reference sequence that the read aligns. For example, the identifier may comprise the genomic start and end locations of the reference sequence on a chromosome (e.g., a human chromosome) from the reference genome (e.g., human genome) to which the sequence read aligns. Accordingly, the alignment position relative to the reference genome may be determined. For example, the first or last aligned position of the read (e.g., closest to a 3’ or 5’ end of the reference sequence) may be used to identify the alignment position or an alignment window. Other methods may be used to store the alignment coordinates. In some cases, the read may be a positive strand or a negative strand. A read is considered “positive” strand if a read aligns without reverse complementing the sequence read. An alignment is considered “negative” strand if a sequence read is to be reverse complemented prior to alignment. Optimal alignment may be determined with the use of any suitable algorithm for aligning sequences, non-limiting example of which include the Smith-Waterman algorithm, the Needleman-Wunsch algorithm, algorithms based on the Burrows-Wheeler Transform (e.g. the Burrows Wheeler Aligner), ClustalW, Clustal X, BLAST (e.g., BLASTn at http://www.ncbi.nlm.nih.gov/), Novoalign by Novocraft Technologies Sdn Bhd (Petaling Jaya, Malaysia), ELAND by Illumina, Inc. (San Diego, Calif.), SOAP (available at soap.genomics.org.cn), and Maq (available at maq.sourceforge.net). [0253] In 1620, differences between the sequence read and the reference genome are identified. The difference can be of various forms, e.g., a substitution, insertion, or deletion. [0254] At 1630, the outcome of the alignment, including the differences identified, may be used to encode the sequence read. Table 3 shows an example chart that can be used to encode a read that contain ns A, T, C, and G base calls using 14 possible encodings. The encodings shown in Table 3 are just an example and can be modified. The sequence read may then be encoded into a text or a bit string using the encodings. The bit string or text that is encoded at the base level PATENT Client Reference No.: P39048-WO-1 can then be compressed in later steps. The encodings include a match, the 4 substitutions, 4 soft clips (the end of a read is not aligned), 4 insertions, and a deletion. Table 3: Example Encodings Char Interpretation = Base matches reference A Base aligned with A substitution C Base aligned with C substitution G Base aligned with G substitution T Base aligned with T substitution o Softclip A base in read p Softclip C base in read q Softclip G base in read r Softclip T base in read j Inserted A k Inserted C l Inserted G m Inserted T d deletion PATENT Client Reference No.: P39048-WO-1 [0255] In 1640, the genomic location information in the reference sequence is substituted for at least a portion of the sequence that matches the reference sequence. For example, if a portion of the nucleotides in the beginning of a sequence matches with the reference sequence and then there is one or more mismatches, the nucleotides in the first portion can be replaced by a start location relative to the reference sequence, a number that shows the length of the portion, and the code that represents a mismatch. The one or more mismatches may then remain as encoded. Any portion of matching sequences may similarly be replaced (i.e., to compress the sequence data) by a start location corresponding to the position of a first matching nucleotide and a length of the portion of matching sequences. The code for a sequence match may or may not be included. A portion of the sequence that matches with a reference sequence may be 2 bases, 3 bases, 5 bases, 10 bases, 20 bases, 30 bases, 40 bases, 100 bases, 500 bases, or longer. The portion can then be substituted with, for example, only 3 numbers including a chromosome number, a start location for a location of the first nucleotide in the portion that matches with the reference sequence, and the length of the portion. In some embodiments, the length of the read must be stored as part of the location and identification of the matching bases and may be used to decode the final compressed data. [0256] In 1650, compressed base call data of the base call data sub-stream is generated using the location information, the encoded base calls, or a combination thereof. For example, an encoded sequence read may comprise a location relative to the reference genome such as a leftmost (or rightmost) position of the read, the positions where there is a match between the read and the reference sequence, and positions where there is an insertion, a deletion, or any other encoded mismatch. Compression of the encoded sequence read may then be performed by, for example, replacing the portions of the read that match the reference with the position number or a window of numbers. Different combinations of location and encoded sequence can be used to compress the sequence read. C. Condensed Homopolymer Calls and Pairwise Alignment [0257] Homopolymers, or the repeat presence of the same nucleotide in the template, offer unique challenges in sequencing and alignment. Following pairwise alignment, in the absence of additional information, a deletion or an insertion in a homopolymer cannot be assigned to a PATENT Client Reference No.: P39048-WO-1 specific location. This creates systematic biases in homopolymer errors where a discordant indel in a homopolymer base is less likely to be detected since it cannot be assigned to a unique alignment position. An algorithmic approach for homopolymer HDD consensus calls and pairwise alignment may prove beneficial to encode homopolymers as a combination of the length and the number of base repeats. For instance, A7 will represent a repeat of an adenosine template base seven times: AAAAAAA. Such a condensed approach may be used in alignment when considering all or only longer homopolymers. This encoding may also be beneficial since a subset of error modes preferentially creates or applies to homopolymers as opposed to other k- mers: in template slippage, in voltage and temperature dependent insertions, and in dysfunctional state inserts. [0258] In a pairwise condensed homopolymer call the quality of forward and reverse complement homopolymers can be uniquely determined based on empirical observations and applied to the entire homopolymer rather than a specific base. Importantly, quality scores can be considered for read 1 vs. read 2, and parent vs. daughter as well as base vs. complement base in determining empirical consensus base quality. This may be done as part of a machine learning or neural network-based approach or tabular calibration per each scenario, e.g., A7 on read 1 and a T7 on read 2 may offer a higher probability of being accurate versus the complement T7 on read 1 and A7 on read 2. [0259] Provided is another, non-limiting example, of how homopolymer regions can be challenging to resolve in sequencing and alignment in the absence of additional information. Below are four possible homopolymer regions, where the top sequence has a run of T’s inserted, while the bottom sequence has a run of A’s inserted that are discordant (bold), while the surrounding sequences are concordant: Possibility 1: C C G C - - - - T T T T C G C G C C G C A A A A - - - - C G C G Possibility 2: PATENT Client Reference No.: P39048-WO-1 C C G C T T T T - - - - C G C G C C G C - - - - A A A A C G C G Possibility 3: C C G C - T T - T T - - C G C G C C G C A - - A - - A A C G C G Possibility 4: C C G C - T - T - T - T C G C G C C G C A - A - A - A - C G C G [0260] As illustrated by the four possibilities above, it is not possible to know which alignment is correct with the information available. As shown in Possibilities 1 and 2, the A’s could go first/second, and the T’s could be second/first, where either the A’s or the T’s could be insertions and the dashes deletions (e.g., depending on which one is your ground truth target and which is the reference). On the other hand, Possibilities 3 and 4, integrate the A’s and T’s in between the dashes. FIG.17 shows an illustration for how these possibilities may be viewed graphically. VII. Lossy Compressions for Consensus Sequence [0261] Lossy compression algorithms are techniques that reduce file size by discarding less important information. The algorithms accomplish this by using inexact approximations and partial data discarding to represent the content. When done effectively, lossy compression technology can reduce data size without degrading the quality of the data. Encoding discordances are described below in several exemplary, and not limiting embodiments. Several possible approaches are enumerated here with example alignment objects, consensus read sequences and variable length header strings, in order to convey different conceptual approaches for achieving compressed representations. PATENT Client Reference No.: P39048-WO-1 A. Example Criteria for Selecting Base call [0262] Various criteria can be used to determine a consensus base call when the base calls at a position are discordant. 1. Base Call Quality Score [0263] As part of base calling, the quality of the identified base generated during sequencing may be measured with a quality score (also referred to as a Phred quality scores or Q-scores). Quality scores may be assigned to each base call during sequencing and stored in a file (e.g., a FASTQ sequencing file). Quality scores can relate (e.g., logarithmically) to the base calling error probabilities of the sequencing system. For example, a quality score of 30 can indicate that the base is called incorrectly once every 1,000 base pairs or the call is 99.9% accurate. On the other hand, a quality score of 60 can indicate the base is called incorrectly 1 in 1,000,000 base pairs or is 99.9999% accurate. With respect to sequencing, Phred quality scores may be used to assess sequencing quality, recognize and remove low-quality sequences, and determine an accurate consensus sequence. [0264] Quality scores can be determined as part of the sequencing workflow. A base caller can determine several parameters such as peak share and resolution of each base. The base caller then uses these parameters to look up a corresponding quality score in hard coded, established look-up tables (LUTs). A low-quality base can result when there is an equal or similar probability between two bases, e.g., near an edge of a cutoff value separating two bases or similar probability from an ML model. [0265] Such knowledge of a low-quality score and which bases have similar signal values (e.g., the bases could have increasing signal values in the order of A, C, G, and T, with T having the highest signal value) can be used to resolve discordant base calls in some instances. Other orders can be used. If a base call for T has a low-quality score, then it can be surmised that a likely other base is a G. Such a determination can also be known when the base caller uses more complex techniques. This information can be used when determining an intramolecular consensus read. For example, if one strand had a high-quality C and the other strand had a low- quality T, then the position could be called as concordant for C-G, where the base call for T is PATENT Client Reference No.: P39048-WO-1 assumed to be an error. Such techniques are described in more detail herein for determining a consensus read. In other embodiments, knowledge of the specific other base that had the second highest probability can be provided from the base caller to the circuitry that determines the consensus read and that implements any data compression. [0266] A raw signal can include raw voltage signals that produce a base call and a quality score. 2. Read Quality Score (read orientation) [0267] Which read a base call is on can be tracked. For example, which strand the base call is on can be tracked. Also, whether the read is from a daughter read or a parent read can be tracked. These different reads can have different associated errors. Each read can have an associated quality score, which can translate into a corresponding weight when determining the consensus basis. For instance, a base call can be modified/weighted (e.g., multiplied) by its base quality score and its read score, and then a weighted sum for each of the different base calls can be determined, thereby obtaining a final score for each base call. The final score can be used to determine whether to make a concordant or discordant call. For example, a concordant call might be made when a first final score of a first base call is higher by a threshold value than a second final score of a second base call. 3. Kmers per strand [0268] The read quality score and/or the base quality score can be influenced by the base calls that come before and/or after the current position being analyzed. Such a Kmer context can be determined for each position, and a quality score for the Kmer can be determined. For example, for K=7, there might be an A in read 1 and a G in read 2. One of those two base calls is wrong given that A pairs with T in base complementation. Assuming they have the same base quality score, but if the A is part of a particular 7-mer, training information can be collected about how accurate that A is depending on what the other bases are around it in the 7-mer. The A can be in a position of the 7-mer where there rarely is a mistake. Then for the other read, the quality can be low (i.e., more likely an error) for the T in that 7-mer on the other strand. In this PATENT Client Reference No.: P39048-WO-1 way, one can weight not only by which read but also by k-mer context. Different reads (e.g., strands and whether a daughter) can have different Kmer context quality values. B. Read Orientation [0269] The effects of data compression algorithms are typically only considered in the downstream processing, transferring, storage, and archival of high-throughput sequencing data. However, errors that occur in upstream processes such as sample preparation, library preparation, and sequencing also contribute to the amount of data generated by these downstream processes. For example, in the context of SBX, Xpandomer synthesis and the sequencing process each have a respective error profile that influences downstream processing steps such as consensus read calling, base calling, variant calling, and the like. By taking into account these error profiles, improvements in consensus read calling, base calling, variant calling, etc. reduce the amount of overall data size of the sequencing data. [0270] In one instance, read 1 and read 2 can have different error profiles as a result of different kinetic rates in which they were synthesized. In addition, when Xpandomers are used, an Xpandomer is typically generated from target DNA molecules that are double-stranded. As a result, during Xpandomer synthesis, the error rate of the first processed read could be different from the error rate of the second processed read. Difference in strand synthesis error rate may be impacted by: (i) certain modification (e.g., biochemical or DNA damaging) inherently being more common on one strand orientation versus its complement; and (ii) sequencing errors can be influenced based on the surrounding bases. [0271] By way of example, without limitation, difference in observed Xpandomer synthesis error profile differences can arise for homopolymer motifs when the target DNA molecule stretch is in a double vs single stranded state. Specifically, it may be observed that for the “read 1” stretch of the Xpandomer (which corresponds to the first pass on the insert when the insert is still in double stranded DNA form), homopolymer accuracy may be higher than for the “read 2” stretch of the Xpandomer (which typically corresponds to the second pass on the insert, when the complementary insert strand is now more likely to be in a single stranded state). When accuracy deltas are expected as a function of kmer motif and read (or pass) number, these modeled differences in expected raw read accuracies can be leveraged to break ties where multiple passes PATENT Client Reference No.: P39048-WO-1 over the same stretch of insert molecule differ. More generally, they may be used to properly weigh the evidence from multiple, disagreeing read passes, through the machinery of a statistical model or machine learning based consensus read generating algorithm. [0272] To overcome the above-mentioned challenges, information on original and Xpandomer synthesis strand orientation from HDD raw reads and on HDD consensus calling may be tracked. An algorithmic approach to this problem may include collecting, in addition to read 1 and read 2 base information, also collecting information regarding the alignment orientation of a read to the original parent molecule. Read orientation is of particular interest in consensus and concordance calls and in the assignment of base quality to concordant bases. For instance, on a parent-daughter configuration, if a certain biochemical mechanism preferentially damages parent C nucleotides and converts them to uracil (U), the following amplification steps will generate T nucleotides. If this mechanism is much more prevalent than a different mechanism which converts G nucleotides to A nucleotides on the parent strand, an observation of a concordant T-T may be more likely to be erroneous compared to an A-A concordant base when aligning to the same reference position. This will be the case if both are the outcome of the respective parent C>T and daughter G>T, and parent G>A and daughter C>T mismatches respectively. If both forward and reverse orientations are sequenced, the same T-T concordant call may be more accurate depending on whether it is the complement or in the same orientation as the original DNA molecule template. Strand orientation information can be recovered and tracked as part of the hairpin and or adapter orientations and in combination with the orientation of read 1 and read 2 with respect to the reference genome. This additional information can be used in variant calling, assignment of quality scores, and consensus. 1. Method [0273] FIG.18 shows a flowchart illustrating method 1800 for determining a consensus sequence of a double-stranded nucleic acid molecule. The method 1800 depicted in FIG.18 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method 1800 presented in FIG.18 and PATENT Client Reference No.: P39048-WO-1 described below is intended to be illustrative and non-limiting. Although FIG.18 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. [0274] At 1805, a first strand of the double-stranded nucleic acid molecule is sequenced to obtain a first sequence of base calls. Each of the first sequence of base calls has a first quality score and a first label that corresponds to the first strand. Similarly, a second strand of the double-stranded nucleic acid molecule is sequenced to obtain a second sequence of base calls. Each of the second sequence of base calls has a second quality score and a second label that corresponds to the second strand. The first weight and the second weight can be dependent on base calls adjacent to the discordant position. [0275] Following sequencing, at 1810, a first set of concordant positions and a second set of discordant positions are identified using the first sequence of base calls and the second sequence of base calls, respectively. [0276] In some embodiments, first set of concordant positions and the second set of discordant positions are identified by aligning the first sequence of base calls to the second sequence of base calls (e.g., aligning the base calls to each other). This method is referred to as reference free alignment and is described in more detail in section V-B. [0277] In another embodiment, the first sequence of base calls, the second sequence of base calls, or both may be aligned to a reference genome corresponding to the origin of the sequence (e.g., if human sequence, will be aligned to a human reference genome). For example, the first sequence of base calls may be aligned to a first strand of a reference genome, while the second sequence of base calls may be aligned to second strand of a reference genome. In some instances, a portion of the first set of concordant positions may not match the reference genome. Thus, a third set of concordant positions may be identified that match the reference genome and each of the third set of concordant positions may be represented with an indication of a genomic coordinate in the reference genome. The genomic coordinate indication can include a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant PATENT Client Reference No.: P39048-WO-1 positions that do not match the reference genome. This method is referred to as reference-based alignment and is described in more detail in section V-A. [0278] For each discordant position of the second set of discordant positions, a consensus base call is determined using the first quality score, the second quality score, a first weight corresponding to the first label, and a second weight corresponding to the second label. The consensus base call is determined at an initial discordant position of the second set of discordant positions. This involves changing the initially discordant position to a concordant position for the first base call on the first strand. This change may be based on either: (i) the first quality score is higher than the second quality score for a second base call of the second strand; (ii) the first weight is higher than the second weight; or (iii) the concordant base on the second strand has a measured signal that is adjacent to the second base call. The reasoning behind options (i) and (ii) regards the notion that, typically, the first base call has higher accuracy compared to the second base call. [0279] Following consensus base calling, at 1815, the consensus sequence is generated using: (1) the concordant values at the first set of concordant positions; and (2) the consensus base calls at the second set of discordant positions. In some embodiments, the consensus sequence may be a partial consensus sequence as described above. [0280] Finally, at 1820, the consensus sequence is transmitted to a computer system. In one embodiment, the computer system may be a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause the computer system to perform method 1800 for determining a partial order consensus sequence of a double-stranded nucleic acid molecule. The computer system may also comprise one or more processors configured to execute instructions stored on the computer readable medium. C. Wobble Nucleotides [0281] In some circumstances it may also be beneficial to include wobble nucleotides such as inosine to facilitate synthesis and reduce complementarity on hard to process or amplify motifs. One such circumstance may be when the base pairing is indeterminant, and thus by incorporating an inosine base, which is concordant to A, C, T, and G bases, synthesis and high PATENT Client Reference No.: P39048-WO-1 discordance can be avoided. Such modifications may apply only to one of the two HDD subreads, for instance a synthesized daughter strand. Under such circumstances a certain set of discordant bases from the table below will be accepted as valid, though lower confidence base calls. The information on both read 1 and read 2 base calls may be encoded and become recoverable downstream to be leveraged in variant calling and or applied to base quality. Table 4 provides examples of lossless encoding table of pairwise concordant and discordant calls following pairwise alignment in the presence of wobble nucleotides. Table 4 Example Wobble Additives to Daughter Strand Read Read Category Encoding 1 2 Inosine Uracil Adenine A T A A Cytosine C G C C Concordant Calls Guanine G C G G Thymine (or Uracil) T A T T Weak A a weak t T Strong C c Discordant Calls or Wobble strong g G concordance pYrimidine C t pyrmidine Wobble T call t C daughter A paired PATENT Client Reference No.: P39048-WO-1 I complemented with C Keto G t keto t G puRine W obble A call daughter-T paired with U A g U complemented with G purine W obble G call daughter-C paired g A I complemented with A aMino A c amino a C VIII. Example Compression Techniques [0282] Described below are a variety of non-limiting exemplary approaches that may be used to achieve read compression that are particular to determining intramolecular consensus. Importantly, the examples provided below demonstrate that based on the compression technique used, a higher compression ratio may be achieved. A. Lossless 1. Consensus Quality Score Vector [0283] Exemplary embodiment 1 for generating a consensus read from an alignment object is outlined below. PATENT Client Reference No.: P39048-WO-1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). ) For discordant positions one of the two characters in either read 1 or read 2 is chosen according to some selection criteria. That selection criteria may involve, for example, one of the following options: a. Comparing the raw read quality scores between the read 1 character and the read 2 character (see section VII-A.1 for description). b. Comparing the kmer context for each of the two characters on read 1 and read 2 (see section VII-A.3). c. Considering which read the characters occurs on (i.e., base quality score based on read orientation; see section VI-A.2). d. Considering error profiles for upstream processing events such as physical processing of the molecule of interest, sample preparation, library preparation, target enrichment, and sequencing preparation. Each of these methods may comprise information about which base transversions (e.g., conversion of a single purine to a pyrimidine, or vice versa) are more or less likely to result from possible chemical lesions in the original DNA molecules during each upstream processing event. See sections III-A, B, and C as well as section VII-B for additional descriptions. e. Some combination of the above criteria. ) Alternatively (instead of step 2), an arbitrary base may be randomly selected to occupy the positions in the consensus read corresponding to each discordant set of positions in the alignment object (e.g., wobble nucleotides discussed in section VII-C). ) A second vector of data, less than or equal in length to the alignment object, with a single bit per element, can be used to encode whether each position in the consensus read was generated from concordant or discordant characters. This second vector of data might be PATENT Client Reference No.: P39048-WO-1 a Consensus Quality Score vector, where the quality scores are represented by some specific number of bits per consensus base, such as a single bit per consensus base as one example. 5) A third variable length vector of data, sometimes referred to as part of or as a full header string, is used to encode the combination of characters that represents the original two characters in read 1 and read 2 at each of the discordant positions. 6) Each character in the variable length vector of data, might be assigned five bits, for example. Of the 32 possible states encoded by five bits, 16 of those possible states could be assigned to the 16 possible combinations of nucleotide base mismatches, as well as base insertions and deletions. An example of a combination with 16 possible states is given by the set of “Discordant Calls” and “Discordant Indels” listed in Table 2. 7) A naive implementation of the above algorithm and resulting data structure would include allocating 3 bits per base in the consensus read, and an additional 5 bits for each discordant position, but no loss of information at discordant locations. 2. Variable Length Encoding [0284] Variable length encoding uses a different number of bits to encode different symbols. For example, an A in a sequence may be encoded by ‘0’ and only use 1 bit. On the other hand, C may be encoded by ‘10’ using 2 bits, G may be encoded by ‘110’ using 3 bits, and T may be encoded by ‘1111’ using 4 bits. The process of transforming symbols (A, C, G, and T) into their binary word or sequence (0, 10, 110, 111 respectively) is referred to as (variable length) encoding and is performed by an encoder. Variable length encoding can use short codewords, requiring fewer bits, for common symbols with a high probability of occurring and longer codewords, requiring more bits, for symbols with a low probability of occurring. An advantage of this method is that less storage space is needed and transmitting the data from one place to another occurs very rapidly. However, variable length encoding, depending on the complexity, can make decoding more difficult and increase the demand of computational power and circuit cost. PATENT Client Reference No.: P39048-WO-1 [0285] Additional, and non-limiting, examples of variable length encodings that are described in the examples below may include: (i) encoding concordant pairs with 1 bit and discordant pairs with at least 2 bits; and (ii) frequently occurring or common mutations, sequencing errors, etc. may be assigned shorter codewords requiring fewer bits, while infrequent or rare mutations may be assigned longer codewords requiring more bits. [0286] As used herein, a “code” or a “codebook” refers to a mapping between symbols and binary (or non-binary) words (e.g., codeword). A codebook provides information on the structure, contents, and layout of a data file. Typically, a codebook can include: column locations and widths for each variable, definitions of different record types, response codes for each variable, codes used to indicate nonresponse and missing data, exact questions and skip patterns used in a survey, other indications of the content and characteristics of each variable, etc. Additional elements that may be included in a codebook include: frequencies of response, survey objectives, concept definitions, a description of the survey design and methodology, a copy of the survey questionnaire, information on data collection, data processing, and data quality, etc. A ‘codeword’ refers to the binary (or non-binary) word or sequence used to represent the symbol. [0287] FIGS.19A & 19B show exemplary embodiment 2 for generating a consensus read from an alignment object, outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions one of the two characters in either read 1 or read 2 is chosen according to some selection criteria. That selection criteria may involve, for example, one of the following options: a. Comparing the raw read quality scores between the read 1 character and the read 2 character (see section VII-A.1 for description). b. Comparing the kmer context for each of the two characters on read 1 and read 2 (see section VII-A.3). PATENT Client Reference No.: P39048-WO-1 c. Considering which read the characters occurs on (i.e., base quality score based on read orientation; see section VI-A.2). d. Considering error profiles for upstream processing events such as physical processing of the molecule of interest, sample preparation, library preparation, target enrichment, and sequencing preparation. Each of these methods may comprise information about which base transversions (e.g., conversion of a single purine to a pyrimidine, or vice versa) are more or less likely to result from possible chemical lesions in the original DNA molecules during each upstream processing event. See sections III-A, B, and C as well as section VII-B for additional descriptions. e. Some combination of the above criteria. ) Alternatively (instead of step 2), an arbitrary base may be randomly selected to occupy the positions in the consensus read corresponding to each discordant set of positions in the alignment object (e.g., wobble nucleotides discussed in section VII-C). ) A second variable length vector of data, sometimes referred to as part of or as a full header string, is used to encode the combination of characters that represents the original two characters in read 1 and read 2 at each of the discordant positions. ) Information about which consensus positions were from concordant bases in the alignment object, which consensus positions were from discordant characters in the alignment object, and the set of values for each of those original discordant characters can be stored using one or more variable length encoding strategies within the variable length second vector of data. ) An example variable length encoding strategy, where the “Discordant Calls” and “Discordant Indels” codes from Table 2 are mapped to the “Not a Match” table used by the variable length encoding algorithm of Embodiment 2, is provided in FIG.19A. ) For the alignment object example chosen, the resulting example consensus read, and header data are provided in FIG.19B, as is a computation of the number of bits required PATENT Client Reference No.: P39048-WO-1 for the header in this particular case. With reference to Table 2, Header values of R, t, and e correspond to lossless encodings puRine (read 1 A and read 2 G), lower cast t (read 1 T and read 2 -), and lower-case e (read 1 C and read 2 -) respectively. 3. Context Aware [0288] FIGS.20A & 20B show an exemplary embodiment 3 for generating a consensus read from an alignment object, outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions the characters in read 1 are chosen. 3) A second variable length vector of data, sometimes referred to as part of or as a full header string, is used to encode the differences in the sequence of read 2 relative to the sequence of read 1. An example of this context aware variable length encoding strategy, is provided in FIG.20A. 4) For the example chosen, the resulting example consensus read and header data are provided in the FIG.20B, as is a computation of the number of bits required for the header in this particular example case. 4. Dynamic Codes [0289] Dynamic codes can use predictive arithmetic coding to predict one bit at a time. This becomes especially useful when, during sequencing, a shift in distribution of error rate, mutation type, etc. is encountered and the codebook needs to switched to a new codebook compatible with the distribution shift. [0290] An exemplary embodiment 4 for generating a consensus read from an alignment object is outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). PATENT Client Reference No.: P39048-WO-1 2) For discordant positions the characters in read 1 are chosen. 3) A second variable length vector of data, sometimes referred to as part of or as a full header string, is used to encode the differences in the sequence of read 2 relative to the sequence of read 1. 4) The encoding approaches described in embodiment 3 and embodiment 4 could be replaced by other variable length codes, either static or dynamic codes. Dynamic codes may be static over certain subsets of the data or of subsets of the run, or subsets of sets of runs or conditions, but dynamic over the full set of data from a run, or even at the scale of multiple runs or run conditions. B. Lossy 1. Dropping Discordant Positions [0291] FIG.21 shows exemplary embodiment 5 for generating a consensus read from an alignment object, outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions in the alignment object, a 5th character ‘N’ is chosen. 3) A naive implementation of the above algorithm and resulting data structure would include allocating 3 bits per base in the consensus read, and a loss of information at discordant locations. 2. Using Quality Scores and Read Orientation [0292] FIG.22 shows exemplary embodiment 6 for generating a consensus read from an alignment object. In this particular alignment object, one of the two bases at the discordant positions are selected based on selection criteria. A second vector of data of less than or equal length to the alignment object comprises a single bit per element. The second vector is used to encode whether each position in the consensus read is generated from concordant or discordant PATENT Client Reference No.: P39048-WO-1 characters. The second vector of data is a consensus quality score vector, where the quality scores are represented by a single bit per consensus base. The steps are outlined below. 1) At every concordant position, the consensus base is chosen (in this case it is the same base for both reads). 2) For discordant positions one of the two bases (or more generally referred to as characters in the alignment object to be inclusive of dashes) in either read 1 or read 2 is chosen according to some selection criteria. That selection criteria may involve, for example, one of the following: a. Comparing the raw read quality scores between the read 1 character and the read 2 character (see section VII-A.1 for description). b. Comparing the kmer context for each of the two characters on read 1 and read 2 (see section VII-A.3). c. Considering which read the characters occurs on (i.e., base quality score based on read orientation; see section VI-A.2). d. Considering error profiles for upstream processing events such as physical processing of the molecule of interest, sample preparation, library preparation, target enrichment, and sequencing preparation. Each of these methods may comprise information about which base transversions (e.g., conversion of a single purine to a pyrimidine, or vice versa) are more or less likely to result from possible chemical lesions in the original DNA molecules during each upstream processing event. See sections III-A, B, and C as well as section VII-B for additional descriptions. e. Some combination of the above criteria. 3) Note that selection of one and only one of the two characters for recording in that position in the consensus read results in a loss of information relative to the full alignment object. PATENT Client Reference No.: P39048-WO-1 4) A second vector of data, less than or equal in length to the alignment object, with a single bit per element, can be used to encode whether each position in the consensus read was generated from concordant or discordant characters. This second vector of data might be considered to be a consensus quality score vector, where the quality scores are represented by a single bit per consensus base, for example binary 0 and 1. 5) A naive implementation of the above algorithm and resulting data structure would include allocating 3 bits per base in the consensus read, and a loss of information at discordant locations, though slightly less of a loss of information as compared with Embodiment 1. IX. Compressions for Different Read Constructs [0293] Further details are provided for handling the different read classes in FIG.12 and the different read constructs generated with differing number of passes. A. Treatment of Full HDD vs. One+ and other non-Full HDD Reads [0294] As described with respect to FIG.13C, the “Duplex Start Position”, p, is identified as the first position on Insert Read 1 at which bases from Insert Read 2 are aligned with Insert Read 1 bases. For “Full HDD Reads” the Duplex Start Position, p, is equal to 0. For “One+” reads, the Duplex Start Position will be some value p greater than 0. Both Full HDD and One+ reads can be accommodated by the encoding strategies described in sections VII and VIII on HDD read classes, and others, by allocating a first bit to a header string (or sub-header string) which indicates whether the HDD read, and therefore alignment object as well, corresponds to a Full HDD versus a One+ read. The encoding methods then treat the two scenarios differently by inserting a step early in the variable length header string generation for One+ reads only. [0295] Described below is a sub-method for accommodating the possibility of both Full HDD read constructs and One+ read constructs. 1) An algorithm classifies the alignment object as either a Full HDD read construct or a One+ read construct and a corresponding value is stored in the 0th bit of the variable length header code. PATENT Client Reference No.: P39048-WO-1 2) If the algorithm classifies the alignment object as a Full HDD read construct, then the algorithm skips steps 3 and 4 and begins to look for concordant or discordant. 3) If the algorithm classifies the alignment object as a One+ read construct, then the algorithm allocates a certain number of bits to store the Starting Duplex Position (SDP) relative to the alignment object or to read 1. The number of bits allocated depends on the compression method chosen. 4) The number of bits allocated to store the SDP may be chosen to achieve the highest average compression ratios for the given run conditions. The range of the SDP number may depend, for example, on the expected DNA insert length distribution for a given experimental run condition. The number of bits allocated to SDP values should not be more than necessary to accommodate the expected insert length distribution for a given experiment. [0296] In some instances, an alternative approach for One+ read constructs can be used. Instead of recording the start position of the duplex segment relative to position 0 on read 1, the second stream of data, which records read 2 in a lossless way, may start recording information about read 2 concordant or discordant from the hairpin adapter side of the HDD alignment object. Accordingly, there would most often not be a gap between the “start” of read 1 and “start” of read 2. When the end of the duplex segment is reached, information about read 2 would simply stop being recorded. This would provide a potential savings of the first segment of bits associated with recording the duplex start position at the beginning of the stream of data encoding deltas for many One+ read constructs. [0297] For some HDD read constructs, it may be the case that the “compressed” version of the HDD read construct may actually be greater in size than the uncompressed version of the two insert reads. To accommodate such cases, an optional extra pre-fix bit may be allocated to the variable length header associated with every HDD read construct, indicating whether read 2 is stored in a compressed or uncompressed (raw) form. If an uncompressed (raw) form is chosen, then a number of different approaches can be used to store the read 2 sequence. For example, the full read 2 sequence with its raw quality scores can be appended to the read 1 sequence and PATENT Client Reference No.: P39048-WO-1 quality scores. Further, an SDP may be used in the header to indicate the breakpoint between read 1 and read 2 sequences. B. HDD Sequence Encoding Methods for n-Pass Reads, where n > 2 [0298] HDD DNA constructs that generate Xpandomers with greater than two read “passes” on the insert sequence are possible, as described in sections III-B (Four Pass HDD Read Construction) and III-C (“n” pass HDD Read Constructs). This section describes methods for processing and compressing n-pass HDD reads, both in lossy and lossless manners. Some of the same principles can be applied for n > 2 pass HDD read constructs that have been developed for that lossy approaches to compressing the insert reads alignment object are satisfactory, and are even preferred, relative to lossless compression of information contained within the n > 2 pass multiple alignment object; however, both lossy and lossless approaches are relevant. 1. Compression Approaches to n > 2 Pass HDD Read Constructs [0299] constructs, a pairwise alignment between the read 1 and read 2 passes may be sufficient. For n > 2 pass HDD read constructs, a multiple sequence alignment, or sequence of steps, which result in the equivalence of a multiple sequence alignment, may first be required. Concepts for achieving the best alignment results for the purposes of HDD lossy and lossless compression are described. [0300] Assuming that a multiple sequence alignment has been achieved, which may be performed differently depending on whether lossless or lossy compression is the goal, then exemplary embodiments 7 and 8 can be performed to compress the multiple sequence alignment object. 2. Lossless Compression Approaches to N > 2 Pass HDD Reads [0301] Exemplary embodiment 7 describes compression of n > 2 pass HDD read constructs, outlined below. PATENT Client Reference No.: P39048-WO-1 1) Read 1 is recorded as the reference read. 2) For each of the other reads, namely read 2, read 3 …. read n, a second variable length stream of data is generated to record the deltas between each of the other reads and read 1. Approaches, such as the exemplar embodiment 4 described in section VIII-B of the 2- pass read section, can be applied in series to each of the reads and their pairing with read 1. [0302] Exemplary embodiment 8 describes compression of n > 2 pass HDD read constructs, outlined below. 1) A (potentially lossy) consensus read may be generated first by calling the most probable base for each of the consensus read positions, given the evidence in the multiple sequence alignment object, as well as the associated raw read quality scores. 2) This consensus read may be recorded as a first stream of data. For each of the insert passes, namely read 1, read 2, read 3 …. read n, a variable length stream of data is generated to record the deltas between each of the reads and the consensus reads. Approaches, such as the exemplar embodiment 4 described in section VIII-B of the 2- pass read section, can be applied in series to each of the reads and their pairing with the consensus read. 3. Compression Estimates [0303] Table 5 shows compression estimates on reference free intramolecular consensus of HDD read constructs. For example, if an approximate error profile of 3% discordant insert positions is assumed for a duplex segment alignment, HDD read construct consensus calls compressed in a lossless way may achieve ~43.5% compression when the reference-based compression step is not applied. In addition, the higher accuracy of consensus reads relative to raw reads is expected to improve reference-based alignment compression by having a lower deviation from the reference in alignment of concordant calls. Table 5 PATENT Client Reference No.: P39048-WO-1 Read Length Entropy Total excluding Adapters 2 bit base call assuming equal insert length x 2 x 2 bit base calls A/C/T/G frequency of 25% insert length x 2 x 2 bit Q scores HDD Raw Insert x 2 H=- logp =-4*0.25*log2(0.25) 2 bit Q score Assuming 3% discordant bases insert length x 2 bit Q scores Assuming 97% concordant bases (assuming 98.5% base accuracy) insert length x 2.26 bit concordance call A, C, T, G concordant base calls each assumed at frequency ~97%/4 HDD Insert 20 extended codes at a total 3% Consensus frequency assuming worst case scenario of equal fraction for each discordant base H = - log2(0.97/4) + 20 * 0.03 / 20 * log2(0.03/20) = 2.26 bit per call [0304] On average without any consensus, there may be four bits per base: two bits per base call and two bits per consensus quality score, although more bits could be used for quality score. The number of bits is multiplied by the insert length and times two because there are two reads for the duplex sequencing. With various encodings, one could expect to achieve an upper limit of 2.26 bits times the insert length for an estimate of 3% discordance. A consensus quality score can correspond to the final quality score assigned to concordant and discordant positions, e.g., based on the criteria mentioned above. PATENT Client Reference No.: P39048-WO-1 X. Decompression/Decoding [0305] Various embodiments can decode compressed consensus reads in preparation for processing by downstream processes. [0306] While it will be possible for some downstream post-primary or secondary algorithms to directly ingest and process various compressed representations of HDD reads as described above, other downstream algorithms may benefit from first decompressing such representations and reconstructing the original alignment object (possible in the case of lossless compression of an HDD alignment object). [0307] Given a previously determined codebook, it would be possible to provide a decoding routing that can be executed at the location of downstream secondary analysis. In this case, just the compressed HDD read data would need to be transmitted through the communications channel to the location of the downstream processing storage and compute elements. [0308] If a context dependent code is used to encode HDD alignment objects, then both the compressed data and information specifying unknown aspects of the codebook would need to be transmitted to the location of the downstream processing storage and compute elements. A. Intramolecular Consensus [0309] To decompress such a compressed sequence, where not all of the positions are represented with the same amount of data, additional metadata can be used. For example, a separate bit vector can specify which positions are concordant (e.g., with a 0) or discordant (e.g., with a 1). As another example, a header file can specify the positions that are discordant. Thus, the decompression circuitry/module can assume that every two bits is a concordant base call until a discordant position is reached in the sequence. The decompressor can keep a counter that increases after each position is read, and then cross-reference the current position with the next discordant position in the header file. [0310] Full header strings can be associated with reads and include information to distinguish concordant positions from discordant positions. Such encoding information about the discordances can be in one variable length but relatively compact section of the header. PATENT Client Reference No.: P39048-WO-1 [0311] Other examples of metadata are provided in section VII, e.g., the consensus quality score vector, the variable length encoding, context aware encoding, and dynamic codes. B. Decompression of Reference-Based Compressed Consensus [0312] When reference-based compression is performed, various techniques can be used for the decompression/decoding. When the 14 encodings in Table 2 are used, four bits can be used to encode and a table (e.g., like Table 2) can be used to determine the corresponding state of that position. [0313] In other embodiments, a header file can provide a start position of the read (e.g., the 5’ end) or an end position of the read (e.g., 3’ end), and all mismatches from the reference can be identified relative to the start/end positions. Such a list can identify the concordant positions and the discordant positions as different amounts of bits may be allocated for the different types of positions. Then the decoding techniques for each position can keep track of which position the data corresponds to either a concordant or discordant position. For instance, if position 10 is the first mismatched discordant position, then the first two bits (or other number of bits) can be read out. Then it can be determined that the discordant list has the next highest value at 32. The 4 or 5 bits (or other number of bits) of the discordant position can be read out, as control goes to the discordant decoder. Then it is determined which list (concordant or discordant) has the lowest position, and the corresponding decoder is used. XI. Methylation [0314] An advantage of HDD reads for 5mC methylation calls is that when intramolecular consensus calls are made, they provide an unmethylated read that can be readily aligned to the reference genome, as well as methylation calls. In contrast on a typical EM-Seq or bisulfite methylation workflow eit need to be assigned unique UMIs in order to be grouped together, or alternatively the converted reads need to be aligned to a macerated genome where C’s were converted to T ’s or G’s to A’s. This later step is often less accurate as the new reference complexity is lower and as conversion can be partial making it such that the methylated read does not align with PATENT Client Reference No.: P39048-WO-1 an unconverted requiring special alternative approaches to alignment. Table 6, set forth below, provides an example of lossless encoding of pairwise concordant and discordant calls following pairwise alignment and methylation detection. Table 6 Methylation Status DNMT EM or parent EM or Encoding Read Read EM or EM or bisulfite into Category bisulfite 1 2 bisulfite bisulfite conversion of read2 both read1 read2 unmethylated daughter reads reads + EM or bisulfite Adenine A T read1 A read1 A read1 A read1 A parent A Cytosine C G read1 read1 read1 C parent 5mC 5mC 5mCG Concordant Calls Guanine G C read2 read1 G read2 parent 5mC 5mC G5mC Thymine T A read1 T read1 T read1 T read1 T parent T (or Uracil) Weak A a Discordant weak t T Calls or 5mC Strong C c methylatio n status strong g G pYrimidine C t PATENT Client Reference No.: P39048-WO-1 pyrmidine t C Keto G t read1 read1 read1 G parent G G, G, read2 C read2 5mC keto t G read1 C read1 C read1 C parent C puRine A g purine g A aMino A c amino c A lower case A - a b - A lower case C - c d - C Discordant indel lowercase G - g e - G lower case T - t u - T PATENT Client Reference No.: P39048-WO-1 [0315] An additional complement of the table above covers cases where a complement HDD molecule is sequenced. Under this scenario, an additional flag pertaining to strand orientation can be added to the read where the codes remain as above. Specific ASCII encoding are one of the possible encodings. A. Methylation Detection Workflow Adjustments to Pairwise Alignment Algorithm [0316] Step 1: Detect hairpin, and the adapters, then extract insert 1 and 2. [0317] Step 2: perform pairwise alignment using a modified transition matrix, this transition matrix allows for mismatch errors between converted (non-methylated C) and G the substitution matrix will have the following general form: A C T G A 1.0 0.0 0.0 0.0 C 0.0 1.0 0.0 0.0 T 0.0 1.0 1.0 0.0 G 1.0 0.0 0.0 1.0 result of converting methylated C’s to T ’s. [0318] Step 3: After the initial pairwise alignment, it is possible to detect T/C pairs which correspond to the conversion of the methylated bases. Once the methylated bases are detected, we randomly reassign CG pairs to these bases and repeat the above step. Each iteration improves the alignment to some extent. The base conversion is done using the following dictionary: substitution dictionary = { PATENT Client Reference No.: P39048-WO-1 'TA': 'N', 'TG': 'N', 'AC': 'N', 'AT': 'N', 'AA': 'A', 'AG': 'N', methyl methyl (rc) B. Partial Order Alignment in the Context of Methylation Calling [0319] In the context of methylation detection, an implied pairing between a C and a U converted base vs. a pairing between a C and G identifies a 5mC versus a C methylation event on a single strand. Methylation often occurs on CpG islands which can cause ambiguity in alignment in the presence of indels and when considering the ambiguity generated EM or bisulfite conversion processes. A solution to this problem is to implement Partial order alignment of HDD reads and include penalties to account for C - U pairing derived based pairs. A way to approach this algorithmically is to have read 1 and read 2 undergo partial order alignment with methylation detection proposed alignment weights, maintaining both read 1 and read 2 information in a lossless manner. The partial order alignment generates a sorted DAG, which can be exported as a linear sequence encoding. In contrast with regular pairwise alignment, the ambiguity in alignment will be maintained, enabling the assignment of methylation calls to CpG islands, or other alignment ambiguous alignment positions when appropriate vs. single position assignments. If an additional dedup step takes place, partial order alignment results can be utilized to weigh ambiguously positioned methylation calls. XII. Intermolecular Consensus and Variant Calling PATENT Client Reference No.: P39048-WO-1 A. Intermolecular Consensus [0320] In some embodiments, template nucleic acid molecules may be amplified during library preparation prior to sequencing. Thus, multiple nucleic acid molecules (e.g., copies and original) of the template can be sequenced. Then, raw data corresponding to these nucleic acid molecules or portions thereof may be generated by the sequencing device (e.g., at different time points). Sequence reads (e.g., from raw read data) of two or more raw data corresponding to the same nucleic acid molecule may be used to generate a consensus read for the nucleic acid molecule. The number of sequence reads that are used to generate the consensus read can be limited to a cutoff number (threshold) or until a consensus read is considered complete or substantially accurate. When the limit/cutoff is reached, data from any raw read data that corresponds to the same nucleic acid molecule or portions thereof may be discarded and excluded from further analysis. The corresponding new raw read data may be removed from the instrument to reduce the amount of data in the memory and the amount of data that needs to be output from the memory. [0321] There are several methods known in the art used for fast identification of a sequence read corresponding to a nucleic acid molecule or a molecular family based on an identifier (e.g., a unique molecular identifier (UMI), a random sequence barcode (randomer), or content of a sequence read). This information may then be used in real time to discard or retain the sequence read. [0322] Identifiers, such as UMIs, are widely used in a variety of NGS library prep workflows to: (i) identify and cluster sequences belonging to the same population of nucleic acid molecules; and (ii) perform error correction through oversampling of raw reads and consensus read forming strategies. With respect to the first point, each cluster may contain a plurality of sequence reads that correspond to a nucleic acid molecule. To reduce the amount of data within a cluster, sequence reads may be collapsed into a single sequence read representing a consensus sequence. The consensus sequence of a cluster is a single nucleotide sequence, in which every position is a nucleotide that is most commonly called amongst all the sequence reads in that cluster. The consensus sequence may be generated by performing a multiple alignment between all the sequence reads in a cluster. Alternatively, the consensus sequence may be generated by aligning PATENT Client Reference No.: P39048-WO-1 each sequence read in a cluster to a reference genome. Then, for every position in the multiple alignment or alignment to a reference genome, the most common nucleotide amongst all reads can be selected. Regarding the second point, each sequence read may contain random errors that can be randomly produced during nucleic acid amplification and sequencing processes. A consensus sequence, generated from a plurality off sequence reads, may therefore more accurately represent a nucleic acid molecule. Including more sequence reads to form a consensus sequence read may lead to a consensus sequence read that may correspond to the actual sequence of the nucleic acid molecule more accurately. On the other hand, including too many sequence reads to generate a consensus read may consume more time as well as more memory, and computational resources. Therefore, to optimize generating an accurate consensus data, a cutoff can be applied to a number of sequence reads that are used in building the consensus. For example, a highly accurate consensus sequence may be generated from at most about 100, 50, 40, 30, 20, 10, or less sequence reads. Barcode and UMI technologies, and methods of labeling nucleic acid molecules with a barcode or UMI sequence, are well known in the art (see, e.g., Fu et al., Proc. Nat’l. Acad. Sci.111:1891-1896 (2014); Islam et al., Nat. Methods 11:163-168 (2014); Kivioja et al., Nat. Methods 9:72-74 (2012); U.S. Patent Nos.5,604,097; 7,537,897; 8,715,967; 8,835,358; and International Application No. PCT/US2013/041031). [0323] There exist some undesirable aspects of using UMI and PCR strategy in library preparation in combination with an in silico intermolecular consensus analysis, which determines a consensus of the sequence reads all corresponding to a same template nucleic acid molecule (i.e., part of a same cluster). In some cases, the amplification and sampling process results in uneven representation across UMI-labeled nucleic acid molecules (or UMI-molecular families). The sampling may include random sampling of the molecules generated in the amplification process. For example, a fraction of the amplified molecules (i.e., including the original template molecules) may be sampled for sequencing. Different parameters in an amplification process (e.g., number of PCR cycles) to generate different molecular families prior to sequencing may cause the molecular families to contain different number of nucleic acid molecules. This may be caused by, for example, over amplification (e.g., using PCR), or, in some cases, an initial amount (e.g., concentration) of a nucleic acid molecule may be more than other nucleic acid molecules in a sample, leading to molecular family that contains more progenies with the same barcode and PATENT Client Reference No.: P39048-WO-1 content (i.e., nucleotide sequence). Therefore, an amount of sequence reads generated by the sequencing device corresponding to a nucleic acid molecule or a molecular family may vary significantly across different molecules or molecular families. Consequently, a nucleic acid molecule or molecular family may be over-, or under-sampled. This may also happen due to other factors such as sequencing errors. [0324] This may be undesirable from an assay perspective. For example, if a particular assay has some desired depth of coverage for each UMI-molecular family (e.g., 10x), the resulting intermolecular consensus families (clusters) may hit that average 10x read depth, but the variance across families will be high. Thus, some molecular families may have insufficient representation, while others may have orders of magnitude more reads than are required. Families with extremely high depth of coverage may not benefit the assay much, while the UMI- molecular families with membership number lower than the desired depth will be unable to generate high quality consensus reads. For example, each family labeled using a UMI may represent a region of interest in a genome. In order to satisfy assay needs for all regions of interest, the sequencing throughput requirements have to be raised in order for all regions of interest to be covered by at least the minimum required depth. The regions of interest can be the subject of targeted sequencing, e.g., enrichment of DNA from those regions, as may be done by amplification of DNA or capture probes. [0325] From the compute perspective, another major disadvantage of conventional UMI- based intermolecular consensus workflows is the fact that members of the same UMI family are typically dispersed randomly throughout the physical sample, such that each member of a UMI- Original Molecule Family may be read at a different time throughout a run. Such a run may conceivably last an hour, several hours, 24 hours, multiple days, or another duration of time. Consequently, UMI Based Intermolecular Consensus algorithmic workflows cannot process the read data until all reads from the entire run have been produced and collected. Namely, the “clustering step”, in which all raw read members of a UMI Based Intermolecular consensus family are clustered, or grouped together, must remain in an unfinished state until the run has completed, and no new members of the UMI-Original Molecule family are expected. Accordingly, the processing steps for clustering requires significant computational resources and prevents the processing and production of the consensus reads in real time. PATENT Client Reference No.: P39048-WO-1 B. Variant Calling [0326] Genomic variants (also referred to as genomic alterations) are naturally occurring alterations to the DNA sequence not found in a reference sequence. Examples of genomic variants include small variants (less than 50 base pairs) such as single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), insertions, and deletions (sometimes referred to as indels), and structural variants (greater than 50 base pairs) such as insertions, deletions, chromosomal rearrangements (e.g., translocations, inversions, and fusions), and copy number variations (CNVs). SNVs/SNPs are the result of single point mutations that can cause synonymous changes (nucleotide change does not alter the encoded amino acid), missense changes (nucleotide change does alter the encoded amino acid), or nonsense changes (resulting amino acid change converts the encoded codon to a stop codon). [0327] Variant calling generally involves comparing a sequence read to a reference genome and reporting any variation between them. A reference genome is an established, high-quality and well-accepted sequence of a given organism, for example the hg38 human reference genome. Reference genomes comprise pieces of multiple genomes put together to generate a “consensus” reference genome with one assigned nucleotide for every position. [0328] Computational tools known as variant callers score and filter aligned sequencing data to call true sequence variations. As discussed above, alignment of reads can identify concordant and discordant positions, and the variant caller is responsible for determining which of the discordant positions are true positives or true negatives. After alignment to a reference genome, a next step is variant calling. The system (e.g., a de novo software application) can examine the mapped data and reference genome side-by-side to determine the existence of sequence mutations (single base changes and small indels). In some embodiments, the system can extract candidate variants from alignment, and then score a number of individual metrics for each variant and applies these scores both individually and in combination to identify bona fide sequence mutations and to exclude sequence artifacts. Any suitable program may be used to call variants. Variants may be reported in any suitable format such as the variant call format (*.VCF; a standard tab-delimited format for storing variant calls). PATENT Client Reference No.: P39048-WO-1 [0329] Intramolecular consensus reads and/or intermolecular consensus reads can be used to perform the variant calling. For example, intramolecular consensus reads can be output by a first consensus circuit and then used by a second consensus circuit to determine an intermolecular consensus read for a particular family of molecules (e.g., sharing a same bar code). This can be done across families for a given sequencing run. XIII. Detecting Components of Adapter Constructs [0330] Efficient detection of added artificial sequences in nucleic acid fragments (e.g., DNA and/or RNA) is essential for their proper identification during sequencing. In so doing, DNA fragments belonging to the same sample in a pool of samples may be identified and processed. To distinguish DNA fragments from one another, adapter sequences comprising sequence IDs (SIDs) and/or unique barcodes may be used, where each SID corresponds to a single sample in the pool of samples and the unique barcodes distinguishes fragments within a single sample. Additionally, it is important to identify where the naturally occurring nucleic acid segment starts and ends so that such a segment is analyzed for the subject and not an adapter used in library preparation (e.g., a hairpin adapter). [0331] Provided herein are various methods that may be used to detect adapter sequences and/or the individual components comprising adapter sequences. In a first set of embodiments, a window sliding technique may be used to identify candidate locations for a hairpin adapter. In a second set of embodiments, machine learning models, such as neural networks, may be used for adapter segmentation and classification to identify the various components of an adapter construct. In a third set of embodiments, adapters may be detected using frequency-based methods where adapter signals may be converted from a time-based domain into the frequency domain reducing the computational cost for detecting adapter sequences. A. Detecting Hairpin Adapter using Dual Sliding Window [0332] During duplex sequencing (or higher numbers of passes), a hairpin adapter can be added to the end of a double-stranded DNA (dsDNA) molecule, as described and illustrated with respect to FIGs.4, 6, 8 and 11. As an example, the structure of hairpin adapter may include three parts: a first sequence ID (SID), a loop sequence, and a SID’ (the reverse complement of the SID PATENT Client Reference No.: P39048-WO-1 sequence). Assuming there are no sequencing errors accrued during sequencing, the loop is a fixed width sequence. The structure of the hairpin adapter is known and able to be distinguished from the dsDNA molecule sequence. Being able to accurately detect the hairpin adapter allows for (1) the separation of sample sets, (2) correct adapter removal during adapter trimming processes, and (3) correct identification of the nucleotide sequence of the dsDNA molecule. The sequence IDs can include one or more components, which may identify a particular sample (e.g., if different samples are pooled) or different molecules using, e.g., a unique molecular identifier (UMI) as described herein. [0333] Detection of the hairpin duplex may include two computations for each of a set of positions in a read, potentially for all positions. The first computation can include hairpin detection, which determines if a particular read position is a candidate for being a hairpin location. If a candidate hairpin location is identified, the second computation can include classifying the sample using the SID, e.g., using a look-up table (LUT) to determine if there is a sample in the sample pool that corresponds to the SID part of the hairpin. Other techniques can be used to determine a match between a measured SID and the known SID pool, e.g., using machine learning models as described in herein. [0334] Given the known SID pool when LUTs are used, a SID LUT can be pre-computed comprising every possible SID k-mer. For example, given a SID length equal to k bp, every possible k-mer out of a total of 4k can be compared against every SID in the pool. If a k-mer is unambiguously close to one of the SIDs in the pool, then the sample identity of this best matching SID is recorded. The LUT can effectively represent a sparse array, and the matching SID is recorded in the location under the index equal to the lexicographical order of the k-mer (0 through 4k - 1). Once a hairpin candidate location is found, the LUT is queried using the measured sequence corresponding to the presumed sample adapter portion; if a non-null value is returned, it is reported as a possible SID. The demultiplexing method can use the structure of the hairpin (e.g., a first sequence ID (SID), a loop sequence, and a SID’) to separate mixed sequence data into individual sample data sets after sequencing. [0335] In this manner, the measured sequence can be compared to the k-mer column in the LUT to identify a match. There will be an exact match since every k-mer combination of length PATENT Client Reference No.: P39048-WO-1 k bp exists in the LUT. For the matching entry (row), a second column can indicate the sample identity (actual SID), if the k-mer in the row was unambiguously close to the actual SID. Multiple k-mers can correspond to a same SID. An ambiguous k-mer can have a null value in the second column, indicating no clear match exists. For instance, a particular k-mer can have the same number of differences between from two different SIDs. In some implementations, certain sequencing errors can be known to occur more frequently, so even if a k-mer has a same number of differences between two SIDs, one of the SIDs can be selected, since it is most likely. 1. Sliding Window Structure [0336] To determine if a read position is a candidate hairpin location, the hairpin duplex molecule is linearized and a window sliding technique may be used to detect patterns and/or regions of interest. The window sliding can analyze sequences by examining fixed-length segments (windows) that moves along the sequence by a specified number of bases (step size). [0337] FIG.23 illustrates an exemplary window sliding technique that may be used to locate one or more hairpin adapters. A window structure 2310 shown in FIG.23 may comprise a first window 2312 (e.g., forward substring) and a second window 2314 (e.g., reverse complement substring) of equal length, and more specifically equal in length to the SID portion of the one or more hairpin adapters. For example, the window length (e.g., the SID length) may be 5, 10, 15, 20, 25, 30, 35, 40, 45, or 50 bases in length, or any whole number between 5 and 50 bases. [0338] First window 2312 and second window 2314 are separated by a constant distance equal to a loop length 2316, such as 5, 10, 15, 20, 25, 30, 35,40, 45, 50, 55, 60, 65, or 75 bases in length, or any whole number between 5 and 75 bases. Loop length 2316 corresponds to the length of the loop that does not hybridize to another portion of the adapter. Window structure 2310 can be shifted along a read sequence 2320 based on a defined step size. By way of example, the step size may be 1, 2, 3, 4, 5, or 10 or more bases. In various embodiments, the step size is 1 meaning the window construct slides along the sequence base by base. Read sequence 2320 includes sequence of a first strand 2321 and a second strand 2322 of the dsDNA molecule. [0339] At each “step”, a first substring (also referred to as a first sequence) corresponding to first window 2312 and a second substring (also referred to as a second sequence) corresponding PATENT Client Reference No.: P39048-WO-1 to second window 2314 are examined for reverse complementary (i.e., sequence symmetry). This is because first strand 2312 and second strand 2322 of the dsDNA molecule are reverse complements of each other, as are the SID and SID’ of the adapter. Accordingly, the location of a hairpin candidate can be determined based on when the complementary substring corresponding to the second sliding window has a high level of concordance (e.g., close to a perfect match) with the substring corresponding to the first sliding window. In other words, until window structure 2310 has slid into the position where the gap (e.g., loop length) between the two windows is over the loop sequence 2323 (e.g., red box in FIG.23), low concordance values (e.g., indicating very few bases match between the two substrings) would be encountered. Once window structure 2310 aligns with the actual location 2325 of the hairpin loop, a high concordance value indicates the corresponding location has been detected. [0340] In addition to setting the window length and step size, another parameter that may be adjusted for the window sliding technique is the number of mismatches that may be allowed between the first and second substrings that still allows for the hairpin adapter to be identified with a high level of confidence. The more stringent the mismatch parameter is (e.g., allowing 0, 1, or 2 mismatches) the fewer candidate hairpin matches; however, if a match is found, it is considered a highly confident match. The looser the stringency for the mismatch parameter (e.g., 3 or more mismatches) more candidate hairpin matches are found; however, the risk of a false positive match or the identification of a different biological sequence is much higher. In various embodiments, the number of allowed mismatches may be 0, 1, 2, or 3 mismatches between the first and second substrings. [0341] As described herein, a mismatch refers to one or more pairs of bases with non- complementarity. Examples of mismatches include, without limitation, single nucleotide polymorphisms (SNPs), base insertions, and base deletions. For the base insertion and deletion mismatches, the number of allowed inserted or deleted bases is another parameter that can be set by a user. The fewer number of allowed inserted/deleted bases the more stringent the parameter, while the greater the number of allowed inserted/deleted bases, the less stringent the parameter. In various embodiments, one inserted and/or one deleted base may be allowed. By allowing some flexibility in the number of mismatches that can be tolerated for hairpin identification, PATENT Client Reference No.: P39048-WO-1 mismatches due to sequencing error do not automatically disqualify a candidate hairpin match from being considered. [0342] To speed up computation times for determining the number of matches between two substrings, the substring sequences may be stored in a bit encoding format. For example, if the length of the substring sequences is 16 bp, the two sequences can be stored in a 4-bit encoding format (also referred to as one-hot encoding), requiring only one 64-bit machine word. The number of matches is determined by two-bit operations requiring a single CPU instruction. Namely, each DNA base may be encoded as a binary-coded decimal (i.e., each base is represented by a fixed number of bits). As non-limiting examples, each DNA base may be encoded as follows: A - 0001, C - 0010, G - 0100, T - 1000. As illustrated in FIG.23, application of the binary AND operation followed by the CPU popcount function (or other function), which counts the number of set bits (bits with a value of 1) in a binary number, the number of matches between two sequences is computed. Counting mismatches and more generally edit distance can be performed in various ways. [0343] As the window structure slides along the linear hairpin read, a substring concordance value (e.g., edit distance values) can be determined for each new position using the bases within the two windows. As described above, substring concordance is based on how well the first and second substrings complement each other. Such a concordance can be determined as the number of mismatches based on a direct comparison. As another example, the concordance can be measured as an edit distance, which corresponds to the number of changes required in one string to make the other string a perfect reverse complement. The edit distance can provide information on the number and type of mismatch(es) that can occur in a substring. For example, if all the bases in the two substrings complement , the edit distance for those substring sequences is zero. On the other hand, if a sequencing error is introduced, such as a SNP, one change is required to achieve perfect reverse complementarity, thus the edit distance is one. As another example, if an insertion of ‘x’ bases is incorporated, meaning that the bases to the right of the insertion all shift over by ‘x’, then the edit distance is equal to ‘x’. [0344] In various embodiments, an exhaustive approach can be used to identify the location of the hairpin adapter (e.g., the position(s) with the lowest edit distance and highest PATENT Client Reference No.: P39048-WO-1 concordance). This approach involves sliding the window construct along the entirety of the linear hairpin read and analyzing the data at each step. Other embodiments can avoid such an exhaustive search and only test some of the positions. For example, the midpoint of the linear hairpin sequence can be identified, and the window sliding algorithm can start with the first base in the linear hairpin read or at a specified distance (e.g., N bases, such as 50) from the midpoint until the window construct reaches a “stop” base position (e.g., a base position selected by the user for the window construct to stop at once reached). The position with the lowest edit distance measured in this range can be used as the location of the hairpin loop, thereby leading to the determination of the locations of the barcodes and the DNA sequences. [0345] In another embodiment, the edit distance can be compared to a threshold, corresponding to an expected edit distance that might occur when the matching position (actual hairpin location) is identified. In other words, the mismatch parameter set by the user may act as a threshold, where once the window construct identifies a candidate hairpin position where the number of allowed mismatches is met, the sliding can stop or possibly proceed for only 1, 2, or 3 more bases to confirm that a slightly different position might not match better. In this manner, further sliding would not be required. Further, the use of the range around the midpoint could be combined with this threshold technique. 2. Method [0346] FIG.24 shows a flowchart illustrating method 2400 for detecting hairpin adapters using a dual sliding window and any of the aforementioned techniques. The method 2400 depicted in FIG.24 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method 2400 presented in FIG.24 and described below is intended to be illustrative and non-limiting. Although FIG.24 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. PATENT Client Reference No.: P39048-WO-1 [0347] At 2405, a hairpin adapter is ligated to an end of a double-stranded nucleic acid molecule thereby forming a resulting molecule (e.g., a hairpin duplex construct). The resulting molecule comprises a hybridized portion (e.g., nucleotides with base pairing) and a non- hybridized portion (e.g., nucleotides without base pairing). The hairpin adapter includes a hairpin loop, with a known loop length, that comprises nucleotides that are not hybridized to other nucleotides. [0348] At 2410, the resulting molecule is separated or linearized, wherein the hybridized portions are separated from one another to generate a linear hairpin molecule. Methods for separating hybridized portions of double stranded nucleic acid molecules are well known in the art. For example, heat denaturation may be used to separate the hybridized portions. The single stranded molecule is sequenced to obtain a corresponding output sequence (e.g., a linear hairpin read) that is received by a computer system for processing. Examples of library preparations and sequencing methods that may be implemented for sequencing the linear hairpin molecule are described with respect to Section II of the disclosure. Moreover, an overview of example sequencing devices and pipelines are described with respect to Section I of the disclosure. [0349] At 2415, the computer system identifies candidate locations for the hairpin adapter, using a window sliding technique. This technique involves sliding the window structure, based on a specified step size, over a plurality of positions of the output sequence (e.g., sequence read). The defined step size may be 1, 2, 3, 4, 5, or 10 or more bases, or any whole number between 1 and 10 bases. In various embodiments, the step size is one base meaning the window construct slides along the sequence base by base. The window structure includes a first window portion separated from a second window portion by the known loop length. The loop length corresponds to the length of the ligated hairpin adapter sequence added to the end of the double stranded nucleic acid molecule at 2405. [0350] At 2420, at each position in the plurality of positions, an edit distance between a first sequence (e.g., a first substring) in the first window portion and a reverse complement of a second sequence (e.g., a second substring) in the second window portion is determined. In so doing, a set of edit distances is also determined. The plurality of positions includes a specified number of positions before and after a middle of the output sequence. The edit distance PATENT Client Reference No.: P39048-WO-1 corresponds to the number of changes required in either the first sequence or the second sequence to make the other sequence (e.g., either the second sequence or the first sequence, respectively) a perfect reverse complement. In various embodiments, determining the edit distance can involve determining a number of changes required in the second sequence to obtain a matching reverse complement to the first sequence. In other embodiments, determining the edit distance can involve determining a number of mismatches between the first sequence and the reverse complement of the second sequence. The set of edit distances are used to identify the location of the hairpin loop in the output sequence. In various embodiments, the edit distance at each of the plurality of positions is compared to a threshold, where the location of the hairpin loop is at a position having an edit distance less than the threshold. [0351] Additionally or alternatively to the window structure sliding across the plurality of positions of the output sequence, the window structure can be slid over an entirety of the output sequence. In this instance, determining the location of the hairpin loop in the output sequence based on the edit distances comprises (i) determining the edit distance for each position of the window structure, and (ii) selecting a maximum of the set of edit distances. Different stopping criteria may be used to cease sliding of the window structure. For example, sliding may stop when the maximum edit distance in the set of edit distances is found, or sliding may stop when a threshold for the edit distance is achieved. [0352] When the location of the hairpin loop in the output sequence is determined, a measured identity sequence (e.g., SID) from the sequence read in the first window, the second window, or both at the location of the hairpin loop in the output sequence is determined. The measured identity sequence (e.g., SID) is used to identify a particular sample, a particular double-stranded nucleic acid molecule, or both. In various embodiments, sample identification comprises comparing the measured identity sequence (e.g., SID) to a LUT.to determine if there is a sample in the sample pool that corresponds to the measured identity sequence of the hairpin. Once the location of the hairpin loop in the output sequence is found, the LUT is queried, and if a non-null value is returned, it is reported as a possible SID. In other embodiments, sample identification can comprise inputting the measured identity sequence into a machine learning model that is trained on various input sequences of a same length as a sample identifier used in the hairpin adapter. Regardless of how the sample is identified, the measured identity sequence PATENT Client Reference No.: P39048-WO-1 allows sample sets in a pool of samples to be separated from one another. Furthermore, accurate detection of the hairpin loop in the output sequence ensures correct adapter removal during adapter trimming processes in downstream computational analysis. [0353] Additionally, a first strand sequence of a first strand of the double-stranded nucleic acid molecule and a second strand sequence of a second strand of the double-stranded nucleic acid molecule may be determined using the location of the hairpin loop in the output sequence. The first strand sequence and the second strand sequence represent the nucleotide sequence of the double stranded nucleic acid molecule, which is the target sequence used in downstream computational analysis. B. Machine Learning Models for Identifying Different Components of Adapters [0354] As previously described, adapters are a class of pre-defined sequences attached to DNA segments (e.g., DNA inserts) that occur in the sample obtained from the subject. They are sequenced together, and the adapter sequence can be used to identify the origin of the DNA molecule and establish the boundaries of the DNA segments. Adapters may be designed to have different components, which can include fixed or semi-fixed sequences for easy identification. These different components can provide different types of identifiers; examples of which are provided below. Some embodiments can use a first machine learning model to identify location(s) of one or more components of an adapter. A second machine learning model can be used to identify the correct sequence of the adapter—in case there are sequencing and/or base calling errors—thereby determining the origin of the DNA molecule. 1. Exemplary Adapter Architectures [0355] FIGs.25A and 25B illustrate exemplary adapter architectures that may be used during sequencing. As shown in FIG.25A, the adapter portion of the double stranded nucleic acid molecule comprises (i) an E-oligo (E+; 2505) and a Blocker oligo(B; 2510), (ii) a runway (R; 2515), (iii) SIDs (S1+ and S2+; 2520), (iv) stems (ST and ST’; 2525), (v) UMIs (U1, U1-, U2, U2-; 2530), and (vi) anchors (A+ and A-; 2535). The E-oligos 2505 on the 5’ ends of the parent and daughter strands function as primer binding cites for sequencing primers while the blocker- oligos 2510 function as a termination sequence during synthesis. The SIDs 2520 are selected PATENT Client Reference No.: P39048-WO-1 from a pool of ‘x’-base fixed sequences, where ‘x’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. Moreover, the SIDs 2520 function as sample identifiers that distinguish one sample from another in a pool of samples. The stems 2525 are reverse complementary pairs (i.e., ST reverse complements ST’) that are also of a ‘s’-based fixed sequence, where ‘s’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. The UMIs 2530 are the sequences that fall between the stem and the anchor and function to distinguish between original molecules and PCR duplicates within a sample. UMI sequences 2530 may be either randomers or semi-randomers comprising about 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 bases. Similar to the stems 2525, the anchors 2535 are also reverse complementary pairs that have a ‘t’-based fixed sequence, where ‘t’ maybe be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. In between the adapter architecture, lies the nucleic acid molecule (e.g., double stranded DNA molecule) that is also referred to as an insert 2540. This is the segment to be sequenced. [0356] FIG.25B shows another exemplary adapter structure for a linearized nucleic acid molecule. The adapter architecture comprises (i) an E-oligo (E+; not shown) and a Blocker (B) oligo (not shown), (ii) a runway 2515, (iii) SIDs 2520, (iv) stems 2525, (v) UMIs 2530, (vi) overhangs 2545, and (vii) a DNA insert sequence @2540. In this example, the runway sequence 2515 is only on the 5’ end of the DNA insert sequence 2540 and is shown to have the sequence CAACAA. As before, the SIDs 2520 may be selected from a pool of ‘x’-base fixed sequences, where ‘x’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. In this example, ‘x’= 14 bases with SID 2520 comprising the nucleic acid bases GACAGAGACAGGCT, while its reverse complement, SID’ 2520’, comprises the nucleic acid bases TCGGACAGAGACAG. The stem sequence 2525 in this example comprises the nucleic acid bases GACGTGTGCTCTTCCGATCT on the 5’ end and AGATCGGAAGAGCGTCGTGT on 3’ end. For the UMIs 2530, they are selected from a pool of ‘x’-base fixed sequences, where ‘x’ may be 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 bases. In this example, ‘x’= 12 random bases as denoted by ‘N’. The overhangs 2545 comprising the bases GT on 5’ end and AC on 3’ end, are used during adapter ligation to the insert DNA sequence 2540, which is the sequencing target. 2. Training, Testing, and Implementation of Machine Learning Models PATENT Client Reference No.: P39048-WO-1 [0357] During sequencing, there is a chance for sequencing errors to occur, such as during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls. Such errors cause problems in identifying the positions of adapters in the sequence reads and thus in identifying the DNA segments. These problems are further exacerbated when the adapters include variable components. To overcome these technical challenges, machine learning models may be used to account for different patterns and types of errors that may occur during the sequencing process so that the adapters and their various components may still be identified. As used herein, machine learning models are procedures that are run on datasets (e.g., training and validation datasets) and can perform pattern recognition on datasets, learn from the datasets, and/or are fit on the datasets. Examples of machine learning models include linear and logistic regression, decision trees, artificial neural networks, k-means, and k-nearest neighbor. The machine learning model can correspond to the program that, for a given model architecture, is saved after running an optimization technique (e.g., backpropagation or stochastic descent) on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make inferences. For example, a linear regression model may comprise a vector of coefficients with specific values, a decision tree model may comprise a tree of if-then statements with specific values, or a neural network, a model may comprise a graph structure with vectors or matrices of weights with specific values. [0358] wide range of model architectures suitable for different kinds of tasks and data may be used to identify different components of the adapter. Examples of models include, without limitation, linear regression, logistic regression, decision tree, Support Vector Machines, Naives Bayes algorithm, Bayesian classifier, linear classifier, K-Nearest Neighbors, K-Means, random forest, dimensionality reduction algorithms, grid search algorithm, genetic algorithm, AdaBoosting algorithm, Gradient Boosting Machines, and Artificial Neural Networks such as convolutional neural network (“CNN”), an inception neural network, a U-Net, a V-Net, a residual neural network (“Resnet”), a transform neural network, a recurrent neural network, a Generative adversarial network (GAN), or other variants of Deep Neural Networks (“DNN”) (e.g., a multi-label n-binary DNN classifier or multi-class DNN classifier). These models can be implemented using various machine learning libraries and frameworks such as TensorFlow, PATENT Client Reference No.: P39048-WO-1 PyTorch, Keras, and scikit-learn, which provide extensive tools and features to facilitate model building, training, validation, and testing. [0359] In order to train, validate, and test machine learning models, input data is collected and preprocessed if necessary. Data collection can include exploring various data sources such as public datasets, private data collections, or real-time data streams, depending on a project’s needs. In some instances, the collected data comprises sequencing read data generated from sequencing methods (e.g., Xpandomer sequencing as described with respect to section II). The read data can include the adapters and the DNA segment, where the adapter architecture may include the architectures described with respect to FIGs.25A and 25B. In some embodiments, the collected read data is used to synthesize new reads that comprise sequencing errors, which can occur during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls. Sequencing errors that can be introduced include, without limitation, insertions, deletions, and substitutions. Furthermore, techniques such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) may be used to generate new data examples. Another option for simulating library-prep errors and sequencing errors can include incorporating errors based on a hyper-parameter designated probability. [0360] Once collected, generated, and preprocessed, the data may then be split into at least three subsets of data: training, validation, and testing. The training set is used to fit the model, where the machine learning model learns to make inferences based on the training data. The validation set, on the other hand, is utilized to tune hyperparameters and prevent overfitting by providing a sandbox for model selection. Finally, the test set serves as a new and unseen dataset for the model, used to simulate real-world application and evaluate the final model’s performance. The process of splitting ensures that the model can perform well not just on the data it was trained on, but also on new, unseen data, thereby validating and testing its ability to generalize. Various techniques can be employed to split the data effectively, with each method aiming to maintain a good representation of the overall dataset in each subset. A simple random split (e.g., a 70/20/10%, 80/10/10%, or 60/25/15%) is the most straightforward approach, where examples from the data are randomly assigned to each of the three sets. However, more sophisticated methods may be necessary to preserve the underlying distribution of data. PATENT Client Reference No.: P39048-WO-1 [0361] Training and validation steps occur on a computer system that comprises a combination of specialized hardware and software to efficiently handle the computational demands required for training, validating, and testing machine learning models. On the hardware side, high-performance GPUs (Graphics Processing Units) may be used for their ability to perform parallel processing, drastically speeding up the training of complex models, especially deep learning networks. CPUs (Central Processing Units), while generally slower for this task, may also be used for less complex model training or when parallel processing is less critical. TPUs (Tensor Processing Units), designed specifically for tensor calculations, provide another level of optimization for machine learning tasks. On the software side, a variety of frameworks and libraries are utilized, including TensorFlow, PyTorch, Keras, and scikit-learn. These tools offer comprehensive libraries and functions that facilitate the design, training, validation, and testing of a wide range of machine learning models across different computing platforms, whether local machines, cloud-based systems, or hybrid setups, enabling developers to focus more on model architecture and less on underlying computational details. [0362] Training is the initial phase of developing machine learning models where the model learns to make predictions or decisions based on data training data provided from the training and validation datasets. During this phase, the model iteratively adjusts its internal model parameters (e.g., weights, coefficients, trees, feature importance, and/or biases) that minimizes or maximizes an objective function (e.g., a loss function, a cost function, a contrastive loss function, a cross-entropy loss function, an Out-of-Bag (OOB) score, etc.). Various techniques may be used to perform the optimization. For example, to train machine learning models such as a neural network, optimization can be done using back propagation. The current error is typically propagated backwards to a previous layer, where it is used to modify the weights and bias in such a way that the error is minimized. The weights are modified using the optimization function. [0363] Validating is another phase of developing machine learning models where the model is checked for deficiencies in performance and the hyperparameters are optimized based on validation data provided from the training and validation datasets. The validation data helps to evaluate the model's performance, such as accuracy, precision, recall, or F1-score, to gauge how well the model is likely to perform in real-world scenarios. Hyperparameter optimization, on the PATENT Client Reference No.: P39048-WO-1 other hand, involves adjusting the settings that govern the model's learning process (e.g., learning rate, number of layers, size of the layers in neural networks) to find the combination that yields the best performance on the validation data. The validation process includes iterative operations of inputting the validation subset of data into the trained model(s) using a validation technique such as K-Fold Cross-Validation, Leave-one-out Cross-Validation, Leave-one-group-out Cross- Validation, Nested Cross-Validation, or the like, to fine-tune the hyperparameters and ultimately find the optimal set of hyperparameters. [0364] Once a machine learning model has been trained and validated, it undergoes a final evaluation using test data provided from the training and validation datasets, which is a separate subset of the data that has not been used during the training or validation phases. This step is crucial as it provides an unbiased assessment of the model's performance in simulating real- world operation. The test dataset serves as new, unseen data for the model, mimicking how the model would perform when deployed in actual use. During testing, the model’s predictions are compared against the true values in the test dataset using various performance metrics such as accuracy, precision, recall, and mean squared error, depending on the nature of the problem (classification or regression). This process helps to verify the generalizability of the model—its ability to perform well across different data samples and environments—highlighting potential issues like overfitting or underfitting and ensuring that the model is robust and reliable for practical applications. The machine learning models are fully validated and tested once the output predictions have been deemed acceptable by user defined acceptance parameters. Acceptance parameters may be determined using correlation techniques such as Bland-Altman method and the Spearman’s rank correlation coefficients and calculating performance metrics such as the error, accuracy, precision, recall, receiver operating characteristic curve (ROC), etc. [0365] Deploying the machine learning models includes moving the models from a development environment (e.g., a training and validation subsystem, where it has been trained, validated, and tested), into a production environment where it can make inferences on real-world data. This step typically starts with the model being saved after training, including its parameters and configuration such as final architecture and hyperparameters. It is then converted, if necessary, into a format that is suitable for deployment, depending on the deployment environment. For instance, a model trained in a scientific computing environment such as Python PATENT Client Reference No.: P39048-WO-1 might be converted into a Java-friendly format for integration into a larger enterprise application. Deployment can be conducted on various platforms, including on-premises servers or cloud environments like AWS, Azure, or Google. The foregoing description can apply to any machine learning model described herein. 3. Machine Learning Models for Locating an Adapter Sequence [0366] As described above, prior to making inferences on real-world data, machine learning models are trained, and potentially validated and tested. Data sets for each of these steps may be generated from a primary collection of data that is divided into training, validation, and testing datasets. The primary collection of data can be obtained from various data sources such as public datasets, private data collections, or real-time data streams, depending on a project’s needs. In some instances, the collected data comprises sequencing read data generated from sequencing methods (e.g., Xpandomer sequencing as described with respect to section II). The read data can include the adapters and the DNA segment, where the adapter architecture may include the architectures described with respect to FIGs.25A and 25B. [0367] In some embodiments, the read data is used to synthesize or generate simulated read data that comprise sequencing errors, which can occur during the physical/chemistry processing, signal measurement (e.g., optical or electrical), and/or in determining base calls. Sequencing errors that can be introduced include, without limitation, insertions, deletions, and substitutions. Simulating library-prep errors and sequencing errors can include incorporating such errors based on a hyper-parameter designated probability. For example, one or more hyper-parameters may include the error rate one or more error types for generating deviations from the expected sequence are incorporated. The error rate may be based on known error rates associated with the sequencing method (e.g., Xpandomer synthesis), random error rate for SNV and/or indels during synthesis (e.g., based on probability distributions corresponding to the specific error), the rate of DNA damage, the sequence of the read (e.g., comprising homopolymers or repetitive regions of some k-mer in length), and the like as described in more detail with respect to FIGs.7, 9 and 10. In other instances, the error rate may be set by a user that is specific to the error rate of the data they wish to analyze. For example, the error rate hyper-parameter for any given error type may be set so that the mean value of the error rate is about 1-2% across (i) the whole read sequence PATENT Client Reference No.: P39048-WO-1 (e.g., the adapters and the DNA segment), (ii) the whole or a portion of the whole adapter sequence, (iii) the whole or a portion of the whole DNA segment, (iv) or any combination thereof. [0368] By training machine learning models (e.g., neural networks) with simulated data that incorporates error rates based on real data, the models learn the many nuances that can result in sequencing error. Accordingly, the neural network can learn that different motifs (e.g., k-mer) can trigger a particular error pattern (e.g., several errors in a cluster in a certain pattern all at once). As a result, the model can see past the pure number of errors and salvage that particular motif sequence and output the correct error free sequence. The models can accommodate these different k-mer dependent error modes, improving adapter detection and thus downstream data analysis. [0369] FIG.26 shows a flowchart illustrating method 2600 for using machine learning models to segment components of adapter architecture from nucleic acid sequencing read data. The sequencing read data may be generated by any of the methods described herein, such as the sequencing methods described with respect to section II of the disclosure. The method 2600 depicted in FIG.26 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method presented in FIG.26 and described below is intended to be illustrative and non-limiting. Although FIG.26 depicts the various processing steps occurring in a particular sequence or order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. [0370] At 2605, a sequence segment of a nucleic acid molecule is received. The sequence segment specifies nucleotides at positions within the nucleic acid molecule. Moreover, the sequence segment can be all, or a portion of the sequence read, which can be all or a portion of the nucleic acid molecule. The sequence read includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample (e.g., the nucleic acid insert) PATENT Client Reference No.: P39048-WO-1 and a second sequence portion corresponding to an adapter segment that was added to the nucleic acid segment. [0371] In various embodiments, the sequence segment may be equal to the length of the sequence read. As non-limiting examples, the sequence segment (and thus the sequence read) may have a mean, median, average, or absolute length of about 15bp to about 1000bp. For example, the sequencing segment may be about 15bp, 16bp, 17bp, 18bp, 19bp, 20bp, 25bp, 50bp, 100bp, 150bp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, or about 1000bp or about any integer value between 15bp and 1000bp. [0372] In other instances, the sequence segment is equal in length to a portion of the sequence read, including the starting and ending portions of the sequence read. For example, 100 bases may be sequenced at one end of a DNA molecule to obtain the sequence read, and a sequence segment of only 64 might be used. The sequence segment can be generated by cutting (e.g., extracting) a fixed number of bases “L” from the start and end positions of the sequence read. The number of bases that are extracted is dependent on the size of the adapter sequence ligated to the end of the nucleic acid molecule, which is previously know. Accordingly, the number of extracted bases cut from the start and end of the sequencing read may comprise 50, 55, 60, 65, 70, or 80 bases, or any whole number between 50 and 80 bases. In various embodiments, the number of extracted bases cut from the start and end of the sequencing read is 64 bases (e.g., ‘L’ = 64 bases). Accordingly, the sequence segment comprises 64 bases where some proportion of the bases correspond to at least a portion of the nucleic acid segment from a biological sample (e.g., the first sequence portion), while the second sequence portion corresponds to the adapter (also referred to as ‘extracted adapter sequence’) added to the nucleic acid segment. [0373] In various embodiments, the nucleic acid molecule may be either deoxyribonucleic acid (DNA) molecules or ribonucleic acid (RNA) molecules and polymers thereof in either single- or double-stranded form. Additionally, the nucleic acid molecule can comprise combinations of deoxyribonucleic acids and ribonucleic acids. In some aspects, the nucleic acid molecule is a double stranded DNA molecule, which may also be referred to as an “insert” or “DNA insert”. PATENT Client Reference No.: P39048-WO-1 [0374] The nucleic acid molecule is obtained or provided from a biological sample and may include, but is not limited to, any cell, tissue or biological fluid comprising nucleic acid molecules. For example, the biological sample can be at least one cell, fetal cell(s), cell culture, tissue specimen, blood, serum, plasma, saliva, urine, tear, vaginal secretion, sweat, lymph fluid, cerebrospinal fluid, mucosa secretion, peritoneal fluid, ascites fluid, fecal matter, body exudates, umbilical cord blood, chorionic villi, amniotic fluid, embryonic tissue, multicellular embryo, lysate, extract, solution, and the like. The biological sample can also include non-human sources, such as non-human primates, rodents and other mammals, other animals, plants, fungi, bacteria, and viruses. [0375] At 2610 a first feature vector is generated using the nucleotides of the sequence segment at the positions within the nucleic acid molecule. The first feature vector can include a series of data items, where each data item in the series indicates a nucleotide at a corresponding position in the sequence segment. The series of data items represent encoded nucleotides that correspond to the nucleotide sequence of the nucleic acid molecule. The nucleotide sequence of the nucleic acid molecule may be encoded using various methods that converts categorical data into a numerical format (e.g., one-hot encoding). For example, the nucleotide bases A, T, C, and G can be converted into a binary numerical format such as 2-bit encoding or 4-bit encoding as non-limiting examples. For 2-bit encoding, the first feature vector would include a series of data items where the nucleotide bases may be encoded as follows: A: [0, 0]; C: [0, 1]; G: [1, 0]; T: [1, 1]. For 4-bit encoding, the first feature vector can include a series of data items where the nucleotide bases may be encoded as follows: A: [1, 0, 0, 0]; C: [0, 1, 0, 0]; G: [0, 0, 1, 0]; T: [0, 0, 0, 1]. In some embodiments, the sequence segment may be received in this format, and thus the generation of the first feature vector may simply use the sequence segment. [0376] At 2615, a first adapter location in the sequence segment of a component of the adapter segment is determined by processing the first feature vector using a first machine learning model. The first feature vector is fed into a first machine learning model, which can be trained to: (i) confirm one or more components of the adapter exists; and (ii) provide a location for the one or more components of the adapter sequence, which may be done for all components of the adapter sequence. In various embodiments, the first machine learning model can be a neural network, and more specifically a segmentation neural network. The second sequence PATENT Client Reference No.: P39048-WO-1 portion of the adapter can include a plurality of components including a first component and a second component. The first component can be a fixed sequence (e.g., non-variable sequence) comprising the stems and/or anchor sequences of the adapter and the first component corresponds to the first adapter location. The second component can be a variable sequence comprising SIDs, UMIs, and/or portion of the DNA insert. [0377] To confirm the presence of each adapter component, the first machine learning model can use segmentation techniques to partition the series of data items from the first feature vector (e.g., extracted adapter sequences) into meaningful regions, such as the individual components of the adapter sequence. The first machine learning model can locate the positions of fixed sequences in the adapter (e.g., the stems and/or anchors), which may be identified by their consistent sequence across all the adapters used. For example, with respect to FIG.25B, the stem component comprises the nucleic acid sequences GACGTGTGCTCTTCCGATCT on the 5’ end and AGATCGGAAGAGCGTCGTGT on 3’ end. Accordingly, the first machine learning model identifies and segments the stem component of the adapter based on this known sequence. This same concept may be used to identify the location of the anchor sequence. [0378] Once the fixed sequences are located, the variable sequences (e.g., SIDs, UMIs, and/or portion of the DNA insert) are extracted based on their expected, known distances from the locations of the fixed sequences identified by the first machine learning model. In various embodiments, the method described herein may further comprise using the first adapter location to determine the variable sequence of the second component. For example, at the 3’ end of the extracted adapter sequence, the end position of an SID is 10 bases to the right of the end position of the stem, assuming the SID is a fixed length of 10 bases. This process of variable sequence extraction continues until both SIDs, both UMIs, and/or the portion of the DNA insert are identified. In so doing, the first sequence portion of the nucleic acid segment, the second sequence portion corresponding to the adapter segment, or both may be determined by using the first adapter location in the sequence segment. Moreover, determining the first adapter location also indicates the origin of the nucleic acid segment (e.g., from the first sequence portion). [0379] Additionally, because the locations of the variable sequences are inferred, they may be used to determine which sample and/or which molecule corresponds to the sequence. PATENT Client Reference No.: P39048-WO-1 [0380] The first machine learning model can generate an output vector where each base associated with a fixed sequence (e.g., stem, anchor) with an integer number (e.g., 0, 1, or 2) indicating if the base belongs to a fixed sequence or the rest of the adapter. As a non-limiting example, 1 may be used to label the stem, 2 to label the anchor, and 0 to label everything else. In so doing, the start stem and start anchor may be located, and based on this information, the locations of variable sequences may be inferred (as noted above). For each position, a probability can be determined for each classification. The integer label assigned to each base can be based on the probability, calculated by the first machine learning model, indicating whether the base corresponds to a stem sequence, an anchor sequence, or the rest of the adapter (e.g., other). Based on which category (e.g., stem, anchor, other) has the highest probability the appropriate integer label is assigned to that base. [0381] A quality control step may be performed to confirm that the identified fixed sequences and/or variable sequences fall into an acceptable range of their expected lengths. Any that do not pass quality control can be removed from the analysis. Those sequence segments that do pass quality control may optionally be compared to a LUT or classified by a second machine learning (described in more detail below) to determine what sample the sequence segments originated from. 4. Machine Learning Model for Classifying Variable Sequences [0382] FIG.27 shows a flowchart illustrating method 2700 for using a machine learning model to classify components of adapter architecture from nucleic acid sequencing read data. [0383] At 2705, the output vector from the first machine learning model comprising encoded/labeled fixed components, and non-labeled variable components is received. Recall that the variable components/sequences may be used to determine which sample and/or molecule the sequence corresponds to. To determine the sample and/or molecule the variable sequences (e.g., segmented adapter components) in the output vector are used to generate a second feature vector. More specifically, the second feature vector may be generated using nucleotides at positions within the variable sequences segmented by the first machine learning model. Like the first feature vector, the second feature vector comprises a series of data items representing the encodings of the variable sequences. PATENT Client Reference No.: P39048-WO-1 [0384] At 2710, the second feature vector is processed by a second machine learning model. As an example, the second machine learning model can be a neural network (e.g., a classification neural network). Processing the second feature vector by the second machine learning model can include: (i) encoding the second feature vector to a multidimensional data point of N dimensions; and (ii) comparing the multidimensional data point to a set of reference data points generated by applying the second machine learning model to a set of adapters used for the variable sequence in a sequencing run. This second step may also be used to determine to which sample and/or which molecule the sequence corresponds. a. Fixed Pool [0385] An objective of the second machine learning model is to match the variable sequences (e.g., SID, UMIs) to a pool of possible sequences. For example, the second machine learning model may be trained to match the SID sequence to a fixed pool of all the possible SID sequences. In various embodiments, the second machine learning model is trained using a simulated training set that was generated by randomly modifying a fixed pool of adapters for the variable sequence. Because the second machine learning model was trained on simulated SID data that comprises insertions, deletions, and substitutions to model data-preparation errors and sequencing errors, the model can match a sample SID, that may comprise errors, to a SID in the fixed pool. The second machine learning model can map the SID sequence to the sequence in the fixed pool that is closest in edit distance. The second machine learning model can output an identifier identifying a particular adapter from the fixed pool of adapters. [0386] In some implementations, if the extracted SID is shorter than its theoretical length, one or more additional values may be added to pad the extracted sequence. As a non-limiting example, the extracted SID may be padded with a 5th value: 4, on top of 0,1,2,3 for ATCG. The sequence with the closest edit distance is output and used to: (i) determine which sample the sequence read belongs to in a pool of samples; and/or (ii) determine where down stream processes, such as adapter trimming, may begin. b. Arbitrary SID PATENT Client Reference No.: P39048-WO-1 [0387] Additionally or alternatively, a fixed pool of SIDs is not used and an arbitrary SID is used. In this approach, the second machine learning model encodes the measured sequence (presumptive SID) and an actual SID into fixed-dimension vectors and the distance between the encoded vectors are used to approximate the edit distances. The distance can be determined between the measured sequence and each SID in the pool. The SID that has the minimal metric distance from the sequence (both after encoding) is deemed the correct classification. [0388] A Q-score can be calculated as the metric distance of the second best SID and sequence minus the metric distance of the best SID match and sequence. A minimum Q-score can be required for a sequence to be classified to ensure there is enough separation between the top two SID candidates. Selection of the minimum Q-score threshold can be set according to accuracy requirements and sequence length on a case-by-case basis. For example, arbitrary SID classification can leverage the knowledge of which SIDs are actually in an experiment, making it a Bayesian classification. For example, if there is only one SID in a run, then Bayesian classification is 100% accurate. However, if there is SID contamination or incorrect SID information is provided, the Bayesian classification is likely to yield false results. On the other hand, the fixed-pool model directly provides SID classification without considering which SIDs are in the experiment, and it remains unaffected by SID contamination. [0389] In addition to the second machine learning model trained to classify SIDs from the second feature vector, a third machine learning model may optionally be fed the second feature vector to classify the UMI sequences. In various embodiments, the third machine learning model is a neural network. More specifically, the third machine learning model may be aclassification neural network trained to classify UMI sequences. The third machine learning model has been trained in a similar process as the second machine learning model, where simulated UMI data is generated comprising insertions, deletions, and substitutions to model data-prep errors and sequencing errors. Furthermore, the third machine learning model can also apply the fixed pooling method and/or the arbitrary method described with respect to the second machine learning model. In various embodiments, the UMI in the second feature vector is mapped to a pool of UMI sequences that comprises at least 200 sequences. PATENT Client Reference No.: P39048-WO-1 [0390] Any of the methods described herein may use sequencing methods to generate the sequence reads, sequence segments, etc. used in the herein. The sequencing methods may be any one of the sequencing methods described in section II of the disclosure. In various embodiments, sequencing the first strand of the double-stranded nucleic acid molecule to obtain the first sequence of base calls can include: (i) measuring signals for a window of a compound corresponding to the first strand of the double-stranded nucleic acid molecule, wherein the compound comprising a plurality of units, each corresponding to a nucleotide; and (ii) determining a base call for a genomic position within the window by comparing the signals to known signal patterns corresponding to different nucleotides. Comparing the signals to known patterns corresponding to different nucleotides can be performed by a machine learning model trained using the known signal patterns. In various instances, the compound may be the first strand of the double-stranded nucleic acid molecule with a reporter element corresponding to a nucleotide. In other instances, the compound may be a surrogate molecule created from the first strand of the double-stranded nucleic acid molecule, wherein the surrogate molecule includes one or more reporter elements corresponding to each nucleotide. [0391] In various embodiments, sequencing the double-stranded nucleic acid molecule includes: (i) creating a surrogate molecule from the double-stranded nucleic acid molecule, wherein the surrogate molecule includes one or more reporter elements corresponding to each nucleotide; (ii) passing the surrogate molecule through a nanopore to obtain electrical signals; and (iii) determining the first sequence of base calls and the second sequence of base calls of nucleotides in the double-stranded nucleic acid molecule using the electrical signals. [0392] In various embodiments, the methods described herein may further comprise repeating the sequencing method(s), repeating the method for using machine learning models to segment components of adapter architecture from nucleic acid sequencing read data, and/or repeating the method for using a machine learning model to classify components of adapter architecture from nucleic acid sequencing read data for at least 10,000 nucleic acid molecules. [0393] In various embodiments, a computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause a computer system to perform the methods of any one of the methods described herein. PATENT Client Reference No.: P39048-WO-1 [0394] Table 7 below shows the accuracy of a portion of the data collected from an experimental sequencing run . The ground truth is obtained from the alignment approach. Since there is only one SID in the run, the encoding classifier uses a fake prior condition that there are 10 SIDs or 100 SIDs (including the real one) to test how the encoding classifier works. (Otherwise, it always predicts that single SID.) Table 7 classifier trained on a encoding classifier encoding classifier fixed pool (848 SIDs) (assume 10 SIDs) (assume 100 SIDs) Yield 107072 107399 105365 SID error rate 8E-5 9E-6 5E-5 Running Time 69 seconds 83 seconds 340 seconds [0395] Table 8 below shows the accuracy of a portion of the data collected from an additional experimental sequencing run . The ground truth is obtained from the alignment approach. In this run, there are six SIDs. The encoding classifier also simulates the case that there are 100 SIDs in the run. Table 8 classifier trained on a encoding classifier encoding classifier fixed pool (848 SIDs) (6 known SIDs) (assume 100 SIDs) Yield 117012 117855 116698 SID error rate 2E-4 0 8E-6 Running Time 95 seconds 93 seconds 581 seconds PATENT Client Reference No.: P39048-WO-1 C. Detecting Adapters using Frequency-Based Methods [0396] Described herein are methods based on properties for mathematical (integral) transforms (e.g., Fourier transforms) to detect and identify adapter sequences (e.g. hairpin or adapters at the end) in DNA fragments. Frequency based algorithms are used to analyze functions or signals with respect to frequency, rather than time. For this to occur, signals represented as a function of time in the time domain are converted into frequencies in the frequency domain, where the signal is represented by its constituent frequencies and their respective amplitudes and phases. Various basis functions may be used, such as sines, cosines, plane waves, wavelets, and the like. Non-limiting examples of frequency-based algorithms that may be contemplated in the below described methods include Fourier transform (including Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT)), wavelet transform (including Continuous Wavelet Transform (CWT) and Discrete Wavelet Transform (DWT)), Short-Time Fourier Transform (STFT), Z-transform, Laplace transform, Goertzel Algorithm, Hilbert-Huang Transform (HHT), Cepstrum Analysis, Spectrogram Analysis, Autoregressive (AR) Models, Cross-Spectral Analysis, Principal Component Analysis (PCA) in the Frequency Domain, Filter Design Algorithms (including FIR (Finite Impulse Response) Filters and IIR (Infinite Impulse Response) Filters), Convolution and Correlation (including Frequency Domain Convolution and cross-correlation), Music Information Retrieval (MIR) Algorithms, Harmonic Analysis, and Frequency Modulation (FM) and Demodulation. One particular advantage of using frequency algorithm lies in the fact that, in the frequency domain, cross correlations between each adapter in a set of adapter sequences translates into a multiplication function. In so doing, the computationally costly process of cross correlation analysis is eliminated and replaced by the multiplication function, which is very cost efficient. 1. Encoding [0397] Before transforming a sequence into the frequency domain, the bases in the sequence can be encoded into a particular storage representation. Sequencing reads (e.g., nucleotide sequences comprising a DNA insert and at least one adapter sequence) can be encoded using a PATENT Client Reference No.: P39048-WO-1 variety of techniques. With respect to frequency-based methods, encoding can enhance data representation, increase storage efficiency, and increasing processing performance speed. [0398] For example, sequencing reads, adapter sequences, SIDS, and the like, may be encoded into complex numbers that are mapped to a space in a complex plane. FIG.28 shows an example of nucleotide bases A, C, G, T represented in a complex plane, where the x-axis represents the real part (Re) and the y-axis represents the imaginary part (Im) of the complex numbers. The mapping for a signal of a given base may be determined based on the position that maximizes the correlation with itself and minimizes the correlation with all other signals. [0399] In other embodiments, sequencing reads, adapter sequences, SIDS, and the like, may be encoded into 4-dimensional space, which corresponds to one-hot encoding. A point in 4-space can be represented by a tuple of four coordinates (e.g., x, y, z, w), where each coordinate represents a different dimension. As a non-limiting example, {A, C, G, T} can be mapped to 4-space as follows: A: [1, 0, 0, 0] C: [0, 1, 0, 0] G: [0, 0, 1, 0] T: [0, 0, 0, 1] [0400] 2. Identifying Start and/or End of Adapter Sequences [0401] An adapter sequence (e.g., each adapter of a set) may be compared to a sequence read to determine a cross-correlation between the two sequences and determine the location of the best match, thereby identifying a location of the adapter. Although the exact sequence of the adapter in the sequence read is unknown, the adapter sequence originates from a set of adapter sequences whose sequences are known. The location of the adapter sequence can be shifted to identify the location corresponding to the maximum cross-correlation, which corresponds to the adapter location in the sequence read. However, determining cross-correlation using this method is computationally expensive. PATENT Client Reference No.: P39048-WO-1 [0402] Alternatively, the cross-correlation may be determined in the frequency space, which improves the efficiency of the computation. When calculating a frequency-based cross- correlation between signals of different lengths, the most common approach is to zero-pad the shorter signal to match the length of the longer signal before performing the Fourier transform and cross-correlation in the frequency domain; this allows you to compare the signals at the same time scale despite their differing lengths. To make the lengths of the sequences equal, zeros are added to the shorter sequence before (or optionally after) the actual sequence. [0403] Once the resulting sequence signals are the same length, e.g., padded if necessary, they can be encoded (e.g., as described above) and transformed into the frequency-space using frequency-based techniques (e.g., Fourier transformations including Fast Fourier Transform (FFT), Z-transformations, Laplace transformations, and the like). The frequency transform can be applied to the encoded signal. This transformation decomposes the function or signal into its constituent waveform components, each characterized by a specific frequency, amplitude, and phase. [0404] After the signals are encoded and transformed, a frequency-based cross-correlation is determined using the frequency-encoded sequences. The cross-correlation is computed in the frequency domain by multiplying the first frequency-encoded signal with the complex conjugate of the second frequency-encoded signal. The complex conjugate is used because in the frequency domain, cross-correlation is equivalent to multiplying one signal with the complex conjugate of the other's frequency-encoded signal. The result of this multiplication is an array representing the cross-correlation in the frequency domain. [0405] The frequency-domain cross-correlation array can then be transformed (by applying the Inverse Fast Fourier Transform (IFFT)) to bring it back to the time domain. This step ensures that no information is lost in the transformation processes and gives the cross-correlation signal. The cross-correlation signal provides several key pieces of information: (i) the overall cross- correlation signal (made up of individual nucleotide-space signals) corresponding to the similarity between the signals (e.g., the adapter and the sequence read); and (ii) the nucleotide- space signal for each nucleic acid base comprising the sequence of the adapter sequence and the PATENT Client Reference No.: P39048-WO-1 read sequence, the latter being a maximum at the base position (location) of the start or end of the adapter in the sequence read. [0406] As a cross-correlation is determined for each adapter in a pool used in the sequencing library relative to a given sequence read, which adapter is most similar to a given read can be determined. The term "similarity index" refers to a measure that quantifies the degree of similarity between two signals or data sets and is derived from the cross-correlation signal, e.g., the value of the cross-correlation signal at each position. In other words, the maximized index represents the point of maximum similarity between the two signals (i.e., how much one signal needs to be shifted for the best alignment with the other). The cross-correlation signal can be analyzed to find the maximum similarity index (e.g., the highest peak signal) between the two signals (e.g., the adapter sequence and the read sequence) for each adapter. The overall highest maximum similarity index, which generally should be higher by a large amount, can identify the location/position of the adapter in the sequence read. The maximized index can be determined by taking the absolute value of the cross-correlation array. In other embodiments, depending on the type of mapping used, the maximized index can be determined by comparing the real or imaginary components of the cross-correlation array rather than the absolute value. The output of this process can include two things: the index of maximum similarity and the cross-correlation signal itself. Identification of the overall maximum similarity index across the pool of adapters used in the sequencing library reveals where the start of the adapter sequence is in the read sequence (e.g., large peak), and also the sequence of the adapter, and thus which adapter is used in the read sequence. [0407] FIGs.29A-G show graphs displaying the correlation signals after FFT processing for seven different candidate adapter sequences at every position of the read construct. Compared to adapter sequences with SID numbers 01-06 (FIGs.29B-G), SID number 00 (FIG.29A) has the highest autocorrelation value near base 210. This maximum similarity index indicates 1) the adapter sequence added to the read sequence comprises the SID sequence associated to SID 00, which is a known sequence, and 2) the center location of the adapter sequence. Moreover, the peak signal near base 210 also indicates the “shift” or the translation of a signal in the time (or spatial) domain and its corresponding effect in the frequency domain. The method used to find PATENT Client Reference No.: P39048-WO-1 the starting position of the adapter is also used on the reverse compliment portion of the read construct to find the end position. [0408] To identify the end position of the adapter sequence, the reverse of the sequence read and the adapter can be taken, and then encoded, transformed, and cross-correlated. The highest peak (maximum similarity index) for this reverse analysis provides the position of the ending of the adapter sequence. [0409] As an example, the cross-correlation can be applied between (i) the read sequence and the hairpin, (ii) the read sequence and the SID, (iii) the read sequence and the reverse complement SID, or (iv) any combination thereof. Then the matching score of each cross- correlation is determined, where the score for a particular cross-correlation is equal to the maximum value of the cross-correlation signal divided by the length of the sequence (e.g., hairpin, SID, etc.). All three scores are combined along with the location of the peaks to find the maximized similarity index and eliminate false peaks. The scores may be combined by taking the sum of scores for each cross-correlation for each hairpin and setting up a threshold (e.g., a sum of the individual thresholds for each cross-correlation). In other embodiments, the scores may be combined via other well-known techniques such as by using a weighted sum, a normalized sum, or the like. This method is especially helpful when there are similar looking peaks for two or more hairpin sequences. This could happen due to missing and/or extra bases in the SID or the reverse complement SID of the ground truth sequence. Combining the scores helps address this issue and provides extra confidence in the result. [0410] FIG.30 shows a flowchart illustrating method 3000 for determining the location and the sequence of an adapter in a sequencing read using cross-correlation frequency-based methods. The method 3000 depicted in FIG.30 may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The method presented in FIG.30 and described below is intended to be illustrative and non-limiting. Although FIG.30 depicts the various processing steps occurring in a particular sequence or PATENT Client Reference No.: P39048-WO-1 order, this is not intended to be limiting. In certain alternative embodiments, the steps may be performed in some different order, or some steps may also be performed in parallel. [0411] At 3005 an adapter sequence and a sequencing read of a nucleic acid molecule are received. The adapter sequence comprises at least 20 unique known adapter sequences. The set of adapter sequences can comprise a variety of adapter architectures such as hairpin adapters, Y- adapters, dumbbell adapters, the like, or any combination thereof. The sequence read specifies nucleotides at positions within the nucleic acid molecule. Additionally, the sequence read includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample (e.g., a DNA insert) and a second sequence portion corresponding to the adapter that was added to the nucleic acid segment. [0412] At 3010, the nucleotides of the sequence read are encoded into a first series of nucleotide encodings, wherein each nucleotide has a different encoding. The set of adapter sequences are also encoded into a set of second series of nucleotide encodings, where each adapter sequence in the set of adapter sequences has a different encoding. In various embodiments, the different nucleotide encodings (i) do not overlap with each other, (ii) are orthogonal to each other, (iii) are in complex space (see FIG.28), or (iv) use at least four dimensions (see Encoding section above). [0413] At 3015, the nucleotide encodings of the first series of nucleotides and the set of second series of nucleotides are transformed into a first frequency domain signal and a set of second frequency domain signals, respectively. In various embodiments, the transformation is done using a frequency-based algorithm (e.g., Fourier transformations, Z-transformations, Laplace transformations, and the like). In various embodiments, the frequency-based algorithm is the Fast Fourier Transform (FFT) algorithm. [0414] At 3020, a frequency-domain cross-correlation signal is determined between the first frequency domain signal and each of the second frequency domain signals from the set of second frequency domain signals. Accordingly, each of the frequency-domain cross-correlation signals are transformed into time domain signals to obtain cross-correlation signals. As such, there is one time domain signal, and thus one cross-correlation signal for each of the frequency-domain cross-correlation signals. PATENT Client Reference No.: P39048-WO-1 [0415] At 3025, a maximum similarity index, determined from the cross-correlation signals, is used to determine the location and the sequence of the true adapter sequence corresponding to the adapter that was added to the nucleic acid segment. Furthermore, the adapter location can be used to determine a segment location of the nucleic acid segment, wherein the adapter location can be a start position of the adapter in the sequence read. [0416] In various embodiments, the adapter sequence from 3005 is a first adapter of a set of adapters used in a sequencing library. Accordingly, the method described herein can further comprise repeating the process of obtaining the cross-correlation signal for other adapters in the set of adapters. The cross-correlation signals of all the first adapter and the other adapters in the set of adapters can be compared to determine which adapter sequence has the highest (e.g., maximum) cross-correlation signal. In various embodiments, the first adapter is highest among the set of adapters. [0417] The methods described herein (e.g., sequencing padding, Fourier transformations, mapping, IFFT, and cross-correlation analysis) may be repeated for the reverse complement of the sequence read and the adapter to find the ending position of the adapter sequence. In various embodiments, the first series of nucleotide encodings and the second series of nucleotide encodings are reversed. As such, the adapter location is an end position of the adapter in the sequence read. 3. Methods for Processing Hairpin Loop Sequences with Errors [0418] In some instances, it is still possible (theoretically and chemically) that a ‘perfect’ loop is detected, when in fact the loop comprises deletions, despite perfect SID complementation on both sides, using the above method. In this instance, the cross correlation exhibits two equally sized peaks. FIG.31 shows an example of how the cross-correlation signal would look in real space of the IFFT where a number of bases are deleted from the loop. In this example, similar peaks are exhibited in different cross-correlations, creating a situation where it is not readily apparent which one is correct. In such cases, the peak location from autocorrelation can help identify the correct cross-correlation and, consequently, the correct adapter sequence. In order to still accurately identify the location of the adapter, various methods may be used and are outlined below. PATENT Client Reference No.: P39048-WO-1 [0419] A first method may include measuring the distance between both peaks and ensuring they are within the tolerance of both each other and the location of the loop based on autocorrelation (described in more detail below). [0420] A second method may include choosing to error on the side of dropping bases (choose the first peak) or error on the side of possibly including hairpin bases in the inserts (choose the second peak). The former is a more conservative approach while the latter is more aggressive but would manifest itself in softclips downstream which could be hard to differentiate for secondary analysis. 4. Identifying Hairpin Adapter in Middle of Sequence Read [0421] In some instances, the cross-correlation method described above does not generate an interpretable or ideal result for the known adapter sequences and the base called sequence, as shown in FIG.32. Therefore, in addition to or in alternative to the cross-correlation method, an autocorrelation method is used to identify the center location and sequence of the adapter construct. [0422] FIG.33 shows a graph illustrating how autocorrelation methods may be used to find the center location of an adapter sequence when cross-correlation analysis does not produce conclusive results. The x-axis indicates the ”lag” (i.e., how much a signal was shifted), and the y- axis indicates the magnitude of autocorrelation signal for each discrete amount of “lag.” As shown, the autocorrelation signal at every base except the base near 200 bases of “lag” is very low. However, near base 200, a peak in the autocorrelation signal is observed, which indicates a position of the hairpin adapter because a location of the peak in the autocorrelation signal shows the best alignment of the sequence and its reverse complement. [0423] Autocorrelation can be used to analyze the symmetrical nature of HD sequence reads. The autocorrelation analysis is performed between the original base called sequence and its reverse complement, e.g., when the only pattern that cannot match in the signal is the hairpin. Autocorrelation uses the same process as the cross-correlation and thus has the same computational complexity. Accordingly, it will not add much computational cost to the overall algorithm while helping to strengthen the detection and classification of the adapter sequence. PATENT Client Reference No.: P39048-WO-1 [0424] FIG.34 shows a flowchart illustrating method 3400 for determining the location and the sequence of an adapter in a sequencing read using autocorrelation frequency-based methods. [0425] At 3405, a sequence read of a nucleic acid molecule comprising two strands is received. The sequence read specifies nucleotides at positions within the nucleic acid molecule. In various embodiments, the sequence read includes first sequence portions corresponding to two nucleic acid segments from a biological sample and a second sequence portion corresponding to the adapter that was added between the two nucleic acid segments. In various embodiments, the sequence read includes multiple copies of at least one strand of the two strands of the nucleic acid molecule. In various embodiments, the adapter indicates an origin of the nucleic acid segment. [0426] At 3410, the nucleotides of the sequence read are encoded into a first series of nucleotide encodings, wherein each nucleotide has a different encoding. In various embodiments, the different nucleotide encodings (i) do not overlap with each other, (ii) are orthogonal to each other, (iii) are in complex space (see FIG.28), or (iv) use at least four dimensions (see Encoding section above). Additionally, the reverse complement of the sequence read is also encoded into a second series of nucleotide encodings. This takes advantage of the symmetrical nature of the sequence read. The encoding into different coordinates can use vectors (e.g., independent vectors, orthogonal vectors, normalized vectors, orthonormal vectors, and the like) to be mapped in four-dimensional space or use coordinates in complex space (see FIG.28). [0427] At 3415, first series of nucleotide encodings and second series of nucleotide encodings are transformed into a first and second frequency domain signal, respectively. The transformation can be performed using various frequency-based functions such as Fourier transform, wavelet transform, STFT, Z-transform, Laplace transform, Goertzel Algorithm, HHT, Cepstrum Analysis, Spectrogram Analysis, AR Models, Cross-Spectral Analysis, PCA in the Frequency Domain, Filter Design Algorithms, Convolution and Correlation, MIR Algorithms, Harmonic Analysis, and FM and Demodulation. [0428] At 3420, a frequency-domain autocorrelation signal between the first frequency domain signal and second frequency domain signal is determined. PATENT Client Reference No.: P39048-WO-1 [0429] At 3425, the frequency-domain cross-correlation signal is transformed into a time domain signal to obtain cross-correlation signals. [0430] At 3430, an adapter location of the adapter is determined using a maximum of the cross-correction signal. In various embodiments, the adapter location corresponds to the middle of the adapter. In various embodiments, the adapter is optionally a hairpin adapter. Once the location of the center of the adapter (e.g., SID + hairpin) structure in the sequence is detected using autocorrelation, the specific adapter sequence can be narrowed down using the location and known adapter sequences from the set of adapter candidates. [0431] Moreover, although the methods provided herein describe finding the location of one adapter, multiple adapter locations within a single read sequence may be identified. This occurs when a read sequences comprises adapters at the 3’ and 5’ ends of the read sequences, or if the read sequences are two pass HDD read constructs (described with respect to FIGs.6A and 6B), four pass HDD read constructs (described with respect to FIGs.8A-8D), or “n” pass HDD read constructions (described with respect to FIGs.11A-11C) described with respect to section III (e.g., Alternative HDD Read constructs). Accordingly, multiple maxima in the cross-correlation signal are observed. Each maximum corresponds to a different location of the adapter in the entire sequence read, which can include multiple copies of one or both of the DNA inserts. [0432] The methods for adapter detection may be implemented in software (e.g., code, instructions, program) executed by one or more processing units (e.g., processors, cores) of the respective systems, hardware, or combinations thereof (e.g., the intelligent selection machine). The software may be stored on a non-transitory storage medium (e.g., on a memory device). The one or more processors include CPUs, GPUs, TPUs, FPGAs, DSPs, ASICs, MCUs, NPUs, vector processors, quantum processors, FPAAs, SoC, the like, or any combination thereof. In various embodiments, the methods for adapter detection are executed using one or more GPUs. As described, the adapter sequence added to a nucleic acid molecule (e.g., a DNA insert) is part of a set of adapters comprising unique SID sequences to distinguish one sample from another in a pool of samples. Accordingly, a pool of samples can comprise at set of SIDs (e.g., at least 20, 30, 40, 50, 60, 70, 80, 90, 100, or more) that may be detected using any of the methods described herein. PATENT Client Reference No.: P39048-WO-1 [0433] Processing of the set of SIDs can be completed in parallel, e.g., on a GPU due to the element wise nature of GPUs. More specifically, the GPU may include hundreds or even thousands of parallel processing cores that allow them to handle multiple tasks simultaneously. Other architectures besides GPUs may also be used, e.g., processors having single instruction multiple data (SIMD). [0434] Although not explicitly described, one of skill in the art can appreciate how the aforementioned demultiplexing methods (i.e., adapter detection via (i) dual sliding window algorithm, (ii) machine learning models, and (iii) frequency-based methods) can be modified to detect other adapter constructions not described in FIGs.25A and 25B. For example, hairpin adapters, Y-Open-Hairpin-adapters, dumbbell adapters, and other adapter architectures specific to certain sequencing methods, may also be contemplated. Moreover, the aforementioned demultiplexing methods may also be modified to detect the one or more adapter sequences within the two pass HDD read constructs (described with respect to FIGs.6A and 6B), the four pass HDD read constructs (described with respect to FIGs.8A-8D), and the “n” pass HDD read constructions (described with respect to FIGs.11A-11C) described with respect to section III (e.g., Alternative HDD Read constructs). Modification of the aforementioned demultiplexing methods would not unreasonably broaden the scope of the described methods, as all three methods take advantage of the natural sequence symmetry that is inherent to the sequencing constructions described herein. [0435] Provided herein is a system comprising the computer product of any one of the disclosed methods and one or more processors configured to execute the instructions of any of the disclosed methods stored on the computer readable medium. The system comprises the means for performing any of the disclosed methods as well as one or more processors configured to perform any of the disclosed methods. In various embodiments, the system comprises modules that respectively perform the steps of any of the disclosed methods. [0436] Provided herein is a sequencing device for determining consensus sequences of double-stranded nucleic acid molecules. The sequencing device comprises a set of sequencing cells (e.g., at least 10,000 sequencing cells), each configured to perform (i) sequencing of a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base PATENT Client Reference No.: P39048-WO-1 measurements, and (ii) sequencing of a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements. The sequencing device also comprises a consensus circuit electrically connected with the set of sequencing cells. The comparator circuit, for each of the double-stranded nucleic acid molecules, is configured to perform the process of (i) receiving the first sequence of base measurements and the second sequence of base measurements and (ii) generating a consensus sequence using base call values. For each of a plurality of positions of the double-stranded nucleic acid molecule, one or more of the first base measurements are compared to one or more of the second base measurements. The comparison allows for the base call value to be determined for the consensus sequence. Finally, the sequence device uses a transmitter configured to transmit the consensus sequence to a computer system. [0437] In various embodiments, comparing a first base measurement to a second base measurement comprises (i) determining a first base call using the one or more of the first base measurements, (ii) determining a second base call using the one or more of the second base measurements, and (iii) comparing the first base call and the second base call. [0438] In various embodiments, the comparator circuit is further configured to determine whether a position of the plurality of positions is concordant or discordant based on the comparing, wherein the base call value is dependent on whether the position is concordant or discordant. [0439] In various embodiments, a number of bits used for the base call value is dependent on whether the position is concordant or discordant. [0440] In various embodiments, the comparator circuit is further configured to generate metadata identifying which positions are discordant, and wherein the consensus sequence includes the metadata. [0441] In various embodiments, the set of sequence cells and the comparator circuit are on a same printed circuit board. PATENT Client Reference No.: P39048-WO-1 [0442] In various embodiments, the set of sequence cells and the comparator circuit are on a same integrated circuit. XIV. Example Systems [0443] FIG.35 illustrates a measurement system 3500 according to an embodiment of the present disclosure. The system as shown includes a sample 3505, such as Xpandomers within an assay device 3510, where an assay 3508 can be performed on sample 3505. For example, sample 3505 can be contacted with reagents of assay 3508 to provide a signal (e.g., an intensity signal) of a physical characteristic 3515 (e.g., sequence information of a cell-free nucleic acid molecule). Assay 3508 may include sequencing by expansion with an assay device 3510. An example of an assay device 510 can be a well plate that includes Xpandomers. Physical characteristic 3515 (e.g., a fluorescence intensity, a voltage, or a current), from the sample is detected by detector 3520. Detector 3520 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal. In one embodiment, an analog-to- digital converter converts an analog signal from the detector into digital form at a plurality of times. [0444] Assay device 3510 and detector 3520 can form an assay system, e.g., a PCR system or a sequencing system that performs sequencing according to embodiments described herein. A data signal 3525 is sent from detector 3520 to logic system 3530. As an example, data signal 3525 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA). Data signal 3525 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 3505, and thus data signal 3525 can correspond to multiple signals. Data signal 3525 may be stored in a local memory 3535, an external memory 3540, or a storage device 3545. The assay system can be comprised of multiple assay devices 3510 and detectors 3520. [0445] Logic system 3530 may be, or may include, a computer system, ASIC, microprocessor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.). Logic system 3530 and the other components may be part of a stand-alone or network PATENT Client Reference No.: P39048-WO-1 connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 3520 and/or assay device 3510. Logic system 3530 may also include software that executes in a processor 3550. Logic system 3530 may include a computer readable medium storing instructions for controlling system 3500 to perform any of the methods described herein. For example, logic system 3530 can provide commands to a system that includes assay device 3510 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order. Such physical operations may be performed by a robotics system, e.g., including a robotic arm, as may be used to obtain a sample and perform an assay 3508. Logic system 3530 can perform any steps of methods described herein that perform computer processing. [0446] Measurement system 3500 may also include a treatment device 3560, which can provide a treatment to the subject. Treatment device 3560 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant. Logic system 3530 may be connected to treatment device 3560, e.g., to provide results of a method described herein. The treatment device may receive inputs from other devices, such as an imaging device and user inputs (e.g., to control the treatment, such as controls over a robotic system). [0447] Measurement system 3500 may also include a reporting device 3555, which can present results of any of the methods describe herein, e.g., as determined using the measurement system 3500. Reporting device 3555 can be in communication with a reporting module within logic system 3530 that can aggregate, format, and send a report to reporting device 3555. The reporting module can present information determined using any of the method described herein. The information can be presented by reporting device 3555 in any format that can be recognized and interpreted by a user of the measurement system 3500. For example, the information can be presented by reporting device 3555 in a displayed, printed, or transmitted format, or any combination thereof. PATENT Client Reference No.: P39048-WO-1 [0448] Any of the computer systems mentioned herein may utilize any suitable number of subsystems. Examples of such subsystems are shown in the computer systemof FIG.36. In some embodiments, the computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus. In other embodiments, a computer system can include multiple computer apparatuses, each being a subsystem, with internal components. A computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices. [0449] The subsystems shown in FIG.36 are interconnected via a system bus 3675. Additional subsystems such as a printer 3674, keyboard 3678, storage device(s) 3679, monitor 3676 (e.g., a display screen, such as an LED), which is coupled to display adapter 3682, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 3671, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 3677 (e.g., USB, FireWire®). For example, I/O port 3677 or external interface 3681 (e.g., Ethernet, Wi-Fi, etc.) can be used to connect computer system 3610 to a wide area network such as the Internet, a mouse input device, or a scanner. The interconnection via system bus 3675 allows the central processor 3673 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 3672 or the storage device(s) 3679 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems. The system memory 3672 and/or the storage device(s) 3679 may embody a computer readable medium. Another subsystem is a data collection device 3685, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user. [0450] A computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 3681, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component. In some embodiments, computer systems, subsystem, or apparatuses can communicate over a network. In such instances, one computer can be considered a client and another computer a server, where each can be part of a same computer system. A client and a server can each include multiple systems, subsystems, or components. In various embodiments, methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or PATENT Client Reference No.: P39048-WO-1 10,000 devices. Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,000, or one million communication messages. Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data. [0451] Aspects of embodiments can be implemented in the form of control logic using hardware circuitry (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software stored in a memory with a generally programmable processor in a modular or integrated manner, and thus a processor can include memory storing software instructions that configure hardware circuitry, as well as an FPGA with configuration instructions or an ASIC. As used herein, a processor can include a single-core processor, multi- core processor on a same integrated chip, or multiple processing units on a single circuit board or networked, as well as dedicated hardware. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement embodiments of the present disclosure using hardware and a combination of hardware and software. [0452] Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. A suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard- drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk) or Blu-ray disk, flash memory, and the like. The computer readable medium may be any combination of such devices. In addition, the order of operations may be re-arranged. A process can be terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function. PATENT Client Reference No.: P39048-WO-1 [0453] Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet. As such, a computer readable medium may be created using a data signal encoded with such programs. Computer readable media encoded with the program code may be packaged with a compatible device (e.g., as firmware) or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g., a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network. A computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user. [0454] Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor may be performed in real-time. The term “real-time” may refer to computing operations or processes that are completed within a certain time constraint. As examples, a time constraint may be 30 seconds, 1 minute, 10 minutes, 30 minutes, 1 hour, 4 hours, 1 day, or 7 days. Thus, embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or at different times or in a different order. Additionally, portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps. [0455] The specific details of particular embodiments may be combined in any suitable manner without departing from the spirit and scope of embodiments of the disclosure. However, other embodiments of the disclosure may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects. [0456] The above description of example embodiments of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to PATENT Client Reference No.: P39048-WO-1 limit the disclosure to the precise form described, and many modifications and variations are possible in light of the teaching above. [0457] A recitation of "a", "an" or "the" is intended to mean "one or more" unless specifically indicated to the contrary. The use of “or” is intended to mean an “inclusive or,” and not an “exclusive or” unless specifically indicated to the contrary. Reference to a “first” component does not necessarily require that a second component be provided. Moreover, reference to a “first” or a “second” component does not limit the referenced component to a particular location unless expressly stated. The term “based on” is intended to mean “based at least in part on.” [0459] All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes. None is admitted as prior art. Where a conflict exists between the instant application and a reference provided herein, the instant application shall dominate.

Claims

PATENT Client Reference No.: P39048-WO-1 CLAIMS 1. A method for determining a partial consensus sequence of a double-stranded nucleic acid molecule, the method comprising: sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of base calls; sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of base calls; identifying a first set of concordant positions and a second set of discordant positions using the first sequence of base calls and the second sequence of base calls; representing each of the first set of concordant positions by a concordant value of a first group of four concordant values, each concordant value representing a concordant pair of bases on the first stand and the second strand; representing each of the second set of discordant positions by a discordant value of a second group of at least 12 discordant values, each discordant value representing a discordant pair of bases on the first strand and the second strand; and generating the partial consensus sequence using (1) the concordant values at the first set of concordant positions and (2) the discordant values at the second set of discordant positions. 2. The method of claim 1, wherein the first group of four concordant values is specified using two binary bits and includes A<>T, C<>G, G<>C, and T<>A. 3. The method of claim 1, wherein the second group of at least 12 discordant values is specified using at least four binary bits and includes A<>A, A<>C, A<>G, C<>A, C<>C, C<>T, G<>A, G<>G, G<>T, T<>C, T<>G, and T<>T. 4. The method of claim 1, wherein the second group of at least 12 discordant values includes at least 20 discordant values. PATENT Client Reference No.: P39048-WO-1 5. The method of claim 4, wherein each of the at least 20 discordant values are specified using five binary bits. 6. The method of claim 1, wherein generating the partial consensus sequence includes: including, in a data stream, metadata that specifies the second set of discordant positions. 7. The method of claim 6, wherein the metadata, the concordant values for the first set of concordant positions, and the discordant values for the second set of discordant positions are usable to recover the base calls of the first sequence and the second sequence at the first set of concordant positions and the second set of discordant positions. 8. The method of claim 1, further comprising transmitting the partial consensus sequence to a computer system. 9. The method of claim 1, further comprising: aligning the first sequence of base calls, the second sequence of base calls, or both to a reference genome, wherein the first set of concordant positions do not match the reference genome; identifying a third set of concordant positions that match the reference genome; and representing, in a data stream, each of the third set of concordant positions with an indication of a genomic coordinate in the reference genome. 10. The method of claim 9, wherein the indication of the genomic coordinate in the reference genome includes a starting genomic coordinate of the first sequence of base calls and a binary bit that specifies whether the concordant position matches the reference genome or not. 11. The method of claim 9, wherein the indication of the genomic coordinate in the reference genome includes a starting genomic coordinate of the first sequence of base calls and metadata specifying the concordant positions that do not match the reference genome. PATENT Client Reference No.: P39048-WO-1 12. A method for determining a consensus sequence of a double-stranded nucleic acid molecule, the method comprising: sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of base calls, each having a first quality score and a first label corresponding to the first strand; sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of base calls, each having a second quality score and a second label corresponding to the second strand; identifying a first set of concordant positions and a second set of discordant positions using the first sequence of base calls and the second sequence of base calls; and for each discordant position of the second set of discordant positions: determining a consensus base call using the first quality score, the second quality score, a first weight corresponding to the first label, and a second weight corresponding to the second label; and generating the consensus sequence using (1) concordant values at the first set of concordant positions and (2) the consensus base calls at the second set of discordant positions. 13. The method of claim 12, wherein the first weight and the second weight are dependent on base calls adjacent to the discordant position. 14. The method of claim 12, wherein determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first quality score being higher than the second quality score for a second base call of the second strand. 15. The method of claim 14, wherein the initially discordant position is changed to be the concordant position for the first base call of the first strand further based on a concordant base on the second strand having a measured signal that is adjacent to the second base call. PATENT Client Reference No.: P39048-WO-1 16. The method of claim 12, wherein determining the consensus base call at an initially discordant position of the second set of discordant positions includes: changing the initially discordant position to be a concordant position for a first base call of the first strand based on the first weight being higher than the second weight. 17. The method of claim 12, wherein the consensus sequence is a partial consensus sequence. 18. The method of claim 12, wherein identifying the first set of concordant positions and the second set of discordant positions includes: aligning the first sequence of base calls to the second sequence of base calls. 19. The method of claim 18, wherein aligning the first sequence of base calls to the second sequence of base calls includes: aligning the first sequence of base calls to a reference genome; and aligning the second sequence of base calls to the reference genome. 20. The method of claim 19, wherein the second sequence of base calls is aligned to a second strand of the reference genome. 21. The method of claim 18, wherein the first sequence of base calls is directly aligned to the second sequence of base calls. 22. A method comprising: ligating a hairpin adapter to an end of a double-stranded nucleic acid molecule, thereby forming a resulting molecule having a hybridized portion and a non-hybridized portion, wherein the hairpin adapter includes a hairpin loop that comprises nucleotides that are not hybridized to other nucleotides, the hairpin loop having a known loop length; separating the hybridized portion of the resulting molecule to form a single stranded molecule; sequencing the single stranded molecule to obtain an output sequence; receiving the output sequence at a computer system; PATENT Client Reference No.: P39048-WO-1 sliding a window structure over a plurality of positions of the output sequence, wherein the window structure includes a first window separated from a second window by the known loop length; at each position of the plurality of positions, determining an edit distance between a first sequence in the first window and a reverse complement of a second sequence in the second window, thereby determining a set of edit distances; determining a location of the hairpin loop in the output sequence based on the set of edit distances; and determining, using the location of the hairpin loop in the output sequence, a first strand sequence of a first strand of the double-stranded nucleic acid molecule and a second strand sequence of a second strand of the double-stranded nucleic acid molecule. 23. The method of claim 22, further comprising: determining, using the location of the hairpin loop in the output sequence, a measured identity sequence from the sequence read in the first window, the second window, or both at the location of the hairpin loop in the output sequence; identifying a particular sample, a particular double-stranded nucleic acid molecule, or both using the measured identity sequence. 24. The method of claim 23, wherein identifying the particular sample, the particular double- stranded nucleic acid molecule, or both comprises: comparing the measured identity sequence to a look-up table. 25. The method of claim 23, wherein identifying the particular sample, the particular double- stranded nucleic acid molecule, or both comprises: inputting the measured identity sequence to a machine learning model that is trained on various input sequences of a same length as a sample identifier used in the hairpin adapter. 26. The method of claim 22, wherein the window structure is slid across the sequence read at a specified step size. PATENT Client Reference No.: P39048-WO-1 27. The method of claim 26, wherein the step size is one base. 28. The method of claim 22, wherein determining the edit distance between the first sequence in the first window and a reverse complement of the second sequence in the second window comprises: determining a number of changes required in the second sequence to obtain a matching reverse complement to the first sequence. 29. The method of claim 22, wherein determining the edit distance between the first sequence in the first window and a reverse complement of the second sequence in the second window comprises: determining a number of mismatches between the first sequence and the reverse complement of the second sequence. 30. The method of claim 22, wherein the window structure is slid over an entirety of the output sequence, and wherein determining the location of the hairpin loop in the output sequence based on the edit distances comprises: determining the edit distance for each position of the window structure; and selecting a maximum of the set of edit distances. 31. The method of claim 22, wherein the plurality of positions includes a specified number of positions before and after a middle of the output sequence. 32. The method of claim 22, further comprising: comparing the edit distance at each of the plurality of positions to a threshold; and determining the location of the hairpin loop to be a position having the edit distance less than the threshold. 33. A method comprising: receiving a sequence segment of a nucleic acid molecule, the sequence segment specifying nucleotides at positions within the nucleic acid molecule, wherein the sequence read PATENT Client Reference No.: P39048-WO-1 includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample and a second sequence portion corresponding to an adapter segment that was added to the nucleic acid segment; generating a first feature vector using the nucleotides at the positions within the nucleic acid molecule, the first feature vector including a series of data items, each data item in the series indicating a nucleotide at a corresponding position in the sequence segment; determining a first adapter location in the sequence segment of a component of the adapter segment by processing the first feature vector using a first machine learning model; and determining, using the first adapter location in the sequence segment, the first sequence portion corresponding to the nucleic acid segment, the second sequence portion corresponding to the adapter segment, or both. 34. The method of claim 33, wherein the second sequence portion of the adapter segment includes a plurality of components, wherein a first component is a fixed sequence and a second component is a variable sequence, and wherein the first adapter location corresponds to the first component, the method further comprising: determining, using the first adapter location, the variable sequence of the second component; and determining, using the variable sequence, to which sample and/or which molecule the sequence corresponds. 35. The method of claim 34, wherein determining which sample and/or which molecule comprises: generating a second feature vector using nucleotides at positions within the variable sequence; and processing the second feature vector using a second machine learning model. 36. The method of claim 35, wherein the second machine learning model is trained using a training set that was generated by randomly modifying a fixed pool of adapters for the variable sequence, and wherein the second machine learning model outputs an identifier identifying a particular adapter from the fixed pool of adapters. PATENT Client Reference No.: P39048-WO-1 37. The method of claim 35, wherein processing the second feature vector using the second machine learning model includes: encoding the second feature vector to a multidimensional data point of N dimensions, and wherein determining to which sample and/or which molecule the sequence corresponds comprises: comparing the multidimensional data point to a set of reference data points generated by applying the second machine learning model to a set of adapters used for the variable sequence in a sequencing run. 38. The method of claim 33, wherein the adapter segment indicates an origin of the nucleic acid segment. 39. The method of claim 33, wherein sequencing the first strand of the double-stranded nucleic acid molecule to obtain the first sequence of base calls includes: measuring signals for a window of a compound corresponding to the first strand of the double-stranded nucleic acid molecule, the compound comprising a plurality of units, each corresponding to a nucleotide; and determining a base call for a genomic position within the window by comparing the signals to known signal patterns corresponding to different nucleotides. 40. The method of claim 39, wherein comparing the signals to known patterns corresponding to different nucleotides is performed by a machine learning model trained using the known signal patterns. 41. The method of claim 39, wherein the compound is (1) the first strand of the double- stranded nucleic acid molecule, or (2) a surrogate molecule created from the first strand of the double-stranded nucleic acid molecule. 42. The method of claim 41, wherein sequencing the double-stranded nucleic acid molecule includes: PATENT Client Reference No.: P39048-WO-1 creating the surrogate molecule from the double-stranded nucleic acid molecule, the surrogate molecule including one or more reporter elements corresponding to each nucleotide; passing the surrogate molecule through a nanopore to obtain electrical signals; and determining the first sequence of base calls and the second sequence of base calls of nucleotides in the double-stranded nucleic acid molecule using the electrical signals. 43. The method of claim 33, further comprising repeating the method for at least 10,000 nucleic acid molecules. 44. A computer product comprising a non-transitory computer readable medium storing a plurality of instructions that, when executed, cause a computer system to perform the method of claim 1. 45. A method comprising: receiving an adapter sequence of an adapter; receiving a sequence read of a nucleic acid molecule, the sequence read specifying nucleotides at positions within the nucleic acid molecule, wherein the sequence read includes a first sequence portion corresponding to at least a portion of a nucleic acid segment from a biological sample and a second sequence portion corresponding to the adapter that was added to the nucleic acid segment; encoding the nucleotides of the sequence read into a first series of nucleotide encodings, wherein each nucleotide has a different encoding; encoding the adapter sequence into a second series of nucleotide encodings; transforming the first series of nucleotide encodings into a first frequency domain signal; transforming the second series of nucleotide encodings into a second frequency domain signal; determining a frequency-domain cross-correlation signal between the first frequency domain signal and second frequency domain signal; transforming the frequency-domain cross-correlation signal to a time domain to obtain a cross-correlation signal; PATENT Client Reference No.: P39048-WO-1 determining an adapter location of the adapter using a maximum of the cross-correlation signal; and determining, using the adapter location, a segment location of the nucleic acid segment. 46. The method of claim 45, wherein the adapter is a first adapter of a set of adapters used in a sequencing library, the method further comprising: repeating obtaining the cross-correlation signal for other adapters in the set of adapters, wherein the maximum of the cross-correlation signal for the first adapter is highest among the set of adapters. 47. The method of claim 45, wherein the adapter location is a start position of the adapter in the sequence read. 48. The method of claim 45, wherein the first series of nucleotide encodings and the second series of nucleotide encodings are reversed, and wherein the adapter location is an end position of the adapter in the sequence read. 49. A method comprising: receiving a sequence read of two strands of a nucleic acid molecule, the sequence read specifying nucleotides at positions within the nucleic acid molecule, wherein the sequence read includes first sequence portions corresponding to two nucleic acid segments from a biological sample and a second sequence portion corresponding to the adapter that was added between the two nucleic acid segments; encoding the nucleotides of the sequence read into a first series of nucleotide encodings, wherein each nucleotide has a different encoding; encoding a reverse complement of the sequence read into a second series of nucleotide encodings; transforming the first series of nucleotide encodings into a first frequency domain signal; transforming the second series of nucleotide encodings into a second frequency domain signal; PATENT Client Reference No.: P39048-WO-1 determining a frequency-domain cross-correlation signal between the first frequency domain signal and second frequency domain signal; transforming the frequency-domain cross-correlation signal to a time domain to obtain a cross-correlation signal; and determining an adapter location of the adapter using a maximum of the cross-correlation signal. 50. The method of claim 49, wherein different nucleotide encodings do not overlap with each other. 51. The method of claim 50, wherein the different nucleotide encodings are orthogonal to each other. 52. The method of claim 50, wherein the different nucleotide encodings are in complex space or use at least four dimensions. 53. The method of claim 49, wherein the sequence read includes multiple copies of at least one strand of the two strands of the nucleic acid molecule. 54. The method of claim 49, wherein the adapter location corresponds to the middle of the adapter, and wherein the adapter is optionally a hairpin adapter. 55. The method of claim 54, wherein the adapter location indicates an origin of the nucleic acid segment. 56. A sequencing device for determining consensus sequences of double-stranded nucleic acid molecules, the sequencing device comprising: a set of sequencing cells comprising at least 10,000 sequencing cells, each sequencing cell configured to perform: PATENT Client Reference No.: P39048-WO-1 sequencing a first strand of the double-stranded nucleic acid molecule to obtain a first sequence of first base measurements; and sequencing a second strand of the double-stranded nucleic acid molecule to obtain a second sequence of second base measurements; a consensus circuit electrically connected with the set of sequencing cells, wherein the consensus circuit is configured to perform, for each of the double-stranded nucleic acid molecules: receiving the first sequence of base measurements and the second sequence of base measurements; for each of a plurality of positions of the double-stranded nucleic acid molecule: comparing one or more of the first base measurements to one or more of the second base measurements; and determining a base call value based on the comparison; and generating a consensus sequence using the base call values; a transmitter configured to transmit the consensus sequence to a computer system. 57. The sequencing device of claim 56, wherein comparing a first base measurement to a second base measurement comprises: determining a first base call using the one or more of the first base measurements; determining a second base call using the one or more of the second base measurements; and comparing the first base call and the second base call. 58. The sequencing device of claim 56, wherein the consensus circuit is further configured to perform: determining whether a position of the plurality of positions is concordant or discordant based on the comparing, wherein the base call value is dependent on whether the position is concordant or discordant. 59. The sequencing device of claim 58, wherein a number of bits used for the base call value is dependent on whether the position is concordant or discordant. PATENT Client Reference No.: P39048-WO-1 60. The sequencing device of claim 58, wherein the consensus circuit is further configured to generate metadata identifying which positions are discordant, and wherein the consensus sequence includes the metadata. 61. The sequencing device of claim 56, wherein the set of sequence cells and the consensus circuit are on a same printed circuit board. 62. The sequencing device of claim 56, wherein the set of sequence cells and the consensus circuit are on a same integrated circuit. 63. The method of claim 1, further comprising: generating, via a neural network, a set of quality scores for at least one of the first set of concordant positions and the second set of discordant positions, wherein the neural network receives the first sequence of base calls and the second sequence of base calls and predicts the set of quality scores. 64. The method of claim 63, wherein the neural network is a convolutional neural network. 65. The method of claim 63, wherein the neural network comprises a U-Net having an encoder portion and a decoder portion, and wherein the neural network further comprises a softmax layer configured to predict a probability that a corresponding consensus call at each position is accurate.
PCT/US2025/022458 2024-04-02 2025-04-01 High throughput inramolecular consensus reads Pending WO2025212586A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202463573191P 2024-04-02 2024-04-02
US63/573,191 2024-04-02
US202463689578P 2024-08-30 2024-08-30
US63/689,578 2024-08-30

Publications (1)

Publication Number Publication Date
WO2025212586A1 true WO2025212586A1 (en) 2025-10-09

Family

ID=95519085

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/022458 Pending WO2025212586A1 (en) 2024-04-02 2025-04-01 High throughput inramolecular consensus reads

Country Status (1)

Country Link
WO (1) WO2025212586A1 (en)

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5604097A (en) 1994-10-13 1997-02-18 Spectragen, Inc. Methods for sorting polynucleotides using oligonucleotide tags
WO2008157696A2 (en) * 2007-06-19 2008-12-24 Stratos Genomics Inc. High throughput nucleic acid sequencing by expansion
US7537897B2 (en) 2006-01-23 2009-05-26 Population Genetics Technologies, Ltd. Molecular counting
US20130288902A1 (en) * 2012-04-30 2013-10-31 Life Technologies Corporation Systems and methods for paired end sequencing
US8715967B2 (en) 2010-09-21 2014-05-06 Population Genetics Technologies Ltd. Method for accurately counting starting molecules
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
US8835358B2 (en) 2009-12-15 2014-09-16 Cellular Research, Inc. Digital counting of individual molecules by stochastic attachment of diverse labels
US9041420B2 (en) 2010-02-08 2015-05-26 Genia Technologies, Inc. Systems and methods for characterizing a molecule
US9290805B2 (en) 2012-02-27 2016-03-22 Genia Technologies, Inc. Sensor circuit for controlling, detecting, and measuring a molecular complex
US9322062B2 (en) 2013-10-23 2016-04-26 Genia Technologies, Inc. Process for biosensor well formation
US9494554B2 (en) 2012-06-15 2016-11-15 Genia Technologies, Inc. Chip set-up and high-accuracy nucleic acid sequencing
US20170089858A1 (en) 2015-09-24 2017-03-30 Genia Technologies, Inc. Encoding state change of nanopore to reduce data size
US10174437B2 (en) 2015-07-09 2019-01-08 Applied Materials, Inc. Wafer electroplating chuck assembly
US10371664B2 (en) 2016-01-21 2019-08-06 Roche Molecular Systems, Inc. Use of titanium nitride as a counter electrode
US10663423B2 (en) 2011-01-24 2020-05-26 Roche Sequencing Solutions, Inc. System for detecting electrical properties of a molecular complex
US10809243B2 (en) 2015-08-31 2020-10-20 Roche Sequencing Solutions, Inc. Small aperture large electrode cell
US10809244B2 (en) 2013-02-05 2020-10-20 Roche Sequencing Solutions, Inc. Nanopore arrays
WO2020236526A1 (en) 2019-05-23 2020-11-26 Stratos Genomics, Inc. Translocation control elements, reporter codes, and further means for translocation control for use in nanopore sequencing
US10920312B2 (en) 2015-08-31 2021-02-16 Roche Sequencing Solutions, Inc. Electrochemical cell with increased current density
US20210148886A1 (en) 2018-06-27 2021-05-20 Roche Sequencing Solutions, Inc. Multiplexing analog components in biochemical sensor arrays
US11098354B2 (en) 2015-08-05 2021-08-24 Roche Sequencing Solutions, Inc. Use of titanium nitride as an electrode in non-faradaic electrochemical cell
US11299725B2 (en) 2015-11-16 2022-04-12 Stratos Genomics, Inc. DP04 polymerase variants
US11530392B2 (en) 2017-12-11 2022-12-20 Stratos Genomics, Inc. DPO4 polymerase variants with improved accuracy
US11708566B2 (en) 2017-05-04 2023-07-25 Stratos Genomics, Inc. DP04 polymerase variants
US11767556B2 (en) * 2013-12-28 2023-09-26 Guardant Health, Inc. Methods and systems for detecting genetic variants

Patent Citations (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5604097A (en) 1994-10-13 1997-02-18 Spectragen, Inc. Methods for sorting polynucleotides using oligonucleotide tags
US7537897B2 (en) 2006-01-23 2009-05-26 Population Genetics Technologies, Ltd. Molecular counting
WO2008157696A2 (en) * 2007-06-19 2008-12-24 Stratos Genomics Inc. High throughput nucleic acid sequencing by expansion
US7939259B2 (en) 2007-06-19 2011-05-10 Stratos Genomics, Inc. High throughput nucleic acid sequencing by expansion
US8835358B2 (en) 2009-12-15 2014-09-16 Cellular Research, Inc. Digital counting of individual molecules by stochastic attachment of diverse labels
US9041420B2 (en) 2010-02-08 2015-05-26 Genia Technologies, Inc. Systems and methods for characterizing a molecule
US8715967B2 (en) 2010-09-21 2014-05-06 Population Genetics Technologies Ltd. Method for accurately counting starting molecules
US10663423B2 (en) 2011-01-24 2020-05-26 Roche Sequencing Solutions, Inc. System for detecting electrical properties of a molecular complex
US9290805B2 (en) 2012-02-27 2016-03-22 Genia Technologies, Inc. Sensor circuit for controlling, detecting, and measuring a molecular complex
US20130288902A1 (en) * 2012-04-30 2013-10-31 Life Technologies Corporation Systems and methods for paired end sequencing
US9494554B2 (en) 2012-06-15 2016-11-15 Genia Technologies, Inc. Chip set-up and high-accuracy nucleic acid sequencing
US20140134616A1 (en) 2012-11-09 2014-05-15 Genia Technologies, Inc. Nucleic acid sequencing using tags
US10809244B2 (en) 2013-02-05 2020-10-20 Roche Sequencing Solutions, Inc. Nanopore arrays
US9322062B2 (en) 2013-10-23 2016-04-26 Genia Technologies, Inc. Process for biosensor well formation
US11767556B2 (en) * 2013-12-28 2023-09-26 Guardant Health, Inc. Methods and systems for detecting genetic variants
US10174437B2 (en) 2015-07-09 2019-01-08 Applied Materials, Inc. Wafer electroplating chuck assembly
US11098354B2 (en) 2015-08-05 2021-08-24 Roche Sequencing Solutions, Inc. Use of titanium nitride as an electrode in non-faradaic electrochemical cell
US10809243B2 (en) 2015-08-31 2020-10-20 Roche Sequencing Solutions, Inc. Small aperture large electrode cell
US10920312B2 (en) 2015-08-31 2021-02-16 Roche Sequencing Solutions, Inc. Electrochemical cell with increased current density
US20170089858A1 (en) 2015-09-24 2017-03-30 Genia Technologies, Inc. Encoding state change of nanopore to reduce data size
US10935512B2 (en) 2015-09-24 2021-03-02 Roche Sequencing Solutions, Inc. Encoding state change of nanopore to reduce data size
US11299725B2 (en) 2015-11-16 2022-04-12 Stratos Genomics, Inc. DP04 polymerase variants
US10371664B2 (en) 2016-01-21 2019-08-06 Roche Molecular Systems, Inc. Use of titanium nitride as a counter electrode
US11708566B2 (en) 2017-05-04 2023-07-25 Stratos Genomics, Inc. DP04 polymerase variants
US11530392B2 (en) 2017-12-11 2022-12-20 Stratos Genomics, Inc. DPO4 polymerase variants with improved accuracy
US20210148886A1 (en) 2018-06-27 2021-05-20 Roche Sequencing Solutions, Inc. Multiplexing analog components in biochemical sensor arrays
US20220411458A1 (en) 2019-05-23 2022-12-29 Stratos Genomics, Inc. Translocation control elements, reporter codes, and further means for translocation control for use in nanopore sequencing
WO2020236526A1 (en) 2019-05-23 2020-11-26 Stratos Genomics, Inc. Translocation control elements, reporter codes, and further means for translocation control for use in nanopore sequencing

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
BATZER ET AL., NUCLEIC ACID RES, vol. 19, 1991, pages 5081
FU ET AL., PROC. NAT'L. ACAD. SCI., vol. 111, 2014, pages 1891 - 1896
ISLAM ET AL., NAT. METHODS, vol. 11, 2014, pages 163 - 168
KIVIOJA ET AL., NAT. METHODS, vol. 9, 2012, pages 72 - 74
OHTSUKA ET AL., J. BIOL. CHEM., vol. 260, 1985, pages 2605 - 2608
RONNEBERGER ET AL.: "U-Net:Convolutional Networks for Biomedical Imaging Segmentation", COMPUTER VISION AND PATTERN RECOGNITION, ARXIV1505.04597, 2015
ROSSOLINI ET AL., MOL. CELL. PROBES, vol. 8, 1994, pages 91 - 98

Similar Documents

Publication Publication Date Title
AU2022200179B2 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
JP7462993B2 (en) Determination of nucleic acid base modifications
US20250226056A1 (en) Variant classifier based on deep neural networks
CN108350494B (en) Systems and methods for genomic analysis
AU2018254595B2 (en) Using cell-free DNA fragment size to detect tumor-associated variant
US20190189242A1 (en) Machine learning system and method for somatic mutation discovery
US20210065847A1 (en) Systems and methods for determining consensus base calls in nucleic acid sequencing
WO2019200338A1 (en) Variant classifier based on deep neural networks
CN110799653A (en) Optimal Index Sequences for Multiplex Massively Parallel Sequencing
EP3520221B1 (en) Efficient clustering of noisy polynucleotide sequence reads
CN117083680A (en) Artificial intelligence-based cancer diagnosis and cancer type prediction method
WO2020132151A1 (en) Cancer tissue source of origin prediction with multi-tier analysis of small variants in cell-free dna samples
US20190139628A1 (en) Machine learning techniques for analysis of structural variants
JP2022550841A (en) Improved variant calling using single-cell analysis
WO2024010809A2 (en) Methods and systems for detecting recombination events
WO2025212586A1 (en) High throughput inramolecular consensus reads
Edwards Whole-genome sequencing for marker discovery
CN114613428A (en) Metabolite-protein interaction prediction method based on two-dimensional heterogeneous network
US20230340571A1 (en) Machine-learning models for selecting oligonucleotide probes for array technologies
US20250210141A1 (en) Enhanced mapping and alignment of nucleotide reads utilizing an improved haplotype data structure with allele-variant differences
WO2022109330A1 (en) Cellular clustering analysis in sequencing datasets
WO2025090883A1 (en) Detecting variants in nucleotide sequences based on haplotype diversity
Quan Accurate alignment of sequencing reads from various genomic origins
Teng et al. Detecting m6A RNA modification from nanopore sequencing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25721370

Country of ref document: EP

Kind code of ref document: A1