WO2025193747A1 - Machine-learning models for ordering and expediting sequencing tasks or corresponding nucleotide-sample slides - Google Patents
Machine-learning models for ordering and expediting sequencing tasks or corresponding nucleotide-sample slidesInfo
- Publication number
- WO2025193747A1 WO2025193747A1 PCT/US2025/019437 US2025019437W WO2025193747A1 WO 2025193747 A1 WO2025193747 A1 WO 2025193747A1 US 2025019437 W US2025019437 W US 2025019437W WO 2025193747 A1 WO2025193747 A1 WO 2025193747A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequencing
- sample
- nucleotide
- ordering
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- a sequencing device e.g., sequencing machine or instrument
- primary sequencing tasks e.g., cluster generation, primer hybridization, image analysis, base calling, demultiplexing, and quality scoring for primary analysis
- existing sequencing-data-analysis software can cause computing devices to run secondary sequencing tasks (e.g., read alignment, variant-calling, structural variant detection, functional annotation, taxonomic classification, and genome assembly for secondary analysis) on such nucleotide reads to align the nucleotide reads with a reference genome and determine variant calls for genomic samples where such samples differ from the reference genome.
- secondary sequencing tasks e.g., read alignment, variant-calling, structural variant detection, functional annotation, taxonomic classification, and genome assembly for secondary analysis
- existing sequencing management systems provide useful options to order and analyze the results of sequencing runs
- existing sequencing management systems (i) provide computationally limited or inefficient ordering mechanisms for ordering nucleotide-sample-slides for genomic analysis on specialized computing devices, (ii) provide computationally limited or inefficient ordering mechanisms for ordering sequencing tasks for genomic analysis, and (iii) limit functions and control of an end-to-end sequencing process for a genomic sample across the sequencing device and secondary sequencing-data-analysis devices.
- the system allocates more system resources (e.g., CPU processing power, memory, disk space, network bandwidth) than the system can handle effectively.
- system resources e.g., CPU processing power, memory, disk space, network bandwidth
- the processor may become overloaded, leading to resource exhaustion.
- existing sequencing systems consume excess power and generate unnecessary heat, which not only wastes energy but also increases cooling costs.
- allocating too much memory or disk space can deplete these computing resources or load to situations where tasks are waiting for resources, causing a deadlock where none of the pending tasks can proceed.
- DeepRM is a deep reinforcement learning-based resource management solution that employs a conventional deep Q-leaming algorithm. While publications have also not been found suggesting that DeepRM has been used for ordering sequencing tasks or nucleotide-sample slides, even if DeepRM has been so used, DeepRM requires large amounts of data to train effectively, which can be complex and result in an increased computational burden.
- existing sequencing systems Because existing sequencing system generally complete all primary sequencing tasks for a genomic sample before commencing secondary sequencing tasks, existing sequencing systems require storing and transferring large amounts of data consecutively, which taxes the bandwidth of network connections or other interfaces that connect processor cards with other hardware within a computing device. For example, in the 52-billion-read example mentioned above, existing sequencing systems analyzing primary sequencing tasks for a sequencing run with paired-end reads with a length of 150 base pairs, produce approximately 16 Tb of data and require approximately 48 hours of run time. Consequently, existing sequencing systems require local storage of the 16 Tb of sequencing data and perform a subsequent batch data transfer over network devices that consumes approximately 7 hours (assuming a 5 Gb/s link).
- the disclosed system can utilize a relatively small neural network composed of fully connected layers combined with activation functions to produce alignment values that order sequencing tasks and/or nucleotide-sample-slides.
- the neural network can be trained via a genetic algorithm to determine a best scoring version of the model — whether for a sequencing-task ordering machine-learning model or a nucleotide-sample-slide ordering machine-1 earning model. For example, in certain instances, the system generates predicted ordering scores from candidate machine-learning models and determines makespan scores for each candidate model based on the predicted ordering scores.
- the system selects a highest performing candidate model as the ordering machinelearning model. Furthermore, the disclosed system can use a two-tier alignment function that utilizes two neural networks and incorporates a penalty value (or priority feature) to order and execute sequencing tasks more efficiently and with fewer computing resources.
- the disclosed system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run. In particular, based on processing requirements, the disclosed system can demultiplex and transmit base-call-data files specific to genomic samples to one or more computing devices during the sequencing run.
- FIG. 1 illustrates an environment in which a sequencing ordering system can operate in accordance with one or more embodiments of the present disclosure.
- FIG. 2A illustrates a schematic diagram of the sequencing ordering system determining task ordering scores for sequencing tasks and performing the sequencing tasks in a relative order according to the task ordering scores in accordance with one or more embodiments of the present disclosure
- FIG. 2B illustrates a schematic diagram of the sequencing ordering system determining slide ordering scores for nucleotide-sample slides and processing the nucleotide-sample slides in a relative order according to the slide ordering scores in accordance with one or more embodiments of the present disclosure.
- FIG. 3 illustrates a schematic diagram of the sequencing ordering system utilizing the sequencing-task ordering machine-learning model to determine task ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure.
- FIG. 4 illustrates the sequencing ordering system providing primary and/or secondary sequencing task features to the sequencing-task ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
- FIG. 5 illustrates an example architecture for a sequencing-task ordering machinelearning model in accordance with one or more embodiments of the present disclosure.
- FIGS. 6A-6B illustrate utilizing the sequencing ordering system to select the highest performing sequencing-task ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
- FIG. 7A illustrates the sequencing ordering system distributing sample-specific base- call-data files to one or more computing devices in accordance with one or more embodiments of the present disclosure.
- FIG. 7B illustrates the sequencing ordering system performing a demultiplexing operation on a subset of sequencing cycles with indexing cycles performed between genomic sequencing cycles in accordance with one or more embodiments of the present disclosure.
- FIG. 7C illustrates the sequencing ordering system performing an indexing-first approach to demultiplexing nucleotide reads by performing indexing cycles before genomic sequencing cycles in accordance with one or more embodiments of the present disclosure.
- FIG. 8 illustrates a schematic diagram of the sequencing ordering system utilizing the nucleotide-sample-slide ordering machine-learning model to determine task ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure.
- FIG. 9 illustrates the sequencing ordering system providing nucleotide-sample slide features to the nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
- FIG. 10 illustrates an example architecture for a nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
- FIGS. 11A-11C illustrate the sequencing ordering system selecting the highest performing nucleotide-sample-slide ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
- FIG. 12 illustrates a schematic diagram of the sequencing ordering system utilizing a combination of the nucleotide-sample-slide ordering machine-1 earning model and the sequencingtask ordering machine learning model to order sequencing tasks in accordance with one or more embodiments of the present disclosure.
- FIG. 17 illustrates a flowchart of a series of acts for transmitting genomic samples to computing devices in accordance with one or more embodiments of the present disclosure.
- FIG. 19 illustrates a block diagram of an example computing device in accordance with one or more embodiments of the present disclosure.
- This disclosure describes one or more embodiments of a sequencing ordering system that provides a machine-learning model that can analyze features of sequencing tasks and generate task ordering scores upon which a computing system can order the processing of sequencing tasks.
- the sequencing ordering system can determine, for a set of sequencing tasks, a set of sequencing task features indicating at least a performance time associated with respective sequencing tasks of the set of sequencing tasks.
- the sequencing ordering system may further provide the set of sequencing task features to a sequencing-task ordering machine-learning model for ordering the set of sequencing tasks.
- the sequencing ordering system may generate, utilizing the sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on available computing resources and the set of sequencing task features.
- the sequencing ordering system Based on the task ordering scores, the sequencing ordering system performs the set of sequencing tasks. Furthermore, in some implementations, the disclosed system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run based on the processing requirements for the base call data files for the scheduled sequencing tasks.
- the sequencing ordering system utilizes a machine-learning model that takes as input data (e g., a feature vector) representing the actual tasks in the pipeline (associated with multiple nucleotide-sample-slides with various densities and/or secondary analysis applications) in combination with data representing the compute resources available on and/or off the sequencing instrument for analyzing a nucleotide-sample-slide to reduce the makespan (overall time to complete the related tasks) of either primary or secondary sequencing tasks.
- the sequencing ordering system determines sequencing task features (e.g., a performance time associated with each sequencing task) for sequencing tasks and further provides the sequencing task features to a sequencing-task ordering machine-learning model.
- the sequencing-task ordering machine-learning model By processing the sequencing task features and accounting for available computing resources (e.g., using model parameters), the sequencing-task ordering machine-learning model generates task ordering scores indicating a relative order of the sequencing tasks.
- the sequencing ordering system further performs the sequencing tasks according to the task ordenng scores.
- the disclosed system can use a specialized machine-1 earning model to generate task ordering scores for either (i) primary sequencing tasks (e.g., real-time analysis) associated with base calling for a genomic sample’s nucleotide reads or (ii) secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of such nucleotide reads.
- the disclosed system determines sequencing task features (e.g., a performance time associated with each sequencing task) for sequencing tasks and further provides the sequencing task features to the sequencing-task ordering machine-learning model.
- the sequencing-task ordering machine-learning model By processing the sequencing task features and accounting for available computing resources (e.g., using model parameters), the sequencing-task ordering machine-learning model generates task ordering scores indicating a relative order of the sequencing tasks.
- the system further performs the sequencing tasks according to the task ordering scores.
- the sequencing ordering system can utilize a relatively small sequencing-task ordering neural network composed of fully connected layers combined with activation functions to generate task ordering scores that determine an order in which to perform sequencing tasks for a sequencing run or secondary analysis.
- a sequencing-task ordering neural network includes an input layer for the set of sequencing task features, two fully connected hidden layers, each equipped with an activation function, bias, and weights, and — after the fully connected hidden layers — an output layer that outputs task ordering scores.
- such a sequencing-task ordering neural network architecture includes adjustable parameters (e.g., 88 adjustable parameters) to generate the most efficient alignment function.
- the sequencing ordering system trains the sequencing-task ordering machine-learning model to determine scores indicating a best order of sequencing tasks.
- the sequencing-task ordering machine-learning model is trained via a genetic algorithm to determine a best version of the sequencing-task ordering machine-learning model. For instance, the sequencing ordering system identifies a set of parent sequencing-task ordering machine-learning models (e.g., 128 parent models filtered from an initial 8,192 models) and, from the parents, generates a set of candidate sequencing-task ordering machine-learning models (e.g., repopulated 8,192 candidate models) each comprising different weights and biases.
- parent sequencing-task ordering machine-learning models e.g., 128 parent models filtered from an initial 8,192 models
- candidate sequencing-task ordering machine-learning models e.g., repopulated 8,192 candidate models
- the sequencing ordering system further generates predicted ordering scores from each candidate sequencing-task ordering machine-learning model and determines makespan scores for each candidate sequencing-task ordering machine-learning model based on the predicted ordering scores. By comparing the makespan scores for each candidate sequencing-task ordering machinelearning model using a loss function, the sequencing ordering system selects a highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model.
- the sequencing ordering system can analyze features of nucleotide-sample-slides and generate slide ordering scores upon which a computing system can order the processing of nucleotide-sample-slides. For example, the sequencing ordering system can determine, for a set of nucleotide-sample-slides, a set of nucleotide-sample-slide features indicating at least a performance time associated with processing data for each nucleotide-sample-slide of the set of nucleotide-sample-slides.
- the sequencing ordering system may further provide the set of nucleotide-sample-slide features to a nucleotide- sample-slide ordering machine-learning model for ordering the set of nucleotide-sample-slides.
- the sequencing ordering system may generate, utilizing the nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide- sample-slide based on available computing resources and the set of nucleotide-sample-slide features. Based on the slide ordering scores, the sequencing ordering system performs sequencing tasks.
- the sequencing ordering system can utilize a two-tier system incorporating both the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model to generate the slide ordering scores.
- the system accesses or determines nucleotide-sample-slide features (e.g., a performance time associated with processing data for each nucleotide-sample-slide) for respective nucleotide-sample-slides and provides the nucleotide- sample-slide features to a nucleotide-sample-slide ordering machine-learning model.
- nucleotide-sample-slide features e.g., a performance time associated with processing data for each nucleotide-sample-slide
- the nucleotide-sample-slide ordering machine-learning model By processing the nucleotide-sample-slide features and accounting for available computing resources, the nucleotide-sample-slide ordering machine-learning model generates nucleotide-sample-slide ordering scores indicating a relative order for processing the different nucleotide-sample-slides. Based on the nucleotide-sample-slide ordering scores, the disclosed system performs sequencing tasks for the ordered nucleotide-sample-slides.
- the sequencing ordering system can utilize a relatively small nucleotide-sample-slide ordering neural network composed of fully connected layers combined with activation functions to generate slide ordering scores that determine an order in which to process nucleotide-sample slides.
- the sequencing ordering system utilizes a nucleotide-sample-slide ordering neural network that includes an input layer for the set of nucleotide-sample-slide features, four fully connected hidden layers, each equipped with an activation function, bias, and weights, and — after the fully connected hidden layers — an output layer that outputs slide ordering scores.
- the sequencing ordering system trains a nucleotide-sample-slide ordering neural network or other machine-learning model using genetic algorithms to determine scores indicating a best order of processing different nucleotide-sample slides. For example, similar to the method outlined above used to train the sequencing-task ordering machine-learning model, the sequencing ordering system trains a nucleotide-sample-slide ordering machine-learning model to produce slide ordering scores that indicate the order for processing nucleotide-sample-slides in a sequencing run or secondary analysis.
- the nucleotide-sample-slide ordering machine-learning model is trained via a genetic algorithm by selecting parent nucleotide-sample- slide ordering machine-learning models based on their fitness, generating candidate nucleotide- sample-slide ordering machine-learning models through crossover and mutation, and selecting a highest performing nucleotide-sample-slide ordering candidate model as the nucleotide-sample- slide ordering machine-learning model.
- the sequencing ordering system can integrate both the nucleotide-sample- slide ordering machine-learning model and the sequencing-task ordering machine-learning model into a two-tier sequencing ordering system. In this way, the sequencing ordering system can provide an order for the sequencing tasks based on both the slide ordering scores and the task ordering scores. In turn, the sequencing ordering system can perform the sequencing tasks for the set of nucleotide-sample slides according to both the slide ordering scores and task ordering scores. [0050] In addition to ordering sequencing tasks and/or nucleotide-sample-slides, in some cases, the sequencing ordering system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run.
- the sequencing ordering system can demultiplex and transmit base-call-data files to one or more computing devices during the sequencing run.
- Such different processing parameters may include a different secondary sequencing task for a genomic sample, different analysis rights for a genomic sample, a different category of analysis for the genomic sample, or a different sample size for a genomic sample.
- the disclosed system can demultiplex the indexed reads to determine which indexing sequences belong to which genomic samples after the completion of the first sequencing pass and efficiently begin transmitting the base-call data files to the appropriate computing device during the sequencing run (e.g., during the second sequencing pass).
- the sequencing ordering system can speed up its distribution of sample-specific base-call-data files.
- the sequencing ordering system expedites determining oligonucleotides belonging to respective genomic samples within anucleotide-sample-shde pool (or other nucleoti de-sample-substrate pool) by base calling the indexing sequences for both read pairs before base calling the genomic sequences in library templates for each sample.
- the sequencing ordering system determines which nucleotide reads belong to which genomic samples and a relative balance of genomic samples.
- the sequencing ordering system can begin generating and transmitting the base-call data files to the appropriate computing device after each genomic sequencing cycle of the sequencing run.
- the sequencing-task ordering machine-learning model improves memory management by performing memory-intensive sequencing tasks in an order indicated by task ordering scores thereby improves memory utilization and ensures consistent memory availability to provide better resource management.
- the sequencing ordering system decreases a frequency of delays in performing sequencing tasks by 10-30% relative to a first-in- first-out (FIFO) method as measured by makespan scores. Such a decrease in sequencing-task delays translates into improved run times.
- FIFO first-in- first-out
- the sequencing ordering system Independent of the sequencing-task ordering machine-learning model, by processing different nucleotide-sample slides according to slide ordering scores generated by a nucleotide- sample-slide ordering machine-learning model, for instance, the sequencing ordering system likewise expedites computing run times (e.g., as measured by makespan scores) by improving completion time and saves memory and/or consumable reagents relative to existing systems.
- FIGS. 13A-13B for example, by performing sequencing tasks based on slide ordering scores from the nucleotide-sample-slide ordering machine-learning model, the sequencing ordering system provides an improvement of nearly 15-25% in median makespan scores and 5-15% in average makespan scores relative to a FIFO method.
- the sequencing ordering system decreases a frequency of delays in performing sequencing tasks by 10% to 30% relative to a FIFO as measured by makespan scores.
- the sequencing ordering system can utilize a combination of the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine learning model to order sequencing tasks according to the output ordering scores.
- This disclosure illustrates an embodiment of such a two-tier sequencing ordering system in FIG. 12. As shown in FIGS. 13 A and 13B, such a two-tier sequencing ordering system expedites computing run times by ordering the sequencing tasks based on the task ordering scores and produces makespan scores generally 15% better than a FIFO method.
- such a two-tier sequencing ordering system outperforms a first-in-first-out algorithm by nearly 30% in median makespan scores and 20% in average makespan scores.
- the sequencing ordering system reduces the required run time by 20-30% in median makespan scores.
- the sequencing ordering system can be deployed both on-instrument and off-instrument and offers the flexibility of training/refining the sequencing ordering system with real data that reflects the real-life usage of the instrument. For example, based on ascertained need for a particular type of assay, the ordering machine-learning models can be tuned to reflect the real-life usage of a sequencing device (e.g., tuned to the size/requirements of particular nucleotide-sample-slides). To illustrate, some existing scheduling algorithms utilize a FIFO algorithm based on the upcoming tasks. In contrast, the sequencing ordering system can use a two-tier alignment function that utilizes neural networks incorporating a penalty value (or priority feature) to order and execute sequencing tasks more intelligently.
- a penalty value or priority feature
- the sequencing ordering system utilizes ordering machine-learning models to swiftly converge to a solution utilizing a relatively small amount of computing resources, outperforming the training speed seen by current methods of task scheduling.
- the sequencing ordering system beats existing heuristics when training on data for less than 5 days and less than 10 iterations deep with negligible time spent in order evaluation (e.g., 1-2% better than the Tetris heuristic model, and 15% over FIFO).
- the sequencing ordering system can determine to which computing devices and at which time of a sequencing run to distribute sample-specific basecall-data files from a sequencing device — thereby expediting the beginning of secondary analysis for different samples. For example, in some embodiments, the sequencing ordering system generates, demultiplexes, and transfers sample-specific base-call-data files to various processing devices during the sequencing run based on the processing requirements of the sample. As further illustrated in FIGS. 7A-7C, during the sequencing run, the sequencing ordering system can determine different subsets of indexing sequences corresponding to different genomic samples that have different corresponding processing parameters for sequencing tasks. Based on identifying different processing parameters for sequencing tasks corresponding to different genomic samples, the sequencing ordering system can preemptively transmit sample-specific base-call-data files to one or more computing devices during the sequencing run rather than storing the base-call-data files for transmission.
- the sequencing ordering system By transmitting the sample-specific base-call-data files during the sequencing run, the sequencing ordering system saves both processing time and reduces storage requirements. For example, in the case of primary sequencing tasks for a sequencing run that produces approximately 16 Tb of data (e.g., paired-end reads with a length of 150 base pairs over approximately 48 hours of run time), existing sequencing systems require local storage of the 16 Tb of base-call-data files and waiting to begin secondary analysis until after a subsequent batch data transfer over network devices of approximately 7 hours (assuming a 5 Gb/s link).
- primary sequencing tasks for a sequencing run that produces approximately 16 Tb of data e.g., paired-end reads with a length of 150 base pairs over approximately 48 hours of run time
- existing sequencing systems require local storage of the 16 Tb of base-call-data files and waiting to begin secondary analysis until after a subsequent batch data transfer over network devices of approximately 7 hours (assuming a 5 Gb/s link).
- sequencing devices and servers can include increased memory (e.g., chips for a Field Programmable Gate Array (FPGA) or other configurable processors), this memory can be insufficient to store base-call data files for multiple sequential sequencing runs. Further, by waiting to transfer the primary sequencing task base-call-data files until the end of the instrument run time, existing sequencing systems require local storage of the primary sequencing task base-call-data files, tax the bandwidth of network connections, and delay the start of secondary analysis by up to 55 hours (e.g., 48 hours run time and 7 hours transfer time).
- FPGA Field Programmable Gate Array
- the sequencing ordering system can transfer the sample-specific base-call-data files to one or more computing devices during the run; thereby relieving local storage requirements for base-call-data files, alleviating bandwidth strain to ensure more efficient network performance, and expediting the start of secondary analysis (e.g., by at least 7 hours).
- genomic sample refers to a target genome or portion of a genome undergoing sequencing.
- a genomic sample includes a sequence of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence).
- a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases.
- a genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below.
- the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
- nucleotide-sample slide refers to a plate or substrate, such as a flow cell, comprising oligonucleotides for sequencing nucleotide sequences from genomic samples or other sample nucleic-acid polymers.
- a nucleotide-sample slide can refer to a substrate containing fluidic channels through which reagents and buffers can travel as part of sequencing.
- the nucleotide-sample slide e.g., a patterned flow cell or non-pattemed flow cell
- a flow cell can be an open substrate with one or more regions for oligonucleotide samples to be analyzed and the oligonucleotide samples may be positioned using charged pads or other means.
- the nucleotide-sample substrate can be a membrane having a nanopore through which one or more oligonucleotide samples may pass.
- a flow cell can include tiles and wells (e.g., nano wells) comprising clusters of oligonucleotides.
- a patterned flow cell may take on, but is not limited to, a square, hexagonal, and/or diamond shape.
- sample genomic sequence refers to a nucleotide sequence extracted from, copied from, or complementary to a sample’s chromosome.
- a sample genomic sequence includes a nucleotide sequence that has been separated or copied from chromosomal DNA of a sample or has been sequenced to be complementary to an extracted or copied nucleotide sequence.
- a sample genomic sequence includes genomic DNA (gDNA) for a particular unknown sample.
- the sequence-to-coverage system can use a sample complementary sequence comprising cDNA rather than a sample genomic sequence comprising gDNA in a sample library fragment or wherever suitable cDNA may replace gDNA as understood by a skilled artisan.
- any embodiment or nucleotide read in this disclosure that uses or includes a sample genomic sequence can also use or include a cDNA sequence corresponding to a genomic sample.
- indexing sequence refers to a unique and artificial nucleotide sequence that identifies nucleotide reads for a sample and that is ligated to a sample’s nucleotide sequence (e.g., a gDNA fragment or cDNA fragment) or to another sequence within a sample library fragment.
- an indexing sequence can be part of a sample library fragment.
- an indexing sequence can be used to sort nucleotide reads by sample or into different files, among other things, such as part of a de-multipl exing process.
- a sample library fragment includes an indexing primer sequence that differs from a read priming sequence and that indicates a starting point or starting nucleobase for determining nucleobases of an indexing sequence.
- sequencing run refers to an iterative process on a sequencing device to determine a primary structure of nucleotide fragments from a sample (e.g., genomic sample).
- a sequencing run includes cycles of sequencing chemistry and imaging performed by a sequencing device that incorporate nucleobases into growing oligonucleotides to determine nucleotide reads from nucleotide sequences extracted from a sample (or other sequences within a library fragment) and seeded throughout a nucleotide-sample slide.
- a sequencing run includes replicating nucleotide fragments from one or more genome samples seeded in clusters throughout a nucleotide-sample slide (e.g., a flow cell).
- a sequencing device can generate nucleobase-call data in a file, such as a binary base call (BCL) sequence file or a fast-all quality (FASTQ) file.
- BCL binary base call
- FASTQ fast-all quality
- sequencing cycle refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to sample’s sequence (e.g., a genomic or transcriptomic sequence from a sample) or a corresponding adapter sequence.
- a sequencing cycle includes an iteration of both incorporating nucleobases into clusters of oligonucleotides using sequencing chemistry and capturing images of such clusters attached to a flow cell.
- a sequencing cycle can include one or both of an indexing cycle and a genomic sequencing cycle.
- one cluster of oligonucleotides or a set of clusters of oligonucleotides may be undergoing a genomic sequencing cycle in which nucleobases corresponding to a sample genomic sequence are incorporated and another cluster of oligonucleotides or another set of clusters of oligonucleotides may be concurrently undergoing an indexing cycle in which nucleobases corresponding to an indexing sequence for a nucleotide read are incorporated.
- genomic sequencing cycle refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to a sample genomic sequence (or cDNA sequence).
- a genomic sequencing cycle can include an iteration of capturing and analyzing one or more images with data indicating individual nucleobases added or incorporated into an oligonucleotide or to oligonucleotides (in parallel) representing or corresponding to one or more sample genomic sequences.
- each genomic sequencing cycle involves capturing and analyzing images to determine either single reads of DNA (or RNA) strands representing part of a genomic sample (or transcribed sequence from a genomic sample).
- a genomic sequencing cycle in some cases, is specific to a cluster of oligonucleotides or a set of clusters of oligonucleotides.
- indexing cycle refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to one or more indexing sequences.
- an indexing cycle can include an iteration of capturing and analyzing one or more images of clusters of oligonucleotides indicating one or more nucleobases added or incorporated into an oligonucleotide or to oligonucleotides (in parallel) representing or corresponding to one or more indexing sequences.
- An indexing cycle differs from a genomic sequencing cycle in that an indexing cycle includes sequencing of at least a nucleobase (or a majority of nucleobases) from one or more indexing sequences that identify or encode one or more sample library fragments. Because genomic sequencing cycles may be specific to a cluster or clusters of oligonucleotides, an indexing cycle for one cluster of oligonucleotides may be performed at a same time as a genomic sequencing cycle for another cluster of oligonucleotides.
- a nucleotide sequencing task refers to an operation or a process performed by a computing device as part of determining a sequence of nucleobases for one or more genomic samples (or other nucleotide polymers) or part of saving data from determining such a sequence or from a corresponding analysis.
- a nucleotide sequencing task can include an operation or a process performed by a sequencing device that determines nucleobase sequences of fragments from a genomic sample or performed by another computing device (e.g., server) to analyze data for the nucleobase sequences and/or determine variants within the nucleobase sequences with respect to a reference genome.
- a sequencing task can likewise include an operation or a process of preserving data generated from determining a nucleotide sequence (e.g., base-call data) or an analysis thereof.
- a nucleotide sequencing task can include, but is not limited to, (i) cluster generation, primer hybridization, image analysis, base calling, demultiplexing, or quality scoring for primary sequencing tasks or (ii) read alignment, variantcalling, structural variant detection, functional annotation, taxonomic classification, and genome assembly for secondary sequencing tasks.
- set of sequencing tasks refers to a group of sequencing tasks performed by one or more computing devices that determine a sequence of nucleobases for one or more sample genomes (or other nucleotide polymers) or save data from determining such a sequence or from a corresponding analysis.
- a set of sequencing tasks can include a group of operations or processes (i) performed by a sequencing device to determine nucleobase sequences of fragments from a sample genome or save data related to the determined nucleobase sequences and (ii) performed by another computing device (e.g., server) to analyze data related to the determined nucleobase sequences, determine variants within the nucleobase sequences with respect to a reference genome, or save data resulting from the analyzed data.
- a set of sequencing tasks comprises the primary sequencing tasks and/or secondary sequencing tasks associated with a sequencing run for a genomic sample.
- a set of sequencing tasks comprises tasks starting from a sequencing run that generates base-call data through completing (and storing a copy of) variant analysis of the base-call data.
- secondary sequencing tasks refers to secondary analysis tasks performed on base-call data by a computing device to align nucleotide reads with a reference genome, determine genetic variants based on the aligned nucleotide reads, genotype call for a genomic sample, and/or interpret the determined genetic variants or nucleotide reads.
- a secondary sequencing task can include a secondary analysis task performed by a server executing variant-call software to perform genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
- a secondary sequencing task can include a tertiary analysis performed by a server executing bioinformatics software to determine potential genetic diseases (or genetic factors correlating with genetic diseases) based on determined genetic variants of a sample.
- nucleobase call refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample.
- a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls).
- base-call-data file refers to a digital file or other digital information indicating individual nucleobases or the sequence of nucleobases for a nucleic-acid polymer.
- a base-call-data file can include nucleotide reads comprising nucleobase calls for particular genomic samples.
- Nucleobase-call-data files can include intensity values (e.g., color or light intensity values for individual clusters) from images taken by a camera of a nucleotide-sample slide or other data that indicate individual nucleobases or the sequence of nucleobases for a nucleic-acid polymer.
- nucleobase-call-data file may include chromatogram peaks or electrical current changes indicating individual nucleobases in a sequence. Additionally, in some embodiments, nucleobase-call-data file includes individual nucleobase calls identifying the individual nucleobases (e.g., A, T, C, or G).
- nucleobase-call-data file can comprise data for nucleobase calls in a sequence for a nucleic-acid polymer, the number of nucleobase calls corresponding to a particular base (e.g., adenine, cytosine, thymine, or guanine), as organized in a digital file, such as a Binary Base Call (BCL) file or a Fast- All Q (FASTQ) file.
- BCL Binary Base Call
- FASTQ Fast- All Q
- the format of the base-call data file can vary based upon the sequencing technology used and can include BCF, BAM, and QSEQ, as well as other formats.
- base-call-data file can include error/accuracy information, such as a quality metric associated with each nucleobase call.
- nucleobase-call data comprises information from a sequencing device that utilizes sequencing by synthesis (SBS).
- sequencing task feature refers to a factor, metric, or value that quantifies or represents a sequencing task or a computing resource related to one or more sequencing tasks.
- a sequencing task feature includes a value indicating a setting, boundary, environment variable, or feature vector in which a nucleobase of a particular nucleobase type can be accurately quantified or analyzed using a sequencing device.
- a sequencing task feature includes, but is not limited to, one or more of computing resources, such as accelerator resources, FPGA resources, CPU resources, GPU resources, performance time, and/or memory requirements associated with a sequencing task.
- nucleotide- sample-slide features refers to a factor, metric, or value that quantifies or represents a nucleotide- sample slide or a computing resource related to one or more nucleotide-sample slide.
- a nucleotide-sample-slide feature includes a value indicating a setting, boundary, or environment variable in which a nucleotide-sample slide can be accurately quantified or analyzed using a sequencing device.
- a nucleotide-sample-slide feature includes processor usage for processing data associated with a nucleotide-sample slide of the set of nucleotide-sample slides, memory requirements for processing data associated with the nucleotide-sample slide, and performance time associated with processing data for the nucleotide-sample slide.
- the sequencing ordering system can generate ordering scores using one or more machine learning models.
- machine-1 earning model refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular sequencing task set through iterative outputs or predictions based on use of data.
- a machine learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness.
- Example machine learning models include various types of neural networks, decision trees, support vector machines, linear regression models, and Bayesian networks.
- the sequencing ordering system utilizes a sequencing-task ordering machme-leammg model, such as a feedforward neural network, to generate or predict task ordering scores indicating a relative order of the set of sequencing tasks based on available computing resources and the set of sequencing task features.
- a sequencing-task ordering machme-leammg model such as a feedforward neural network
- sequencing-task ordering machine-learning model refers to a machine-learning model that generates tasking ordering scores indicating a relative order of sequencing tasks.
- the sequencing-task ordering machine-learning model utilizes inputs of sequencing task features and available computing resources (e.g., using model parameters) to generate or predict task ordering scores indicating a relative order of the sequencing tasks.
- the sequencing-task ordering machine-learning model can generate or predict task ordering scores for either primary sequencing tasks associated with base calling for a genomic sample’s nucleotide reads or secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
- a sequencingtask ordering machine-learning model can include a neural network with an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores — such as a type of feedforward neural network (or a multilayer perceptron).
- nucleotide-sample-slide ordering machine-learning model refers to a machine-learning model that generates slide ordering scores indicating a relative order of processing nucleotide-sample slides.
- the nucleotide-sample-slide ordering machine-learning model can generate or predict slide ordering scores for determining an order of nucleotide-sample slides on which to perform primary sequencing tasks associated with base calling or for which to perform secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
- a nucleotide-sample-slide ordering machine-learning model includes a feedforward neural network that generates or predicts slide ordering scores indicating a relative order of sequencing tasks based on available computing resources and sequencing task features.
- the nucleotide-sample-slide ordering machine-learning model can include a neural network with an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores — such as a type of feedforward neural network (or a multilayer perceptron).
- the term “makespan score” refers to a measure of the total time or duration required to complete a set of tasks, such as sequencing tasks (e g., primary or secondary sequencing tasks) or tasks for processing data corresponding to a nucleotide-sample slide.
- a makespan score is used to evaluate the efficiency and performance of scheduling algorithms, production processes, or resource allocation by the sequencing-task ordering machine-learning model or the nucleotide-sample-slide ordering machine-learning model.
- the makespan score quantifies the time taken from the start of the first sequencing task until the completion of the last sequencing task, considering factors such as sequencing task duration, resource availability, and sequencing task features.
- the term “configurable processor” refers to a circuit or chip that can be configured or customized to perform a specific application.
- a configurable processor includes an integrated circuit chip that is designed to be configured or customized on site by an end user’s computing device to perform a specific application.
- Configurable processors include, but are not limited to, an ASIC, ASSP, a coarse-grained reconfigurable array (CGRA), or FPGA.
- CGRA coarse-grained reconfigurable array
- configurable processors do not include a CPU or GPU.
- the accelerated genotype-imputation system uses a configurable processor (e.g., FPGA) and/or a processor (e.g., CPU) to perform the various embodiments described herein.
- processing parameters refers to values, specifications, or variables that indicate how a computing device performs primary or secondary analysis or a particular sequencing task.
- processing parameters include a particular type of secondary analysis for a genomic sample (e.g., secondary analysis based on whole genome sequencing versus a cancer array, different sequencing tasks requiring an FPGA or other configurable processor versus a CPU or other non-configurable processor), analysis rights for a genomic sample (e.g., different laboratories or patients having different ownership rights to different samples, different privacy rights), a category of analysis for the genomic sample (e.g., methylation estimates versus variant calling), or a sample size for a genomic sample (e.g., different numbers of oligonucleotide clusters in a flow cell for samples).
- processing parameters can additionally or alternatively include other parameters, such as configuration data, clock settings, resource allocation, input/output definitions, signal timing, security settings, and functional unit configuration used to configure an ASIC, AS
- FIG. 1 illustrates a schematic diagram of a computing system 100 in which a sequencing ordering system 106 operates in accordance with one or more embodiments.
- the computing system 100 includes a server device(s) 102 connected to one or more server device(s) 110, a sequencing device 108, and a client device(s) 114 via a network 118. While FIG. 1 shows an embodiment of the sequencing ordering system 106, this disclosure describes alternative embodiments and configurations below.
- the server device(s) 102, the sequencing device 108, the server device(s) 110, and the client device(s) 114 can communicate with each other via the network 118.
- the network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 19 (computing device Fig).
- the sequencing device 108 comprises a device for sequencing a genomic sample or other nucleic-acid polymer.
- the sequencing device 108 analyzes nucleic-acid segments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems (described herein) either directly or indirectly on the sequencing device 108. More particularly, the sequencing device 108 receives nucleotide-sample slides (e.g., nucleotide-sample-slides) comprising nucleotide fragments extracted from samples and then copies and determines the nucleobase sequence of such extracted nucleotide fragments.
- nucleotide-sample slides e.g., nucleotide-sample-slides
- the sequencing device 108 utilizes SBS to sequence nucleic-acid polymers into nucleotide reads.
- the sequencing device 108 bypasses the network 118 and communicates directly with the server device(s) 102 or the client device(s) 114.
- the server device(s) 102 is located at or near a same physical location of the sequencing device 108. Indeed, in some embodiments, the server device(s) 102 and the sequencing device 108 are integrated into a same computing device, as indicated by dotted lines 122.
- the server device(s) 102 may run a sequencing system 104 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data.
- the sequencing device 108 may send (and the server device(s) 102 may receive) base-call data generated during a sequencing run of the sequencing device 108.
- the server device(s) 110 are located remotely from the server device(s) 102 and the sequencing device 108. Similar to the server device(s) 102, in some embodiments, the server device(s) 110 can include a version of the sequencing system 104. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as data for scheduling nucleobase calls or sequencing nucleic-acid polymers. Similarly, the sequencing device 108 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 108. The server device(s) 110 may also communicate with the client device(s) 114.
- the server device(s) 110 can send data to the client device(s) 114, including status information for nucleotide sequencing tasks, a variant call files (VCF), binary base call (BCL) sequence files, sequence read archive (SRA) files, variant call format (VCF) files, fast- all quality (FASTQ) files, or other information indicating nucleobase calls, sequencing metrics, error data, other sequencing related information, or other metrics.
- VCF variant call files
- BCL binary base call
- SRA sequence read archive
- VCF variant call format
- FASTQ fast- all quality
- the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server.
- the client device(s) 114 can generate, store, receive, and send digital data.
- the client device(s) 114 can receive status data from the server device(s) 102 or receive sequencing metrics from the sequencing device 108.
- the client device(s) 114 may communicate with the server device(s) 102 or the server device(s) 110 to receive a VCF comprising nucleobase calls and/or other metrics, such as a sequencing metrics, error data, or other metrics.
- the client device(s) 114 can accordingly present or display information pertaining to variant calls or other nucleobase calls to a user associated with the client device(s) 114. For instance, as shown in FIG.
- the sequencing ordering system 106 determines, during a sequencing run, task ordering scores for the set of sequencing tasks corresponding genomic samples and parameters for secondary sequencing tasks corresponding to the genomic samples. Further, the server device(s) 102, the sequencing device 108, and/or the server device(s) 110 transmit the task ordering scores for the set of sequencing tasks, the parameters for secondary sequencing tasks, and/or the base-call-data files to the client device(s) 114 indicating a relative order of the secondary sequencing tasks corresponding to the genomic samples.
- FIG. 1 depicts the client device(s) 114 as a desktop or laptop computer
- the client device(s) 114 may comprise various types of client devices.
- the client device(s) 114 includes non-mobile devices, such as desktop computers or servers, or other types of client devices.
- the client device(s) 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device(s) 114 are discussed below with respect to FIG. 19.
- the client device(s) 114 includes a sequencing application 116.
- the sequencing application 116 may be a web application or a native application stored and executed on the client device(s) 114 (e.g., a mobile application, desktop application).
- the sequencing application 116 can include instructions that (when executed) cause the client device(s) 114 to receive data from the sequencing ordering system 106 and present, for display at the client device(s) 114, data concerning a status of a nucleotide sequencing task or data from a VCF.
- the sequencing application 116 can instruct the client device(s) 114 to display the status for nucleotide sequencing tasks.
- a version of the sequencing ordering system 106 may be located on the client device(s) 114 as part of the sequencing application 116 or on the server device(s) 110. Accordingly, in some embodiments, the sequencing ordering system 106 is implemented by (e.g., located entirely or in part) on the client device(s) 114. In yet other embodiments, the sequencing ordering system 106 is implemented by one or more other components of the computing system 100, such as the server device(s) 110. In particular, the sequencing ordering system 106 can be implemented in a variety of different ways across server device(s) 102, the sequencing device 108, the client device(s) 114, and the server device(s) 110.
- the sequencing ordering system 106 can be downloaded from the server device(s) 110 to the server device(s) 102 and/or the client device(s) 114 where all or part of the functionality of the sequencing ordering system 106 is performed at each respective device within the computing system 100.
- 2B illustrates a schematic diagram of the sequencing ordering system 106 determining slide ordering scores for nucleotide-sample slides and processing the nucleotide- sample slides in a relative order according to the slide ordering scores in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 identifies or receives data for a genomic sample(s) 202 to be queued for processing in a sequencing run or for secondary analysis. As further shown, the sequencing ordering system 106 determines sequencing tasks 204 associated with processing the genomic sample(s) 202 for the sequencing run or the secondary analysis. As mentioned, the sequencing tasks 204 can include both primary sequencing tasks and secondary sequencing tasks. For example, the sequencing ordering system 106 can identify data from a FASTQ or BCL file comprising nucleotide reads for a genomic sample(s) 202, which may include any biological specimen or culture that potentially contains the target of interest.
- clusters of oligonucleotides extracted from the genomic sample(s) 202 may be imaged or scanned for subsequent analysis utilizing the sequencing tasks 204.
- the sequencing ordering system 106 can perform the sequencing tasks 204 including primary sequencing tasks, such as indexing cycles to determine nucleobase calls for indexing sequences, generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, and/or base- call-quality scoring of base calls within the nucleotide reads.
- primary sequencing tasks such as indexing cycles to determine nucleobase calls for indexing sequences
- generating clusters of oligonucleotides on a nucleotide-sample slide hybridizing primer
- the sequencing ordering system 106 can perform the sequencing tasks 204 including secondary sequencing tasks, such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
- secondary sequencing tasks such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
- the sequencing ordering system 106 can determine sequencing task features 206 associated with the sequencing tasks 204 of the genomic sample(s) 202. For example, the sequencing ordering system 106 can determine the sequencing task features 206 indicating available computing resources, sequencing task processor usage, sequencing task memory requirements, and/or a sequencing task performance time for respective sequencing tasks of the sequencing tasks 204. As further examples, the sequencing ordering system 106 can determine the sequencing task features 206 that include the available accelerator resources (e.g., FPGA, CPU, GPU) for sequencing the genomic samples.
- available accelerator resources e.g., FPGA, CPU, GPU
- the sequencing ordering system 106 can determine a sequencing task relative order 208 and task ordering scores 210.
- the sequencing ordering system 106 can generate the task ordering scores 210 that indicate a relative order for implementing the sequencing tasks 204.
- the sequencing ordering system 106 determines the sequencing task relative order 208 utilizing a sequencing-task ordering machine-learning model (e.g., a neural network) composed of fully connected layers combined with activation functions to produce alignment values that provide the task ordering scores 210 indicating the schedule for the sequencing tasks 204.
- the task ordering scores 210 can minimize a determined makespan value and can also account for priority values to generate the task ordering scores 210.
- the sequencing ordering system 106 can then provide the task ordering scores 210 to the sequencing device indicating an order for the ordered tasks 212 and indicating an order to enact a set of sequencing tasks 214.
- the sequencing ordering system 106 provides the task ordering scores 210 indicating an order for the set of sequencing tasks 214 used to sequence the nucleic-acid polymers present in the genomic samples received by a sequencing device.
- the sequencing ordering system 106 provides the task ordering scores 210 indicating an order for the set of sequencing tasks 214 used to map the nucleotide reads to genomic coordinates of a reference genome.
- the sequencing ordering system 106 can provide the task ordering scores 210 for the set of sequencing tasks 214, thereby prompting the sequencing device to schedule both primary sequencing tasks and/or secondary sequencing for the sequencing tasks 204.
- the sequencing ordering system 106 can also analyze features of nucleotide sample slides to determine slide ordering scores for nucleotide-sample slides and for processing the nucleotide-sample slides in a relative order according to the slide ordering scores. As shown in FIG. 2B, the sequencing ordering system 106 can determine slide ordering scores associated with a nucleotide-sample-slide relative order. More particularly, the sequencing ordering system 106 receives or detects nucleotide-sample-slide(s) 216 (e.g., flow cells) comprising oligonucleotides extracted from genomic samples.
- nucleotide-sample-slide(s) 216 e.g., flow cells
- the nucleotide-sample- slide(s) 216 can refer to a slide containing fluidic channels through which reagents and buffers can travel as part of sequencing.
- the nucleotide-sample- shde(s) 216 includes a flow cell (e.g., a patterned nucleotide-sample-shde or non-pattemed nucleotide-sample-slide) comprising small fluidic channels and short oligonucleotides complementary to binding adapter sequences.
- the nucleotide-sample-slide(s) 216 can include wells (e.g., nanowells) comprising clusters of oligonucleotides.
- the nucleotide-sample-slide(s) 216 may be imaged or scanned for subsequent analysis utilizing sequencing tasks 218.
- the sequencing ordering system 106 can perform the sequencing tasks 218 including primary sequencing tasks, such as generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of the genomic sample, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
- primary sequencing tasks such as generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for
- the sequencing ordering system 106 can perform the sequencing tasks 218 including secondary sequencing tasks such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
- secondary sequencing tasks such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
- the sequencing ordering system 106 can determine nucleotide- sample-slide features 220 associated with processing data for each of the nucleotide-sample- slide(s) 216. For example, the sequencing ordering system 106 can determine the nucleotide- sample-slide features 220 indicating available computing resources, processor usage, memory requirements, and/or a performance time for processing data associated with the nucleotide-sample- shde(s) 216.
- the sequencing ordering system 106 can determine a nucleotide-sample-slide relative order 222 and slide ordering scores 224.
- the sequencing ordering system 106 can generate the slide ordering scores 224 that indicate a relative order of the nucleotide-sample-slide(s) 216 and the sequencing tasks 218.
- the sequencing ordering system 106 determines the nucleotide-sample-slide relative order 222 utilizing a sequencing-task ordering machine-learning model as shown in FIG. 5 to produce alignment values that provide slide ordering scores 224 indicating a schedule alignment for sequencing tasks 218.
- the slide ordering scores 224 can minimize a determined makespan value and can also account for priority values to provide the slide ordering scores 224.
- the sequencing ordering system 106 can subsequently provide the slide ordering scores 224 to the sequencing device indicating an order for the sequencing device to perform the sequencing tasks 228 (e.g., by aligning the sequencing tasks 218). For example, the sequencing ordering system 106 determines the slide ordering scores 224 and provides the slide ordering scores 224 to the sequencing device. Furthermore, the sequencing device orders the nucleotide-sample-slide(s) 216 to process ordered nucleotide-sample-slide(s) 226 based on the slide ordering scores 224. As mentioned, the sequencing ordering system 106 can provide an order for scheduling both primary sequencing tasks and/or secondary sequencing tasks of the sequencing tasks 228.
- the sequencing ordering system 106 can access or identify sequencing task features 306 indicating available computing resources as a metric, a setting, a boundary, an environment variable, and/or a feature vector.
- the sequencing task features can include a task processor usage feature 308, a task memory requirements feature 310, and/or a task performance time feature 312 for respective sequencing tasks of the sequencing tasks 304.
- the sequencing task features 306 can include features that assess the computational infrastructure required for primary sequencing tasks, such as reading the nucleotide sequences, and secondary sequencing tasks such as aligning and assembling these sequences into a genome.
- the sequencing task features 306 include values for the task processor usage feature 308 that quantify the computational power associated with the sequencing tasks 304, including the number and type of processors.
- the task processor usage feature 308 can include the number of FPGAs/CPUs/GPUs and the amount of available RAM associated with the sequencing tasks 304.
- the task processor usage feature 308 can also include data representing the computational load on the sequencing system 104 (e.g., sequencing device 108, server device(s) 102, server device(s) 110, and/or client device(s) 114) and can be operationalized as the percentage of processor time required or as the intensity of the computations needed.
- the task processor usage feature 308 includes values for required processing power that influence the capacity of the sequencing ordering system 106 to process primary sequencing tasks like nucleotide identification, and secondary tasks such as sequence assembly and annotation.
- the sequencing ordering system 106 can provide the task ordering scores 318 that account for the high processor usage of real-time base calling algorithms due to their computational intensity.
- the sequencing task features 306 include data representing the task performance time feature 312 to quantify the time requirements needed to execute the sequencing tasks 304.
- the sequencing ordering system 106 can utilize the task performance time feature 312 to account for the time taken to complete the sequencing tasks 304.
- the sequencing ordering system 106 can use the task performance time feature 312 to account for the throughput rate of the sequencer for primary sequencing tasks and use the task performance time feature 312 to account for the duration of computational analyses of secondary sequencing tasks such as comparative genomics.
- the sequencing ordering system 106 can include the sequencing task features 306 of the task performance time feature 312, the task processor usage feature 308 (CPU), the task memory requirements feature 310, and the task processor usage feature 308 (FPGA) such as:
- the sequencing ordering system 106 utilizes a sequencingtask ordering machine-learning model 314 to generate task ordering scores 318.
- the sequencing-task ordering machine-1 earning model By processing the sequencing task features 306 and accounting for available computing resources (e g., using model parameters), the sequencing-task ordering machine-1 earning model generates task ordering scores 318 indicating a relative order for the primary /secondary sequencing tasks 320.
- the sequencing-task ordering machine-1 earning model 314 includes a neural network composed of fully connected layers combined with activation functions to produce alignment values that provide task ordering scores 318. This disclosure describes an example architecture for a sequencing-task ordering machine-learning model with respect to FIG. 5 below.
- the sequencing-task ordering machine-learning model 314 generates task ordering scores 318.
- the task ordering scores 318 represent values that reflect the assessed priority of the sequencing tasks 304 to maximize the runtime efficiency of the sequencing tasks within the sequencing ordering system 106.
- the sequencing-task ordering machinelearning model 314 generates task ordering scores 318 that can be used to execute the sequencing tasks 304 with a more efficient utilization of resources, provide a reduced turnaround times for the sequencing tasks 304, and an overall increase in the throughput of the genomic sequencing process.
- the sequencing-task ordering machine-learning model 314 generates the task ordering scores 318 that minimize the makespan value for performing the sequencing tasks 304. In this way, the sequencing ordering system 106 can strategically order the sequencing tasks 304 (e.g., particularly in high-volume environments) to provide significant improvements in productivity and efficiency.
- the sequencing-task ordering machine-learning model 314 utilizes the task ordering scores 318 to provide a ranking for the sequencing tasks 304 indicating a relative order for the primary/secondary sequencing tasks 320.
- the sequencing tasks 304 are arranged, not in an arbitrary basis, but in a sequence that reflects their assessed priority from the task ordering scores 318.
- the sequencing ordering system 106 causes the sequencing task 304 with the highest score of the task ordering scores 318 to be scheduled first.
- the sequencing ordering system 106 further causes the sequencing tasks 304 to be performed according to the task ordering scores 318 on the computing device(s) 322.
- the computing device(s) 322 can include a sequencing device and/or a computing server device.
- the computing device(s) 322 can include one or more of the sequencing device 108, the server device(s) 102, the server device(s) 110, and the client device(s) 114 as described with relation to FIG. 1.
- the sequencing ordering system 106 provides sequencing task features to the sequencing-task ordering machine-learning model.
- FIG. 4 illustrates providing primary and/or secondary sequencing task features to the sequencing-task ordering machine-learning model in accordance with one or more embodiments of the present disclosure. The following paragraphs provide examples of such primary and/or secondary sequencing task features.
- the sequencing ordering system 106 receives or identifies sequencing task features, including primary sequencing task features 404 and secondary sequencing task features 418, associated with a nucleotide-sample slide 402.
- the sequencing task features can include a metric, a setting, a boundary, an environment variable, or a feature vector representing the performance time, processor usage (e.g., CPU and FPGA), memory usage, and/or other resource requirements for each sequencing task.
- the sequencing ordering system 106 can receive or identify the primary sequencing task features 404 including an oligonucleotide- cluster-generation feature 406, a hybridizing primers feature 408, an analyzing images feature 410, a base calling feature 412, a demultiplexing-nucleotide-reads feature 414, and/or a base-call-quality scoring feature 416.
- the primary sequencing task features 404 including an oligonucleotide- cluster-generation feature 406, a hybridizing primers feature 408, an analyzing images feature 410, a base calling feature 412, a demultiplexing-nucleotide-reads feature 414, and/or a base-call-quality scoring feature 416.
- the sequencing ordering system 106 can access or identify the oligonucleotide-cluster-generation feature 406 associated with the nucleotide-sample slide 402.
- the oligonucleotide-cluster-generation feature 406 can include data quantifying the computing (e.g., processor, time, or memory) resources required for attaching oligonucleotides onto a specially coated slide or nucleotide-sample-slide so that they are spatially separated into distinct, individual clusters.
- the oligonucleotide-cluster-generation feature 406 can include data quantifying the computing resources required for amplifying the oligonucleotides to create a dense area of identical DNA fragments.
- the oligonucleotide-cluster-generation feature 406 includes data quantifying the computing resources required for bridge amplification, where each bound DNA fragment is copied in situ, creating a localized amplification of DNA sequences.
- the sequencing ordering system 106 can additionally or alternatively access the hybridizing primers feature 408 associated with sequencing the clusters of oligonucleotides on the nucleotide-sample slide 402.
- the hybridizing primers feature 408 can include data quantifying the computing (e.g., processor, time, or memory) resources required for heating the DNA to separate its two strands and then cooling it to allow the primers to bind, or anneal, to their complementary sequences on the single-stranded DNA.
- the hybridizing primers feature 408 can include data quantifying the computing resources required to align the primer sequence with a specific segment of the template DNA, and through hydrogen bonding, form stable, double-stranded structures at a complementary site.
- the sequencing ordering system 106 can additionally or alternatively access the analyzing images feature 410 associated with the nucleotide- sample slide 402.
- the analyzing images feature 410 includes data quantifying the computing (e.g., processor, time, memory) resources required to add fluorescently labeled nucleotides to sequence the clusters of DNA fragments after they have been amplified on a nucleotide-sample-slide.
- the analyzing images feature 410 can include data quantifying the computing resources required to capture the image of the fluorescent signal as each nucleotide is incorporated into the growing DNA strand during the sequencing reaction, which fluorescent signal corresponds to the identity of the incorporated nucleotide.
- the base calling feature 412 can include data quantifying the computing resources required to analyze the signals, which can consist of fluorescent or electrical changes, to assign a base to each signal peak.
- the base calling feature 412 can include data quantifying the computing resources required to determine a confidence score assessing the quality of each called base, which indicates the likelihood that each base was identified correctly.
- the sequencing ordering system 106 can additionally or alternatively access the demultiplexing-nucleotide-reads feature 414 associated with the resources required to index sequences corresponding to the genomic samples of the nucleotide- sample slide 402.
- the demultiplexing-nucleotide-reads feature 414 includes data quantifying the computing (e.g., processor, time, memory) resources required to separate mixed sequence data into distinct samples based on unique identifiers such as index sequences or barcodes.
- the demultiplexmg-nucleotide-reads feature 414 can include data quantifying the computing resources required to demultiplex the combination of reads from all the genomic samples to identify the barcode sequences within each read and assign the read to the corresponding sample.
- the demultiplexing-nucleotide-reads feature 414 includes data quantifying the computing resources required to incorporate barcode design into sequencing libraries, provide quality control to ensure accurate read attribution, and sorting data to organize reads by sample.
- the sequencing ordering system 106 can access the base-call-quality scoring feature 416 corresponding to the genomic samples of the nucleotide- sample slide 402.
- the base-call-quality scoring feature 416 includes data quantifying the computing (e.g., processor, time, memory) resources required to assign a confidence value (indicative of how likely it is that each base call is correct) to each nucleotide identified in a DNA sequence during the sequencing process.
- the base-call-quality scoring feature 416 includes data quantifying the computing resources required to generate a score that is represented on a logarithmic scale, where a higher score denotes a higher confidence in the accuracy of the base call.
- the base-call-quality scoring feature 416 can include data quantifying the computing resources required to interpret the strength and clarity of the signals that correspond to the incorporation of nucleotides in the DNA sequence based on factors such as chemical anomalies, sequencing-device errors, or issues with the sample itself.
- the sequencing ordering system 106 generates, receives, or identifies base-call-data files that may include the raw output (BCL, SRA, VCF, FASTQ) from a sequencing device and contains the nucleotide reads for one or more genomic samples. As shown in FIG. 4, in certain embodiments, the sequencing ordering system 106 generates, receives, or identifies a basecall-data file 403.
- the sequencing ordering system 106 can receive or identify the secondary sequencing task features 418, including a genotype-quality scoring feature 420, a mapping nucleotide reads feature 422, an aligning nucleotide reads feature 424, a variant calling feature 426, a detecting structural variants feature 428, and an annotating phenotypes feature 430.
- secondary sequencing task features 418 including a genotype-quality scoring feature 420, a mapping nucleotide reads feature 422, an aligning nucleotide reads feature 424, a variant calling feature 426, a detecting structural variants feature 428, and an annotating phenotypes feature 430.
- the sequencing ordering system 106 can access or identify the genotype-quality scoring feature 420 associated with the base-call-data file 403.
- the genotype-quality scoring feature 420 includes data quantifying the computing (e.g., processor, time, memory) resources required for generating a statistical measure of confidence in a genotype call associated with the base-call-data file 403.
- the genotype- quality scoring feature 420 can additionally or alternatively include data quantifying the computing resources required to analyze the alignment of sequencing reads against a reference genome and identify places where the sequenced DNA differs from the reference and assign a genotype-quality score based on the probability that the genotype call is correct.
- the genotype-quality scoring feature 420 can include data quantifying the computing resources required for evaluating the depth of coverage (number of reads supporting the call) and the agreement between those reads.
- the sequencing ordering system 106 can additionally or alternatively access the mapping nucleotide reads feature 422 to represent mapping the genomic coordinates of a reference genome for a base-call-data file 403.
- the mapping nucleotide reads feature 422 can include data quantifying the computing (e g., processor, time, memory) resources required for aligning the nucleotide reads obtained from a sequencing device to a reference genome or assembling the nucleotide reads de novo if no reference is available.
- the mapping nucleotide reads feature 422 can include data quantifying the computing resources required for preprocessing the reads to trim adapters and filter out low-quality sequences.
- the mapping nucleotide reads feature 422 can include a representation of the computing resources required for specialized algorithms to perform genomic alignment, considering the complexities of the genome, such as repetitive regions and potential sequencing errors.
- the mapping nucleotide reads feature 422 can include data quantifying the computing resources required for post-processing the aligned reads to identify regions with low coverage, potential misalignments, and to mark duplicate sequences that result from PCR amplification.
- the sequencing ordering system 106 can additionally or alternatively access the aligning nucleotide reads feature 424, which can include data quantifying the computing (e g., processor, time, memory) resources required to align the nucleotide reads with the reference genome for the base-call-data file 403.
- the aligning nucleotide reads feature 424 can include a representation of the computing resources required for arranging sequencing reads to a reference genome or alternative contiguous sequence.
- the aligning nucleotide reads feature 424 can include a representation of the computing resources required for quality filtering and trimming to ensure that only high-quality data is used for alignment.
- the aligning nucleotide reads feature 424 can include data quantifying the computing resources required for the use of alignment algorithms to take the processed reads and map them to the reference genome and account for mismatches, insertions, and deletions, which may represent either sequencing errors or genuine variants.
- the aligning nucleotide reads feature 424 can additionally or alternatively include data quantifying the computing resources required for sorting and indexing to flag duplicate reads and perform local realignment to improve accuracy at indel positions.
- the sequencing ordering system 106 can additionally or alternatively access the variant calling feature 426 based on the nucleotide reads for the base-call-data file 403.
- the variant calling feature 426 can include data quantifying the computing (e g., processor, time, memory) resources required for identifying differences between the sequenced DNA and a reference sequence.
- the variant calling feature 426 can include data quantifying the computing resources required for analyzing the nucleotide read alignments to detect discrepancies that may indicate biological variations, such as single nucleotide polymorphisms (SNPs), insertions, and deletions (indels).
- the variant calling feature 426 can include data quantifying the computing resources required for utilizing probabilistic models to determine the likelihood of a variant being real versus a sequencing or alignment error incorporating factors like the base quality scores, alignment quality, and sequence context.
- the sequencing ordering system 106 can additionally or alternatively access the detecting structural variants feature 428 for the base-call- data file 403.
- the detecting structural variants feature 428 can include data quantifying the computing (e.g., processor, time, memory) resources required for identifying large-scale alterations in the genome such as deletions, insertions, duplications, inversions, and translocations that span more than 50 base pairs.
- the detecting structural variants feature 428 can include data quantifying the computing resources required for the analysis of read alignments for patterns that indicate a structural variant.
- the detecting structural variants feature 428 can include data quantifying the computing resources required for statistical modeling to differentiate true structural variants from alignment artifacts or normal genomic variation.
- the sequencing ordering system 106 can additionally or alternatively access the annotating phenotypes feature 430 for the base-call-data file 403.
- the annotating phenotypes feature 430 can include data quantifying the computing (e.g., processor, time, memory) resources required for the association of identified genetic variants with their potential phenotypic outcomes.
- the annotating phenotypes feature 430 can include data quantifying the computing resources required for the use of bioinformatics tools and software to align phenotypic information with known phenotype associations with the genomic data from the base-call-data file 403.
- the annotating phenotypes feature 430 can include data quantifying the computing resources required for predictive modeling to infer potential phenotypes based on the biological functions of genes impacted by the variants.
- the annotating phenotypes feature 430 can include data quantifying the computing resources required for predicting the phenotypic outcome or disease association of each variant and generating a pathogenicity assessment of the clinical relevance of each variant.
- the sequencing ordering system 106 provides the primary sequencing task features 404 and/or the secondary sequencing task features 418 to the sequencingtask ordering machine-learning model 440.
- the following paragraphs provide further details concerning embodiments of the sequencing-task ordering machine-learning model 440.
- FIG. 5 illustrates an example architecture for a sequencing-task ordering machinelearning model in accordance with one or more embodiments of the present disclosure.
- a sequencing-task ordering machine-learning model 510 is a neural network with two hidden layers that is fully connected and equipped with activation functions (e.g., a Multilayer Perceptron).
- the sequencing-task ordering machine-learning model 510 is configured with model parameter(s) 520 that include adjustable weights and biases (e.g., 88 parameters).
- the model parameter(s) 520 which include the weights and biases across the layers of the sequencing-task ordering machine-learning model 510 are optimized using a genetic algorithm.
- the sequencing ordering system 106 can utilize the sequencing-task ordering machine-1 earning model 510 with more or less hidden layers and neurons than shown in FIG. 5.
- the sequencing-task ordering machine-learning model 314 can utilize a fully connected feedforward neural network with two hidden layers, where connections between the nodes do not form a cycle (e.g., a multilayer perceptron (MLP)).
- MLP multilayer perceptron
- the sequencingtask ordering machine-learning model 510 is a fully connected neural network, where each neuron in one layer is connected to all neurons in the subsequent layer, with the two hidden layers providing for the extraction of features at two different levels of hierarchy or abstraction.
- the sequencing-task ordering machine-learning model 510 utilizes activation functions to introduce non-linearity into the network and to model complex patterns that are not linearly separable.
- the sequencing-task ordering machine-learning model 510 can utilize activation functions including ReLU (Rectified Linear Unit), softmax, sigmoid, or tanh. As shown, the sequencing-task ordering machine-learning model 510 passes each neuron output through the activation function before being fed to the next layer. Furthermore, the sequencing-task ordering machine-learning model 510 utilizes biases added to the input of the activation functions for each neuron, thereby enabling the activation function to be shifted to the left or right.
- ReLU Rectified Linear Unit
- softmax Softmax
- sigmoid sigmoid
- tanh tanh
- the sequencing-task ordering machine-learning model 510 includes a first hidden layer 514.
- the input data neurons 512 of the sequencing-task ordering machine-learning model 510 process the input data, which represent the sequencing tasks (e.g., sequencing tasks 304) and sequencing task features (e.g., sequencing task features 306), and passes the input data to the first hidden layer 514.
- each input data neuron of the input data neurons 512 in the input layer is connected to every neuron in the first hidden layer 514.
- the sequencing-task ordering machine-learning model 510 transmits a vector or data signal from each input data neuron of the input data neurons 512 to each neuron in the first hidden layer 514, multiplied by a corresponding weight (e.g., from model parameter(s) 520). These products are summed, resulting in a weighted sum for each hidden neuron of the first hidden layer 514.
- a bias term (e.g., from model parameter(s) 520), for each neuron in the first hidden layer 514, is added to the weighted sum, which allows the threshold of an activation function 515 to be adjusted.
- the result of the weighted sum plus the bias is passed through the activation function 515 (e.g., ReLU, Sigmoid, Tanh) for each neuron of the first hidden layer 514.
- This activation function 515 introduces non-linearity, allowing the sequencing-task ordering machine-learning model 510 to model complex relationships.
- the sequencing-task ordering machine-learning model 510 sends the activated value of each neuron in the first hidden layer 514 to each neuron in the second hidden layer 516. As shown, every neuron in the second hidden layer 516 is fully-connected to every neuron in the first hidden layer 514. As with the first hidden layer 514, the sequencing-task ordering machine-learning model 510 calculates a weighted sum of inputs for each neuron from the previous layer, adds a bias, and then applies an activation function 517.
- the sequencing-task ordering machine-learning model 510 adds a bias term (e.g., from model parameter(s) 520) for each neuron in the second hidden layer 516 to the weighted sum, which allows the threshold of the activation function 517 to be adjusted.
- a bias term e.g., from model parameter(s) 520
- the sequencing-task ordering machine-learning model 510 After adding a bias term, as further indicated by FIG. 5, the sequencing-task ordering machine-learning model 510 passes features representing the product of the weighted sum plus the bias through an activation function (e.g., ReLU, Sigmoid, Tanh) for each neuron of the second hidden layer 516.
- an activation function e.g., ReLU, Sigmoid, Tanh
- the second hidden layer 516 has the capacity to learn even more complex patterns by combining the features extracted by the first hidden layer 514.
- the sequencing-task ordering machine-learning model 510 combines the activated outputs from the second hidden layer 516 with a set of weights and biases (e.g., model parameter(s) 520). As shown, the sequencing-task ordering machme-leammg model 510 applies a final activation function 519 to obtain the task ordering scores 518. As mentioned, the sequencing-task ordering machine-learning model 510 provides the task ordering scores 518 based on the model parameter(s) 520.
- the sequencing ordering system 106 uses a training process to select a highest performing sequencing-task ordering machine-learning model to generate the task ordering scores.
- FIGS. 6A-6B illustrate selecting the highest performing sequencing-task ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 training a sequencingtask ordering machine-learning model by using a genetic algorithm to select, from among candidate models, a highest performing sequencing-task ordering machine-learning model for the sequencing-task ordering machine-learning model 510 utilizing a genetic algorithm. As shown, the sequencing ordering system 106 determines a set of initial sequencing-task ordering machinelearning model(s) 610. The sequencing ordering system 106 randomly initializes each model of the initial sequencing-task ordering machine-learning model(s) 610 with different model parameters (e.g., weights and biases).
- model parameters e.g., weights and biases
- the sequencing ordering system 106 initializes weights in one or more of the initial sequencing-task ordering machine-learning model(s) 610 to the inverse square root of a next layer size within the respective initial sequencing-task ordering machine-learning model. In one or more embodiments, the sequencing ordering system 106 utilizes frequency metadata indicating a number of times a given sequencing task occurs as part of a training data set. Further, in certain implementations, the sequencing ordering system 106 utilizes a set of the initial sequencing-task ordering machine-learning model(s) 610 with a population size of 8192.
- the sequencing ordering system 106 can determine makespan scores on a training set (e.g., 100 nucleotide-sample-slide with 5 days of simulated time) to evaluate the fitness of the initial sequencing-task ordering machine-learning model(s) 610 based on the ordered sequencing tasks.
- the makespan score refers to a measure of the total time or duration required to complete a set of sequencing tasks as part of determining a sequence of nucleobases for one or more sample genomes (or other nucleotide polymers) or part of saving data from determining such a sequence or from a corresponding analysis.
- the sequencing ordering system 106 determines makespan scores that represent the cumulative time span for completing the sequencing tasks.
- a makespan score can depend on the specifications and variables of a sequencing run.
- up to 8 nucleotide-sample slides can be in the given sequencing device simultaneously and each nucleotide-sample slide (e.g., respective nucleotide-sample slides of the set of nucleotide-sample slides) can have oligonucleotide clusters from hundreds or thousands of genomic samples.
- the sequencing ordering system 106 runs imaging and chemistry cycles for every nucleotide-sample slide and the line scanner can process up to 4 nucleotide-sample slide at a time. Each nucleotide-sample slide requires a primary sequencing task and while the images are being produced from the previous phase, the sequencing device can begin processing.
- the sequencing ordering system 106 utilizes a computer processor (e.g., FPGA/CPU/GPU) with 24 cores and 512 GB of RAM for the primary sequencing tasks.
- the makespan score quantifies the time for completing primary sequencing tasks given the specifications and variables noted above for a sequencing run.
- the sequencing ordering system 106 can further execute secondary sequencing tasks with one job for each sample per nucleotide-sample slide. In some embodiments, the secondary sequencing tasks begin after the primary sequencing task. In certain implementations, the sequencing ordering system 106 utilizes a computer processor with 28 cores, 512 GB of RAM, and 2 FPGAs for the secondary sequencing tasks. The sequencing ordering system 106 utilizes the makespan score to quantify the time taken from the start of the first sequencing task until the completion of the last sequencing task based on factors such as sequencing task duration, resource availability, and sequencing task features.
- the sequencing ordering system 106 selects a subset of the initial sequencing-task ordering machine-learning model(s) 610 to serve as the parent sequencing-task ordering machine-learning model(s) 620 for the next generation. As shown, the sequencing ordering system 106 evaluates all of the initial sequencing-task ordering machine-learning model(s) 610 using a fitness function (e.g., an objective function that evaluates the performance of the initial sequencing-task ordering machine-learning model(s) 610) based on a set of training data to generate predicted ordenng scores and a makespan value.
- a fitness function e.g., an objective function that evaluates the performance of the initial sequencing-task ordering machine-learning model(s) 610
- such a set of training data can include metadata indicating a frequency at which a given sequencing task occurs within the set of training data, such as a count quantifying a number of times each particular sequencing task was performed overall, within a particular time frame, or within a given sequencing run.
- the sequencing ordering system 106 can utilize selection strategies for the parent sequencing-task ordering machine-learning model(s) 620 that include (i) tournament selection, where random subsets of models compete, or (ii) roulette wheel selection, where the probability of selection is proportional to fitness as measured by makespan scores.
- the sequencing ordering system 106 evaluates the output of the initial sequencing-task ordering machine-learning model(s) 610 to determine a makespan value that includes a penalty calculation (e.g., penalized makespan) based on apriority multiplier and includes a priority penalty that penalizes for long or poorly scheduled tasks.
- a penalty calculation e.g., penalized makespan
- the sequencing ordering system 106 can evaluate the loss (or fitness) of a model using:
- the sequencing ordering system 106 determines a set of the parent sequencing-task ordering machine-learning model(s) 620 with a population size of 128.
- the sequencing ordering system 106 combines pairs of the parent sequencing-task ordering machine-learning model(s) 620 to produce candidate sequencing-task ordering machine-learning model(s) 630 using crossover or recombination. To illustrate, the sequencing ordering system 106 selects crossover points, and the genetic information is mixed between two of the parent sequencing-task ordering machine-learning model(s) 620 to create one or more of the candidate sequencing-task ordering machine-learning model(s) 630.
- the candidate sequencing-task ordering machine-learning model(s) 630 “inherit” parameters (e.g., weights and biases) from both of the parent sequencing-task ordering machine-learning model(s) 620 and replace some less fit of the parent sequencing-task ordering machine-1 earning model(s) 620 from the previous generation.
- parameters e.g., weights and biases
- the sequencing ordering system 106 applies mutations to the candidate sequencing-task ordering machine-learning model(s) 630.
- the sequencing ordering system 106 can apply a random change (or specific change) in the parameters of the parent sequencing-task ordering machine-learning model(s) 620.
- the candidate sequencing-task ordering machinelearning model(s) 630 are then evaluated for their fitness in the same way as the initial sequencingtask ordering machine-learning model(s) 610usmg a fitness function, that is, based on a set of training data to generate predicted ordering scores and a makespan value.
- the sequencing ordering system 106 uniformly and at random, selects a proportion ⁇ [0, 1] to represent how close the candidate sequencing-task ordering machine-learning model(s) 630 is to either of the parent sequencing-task ordering machine-learning model(s) 620 (e.g., 0 and 1 are exactly like the parent, 0.5 is an even blend of both).
- the sequencing ordering system 106 picks weights and biases with probability p to come from one of the parent sequencing-task ordering machine-learning model(s) 620 and probability 1-p from the other of the parent sequencing-task ordering machine-learning model(s) 620.
- Mutations in the candidate sequencing-task ordering machine-learning model(s) 630 occur by randomly perturbing by a normal distribution with mean of 0 and standard deviation of 0.003. Further, in certain implementations, the sequencing ordering system 106 utilizes a set of candidate sequencing-task ordering machine-learning model(s) 630 with a population of 8192.
- the sequencing ordering system 106 selects a highest performing candidate sequencing-task ordering machine-learning model 640 as the fittest model from the candidate sequencing-task ordering machine-learning model(s) 630.
- the sequencing ordering system 106 can utilize the cycle of selection, crossover, mutation, and evaluation for a predetermined number of generations or until a satisfactory level of fitness is achieved.
- the sequencing ordering system 106 selects the sequencing-task ordering machine-learning model 650 from between a previously configured sequencing-task ordering machine-learning model 642 and the highest performing candidate sequencing-task ordering machine-learning model 640 based on a comparison of the fitness (e.g., makespan scores) of the previously configured sequencing-task ordering machinelearning model 642 and the fitness (e.g., makespan scores) of the highest performing candidate sequencing-task ordering machine-learning model 640.
- the sequencing ordering system 106 utilizes a validation test set (e.g., 25,000 nucleotide-sample slides, 2.5 years of simulated time) to evaluate the fitness and generate the makespan scores.
- the sequencing ordering system 106 determines the previously configured sequencing-task ordering machine-learning model 642 is more fit based on a fitness evaluation on the validation test set, the sequencing ordering system 106 can maintain the previously configured sequencing -task ordering machine-learning model 642 as the sequencing-task ordering machine-learning model 650. For example, if the sequencing ordering system 106 determines the highest performing candidate sequencing-task ordering machine-learning model 640 is more fit based on a fitness evaluation on the validation test set, the sequencing ordering system 106 can assign the highest performing candidate sequencing-task ordering machine-learning model 640 as the sequencing-task ordering machine-learning model 650. In this way, the sequencing ordering system 106 can determine the best performing sequencing-task ordering machine-learning model for the validation test set.
- the sequencing ordering system 106 can determine where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run.
- FIG. 7A illustrates the sequencing ordering system 106 distributing sample-specific base-call-data files to one or more computing devices in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 expedites determining oligonucleotides belonging to respective genomic samples within a nucleotide-sample-slide pool (or other nucleotide-sample-substrate pool) by base calling the indexing sequences for both read pairs before base calling the genomic sequences in library templates for each sample.
- the sequencing ordering system determines which nucleotide reads belong to which genomic samples and a relative balance of genomic samples. Furthermore, based on this determination, the sequencing ordering system can begin generating and transmitting the base-call-data files 712 to the appropriate computing device after each genomic sequencing cycle of the sequencing run.
- the sequencing ordering system 106 may determines which oligonucleotide clusters in a nucleotide-sample slide correspond to which genomic sample through demultiplexing. In particular, after determining base calls for the indexing sequences, the sequencing ordering system 106 analyzes the raw sequencing data and uses index sequences (which function similar to barcodes) to assign each read to its corresponding genomic sample. For example, the sequencing ordering system 106 accesses raw sequencing data comprising indexing sequences for a genomic sample A and indexing sequences for a genomic sample B.
- the indexing sequences comprise nucleobases that act as unique identifiers for each genomic sample, allowing for differentiation and sorting of the reads during demultiplexing.
- the indexing sequences for a genomic sample A indicate that the sample genomic sequence comes from genomic sample A.
- the indexing sequences for a genomic sample B indicate that the sample genomic sequence originates from genomic sample B.
- the sequencing ordering system 106 transmits the base-call-data files 712 that are sample-specific files to various computing devices. As shown, the sequencing ordering system 106 can transmit a first base-call- data file 723 (of the base-call-data files 712) to a first computing device 722 and second base-call- data file 725 (of the base-call-data files 712) to second computing device 724. As mentioned, such sample-specific file distribution provides additional security, saves processing time, and reduces storage requirements.
- FIG. 7B illustrates the sequencing ordering system 106 performing a demultiplexing operation on a subset of sequencing cycles with indexing cycles performed between genomic sequencing cycles in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 demultiplexes to determine which clusters of oligonucleotides correspond to each genomic sample in the pool of genomic samples to generate sample-specific base-call data files (e.g., the base-call-data files 712).
- the sequencing ordering system 106 transmits the sample-specific base-call- data files (e.g., the base-call-data files 712) to various computing devices.
- the sequencing ordering system 106 can begin transmitting base-call-data files (e g., the first base-call-data file 723) after performing the act 736 of determining base calls for a second indexing sequence.
- FIG. 7B illustrates the series of acts comprising the act 732 of determining base calls for a first indexing sequence.
- a first index primer 742 is annealed to the primer binding site appended to the sample genomic sequence 740.
- the sequencing ordering system 106 determines base calls for the first indexing sequence 746.
- the first indexing sequence 746 is appended to a sample genomic sequence 740 of a genomic sample.
- the sequencing ordering system 106 After determining base calls for the first indexing sequence 746, the sequencing ordering system 106 performs the act 734 of determining base calls for a first nucleotide read. More specifically, the sequencing ordering system 106 determines base calls for a first nucleotide read corresponding to a first portion of the sample genomic sequence 740. More specifically, in a paired- end sequencing run, the sample genomic sequence 740 is sequenced from both ends, providing complementary information about the sample genomic sequence 740. As part of performing the act 734, the sequencing ordering system 106 anneals a first nucleotide read primer 748 to a read primer binding site, and the sequencing ordering system 106 sequences the first portion of the sample genomic sequence 740.
- the sequencing ordering system 106 performs the act 736 of determining base calls for a second indexing sequence.
- the sequencing ordering system 106 anneals a second index primer 752 to the primer binding site appended to the sample genomic sequence 740.
- the sequencing ordering system 106 determines base calls for the second indexing sequence 750.
- the second indexing sequence 750 is appended to the 5 ’ end of the sample genomic sequence 740 while the first indexing sequence 746 is appended to the 7’ end of the sample genomic sequence 740.
- the sequencing ordering system 106 performs the act 758 to demultiplex the clusters of oligonucleotides to determine the clusters that correspond to each genomic sample in the pool of genomic samples.
- the sequencing ordering system 106 can demultiplex the nucleotide reads (e.g., when demultiplexing the nucleotide reads 702) to generate sample-specific base-call data files (e.g., the first base-call-data file 723) based on the act 734 of determining base calls for a first nucleotide read. Furthermore, the sequencing ordering system 106 can begin to transmit the sample-specific base-call-data files (e.g., the first base-call- data file 723) to a first computing device (e.g., first computing device 722) after completing the act 734.
- a first computing device e.g., first computing device 722
- the sequencing ordering system 106 after performing the act 736, performs a pair-end turn. Generally, during the pair-end turn, the P7 region is cleaved and all fragments are attached by the P5 region. Prior to the pair-end turn, the P7 region is annealed to the surface of the nucleotide-sample slide. After the pair-end turn, the P5 region is attached to the nucleotide-sample slide. Following the pair-end turn, the sequencing ordering system 106 performs the act 738 of determining base calls for a second nucleotide read. The sequencing ordering system 106 anneals the second nucleotide read primer 754 to a second read primer binding site, and the sequencing ordering system 106 sequences the second portion of the sample genomic sequence 740.
- FIG. 7C illustrates performing an indexing-first approach to demultiplexing nucleotide reads by performing indexing cycles before genomic sequencing cycles.
- the sequencing ordering system 106 demultiplexes the oligonucleotides to determine which clusters of oligonucleotides correspond to each genomic sample in the pool of genomic samples to generate sample-specific base-call data files (e.g., the base-call-data files 712).
- sample-specific base-call-datafiles e.g., the base-call-datafiles 712
- FIG. 7C illustrates the series of acts comprising the act 762 of determining base calls for a first indexing sequence.
- the sequencing ordering system 106 anneals a first index primer 772, determines base calls for the first indexing sequence 776, and appends the first indexing sequence to the 7’ end of a sample genomic sequence 770.
- the sequencing ordering system 106 performs the act 764 of determining base calls for a second indexing sequence.
- the sequencing ordering system 106 anneals a second index primer 778, determines base calls for the second indexing sequence 780, and appends the second indexing sequence 780 to the 5’ end of the sample genomic sequence 770.
- the sequencing ordering system 106 performs the indexing cycles (e.g., the act 762 and the act 764) before performing the genomic sequencing cycles (e.g., act 766 and act 768).
- the sequencing ordering system 106 can perform the act 788 to demultiplex the clusters of oligonucleotides to determine the clusters that correspond to each genomic sample in the pool of genomic samples before performing the sequencing cycles (e g , the act 766 and the act 768).
- the sequencing ordering system 106 can transmit the sample-specific first base-call-data files to various computing devices during the sequencing run as discussed in relation to FIG. 7A.
- the sequencing ordering system 106 can transmit the sample-specific first base-call-data files (e.g., the base-call-data files 712) for the first sample-specific base-call -data file (e.g., the first base-call-data file 723) and the second samplespecific base-call-data file (e.g., the second base-call-data file 725) of the sample genomic sequence 740 after the act 788 to demultiplex the clusters of oligonucleotides and during the sequencing run as discussed in relation to FIG. 7A.
- the sample-specific first base-call-data files e.g., the base-call-data files 712
- the first sample-specific base-call -data file e.g., the first base-call-data file 723
- the second samplespecific base-call-data file e.g., the second base-
- the sequencing ordering system 106 performs the act 766 of determining base calls for a first nucleotide read, anneals a first nucleotide read primer 782 to a read primer binding site, and sequences the first portion of the sample genomic sequence 770. In some embodiments, after performing the act 766, the sequencing ordering system 106 performs a pair-end turn.
- the sequencing ordering system 106 performs the act 768 of determining base calls for a second nucleotide read, anneals the second nucleotide read primer 784 to a second read primer binding site, and sequences the second portion of the sample genomic sequence 770.
- the use of an mdexing-first approach expedites distributing the sample-specific base-call-data files by allowing the sequencing ordering system 106 to begin transmitting the sample-specific base-call-data files after performing act 764 and determining base calls for a second indexing sequence and before performing the genomic sequencing cycles.
- the sequencing ordering system 106 can transmit the sample-specific base-call-data files (e.g., the base-call-data files 712) for the first portion (e.g., the first base-call-data file 723) and the second portion (e.g., the second base-call-data file 725) of the sample genomic sequence 740 during each cycle of the sequencing run as discussed in relation to FIG. 7A.
- the sequencing ordering system 106 utilizes a nucleotide-sample-slide ordering machine-learning model to determine slide ordering scores for scheduling tasks associated with a nucleotide-sample slide.
- FIG. 8 illustrates a schematic diagram of utilizing the nucleotide- sample-slide ordering machine-learning model 806 to determine ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 identifies or receives data for sequencing tasks associated with nucleotide-sample shde(s) 802a, 802b, through 802n.
- the sequencing ordering system 106 can receive a identify or receive data for sequencing tasks for nucleotide- sample slide(s) 802a-802n comprising genomic samples for four different nucleobase types (e.g., A, T, C, G) associated with sample library fragments. As further shown, the sequencing ordering system 106 determines nucleotide-sample-slide features 804a, 804b, through 804n associated with the nucleotide-sample slide(s) 802a-802n. [0163] As further shown, the sequencing ordering system 106 utilizes a nucleotide-sample- shde ordering machine-1 earning model 806 to generate slide ordering scores 808.
- nucleotide-sample- shde ordering machine-1 earning model 806 to generate slide ordering scores 808.
- the nucleotide-sample-slide ordering machine-learning model 806 By utilizing the nucleotide-sample-slide features 804a-804n and accounting for available computing resources (e.g., using model parameters), the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 indicating a relative order of the nucleotide-sample slide(s) 802a-802n. In one or more embodiments, the nucleotide-sample-slide ordering machine-learning model 806 as further described in relation to FIG. 10 to provide the slide ordering scores 808.
- the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 that represent values for a slide order which maximizes the efficiency of the sequencing tasks for the nucleotide-sample slide(s) 802a-802n.
- the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 that can be used to order the nucleotide-sample-slide(s) 802a-802n and provide a more efficient utilization of resources, provide a reduced turnaround times for processing nucleotide-sample slides, and an overall increase in the throughput of the genomic sequencing process.
- the sequencing ordering system 106 can strategically provide an order for the nucleotide-sample shde(s) 802a-802n (e.g., particularly in high-volume environments) which can provide significant improvements in productivity and efficiency.
- the nucleotide-sample-slide ordering machine-learning model 806 utilizes the slide ordering scores 808 to provide a ranking for ordered slides 810 indicating a relative order for the nucleotide-sample slide(s) 802a-802n to perform primary sequencing tasks and/or secondary sequencing tasks (e.g., the relative order for primary/secondary sequencing tasks 320).
- the ordered slides 810 are arranged in a sequence that reflects their assessed priority from the slide ordering scores 808. with the highest score of the nucleotide-sample slide(s) 802a- 802n scheduled first.
- the sequencing ordering system 106 further causes nucleotide-sample slide(s) 802a-802n to be scheduled according to the slide ordering scores 808 on the computing device(s) 812 (e.g., one or more of the sequencing device 108, the server device(s) 102, the server device(s) 110, and the client device(s) 114).
- the computing device(s) 812 e.g., one or more of the sequencing device 108, the server device(s) 102, the server device(s) 110, and the client device(s) 114.
- the sequencing ordering system 106 provides nucleotide-sample slide features to the nucleotide-sample-slide ordering machine-learning model.
- FIG. 9 illustrates providing nucleotide-sample slide features to the nucleotide-sample-slide ordering machinelearning model in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 receives or identifies nucleotide-sample-slide features 904 associated with nucleotide-sample-slide(s) 902. Similar to the discussion in relation to FIG 4, the nucleotide-sample-slide features 904 can include a metric, a setting, a boundary, an environment variable, or a feature vector representing the performance time, processor usage (e.g., CPU and FPGA), memory usage, and/or other resource requirements for the sequencing tasks associated with the nucleotide-sample-slide(s) 902. In particular, the sequencing ordering system 106 can receive or identify the nucleotide-sample-slide features 904 including a processor usage feature 906, a memory requirements feature 908, a performance time feature 910, and a priority feature 912.
- the sequencing ordering system 106 can access or identify the processor usage feature 906 associated with the nucleotide-sample-slide(s) 902.
- the processor usage feature 906 can include data for quantifying the number of FPGAs/CPUs/GPUs and the amount of available RAM.
- the processor usage feature 906 can include data quantifying the computational load on processors and can be operationalized as the percentage of processor time required or as the intensity of the computations needed.
- the processor usage feature 906 includes data quantifying the values for required processing power (or computational infrastructure) and the capacity of the sequencing ordering system 106 to process primary sequencing tasks like nucleotide identification, and/or secondary tasks such as sequence assembly and annotation for the nucleotide-sample-slide(s) 902.
- the processor usage feature 906 can include data quantifying the processor usage requirements based on the processor requirements associated with the sample sequencing depth, sample complexity, slide size, number of multiplexed samples, computational algorithm efficiency, system data throughput, and/or the system architecture.
- the processor usage feature 906 includes data representing CPU usage within the ranges of 5 - 10 cores per task, and FPGA usage of 1 - 3 FPGA subdivisions per sequencing task.
- the sequencing ordering system 106 can additionally or alternatively identify nucleotide-sample-slide features 904 which include the memory requirements features 908 that quantify the memory requirements including the amount of RAM needed to perform the sequencing tasks for the nucleotide-sample-slide(s) 902.
- the memory requirements feature 908 indicates memory required for sequencing tasks, such as storing raw sequencing data during primary sequencing or processing large amounts of genomic data during secondary analyses.
- the memory requirements feature 908 can account for the large datasets (e.g., gigabytes of data per run) involved in primary sequencing tasks and secondary sequencing tasks.
- the memory requirements feature 908 can include data quantifying the computing (e.g., processor, time, memory) resources required based on the data volume, data complexity, parallel processing needs, temporary storage needs, and/or final storage needs.
- the sequencing ordering system 106 can identify additionally or alternatively the performance time feature 910 that can include data for quantifying the time requirements to perform the sequencing tasks for the nucleotide-sample-slide(s) 902.
- the performance time feature 910 includes data reflecting the throughput rate of the sequencer.
- the performance time feature 910 represents the duration of computational analyses such as comparative genomics.
- the nucleotide-sample-slide features 904 include the priority feature 912 that can include data for quantifying the priority of the nucleotide-sample- slide(s) 902.
- the priority feature 912 can include a priority value for scheduling the nucleotide-sample-slide(s) 902.
- the priority feature 912 can include a value indicating a relative priority value for scheduling the nucleotide-sample-slide(s) 902 in comparison to other of the nucleotide-sample-slide(s) 902.
- the priority feature 912 indicates an assessment of the sample urgency for sequencing the nucleotide-sample-slide(s) 902 based on time-sensitive analyses, sequencing project deadlines, customer requirements, and/or quality checks.
- the sequencing ordering system 106 provides nucleotide-sample- slide features 904 to the nucleotide-sample-slide ordering machine-learning model 916.
- the sequencing ordering system 106 can access or identify the nucleotide-sample-slide features 904 of the performance time feature 910, the processor usage feature 906 (CPU), the memory requirements feature 908, the processor usage feature 906 (FPGA), and the priority feature 912 as indicated in the following table:
- the nucleotide-sample-slide ordering machine-learning model can be implemented utilizing a neural network.
- FIG. 10 illustrates an example architecture for a nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
- the nucleotide-sample-slide ordering machine-learning model 1010 can be implemented as a neural network with four hidden layers that is fully connected and equipped with activation functions (e.g., a Multilayer Perceptron).
- the nucleotide-sample-slide ordering machine-learning model 1010 can be configured with model parameter(s) 1030 that include adjustable weights and biases.
- the sequencing ordering system 106 can utilize the nucleotide-sample-slide ordering machine-learning model 1010 with more or less hidden layers and neurons than shown in FIG. 10.
- the sequencing-task ordering machine-learning model 510 includes a first hidden layer 1014, a second hidden layer 1016, a third hidden layer 1018, and a fourth hidden layer 1020.
- the input data neurons 1012 of the nucleotide-sample-slide ordering machine-learning model 1010 process the input data, which represent the nucleotide-sample slide (e.g., nucleotide- sample slide(s) 802a-802n) and nucleotide-sample-slide features (e.g., nucleotide-sample-slide features 804a-804n), and passes the input data to the first hidden layer 1014.
- the nucleotide-sample slide e.g., nucleotide- sample slide(s) 802a-802n
- nucleotide-sample-slide features e.g., nucleotide-sample-slide features 804a-804n
- the nucleotide-sample-slide ordering machme-leammg model 1010 transmits a vector or data signal from each of the input data neuron 1012 to each of the input data neurons 1012 in the first hidden layer 1014, multiplied by a corresponding weight (e.g., from model parameter(s) 1030). These products are summed, resulting in a weighted sum for each hidden neuron of the first hidden layer 1014.
- a bias term e.g., from model parameter(s) 1030
- the result of the weighted sum plus the bias is passed through the activation function (e.g., ReLU, Sigmoid, Tanh) for each neuron of the first hidden layer 1014.
- the nucleotide-sample-slide ordering machine-learning model 1010 sends the activated value of each of the input data neurons 1012 in the first hidden layer 1014 to each neuron in the second hidden layer 1016.
- the nucleotide-sample- slide ordering machine-learning model 1010 calculates a weighted sum of inputs for each neuron from the previous layer, adds a bias, and then applies an activation function 1017.
- nucleotide-sample-slide ordering machme-leammg model 1010 repeats this process for the third hidden layer 1018 with a corresponding activation function 1019 and a fourth hidden layer 1020 and a corresponding activation function 1021.
- the nucleotide-sample-slide ordering machinelearning model 1010 has the capacity to learn even more complex patterns by combining the features extracted by each of the hidden layers. As shown, the nucleotide-sample-slide ordering machine-learning model 1010 applies a final activation function 1023 to obtain the slide ordering scores 1022. As mentioned, the nucleotide-sample-slide ordering machine-learning model 1010 provides the slide ordering scores 1022 based on the model parameter(s) 1030.
- the model parameter(s) 1030 which include the weights and biases across the layers of the nucleotide-sample- slide ordering machine-learning model 1010 are optimized using a genetic algorithm.
- the sequencing ordering system 106 selects the sequencing-task ordering machine-learning model 1150 utilizing a genetic algorithm. As shown in FIG. 11A, the sequencing ordering system 106 determines a set of initial nucleotide-sample-slide ordering machine-learning model(s) 1110 and randomly initializes each of the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 with different model parameters (e.g., weights and biases).
- model parameters e.g., weights and biases.
- the sequencing ordering system 106 utilizes a set of initial nucleotide- sample-slide ordering machine-learning model(s) 1110 with a population size of 8192. As described in more detail in relation to FIGS. 6A-6B, the sequencing ordering system 106 can determine makespan scores to evaluate the fitness of the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 based on the scheduled sequencing tasks.
- the sequencing ordering system 106 determines a set of the parent nucleotide-sample-slide ordering machine-learning model(s) 1120 with a population size of 128. [0181] As further shown, pairs of the parent nucleotide-sample-slide ordering machinelearning model(s) 1120 are combined to produce candidate sequencing -task ordering machinelearning model(s) 1130 using crossover or recombination. As disclosed in more detail in relation to FIG. 5, the sequencing ordering system 106 selects crossover points, and the genetic information is mixed between two parent nucleotide-sample-slide ordering machine-learning model(s) 1120 to create one or more candidate sequencing-task ordering machine-learning model(s) 1130.
- the candidate nucleotide-sample-slide ordering machine-learning model(s) 1130 are evaluated for their fitness in the same way as the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 using a fitness function based on a set of training data to generate predicted ordering scores and a makespan value.
- the sequencing ordering system 106 utilizes a set of candidate nucleotide-sample-slide ordering machine-learning model(s) 1130 with a population size of 8192.
- the sequencing ordering system 106 selects a highest performing candidate nucleotide-sample-slide ordering machine-learning model 1140 as the fittest model from the candidate nucleotide-sample-slide ordering machine-learning model(s) 1130.
- the sequencing ordering system 106 can utilize the cycle of selection, crossover, mutation, and evaluation for a predetermined number of generations or until a satisfactory level of fitness is achieved for the candidate nucleotide-sample-slide ordering machine-learning model(s) 1130.
- the sequencing ordering system 106 selects the nucleotide-sample-slide ordering machine-learning model 1150 from between a previously configured nucleotide-sample-slide ordering machine-learning model 1142 and the highest performing candidate nucleotide-sample-slide ordering machine-learning model 1140 based on a validation test set (e.g., 25,000 nucleotide-sample slide, 2.5 years of simulated time).
- the sequencing ordering system 106 can maintain a best model that is a previously configured nucleotide-sample-slide ordering machine-learning model 1142 or select the highest performing candidate nucleoti de-sample-slide ordering machine-learning model 1140. In this way, the sequencing ordering system 106 can determine the best performing nucleotide-sample-slide ordering machine-learning model for the specific validation test set.
- the sequencing ordering system 106 can utilize a two-tier sequencing ordering system that integrates an embodiment of the nucleotide-sample-slide ordering machinelearning model and an embodiment of the sequencing-task ordering machine learning model to order nucleotide-sample slides and sequencing tasks more efficiently.
- FIG. 12 illustrates a schematic diagram of utilizing a combination of the nucleotide-sample-slide ordering machinelearning model and the sequencing-task ordering machine learning model to order sequencing tasks in accordance with one or more embodiments of the present disclosure.
- the sequencing ordering system 106 can utilize a nucleotide-sample-slide ordering machine-1 earning model 1206 to access or identify a set of the nucleotide-sample-slide features 1204 for a nucleotide-sample-slide(s) 1202 and generate slide ordering scores 1210 indicating a nucleotide-sample-slide relative order 1208 of the set of nucleotide-sample slides based on the set of the nucleotide-sample-slide features 1204.
- a nucleotide-sample-slide ordering machine-1 earning model 1206 to access or identify a set of the nucleotide-sample-slide features 1204 for a nucleotide-sample-slide(s) 1202 and generate slide ordering scores 1210 indicating a nucleotide-sample-slide relative order 1208 of the set of nucleotide-sample slides
- the sequencing ordering system 106 can generate slide ordering scores 1210 indicating the values for a slide order that maximizes the efficiency of the sequencing tasks for the nucleotide-sample-slide(s) 1202. As further shown, the sequencing ordering system 106 can select a nucleotide-sample slide from the nucleotide-sample-slide(s) 1202 based on the relative order for the set of nucleotide-sample slides as provided by the nucleotide-sample-slide relative order 1208 and the slide ordering scores 1210.
- the sequencing ordering system 106 can access or identify a set of the sequencing task features 1214 for the nucleotide-sample-slide(s) 1202 and provide the set of the sequencing task features 1214 to a sequencing-task ordering machine-learning model 1216 for ordering the set of sequencing tasks 1212.
- the sequencing-task ordering machine-learning model 1216 can generate task ordering scores 1220 indicating a sequencing task relative order 1218 for the set of sequencing tasks based on the sequencing task features 1214 for the set of sequencing tasks 1212 and perform the set of sequencing tasks 1212 according to the task ordering scores 1220.
- the sequencing ordering system 106 can incorporate a first tier that utilizes the nucleotide-sample-slide ordering machinelearning model 1206 to determine slide ordering scores 1210.
- the sequencing ordering system 106 can access or identify the nucleotide-sample-slide(s) 1202.
- the sequencing ordering system 106 can identify nucleotide-sample-slide features 1204 associated with the nucleotide-sample-slide(s) 1202.
- the sequencing ordering system 106 utilizes the nucleotide-sample-slide ordering machine-learning model 1206 to generate slide ordering scores 1210 and determine a nucleotide-sample-slide relative order 1208.
- the sequencing ordering system 106 can incorporate a second tier that utilizes the sequencing-task ordering machine-learning model 1216.
- the sequencing ordering system 106 provides the nucleotide-sample-slide relative order 1208 (e.g., the slide ordering scores 1210) for the nucleotide-sample-shde(s) 1202 from the nucleotide-sample-slide ordering machinelearning model 1206. Further, the sequencing ordering system 106 identifies the set of sequencing tasks 1212 and the sequencing task features 1214 for each of the nucleotide-sample-slide(s) 1202.
- the sequencing ordering system 106 utilizes the sequencing-task ordering machine-learning model 1216 to generate task ordering scores 1220 and determine a sequencing task relative order 1218 for each set of sequencing tasks 1212. As shown in FIG. 12, the sequencing ordering system 106 can iteratively utilize the sequencing -task ordering machine-learning model 1216 to generate task ordering scores 1220 and determine a sequencing task relative order for each set of sequencing tasks 1212 and sequencing task features 1214 for each of the nucleotide-sample- shde(s) 1202 based on the slide ordering scores 1210 provided by the nucleotide-sample-slide ordering machine-learning model 1206.
- FIGS. 13A-13B illustrate graphs of the distribution of a penalized makespan utilizing different and existing ordering strategies in comparison with the sequencing ordering system 106.
- FIG. 13 A illustrates the initial makespan distribution values and
- FIG. 13B illustrates the makespan values for the tail of the makespan distribution after four hours.
- FIGS. 13A-13B represent the distribution of the penalized makespan over 200,000 nucleotide-sample slides utilizing 4 different order strategies.
- a sequencing device would need around 20 years to complete sequencing runs for 200,000 nucleotide-sample slides.
- the sequencing ordering system 106 utilizing only the sequencing-task ordering machine-learning model performs between 15-25% in median makespan and 5-15% in average makespan scores better than the FIFO Method showing an improvement.
- the scheduling strategy utilizing only the nucleotide-sample-shde ordering machine-learning model performs better than the FIFO Method showing an improvement between 5-15% in median makespan and average makespan scores.
- FIGS. 13A-13B depict results for the sequencing ordering system 106 using the nucleotide-sample-shde ordering machine-learning model depicted in FIG. 5 and the sequencingtask ordering machine-learning model depicted in FIG. 10.
- FIG. 14 illustrates a graph of the makespan compared against an average task load for the sequencing ordering system 106 in comparison to different and existing ordering strategies.
- FIG. 14 illustrates the intensity of a test case as a percentage of max load (sum of weight • time across tasks in units of normalized resources • hours).
- the load can be defined as:
- the graph shows the 5% percentile (reference number 1402), the 95% percentile (reference number 1404), and quantile 1 (QI) (reference number 1406) of makespan per nucleotide-sample slide against average task load (intensity) of the test set of 200,000 nucleotide-sample slides (20 years of simulated time).
- the sequencing ordering system 106 utilizing the two-tier sequencing ordering system of both the nucleotide-sample-slide ordering machine-1 earning model and the sequencing-task ordering machine-learning model provide the lowest makespan values.
- the sequencing ordering system 14 depicts results for the sequencing ordering system 106 using the nucleotide-sample-slide ordering machine-learning model depicted in FIG. 5 and the two-tier sequencing ordering system depicted in FIG. 12.
- the two-tier sequencing ordering system of both the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model also show a noticeable improvement over both a Tetris Heuristic Model and a FIFO Method that is even greater for higher intensity test cases.
- FIG. 15 illustrates a graph showing a comparison of the performance of the sequencingtask ordering machine-learning model when trained utilizing the genetic algorithm (as described above in relation to FIG. 12) when compared to a Tetris heuristic training model in accordance with one or more embodiments of the present disclosure.
- the sequencing-task ordering machine-learning model converges much quicker than the traditional Tetris heuristic training approach (at reference number 1504).
- the average performance 1502 of the sequencing ordering system 106 shows a l%-2% improvement over the Tetris heuristics when trained on less than 5 days of data and less than 10 iterations deep.
- the sequencing ordering system 106 continues to outperform the Tetris heuristics baseline over further training iterations (shown by the average performance 1502 at 150 iterations deep).
- FIGS. 1-15 the corresponding text, and the examples provide a number of different methods, systems, devices, and non-transitory computer-readable media of the sequencing ordering system 106.
- one or more implementations can also be described in terms of flowcharts comprising acts for accomplishing a particular result, as shown in FIGS. 16- 18.
- FIG. 16 illustrates a flowchart of a series of acts 1600 for generating task ordering scores and performing a set of sequencing tasks in accordance with one or more embodiments of the present disclosure.
- FIG. 17 illustrates a flowchart of a series of acts for transmitting genomic samples to computing devices in accordance with one or more embodiments of the present disclosure.
- FIG. 16 illustrates a flowchart of a series of acts 1600 for generating task ordering scores and performing a set of sequencing tasks in accordance with one or more embodiments of the present disclosure.
- FIG. 17 illustrates a flowchart of a series of acts for transmitting genomic samples to computing devices in accordance with one or more embodiments of the present disclosure
- FIGS. 16-18 illustrate acts according to one embodiment, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIGS. 16-18.
- the acts of FIGS. 16-18 can be performed as part of a method.
- a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device or a system to perform the acts depicted in FIGS. 16-18.
- a system comprising an imaging system, a fluidic system, and a computer comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIGS. 16-18.
- the series of acts 1600 includes an act 1602 of determining a set of sequencing task features for a set of sequencing tasks, an act 1604 of providing the set of sequencing task features to a sequencing-task ordering machine-learning model for ordering the set of sequencing tasks, an act 1606 of generating task ordering scores indicating a relative order of the set of sequencing tasks, and an act 1608 of performing the set of sequencing tasks according to the task ordering scores.
- the series of acts 1600 can include acts to perform any of the operations described in the following clauses:
- a computer-implemented method comprising: determining, for a set of sequencing tasks, a set of sequencing task features indicating at least available computing resources and a performance time associated with respective sequencing tasks of the set of sequencing tasks; providing the set of sequencing task features to a sequencing-task ordering machinelearning model for ordering the set of sequencing tasks; generating, utilizing the sequencing-task ordenng machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and performing the set of sequencing tasks according to the task ordering scores.
- the set of sequencing tasks comprises a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the set of sequencing tasks comprises a set of secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
- CLAUSE 3 The computer-implemented method of clause 2, wherein the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide- sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
- the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
- CLAUSE 6 The computer-implemented method of clause 1 , further comprising training the sequencing-task ordering machine-learning model by: identifying a set of parent sequencing-task ordering machine-learning models; generating, from the set of parent sequencing-task ordering machine-learning models, a set of candidate sequencing-task ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate sequencing-task ordering machine-1 earning model of the set of candidate sequencing-task ordering machme-leammg models; determining makespan scores for each candidate sequencing-task ordering machinelearning model of the set of candidate sequencing-task ordering machine-1 earning models based on the predicted ordering scores; and selecting a highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model based on comparing the makespan scores for each candidate model using a loss function.
- CLAUSE 7 The computer-implemented method of clause 6, further comprising: comparing the makespan scores of the highest performing candidate sequencing-task ordering machine-learning model with a previously configured sequencing-task ordering machinelearning model; and selecting the highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model instead of the previously configured sequencing-task ordering machine-learning model based on a makespan score for the highest performing candidate sequencing-task ordering machme-leammg model.
- CLAUSE 8 The computer-implemented method of clause 1, further comprising: determining a set of nucleotide-sample-slide features for a set of nucleotide-sample slides indicating at least the available computing resources and a performance time associated with processing data from respective nucleotide-sample slides of the set of nucleotide-sample slides; generating, utilizing a nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; selecting a nucleotide-sample slide from the set of nucleotide-sample slides based on the relative order of the set of nucleotide-sample slides; and performing the set of sequencing tasks for the selected nucleotide-sample slide based on the task ordering scores and the slide ordering scores.
- CLAUSE 10 The computer-implemented method of clause 1, further comprising performing the set of sequencing tasks in part by: determining, for a sequencing run, base calls for a set of indexing sequences within clusters of oligonucleotides on a nucleotide-sample slide; determining, during the sequencing run, a first subset of indexing sequences corresponding to a first genomic sample designated with a first set of processing parameters; determining, during the sequencing run, a second subset of indexing sequences corresponding to a second genomic sample designated with a second set of processing parameters; and transmitting, for the first genomic sample, a first base-call-data file to a first computing device based on the first set of processing parameters and, for the second genomic sample, a second base-call-data file to a second computing device based on the second set of processing parameters.
- the first set of processing parameters specify one or more of a secondary sequencing task for the first genomic sample, analysis rights for the first genomic sample, a category of analysis for the first genomic sample, or a sample size for the first genomic sample; and the second set of processing parameters specify one or more of a secondary sequencing task for the second genomic sample, analysis rights for the second genomic sample, a category of analysis for the second genomic sample, or a sample size for the second genomic sample.
- the series of acts 1800 includes an act 1802 of determining a set of nucleotide-sample-slide features for a set of nucleotide-sample slides, an act 1804 of providing the set of nucleotide-sample-slide features to a nucleotide-sample-slide ordering machine-learning model for ordering the set of nucleotide-sample slides, an act 1806 of generating slide ordering scores indicating a relative order of the set of nucleotide-sample slides, and an act 1808 of performing sequencing tasks for the set of nucleotide-sample slides according to the slide ordering scores.
- the series of acts 1800 can include acts to perform any of the operations described in the following clauses:
- a computer-implemented method comprising: determining, for a set of nucleotide-sample slides, a set of nucleotide-sample-slide features indicating at least available computing resources and a performance time associated with processing data for respective nucleotide-sample slides of the set of nucleotide-sample slides; providing the set of nucleotide-sample-slide features to a nucleotide-sample-slide ordering machine-learning model for ordering the set of nucleotide-sample slides; generating, utilizing the nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; and performing sequencing tasks for the set of nucleotide-sample slides according to the slide ordering scores.
- CLAUSE 14 The computer-implemented method of clause 13, wherein the set of nucleotide-sample-slide features comprise a set of priority features indicating a relative priority of the respective nucleotide-sample slides.
- CLAUSE 15 The computer-implemented method of clause 13, wherein the set of nucleotide-sample-slide features comprises one or more of processor usage for processing data associated with a nucleotide-sample slide of the set of nucleotide-sample slides, memory requirements for processing data associated with the nucleotide-sample slide, or performance time associated with processing data for the nucleotide-sample slide.
- the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of secondaiy sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
- the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of the genomic sample, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
- CLAUSE 18 The computer-implemented method of clause 16, wherein the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
- CLAUSE 19 The computer-implemented method of clause 13, further comprising training the nucleotide-sample-slide ordering machine-learning model by: identifying a set of parent nucleotide-sample-slide ordering machine-learning models; generating, from the set of parent nucleotide-sample-slide ordering machine-learning models, a set of candidate nucleotide-sample-slide ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate nucleotide-sample-slide ordering machine-learning model of the set of candidate nucleotide-sample-slide ordering machine-learning models; determining makespan scores for each candidate nucleotide-sample-slide ordering machine-learning model of the set of candidate nucleotide-sample-slide ordering machine-learning models based on the predicted ordering scores; and selecting a highest performing candidate nucleotide-sample-slide ordering machine
- CLAUSE 20 The computer-implemented method of clause 19, further comprising: comparing the makespan scores of the highest performing candidate nucleotide-sample- slide ordering machine-learning model with a previously configured nucleotide-sample-slide ordering machine-learning model; and selecting the highest performing candidate nucleotide-sample-slide ordering machinelearning model as the nucleotide-sample-slide ordering machine-learning model instead of the previously configured nucleotide-sample-slide ordering machine-learning model based on a makespan score for the highest performing candidate nucleotide-sample-slide ordering machinelearning model.
- CLAUSE 21 The computer-implemented method of clause 13, further comprising: selecting a set of sequencing tasks associated with a nucleotide-sample slide from the set of nucleotide-sample slides; determining a set of sequencing task features for the set of sequencing tasks indicating at least the available computing resources and a performance time associated with respective sequencing tasks of the set of sequencing tasks; generating, utilizing a sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and performing the set of sequencing tasks for the nucleotide-sample slide according to the task ordering scores.
- nucleic acid sequencing techniques can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable.
- the process to determine the nucleotide sequence of a target nucleic acid i.e., a nucleic-acid polymer
- Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
- SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand.
- a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
- more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
- SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties.
- Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below.
- the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery.
- the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
- SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like.
- a characteristic of the label such as fluorescence of the label
- a characteristic of the nucleotide monomer such as molecular weight or charge
- a byproduct of incorporation of the nucleotide such as release of pyrophosphate; or the like.
- the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used.
- the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by
- Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release.” Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P.
- PPi inorganic pyrophosphate
- cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference.
- This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference.
- the availability of fluorescently- labeled terminators in which both the termination can be reversed and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing.
- Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
- the labels do not substantially inhibit extension under SBS reaction conditions.
- the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features.
- each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially and an image of the array can be obtained between each addition step.
- each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator-SBS methods can be stored, processed, and analyzed as set forth herein. Following the image capture step, labels can be removed and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
- nucleotide monomers can include reversible terminators.
- reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15: 1767-1776 (2005), which is incorporated herein by reference).
- Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety).
- Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension, but could easily be deblocked by a short treatment with a palladium catalyst.
- the fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light.
- disulfide reduction or photocleavage can be used as a cleavable linker.
- Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP.
- the presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance.
- Some embodiments can utilize detection of four different nucleotides using fewer than four different labels.
- SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232.
- a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair.
- nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into anucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal.
- one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels.
- An exemplary embodiment that combines all three examples is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g.
- dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength
- a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
- sequencing data can be obtained using a single channel.
- the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated.
- the third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
- Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
- the oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize.
- images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images.
- Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing.” Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, “Characterization of nucleic acids by nanopore analysis”. Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stem, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties).
- the target nucleic acid passes through a nanopore.
- the nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin.
- each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
- Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity.
- Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No.
- FRET fluorescence resonance energy transfer
- the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.
- Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product.
- sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 1910/0137143 Al; or US 1910/0282617 Al, each of which is incorporated herein by reference.
- Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
- the above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously.
- different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner.
- the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner.
- the target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface.
- the array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
- the methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
- An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above.
- an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like.
- a nucleotide-sample-slide can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary nucleotide-sample-slides are described, for example, in US 1910/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference.
- one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method.
- one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above.
- an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods.
- Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
- sample and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target.
- the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids.
- the sample can include any biological, clinical, surgical, agricultural, atmospheric, or aquatic-based specimen containing one or more nucleic acids.
- the term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen.
- the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA.
- the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
- the nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA).
- the sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another embodiment, low molecular weight material includes enzymatically or mechanically fragmented DNA.
- the sample can include cell-free circulating DNA.
- the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples.
- the sample can be an epidemiological, agricultural, forensic, or pathogenic sample.
- the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source.
- the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus, or fungus.
- a non-mammalian source such as a plant, bacteria, virus, or fungus.
- the source of the nucleic acid molecules may be an archived or extinct sample or species.
- forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel.
- the nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric, or other substrate that may be impregnated with saliva, blood, or other bodily fluids.
- the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA.
- target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine, and serum.
- target sequences can be obtained from hair, skin, tissue samples, autopsy, or remains of a victim.
- nucleic acids including one or more target sequences can be obtained from a deceased animal or human.
- target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA.
- target sequences or amplified target sequences are directed to purposes of human identification.
- the disclosure relates generally to methods for identifying characteristics of a forensic sample.
- the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein.
- a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
- the components of the sequencing ordering system 106 can include software, hardware, or both.
- the components of the sequencing ordering system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device(s) 114). When executed by the one or more processors, the computer-executable instructions of the sequencing ordering system 106 can cause the computing devices to perform the bubble detection methods described herein.
- the components of the sequencing ordering system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the sequencing ordering system 106 can include a combination of computer-executable instructions and hardware.
- the components of the sequencing ordering system 106 performing the functions described herein with respect to the sequencing ordering system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model.
- components of the sequencing ordering system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device.
- the components of the sequencing ordering system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
- Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below.
- Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
- one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein).
- a processor receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc ), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
- Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system.
- Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices).
- Computer- readable media that carry computer-executable instructions are transmission media.
- embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
- Non-transitory computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
- SSDs solid state drives
- PCM phasechange memory
- a “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices.
- a network or another communications connection can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
- program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa).
- computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system.
- a network interface module e g., a NIC
- non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
- Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions.
- computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure.
- the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
- the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like.
- the disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks.
- program modules may be located in both local and remote memory storage devices.
- a cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth.
- a cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS).
- SaaS Software as a Service
- PaaS Platform as a Service
- laaS Infrastructure as a Service
- a cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
- a “cloud-computing environment” is an environment in which cloud computing is employed.
- FIG. 19 illustrates a block diagram of a computing device 1900 that may be configured to perform one or more of the processes described above.
- the computing device 1900 may implement the sequencing ordering system 106 and the sequencing system 104.
- the computing device 1900 can comprise a processor 1902, a memory 1904, a storage device 1906, an I/O interface 1908, and a communication interface 1910, which may be communicatively coupled by way of a communication infrastructure 1912.
- the computing device 1900 can include fewer or more components than those shown in FIG. 19. The following paragraphs describe components of the computing device 1900 shown in FIG. 19 in additional detail.
- the I/O interface 1908 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1900.
- the I/O interface 1908 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces.
- the I/O interface 1908 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers.
- the I/O interface 1908 is configured to provide graphical data to a display for presentation to a user.
- the graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
- the communication interface 1910 can include hardware, software, or both. In any event, the communication interface 1910 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1900 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1910 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
- NIC network interface controller
- WNIC wireless NIC
- the communication interface 1910 may facilitate communications with various types of wired or wireless networks.
- the communication interface 1910 may also facilitate communications using various communication protocols.
- the communication infrastructure 1912 may also include hardware, software, or both that couples components of the computing device
- the communication interface 1910 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein.
- the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
This disclosure describes methods, non-transitory computer readable media, and systems that can analyze features of sequencing tasks and/or nucleotide-sample-slides and generate task ordering scores or nucleotide-sample-slide-ordering scores upon which a computing system can order the processing of sequencing tasks and/or nucleotide-sample-slides. For instance, the sequencing ordering system may generate task ordering scores, utilizing a sequencing-task ordering machine-learning model, indicating a relative order of the set of sequencing tasks, and slide ordering scores, utilizing a nucleotide-sample-slide ordering machine-learning model, indicating a relative order of the nucleotide-sample-slides, based on available computing resources and the set of sequencing task features and perform the set of sequencing tasks according to the task ordering scores. Furthermore, in some implementations, the disclosed system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run based on the processing parameters for the base-call-data files for the ordered sequencing tasks.
Description
MACHINE-LEARNING MODELS FOR ORDERING AND EXPEDITING SEQUENCING TASKS OR CORRESPONDING NUCLEOTIDE-SAMPLE SLIDES
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/564,251, entitled “MACHINE-LEARNING MODELS FOR ORDERING AND EXPEDITING SEQUENCING TASKS OR CORRESPONDING NUCLEOTIDE-SAMPLE SLIDES,” filed on March 12, 2024, which is incorporated herein by reference in its entirety.
BACKGROUND
[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining nucleobase calls for genomic samples. To generate and analyze nucleobase calls for genomic samples, some existing sequencing devices and sequencing-data-analysis software (together “existing sequencing systems”) configure sequencing parameters for controlling tasks or otherwise guiding existing sequencing systems when ordering or executing sequencing runs. During a sequencing run, for instance, a sequencing device (e.g., sequencing machine or instrument) performs primary sequencing tasks (e.g., cluster generation, primer hybridization, image analysis, base calling, demultiplexing, and quality scoring for primary analysis) to determine nucleobase calls for nucleotide reads of a genomic sample. After a sequencing run finishes, existing sequencing-data-analysis software can cause computing devices to run secondary sequencing tasks (e.g., read alignment, variant-calling, structural variant detection, functional annotation, taxonomic classification, and genome assembly for secondary analysis) on such nucleotide reads to align the nucleotide reads with a reference genome and determine variant calls for genomic samples where such samples differ from the reference genome. [0003] While such primary and secondary sequencing-task software (together “existing sequencing management systems”) provide useful options to order and analyze the results of sequencing runs, existing sequencing management systems (i) provide computationally limited or inefficient ordering mechanisms for ordering nucleotide-sample-slides for genomic analysis on specialized computing devices, (ii) provide computationally limited or inefficient ordering mechanisms for ordering sequencing tasks for genomic analysis, and (iii) limit functions and control of an end-to-end sequencing process for a genomic sample across the sequencing device and secondary sequencing-data-analysis devices.
[0004] On servers that perform tasks for sequencing devices, for instance, multiple nucleotide- sample-slides can get queued for sequencing runs or for post-sequencing secondary analysis applications, thereby straining limited server resources. For example, despite recent advances, existing ordering algorithms could use a dot product alignment function to order sequencing tasks for genome samples using a first-in-first-out algorithm based on the highest alignment tasks.
Indeed, some existing sequencing systems prioritize and process the sequencing tasks and genomic analysis of the genomic samples based upon the order in which the genomic samples are received. As a result of processing genomic samples in the order they are received, existing sequencing systems execute ordering procedures that often over-allocate or under-allocate computing and other resources (e.g., consumable reagents for biochemistry). To illustrate, in cases where existing sequencing systems over-allocate sequencing tasks and/or genomic analysis, the system allocates more system resources (e.g., CPU processing power, memory, disk space, network bandwidth) than the system can handle effectively. By executing too many processes to run concurrently on a processor, the processor may become overloaded, leading to resource exhaustion. Furthermore, when sequencing computing resources are overutilized, existing sequencing systems consume excess power and generate unnecessary heat, which not only wastes energy but also increases cooling costs. Similarly, allocating too much memory or disk space can deplete these computing resources or load to situations where tasks are waiting for resources, causing a deadlock where none of the pending tasks can proceed.
[0005] By contrast, in cases where existing sequencing systems under-allocate sequencing tasks and/or genomic analysis, the system allocates less system resources (e.g., CPU processing power, memory, disk space, network bandwidth) than the system can handle effectively, thereby resulting in resource wastage and suboptimal performance. Furthermore, the under-allocation of computing resources by existing sequencing systems results in longer execution times for sequencing runs, delaying results, and delaying the ordering of upcoming sequencing tasks. In addition, in cases where the ordering system under-allocates resources, existing systems can require more frequent intervention, maintenance, and/or manual adjustments to compensate for the inadequate resources.
[0006] In addition to misallocating computing resources in a sequencing run, many existing sequencing systems consecutively perform primary sequencing tasks followed by the corresponding secondary sequencing tasks resulting in an inefficient use of resources and an excessive amount of run time. To illustrate, given a nucleotide-sample-slide sequencing run, some existing sequencing systems require 48 hours of processing time for the combination of cluster generation, genome sequencing, and base calling associated with around 52 billion reads. Indeed, for some existing sequencing systems, primary sequencing tasks for a sequencing run with paired- end reads with a length of 150 base pairs require approximately 48 hours of run time on the sequencing device. After the completion of the primary sequencing tasks, existing sequencing systems then transfer the base-call-data files to devices for a sequential secondary analysis.
[0007] Despite the need for effective ordering of sequencing tasks, generic ordering systems would consume an inordinate amount of computer processing time if trained to execute genomic
sequencing operations. For example, both implementing and maintaining many generic ordering systems can be complex and result in substantial computational demands, which can strain system resources. Generic ordering systems require an extensive use of memory and processor capacity and can slow down other processes and reduce overall system efficiency due to the heavy resource consumption. As an example, the Tetris Heuristic ordering model utilizes a dot product alignment function that requires a careful tuning of parameters. While publications have not been found suggesting that the Tetris Heuristic ordering model has been used for ordering sequencing tasks or nucleotide-sample slides, even if the model has been so used, the parameter tuning required by the Tetris Heuristic ordering model is a meticulous and time-consuming process and, as described further below, the model’s generic design performs poorly relative to the disclosed system. As another example, DeepRM is a deep reinforcement learning-based resource management solution that employs a conventional deep Q-leaming algorithm. While publications have also not been found suggesting that DeepRM has been used for ordering sequencing tasks or nucleotide-sample slides, even if DeepRM has been so used, DeepRM requires large amounts of data to train effectively, which can be complex and result in an increased computational burden.
[0008] Independent of generic ordering models used for unrelated technologies, many existing sequencing systems inefficiently transfer primary sequencing data files to processing devices for analysis in part due to the limits imposed by consecutively performing sequencing tasks and secondary analysis tasks. To illustrate, many existing sequencing systems perform all primary sequencing tasks for a genomic sample before transferring data files or performing any secondary sequencing tasks for the genomic sample. Indeed, these existing sequencing systems utilize a rigid two-stage approach to sequencing tasks and convert sequencing data into a readable format for the analysis of secondary sequencing tasks only after performing the primary sequencing tasks — thereby isolating information concerning a sequencing run and the sequencing-data analysis for variants and limiting control over the end-to-end sequencing process.
[0009] Because existing sequencing system generally complete all primary sequencing tasks for a genomic sample before commencing secondary sequencing tasks, existing sequencing systems require storing and transferring large amounts of data consecutively, which taxes the bandwidth of network connections or other interfaces that connect processor cards with other hardware within a computing device. For example, in the 52-billion-read example mentioned above, existing sequencing systems analyzing primary sequencing tasks for a sequencing run with paired-end reads with a length of 150 base pairs, produce approximately 16 Tb of data and require approximately 48 hours of run time. Consequently, existing sequencing systems require local storage of the 16 Tb of sequencing data and perform a subsequent batch data transfer over network devices that consumes approximately 7 hours (assuming a 5 Gb/s link). While hardware on
sequencing devices and servers have increased memory (e.g., chips for a Field Programmable Gate Array (FPGA) or other configurable processors often include around 32 gigabytes of memory on the chip), existing server or device memory can be insufficient to store data for multiple sequential sequencing runs. By waiting to transfer the primary sequencing task data files until the end of the sequencing run, existing sequencing systems require local storage of the primary sequencing task data files, tax the bandwidth of network connections, and delay the start of analysis for the secondary sequencing tasks by up to 55 hours (e.g., 48 hours run time and 7 hours transfer time). Furthermore, network bottlenecks or other interface throughput interruptions can add additional delays to slow the start of the analysis of the secondary sequencing tasks.
[0010] In addition to inefficient computing-resource allocation and delayed secondary sequencing tasks, existing sequencing systems suffer from security concerns. Because some existing systems only convert primary sequencing data into a readable format after completing the primary sequencing tasks, the information concerning the sequencing run is stored locally — often irrespective of the sample ownership or information sensitivity. Consequently, confidential sequencing metadata regarding the primary sequencing tasks is retained within the local system memory until the completion of the primary sequencing tasks, potentially raising additional concerns over data management and security.
[0011] These, along with additional problems and issues exist in existing sequencing management systems.
SUMMARY
[0012] This disclosure describes one or more embodiments of systems, methods, and non- transitory computer readable storage media that solve one or more of the problems described above or provide other advantages over the art. For example, the disclosed systems utilize a machinelearning model to analyze features of sequencing tasks and generate task ordering scores upon which a computing system can order the processing of sequencing tasks when sequencing nucleotides for genomic samples. As part of ordering sequencing tasks, in some embodiments, the disclosed system uses a specialized machine-learning model to generate task ordering scores for either primary sequencing tasks associated with base calling for a genomic sample’s nucleotide reads or secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of such nucleotide reads.
[0013] In addition or in the alternative to ordering sequencing tasks, in some embodiments, the disclosed systems utilize a machine-learning model to analyze features of nucleotide-sample- shdes and generate slide ordering scores upon which a computing system can order the processing of nucleotide-sample-slides when sequencing nucleotides for such genomic samples. As part of ordering nucleotide-sample-slides, in some embodiments, the disclosed system uses a specialized
machine-learning model to generate slide ordering scores for determining an order of nucleotide- sample slides on which to perform primary sequencing tasks associated with base calling or for which to perform secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of such nucleotide reads.
[0014] To illustrate but one example of a specialized machine-learning model, in some embodiments, the disclosed system can utilize a relatively small neural network composed of fully connected layers combined with activation functions to produce alignment values that order sequencing tasks and/or nucleotide-sample-slides. The neural network can be trained via a genetic algorithm to determine a best scoring version of the model — whether for a sequencing-task ordering machine-learning model or a nucleotide-sample-slide ordering machine-1 earning model. For example, in certain instances, the system generates predicted ordering scores from candidate machine-learning models and determines makespan scores for each candidate model based on the predicted ordering scores. By comparing the makespan scores for each candidate model using a loss function, the system selects a highest performing candidate model as the ordering machinelearning model. Furthermore, the disclosed system can use a two-tier alignment function that utilizes two neural networks and incorporates a penalty value (or priority feature) to order and execute sequencing tasks more efficiently and with fewer computing resources.
[0015] In addition, or in the alternative to ordering sequencing tasks or nucleotide-sample- shdes, in some cases, the disclosed system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run. In particular, based on processing requirements, the disclosed system can demultiplex and transmit base-call-data files specific to genomic samples to one or more computing devices during the sequencing run.
[0016] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the descnption which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
[0017] The detailed description refers to the drawings briefly described below.
[0018] FIG. 1 illustrates an environment in which a sequencing ordering system can operate in accordance with one or more embodiments of the present disclosure.
[0019] FIG. 2A illustrates a schematic diagram of the sequencing ordering system determining task ordering scores for sequencing tasks and performing the sequencing tasks in a relative order according to the task ordering scores in accordance with one or more embodiments of the present disclosure
[0020] FIG. 2B illustrates a schematic diagram of the sequencing ordering system determining slide ordering scores for nucleotide-sample slides and processing the nucleotide-sample slides in a
relative order according to the slide ordering scores in accordance with one or more embodiments of the present disclosure.
[0021] FIG. 3 illustrates a schematic diagram of the sequencing ordering system utilizing the sequencing-task ordering machine-learning model to determine task ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure. [0022] FIG. 4 illustrates the sequencing ordering system providing primary and/or secondary sequencing task features to the sequencing-task ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
[0023] FIG. 5 illustrates an example architecture for a sequencing-task ordering machinelearning model in accordance with one or more embodiments of the present disclosure.
[0024] FIGS. 6A-6B illustrate utilizing the sequencing ordering system to select the highest performing sequencing-task ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
[0025] FIG. 7A illustrates the sequencing ordering system distributing sample-specific base- call-data files to one or more computing devices in accordance with one or more embodiments of the present disclosure.
[0026] FIG. 7B illustrates the sequencing ordering system performing a demultiplexing operation on a subset of sequencing cycles with indexing cycles performed between genomic sequencing cycles in accordance with one or more embodiments of the present disclosure.
[0027] FIG. 7C illustrates the sequencing ordering system performing an indexing-first approach to demultiplexing nucleotide reads by performing indexing cycles before genomic sequencing cycles in accordance with one or more embodiments of the present disclosure.
[0028] FIG. 8 illustrates a schematic diagram of the sequencing ordering system utilizing the nucleotide-sample-slide ordering machine-learning model to determine task ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure.
[0029] FIG. 9 illustrates the sequencing ordering system providing nucleotide-sample slide features to the nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
[0030] FIG. 10 illustrates an example architecture for a nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
[0031] FIGS. 11A-11C illustrate the sequencing ordering system selecting the highest performing nucleotide-sample-slide ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
[0032] FIG. 12 illustrates a schematic diagram of the sequencing ordering system utilizing a combination of the nucleotide-sample-slide ordering machine-1 earning model and the sequencingtask ordering machine learning model to order sequencing tasks in accordance with one or more embodiments of the present disclosure.
[0033] FIGS. 13A-13B illustrate graphs of the distribution of a penalized makespan utilizing different and existing ordering strategies in comparison with a first-in-first-out (FIFO) sequencing ordering system.
[0034] FIG. 14 illustrates a graph of the makespan compared against an average task load for different and existing ordering strategies in comparison with the sequencing ordering system.
[0035] FIG. 15 illustrates a graph showing a comparison of the performance of the sequencing ordering system when trained utilizing the genetic algorithm compared to a Tetris heuristic model in accordance with one or more embodiments of the present disclosure.
[0036] FIG. 16 illustrates a flowchart of a series of acts for generating task ordering scores and performing a set of sequencing tasks according to the task ordering scores in accordance with one or more embodiments of the present disclosure.
[0037] FIG. 17 illustrates a flowchart of a series of acts for transmitting genomic samples to computing devices in accordance with one or more embodiments of the present disclosure.
[0038] FIG. 18 illustrates a flowchart of a series of acts for generating slide ordering scores and performing sequencing tasks according to the slide ordering scores in accordance with one or more embodiments of the present disclosure.
[0039] FIG. 19 illustrates a block diagram of an example computing device in accordance with one or more embodiments of the present disclosure.
DETAILED DESCRIPTION
[0040] This disclosure describes one or more embodiments of a sequencing ordering system that provides a machine-learning model that can analyze features of sequencing tasks and generate task ordering scores upon which a computing system can order the processing of sequencing tasks. For instance, the sequencing ordering system can determine, for a set of sequencing tasks, a set of sequencing task features indicating at least a performance time associated with respective sequencing tasks of the set of sequencing tasks. The sequencing ordering system may further provide the set of sequencing task features to a sequencing-task ordering machine-learning model for ordering the set of sequencing tasks. The sequencing ordering system may generate, utilizing the sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on available computing resources and the set of sequencing task features. Based on the task ordering scores, the sequencing ordering system performs the set of sequencing tasks. Furthermore, in some implementations, the disclosed system
determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run based on the processing requirements for the base call data files for the scheduled sequencing tasks.
[0041] As part of generating task ordering scores, the sequencing ordering system utilizes a machine-learning model that takes as input data (e g., a feature vector) representing the actual tasks in the pipeline (associated with multiple nucleotide-sample-slides with various densities and/or secondary analysis applications) in combination with data representing the compute resources available on and/or off the sequencing instrument for analyzing a nucleotide-sample-slide to reduce the makespan (overall time to complete the related tasks) of either primary or secondary sequencing tasks. For instance, the sequencing ordering system determines sequencing task features (e.g., a performance time associated with each sequencing task) for sequencing tasks and further provides the sequencing task features to a sequencing-task ordering machine-learning model. By processing the sequencing task features and accounting for available computing resources (e.g., using model parameters), the sequencing-task ordering machine-learning model generates task ordering scores indicating a relative order of the sequencing tasks. The sequencing ordering system further performs the sequencing tasks according to the task ordenng scores.
[0042] To further illustrate how the disclosed system utilizes task ordering scores, the disclosed system can use a specialized machine-1 earning model to generate task ordering scores for either (i) primary sequencing tasks (e.g., real-time analysis) associated with base calling for a genomic sample’s nucleotide reads or (ii) secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of such nucleotide reads. In either case, the disclosed system determines sequencing task features (e.g., a performance time associated with each sequencing task) for sequencing tasks and further provides the sequencing task features to the sequencing-task ordering machine-learning model. By processing the sequencing task features and accounting for available computing resources (e.g., using model parameters), the sequencing-task ordering machine-learning model generates task ordering scores indicating a relative order of the sequencing tasks. The system further performs the sequencing tasks according to the task ordering scores.
[0043] In some implementations, the sequencing ordering system can utilize a relatively small sequencing-task ordering neural network composed of fully connected layers combined with activation functions to generate task ordering scores that determine an order in which to perform sequencing tasks for a sequencing run or secondary analysis. To illustrate, in certain implementations, a sequencing-task ordering neural network includes an input layer for the set of sequencing task features, two fully connected hidden layers, each equipped with an activation function, bias, and weights, and — after the fully connected hidden layers — an output layer that
outputs task ordering scores. As illustrated below, such a sequencing-task ordering neural network architecture includes adjustable parameters (e.g., 88 adjustable parameters) to generate the most efficient alignment function.
[0044] As mentioned, in some embodiments, the sequencing ordering system trains the sequencing-task ordering machine-learning model to determine scores indicating a best order of sequencing tasks. To illustrate, the sequencing-task ordering machine-learning model is trained via a genetic algorithm to determine a best version of the sequencing-task ordering machine-learning model. For instance, the sequencing ordering system identifies a set of parent sequencing-task ordering machine-learning models (e.g., 128 parent models filtered from an initial 8,192 models) and, from the parents, generates a set of candidate sequencing-task ordering machine-learning models (e.g., repopulated 8,192 candidate models) each comprising different weights and biases. The sequencing ordering system further generates predicted ordering scores from each candidate sequencing-task ordering machine-learning model and determines makespan scores for each candidate sequencing-task ordering machine-learning model based on the predicted ordering scores. By comparing the makespan scores for each candidate sequencing-task ordering machinelearning model using a loss function, the sequencing ordering system selects a highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model.
[0045] In addition to ordering sequencing tasks, in some embodiments, the sequencing ordering system can analyze features of nucleotide-sample-slides and generate slide ordering scores upon which a computing system can order the processing of nucleotide-sample-slides. For example, the sequencing ordering system can determine, for a set of nucleotide-sample-slides, a set of nucleotide-sample-slide features indicating at least a performance time associated with processing data for each nucleotide-sample-slide of the set of nucleotide-sample-slides. The sequencing ordering system may further provide the set of nucleotide-sample-slide features to a nucleotide- sample-slide ordering machine-learning model for ordering the set of nucleotide-sample-slides. The sequencing ordering system may generate, utilizing the nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide- sample-slide based on available computing resources and the set of nucleotide-sample-slide features. Based on the slide ordering scores, the sequencing ordering system performs sequencing tasks. In addition, the sequencing ordering system can utilize a two-tier system incorporating both the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model to generate the slide ordering scores.
[0046] As part of determining such slide ordering scores, the system accesses or determines nucleotide-sample-slide features (e.g., a performance time associated with processing data for each
nucleotide-sample-slide) for respective nucleotide-sample-slides and provides the nucleotide- sample-slide features to a nucleotide-sample-slide ordering machine-learning model. By processing the nucleotide-sample-slide features and accounting for available computing resources, the nucleotide-sample-slide ordering machine-learning model generates nucleotide-sample-slide ordering scores indicating a relative order for processing the different nucleotide-sample-slides. Based on the nucleotide-sample-slide ordering scores, the disclosed system performs sequencing tasks for the ordered nucleotide-sample-slides.
[0047] In addition, similar to the relatively small sequencing-task ordering neural network utilized to generate task ordering scores, the sequencing ordering system can utilize a relatively small nucleotide-sample-slide ordering neural network composed of fully connected layers combined with activation functions to generate slide ordering scores that determine an order in which to process nucleotide-sample slides. To illustrate, in certain implementations, the sequencing ordering system utilizes a nucleotide-sample-slide ordering neural network that includes an input layer for the set of nucleotide-sample-slide features, four fully connected hidden layers, each equipped with an activation function, bias, and weights, and — after the fully connected hidden layers — an output layer that outputs slide ordering scores.
[0048] In some embodiments, the sequencing ordering system trains a nucleotide-sample-slide ordering neural network or other machine-learning model using genetic algorithms to determine scores indicating a best order of processing different nucleotide-sample slides. For example, similar to the method outlined above used to train the sequencing-task ordering machine-learning model, the sequencing ordering system trains a nucleotide-sample-slide ordering machine-learning model to produce slide ordering scores that indicate the order for processing nucleotide-sample-slides in a sequencing run or secondary analysis. For example, the nucleotide-sample-slide ordering machine-learning model is trained via a genetic algorithm by selecting parent nucleotide-sample- slide ordering machine-learning models based on their fitness, generating candidate nucleotide- sample-slide ordering machine-learning models through crossover and mutation, and selecting a highest performing nucleotide-sample-slide ordering candidate model as the nucleotide-sample- slide ordering machine-learning model.
[0049] Furthermore, the sequencing ordering system can integrate both the nucleotide-sample- slide ordering machine-learning model and the sequencing-task ordering machine-learning model into a two-tier sequencing ordering system. In this way, the sequencing ordering system can provide an order for the sequencing tasks based on both the slide ordering scores and the task ordering scores. In turn, the sequencing ordering system can perform the sequencing tasks for the set of nucleotide-sample slides according to both the slide ordering scores and task ordering scores.
[0050] In addition to ordering sequencing tasks and/or nucleotide-sample-slides, in some cases, the sequencing ordering system determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run. In particular, based on different processing parameters, the sequencing ordering system can demultiplex and transmit base-call-data files to one or more computing devices during the sequencing run. Such different processing parameters may include a different secondary sequencing task for a genomic sample, different analysis rights for a genomic sample, a different category of analysis for the genomic sample, or a different sample size for a genomic sample. Based on such processing parameters, the disclosed system can demultiplex the indexed reads to determine which indexing sequences belong to which genomic samples after the completion of the first sequencing pass and efficiently begin transmitting the base-call data files to the appropriate computing device during the sequencing run (e.g., during the second sequencing pass).
[0051] By employing an indexing-first approach, the sequencing ordering system can speed up its distribution of sample-specific base-call-data files. In some cases, for example, the sequencing ordering system expedites determining oligonucleotides belonging to respective genomic samples within anucleotide-sample-shde pool (or other nucleoti de-sample-substrate pool) by base calling the indexing sequences for both read pairs before base calling the genomic sequences in library templates for each sample. By performing indexing cycles before the genomic sequencing cycles, the sequencing ordering system determines which nucleotide reads belong to which genomic samples and a relative balance of genomic samples. Furthermore, based on this determination, the sequencing ordering system can begin generating and transmitting the base-call data files to the appropriate computing device after each genomic sequencing cycle of the sequencing run.
[0052] As indicated above, the sequencing ordering system provides several technical benefits relative to existing sequencing management systems, such as improving the efficiency and functionality of sequencing task scheduling and nucleotide-sample-slide scheduling relative to existing sequencing management systems. For example, the sequencing ordering system performs the sequencing tasks in an order that minimizes overall computing run times and alleviates delays caused by lengthy or complex tasks. By utilizing the sequencing-task ordering machine-learning model to generate task ordering scores performing sequencing tasks according to the task ordering scores, for instance, the sequencing ordering system expedites computing run times (e.g., as measured by makespan scores) by decreasing sequencing-task delay frequency and saves memory and/or consumable reagents relative to existing systems. To illustrate, at high task loads, the sequencing-task ordering machine-learning model improves memory management by performing memory-intensive sequencing tasks in an order indicated by task ordering scores thereby improves
memory utilization and ensures consistent memory availability to provide better resource management. As shown in FIG. 13B, when performing tasks ordered based on task ordering scores from the sequencing-task ordering machine-learning model, the sequencing ordering system decreases a frequency of delays in performing sequencing tasks by 10-30% relative to a first-in- first-out (FIFO) method as measured by makespan scores. Such a decrease in sequencing-task delays translates into improved run times.
[0053] Independent of the sequencing-task ordering machine-learning model, by processing different nucleotide-sample slides according to slide ordering scores generated by a nucleotide- sample-slide ordering machine-learning model, for instance, the sequencing ordering system likewise expedites computing run times (e.g., as measured by makespan scores) by improving completion time and saves memory and/or consumable reagents relative to existing systems. As shown in FIGS. 13A-13B, for example, by performing sequencing tasks based on slide ordering scores from the nucleotide-sample-slide ordering machine-learning model, the sequencing ordering system provides an improvement of nearly 15-25% in median makespan scores and 5-15% in average makespan scores relative to a FIFO method. As further shown in FIG. 13B in particular, when performing tasks ordered based on task ordering scores from the sequencing-task ordering machine-learning model, the sequencing ordering system decreases a frequency of delays in performing sequencing tasks by 10% to 30% relative to a FIFO as measured by makespan scores.
[0054] As indicated above, in some embodiments, the sequencing ordering system can utilize a combination of the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine learning model to order sequencing tasks according to the output ordering scores. This disclosure illustrates an embodiment of such a two-tier sequencing ordering system in FIG. 12. As shown in FIGS. 13 A and 13B, such a two-tier sequencing ordering system expedites computing run times by ordering the sequencing tasks based on the task ordering scores and produces makespan scores generally 15% better than a FIFO method. At high task loads, when performing tasks based on scores from both the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model, such a two-tier sequencing ordering system outperforms a first-in-first-out algorithm by nearly 30% in median makespan scores and 20% in average makespan scores. As discussed below in reference to FIGS. 13 A, 13B, 14, and 15, given a nucleotide-sample-shde sequencing run, the sequencing ordering system reduces the required run time by 20-30% in median makespan scores.
[0055] In addition to improved computing efficiency and run times, the sequencing ordering system can be deployed both on-instrument and off-instrument and offers the flexibility of training/refining the sequencing ordering system with real data that reflects the real-life usage of the instrument. For example, based on ascertained need for a particular type of assay, the ordering
machine-learning models can be tuned to reflect the real-life usage of a sequencing device (e.g., tuned to the size/requirements of particular nucleotide-sample-slides). To illustrate, some existing scheduling algorithms utilize a FIFO algorithm based on the upcoming tasks. In contrast, the sequencing ordering system can use a two-tier alignment function that utilizes neural networks incorporating a penalty value (or priority feature) to order and execute sequencing tasks more intelligently.
[0056] In addition to improved computing efficiency and flexibility during implementation, in some embodiments, the sequencing ordering system utilizes ordering machine-learning models to swiftly converge to a solution utilizing a relatively small amount of computing resources, outperforming the training speed seen by current methods of task scheduling. As illustrated by FIG. 15, for example, the sequencing ordering system beats existing heuristics when training on data for less than 5 days and less than 10 iterations deep with negligible time spent in order evaluation (e.g., 1-2% better than the Tetris heuristic model, and 15% over FIFO).
[0057] As mentioned, in the alternative or in addition to flexibly and efficiently ordering sequencing tasks or nucleotide-sample-slides, the sequencing ordering system can determine to which computing devices and at which time of a sequencing run to distribute sample-specific basecall-data files from a sequencing device — thereby expediting the beginning of secondary analysis for different samples. For example, in some embodiments, the sequencing ordering system generates, demultiplexes, and transfers sample-specific base-call-data files to various processing devices during the sequencing run based on the processing requirements of the sample. As further illustrated in FIGS. 7A-7C, during the sequencing run, the sequencing ordering system can determine different subsets of indexing sequences corresponding to different genomic samples that have different corresponding processing parameters for sequencing tasks. Based on identifying different processing parameters for sequencing tasks corresponding to different genomic samples, the sequencing ordering system can preemptively transmit sample-specific base-call-data files to one or more computing devices during the sequencing run rather than storing the base-call-data files for transmission.
[0058] By transmitting the sample-specific base-call-data files during the sequencing run, the sequencing ordering system saves both processing time and reduces storage requirements. For example, in the case of primary sequencing tasks for a sequencing run that produces approximately 16 Tb of data (e.g., paired-end reads with a length of 150 base pairs over approximately 48 hours of run time), existing sequencing systems require local storage of the 16 Tb of base-call-data files and waiting to begin secondary analysis until after a subsequent batch data transfer over network devices of approximately 7 hours (assuming a 5 Gb/s link). In addition, while the hardware on sequencing devices and servers can include increased memory (e.g., chips for a Field
Programmable Gate Array (FPGA) or other configurable processors), this memory can be insufficient to store base-call data files for multiple sequential sequencing runs. Further, by waiting to transfer the primary sequencing task base-call-data files until the end of the instrument run time, existing sequencing systems require local storage of the primary sequencing task base-call-data files, tax the bandwidth of network connections, and delay the start of secondary analysis by up to 55 hours (e.g., 48 hours run time and 7 hours transfer time). In contrast, the sequencing ordering system can transfer the sample-specific base-call-data files to one or more computing devices during the run; thereby relieving local storage requirements for base-call-data files, alleviating bandwidth strain to ensure more efficient network performance, and expediting the start of secondary analysis (e.g., by at least 7 hours).
[0059] As illustrated by the foregoing discussion, the present disclosure utilizes a variety of terms to describe features and advantages of the sequencing ordering system. As used herein, the term “genomic sample” refers to a target genome or portion of a genome undergoing sequencing. For example, a genomic sample includes a sequence of nucleotides isolated or extracted from a sample organism (or a copy of such an isolated or extracted sequence). In particular, a genomic sample includes a full genome that is isolated or extracted (in whole or in part) from a sample organism and composed of nitrogenous heterocyclic bases. A genomic sample can include a segment of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or other polymeric forms of nucleic acids or chimeric or hybrid forms of nucleic acids noted below. In some cases, the genomic sample is found in a sample prepared or isolated by a kit and received by a sequencing device.
[0060] As used herein, the term “nucleotide-sample slide” refers to a plate or substrate, such as a flow cell, comprising oligonucleotides for sequencing nucleotide sequences from genomic samples or other sample nucleic-acid polymers. In particular, a nucleotide-sample slide can refer to a substrate containing fluidic channels through which reagents and buffers can travel as part of sequencing. For example, in one or more embodiments, the nucleotide-sample slide (e.g., a patterned flow cell or non-pattemed flow cell) may comprise small fluidic channels and oligonucleotide samples that can be bound to adapter sequences on the substrate. In other implementations, a flow cell can be an open substrate with one or more regions for oligonucleotide samples to be analyzed and the oligonucleotide samples may be positioned using charged pads or other means. In yet another implementation, the nucleotide-sample substrate can be a membrane having a nanopore through which one or more oligonucleotide samples may pass. As indicated above, a flow cell can include tiles and wells (e.g., nano wells) comprising clusters of oligonucleotides. In some cases, a patterned flow cell may take on, but is not limited to, a square, hexagonal, and/or diamond shape.
[0061] As used herein, the term “sample genomic sequence” refers to a nucleotide sequence extracted from, copied from, or complementary to a sample’s chromosome. For example, a sample genomic sequence includes a nucleotide sequence that has been separated or copied from chromosomal DNA of a sample or has been sequenced to be complementary to an extracted or copied nucleotide sequence. Accordingly, a sample genomic sequence includes genomic DNA (gDNA) for a particular unknown sample. Accordingly, as described herein, in some embodiments, the sequence-to-coverage system can use a sample complementary sequence comprising cDNA rather than a sample genomic sequence comprising gDNA in a sample library fragment or wherever suitable cDNA may replace gDNA as understood by a skilled artisan. Indeed, any embodiment or nucleotide read in this disclosure that uses or includes a sample genomic sequence can also use or include a cDNA sequence corresponding to a genomic sample.
[0062] As used herein, the term “indexing sequence” refers to a unique and artificial nucleotide sequence that identifies nucleotide reads for a sample and that is ligated to a sample’s nucleotide sequence (e.g., a gDNA fragment or cDNA fragment) or to another sequence within a sample library fragment. As indicated above, an indexing sequence can be part of a sample library fragment. Similarly, an indexing sequence can be used to sort nucleotide reads by sample or into different files, among other things, such as part of a de-multipl exing process. In some cases, a sample library fragment includes an indexing primer sequence that differs from a read priming sequence and that indicates a starting point or starting nucleobase for determining nucleobases of an indexing sequence.
[0063] As further used herein, the term “sequencing run” refers to an iterative process on a sequencing device to determine a primary structure of nucleotide fragments from a sample (e.g., genomic sample). In particular, a sequencing run includes cycles of sequencing chemistry and imaging performed by a sequencing device that incorporate nucleobases into growing oligonucleotides to determine nucleotide reads from nucleotide sequences extracted from a sample (or other sequences within a library fragment) and seeded throughout a nucleotide-sample slide. In some cases, a sequencing run includes replicating nucleotide fragments from one or more genome samples seeded in clusters throughout a nucleotide-sample slide (e.g., a flow cell). Upon completing a sequencing run, a sequencing device can generate nucleobase-call data in a file, such as a binary base call (BCL) sequence file or a fast-all quality (FASTQ) file.
[0064] Relatedly, as used herein, for example, the term “sequencing cycle” refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to sample’s sequence (e.g., a genomic or transcriptomic sequence from a sample) or a corresponding adapter sequence. In some cases, a sequencing cycle includes an iteration of both incorporating nucleobases into clusters of oligonucleotides using sequencing
chemistry and capturing images of such clusters attached to a flow cell. A sequencing cycle can include one or both of an indexing cycle and a genomic sequencing cycle. For instance, one cluster of oligonucleotides or a set of clusters of oligonucleotides may be undergoing a genomic sequencing cycle in which nucleobases corresponding to a sample genomic sequence are incorporated and another cluster of oligonucleotides or another set of clusters of oligonucleotides may be concurrently undergoing an indexing cycle in which nucleobases corresponding to an indexing sequence for a nucleotide read are incorporated.
[0065] As further used herein, the term “genomic sequencing cycle” refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to a sample genomic sequence (or cDNA sequence). In particular, a genomic sequencing cycle can include an iteration of capturing and analyzing one or more images with data indicating individual nucleobases added or incorporated into an oligonucleotide or to oligonucleotides (in parallel) representing or corresponding to one or more sample genomic sequences. For example, in one or more embodiments, each genomic sequencing cycle involves capturing and analyzing images to determine either single reads of DNA (or RNA) strands representing part of a genomic sample (or transcribed sequence from a genomic sample). As suggested above, however, a genomic sequencing cycle, in some cases, is specific to a cluster of oligonucleotides or a set of clusters of oligonucleotides.
[0066] By contrast, the term “indexing cycle” refers to an iteration of adding or incorporating one or more nucleobases to one or more oligonucleotides representing or corresponding to one or more indexing sequences. In particular, an indexing cycle can include an iteration of capturing and analyzing one or more images of clusters of oligonucleotides indicating one or more nucleobases added or incorporated into an oligonucleotide or to oligonucleotides (in parallel) representing or corresponding to one or more indexing sequences. An indexing cycle differs from a genomic sequencing cycle in that an indexing cycle includes sequencing of at least a nucleobase (or a majority of nucleobases) from one or more indexing sequences that identify or encode one or more sample library fragments. Because genomic sequencing cycles may be specific to a cluster or clusters of oligonucleotides, an indexing cycle for one cluster of oligonucleotides may be performed at a same time as a genomic sequencing cycle for another cluster of oligonucleotides.
[0067] As used herein, for example, the term “sequencing task” refers to an operation or a process performed by a computing device as part of determining a sequence of nucleobases for one or more genomic samples (or other nucleotide polymers) or part of saving data from determining such a sequence or from a corresponding analysis. In particular, a nucleotide sequencing task can include an operation or a process performed by a sequencing device that determines nucleobase sequences of fragments from a genomic sample or performed by another computing device (e.g.,
server) to analyze data for the nucleobase sequences and/or determine variants within the nucleobase sequences with respect to a reference genome. A sequencing task can likewise include an operation or a process of preserving data generated from determining a nucleotide sequence (e.g., base-call data) or an analysis thereof. Accordingly, a nucleotide sequencing task can include, but is not limited to, (i) cluster generation, primer hybridization, image analysis, base calling, demultiplexing, or quality scoring for primary sequencing tasks or (ii) read alignment, variantcalling, structural variant detection, functional annotation, taxonomic classification, and genome assembly for secondary sequencing tasks.
[0068] Relatedly, the term “set of sequencing tasks” refers to a group of sequencing tasks performed by one or more computing devices that determine a sequence of nucleobases for one or more sample genomes (or other nucleotide polymers) or save data from determining such a sequence or from a corresponding analysis. In particular, a set of sequencing tasks can include a group of operations or processes (i) performed by a sequencing device to determine nucleobase sequences of fragments from a sample genome or save data related to the determined nucleobase sequences and (ii) performed by another computing device (e.g., server) to analyze data related to the determined nucleobase sequences, determine variants within the nucleobase sequences with respect to a reference genome, or save data resulting from the analyzed data. In some cases, a set of sequencing tasks comprises the primary sequencing tasks and/or secondary sequencing tasks associated with a sequencing run for a genomic sample. In some cases, a set of sequencing tasks comprises tasks starting from a sequencing run that generates base-call data through completing (and storing a copy of) variant analysis of the base-call data.
[0069] Relatedly, as used herein, the term “primary sequencing tasks” refers to primary sequencing tasks performed for genomic samples by a computing device to generate nucleotide reads and corresponding base-call data. For example, a primary sequencing task can include primary sequencing tasks performed by a specialized instrument to generate, transform, or package raw sequencing data, such as generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, base-call- quahty scoring of base calls within the nucleotide reads, or the conversion of base-call data for secondary sequencing tasks.
[0070] Relatedly, as used herein, the term “secondary sequencing tasks” refers to secondary analysis tasks performed on base-call data by a computing device to align nucleotide reads with a reference genome, determine genetic variants based on the aligned nucleotide reads, genotype call for a genomic sample, and/or interpret the determined genetic variants or nucleotide reads. For
example, a secondary sequencing task can include a secondary analysis task performed by a server executing variant-call software to perform genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls. As a further example, a secondary sequencing task can include a tertiary analysis performed by a server executing bioinformatics software to determine potential genetic diseases (or genetic factors correlating with genetic diseases) based on determined genetic variants of a sample.
[0071] As further used herein, the term “nucleobase call” (or simply “base call”) refers to a determination or prediction of a particular nucleobase (or nucleobase pair) for an oligonucleotide (e.g., nucleotide read) during a sequencing cycle or for a genomic coordinate of a genomic sample. In particular, a nucleobase call can indicate a determination or prediction of the type of nucleobase that has been incorporated within an oligonucleotide on a nucleotide-sample slide (e.g., read-based nucleobase calls). In some cases, for a nucleotide read, a nucleobase call includes a determination or a prediction of a nucleobase based on intensity values resulting from fluorescent-tagged nucleotides added to an oligonucleotide of a nucleotide-sample slide (e g., in a cluster of a flow cell). As suggested above, a single nucleobase call can be an adenine (A) call, a cytosine (C) call, a guanine (G) call, a thymine (T) call, or an uracil (U) call. In some embodiments, the type of base (e.g., adenine, cytosine, thymine, or guanine) can be determined based on intensity values for a signal emitted by labeled nucleobases in a cluster of oligonucleotides, such as signals in 16 quadrature amplitude modulation (QAM) or pulse amplitude modulation (PAM) 4 format.
[0072] Additionally, as used herein, the term “base-call-data file” (or “nucleobase-call-data file”) refers to a digital file or other digital information indicating individual nucleobases or the sequence of nucleobases for a nucleic-acid polymer. In particular, a base-call-data file can include nucleotide reads comprising nucleobase calls for particular genomic samples. Nucleobase-call-data files can include intensity values (e.g., color or light intensity values for individual clusters) from images taken by a camera of a nucleotide-sample slide or other data that indicate individual nucleobases or the sequence of nucleobases for a nucleic-acid polymer. In addition, or in the alternative to intensity values, the nucleobase-call-data file may include chromatogram peaks or electrical current changes indicating individual nucleobases in a sequence. Additionally, in some embodiments, nucleobase-call-data file includes individual nucleobase calls identifying the individual nucleobases (e.g., A, T, C, or G). For example, nucleobase-call-data file can comprise data for nucleobase calls in a sequence for a nucleic-acid polymer, the number of nucleobase calls corresponding to a particular base (e.g., adenine, cytosine, thymine, or guanine), as organized in a digital file, such as a Binary Base Call (BCL) file or a Fast- All Q (FASTQ) file. The format of the
base-call data file can vary based upon the sequencing technology used and can include BCF, BAM, and QSEQ, as well as other formats. Further, base-call-data file can include error/accuracy information, such as a quality metric associated with each nucleobase call. In some embodiments, nucleobase-call data comprises information from a sequencing device that utilizes sequencing by synthesis (SBS).
[0073] As further used herein, the term “sequencing task feature” refers to a factor, metric, or value that quantifies or represents a sequencing task or a computing resource related to one or more sequencing tasks. In particular, a sequencing task feature includes a value indicating a setting, boundary, environment variable, or feature vector in which a nucleobase of a particular nucleobase type can be accurately quantified or analyzed using a sequencing device. For instance, a sequencing task feature includes, but is not limited to, one or more of computing resources, such as accelerator resources, FPGA resources, CPU resources, GPU resources, performance time, and/or memory requirements associated with a sequencing task. By contrast, as used herein, the term “nucleotide- sample-slide features” refers to a factor, metric, or value that quantifies or represents a nucleotide- sample slide or a computing resource related to one or more nucleotide-sample slide. In particular, a nucleotide-sample-slide feature includes a value indicating a setting, boundary, or environment variable in which a nucleotide-sample slide can be accurately quantified or analyzed using a sequencing device. For instance, a nucleotide-sample-slide feature includes processor usage for processing data associated with a nucleotide-sample slide of the set of nucleotide-sample slides, memory requirements for processing data associated with the nucleotide-sample slide, and performance time associated with processing data for the nucleotide-sample slide.
[0074] As mentioned above, the sequencing ordering system can generate ordering scores using one or more machine learning models. As used herein, the term “machme-1 earning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular sequencing task set through iterative outputs or predictions based on use of data. For example, a machine learning model can utilize one or more learning techniques to improve in accuracy and/or effectiveness. Example machine learning models include various types of neural networks, decision trees, support vector machines, linear regression models, and Bayesian networks. As described in further detail below, the sequencing ordering system utilizes a sequencing-task ordering machme-leammg model, such as a feedforward neural network, to generate or predict task ordering scores indicating a relative order of the set of sequencing tasks based on available computing resources and the set of sequencing task features.
[0075] As used herein, the term “sequencing-task ordering machine-learning model” refers to a machine-learning model that generates tasking ordering scores indicating a relative order of sequencing tasks. As described in further detail below, the sequencing-task ordering machine-
learning model utilizes inputs of sequencing task features and available computing resources (e.g., using model parameters) to generate or predict task ordering scores indicating a relative order of the sequencing tasks. The sequencing-task ordering machine-learning model can generate or predict task ordering scores for either primary sequencing tasks associated with base calling for a genomic sample’s nucleotide reads or secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads. In some cases, a sequencingtask ordering machine-learning model can include a neural network with an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores — such as a type of feedforward neural network (or a multilayer perceptron).
[0076] By contrast, the term “nucleotide-sample-slide ordering machine-learning model” refers to a machine-learning model that generates slide ordering scores indicating a relative order of processing nucleotide-sample slides. The nucleotide-sample-slide ordering machine-learning model can generate or predict slide ordering scores for determining an order of nucleotide-sample slides on which to perform primary sequencing tasks associated with base calling or for which to perform secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads. In some cases, a nucleotide-sample-slide ordering machine-learning model includes a feedforward neural network that generates or predicts slide ordering scores indicating a relative order of sequencing tasks based on available computing resources and sequencing task features. For example, the nucleotide-sample-slide ordering machine-learning model can include a neural network with an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores — such as a type of feedforward neural network (or a multilayer perceptron).
[0077] As used herein, for example, the term “makespan score” refers to a measure of the total time or duration required to complete a set of tasks, such as sequencing tasks (e g., primary or secondary sequencing tasks) or tasks for processing data corresponding to a nucleotide-sample slide. For example, a makespan score is used to evaluate the efficiency and performance of scheduling algorithms, production processes, or resource allocation by the sequencing-task ordering machine-learning model or the nucleotide-sample-slide ordering machine-learning model. In one or more embodiments, the makespan score quantifies the time taken from the start of the first sequencing task until the completion of the last sequencing task, considering factors such as sequencing task duration, resource availability, and sequencing task features.
[0078] As used herein, for example, the term “configurable processor” refers to a circuit or chip that can be configured or customized to perform a specific application. For instance, a
configurable processor includes an integrated circuit chip that is designed to be configured or customized on site by an end user’s computing device to perform a specific application. Configurable processors include, but are not limited to, an ASIC, ASSP, a coarse-grained reconfigurable array (CGRA), or FPGA. By contrast, configurable processors do not include a CPU or GPU. In some embodiments, the accelerated genotype-imputation system uses a configurable processor (e.g., FPGA) and/or a processor (e.g., CPU) to perform the various embodiments described herein.
[0079] As used herein, for example, the term “processing parameters” refers to values, specifications, or variables that indicate how a computing device performs primary or secondary analysis or a particular sequencing task. For instance, processing parameters include a particular type of secondary analysis for a genomic sample (e.g., secondary analysis based on whole genome sequencing versus a cancer array, different sequencing tasks requiring an FPGA or other configurable processor versus a CPU or other non-configurable processor), analysis rights for a genomic sample (e.g., different laboratories or patients having different ownership rights to different samples, different privacy rights), a category of analysis for the genomic sample (e.g., methylation estimates versus variant calling), or a sample size for a genomic sample (e.g., different numbers of oligonucleotide clusters in a flow cell for samples). In some cases, processing parameters can additionally or alternatively include other parameters, such as configuration data, clock settings, resource allocation, input/output definitions, signal timing, security settings, and functional unit configuration used to configure an ASIC, ASSP, CGRA, and/or FPGA.
[0080] The following paragraphs describe the sequencing ordering system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a sequencing ordering system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes a server device(s) 102 connected to one or more server device(s) 110, a sequencing device 108, and a client device(s) 114 via a network 118. While FIG. 1 shows an embodiment of the sequencing ordering system 106, this disclosure describes alternative embodiments and configurations below.
[0081] As shown in FIG. 1, the server device(s) 102, the sequencing device 108, the server device(s) 110, and the client device(s) 114 can communicate with each other via the network 118. The network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 19 (computing device Fig).
[0082] As indicated by FIG. 1, the sequencing device 108 comprises a device for sequencing a genomic sample or other nucleic-acid polymer. In some embodiments, the sequencing device 108
analyzes nucleic-acid segments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems (described herein) either directly or indirectly on the sequencing device 108. More particularly, the sequencing device 108 receives nucleotide-sample slides (e.g., nucleotide-sample-slides) comprising nucleotide fragments extracted from samples and then copies and determines the nucleobase sequence of such extracted nucleotide fragments. In one or more embodiments, the sequencing device 108 utilizes SBS to sequence nucleic-acid polymers into nucleotide reads. In addition, or in the alternative to communicating across the network 118, in some embodiments, the sequencing device 108 bypasses the network 118 and communicates directly with the server device(s) 102 or the client device(s) 114.
[0083] As further indicated by FIG. 1 , the server device(s) 102 is located at or near a same physical location of the sequencing device 108. Indeed, in some embodiments, the server device(s) 102 and the sequencing device 108 are integrated into a same computing device, as indicated by dotted lines 122. The server device(s) 102 may run a sequencing system 104 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data. As shown in FIG. 1, the sequencing device 108 may send (and the server device(s) 102 may receive) base-call data generated during a sequencing run of the sequencing device 108. By executing software from the sequencing system 104, the server device(s) 102 may align nucleotide reads with a reference genome and determine genetic variants based on the aligned nucleotide reads. The server device(s) 102 may also communicate with the client device(s) 114. In particular, the server device(s) 102 can send data to the client device(s) 114, including sequencing information for nucleotide sequencing tasks, a variant call files (VCF), binary base call (BCL) sequence files, sequence read archive (SRA) files, variant call format (VCF) files, fast-all quality (FASTQ) files, or other information indicating nucleobase calls, sequencing metrics, error data, other sequencing related information, or other metrics.
[0084] As further indicated by FIG. 1 , the server device(s) 110 are located remotely from the server device(s) 102 and the sequencing device 108. Similar to the server device(s) 102, in some embodiments, the server device(s) 110 can include a version of the sequencing system 104. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as data for scheduling nucleobase calls or sequencing nucleic-acid polymers. Similarly, the sequencing device 108 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 108. The server device(s) 110 may also communicate with the client device(s) 114. In particular, the server device(s) 110 can send data to the client device(s) 114, including status information for nucleotide sequencing tasks, a variant call files (VCF), binary base call (BCL) sequence files, sequence read archive (SRA) files, variant call format (VCF) files, fast-
all quality (FASTQ) files, or other information indicating nucleobase calls, sequencing metrics, error data, other sequencing related information, or other metrics.
[0085] In some embodiments, the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server.
[0086] As further illustrated and indicated in FIG. 1, the client device(s) 114 can generate, store, receive, and send digital data. In particular, the client device(s) 114 can receive status data from the server device(s) 102 or receive sequencing metrics from the sequencing device 108. Furthermore, the client device(s) 114 may communicate with the server device(s) 102 or the server device(s) 110 to receive a VCF comprising nucleobase calls and/or other metrics, such as a sequencing metrics, error data, or other metrics. The client device(s) 114 can accordingly present or display information pertaining to variant calls or other nucleobase calls to a user associated with the client device(s) 114. For instance, as shown in FIG. 1, in one or more embodiments, the sequencing ordering system 106 determines, during a sequencing run, task ordering scores for the set of sequencing tasks corresponding genomic samples and parameters for secondary sequencing tasks corresponding to the genomic samples. Further, the server device(s) 102, the sequencing device 108, and/or the server device(s) 110 transmit the task ordering scores for the set of sequencing tasks, the parameters for secondary sequencing tasks, and/or the base-call-data files to the client device(s) 114 indicating a relative order of the secondary sequencing tasks corresponding to the genomic samples.
[0087] Although FIG. 1 depicts the client device(s) 114 as a desktop or laptop computer, the client device(s) 114 may comprise various types of client devices. For example, in some embodiments, the client device(s) 114 includes non-mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device(s) 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device(s) 114 are discussed below with respect to FIG. 19.
[0088] As further illustrated in FIG. 1, the client device(s) 114 includes a sequencing application 116. The sequencing application 116 may be a web application or a native application stored and executed on the client device(s) 114 (e.g., a mobile application, desktop application). The sequencing application 116 can include instructions that (when executed) cause the client device(s) 114 to receive data from the sequencing ordering system 106 and present, for display at the client device(s) 114, data concerning a status of a nucleotide sequencing task or data from a
VCF. Furthermore, the sequencing application 116 can instruct the client device(s) 114 to display the status for nucleotide sequencing tasks.
[0089] As further illustrated in FIG. 1, a version of the sequencing ordering system 106 may be located on the client device(s) 114 as part of the sequencing application 116 or on the server device(s) 110. Accordingly, in some embodiments, the sequencing ordering system 106 is implemented by (e.g., located entirely or in part) on the client device(s) 114. In yet other embodiments, the sequencing ordering system 106 is implemented by one or more other components of the computing system 100, such as the server device(s) 110. In particular, the sequencing ordering system 106 can be implemented in a variety of different ways across server device(s) 102, the sequencing device 108, the client device(s) 114, and the server device(s) 110. For example, the sequencing ordering system 106 can be downloaded from the server device(s) 110 to the server device(s) 102 and/or the client device(s) 114 where all or part of the functionality of the sequencing ordering system 106 is performed at each respective device within the computing system 100.
[0090] As indicated above, the sequencing ordering system 106 can analyze features of sequencing tasks or nucleotide-sample-slides and generate task ordering scores or slide ordering scores. In accordance with one or more embodiments, FIGS. 2A-2B illustrate schematic diagrams of the sequencing ordering system 106 determining ordering scores for ordering sequencing tasks. FIG. 2A illustrates a schematic diagram of the sequencing ordering system 106 determining task ordering scores for sequencing tasks and performing the sequencing tasks in a relative order according to the task ordering scores in accordance with one or more embodiments of the present disclosure. FIG. 2B illustrates a schematic diagram of the sequencing ordering system 106 determining slide ordering scores for nucleotide-sample slides and processing the nucleotide- sample slides in a relative order according to the slide ordering scores in accordance with one or more embodiments of the present disclosure.
[0091] As shown in FIG. 2A, for instance, the sequencing ordering system 106 identifies or receives data for a genomic sample(s) 202 to be queued for processing in a sequencing run or for secondary analysis. As further shown, the sequencing ordering system 106 determines sequencing tasks 204 associated with processing the genomic sample(s) 202 for the sequencing run or the secondary analysis. As mentioned, the sequencing tasks 204 can include both primary sequencing tasks and secondary sequencing tasks. For example, the sequencing ordering system 106 can identify data from a FASTQ or BCL file comprising nucleotide reads for a genomic sample(s) 202, which may include any biological specimen or culture that potentially contains the target of interest. [0092] As indicated above, clusters of oligonucleotides extracted from the genomic sample(s) 202 may be imaged or scanned for subsequent analysis utilizing the sequencing tasks 204. For
example, the sequencing ordering system 106 can perform the sequencing tasks 204 including primary sequencing tasks, such as indexing cycles to determine nucleobase calls for indexing sequences, generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, and/or base- call-quality scoring of base calls within the nucleotide reads. For example, the sequencing ordering system 106 can perform the sequencing tasks 204 including secondary sequencing tasks, such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
[0093] Furthermore, the sequencing ordering system 106 can determine sequencing task features 206 associated with the sequencing tasks 204 of the genomic sample(s) 202. For example, the sequencing ordering system 106 can determine the sequencing task features 206 indicating available computing resources, sequencing task processor usage, sequencing task memory requirements, and/or a sequencing task performance time for respective sequencing tasks of the sequencing tasks 204. As further examples, the sequencing ordering system 106 can determine the sequencing task features 206 that include the available accelerator resources (e.g., FPGA, CPU, GPU) for sequencing the genomic samples.
[0094] As also shown in FIG. 2A, the sequencing ordering system 106 can determine a sequencing task relative order 208 and task ordering scores 210. In particular, the sequencing ordering system 106 can generate the task ordering scores 210 that indicate a relative order for implementing the sequencing tasks 204. In one or more embodiments, the sequencing ordering system 106 determines the sequencing task relative order 208 utilizing a sequencing-task ordering machine-learning model (e.g., a neural network) composed of fully connected layers combined with activation functions to produce alignment values that provide the task ordering scores 210 indicating the schedule for the sequencing tasks 204. The task ordering scores 210 can minimize a determined makespan value and can also account for priority values to generate the task ordering scores 210.
[0095] As further shown in FIG. 2A, the sequencing ordering system 106 can then provide the task ordering scores 210 to the sequencing device indicating an order for the ordered tasks 212 and indicating an order to enact a set of sequencing tasks 214. For example, the sequencing ordering system 106 provides the task ordering scores 210 indicating an order for the set of sequencing tasks 214 used to sequence the nucleic-acid polymers present in the genomic samples received by a
sequencing device. For example, the sequencing ordering system 106 provides the task ordering scores 210 indicating an order for the set of sequencing tasks 214 used to map the nucleotide reads to genomic coordinates of a reference genome. As mentioned, the sequencing ordering system 106 can provide the task ordering scores 210 for the set of sequencing tasks 214, thereby prompting the sequencing device to schedule both primary sequencing tasks and/or secondary sequencing for the sequencing tasks 204.
[0096] As indicated above, the sequencing ordering system 106 can also analyze features of nucleotide sample slides to determine slide ordering scores for nucleotide-sample slides and for processing the nucleotide-sample slides in a relative order according to the slide ordering scores. As shown in FIG. 2B, the sequencing ordering system 106 can determine slide ordering scores associated with a nucleotide-sample-slide relative order. More particularly, the sequencing ordering system 106 receives or detects nucleotide-sample-slide(s) 216 (e.g., flow cells) comprising oligonucleotides extracted from genomic samples. In particular, the nucleotide-sample- slide(s) 216 can refer to a slide containing fluidic channels through which reagents and buffers can travel as part of sequencing. For example, in one or more embodiments, the nucleotide-sample- shde(s) 216 includes a flow cell (e.g., a patterned nucleotide-sample-shde or non-pattemed nucleotide-sample-slide) comprising small fluidic channels and short oligonucleotides complementary to binding adapter sequences. The nucleotide-sample-slide(s) 216 can include wells (e.g., nanowells) comprising clusters of oligonucleotides.
[0097] As indicated above, the nucleotide-sample-slide(s) 216 may be imaged or scanned for subsequent analysis utilizing sequencing tasks 218. For example, the sequencing ordering system 106 can perform the sequencing tasks 218 including primary sequencing tasks, such as generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of the genomic sample, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads. For example, the sequencing ordering system 106 can perform the sequencing tasks 218 including secondary sequencing tasks such as genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
[0098] As further shown, the sequencing ordering system 106 can determine nucleotide- sample-slide features 220 associated with processing data for each of the nucleotide-sample- slide(s) 216. For example, the sequencing ordering system 106 can determine the nucleotide- sample-slide features 220 indicating available computing resources, processor usage, memory
requirements, and/or a performance time for processing data associated with the nucleotide-sample- shde(s) 216. For example, the sequencing ordering system 106 can determine the nucleotide- sample-slide features 220 that include the available resources for a set of primary and/or secondary sequencing tasks associated with processing base calls for the nucleotide-sample-slide(s) 216.
[0099] As also shown in FIG. 2B, the sequencing ordering system 106 can determine a nucleotide-sample-slide relative order 222 and slide ordering scores 224. In particular, the sequencing ordering system 106 can generate the slide ordering scores 224 that indicate a relative order of the nucleotide-sample-slide(s) 216 and the sequencing tasks 218. In one or more embodiments, the sequencing ordering system 106 determines the nucleotide-sample-slide relative order 222 utilizing a sequencing-task ordering machine-learning model as shown in FIG. 5 to produce alignment values that provide slide ordering scores 224 indicating a schedule alignment for sequencing tasks 218. The slide ordering scores 224 can minimize a determined makespan value and can also account for priority values to provide the slide ordering scores 224.
[0100] As shown in FIG. 2B, the sequencing ordering system 106 can subsequently provide the slide ordering scores 224 to the sequencing device indicating an order for the sequencing device to perform the sequencing tasks 228 (e.g., by aligning the sequencing tasks 218). For example, the sequencing ordering system 106 determines the slide ordering scores 224 and provides the slide ordering scores 224 to the sequencing device. Furthermore, the sequencing device orders the nucleotide-sample-slide(s) 216 to process ordered nucleotide-sample-slide(s) 226 based on the slide ordering scores 224. As mentioned, the sequencing ordering system 106 can provide an order for scheduling both primary sequencing tasks and/or secondary sequencing tasks of the sequencing tasks 228.
[0101] As mentioned, the sequencing ordering system 106 utilizes a sequencing-task ordering machine-learning model to determine task ordering scores for scheduling tasks associated with a genomic sample. FIG. 3 illustrates a schematic diagram of utilizing the sequencing-task ordering machine-learning model to determine task ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure. As shown, the sequencing ordering system 106 identifies or receives data for a genomic sample 302. As further shown, the sequencing ordering system 106 determines sequencing tasks 304 associated with processing and sequencing the genomic sample 302. As mentioned, the sequencing tasks 304 include both primary and secondary sequencing tasks associated with processing and sequencing the genomic sample 302. In certain implementations, the sequencing ordering system 106 can include approximately 247 different types of sequencing tasks 304
[0102] As shown in FIG. 3, the sequencing ordering system 106 can access or identify sequencing task features 306 indicating available computing resources as a metric, a setting, a
boundary, an environment variable, and/or a feature vector. For example, the sequencing task features can include a task processor usage feature 308, a task memory requirements feature 310, and/or a task performance time feature 312 for respective sequencing tasks of the sequencing tasks 304. The sequencing task features 306 can include features that assess the computational infrastructure required for primary sequencing tasks, such as reading the nucleotide sequences, and secondary sequencing tasks such as aligning and assembling these sequences into a genome. For example, the sequencing task features 306 include values for the task processor usage feature 308 that quantify the computational power associated with the sequencing tasks 304, including the number and type of processors. To illustrate, the task processor usage feature 308 can include the number of FPGAs/CPUs/GPUs and the amount of available RAM associated with the sequencing tasks 304.
[0103] The task processor usage feature 308 can also include data representing the computational load on the sequencing system 104 (e.g., sequencing device 108, server device(s) 102, server device(s) 110, and/or client device(s) 114) and can be operationalized as the percentage of processor time required or as the intensity of the computations needed. For example, the task processor usage feature 308 includes values for required processing power that influence the capacity of the sequencing ordering system 106 to process primary sequencing tasks like nucleotide identification, and secondary tasks such as sequence assembly and annotation. To illustrate, by accounting for task processor usage feature 308 when processing primary sequencing tasks, the sequencing ordering system 106 can provide the task ordering scores 318 that account for the high processor usage of real-time base calling algorithms due to their computational intensity. To further illustrate, by accounting for the task processor usage feature 308 when processing secondary sequencing tasks, the sequencing ordering system 106 can provide the task ordering scores 318 that account for the use of substantial processing power when comparing sequences against reference genomes and identify genetic variations such as for variant calling or genome-wide association studies. In certain implementations, the task processor usage feature 308 includes data representing CPU usage within the ranges of 5 - 10 cores per task, and FPGA usage of 1 - 3 FPGA subdivisions per sequencing task.
[0104] The sequencing task features 306 can further include data representing the task memory requirements feature 310 to quantify the memory requirements, including the amount of RAM needed to execute the sequencing tasks 304. The sequencing ordering system 106 utilizes the task memory requirements feature 310 for genomic sequencing tasks in part because both primary and secondary tasks often involve large datasets. To illustrate, the sequencing ordering system 106 can utilize a high value for the task memory requirements feature 310 to account for sequencing tasks like storing raw sequencing data during primary sequencing or processing large amounts of
genomic data during secondary analyses. The sequencing ordering system 106 can utilize the task memory requirements feature 310 to account for task memory requirements of primary sequencing tasks that can generate gigabytes of data per sequencing run, necessitating the use of significant memory for efficient base calling. The sequencing ordering system 106 can utilize the task memory requirements feature 310 to account for the task memory requirements of secondary sequencing tasks, like sequence alignment and annotation, which have memory requirements based on processing multiple genomic sequences simultaneously for comparison and analysis.
[0105] In some embodiments, the sequencing task features 306 include data representing the task performance time feature 312 to quantify the time requirements needed to execute the sequencing tasks 304. To illustrate, the sequencing ordering system 106 can utilize the task performance time feature 312 to account for the time taken to complete the sequencing tasks 304. For example, the sequencing ordering system 106 can use the task performance time feature 312 to account for the throughput rate of the sequencer for primary sequencing tasks and use the task performance time feature 312 to account for the duration of computational analyses of secondary sequencing tasks such as comparative genomics.
[0106] To illustrate, the sequencing ordering system 106 can include the sequencing task features 306 of the task performance time feature 312, the task processor usage feature 308 (CPU), the task memory requirements feature 310, and the task processor usage feature 308 (FPGA) such as:
[0107] As further shown in FIG. 3, the sequencing ordering system 106 utilizes a sequencingtask ordering machine-learning model 314 to generate task ordering scores 318. By processing the sequencing task features 306 and accounting for available computing resources (e g., using model parameters), the sequencing-task ordering machine-1 earning model generates task ordering scores 318 indicating a relative order for the primary /secondary sequencing tasks 320. In one or more embodiments, the sequencing-task ordering machine-1 earning model 314 includes a neural network composed of fully connected layers combined with activation functions to produce alignment values that provide task ordering scores 318. This disclosure describes an example architecture for a sequencing-task ordering machine-learning model with respect to FIG. 5 below.
[0108] As mentioned, the sequencing-task ordering machine-learning model 314 generates task ordering scores 318. The task ordering scores 318 represent values that reflect the assessed priority of the sequencing tasks 304 to maximize the runtime efficiency of the sequencing tasks within the sequencing ordering system 106. For example, the sequencing-task ordering machinelearning model 314 generates task ordering scores 318 that can be used to execute the sequencing tasks 304 with a more efficient utilization of resources, provide a reduced turnaround times for the sequencing tasks 304, and an overall increase in the throughput of the genomic sequencing process. In particular, the sequencing-task ordering machine-learning model 314 generates the task ordering scores 318 that minimize the makespan value for performing the sequencing tasks 304. In this way, the sequencing ordering system 106 can strategically order the sequencing tasks 304 (e.g., particularly in high-volume environments) to provide significant improvements in productivity and efficiency.
[0109] As further shown in FIG. 3, the sequencing-task ordering machine-learning model 314 utilizes the task ordering scores 318 to provide a ranking for the sequencing tasks 304 indicating a relative order for the primary/secondary sequencing tasks 320. As shown, the sequencing tasks 304 are arranged, not in an arbitrary basis, but in a sequence that reflects their assessed priority from the task ordering scores 318. In some cases, the sequencing ordering system 106 causes the sequencing task 304 with the highest score of the task ordering scores 318 to be scheduled first. The sequencing ordering system 106 further causes the sequencing tasks 304 to be performed according to the task ordering scores 318 on the computing device(s) 322. The computing device(s) 322 can include a sequencing device and/or a computing server device. To illustrate, the computing device(s) 322 can include one or more of the sequencing device 108, the server device(s) 102, the server device(s) 110, and the client device(s) 114 as described with relation to FIG. 1.
[0110] As mentioned, the sequencing ordering system 106 provides sequencing task features to the sequencing-task ordering machine-learning model. FIG. 4 illustrates providing primary and/or secondary sequencing task features to the sequencing-task ordering machine-learning model in accordance with one or more embodiments of the present disclosure. The following paragraphs provide examples of such primary and/or secondary sequencing task features.
[oni] As shown in FIG. 4, in certain embodiments, the sequencing ordering system 106 receives or identifies sequencing task features, including primary sequencing task features 404 and secondary sequencing task features 418, associated with a nucleotide-sample slide 402. The sequencing task features can include a metric, a setting, a boundary, an environment variable, or a feature vector representing the performance time, processor usage (e.g., CPU and FPGA), memory usage, and/or other resource requirements for each sequencing task. Based on the genomic samples and the sequencing task requirements of the sequencing run, the sequencing ordering system 106 can receive or identify the primary sequencing task features 404 including an oligonucleotide- cluster-generation feature 406, a hybridizing primers feature 408, an analyzing images feature 410, a base calling feature 412, a demultiplexing-nucleotide-reads feature 414, and/or a base-call-quality scoring feature 416.
[0112] As shown, in certain embodiments, the sequencing ordering system 106 can access or identify the oligonucleotide-cluster-generation feature 406 associated with the nucleotide-sample slide 402. The oligonucleotide-cluster-generation feature 406 can include data quantifying the computing (e.g., processor, time, or memory) resources required for attaching oligonucleotides onto a specially coated slide or nucleotide-sample-slide so that they are spatially separated into distinct, individual clusters. As another example, the oligonucleotide-cluster-generation feature 406 can include data quantifying the computing resources required for amplifying the oligonucleotides to create a dense area of identical DNA fragments. As another example, in some embodiments, the oligonucleotide-cluster-generation feature 406 includes data quantifying the computing resources required for bridge amplification, where each bound DNA fragment is copied in situ, creating a localized amplification of DNA sequences.
[0113] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the hybridizing primers feature 408 associated with sequencing the clusters of oligonucleotides on the nucleotide-sample slide 402. In some cases, the hybridizing primers feature 408 can include data quantifying the computing (e.g., processor, time, or memory) resources required for heating the DNA to separate its two strands and then cooling it to allow the primers to bind, or anneal, to their complementary sequences on the single-stranded DNA. For example, the hybridizing primers feature 408 can include data quantifying the computing resources
required to align the primer sequence with a specific segment of the template DNA, and through hydrogen bonding, form stable, double-stranded structures at a complementary site.
[0114] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the analyzing images feature 410 associated with the nucleotide- sample slide 402. In certain embodiments, the analyzing images feature 410 includes data quantifying the computing (e.g., processor, time, memory) resources required to add fluorescently labeled nucleotides to sequence the clusters of DNA fragments after they have been amplified on a nucleotide-sample-slide. For example, the analyzing images feature 410 can include data quantifying the computing resources required to capture the image of the fluorescent signal as each nucleotide is incorporated into the growing DNA strand during the sequencing reaction, which fluorescent signal corresponds to the identity of the incorporated nucleotide. For example, the analyzing images feature 410 can include data quantifying the computing resources required to process the signals to identify which of the four nucleotides (adenine, thymine, cytosine, or guanine) has been added to each cluster during each cycle of the sequencing process and assigning them to specific positions in the DNA sequence. As a further example, the analyzing images feature 410 can include data quantifying the computing resources required to account for variations in signal intensity and quality, as well as correct for any cross-talk between channels that detect different colors of fluorescence.
[0115] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the base calling feature 412 associated with the nucleotide- sample slide 402. In some embodiments, the base calling feature 412 includes data quantifying the computing (e.g., processor, time, memory) resources required to determine the sequence of nucleotides (adenine, thymine, cytosine, and guanine) in a strand of DNA from raw data obtained during the sequencing run. For example, the base calling feature 412 includes data quantifying the computing resources required to translate the complex signals captured from sequencing into the actual sequence of bases for further biological analysis and interpretation. As a further example, the base calling feature 412 can include data quantifying the computing resources required to analyze the signals, which can consist of fluorescent or electrical changes, to assign a base to each signal peak. In addition, the base calling feature 412 can include data quantifying the computing resources required to determine a confidence score assessing the quality of each called base, which indicates the likelihood that each base was identified correctly.
[0116] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the demultiplexing-nucleotide-reads feature 414 associated with the resources required to index sequences corresponding to the genomic samples of the nucleotide- sample slide 402. In some cases, the demultiplexing-nucleotide-reads feature 414 includes data
quantifying the computing (e.g., processor, time, memory) resources required to separate mixed sequence data into distinct samples based on unique identifiers such as index sequences or barcodes. To further illustrate, the demultiplexmg-nucleotide-reads feature 414 can include data quantifying the computing resources required to demultiplex the combination of reads from all the genomic samples to identify the barcode sequences within each read and assign the read to the corresponding sample. For example, the demultiplexing-nucleotide-reads feature 414 includes data quantifying the computing resources required to incorporate barcode design into sequencing libraries, provide quality control to ensure accurate read attribution, and sorting data to organize reads by sample.
[0117] As shown, in certain embodiments, the sequencing ordering system 106 can access the base-call-quality scoring feature 416 corresponding to the genomic samples of the nucleotide- sample slide 402. In some cases, the base-call-quality scoring feature 416 includes data quantifying the computing (e.g., processor, time, memory) resources required to assign a confidence value (indicative of how likely it is that each base call is correct) to each nucleotide identified in a DNA sequence during the sequencing process. As a further example, the base-call-quality scoring feature 416 includes data quantifying the computing resources required to generate a score that is represented on a logarithmic scale, where a higher score denotes a higher confidence in the accuracy of the base call. To further illustrate, the base-call-quality scoring feature 416 can include data quantifying the computing resources required to interpret the strength and clarity of the signals that correspond to the incorporation of nucleotides in the DNA sequence based on factors such as chemical anomalies, sequencing-device errors, or issues with the sample itself.
[0118] As mentioned, the sequencing ordering system 106 generates, receives, or identifies base-call-data files that may include the raw output (BCL, SRA, VCF, FASTQ) from a sequencing device and contains the nucleotide reads for one or more genomic samples. As shown in FIG. 4, in certain embodiments, the sequencing ordering system 106 generates, receives, or identifies a basecall-data file 403. Based on the base-call-data file 403 and the requirements of the sequencing analysis, the sequencing ordering system 106 can receive or identify the secondary sequencing task features 418, including a genotype-quality scoring feature 420, a mapping nucleotide reads feature 422, an aligning nucleotide reads feature 424, a variant calling feature 426, a detecting structural variants feature 428, and an annotating phenotypes feature 430.
[0119] As shown, in certain embodiments, the sequencing ordering system 106 can access or identify the genotype-quality scoring feature 420 associated with the base-call-data file 403. In some cases, the genotype-quality scoring feature 420 includes data quantifying the computing (e.g., processor, time, memory) resources required for generating a statistical measure of confidence in a genotype call associated with the base-call-data file 403. To further illustrate, the genotype-
quality scoring feature 420 can additionally or alternatively include data quantifying the computing resources required to analyze the alignment of sequencing reads against a reference genome and identify places where the sequenced DNA differs from the reference and assign a genotype-quality score based on the probability that the genotype call is correct. For example, the genotype-quality scoring feature 420 can include data quantifying the computing resources required for evaluating the depth of coverage (number of reads supporting the call) and the agreement between those reads. [0120] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the mapping nucleotide reads feature 422 to represent mapping the genomic coordinates of a reference genome for a base-call-data file 403. In some cases, the mapping nucleotide reads feature 422 can include data quantifying the computing (e g., processor, time, memory) resources required for aligning the nucleotide reads obtained from a sequencing device to a reference genome or assembling the nucleotide reads de novo if no reference is available. For example, the mapping nucleotide reads feature 422 can include data quantifying the computing resources required for preprocessing the reads to trim adapters and filter out low-quality sequences. Moreover, the mapping nucleotide reads feature 422 can include a representation of the computing resources required for specialized algorithms to perform genomic alignment, considering the complexities of the genome, such as repetitive regions and potential sequencing errors. Further, the mapping nucleotide reads feature 422 can include data quantifying the computing resources required for post-processing the aligned reads to identify regions with low coverage, potential misalignments, and to mark duplicate sequences that result from PCR amplification.
[0121] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the aligning nucleotide reads feature 424, which can include data quantifying the computing (e g., processor, time, memory) resources required to align the nucleotide reads with the reference genome for the base-call-data file 403. The aligning nucleotide reads feature 424 can include a representation of the computing resources required for arranging sequencing reads to a reference genome or alternative contiguous sequence. Moreover, the aligning nucleotide reads feature 424 can include a representation of the computing resources required for quality filtering and trimming to ensure that only high-quality data is used for alignment. For example, the aligning nucleotide reads feature 424 can include data quantifying the computing resources required for the use of alignment algorithms to take the processed reads and map them to the reference genome and account for mismatches, insertions, and deletions, which may represent either sequencing errors or genuine variants. The aligning nucleotide reads feature 424 can additionally or alternatively include data quantifying the computing resources required for sorting
and indexing to flag duplicate reads and perform local realignment to improve accuracy at indel positions.
[0122] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the variant calling feature 426 based on the nucleotide reads for the base-call-data file 403. The variant calling feature 426 can include data quantifying the computing (e g., processor, time, memory) resources required for identifying differences between the sequenced DNA and a reference sequence. To further illustrate, the variant calling feature 426 can include data quantifying the computing resources required for analyzing the nucleotide read alignments to detect discrepancies that may indicate biological variations, such as single nucleotide polymorphisms (SNPs), insertions, and deletions (indels). For example, the variant calling feature 426 can include data quantifying the computing resources required for utilizing probabilistic models to determine the likelihood of a variant being real versus a sequencing or alignment error incorporating factors like the base quality scores, alignment quality, and sequence context.
[0123] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the detecting structural variants feature 428 for the base-call- data file 403. The detecting structural variants feature 428 can include data quantifying the computing (e.g., processor, time, memory) resources required for identifying large-scale alterations in the genome such as deletions, insertions, duplications, inversions, and translocations that span more than 50 base pairs. For example, the detecting structural variants feature 428 can include data quantifying the computing resources required for the analysis of read alignments for patterns that indicate a structural variant. To further illustrate, the detecting structural variants feature 428 can include data quantifying the computing resources required for statistical modeling to differentiate true structural variants from alignment artifacts or normal genomic variation.
[0124] As shown, in certain embodiments, the sequencing ordering system 106 can additionally or alternatively access the annotating phenotypes feature 430 for the base-call-data file 403. The annotating phenotypes feature 430 can include data quantifying the computing (e.g., processor, time, memory) resources required for the association of identified genetic variants with their potential phenotypic outcomes. To illustrate, the annotating phenotypes feature 430 can include data quantifying the computing resources required for the use of bioinformatics tools and software to align phenotypic information with known phenotype associations with the genomic data from the base-call-data file 403. For example, the annotating phenotypes feature 430 can include data quantifying the computing resources required for predictive modeling to infer potential phenotypes based on the biological functions of genes impacted by the variants. As a further example, the annotating phenotypes feature 430 can include data quantifying the computing
resources required for predicting the phenotypic outcome or disease association of each variant and generating a pathogenicity assessment of the clinical relevance of each variant.
[0125] As further shown in FIG. 4, the sequencing ordering system 106 provides the primary sequencing task features 404 and/or the secondary sequencing task features 418 to the sequencingtask ordering machine-learning model 440. The following paragraphs provide further details concerning embodiments of the sequencing-task ordering machine-learning model 440.
[0126] FIG. 5 illustrates an example architecture for a sequencing-task ordering machinelearning model in accordance with one or more embodiments of the present disclosure. As shown in FIG. 5, in certain embodiments, a sequencing-task ordering machine-learning model 510 is a neural network with two hidden layers that is fully connected and equipped with activation functions (e.g., a Multilayer Perceptron). The sequencing-task ordering machine-learning model 510 is configured with model parameter(s) 520 that include adjustable weights and biases (e.g., 88 parameters). The model parameter(s) 520, which include the weights and biases across the layers of the sequencing-task ordering machine-learning model 510 are optimized using a genetic algorithm. Notably, the sequencing ordering system 106 can utilize the sequencing-task ordering machine-1 earning model 510 with more or less hidden layers and neurons than shown in FIG. 5.
[0127] For example, the sequencing-task ordering machine-learning model 314 can utilize a fully connected feedforward neural network with two hidden layers, where connections between the nodes do not form a cycle (e.g., a multilayer perceptron (MLP)). As shown, the sequencingtask ordering machine-learning model 510 is a fully connected neural network, where each neuron in one layer is connected to all neurons in the subsequent layer, with the two hidden layers providing for the extraction of features at two different levels of hierarchy or abstraction. The sequencing-task ordering machine-learning model 510 utilizes activation functions to introduce non-linearity into the network and to model complex patterns that are not linearly separable. For example, the sequencing-task ordering machine-learning model 510 can utilize activation functions including ReLU (Rectified Linear Unit), softmax, sigmoid, or tanh. As shown, the sequencing-task ordering machine-learning model 510 passes each neuron output through the activation function before being fed to the next layer. Furthermore, the sequencing-task ordering machine-learning model 510 utilizes biases added to the input of the activation functions for each neuron, thereby enabling the activation function to be shifted to the left or right.
[0128] In particular, as shown in FIG. 5, the sequencing-task ordering machine-learning model 510 includes a first hidden layer 514. The input data neurons 512 of the sequencing-task ordering machine-learning model 510 process the input data, which represent the sequencing tasks (e.g., sequencing tasks 304) and sequencing task features (e.g., sequencing task features 306), and passes the input data to the first hidden layer 514. As shown, each input data neuron of the input data
neurons 512 in the input layer is connected to every neuron in the first hidden layer 514. Further, in some cases, the sequencing-task ordering machine-learning model 510 transmits a vector or data signal from each input data neuron of the input data neurons 512 to each neuron in the first hidden layer 514, multiplied by a corresponding weight (e.g., from model parameter(s) 520). These products are summed, resulting in a weighted sum for each hidden neuron of the first hidden layer 514.
[0129] In some embodiments, a bias term (e.g., from model parameter(s) 520), for each neuron in the first hidden layer 514, is added to the weighted sum, which allows the threshold of an activation function 515 to be adjusted. As further indicated by FIG. 5, the result of the weighted sum plus the bias is passed through the activation function 515 (e.g., ReLU, Sigmoid, Tanh) for each neuron of the first hidden layer 514. This activation function 515 introduces non-linearity, allowing the sequencing-task ordering machine-learning model 510 to model complex relationships.
[0130] As further indicated by FIG. 5, in some embodiments, the sequencing-task ordering machine-learning model 510 sends the activated value of each neuron in the first hidden layer 514 to each neuron in the second hidden layer 516. As shown, every neuron in the second hidden layer 516 is fully-connected to every neuron in the first hidden layer 514. As with the first hidden layer 514, the sequencing-task ordering machine-learning model 510 calculates a weighted sum of inputs for each neuron from the previous layer, adds a bias, and then applies an activation function 517. In particular, the sequencing-task ordering machine-learning model 510 adds a bias term (e.g., from model parameter(s) 520) for each neuron in the second hidden layer 516 to the weighted sum, which allows the threshold of the activation function 517 to be adjusted.
[0131] After adding a bias term, as further indicated by FIG. 5, the sequencing-task ordering machine-learning model 510 passes features representing the product of the weighted sum plus the bias through an activation function (e.g., ReLU, Sigmoid, Tanh) for each neuron of the second hidden layer 516. Notably, the second hidden layer 516 has the capacity to learn even more complex patterns by combining the features extracted by the first hidden layer 514.
[0132] Further, as indicated by FIG. 5, the sequencing-task ordering machine-learning model 510 combines the activated outputs from the second hidden layer 516 with a set of weights and biases (e.g., model parameter(s) 520). As shown, the sequencing-task ordering machme-leammg model 510 applies a final activation function 519 to obtain the task ordering scores 518. As mentioned, the sequencing-task ordering machine-learning model 510 provides the task ordering scores 518 based on the model parameter(s) 520.
[0133] As noted above, in some embodiments, the sequencing ordering system 106 uses a training process to select a highest performing sequencing-task ordering machine-learning model
to generate the task ordering scores. FIGS. 6A-6B illustrate selecting the highest performing sequencing-task ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure.
[0134] As shown in FIGS. 6A-6B, the sequencing ordering system 106 training a sequencingtask ordering machine-learning model by using a genetic algorithm to select, from among candidate models, a highest performing sequencing-task ordering machine-learning model for the sequencing-task ordering machine-learning model 510 utilizing a genetic algorithm. As shown, the sequencing ordering system 106 determines a set of initial sequencing-task ordering machinelearning model(s) 610. The sequencing ordering system 106 randomly initializes each model of the initial sequencing-task ordering machine-learning model(s) 610 with different model parameters (e.g., weights and biases). In certain embodiments, the sequencing ordering system 106 initializes weights in one or more of the initial sequencing-task ordering machine-learning model(s) 610 to the inverse square root of a next layer size within the respective initial sequencing-task ordering machine-learning model. In one or more embodiments, the sequencing ordering system 106 utilizes frequency metadata indicating a number of times a given sequencing task occurs as part of a training data set. Further, in certain implementations, the sequencing ordering system 106 utilizes a set of the initial sequencing-task ordering machine-learning model(s) 610 with a population size of 8192.
[0135] To facilitate training using a genetic algorithm, the sequencing ordering system 106 can determine makespan scores on a training set (e.g., 100 nucleotide-sample-slide with 5 days of simulated time) to evaluate the fitness of the initial sequencing-task ordering machine-learning model(s) 610 based on the ordered sequencing tasks. The makespan score refers to a measure of the total time or duration required to complete a set of sequencing tasks as part of determining a sequence of nucleobases for one or more sample genomes (or other nucleotide polymers) or part of saving data from determining such a sequence or from a corresponding analysis.
[0136] In particular, the sequencing ordering system 106 determines makespan scores that represent the cumulative time span for completing the sequencing tasks. A makespan score can depend on the specifications and variables of a sequencing run. To illustrate, for a given sequencing device or pipeline, up to 8 nucleotide-sample slides can be in the given sequencing device simultaneously and each nucleotide-sample slide (e.g., respective nucleotide-sample slides of the set of nucleotide-sample slides) can have oligonucleotide clusters from hundreds or thousands of genomic samples. For the given sequencing device, the sequencing ordering system 106 runs imaging and chemistry cycles for every nucleotide-sample slide and the line scanner can process up to 4 nucleotide-sample slide at a time. Each nucleotide-sample slide requires a primary sequencing task and while the images are being produced from the previous phase, the sequencing
device can begin processing. In certain implementations, the sequencing ordering system 106 utilizes a computer processor (e.g., FPGA/CPU/GPU) with 24 cores and 512 GB of RAM for the primary sequencing tasks. In some cases, the makespan score quantifies the time for completing primary sequencing tasks given the specifications and variables noted above for a sequencing run. The sequencing ordering system 106 can further execute secondary sequencing tasks with one job for each sample per nucleotide-sample slide. In some embodiments, the secondary sequencing tasks begin after the primary sequencing task. In certain implementations, the sequencing ordering system 106 utilizes a computer processor with 28 cores, 512 GB of RAM, and 2 FPGAs for the secondary sequencing tasks. The sequencing ordering system 106 utilizes the makespan score to quantify the time taken from the start of the first sequencing task until the completion of the last sequencing task based on factors such as sequencing task duration, resource availability, and sequencing task features.
[0137] As indicated by FIG. 6A, based on their fitness (e.g., makespan scores), the sequencing ordering system 106 selects a subset of the initial sequencing-task ordering machine-learning model(s) 610 to serve as the parent sequencing-task ordering machine-learning model(s) 620 for the next generation. As shown, the sequencing ordering system 106 evaluates all of the initial sequencing-task ordering machine-learning model(s) 610 using a fitness function (e.g., an objective function that evaluates the performance of the initial sequencing-task ordering machine-learning model(s) 610) based on a set of training data to generate predicted ordenng scores and a makespan value. As noted above, such a set of training data can include metadata indicating a frequency at which a given sequencing task occurs within the set of training data, such as a count quantifying a number of times each particular sequencing task was performed overall, within a particular time frame, or within a given sequencing run. For example, the sequencing ordering system 106 can utilize selection strategies for the parent sequencing-task ordering machine-learning model(s) 620 that include (i) tournament selection, where random subsets of models compete, or (ii) roulette wheel selection, where the probability of selection is proportional to fitness as measured by makespan scores.
[0138] In some embodiments, the sequencing ordering system 106 evaluates the output of the initial sequencing-task ordering machine-learning model(s) 610 to determine a makespan value that includes a penalty calculation (e.g., penalized makespan) based on apriority multiplier and includes a priority penalty that penalizes for long or poorly scheduled tasks. For example, the sequencing ordering system 106 can evaluate the loss (or fitness) of a model using:
(max task completition time since arrival (hours)) ■ priority penalty
Slide
In certain implementations, the sequencing ordering system 106 determines a set of the parent sequencing-task ordering machine-learning model(s) 620 with a population size of 128.
[0139] As further shown in FIG. 6A, the sequencing ordering system 106 combines pairs of the parent sequencing-task ordering machine-learning model(s) 620 to produce candidate sequencing-task ordering machine-learning model(s) 630 using crossover or recombination. To illustrate, the sequencing ordering system 106 selects crossover points, and the genetic information is mixed between two of the parent sequencing-task ordering machine-learning model(s) 620 to create one or more of the candidate sequencing-task ordering machine-learning model(s) 630. The candidate sequencing-task ordering machine-learning model(s) 630 “inherit” parameters (e.g., weights and biases) from both of the parent sequencing-task ordering machine-learning model(s) 620 and replace some less fit of the parent sequencing-task ordering machine-1 earning model(s) 620 from the previous generation.
[0140] Further, to maintain genetic diversity and potentially introduce new parameters into the population, the sequencing ordering system 106 applies mutations to the candidate sequencing-task ordering machine-learning model(s) 630. For example, the sequencing ordering system 106 can apply a random change (or specific change) in the parameters of the parent sequencing-task ordering machine-learning model(s) 620. The candidate sequencing-task ordering machinelearning model(s) 630 are then evaluated for their fitness in the same way as the initial sequencingtask ordering machine-learning model(s) 610usmg a fitness function, that is, based on a set of training data to generate predicted ordering scores and a makespan value.
[0141] To illustrate, in certain implementations, the sequencing ordering system 106, uniformly and at random, selects a proportion ~ [0, 1] to represent how close the candidate sequencing-task ordering machine-learning model(s) 630 is to either of the parent sequencing-task ordering machine-learning model(s) 620 (e.g., 0 and 1 are exactly like the parent, 0.5 is an even blend of both). In addition, the sequencing ordering system 106 picks weights and biases with probability p to come from one of the parent sequencing-task ordering machine-learning model(s) 620 and probability 1-p from the other of the parent sequencing-task ordering machine-learning model(s) 620. Mutations in the candidate sequencing-task ordering machine-learning model(s) 630 occur by randomly perturbing by a normal distribution with mean of 0 and standard deviation of 0.003. Further, in certain implementations, the sequencing ordering system 106 utilizes a set of candidate sequencing-task ordering machine-learning model(s) 630 with a population of 8192.
[0142] Turning now to FIG. 6B, as shown in the figure, the sequencing ordering system 106 selects a highest performing candidate sequencing-task ordering machine-learning model 640 as the fittest model from the candidate sequencing-task ordering machine-learning model(s) 630. In this selection process, the sequencing ordering system 106 can utilize the cycle of selection,
crossover, mutation, and evaluation for a predetermined number of generations or until a satisfactory level of fitness is achieved.
[0143] As further shown in FIG. 6B, in some embodiments, the sequencing ordering system 106 selects the sequencing-task ordering machine-learning model 650 from between a previously configured sequencing-task ordering machine-learning model 642 and the highest performing candidate sequencing-task ordering machine-learning model 640 based on a comparison of the fitness (e.g., makespan scores) of the previously configured sequencing-task ordering machinelearning model 642 and the fitness (e.g., makespan scores) of the highest performing candidate sequencing-task ordering machine-learning model 640. In certain embodiments, the sequencing ordering system 106 utilizes a validation test set (e.g., 25,000 nucleotide-sample slides, 2.5 years of simulated time) to evaluate the fitness and generate the makespan scores. For example, if the sequencing ordering system 106 determines the previously configured sequencing-task ordering machine-learning model 642 is more fit based on a fitness evaluation on the validation test set, the sequencing ordering system 106 can maintain the previously configured sequencing -task ordering machine-learning model 642 as the sequencing-task ordering machine-learning model 650. For example, if the sequencing ordering system 106 determines the highest performing candidate sequencing-task ordering machine-learning model 640 is more fit based on a fitness evaluation on the validation test set, the sequencing ordering system 106 can assign the highest performing candidate sequencing-task ordering machine-learning model 640 as the sequencing-task ordering machine-learning model 650. In this way, the sequencing ordering system 106 can determine the best performing sequencing-task ordering machine-learning model for the validation test set.
[0144] As mentioned, the sequencing ordering system 106 can determine where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run. FIG. 7A illustrates the sequencing ordering system 106 distributing sample-specific base-call-data files to one or more computing devices in accordance with one or more embodiments of the present disclosure.
[0145] In some embodiments, the sequencing ordering system 106 determines where and when to distribute sample-specific base-call-data files from a sequencing device during a sequencing run. In particular, based on processing parameters, the sequencing ordering system can demultiplex and transmit base-call-data files 712 to one or more computing devices during the sequencing run. For example, the disclosed system can demultiplex the indexed reads to determine which indexing sequences belong to which genomic samples after the completion of the first sequencing pass and efficiently begin transmitting the base-call-data files 712 to the appropriate computing device during the sequencing run (e.g., during the second sequencing pass).
[0146] In some cases, the sequencing ordering system 106 expedites determining oligonucleotides belonging to respective genomic samples within a nucleotide-sample-slide pool (or other nucleotide-sample-substrate pool) by base calling the indexing sequences for both read pairs before base calling the genomic sequences in library templates for each sample. By performing indexing cycles before the genomic sequencing cycles, the sequencing ordering system determines which nucleotide reads belong to which genomic samples and a relative balance of genomic samples. Furthermore, based on this determination, the sequencing ordering system can begin generating and transmitting the base-call-data files 712 to the appropriate computing device after each genomic sequencing cycle of the sequencing run.
[0147] As shown in FIG. 7A, the sequencing ordering system 106 can demultiplex nucleotide reads 702 based on indexing sequences for a genomic sample A 704 and indexing sequences for a genomic sample B 706 (and similarly for additional genomic samples C, D, etc.). For example, after determining base calls for the indexing sequences (e g., the indexing sequences for a genomic sample A 704 and the indexing sequences for a genomic sample B 706), the sequencing ordering system 106 determines which clusters of oligonucleotides correspond to each genomic sample in a pool of genomic samples.
[0148] The sequencing ordering system 106 may determines which oligonucleotide clusters in a nucleotide-sample slide correspond to which genomic sample through demultiplexing. In particular, after determining base calls for the indexing sequences, the sequencing ordering system 106 analyzes the raw sequencing data and uses index sequences (which function similar to barcodes) to assign each read to its corresponding genomic sample. For example, the sequencing ordering system 106 accesses raw sequencing data comprising indexing sequences for a genomic sample A and indexing sequences for a genomic sample B. The indexing sequences comprise nucleobases that act as unique identifiers for each genomic sample, allowing for differentiation and sorting of the reads during demultiplexing. For example, the indexing sequences for a genomic sample A indicate that the sample genomic sequence comes from genomic sample A. The indexing sequences for a genomic sample B indicate that the sample genomic sequence originates from genomic sample B.
[0149] As further shown in FIG. 7A, the sequencing ordering system 106 generates the basecall-data files 712 for the base call data from each sample based on the demultiplexing operation described above. During the sequencing run, the system further determines that different subsets of indexing sequences correspond to different genomic samples that have different corresponding requirements or processing parameters 714 for sequencing tasks. To illustrate, the sequencing ordering system 106 can utilize processing parameters 714. For example, the processing parameters 714 include analysis rights for a genomic sample such as permissions, privacy rights, ownership
rights, data user agreements, and/or clinician rights. For example, the sequencing ordering system 106 can utilize processing parameters 714 based on a category of analysis for a genomic sample such as categories of targeted sequencing (for specific genes or regions), forensic analysis, exome sequencing, methylation sequencing, and/or cancer panels. As an additional example, the sequencing ordering system 106 can utilize processing parameters 714 based on a sample size, a sequencing platform, a study purpose, a species complexity, a clinical setting, and/or an ethical consideration.
[0150] Based on the processing parameters 714 for the sequencing tasks, the sequencing ordering system 106 transmits the base-call-data files 712 that are sample-specific files to various computing devices. As shown, the sequencing ordering system 106 can transmit a first base-call- data file 723 (of the base-call-data files 712) to a first computing device 722 and second base-call- data file 725 (of the base-call-data files 712) to second computing device 724. As mentioned, such sample-specific file distribution provides additional security, saves processing time, and reduces storage requirements.
[0151] In accordance with one or more embodiments, FIG. 7B illustrates the sequencing ordering system 106 performing a demultiplexing operation on a subset of sequencing cycles with indexing cycles performed between genomic sequencing cycles in accordance with one or more embodiments of the present disclosure. As shown, after determining base calls for the indexing sequences, the sequencing ordering system 106 demultiplexes to determine which clusters of oligonucleotides correspond to each genomic sample in the pool of genomic samples to generate sample-specific base-call data files (e.g., the base-call-data files 712). As previously described in relation to FIG. 7A, the sequencing ordering system 106 transmits the sample-specific base-call- data files (e.g., the base-call-data files 712) to various computing devices. In particular, as discussed in more detail below, the sequencing ordering system 106 can begin transmitting base-call-data files (e g., the first base-call-data file 723) after performing the act 736 of determining base calls for a second indexing sequence.
[0152] In particular, FIG. 7B illustrates the series of acts comprising the act 732 of determining base calls for a first indexing sequence. A first index primer 742 is annealed to the primer binding site appended to the sample genomic sequence 740. After the first index primer 742 is annealed, the sequencing ordering system 106 determines base calls for the first indexing sequence 746. As shown in FIG. 7B, the first indexing sequence 746 is appended to a sample genomic sequence 740 of a genomic sample.
[0153] After determining base calls for the first indexing sequence 746, the sequencing ordering system 106 performs the act 734 of determining base calls for a first nucleotide read. More specifically, the sequencing ordering system 106 determines base calls for a first nucleotide read
corresponding to a first portion of the sample genomic sequence 740. More specifically, in a paired- end sequencing run, the sample genomic sequence 740 is sequenced from both ends, providing complementary information about the sample genomic sequence 740. As part of performing the act 734, the sequencing ordering system 106 anneals a first nucleotide read primer 748 to a read primer binding site, and the sequencing ordering system 106 sequences the first portion of the sample genomic sequence 740.
[0154] As further illustrated in FIG. 7B, the sequencing ordering system 106 performs the act 736 of determining base calls for a second indexing sequence. The sequencing ordering system 106 anneals a second index primer 752 to the primer binding site appended to the sample genomic sequence 740. The sequencing ordering system 106 determines base calls for the second indexing sequence 750. As further shown in FIG. 7B, the second indexing sequence 750 is appended to the 5 ’ end of the sample genomic sequence 740 while the first indexing sequence 746 is appended to the 7’ end of the sample genomic sequence 740. Further, the sequencing ordering system 106 performs the act 758 to demultiplex the clusters of oligonucleotides to determine the clusters that correspond to each genomic sample in the pool of genomic samples.
[0155] Based on the act 758 to demultiplex the clusters of oligonucleotides, the sequencing ordering system 106 can transmit the sample-specific first base-call-data files to various computing devices during the sequencing run as discussed in relation to FIG. 7A. In particular, the sequencing ordering system 106 can transmit the first sample-specific base-call-data files for the first portion of the sample genomic sequence 740 after the act 758 to demultiplex the clusters of oligonucleotides and during the cycles of the sequencing run as discussed in relation to FIG. 7A. To illustrate, in certain implementations, after the sequencing ordering system 106 performs the act 736 of determining the base calls for a second indexing sequence, the sequencing ordering system 106 can demultiplex the nucleotide reads (e.g., when demultiplexing the nucleotide reads 702) to generate sample-specific base-call data files (e.g., the first base-call-data file 723) based on the act 734 of determining base calls for a first nucleotide read. Furthermore, the sequencing ordering system 106 can begin to transmit the sample-specific base-call-data files (e.g., the first base-call- data file 723) to a first computing device (e.g., first computing device 722) after completing the act 734.
[0156] In some embodiments, after performing the act 736, the sequencing ordering system 106 performs a pair-end turn. Generally, during the pair-end turn, the P7 region is cleaved and all fragments are attached by the P5 region. Prior to the pair-end turn, the P7 region is annealed to the surface of the nucleotide-sample slide. After the pair-end turn, the P5 region is attached to the nucleotide-sample slide. Following the pair-end turn, the sequencing ordering system 106 performs the act 738 of determining base calls for a second nucleotide read. The sequencing ordering system
106 anneals the second nucleotide read primer 754 to a second read primer binding site, and the sequencing ordering system 106 sequences the second portion of the sample genomic sequence 740.
[0157] In accordance with one or more embodiments, FIG. 7C illustrates performing an indexing-first approach to demultiplexing nucleotide reads by performing indexing cycles before genomic sequencing cycles. As shown, after determining base calls for the first and second indexing sequences, the sequencing ordering system 106 demultiplexes the oligonucleotides to determine which clusters of oligonucleotides correspond to each genomic sample in the pool of genomic samples to generate sample-specific base-call data files (e.g., the base-call-data files 712). As previously described in relation to FIG. 7A, the sequencing ordering system 106 transmits the sample-specific base-call-datafiles (e g., the base-call-datafiles 712) to various computing devices. In particular, the sequencing ordering system 106 can begin transmitting both the first base-call- data file (e.g., the first base-call-data file 723) and the second base-call-data file (e.g., the second base-call-data file 725) after performing act 764 of determining base calls for a second indexing sequence. Indeed, as shown in FIG. 7C, the use of an indexing-first approach expedites distributing the sample-specific base-call-data files by allowing the sequencing ordering system 106 to begin transmitting the sample-specific base-call-data files after performing act 764 of determining base calls for a second indexing sequence.
[0158] In particular, FIG. 7C illustrates the series of acts comprising the act 762 of determining base calls for a first indexing sequence. As described in more detail in reference to FIG. 7B, the sequencing ordering system 106 anneals a first index primer 772, determines base calls for the first indexing sequence 776, and appends the first indexing sequence to the 7’ end of a sample genomic sequence 770. As further illustrated in FIG. 7C, the sequencing ordering system 106 performs the act 764 of determining base calls for a second indexing sequence. In particular, the sequencing ordering system 106 anneals a second index primer 778, determines base calls for the second indexing sequence 780, and appends the second indexing sequence 780 to the 5’ end of the sample genomic sequence 770.
[0159] Notably, by using the indexing-first approach, the sequencing ordering system 106 performs the indexing cycles (e.g., the act 762 and the act 764) before performing the genomic sequencing cycles (e.g., act 766 and act 768). As a result, as shown in FIG. 7C, the sequencing ordering system 106 can perform the act 788 to demultiplex the clusters of oligonucleotides to determine the clusters that correspond to each genomic sample in the pool of genomic samples before performing the sequencing cycles (e g , the act 766 and the act 768). Based on the act 788 to demultiplex the clusters of oligonucleotides, the sequencing ordering system 106 can transmit the sample-specific first base-call-data files to various computing devices during the sequencing
run as discussed in relation to FIG. 7A. In particular, the sequencing ordering system 106 can transmit the sample-specific first base-call-data files (e.g., the base-call-data files 712) for the first sample-specific base-call -data file (e.g., the first base-call-data file 723) and the second samplespecific base-call-data file (e.g., the second base-call-data file 725) of the sample genomic sequence 740 after the act 788 to demultiplex the clusters of oligonucleotides and during the sequencing run as discussed in relation to FIG. 7A.
[0160] As further shown in FIG. 7C, and as described in reference to FIG. 7B, after determining base calls for the first indexing sequence 776 and the second indexing sequence 780, the sequencing ordering system 106 performs the act 766 of determining base calls for a first nucleotide read, anneals a first nucleotide read primer 782 to a read primer binding site, and sequences the first portion of the sample genomic sequence 770. In some embodiments, after performing the act 766, the sequencing ordering system 106 performs a pair-end turn. Following the pair-end turn, the sequencing ordering system 106 performs the act 768 of determining base calls for a second nucleotide read, anneals the second nucleotide read primer 784 to a second read primer binding site, and sequences the second portion of the sample genomic sequence 770.
[0161] As shown in FIG. 7C, the use of an mdexing-first approach expedites distributing the sample-specific base-call-data files by allowing the sequencing ordering system 106 to begin transmitting the sample-specific base-call-data files after performing act 764 and determining base calls for a second indexing sequence and before performing the genomic sequencing cycles. In particular, the sequencing ordering system 106 can transmit the sample-specific base-call-data files (e.g., the base-call-data files 712) for the first portion (e.g., the first base-call-data file 723) and the second portion (e.g., the second base-call-data file 725) of the sample genomic sequence 740 during each cycle of the sequencing run as discussed in relation to FIG. 7A.
[0162] As mentioned, the sequencing ordering system 106 utilizes a nucleotide-sample-slide ordering machine-learning model to determine slide ordering scores for scheduling tasks associated with a nucleotide-sample slide. FIG. 8 illustrates a schematic diagram of utilizing the nucleotide- sample-slide ordering machine-learning model 806 to determine ordering scores indicating an order for sequencing tasks in accordance with one or more embodiments of the present disclosure. As shown, the sequencing ordering system 106 identifies or receives data for sequencing tasks associated with nucleotide-sample shde(s) 802a, 802b, through 802n. For example, the sequencing ordering system 106 can receive a identify or receive data for sequencing tasks for nucleotide- sample slide(s) 802a-802n comprising genomic samples for four different nucleobase types (e.g., A, T, C, G) associated with sample library fragments. As further shown, the sequencing ordering system 106 determines nucleotide-sample-slide features 804a, 804b, through 804n associated with the nucleotide-sample slide(s) 802a-802n.
[0163] As further shown, the sequencing ordering system 106 utilizes a nucleotide-sample- shde ordering machine-1 earning model 806 to generate slide ordering scores 808. By utilizing the nucleotide-sample-slide features 804a-804n and accounting for available computing resources (e.g., using model parameters), the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 indicating a relative order of the nucleotide-sample slide(s) 802a-802n. In one or more embodiments, the nucleotide-sample-slide ordering machine-learning model 806 as further described in relation to FIG. 10 to provide the slide ordering scores 808.
[0164] In particular, the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 that represent values for a slide order which maximizes the efficiency of the sequencing tasks for the nucleotide-sample slide(s) 802a-802n. For example, the nucleotide-sample-slide ordering machine-learning model 806 generates slide ordering scores 808 that can be used to order the nucleotide-sample-slide(s) 802a-802n and provide a more efficient utilization of resources, provide a reduced turnaround times for processing nucleotide-sample slides, and an overall increase in the throughput of the genomic sequencing process. In this way, the sequencing ordering system 106 can strategically provide an order for the nucleotide-sample shde(s) 802a-802n (e.g., particularly in high-volume environments) which can provide significant improvements in productivity and efficiency.
[0165] As shown, the nucleotide-sample-slide ordering machine-learning model 806 utilizes the slide ordering scores 808 to provide a ranking for ordered slides 810 indicating a relative order for the nucleotide-sample slide(s) 802a-802n to perform primary sequencing tasks and/or secondary sequencing tasks (e.g., the relative order for primary/secondary sequencing tasks 320). In some cases, the ordered slides 810 are arranged in a sequence that reflects their assessed priority from the slide ordering scores 808. with the highest score of the nucleotide-sample slide(s) 802a- 802n scheduled first. The sequencing ordering system 106 further causes nucleotide-sample slide(s) 802a-802n to be scheduled according to the slide ordering scores 808 on the computing device(s) 812 (e.g., one or more of the sequencing device 108, the server device(s) 102, the server device(s) 110, and the client device(s) 114).
[0166] As mentioned, the sequencing ordering system 106 provides nucleotide-sample slide features to the nucleotide-sample-slide ordering machine-learning model. FIG. 9 illustrates providing nucleotide-sample slide features to the nucleotide-sample-slide ordering machinelearning model in accordance with one or more embodiments of the present disclosure.
[0167] As shown, in certain embodiments, the sequencing ordering system 106 receives or identifies nucleotide-sample-slide features 904 associated with nucleotide-sample-slide(s) 902. Similar to the discussion in relation to FIG 4, the nucleotide-sample-slide features 904 can include a metric, a setting, a boundary, an environment variable, or a feature vector representing the
performance time, processor usage (e.g., CPU and FPGA), memory usage, and/or other resource requirements for the sequencing tasks associated with the nucleotide-sample-slide(s) 902. In particular, the sequencing ordering system 106 can receive or identify the nucleotide-sample-slide features 904 including a processor usage feature 906, a memory requirements feature 908, a performance time feature 910, and a priority feature 912.
[0168] For example, the sequencing ordering system 106 can access or identify the processor usage feature 906 associated with the nucleotide-sample-slide(s) 902. The processor usage feature 906 can include data for quantifying the number of FPGAs/CPUs/GPUs and the amount of available RAM. As another example, the processor usage feature 906 can include data quantifying the computational load on processors and can be operationalized as the percentage of processor time required or as the intensity of the computations needed. As another example, the processor usage feature 906 includes data quantifying the values for required processing power (or computational infrastructure) and the capacity of the sequencing ordering system 106 to process primary sequencing tasks like nucleotide identification, and/or secondary tasks such as sequence assembly and annotation for the nucleotide-sample-slide(s) 902. To illustrate, the processor usage feature 906 can include data quantifying the processor usage requirements based on the processor requirements associated with the sample sequencing depth, sample complexity, slide size, number of multiplexed samples, computational algorithm efficiency, system data throughput, and/or the system architecture. In certain implementations, the processor usage feature 906 includes data representing CPU usage within the ranges of 5 - 10 cores per task, and FPGA usage of 1 - 3 FPGA subdivisions per sequencing task.
[0169] The sequencing ordering system 106 can additionally or alternatively identify nucleotide-sample-slide features 904 which include the memory requirements features 908 that quantify the memory requirements including the amount of RAM needed to perform the sequencing tasks for the nucleotide-sample-slide(s) 902. For example, the memory requirements feature 908 indicates memory required for sequencing tasks, such as storing raw sequencing data during primary sequencing or processing large amounts of genomic data during secondary analyses. To further illustrate, the memory requirements feature 908 can account for the large datasets (e.g., gigabytes of data per run) involved in primary sequencing tasks and secondary sequencing tasks. Furthermore, the memory requirements feature 908 can include data quantifying the computing (e.g., processor, time, memory) resources required based on the data volume, data complexity, parallel processing needs, temporary storage needs, and/or final storage needs.
[0170] The sequencing ordering system 106 can identify additionally or alternatively the performance time feature 910 that can include data for quantifying the time requirements to perform the sequencing tasks for the nucleotide-sample-slide(s) 902. For example, for primary sequencing
tasks, the performance time feature 910 includes data reflecting the throughput rate of the sequencer. For example, for secondary sequencing tasks, the performance time feature 910 represents the duration of computational analyses such as comparative genomics.
[0171] As shown, in certain embodiments, the nucleotide-sample-slide features 904 include the priority feature 912 that can include data for quantifying the priority of the nucleotide-sample- slide(s) 902. To illustrate, the priority feature 912 can include a priority value for scheduling the nucleotide-sample-slide(s) 902. To illustrate, the priority feature 912 can include a value indicating a relative priority value for scheduling the nucleotide-sample-slide(s) 902 in comparison to other of the nucleotide-sample-slide(s) 902. For example, in some cases, the priority feature 912 indicates an assessment of the sample urgency for sequencing the nucleotide-sample-slide(s) 902 based on time-sensitive analyses, sequencing project deadlines, customer requirements, and/or quality checks.
[0172] As further shown, the sequencing ordering system 106 provides nucleotide-sample- slide features 904 to the nucleotide-sample-slide ordering machine-learning model 916. To illustrate, the sequencing ordering system 106 can access or identify the nucleotide-sample-slide features 904 of the performance time feature 910, the processor usage feature 906 (CPU), the memory requirements feature 908, the processor usage feature 906 (FPGA), and the priority feature 912 as indicated in the following table:
[0173] As mentioned, the nucleotide-sample-slide ordering machine-learning model can be implemented utilizing a neural network. FIG. 10 illustrates an example architecture for a
nucleotide-sample-slide ordering machine-learning model in accordance with one or more embodiments of the present disclosure.
[0174] As shown in FIG. 10, in certain implementations, the nucleotide-sample-slide ordering machine-learning model 1010 can be implemented as a neural network with four hidden layers that is fully connected and equipped with activation functions (e.g., a Multilayer Perceptron). The nucleotide-sample-slide ordering machine-learning model 1010 can be configured with model parameter(s) 1030 that include adjustable weights and biases. Notably, the sequencing ordering system 106 can utilize the nucleotide-sample-slide ordering machine-learning model 1010 with more or less hidden layers and neurons than shown in FIG. 10.
[0175] As shown, the sequencing-task ordering machine-learning model 510 includes a first hidden layer 1014, a second hidden layer 1016, a third hidden layer 1018, and a fourth hidden layer 1020. The input data neurons 1012 of the nucleotide-sample-slide ordering machine-learning model 1010 process the input data, which represent the nucleotide-sample slide (e.g., nucleotide- sample slide(s) 802a-802n) and nucleotide-sample-slide features (e.g., nucleotide-sample-slide features 804a-804n), and passes the input data to the first hidden layer 1014. Further, in some cases, similar to the neural network of FIG 5, the nucleotide-sample-slide ordering machme-leammg model 1010 transmits a vector or data signal from each of the input data neuron 1012 to each of the input data neurons 1012 in the first hidden layer 1014, multiplied by a corresponding weight (e.g., from model parameter(s) 1030). These products are summed, resulting in a weighted sum for each hidden neuron of the first hidden layer 1014. In some embodiments, a bias term (e.g., from model parameter(s) 1030), for each of the input data neurons 1012 in the first hidden layer 1014, is added to the weighted sum, which allows an activation function 1015 threshold to be adjusted. As further shown, the result of the weighted sum plus the bias is passed through the activation function (e.g., ReLU, Sigmoid, Tanh) for each neuron of the first hidden layer 1014.
[0176] Similarly, the nucleotide-sample-slide ordering machine-learning model 1010 sends the activated value of each of the input data neurons 1012 in the first hidden layer 1014 to each neuron in the second hidden layer 1016. As with the first hidden layer 1014, the nucleotide-sample- slide ordering machine-learning model 1010 calculates a weighted sum of inputs for each neuron from the previous layer, adds a bias, and then applies an activation function 1017. Further, the nucleotide-sample-slide ordering machme-leammg model 1010 repeats this process for the third hidden layer 1018 with a corresponding activation function 1019 and a fourth hidden layer 1020 and a corresponding activation function 1021.
[0177] Notably, by utilizing four hidden layers, the nucleotide-sample-slide ordering machinelearning model 1010 has the capacity to learn even more complex patterns by combining the features extracted by each of the hidden layers. As shown, the nucleotide-sample-slide ordering
machine-learning model 1010 applies a final activation function 1023 to obtain the slide ordering scores 1022. As mentioned, the nucleotide-sample-slide ordering machine-learning model 1010 provides the slide ordering scores 1022 based on the model parameter(s) 1030. The model parameter(s) 1030, which include the weights and biases across the layers of the nucleotide-sample- slide ordering machine-learning model 1010 are optimized using a genetic algorithm.
[0178] FIGS. 11A-11C illustrate selecting the highest performing nucleotide-sample-slide ordering machine-learning model utilizing a genetic algorithm in accordance with one or more embodiments of the present disclosure. As shown in FIGS. 11A-11C, the sequencing ordering system 106 selects a highest performing nucleotide-sample-slide ordering machine-learning model for the nucleotide-sample-slide ordering machine-learning model(s) 1110 utilizing a genetic algorithm.
[0179] Similar to the genetic algorithm described with respect to FIGS. 6A-6B and as shown in FIGS. 11A-11C, the sequencing ordering system 106 selects the sequencing-task ordering machine-learning model 1150 utilizing a genetic algorithm. As shown in FIG. 11A, the sequencing ordering system 106 determines a set of initial nucleotide-sample-slide ordering machine-learning model(s) 1110 and randomly initializes each of the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 with different model parameters (e.g., weights and biases). In certain implementations, the sequencing ordering system 106 utilizes a set of initial nucleotide- sample-slide ordering machine-learning model(s) 1110 with a population size of 8192. As described in more detail in relation to FIGS. 6A-6B, the sequencing ordering system 106 can determine makespan scores to evaluate the fitness of the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 based on the scheduled sequencing tasks.
[0180] Based on their fitness (e.g., makespan scores), a subset of the initial nucleotide-sample- slide ordering machine-learning model(s) 1110 is selected to serve as parent nucleotide-sample- slide ordering machine-learning model(s) 1120 for the next generation. In some embodiments, the sequencing ordering system 106 evaluates the output of the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 to determine a makespan value and includes a penalty calculation (e.g., penalized makespan) based on a priority multiplier and includes a priority penalty that penalizes for long or poorly scheduled nucleotide-sample slides and tasks. For example, the sequencing ordering system 106 can evaluate the loss (or fitness) of a model using:
(max task completition time since arrival (hours))2 ■ priority penalty Slide
In certain implementations, the sequencing ordering system 106 determines a set of the parent nucleotide-sample-slide ordering machine-learning model(s) 1120 with a population size of 128.
[0181] As further shown, pairs of the parent nucleotide-sample-slide ordering machinelearning model(s) 1120 are combined to produce candidate sequencing -task ordering machinelearning model(s) 1130 using crossover or recombination. As disclosed in more detail in relation to FIG. 5, the sequencing ordering system 106 selects crossover points, and the genetic information is mixed between two parent nucleotide-sample-slide ordering machine-learning model(s) 1120 to create one or more candidate sequencing-task ordering machine-learning model(s) 1130. The candidate nucleotide-sample-slide ordering machine-learning model(s) 1130 are evaluated for their fitness in the same way as the initial nucleotide-sample-slide ordering machine-learning model(s) 1110 using a fitness function based on a set of training data to generate predicted ordering scores and a makespan value. In certain implementations, the sequencing ordering system 106 utilizes a set of candidate nucleotide-sample-slide ordering machine-learning model(s) 1130 with a population size of 8192.
[0182] Similar to the genetic algorithm of FIGS. 6A-6B, the sequencing ordering system 106 selects a highest performing candidate nucleotide-sample-slide ordering machine-learning model 1140 as the fittest model from the candidate nucleotide-sample-slide ordering machine-learning model(s) 1130. In this selection process, the sequencing ordering system 106 can utilize the cycle of selection, crossover, mutation, and evaluation for a predetermined number of generations or until a satisfactory level of fitness is achieved for the candidate nucleotide-sample-slide ordering machine-learning model(s) 1130. As further shown, the sequencing ordering system 106 selects the nucleotide-sample-slide ordering machine-learning model 1150 from between a previously configured nucleotide-sample-slide ordering machine-learning model 1142 and the highest performing candidate nucleotide-sample-slide ordering machine-learning model 1140 based on a validation test set (e.g., 25,000 nucleotide-sample slide, 2.5 years of simulated time). For example, the sequencing ordering system 106 can maintain a best model that is a previously configured nucleotide-sample-slide ordering machine-learning model 1142 or select the highest performing candidate nucleoti de-sample-slide ordering machine-learning model 1140. In this way, the sequencing ordering system 106 can determine the best performing nucleotide-sample-slide ordering machine-learning model for the specific validation test set.
[0183] As mentioned, the sequencing ordering system 106 can utilize a two-tier sequencing ordering system that integrates an embodiment of the nucleotide-sample-slide ordering machinelearning model and an embodiment of the sequencing-task ordering machine learning model to order nucleotide-sample slides and sequencing tasks more efficiently. FIG. 12 illustrates a schematic diagram of utilizing a combination of the nucleotide-sample-slide ordering machinelearning model and the sequencing-task ordering machine learning model to order sequencing tasks in accordance with one or more embodiments of the present disclosure.
[0184] For example, the sequencing ordering system 106 can utilize a nucleotide-sample-slide ordering machine-1 earning model 1206 to access or identify a set of the nucleotide-sample-slide features 1204 for a nucleotide-sample-slide(s) 1202 and generate slide ordering scores 1210 indicating a nucleotide-sample-slide relative order 1208 of the set of nucleotide-sample slides based on the set of the nucleotide-sample-slide features 1204. In particular, as shown, the sequencing ordering system 106 can generate slide ordering scores 1210 indicating the values for a slide order that maximizes the efficiency of the sequencing tasks for the nucleotide-sample-slide(s) 1202. As further shown, the sequencing ordering system 106 can select a nucleotide-sample slide from the nucleotide-sample-slide(s) 1202 based on the relative order for the set of nucleotide-sample slides as provided by the nucleotide-sample-slide relative order 1208 and the slide ordering scores 1210. Furthermore, the sequencing ordering system 106 can access or identify a set of the sequencing task features 1214 for the nucleotide-sample-slide(s) 1202 and provide the set of the sequencing task features 1214 to a sequencing-task ordering machine-learning model 1216 for ordering the set of sequencing tasks 1212. As also shown, the sequencing-task ordering machine-learning model 1216 can generate task ordering scores 1220 indicating a sequencing task relative order 1218 for the set of sequencing tasks based on the sequencing task features 1214 for the set of sequencing tasks 1212 and perform the set of sequencing tasks 1212 according to the task ordering scores 1220. [0185] To illustrate, as previously described in relation to FIG. 8, the sequencing ordering system 106 can incorporate a first tier that utilizes the nucleotide-sample-slide ordering machinelearning model 1206 to determine slide ordering scores 1210. For example, the sequencing ordering system 106 can access or identify the nucleotide-sample-slide(s) 1202. As further shown, the sequencing ordering system 106 can identify nucleotide-sample-slide features 1204 associated with the nucleotide-sample-slide(s) 1202. As further shown, the sequencing ordering system 106 utilizes the nucleotide-sample-slide ordering machine-learning model 1206 to generate slide ordering scores 1210 and determine a nucleotide-sample-slide relative order 1208.
[0186] Further, as previously described in relation to FIG. 3 and as depicted in FIG. 12, the sequencing ordering system 106 can incorporate a second tier that utilizes the sequencing-task ordering machine-learning model 1216. As shown, in some embodiments, the sequencing ordering system 106 provides the nucleotide-sample-slide relative order 1208 (e.g., the slide ordering scores 1210) for the nucleotide-sample-shde(s) 1202 from the nucleotide-sample-slide ordering machinelearning model 1206. Further, the sequencing ordering system 106 identifies the set of sequencing tasks 1212 and the sequencing task features 1214 for each of the nucleotide-sample-slide(s) 1202. As indicated above, the sequencing ordering system 106 utilizes the sequencing-task ordering machine-learning model 1216 to generate task ordering scores 1220 and determine a sequencing task relative order 1218 for each set of sequencing tasks 1212. As shown in FIG. 12, the sequencing
ordering system 106 can iteratively utilize the sequencing -task ordering machine-learning model 1216 to generate task ordering scores 1220 and determine a sequencing task relative order for each set of sequencing tasks 1212 and sequencing task features 1214 for each of the nucleotide-sample- shde(s) 1202 based on the slide ordering scores 1210 provided by the nucleotide-sample-slide ordering machine-learning model 1206.
[0187] FIGS. 13A-13B illustrate graphs of the distribution of a penalized makespan utilizing different and existing ordering strategies in comparison with the sequencing ordering system 106. FIG. 13 A illustrates the initial makespan distribution values and FIG. 13B illustrates the makespan values for the tail of the makespan distribution after four hours.
[0188] In particular, FIGS. 13A-13B represent the distribution of the penalized makespan over 200,000 nucleotide-sample slides utilizing 4 different order strategies. To give perspective to 200,000 nucleotide-sample slides, a sequencing device would need around 20 years to complete sequencing runs for 200,000 nucleotide-sample slides. As shown, the sequencing ordering system 106 utilizing only the sequencing-task ordering machine-learning model performs between 15-25% in median makespan and 5-15% in average makespan scores better than the FIFO Method showing an improvement. In addition, the scheduling strategy utilizing only the nucleotide-sample-shde ordering machine-learning model performs better than the FIFO Method showing an improvement between 5-15% in median makespan and average makespan scores. Further, the sequencing ordering system 106 utilizing the two-tier sequencing ordering system with both the nucleotide- sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model performs nearly 30% in median makespan and 20% in average makespan scores better than the FIFO Method. FIGS. 13A-13B depict results for the sequencing ordering system 106 using the nucleotide-sample-shde ordering machine-learning model depicted in FIG. 5 and the sequencingtask ordering machine-learning model depicted in FIG. 10.
[0189] FIG. 14 illustrates a graph of the makespan compared against an average task load for the sequencing ordering system 106 in comparison to different and existing ordering strategies. In particular, FIG. 14 illustrates the intensity of a test case as a percentage of max load (sum of weight • time across tasks in units of normalized resources • hours). For example, as shown in FIG. 14, the load can be defined as:
Load (intensity) 1 z CPU usage RAM usage FPGA usage \
sequencing 3 kCPU capacity RAM capacity FPGA capacity/ tasks
■ (task duration(hours))
The graph shows the 5% percentile (reference number 1402), the 95% percentile (reference number 1404), and quantile 1 (QI) (reference number 1406) of makespan per nucleotide-sample slide against average task load (intensity) of the test set of 200,000 nucleotide-sample slides (20 years of simulated time). As shown, the sequencing ordering system 106 utilizing the two-tier sequencing ordering system of both the nucleotide-sample-slide ordering machine-1 earning model and the sequencing-task ordering machine-learning model provide the lowest makespan values. FIG. 14 depicts results for the sequencing ordering system 106 using the nucleotide-sample-slide ordering machine-learning model depicted in FIG. 5 and the two-tier sequencing ordering system depicted in FIG. 12. Notably, the two-tier sequencing ordering system of both the nucleotide-sample-slide ordering machine-learning model and the sequencing-task ordering machine-learning model also show a noticeable improvement over both a Tetris Heuristic Model and a FIFO Method that is even greater for higher intensity test cases.
[0190] FIG. 15 illustrates a graph showing a comparison of the performance of the sequencingtask ordering machine-learning model when trained utilizing the genetic algorithm (as described above in relation to FIG. 12) when compared to a Tetris heuristic training model in accordance with one or more embodiments of the present disclosure. Notably, the sequencing-task ordering machine-learning model converges much quicker than the traditional Tetris heuristic training approach (at reference number 1504). Indeed, as shown the average performance 1502 of the sequencing ordering system 106 shows a l%-2% improvement over the Tetris heuristics when trained on less than 5 days of data and less than 10 iterations deep. As also shown, the sequencing ordering system 106 continues to outperform the Tetris heuristics baseline over further training iterations (shown by the average performance 1502 at 150 iterations deep).
[0191] FIGS. 1-15, the corresponding text, and the examples provide a number of different methods, systems, devices, and non-transitory computer-readable media of the sequencing ordering system 106. In addition to the foregoing, one or more implementations can also be described in terms of flowcharts comprising acts for accomplishing a particular result, as shown in FIGS. 16- 18. FIG. 16 illustrates a flowchart of a series of acts 1600 for generating task ordering scores and performing a set of sequencing tasks in accordance with one or more embodiments of the present disclosure. FIG. 17 illustrates a flowchart of a series of acts for transmitting genomic samples to computing devices in accordance with one or more embodiments of the present disclosure. FIG. 18 illustrates a flowchart of a series of acts for generating slide ordering scores and performing sequencing tasks in accordance with one or more embodiments of the present disclosure. While FIGS. 16-18 illustrate acts according to one embodiment, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIGS. 16-18. The acts of FIGS. 16-18 can be performed as part of a method. Alternatively, a non-transitory computer readable storage medium
can comprise instructions that, when executed by one or more processors, cause a computing device or a system to perform the acts depicted in FIGS. 16-18. In still further embodiments, a system comprising an imaging system, a fluidic system, and a computer comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIGS. 16-18.
[0192] As shown in FIG. 16, the series of acts 1600 includes an act 1602 of determining a set of sequencing task features for a set of sequencing tasks, an act 1604 of providing the set of sequencing task features to a sequencing-task ordering machine-learning model for ordering the set of sequencing tasks, an act 1606 of generating task ordering scores indicating a relative order of the set of sequencing tasks, and an act 1608 of performing the set of sequencing tasks according to the task ordering scores. For example, the series of acts 1600 can include acts to perform any of the operations described in the following clauses:
CLAUSE 1. A computer-implemented method comprising: determining, for a set of sequencing tasks, a set of sequencing task features indicating at least available computing resources and a performance time associated with respective sequencing tasks of the set of sequencing tasks; providing the set of sequencing task features to a sequencing-task ordering machinelearning model for ordering the set of sequencing tasks; generating, utilizing the sequencing-task ordenng machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and performing the set of sequencing tasks according to the task ordering scores.
CLAUSE 2. The computer-implemented method of clause 1, wherein: the set of sequencing tasks comprises a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the set of sequencing tasks comprises a set of secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
CLAUSE 3. The computer-implemented method of clause 2, wherein the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide- sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
CLAUSE 4. The computer-implemented method of clause 2, wherein the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
CLAUSE S. The computer-implemented method of clause 1, wherein the set of sequencing task features comprises one or more of task processor usage, task memory requirements, or task performance time.
CLAUSE 6. The computer-implemented method of clause 1 , further comprising training the sequencing-task ordering machine-learning model by: identifying a set of parent sequencing-task ordering machine-learning models; generating, from the set of parent sequencing-task ordering machine-learning models, a set of candidate sequencing-task ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate sequencing-task ordering machine-1 earning model of the set of candidate sequencing-task ordering machme-leammg models; determining makespan scores for each candidate sequencing-task ordering machinelearning model of the set of candidate sequencing-task ordering machine-1 earning models based on the predicted ordering scores; and selecting a highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model based on comparing the makespan scores for each candidate model using a loss function.
CLAUSE 7. The computer-implemented method of clause 6, further comprising: comparing the makespan scores of the highest performing candidate sequencing-task ordering machine-learning model with a previously configured sequencing-task ordering machinelearning model; and selecting the highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model instead of the previously configured sequencing-task ordering machine-learning model based on a makespan score for the highest performing candidate sequencing-task ordering machme-leammg model.
CLAUSE 8. The computer-implemented method of clause 1, further comprising: determining a set of nucleotide-sample-slide features for a set of nucleotide-sample slides indicating at least the available computing resources and a performance time associated with processing data from respective nucleotide-sample slides of the set of nucleotide-sample slides;
generating, utilizing a nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; selecting a nucleotide-sample slide from the set of nucleotide-sample slides based on the relative order of the set of nucleotide-sample slides; and performing the set of sequencing tasks for the selected nucleotide-sample slide based on the task ordering scores and the slide ordering scores.
CLAUSE 9. The computer-implemented method of clause 1, wherein the sequencingtask ordering machine-learning model comprises a neural network including an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores.
[0193] As shown in FIG. 17, the series of acts 1700 includes an act 1702 of determining base calls for a set of indexing sequences; an act 1704 of determining a first subset of indexing sequences corresponding to a first genomic sample designated with a first set of processing parameters; an act 1706 of determining a second subset of indexing sequences corresponding to a second genomic sample designated with a second set of processing parameters; and an act 1708 of transmitting, for the first genomic sample, a first base-call-data file to a first computing device and, for the second genomic sample, a second base-call-data file to a second computing device. While acts 1702-1708 depicted in FIG. 17 can be performed independently from the acts depicted in FIG. 16 or FIG. 18, acts 1702-1708 can also be performed in conjunction with the acts depicted in FIG. 16 or FIG. 18. For example, the series of acts 1700 can include acts to perform any of the operations described in the following clauses:
CLAUSE 10. The computer-implemented method of clause 1, further comprising performing the set of sequencing tasks in part by: determining, for a sequencing run, base calls for a set of indexing sequences within clusters of oligonucleotides on a nucleotide-sample slide; determining, during the sequencing run, a first subset of indexing sequences corresponding to a first genomic sample designated with a first set of processing parameters; determining, during the sequencing run, a second subset of indexing sequences corresponding to a second genomic sample designated with a second set of processing parameters; and transmitting, for the first genomic sample, a first base-call-data file to a first computing device based on the first set of processing parameters and, for the second genomic sample, a second base-call-data file to a second computing device based on the second set of processing parameters.
CLAUSE 11. The computer-implemented method of clause 10, wherein: the first set of processing parameters specify one or more of a secondary sequencing task for the first genomic sample, analysis rights for the first genomic sample, a category of analysis for the first genomic sample, or a sample size for the first genomic sample; and the second set of processing parameters specify one or more of a secondary sequencing task for the second genomic sample, analysis rights for the second genomic sample, a category of analysis for the second genomic sample, or a sample size for the second genomic sample.
CLAUSE 12. The computer-implemented method of clause 10, further comprising determining base calls for a set of indexing sequences within clusters of oligonucleotides by: determining base calls for a first subset of indexing sequences appended to sample genomic sequences of a first genomic sample and base calls for a second subset of indexing sequences appended to sample genomic sequences of a second genomic sample; and after determining the base calls for the first subset of indexing sequences and the second subset of indexing sequences, determining, for the first genomic sample and the second genomic sample, base calls for first nucleotide reads and second nucleotide reads respectively corresponding to first portions and second portions of the sample genomic sequences of the first genomic sample and the second genomic sample.
[0194] As shown in FIG. 18, the series of acts 1800 includes an act 1802 of determining a set of nucleotide-sample-slide features for a set of nucleotide-sample slides, an act 1804 of providing the set of nucleotide-sample-slide features to a nucleotide-sample-slide ordering machine-learning model for ordering the set of nucleotide-sample slides, an act 1806 of generating slide ordering scores indicating a relative order of the set of nucleotide-sample slides, and an act 1808 of performing sequencing tasks for the set of nucleotide-sample slides according to the slide ordering scores. For example, the series of acts 1800 can include acts to perform any of the operations described in the following clauses:
CLAUSE 13. A computer-implemented method comprising: determining, for a set of nucleotide-sample slides, a set of nucleotide-sample-slide features indicating at least available computing resources and a performance time associated with processing data for respective nucleotide-sample slides of the set of nucleotide-sample slides; providing the set of nucleotide-sample-slide features to a nucleotide-sample-slide ordering machine-learning model for ordering the set of nucleotide-sample slides; generating, utilizing the nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; and
performing sequencing tasks for the set of nucleotide-sample slides according to the slide ordering scores.
CLAUSE 14. The computer-implemented method of clause 13, wherein the set of nucleotide-sample-slide features comprise a set of priority features indicating a relative priority of the respective nucleotide-sample slides.
CLAUSE 15. The computer-implemented method of clause 13, wherein the set of nucleotide-sample-slide features comprises one or more of processor usage for processing data associated with a nucleotide-sample slide of the set of nucleotide-sample slides, memory requirements for processing data associated with the nucleotide-sample slide, or performance time associated with processing data for the nucleotide-sample slide.
CLAUSE 16. The computer-implemented method of clause 13, wherein: the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of secondaiy sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
CLAUSE 17. The computer-implemented method of clause 16, wherein the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of the genomic sample, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
CLAUSE 18. The computer-implemented method of clause 16, wherein the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
CLAUSE 19. The computer-implemented method of clause 13, further comprising training the nucleotide-sample-slide ordering machine-learning model by: identifying a set of parent nucleotide-sample-slide ordering machine-learning models;
generating, from the set of parent nucleotide-sample-slide ordering machine-learning models, a set of candidate nucleotide-sample-slide ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate nucleotide-sample-slide ordering machine-learning model of the set of candidate nucleotide-sample-slide ordering machine-learning models; determining makespan scores for each candidate nucleotide-sample-slide ordering machine-learning model of the set of candidate nucleotide-sample-slide ordering machine-learning models based on the predicted ordering scores; and selecting a highest performing candidate nucleotide-sample-slide ordering machinelearning model as the nucleotide-sample-slide ordering machine-learning model based on comparing the makespan scores for each candidate model using a loss function.
CLAUSE 20. The computer-implemented method of clause 19, further comprising: comparing the makespan scores of the highest performing candidate nucleotide-sample- slide ordering machine-learning model with a previously configured nucleotide-sample-slide ordering machine-learning model; and selecting the highest performing candidate nucleotide-sample-slide ordering machinelearning model as the nucleotide-sample-slide ordering machine-learning model instead of the previously configured nucleotide-sample-slide ordering machine-learning model based on a makespan score for the highest performing candidate nucleotide-sample-slide ordering machinelearning model.
CLAUSE 21. The computer-implemented method of clause 13, further comprising: selecting a set of sequencing tasks associated with a nucleotide-sample slide from the set of nucleotide-sample slides; determining a set of sequencing task features for the set of sequencing tasks indicating at least the available computing resources and a performance time associated with respective sequencing tasks of the set of sequencing tasks; generating, utilizing a sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and performing the set of sequencing tasks for the nucleotide-sample slide according to the task ordering scores.
[0195] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein
the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic-acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0196] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0197] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
[0198] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0199] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry
242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-time pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed, and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminator-based sequencing methods.
[0200] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91/06678 and WO 07/123,744, each of which is incorporated herein by reference. The availability of fluorescently- labeled terminators in which both the termination can be reversed and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0201] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a
particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator-SBS methods can be stored, processed, and analyzed as set forth herein. Following the image capture step, labels can be removed and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0202] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators/cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15: 1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension, but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and/or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0203] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007/0166705, U.S. Patent Application Publication No. 1906/0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 1906/0240439, U.S. Patent Application Publication No. 1906/0281109, PCT Publication No. WO 05/065814, U.S. Patent Application Publication No. 2005/0100900, PCT Publication No. WO 06/064199, PCT Publication No. WO 07/010,251, U.S. Patent Application Publication No. 1912/0270305 and U.S. Patent Application Publication No. 2013/0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0204] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described
in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into anucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and/or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0205] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013/0079232, sequencing data can be obtained using a single channel. In such so- called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0206] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have
incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed, and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,969,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0207] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stem, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed, and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0208] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No 7,405,281 and U.S. Patent Application Publication No. 1908/0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase
such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed, and analyzed as set forth herein.
[0209] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009/0026082 Al; US 2009/0127589 Al; US 1910/0137143 Al; or US 1910/0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.
[0210] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0211] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features/cm2, 100 features/cm2, 500 features/cm2, 1,000 features/cm2, 5,000 features/cm2, 10,000 features/cm2, 50,000 features/cm2, 100,000 features/cm2, 1,000,000 features/cm2, 5,000,000 features/cm2, or higher.
[0212] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and/or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A nucleotide-sample-slide can be configured and/or used in an integrated system for detection of target nucleic acids. Exemplary nucleotide-sample-slides are described, for example, in US 1910/0111768 Al and US Ser. No. 13/273,666, each of which is incorporated herein by reference. As exemplified for nucleotide-sample-slides, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13/273,666, which is incorporated herein by reference.
[0213] The sequencing system described above sequences nucleic-acid polymers present in samples received by a sequencing device. As defined herein, "sample" and its derivatives, is used in its broadest sense and includes any specimen, culture and the like that is suspected of including a target. In some embodiments, the sample comprises DNA, RNA, PNA, LNA, chimeric or hybrid forms of nucleic acids. The sample can include any biological, clinical, surgical, agricultural, atmospheric, or aquatic-based specimen containing one or more nucleic acids. The term also includes any isolated nucleic acid sample such a genomic DNA, fresh-frozen or formalin-fixed paraffin-embedded nucleic acid specimen. It is also envisioned that the sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0214] The nucleic acid sample can include high molecular weight material such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another embodiment, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some embodiments, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some embodiments, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some embodiments, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another embodiment, the sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus, or fungus. In some embodiments, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0215] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and/or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric, or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine, and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy, or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can
be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0216] The components of the sequencing ordering system 106 can include software, hardware, or both. For example, the components of the sequencing ordering system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device(s) 114). When executed by the one or more processors, the computer-executable instructions of the sequencing ordering system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the sequencing ordering system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the sequencing ordering system 106 can include a combination of computer-executable instructions and hardware.
[0217] Furthermore, the components of the sequencing ordering system 106 performing the functions described herein with respect to the sequencing ordering system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and/or as a cloud-computing model. Thus, components of the sequencing ordering system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the sequencing ordering system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and/or other countries.
[0218] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc ), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0219] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0220] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
[0221] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0222] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e g., a NIC), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0223] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to
turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0224] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0225] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0226] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0227] FIG. 19 illustrates a block diagram of a computing device 1900 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1900 may implement the sequencing ordering
system 106 and the sequencing system 104. As shown by FIG. 19, the computing device 1900 can comprise a processor 1902, a memory 1904, a storage device 1906, an I/O interface 1908, and a communication interface 1910, which may be communicatively coupled by way of a communication infrastructure 1912. In certain embodiments, the computing device 1900 can include fewer or more components than those shown in FIG. 19. The following paragraphs describe components of the computing device 1900 shown in FIG. 19 in additional detail.
[0228] In one or more embodiments, the processor 1902 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1902 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1904, or the storage device 1906 and decode and execute them. The memory 1904 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processors). The storage device 1906 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
[0229] The I/O interface 1908 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1900. The I/O interface 1908 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 1908 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 1908 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.
[0230] The communication interface 1910 can include hardware, software, or both. In any event, the communication interface 1910 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1900 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1910 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0231] Additionally, the communication interface 1910 may facilitate communications with various types of wired or wireless networks. The communication interface 1910 may also facilitate communications using various communication protocols. The communication infrastructure 1912 may also include hardware, software, or both that couples components of the computing device
13
1900 to each other. For example, the communication interface 1910 may use one or more networks and/or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0232] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.
[0233] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps/acts or the steps/acts may be performed in differing orders. Additionally, the steps/acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps/acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims
1. A system comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system to: determine, for a set of sequencing tasks, a set of sequencing task features indicating at least available computing resources and a performance time associated with respective sequencing tasks of the set of sequencing tasks; provide the set of sequencing task features to a sequencing-task ordering machinelearning model for ordering the set of sequencing tasks; generate, utilizing the sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and perform the set of sequencing tasks according to the task ordering scores.
2. The system of claim 1, wherein: the set of sequencing tasks comprises a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the set of sequencing tasks comprises a set of secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
3. The system of claim 2, wherein the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of genomic samples, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic samples, or base-call-quality scoring of base calls within the nucleotide reads.
4. The system of claim 2, wherein the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant calling for genomic samples based on the nucleotide reads, detecting structural variants, or annotating phenotypes associated with variant calls.
5. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine the set of sequencing task features comprising one or more of task processor usage, task memory requirements, or task performance time.
6. The system of claim 1, further comprising instructions that when executed by the at least one processor, cause the system to train the sequencing-task ordering machine-learning model by: identifying a set of parent sequencing-task ordering machine-learning models; generating, from the set of parent sequencing-task ordering machine-learning models, a set of candidate sequencing-task ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate sequencing-task ordering machine-1 earning model of the set of candidate sequencing-task ordering machine-learning models; determining makespan scores for each candidate sequencing-task ordering machinelearning model of the set of candidate sequencing-task ordering machine-1 earning models based on the predicted ordering scores; and selecting a highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model based on comparing the makespan scores for each candidate model using a loss function.
7. The system of claim 6, further comprising instructions that, when executed by the at least one processor, cause the system to: compare the makespan scores of the highest performing candidate sequencing-task ordering machine-learning model with a previously configured sequencing-task ordering machinelearning model; and select the highest performing candidate sequencing-task ordering machine-learning model as the sequencing-task ordering machine-learning model instead of the previously configured sequencing-task ordering machine-learning model based on a makespan score for the highest performing candidate sequencing-task ordering machine-1 earning model.
8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine a set of nucleotide-sample-slide features for a set of nucleotide-sample slides indicating at least available computing resources and a performance time associated with processing data from respective nucleotide-sample slides of the set of nucleotide-sample slides; generate, utilizing a nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; select a nucleotide-sample slide from the set of nucleotide-sample slides based on the relative order of the set of nucleotide-sample slides; and
perform the set of sequencing tasks for the selected nucleotide-sample slide based on the task ordering scores and the slide ordering scores.
9. The system of claim 1, wherein the sequencing-task ordering machine-learning model comprises a neural network including an input layer for the set of sequencing task features, fully connected hidden layers, activation functions before and after the fully connected hidden layers, and an output layer that outputs the task ordering scores.
10. The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to perform the set of sequencing tasks in part by: determining, for a sequencing run, base calls for a set of indexing sequences within clusters of oligonucleotides on a nucleotide-sample slide; determining, during the sequencing run, a first subset of indexing sequences corresponding to a first genomic sample designated with a first set of processing parameters; determining, during the sequencing run, a second subset of indexing sequences corresponding to a second genomic sample designated with a second set of processing parameters; and transmitting, for the first genomic sample, a first base-call-data file to a first computing device based on the first set of processing parameters and, for the second genomic sample, a second base-call-data file to a second computing device based on the second set of processing parameters.
11. The system of claim 10, wherein: the first set of processing parameters specify one or more of a secondary sequencing task for the first genomic sample, analysis rights for the first genomic sample, a category of analysis for the first genomic sample, or a sample size for the first genomic sample; and the second set of processing parameters specify one or more of a secondary sequencing task for the second genomic sample, analysis rights for the second genomic sample, a category of analysis for the second genomic sample, or a sample size for the second genomic sample.
12. The system of claim 10, further comprising instructions that, when executed by the at least one processor, cause the system to determine base calls for a set of indexing sequences within clusters of oligonucleotides by: determining base calls for a first subset of indexing sequences appended to sample genomic sequences of a first genomic sample and base calls for a second subset of indexing sequences appended to sample genomic sequences of a second genomic sample; and after determining the base calls for the first subset of indexing sequences and the second subset of indexing sequences, determining, for the first genomic sample and the second genomic sample, base calls for first nucleotide reads and second nucleotide reads respectively corresponding
to first portions and second portions of the sample genomic sequences of the first genomic sample and the second genomic sample.
13. A system comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system to: determine, for a set of nucleotide-sample slides, a set of nucleotide-sample-slide features indicating at least available computing resources and a performance time associated with processing data for respective nucleotide-sample slides of the set of nucleotide-sample slides; provide the set of nucleotide-sample-slide features to a nucleotide-sample-slide ordering machine-learning model for ordering the set of nucleotide-sample slides; generate, utilizing the nucleotide-sample-slide ordering machine-learning model, slide ordering scores indicating a relative order of the set of nucleotide-sample slides based on the set of nucleotide-sample-slide features; and perform sequencing tasks for the set of nucleotide-sample slides according to the slide ordering scores.
14. The system of claim 13, wherein the set of nucleotide-sample-slide features comprise a set of priority features indicating a relative priority of the respective nucleotide-sample slides.
15. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to determine the set of nucleotide-sample-slide features comprising one or more of processor usage for processing data associated with a nucleotide-sample slide of the set of nucleotide-sample slides, memory requirements for processing data associated with the nucleotide-sample slide, or performance time associated with processing data for the nucleotide-sample slide.
16. The system of claim 13, wherein: the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of primary sequencing tasks associated with base calling for nucleotide reads of a genomic sample; or the performance time associated with processing data from the respective nucleotide- sample slides of the set of nucleotide-sample slides comprises the performance time associated with a set of secondary sequencing tasks associated with genotype calling based on the nucleotide reads or interpretation of the nucleotide reads.
17. The system of claim 16, wherein the set of primary sequencing tasks includes one or more of generating clusters of oligonucleotides on a nucleotide-sample slide, hybridizing primers within the clusters of oligonucleotides, analyzing images of the clusters of oligonucleotides, base calling for the nucleotide reads of the genomic sample, demultiplexing the nucleotide reads based on indexing sequences corresponding to the genomic sample, or base-call- quality scoring of base calls within the nucleotide reads.
18. The system of claim 16, wherein the set of secondary sequencing tasks includes one or more of genotype-quality scoring, mapping of the nucleotide reads to genomic coordinates of a reference genome, aligning the nucleotide reads with the reference genome, variant-calling for genomic samples based on the nucleotide reads, detecting structural variants or annotating phenotypes associated with variant calls.
19. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to train the nucleotide-sample-slide ordering machinelearning model by: identifying a set of parent nucleotide-sample-slide ordering machine-learning models; generating, from the set of parent nucleotide-sample-shde ordenng machme-leammg models, a set of candidate nucleotide-sample-slide ordering machine-learning models comprising different weights and biases; generating predicted ordering scores from each candidate nucleotide-sample-slide ordering machine-learning model of the set of candidate nucleotide-sample-shde ordering machine-learning models; determining makespan scores for each candidate nucleotide-sample-slide ordering machine-1 earning model of the set of candidate nucleotide-sample-shde ordenng machine-learning models based on the predicted ordering scores; and selecting a highest performing candidate nucleotide-sample-slide ordering machinelearning model as the nucleotide-sample-slide ordering machine-learning model based on comparing the makespan scores for each candidate model using a loss function.
20. The system of claim 19, further comprising instructions that, when executed by the at least one processor, cause the system to: compare the makespan scores of the highest performing candidate nucleotide-sample-shde ordering machine-1 earning model with a previously configured nucleotide-sample-slide ordering machine-learning model; and select the highest performing candidate nucleotide-sample-slide ordering machine-learning model as the nucleotide-sample-slide ordering machine-learning model instead of the previously
configured nucleotide-sample-slide ordering machine-learning model based on a makespan score for the highest performing candidate nucleotide-sample-slide ordering machine-learning model.
21. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to: select a set of sequencing tasks associated with a nucleotide-sample slide from the set of nucleotide-sample slides; determine a set of sequencing task features for the set of sequencing tasks indicating at least the available computing resources and the performance time associated with respective sequencing tasks of the set of sequencing tasks; generate, utilizing a sequencing-task ordering machine-learning model, task ordering scores indicating a relative order of the set of sequencing tasks based on the set of sequencing task features; and perform the set of sequencing tasks for the nucleotide-sample slide according to the task ordering scores.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463564251P | 2024-03-12 | 2024-03-12 | |
| US63/564,251 | 2024-03-12 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025193747A1 true WO2025193747A1 (en) | 2025-09-18 |
Family
ID=95284444
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/019437 Pending WO2025193747A1 (en) | 2024-03-12 | 2025-03-11 | Machine-learning models for ordering and expediting sequencing tasks or corresponding nucleotide-sample slides |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025193747A1 (en) |
Citations (30)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
| US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
| US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
| US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
| US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
| US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
| WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
| US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
| WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
| US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
| US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
| US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
| US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
| WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
| US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
| US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
| WO2023129764A1 (en) * | 2021-12-29 | 2023-07-06 | Illumina Software, Inc. | Automatically switching variant analysis model versions for genomic analysis applications |
-
2025
- 2025-03-11 WO PCT/US2025/019437 patent/WO2025193747A1/en active Pending
Patent Citations (34)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (en) | 1989-10-26 | 1991-05-16 | Sri International | Dna sequencing |
| US6172218B1 (en) | 1994-10-13 | 2001-01-09 | Lynx Therapeutics, Inc. | Oligonucleotide tags for sorting and identification |
| US6306597B1 (en) | 1995-04-17 | 2001-10-23 | Lynx Therapeutics, Inc. | DNA sequencing by parallel oligonucleotide extensions |
| US6210891B1 (en) | 1996-09-27 | 2001-04-03 | Pyrosequencing Ab | Method of sequencing DNA |
| US6258568B1 (en) | 1996-12-23 | 2001-07-10 | Pyrosequencing Ab | Method of sequencing DNA based on the detection of the release of pyrophosphate and enzymatic nucleotide degradation |
| US20050100900A1 (en) | 1997-04-01 | 2005-05-12 | Manteia Sa | Method of nucleic acid amplification |
| US6969488B2 (en) | 1998-05-22 | 2005-11-29 | Solexa, Inc. | System and apparatus for sequential processing of analytes |
| US6274320B1 (en) | 1999-09-16 | 2001-08-14 | Curagen Corporation | Method of sequencing a nucleic acid |
| US7001792B2 (en) | 2000-04-24 | 2006-02-21 | Eagle Research & Development, Llc | Ultra-fast nucleic acid sequencing device and a method for making and using the same |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| US7427673B2 (en) | 2001-12-04 | 2008-09-23 | Illumina Cambridge Limited | Labelled nucleotides |
| US20060188901A1 (en) | 2001-12-04 | 2006-08-24 | Solexa Limited | Labelled nucleotides |
| WO2004018497A2 (en) | 2002-08-23 | 2004-03-04 | Solexa Limited | Modified nucleotides for polynucleotide sequencing |
| US20070166705A1 (en) | 2002-08-23 | 2007-07-19 | John Milton | Modified nucleotides |
| US20060240439A1 (en) | 2003-09-11 | 2006-10-26 | Smith Geoffrey P | Modified polymerases for improved incorporation of nucleotide analogues |
| WO2005065814A1 (en) | 2004-01-07 | 2005-07-21 | Solexa Limited | Modified molecular arrays |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| WO2006064199A1 (en) | 2004-12-13 | 2006-06-22 | Solexa Limited | Improved method of nucleotide detection |
| US20060281109A1 (en) | 2005-05-10 | 2006-12-14 | Barr Ost Tobias W | Polymerases |
| WO2007010251A2 (en) | 2005-07-20 | 2007-01-25 | Solexa Limited | Preparation of templates for nucleic acid sequencing |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| WO2007123744A2 (en) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US20100111768A1 (en) | 2006-03-31 | 2010-05-06 | Solexa, Inc. | Systems and devices for sequence by synthesis analysis |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US20090127589A1 (en) | 2006-12-14 | 2009-05-21 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20090026082A1 (en) | 2006-12-14 | 2009-01-29 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes using large scale FET arrays |
| US20100282617A1 (en) | 2006-12-14 | 2010-11-11 | Ion Torrent Systems Incorporated | Methods and apparatus for detecting molecular interactions using fet arrays |
| US20100137143A1 (en) | 2008-10-22 | 2010-06-03 | Ion Torrent Systems Incorporated | Methods and apparatus for measuring analytes |
| US20120270305A1 (en) | 2011-01-10 | 2012-10-25 | Illumina Inc. | Systems, methods, and apparatuses to image a sample for biological or chemical analysis |
| US20130079232A1 (en) | 2011-09-23 | 2013-03-28 | Illumina, Inc. | Methods and compositions for nucleic acid sequencing |
| US20130260372A1 (en) | 2012-04-03 | 2013-10-03 | Illumina, Inc. | Integrated optoelectronic read head and fluidic cartridge useful for nucleic acid sequencing |
| WO2023129764A1 (en) * | 2021-12-29 | 2023-07-06 | Illumina Software, Inc. | Automatically switching variant analysis model versions for genomic analysis applications |
Non-Patent Citations (15)
| Title |
|---|
| COCKROFT, S. L.CHU, J.AMORIN, M.GHADIRI, M. R.: "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution", J. AM. CHEM. SOC., vol. 130, 2008, pages 818 - 820, XP055097434, DOI: 10.1021/ja077082c |
| DEAMER, D. W.AKESON, M.: "Nanopores and nucleic acids: prospects for ultrarapid sequencing", TRENDS BIOTECHNOL, vol. 18, 2000, pages 147 - 151, XP004194002, DOI: 10.1016/S0167-7799(00)01426-8 |
| DEAMER, D.D. BRANTON: "Characterization of nucleic acids by nanopore analysis", ACC. CHEM. RES., vol. 35, 2002, pages 817 - 825, XP002226144, DOI: 10.1021/ar000138m |
| HEALY, K.: "Nanopore-based single-molecule DNA analysis", NANOMED, vol. 2, 2007, pages 459 - 481, XP009111262, DOI: 10.2217/17435889.2.4.459 |
| KORLACH, J. ET AL.: "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures", PROC. NATL. ACAD. SCI., vol. 105, 2008, pages 1176 - 1181 |
| LEVENE, M. J. ET AL.: "Zero-mode waveguides for single-molecule analysis at high concentrations", SCIENCE, vol. 299, 2003, pages 682 - 686, XP002341055, DOI: 10.1126/science.1079700 |
| LI, J.M. GERSHOWD. STEINE. BRANDINJ. A. GOLOVCHENKO: "DNA molecules and configurations in a solid-state nanopore microscope", NAT. MATER., vol. 2, 2003, pages 611 - 615, XP009039572, DOI: 10.1038/nmat965 |
| LUNDQUIST, P. M. ET AL.: "Parallel confocal detection of single molecules in real time", OPT. LETT., vol. 33, 2008, pages 1026 - 1028, XP001522593, DOI: 10.1364/OL.33.001026 |
| METZKER, GENOME RES, vol. 15, 2005, pages 1767 - 1776 |
| RONAGHI, M.: "Pyrosequencing sheds light on DNA sequencing", GENOME RES, vol. 11, no. 1, 2001, pages 3 - 11, XP000980886, DOI: 10.1101/gr.11.1.3 |
| RONAGHI, M.KARAMOHAMED, S.PETTERSSON, B.UHLEN, M.NYREN, P.: "Real-time DNA sequencing using detection of pyrophosphate release", ANALYTICAL BIOCHEMISTRY, vol. 242, no. 1, 1996, pages 84 - 9, XP002388725, DOI: 10.1006/abio.1996.0432 |
| RONAGHI, M.UHLEN, M.NYREN, P.: "A sequencing method based on real-time pyrophosphate", SCIENCE, vol. 281, no. 5375, 1998, pages 363, XP002135869, DOI: 10.1126/science.281.5375.363 |
| RUPAREL ET AL., PROC NATL ACAD SCI, vol. 102, 2005, pages 5932 - 7 |
| SONI, G. V.MELLER: "A. Progress toward ultrafast DNA sequencing using solid-state nanopores", CLIN. CHEM., vol. 53, 2007, pages 1996 - 2001, XP055076185, DOI: 10.1373/clinchem.2007.091231 |
| TONG ZHAO ET AL: "QL-HEFT: a novel machine learning scheduling scheme base on cloud computing environment", NEURAL COMPUTING AND APPLICATIONS, SPRINGER LONDON, LONDON, vol. 32, no. 10, 7 March 2019 (2019-03-07), pages 5553 - 5570, XP037110789, ISSN: 0941-0643, [retrieved on 20190307], DOI: 10.1007/S00521-019-04118-8 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240038327A1 (en) | Rapid single-cell multiomics processing using an executable file | |
| US20220415443A1 (en) | Machine-learning model for generating confidence classifications for genomic coordinates | |
| US20220415442A1 (en) | Signal-to-noise-ratio metric for determining nucleotide-base calls and base-call quality | |
| US20240112753A1 (en) | Target-variant-reference panel for imputing target variants | |
| US20230420082A1 (en) | Generating and implementing a structural variation graph genome | |
| US20230095961A1 (en) | Graph reference genome and base-calling approach using imputed haplotypes | |
| EP4544554A1 (en) | Improved human leukocyte antigen (hla) genotyping | |
| WO2025193747A1 (en) | Machine-learning models for ordering and expediting sequencing tasks or corresponding nucleotide-sample slides | |
| US20240127906A1 (en) | Detecting and correcting methylation values from methylation sequencing assays | |
| US20230313271A1 (en) | Machine-learning models for detecting and adjusting values for nucleotide methylation levels | |
| US20250384952A1 (en) | Tandem repeat genotyping | |
| US20230340571A1 (en) | Machine-learning models for selecting oligonucleotide probes for array technologies | |
| US20250210141A1 (en) | Enhanced mapping and alignment of nucleotide reads utilizing an improved haplotype data structure with allele-variant differences | |
| US20240177802A1 (en) | Accurately predicting variants from methylation sequencing data | |
| US20230368866A1 (en) | Adaptive neural network for nucelotide sequencing | |
| US20250111899A1 (en) | Predicting insert lengths using primary analysis metrics | |
| US20230420075A1 (en) | Accelerators for a genotype imputation model | |
| WO2024206848A1 (en) | Tandem repeat genotyping | |
| WO2025184234A1 (en) | A personalized haplotype database for improved mapping and alignment of nucleotide reads and improved genotype calling | |
| WO2025240241A1 (en) | Modifying sequencing cycles during a sequencing run to meet customized coverage estimations for a target genomic region | |
| WO2025006570A2 (en) | Modifying sequencing cycles or imaging during a sequencing run to meet customized coverage estimation | |
| WO2025250996A2 (en) | Call generation and recalibration models for implementing personalized diploid reference haplotypes in genotype calling | |
| WO2025160089A1 (en) | Custom multigenome reference construction for improved sequencing analysis of genomic samples | |
| JP2025523520A (en) | Improving split-read alignment by intelligently identifying and scoring candidate split groups |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25716873 Country of ref document: EP Kind code of ref document: A1 |