WO2025006487A1 - Utilisation des coordonnées spatiales des cuves à circulation pour relier les lectures afin d'améliorer l'analyse du génome - Google Patents
Utilisation des coordonnées spatiales des cuves à circulation pour relier les lectures afin d'améliorer l'analyse du génome Download PDFInfo
- Publication number
- WO2025006487A1 WO2025006487A1 PCT/US2024/035447 US2024035447W WO2025006487A1 WO 2025006487 A1 WO2025006487 A1 WO 2025006487A1 US 2024035447 W US2024035447 W US 2024035447W WO 2025006487 A1 WO2025006487 A1 WO 2025006487A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- clusters
- substrate
- nucleic acid
- cluster
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6806—Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
Definitions
- template genomic sequences are first fragmented in solution into smaller pieces that are amenable to next-generation sequencing methods on a flowcell.
- One of the difficulties of this approach is that by the time the smaller sequence fragments from the template genomic sequences have been read, knowledge of their connectivity and proximity to each other in the original template genomic sequence is lost.
- the process of ordering the sequence fragments to arrive at the sequence of the original template genomic sequence is generally referred to as "assembly.” Assembly processes can be computationally intensive and time-consuming.
- sequence and assembly errors can become a problem depending upon the sequencing methodology used and the quality of genomic DNA samples under evaluation. [0003]
- many genomes of interest contain more than one version of each chromosome.
- the human genome is diploid, having two sets of chromosomes— one set inherited from each parent.
- Some organisms have polyploid genomes with more than two sets of chromosomes.
- Examples of polyploid organisms include animals, such as salmon, and many plant species such as wheat, apple, oat and sugar cane.
- phasing information pertaining to the identity of which fragments came from which set of chromosomes, is lost. This phasing information can be difficult or impossible to reconstruct using typical shotgun methods. [0004] Somewhat similar yet often more complex difficulties can arise when mixed samples are evaluated.
- Mixed samples can contain nucleic acid molecules, such as chromosomes, mRNA transcripts, plasmids etc., from two or more organisms. Mixed samples having multiple organisms are often referred to as metagenomic samples. Other examples of mixed samples are different cells or tissues that although being derived from the same organism have different characteristics. Examples include cancerous tissues which may comprise a mixture of healthy cells and cancerous cells, tissues that may comprise pre- cancerous cells and cancerous cells, tissue that may comprise two or more different types of cancerous cells. Indeed, there may be a variety of different types of cancer cells as is the case for cancer samples that have mosaicity. Another example of different cells derived from a single organism are mixtures of maternal and fetal cells obtained from a pregnant female (e.g.
- a method for assigning nucleic acid sequence reads to target polynucleotides including providing transposome complexes, wherein the transposome complexes include a transposase and a first polynucleotide including an end sequence and a first tag; contacting the transposome complexes with target polynucleotides under conditions to fragment the target polynucleotides; amplifying the fragmented target polynucleotides to form a plurality of nucleic acid clusters on a substrate; obtaining location information for the plurality of nucleic acid clusters on the substrate; determining the nucleic acid sequence reads of the fragmented nucleic acids in each of the nucleic acid clusters; and assigning the nucleic acid sequence reads to the target polynucleo
- a length of the target polynucleotides is greater than a length of the fragment.
- assigning the nucleic acid sequence reads includes determining the distance between each of the clusters and using the determined distance to assign reads to a specific target polynucleotide. In some embodiments, assigning the nucleic acid sequence reads includes determining for a likelihood score that at least a first and a second cluster on the substrate derive from the same target polynucleotide. [0008] In some embodiments, the method further includes increasing the likelihood score for the first cluster when the spatial distance between at least the first and second clusters are below a threshold value.
- the method further includes increasing the likelihood score for the first cluster when a genomic distance between at least the first and second clusters are below a threshold value. In some embodiments, the method further includes increasing the likelihood score for the first cluster when a genomic distance between at least the first and second clusters are below the threshold value. In some embodiments, the method further includes increasing the likelihood score for the first cluster when the spatial distance and a genomic distance between at least the first and second clusters are below a threshold value. In some embodiments, the likelihood score is influenced by a pitch of the substrate, size of the substrate, a pattern of the substrate, temperature, loading density, fragment directionality, or a combination thereof.
- the method further includes determining whether the target nucleic acid has a variant when the spatial distance between at least the first and second clusters are below a threshold value and when the genomic distance is above the genomic distance threshold.
- the spatial distance threshold from a cluster forms a pattern of an ellipse or a circle around the cluster.
- the likelihood score of the first cluster is 0.
- the likelihood score of the second cluster is above 30.
- the likelihood score of one or more other clusters is above 30.
- the location information includes a first spatial coordinate and a second spatial coordinate in a cartesian coordinate system.
- the method further includes sorting the plurality of the nucleic acid clusters by their spatial coordinates.
- the transposome complexes include a second polynucleotide including a region complementary to the transposon end sequence.
- the transposome complexes are present on the substrate at a density of at least 103, 104, 105, 106, 107, 108, 109, or 1010 or more complexes per mm 2 .
- the transposome complexes include a hyperactive Tn5 transposase.
- the substrate includes microparticles. In some embodiments, the substrate includes a patterned surface.
- the substrate includes wells.
- providing the transposome complexes includes providing the transposome complexes in solution.
- providing the transposome complexes includes providing the transposome complexes bound to the substrate.
- a system for assigning sequence reads on a substrate to their original target nucleic acid including a substrate including clusters of amplified fragments of target polynucleotides bound to the substrate in spatial locations; one or more processors having instructions that when executed perform a method including obtaining spatial location information for the clusters of amplified fragments on the substrate; sequencing the target polynucleotides in each of the clusters to determine a sequence read and spatial location from each of the amplified fragments; determining the geographic distance between each cluster on the substrate; and assigning sequence reads from each cluster to a target nucleic acid based on their geographic distance from one another.
- a length of the target polynucleotides is greater than a length of the fragment.
- assigning the nucleic acid sequence reads includes determining the distance between each of the clusters and using the determined distance to assign reads to a specific target polynucleotide. In some embodiments, assigning the nucleic acid sequence reads includes determining for a likelihood score that at least a first and a second cluster on the substrate derive from the same target polynucleotide. In some embodiments, the one or more processors further performs a method including increasing the likelihood score for the first cluster when the spatial distance between at least the first and second clusters is below a threshold value.
- the one or more processors further performs a method including increasing the likelihood score for the first cluster when a genomic distance between at least the first and second clusters is below a threshold value. In some embodiments, the one or more processors further performs a method including increasing the likelihood score for the first cluster when a genomic distance between at least the first and second clusters is below the threshold value. In some embodiments, the one or more processors further performs a method including increasing the likelihood score for the first cluster when the spatial distance and a genomic distance between at least the first and second clusters are below a threshold value. In some embodiments, the likelihood score is influenced by a pitch of the substrate, size of the substrate, a pattern of the substrate, temperature, loading density, fragment directionality, or a combination thereof.
- the one or more processors further performs a method including determining whether the target nucleic acid has a variant when the spatial distance between the first and second clusters are below a threshold value and when the genomic distance is above the genomic distance threshold.
- the spatial distance threshold from a cluster forms a pattern of an ellipse or a circle around the cluster.
- the likelihood score of the first cluster is 0.
- the likelihood score of the second cluster is above 30.
- the likelihood score of one or more other clusters is above 30.
- the location information includes a first spatial coordinate and a second spatial coordinate in a 2D coordinate system.
- the one or more processors further performs a method including sorting the plurality of the nucleic acid clusters by their spatial coordinates.
- the transposome complexes include a second polynucleotide including a region complementary to the transposon end sequence.
- the transposome complexes are present on the substrate at a density of at least 103, 104, 105, 106, 107, 108, 109, or 1010 or more complexes per mm 2 .
- the transposome complexes include a hyperactive Tn5 transposase.
- the substrate includes microparticles.
- the substrate includes a patterned surface.
- the substrate includes wells.
- BRIEF DESCRIPTION OF THE DRAWINGS [0016] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “figure” and “FIG.” herein).
- FIG.1 illustrates the difficulty in mapping two identical sequences, A1 and A2.
- FIG. 2 is a flow diagram illustrating a method according to some embodiments.
- FIGS. 3A-3B are schematical diagrams illustrating the disambiguation of multi-mapped reads according to some embodiments.
- FIGS.4A-4B illustrate reads based on spatial location and genomic distance according to some embodiments.
- FIG. 4A illustrates an update to the MAPQ value and
- FIG. 4B illustrates when the MAPQ value is not updated.
- FIG. 5 shows a non-limiting example of a solid support according to some embodiments.
- FIG. 6 illustrates clustering on a solid support according to some embodiments.
- FIG. 7 illustrates clustering on a solid support according to some embodiments.
- FIG. 8 illustrates non-limiting examples of metrics related to spatial distance information from the clusters according to some embodiments.
- FIG. 9 illustrates non-limiting examples of spatial information obtained from a nucleic acid.
- FIG. 26 FIG.
- FIG. 10 is a flow diagram illustrating a non-limiting example of updating a MAPQ score of a MAP0 read.
- FIG. 11 is a flow diagram illustrating a non-limiting example of updating a MAPQ score of a MAP0 read.
- DETAILED DESCRIPTION [0028] While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
- Embodiments of the invention relate to systems and methods for sequencing target nucleic acids by fragmenting the target nucleic acid and distributing the fragments onto a flow cell. As the fragments are distributed along the flow cell, they bind capture primers and are then used to create clusters by well-known technologies, such as those provided by Illumina Inc. (San Diego, CA). It has been discovered that fragments which were derived from the same template genomic sequence are more likely to bind to the flow cell in spatially nearby positions as compared to fragments that are from different template genomic sequences, particularly when the fragmentation is performed directly on the flow cell using immobilized transposome complexes on the surface of the flow cell.
- one embodiment is a method for assigning nucleic acid sequence reads to target polynucleotides, which includes providing transposome complexes.
- the transposome complexes include a transposase and a first polynucleotide having end sequences which can be used to fragment the target polynucleotides and insert into each fragment an end sequence or tag which can be used to bind to capture probes located on the substrate.
- the method can include contacting the transposome complexes with the target polynucleotides under conditions to fragment the target polynucleotides and add capture sequences to the ends of each fragment.
- the capture sequences include P5 or P7 sequences as provided by Illumina, Inc.
- the complexed strand and transposome is in solution, and is then brought towards a substrate and immobilized thereon.
- one or more of the transposome complexes prior to immobilization of the transposome complexes on the substrate, one or more of the transposome complexes bind the target polynucleotides in solution. In this embodiment, the transposome complexes in solution become immobilized to the substrate.
- the bound fragments can be amplified to form a plurality of nucleic acid clusters on the substrate.
- the location of each cluster on the flow cell can then be determined before, during or after performing sequencing by synthesis reactions (SBS) to obtain the nucleotide sequence of each fragment located in each cluster.
- SBS sequencing by synthesis reactions
- the method can start to map those reads to determine the original target polynucleotide from which the read originated.
- the mapping process takes into account the spatial location of each cluster, such that clusters which are closer to each other on the flow cell are more likely to have originated from the same target polynucleotide.
- the library preparation steps are performed on the flow cell, which may reduce the complexity and the amount of equipment required for the systems. Furthermore, by mapping the sequenced fragments to target polynucleotides using the spatial information accompanying each cluster, the method performs more accurate mapping operations as compared to methods that do not take the spatial location of each cluster into account during the mapping process. Therefore, spatial information that includes relative distances between various clusters on a flow cell is leveraged to adjust mapping information, thereby increasing the read quality of previously identified multi-mapped reads. In the past, identified multi-mapped reads may have been discarded. Increasing the read quality of these previously discarded reads may improve the alignment information and quality of information used in certain genomic analysis applications including, but not limited to, variant calling.
- each flow cell is divided into swaths and tiles.
- Each swath is a longitudinal stripe of the flow cell, and each tile is an area within each stripe. More information on this schema can be found with reference to Figs. 5, 6 and 7.
- each tile on the flow cell is given a unique tile number, which includes the corresponding swath where the tile is located.
- the nucleotide sequence of a cluster is determined, it is stored using a filename or readname which includes not only the sequence information but also the identification of the swath and tile.
- a non-limiting example of storing such information includes storing the information in a specific file format, such as a fastq file format. Then, during the assembly process, if the process determines that a particular fragment maps to more than one possible target polynucleotide, the process reads the spatial location of the cluster and uses that spatial location to determine if one of the mapping assignments is more likely than the other based on the spatial location of each read on the flow cell. Reads which are within a relatively short distance with one another are more likely to have derived from the same target polynucleotide, so the process assigns a read which is mapped to multiple target polynucleotides to a particular one based on their relative spatial locations on the flow cell.
- a specific file format such as a fastq file format
- the sequencing system first determines the sequence and location of reads in each cluster on the flow cell.
- the system creates a BAM file for each such read which includes the read sequence and identification of one or more possible target polynucleotides.
- the system attempts to disambiguate the reads by referring to the spatial location information contained with each read.
- nucleotide When used in the context of a compound, composition, or device, the term “comprising” means that the compound, composition, or device includes at least the recited features or components, but may also include additional features or components.
- polynucleotide oligonucleotide
- nucleic acid and “nucleic acid molecules” are used interchangeably herein and refer to a covalently linked sequence of nucleotides of any length (i.e., ribonucleotides for RNA, deoxyribonucleotides for DNA, analogs thereof, or mixtures thereof) in which the 3’ position of the pentose of one nucleotide is joined by a phosphodiester group to the 5’ position of the pentose of the next.
- the terms should be understood to include, as equivalents, analogs of either DNA, RNA, cDNA, or antibody-oligo conjugates made from nucleotide analogs and to be applicable to single stranded (such as sense or antisense) and double stranded polynucleotides.
- the term as used herein also encompasses cDNA, that is complementary or copy DNA produced from an RNA template, for example by the action of reverse transcriptase. This term refers only to the primary structure of the molecule. Thus, the term includes, without limitation, triple-, double- and single-stranded deoxyribonucleic acid (“DNA”), as well as triple-, double- and single- stranded ribonucleic acid (“RNA”).
- nucleotides include sequences of any form of nucleic acid.
- a nucleic acid can have a naturally occurring nucleic acid structure or a non-naturally occurring nucleic acid analog structure.
- a nucleic acid can contain phosphodiester bonds; however, in some embodiments, nucleic acids may have other types of backbones, comprising, for example, phosphoramide, phosphorothioate, phosphorodithioate, O-methylphosphoroamidite and peptide nucleic acid backbones and linkages.
- Nucleic acids can have positive backbones; non-ionic backbones, and non-ribose based backbones.
- Nucleic acids may also contain one or more carbocyclic sugars.
- the nucleic acids used in methods or compositions herein may be single stranded or, alternatively double stranded, as specified.
- a nucleic acid can contain portions of both double stranded and single stranded sequence, for example, as demonstrated by forked adapters.
- a nucleic acid can contain any combination of deoxyribo- and ribonucleotides, and any combination of bases, including uracil, adenine, thymine, cytosine, guanine, inosine, xanthanine, hypoxanthanine, isocytosine, isoguanine, and base analogs such as nitropyrrole (including 3-nitropyrrole) and nitroindole (including 5-nitroindole), etc.
- a nucleic acid can include at least one promiscuous base.
- a promiscuous base can base-pair with more than one different type of base and can be useful, for example, when included in oligonucleotide primers or inserts that are used for random hybridization in complex nucleic acid samples such as genomic DNA samples.
- An example of a promiscuous base includes inosine that may pair with adenine, thymine, or cytosine. Other examples include hypoxanthine, 5-nitroindole, acylic 5-nitroindole, 4-nitropyrazole, 4-nitroimidazole and 3- nitropyrrole.
- Promiscuous bases that can base-pair with at least two, three, four or more types of bases can be used.
- fragment when used in reference to a first nucleic acid, is intended to mean a second nucleic acid having a part or portion of the sequence of the first nucleic acid. Generally, the fragment and the first nucleic acid are separate molecules. The fragment can be derived, for example, by physical removal from the larger nucleic acid, by replication or amplification of a region of the larger nucleic acid, by degradation of other portions of the larger nucleic acid, a combination thereof or the like. The term can be used analogously to describe sequence data or other representations of nucleic acids.
- haplotype refers to a set of alleles at more than one locus inherited by an individual from one of its parents.
- a haplotype can include two or more loci from all or part of a chromosome. Alleles include, for example, single nucleotide polymorphisms (SNPs), short tandem repeats (STRs), gene sequences, chromosomal insertions, chromosomal deletions etc.
- SNPs single nucleotide polymorphisms
- STRs short tandem repeats
- gene sequences chromosomal insertions
- phased alleles refers to the distribution of the particular alleles from a particular chromosome, or portion thereof. Accordingly, the "phase" of two alleles can refer to a characterization or representation of the relative location of two or more alleles on one or more chromosomes.
- nucleotide sequence is intended to refer to the order and type of nucleotide monomers in a nucleic acid polymer.
- a nucleotide sequence is a characteristic of a nucleic acid molecule and can be represented in any of a variety of formats including, for example, a depiction, image, electronic medium, series of symbols, series of numbers, series of letters, series of colors, etc.
- the information can be represented, for example, at single nucleotide resolution, at higher resolution (e.g. indicating molecular structure for nucleotide subunits) or at lower resolution (e.g. indicating chromosomal regions, such as haplotype blocks).
- solid support refers to a rigid substrate that is insoluble in aqueous liquid.
- the substrate can be non-porous or porous.
- the substrate can optionally be capable of taking up a liquid (e.g. due to porosity) but will typically be sufficiently rigid that the substrate does not swell substantially when taking up the liquid and does not contract substantially when the liquid is removed by drying.
- a nonporous solid support is generally impermeable to liquids or gases.
- Exemplary solid supports include, but are not limited to, glass and modified or functionalized glass, plastics (including acrylics, polystyrene and copolymers of styrene and other materials, polypropylene, polyethylene, polybutylene, polyurethanes, TeflonTM, cyclic olefins, polyimides etc.), nylon, ceramics, resins, Zeonor, silica or silica-based materials including silicon and modified silicon, carbon, metals, inorganic glasses, optical fiber bundles, and polymers.
- Particularly useful solid supports for some embodiments are located within a flow cell apparatus. Exemplary flow cells are set forth in further detail below.
- flow cell is intended to mean a chamber having a surface across which one or more fluid reagents can be flowed. Generally, a flow cell will have an ingress opening and an egress opening to facilitate flow of fluid. A flow cell can have multiple surfaces.
- a solid support to which nucleic acids are attached in a method set forth herein will have a continuous or monolithic surface.
- fragments can attach at spatially random locations wherein the distance between nearest neighbor fragments (or nearest neighbor clusters derived from the fragments) will be variable.
- the resulting arrays will have a variable or random spatial pattern of features.
- a solid support used in a method set forth herein can include an array of features that are present in a repeating pattern.
- the features provide the locations to which modified nucleic acid polymers, or fragments thereof, can attach.
- Particularly useful repeating patterns are hexagonal patterns, rectilinear patterns, grid patterns, patterns having reflective symmetry, patterns having rotational symmetry, or the like.
- each feature can have an area that is smaller than about 1mm 2 , 500 ⁇ m 2 , 100 ⁇ m 2 , 25 ⁇ m 2 , 10 ⁇ m 2 , 5 ⁇ m 2 , 1 ⁇ m 2 , 500 nm 2 , or 100 nm 2 .
- each feature can have an area that is larger than about 100 nm 2 , 250 nm 2 , 500 nm 2 , 1 ⁇ m 2 , 2.5 ⁇ m 2 , 5 ⁇ m 2 , 10 ⁇ m 2 , 100 ⁇ m 2 , or 500 ⁇ m 2 .
- a cluster or colony of nucleic acids that result from amplification of fragments on an array can similarly have an area that is in a range above or between an upper and lower limit selected from those exemplified above.
- the features can be discrete, being separated by interstitial regions.
- some or all of the features on a surface can be abutting (i.e. not separated by interstitial regions).
- the average size of the features and/or average distance between the features can vary such that arrays can be high density, medium density or lower density.
- High density arrays are characterized as having features with average pitch of less than about 15 ⁇ m.
- Medium density arrays have average feature pitch of about 15 to 30 ⁇ m, while low density arrays have average feature pitch of greater than 30 ⁇ m.
- An array useful in the invention can have feature pitch of, for example, less than 100 ⁇ m, 50 ⁇ m, 10 ⁇ m, 5 ⁇ m, 1 ⁇ m or 0.5 ⁇ m.
- the feature pitch can be, for example, greater than 0.1 ⁇ m, 0.5 ⁇ m, 1 ⁇ m, 5 ⁇ m, 10 ⁇ m, 50 ⁇ m, or 100 ⁇ m.
- the term "source” is intended to include an origin for a nucleic acid molecule, such as a tissue, cell, organelle, compartment, or organism.
- a source can be a particular organism in a metagenomic sample having several different species of organisms. In some embodiments the source will be identified as an individual origin (e.g. an individual cell or organism). Alternatively, the source can be identified as a species that encompasses several individuals of the same type in a sample (e.g. a species of bacteria or other organism in a metagenomic sample having several individual members of the species along with members of other species as well).
- the term "surface,” when used in reference to a material, is intended to mean an external part or external layer of the material.
- the surface can be in contact with another material such as a gas, liquid, gel, polymer, organic polymer, second surface of a similar or different material, metal, or coat.
- the surface, or regions thereof, can be substantially flat.
- the surface can have surface features such as wells, pits, channels, ridges, raised regions, pegs, posts or the like.
- the material can be, for example, a solid support, gel, or the like.
- a physical map of the immobilized nucleic acid can then be generated.
- the physical map thus correlates the physical relationship of clusters after immobilized nucleic acid is amplified.
- the physical map is used to calculate the probability that sequence data obtained from any two clusters are linked, as described in the incorporated materials of WO 2012/025250.
- the physical map can be indicative of the genome of a particular organism in a metagenomic sample. In this latter case the physical map can indicate the order of sequence fragments in the organism's genome; however, the order need not be specified and instead the mere presence of two or more fragments in a common organism (or other source or origin) can be sufficient basis for a physical map that characterizes a mixed sample and one or more organisms therein.
- the physical map is generated by imaging the solid support to establish the location of the immobilized nucleic acid molecules across the surface.
- the immobilized nucleic acid is imaged by adding an imaging agent to the solid support and detecting a signal from the imaging agent.
- the imaging agent is a detectable label. Suitable detectable labels, include, but are not limited to, protons, haptens, radionuclides, enzymes, fluorescent labels, chemiluminescent labels, and/or chromogenic agents.
- the imaging agent is an intercalating dye or non-intercalating DNA binding agent.
- a plurality of modified nucleic acid molecules is flowed onto a flow cell comprising a plurality of nano-channels.
- nano- channel refers to a narrow channel into which a long linear nucleic acid molecule is stretched.
- the individual nano-channels are separated by a physical barrier that prevents individual long strands of target nucleic acid from interacting with multiple nano-channels.
- the solid support comprises at least 10, 50, 100, 200, 500, 1000, 3000, 5000, 10000, 30000, 50000, 80000 or at least 100000 nano-channels.
- target when used in reference to a nucleic acid polymer, is intended to linguistically distinguish the nucleic acid, for example, from other nucleic acids, modified forms of the nucleic acid, fragments of the nucleic acid, and the like. Any of a variety of nucleic acids set forth herein can be identified as target nucleic acids, examples of which include genomic DNA (gDNA), messenger RNA (mRNA), copy or complimentary DNA (cDNA), and derivatives or analogs of these nucleic acids.
- gDNA genomic DNA
- mRNA messenger RNA
- cDNA complimentary DNA
- transposase is intended to mean an enzyme that is capable of forming a functional complex with a transposon element-containing composition (e.g., transposons, transposon ends, transposon end compositions) and catalyzing insertion or transposition of the transposon element-containing composition into a target DNA with which it is incubated, for example, in an in vitro transposition reaction.
- the term can also include integrases from retrotransposons and retroviruses.
- Transposases, transposomes and transposome complexes are generally known to those of skill in the art, as exemplified by the disclosure of US Pat. App. Pub. No.2010/0120098, which is incorporated herein by reference.
- transposome is intended to mean a transposase enzyme bound to a nucleic acid. Typically the nucleic acid is double stranded.
- the complex can be the product of incubating a transposase enzyme with double-stranded transposon DNA under conditions that support non-covalent complex formation.
- Transposon DNA can include, without limitation, Tn5 DNA, a portion of Tn5 DNA, a transposon element composition, a mixture of transposon element compositions or other nucleic acids capable of interacting with a transposase such as the hyperactive Tn5 transposase.
- the term "transposon element" is intended to mean a nucleic acid molecule, or portion thereof, that includes the nucleotide sequences that form a transposome with a transposase or integrase enzyme.
- a transposon element is capable of forming a functional complex with the transposase in a transposition reaction.
- transposon elements can include the 19-bp outer end (“OE") transposon end, inner end (“IE”) transposon end, or “mosaic end” (“ME”) transposon end recognized by a wild-type or mutant Tn5 transposase, or the Rl and R2 transposon end as set forth in the disclosure of US Pat. App. Pub. No. 2010/0120098, which is incorporated herein by reference.
- Transposon elements can comprise any nucleic acid or nucleic acid analogue suitable for forming a functional complex with the transposase or integrase enzyme in an in vitro transposition reaction.
- the transposon end can comprise DNA, RNA, modified bases, non- natural bases, modified backbone, and can comprise nicks in one or both strands.
- a standard NGS sequencing run yields millions of short sequences that are eventually mapped on a reference genome. A percentage of good-quality reads (1-5%) are discarded because of ambiguous genomic location.
- FIG. 1 is a diagram 100 showing the difficulty resolving sequence reads from identical sequences A1 and A2 on a reference genome.
- FIG.2 provides a method 200 for assigning nucleic acid sequence reads to target polynucleotides.
- the method 200 includes providing a substrate having transposome complexes immobilized thereon in step 202.
- the transposome complexes include a transposase and a first polynucleotide including an end sequence and a first tag in some embodiments.
- the method 200 then includes contacting the transposome complexes in step 204 with target polynucleotides under conditions to fragment the target polynucleotides.
- the method 200 then includes amplifying the fragmented target polynucleotides in step 206 to form a plurality of nucleic acid clusters on the substrate.
- the method 200 then includes obtaining location information in step 208 for the plurality of nucleic acid clusters on the substrate.
- the method 200 then includes determining the nucleic acid sequence reads of the fragmented nucleic acids in each of the nucleic acid clusters in step 210.
- the method 200 then includes assigning the nucleic acid sequence reads to the target polynucleotides using the obtained location information in step 212.
- a length of the target polynucleotides is greater than a length of the fragment.
- assigning the nucleic acid sequence reads in step 212 includes determining the distance between each of the clusters and using the determined distance to assign reads to a specific target polynucleotide.
- assigning the nucleic acid sequence reads includes determining for a likelihood score that sequence reads from at least a first and a second cluster on the substrate derive from the same target polynucleotide. In some embodiments, sequence reads from more than two clusters are determined to be derived from the same target polynucleotide based on the likelihood score. In some embodiments, the method further includes increasing the likelihood score for sequence reads from at least the first cluster when the spatial distance and a genomic distance between the reads derived from first and second clusters are below a threshold value.
- assigning the nucleic acid sequence reads to a particular target polynucleotide includes determining for a likelihood score that a plurality of sequence reads from clusters on the substrate derive from the same target polynucleotide. In some embodiments, the method further includes increasing the likelihood score for sequence reads from one cluster when the spatial distance and a genomic distance between one cluster and one or more other clusters are below a threshold value. [0055] In some embodiments, the method further includes determining whether the target nucleic acid has a variant when the spatial distance between at least the first and second clusters are below a threshold value and when the genomic distance is above the genomic distance threshold.
- the method further includes determining whether the target nucleic acid has a variant when the spatial distances between the first cluster and the second cluster, and the first cluster and one or more other clusters are below a threshold value and when the genomic distance between the genomic locations of reads derived from the first cluster and the second cluster is above the genomic distance threshold.
- the spatial distance threshold from a cluster forms a pattern of an ellipse or a circle around the cluster. Other patterns are contemplated including, but not limited to, symmetrical patterns, asymmetrical patterns, rectilinear patterns, hexagonal patterns, or the like.
- the likelihood score includes a score representing the likelihood that a read from a particular cluster maps to a target polynucleotide. This likelihood score may refer to a mapping quality (MAPQ) score, e.g., likelihood that the mapping location of a read to a particular target polynucleotide is correct.
- MAPQ mapping quality
- the likelihood score of a read from a first cluster is 0.
- the likelihood score of a read from second cluster is above 30.
- the likelihood score of reads from clusters other than the first and second clusters is above 30.
- the location information includes a first spatial coordinate and a second spatial coordinate in a 2D coordinate system.
- the method further includes sorting the plurality reads determined from the nucleic acid clusters by their spatial coordinates. In some embodiments, the method further includes sorting each read from each cluster by the read’s MAPQ score.
- the transposome complexes include a second polynucleotide including a region complementary to the transposon end sequence. In some embodiments, the transposome complexes are present on the substrate at a density of at least 10 3 , 10 4 , 10 5 , 10 6 , 10 7 , 10 8 , 10 9 , or 10 10 or more complexes per mm 2 . In some embodiments, the transposome complexes include a hyperactive Tn5 transposase.
- the likelihood score is influenced by a pitch of the substrate, size of the substrate, a pattern of the substrate, temperature, input nucleic acid size distribution, or a combination thereof. In some embodiments, the likelihood score is influenced by a pitch of the substrate, size of the substrate, a pattern of the substrate, temperature, loading density, fragment directionality, or a combination thereof.
- the substrate includes microparticles. In some embodiments, the substrate includes a patterned surface. In some embodiments, the substrate includes wells. In some embodiments, the substrate includes a flow cell.
- FIG.3A shows a method 300 of updating a standard BAM file using spatial information to disambiguate reads which are mapping to more than one target polynucleotide.
- the standard BAM file includes both uniquely mapped reads and multi-mapped reads, where the read could be assigned to more than one target polynucleotide.
- the uniquely mapped reads may be used for variant calling without any further processing.
- variant calls are stored in a variant call file (“vcf”).
- vcf variant call file
- multi-mapped reads where a read cannot be assigned to a single target polynucleotide are not suitable for variant calling due to the ambiguity of whether such reads are mapped to a particular target (reference) sequence.
- multi-mapped reads are reads that are mapped to more than one target polynucleotide because the reads have not been resolved adequately. As shown, unique sites within the target polynucleotide or genome can be leveraged to resolve a particular read. [0060] Similarly, some multi-mapped reads may be linked to other multi-mapped reads. In this circumstance, the system may use differentiating sites to correct this issue.
- the system may also look to disambiguate some multi-mapped reads by using spatial information of the particular read and the possible multi-mapped target polynucleotides to determine if one of the target polynucleotides is more likely than others to be the correctly mapped sequence based on the spatial location on the flow cell of each read that is attributed to the target polynucleotide. If the multi-mapped read is disambiguated, then the system may perform variant calling and update the BAM file to include the correct mapping location of the particular muti-mapped read. In some embodiments, the system updates the mapping quality (MAPQ) score for the particular multi-mapped read to indicate that the mapping quality has increased if the system has determined the correct target polynucleotide based on the spatial location information.
- MAPQ mapping quality
- FIG.3B shows a method 350 of updating a standard BAM file using spatial information to disambiguate reads which are mapped to more than one target polynucleotide, similarly to the method 300 shown in FIG 3A. However, in a step 355 of FIG.3B, the method 350 may update a MAPQ value and assign a unique position to MAPQ0 reads, even when the read may be mapped to a segmental duplication.
- proximal reads with High MAPQ are fetched from these homologous regions, e.g., high MAPQ reads from flanking regions, 75kb or more in length, of the multi-mapped read and then all alternative alignment locations are saved.
- the mean value of the distances when more than one proximal link read is found is kept.
- the MAPQ score of the multi-reads is updated only when one of the alternative alignments has an average smaller than 50kb, otherwise the multi- mapped read is left as MAPQ0.
- the distance of all of the alternative alignments is greater than 50kb, or more than one of the alternative alignments is less than 50kb, then the MAPQ score of the multi-mapped read remains zero. Distances greater or shorter than 50 kb are also contemplated. [0062] This may be understood more completely with reference to FIG.4A which shows the position of several clusters on a flow cell and the potential location of those reads on the target polynucleotide or genome.
- the unknown multi-mapped read is indicated as having a MAPQ of 0 (MAPQ0) due to the uncertainty that the indicated alignment is correct.
- the MAPQ0 read is shown as having a distance on the genome of less than an average of 50kb from the other reads having MAPQ scores of 30, the latter of which are found to also be geographically nearby to the MAPQ0 read. Because the distance d1 is less than 50kb, the system determines that there is high likelihood that this assignment of the MAPQ0 read to this position on the genome is accurate, and so the MAPQ0 score is updated to a MAPQ score of 20.
- the system can update the MAPQ score based on various factors to indicate that the assignment to this position on the genome is likely to be correct.
- the MAPQ0 read is possibly located within 50kb of a set of three other MAPQ30 reads and also possibly a distance d2 of more than 50kb from the same reads. Given this disparity in the read distances, the system determines that the proper assignment of this read is the position within 50kb of the other MAPQ30 reads and the MAPQ score can be updated to be a MAPQ of 20 to indicate that the assignment to this position is likely to be correct.
- the MAPQ0 read may be located spatially very near to other reads on the flow cell (distances greater or shorter than 50 flow cell units are contemplated).
- the threshold for nearness is 100 flow cell units.
- the average distance for each of those reads on the genome is more than 50kb, where d1 > 50kb. In this circumstance, it’s not clear that the MAPQ0 read is correctly assigned to this portion of the genome and so the MAPQ score would stay as zero.
- the MAPQ0 read may be positioned near other reads on the flow cell as shown in the clusters thereon, but the multiple possible locations of the MAPQ0 read on the lower genome diagram of the Figure make it difficult to assign one location on the genome.
- multiple MAPQ0 reads may be located less than 50kb from the other MAPQ30 reads in either of their possible locations (e.g., d1 ⁇ 50kb and d2 ⁇ 50kb).
- assigning the MAPQ0 reads to one location or the other is a challenge.
- a flow cell 500 includes a plurality of lanes 510 as shown in FIG. 5. Each lane 510 includes a plurality of surfaces. As shown, in some embodiments of the flow cell 500, a lane includes a top surface 512 and a bottom surface 514. In some embodiments, each surface is subdivided into a plurality of tiles 520.
- a cluster 530 is located on a tile 520 that is designated as 1201. This designation serves as an illustrative example only and is not limited to the alphanumeric characters shown in the figure.
- the tile 520 includes 2D X-Y coordinates as shown to provide the spatial information between clusters.
- the X-Y coordinates are derived from FQU (fastq units).
- the subdivision of the surface into tiles 520 is an artificial separation so that the surface of the flow cell is not separated into physical tiles, but instead the images captured by a camera can be segmented into tiles.
- the tiles 520 are subdivided by swath, which is a width of a camera.
- the tile 520 denotes the size of an image that can be captured by the camera.
- the X-Y coordinates are pixel values.
- 1 unit of a tile 520 can be approximated to be 1/10 th of a pixel.
- a physical separation is contemplated in some embodiments were the tile can have physical barriers, wells, other structures which separate one portion of the flow cell from another portion of the flow cell. [0067] In a non-limiting example shown in FIG.6, a plurality of clusters 610, 620, 630 are shown in tile 600.
- spatial information including X-Y coordinates, for clusters 610, 620, 630 are obtained by the camera that processes the pixel value of the digital image, the processing of which is shown in the readnames 612, 622, 632 for clusters 610, 620, 630 respectively.
- readnames 612, 622, 632 for clusters 610, 620, 630 respectively, shown in FIG. 6, follow the format of: Instrument:Run:Flowcell ID:Lane #:Surface (1 or 2, top or bottom):Swath:Tile-X:Tile-Y.
- Other formats of the readnames are contemplated so long as the X-Y spatial information is obtained.
- the instrument value is A01298; the run number is 280; the flow cell ID is HYC2MDRX2; number of lanes is 2; for 1201, the identified Surface is 1, the identified Swath is 2, and the identified Tile is 1; the Tile X-coordinate is 26928; and the Tile Y-coordinate is 18349.
- spatial information is used to link reads together.
- spatial information is used to link reads together, where the link between the reads can be physical or non-physical. In some embodiments, spatial information is used to link reads together to form a longer linked read with one or more read subpairs.
- the linked reads have expected properties such as, but not limited to, expected length distribution, distance between pairs, and number of pairs. These properties can be leveraged in genome analysis.
- Fragment 1 ----- Fragment 2 are linked using spatial information that confirm Fragments 1 and 2 are from the same polynucleotide.
- the length of the linked read construct is the length of Fragment 1 and Fragment 2 plus 5 units, where each unit is indicated by “-”.
- the distance between Fragment 1 and 2 is 5 units, such as 5 flow cell units.
- Fragment 3 ----- Fragment 4 ------ Fragment 5 are linked using spatial information to confirm that Fragments 3, 4, and 5 are from the same polynucleotide.
- the distance between Fragments 3 and 4 is 5 units, where each unit is indicated by “-”; and the distance between Fragments 4 and 5 is 6 units.
- Linking reads can be performed in a genomic reference dependent or independent manner. For genomic independent linking, links are formed between reads using spatial information only. In a non-limiting example, reads are linked when they are 50 flow cell distance units apart.
- reference-based information (including, but not limited to, genomic information and the like) is not considered in linking these reads.
- a genomic dependent linking links are formed between reads using spatial and genomic information.
- a pair of reads or multiple reads are linked when they are 50 flowcell distance units apart and within 10 kbp distance when the reads are aligned to a reference genome.
- a system for assigning sequence reads on a substrate to their original target nucleic acid including a substrate including clusters of amplified fragments of target polynucleotides bound to the substrate in spatial locations; and one or more processors.
- the one or more processors include instructions that when executed perform a method that includes obtaining spatial location information for the clusters of amplified fragments on the substrate; sequencing the target polynucleotides in each of the clusters to determine a sequence read and spatial location from each of the amplified fragments; determining the geographic distance between each cluster on the substrate; and assigning sequence reads from each cluster to a target nucleic acid based on their geographic distance from one another.
- the obtaining of the spatial location information is performed during the sequencing of the target nucleic acids.
- an alignment of the target nucleic acids to the target polynucleotides is determined as the sequencing of the target nucleic acids is being performed.
- the one of more processors further include instructions that when executed perform a method that includes defining a definitive position of each cluster on the substrate.
- a length of the target polynucleotides is greater than a length of the fragment.
- the assigning of the nucleic acid sequence reads includes determining the distance between each of the clusters and using the determined distance to assign reads to a specific target polynucleotide.
- the assigning of the nucleic acid sequence reads includes determining for a likelihood score that at least a first and a second cluster on the substrate derive from the same target polynucleotide.
- the one or more processors further performs a method including increasing the likelihood score for at least the first cluster when the spatial distance between at least the first and second clusters are below a threshold value. In some embodiments, the one or more processors further performs a method including increasing the likelihood score for at least the first cluster when a genomic distance between at least the first and second clusters are below the threshold value. In some embodiments, the one or more processors further performs a method including increasing the likelihood score for at least the first cluster when the spatial distance and a genomic distance between at least the first and second clusters are below a threshold value.
- the likelihood score is influenced by a pitch of the substrate, size of the substrate, a pattern of the substrate, temperature, input nucleic acid size distribution, or a combination thereof.
- the substrate includes microparticles.
- the substrate includes a patterned surface.
- the substrate includes wells.
- the substrate includes nanowells.
- a pitch of the nanowell affects the likelihood score.
- the one or more processors further perform a method including determining whether the target nucleic acid has a variant when the spatial distance between at least the first and second clusters are below a threshold value and when the genomic distance is above the genomic distance threshold.
- the spatial distance threshold from a cluster forms a pattern of an ellipse or a circle around the cluster.
- the likelihood score of the first cluster is 0.
- the likelihood score of the second cluster is above 30.
- the location information includes a first spatial coordinate and a second spatial coordinate in a 2D coordinate system.
- the one or more processors further performs a method further including sorting the plurality of the nucleic acid clusters by their spatial coordinates.
- transposome complexes include a second polynucleotide including a region complementary to the transposon end sequence.
- the transposome complexes are present on the substrate at a density of at least 103, 104, 105, 106 complexes per mm 2 .
- the transposome complexes include a hyperactive Tn5 transposase.
- FIG.8 provides a non-limiting example 800 of flow cell grid geometry 805 and spatial metrics 810. As shown at grid 805, the nanowells of the flow cell are organized into a grid. The nanowell grid 805 is overlayed with a pixel grid (not shown) which is derived from an image of the flow cell grid 805 by a CCD camera.
- 1 pixel in the captured image represents 10 FQU’s and each pixel size is 345 nm.
- Various cameras at different magnification levels will have different pixel sizes and the pixel size disclosed in FIG. 8 is by way of example only.
- the physical FQU size is determined to be 34.5 nm, in this example.
- Other metrics are obtained with this information includes the nanowell pitch, diameter, and interstitial distance.
- the nanowell pitch is shown to be 624 nm, 1.81 pixels, and 18.1 FQU’s; the nanowell diameter is 360 nm, 1.04 pixels, 10.4 FQU’s; and the nanowell interstitial distance is 282 nm, 0.82 pixel, and 8.2 FQU’s. Also shown in this example, the nanowell grid is organized in a hexagonal pattern.
- FIG. 9 provides a non-limiting example 900 of a relationship between spatial and genomic metrics, the latter with respect to an exemplary linear strand of DNA 905. As shown, based on the information obtained by a camera (e.g., pixel size), the pitch between two nanowells is 642 nm and the FQU is 18.
- FIG. 10 provides a non-limiting example of a workflow for updating a MAPQ score of a MAPQ0 read.
- alignment data is obtained, which includes two homologous regions (A and B).
- the alignment data is obtained from BAM, BED, SEG, or any other file formats. After the alignment data has been obtained, the workflow proceeds to blocks 1020 and 1030.
- Block 1020 includes fetching high MAPQ (>30) reads from regions A and B (+75kb flanks).
- Block 1030 includes searching for MAPQ0 reads from regions A and B (+75kb flanks) and saving all possible alternative alignment (AA) locations. Once the search for MAPQ reads has been completed in block 1030, the workflow moves to a decision state 1032 to determine whether a MAPQ0 read has been found.
- the workflow moves to block 1040 to search for high MAPQ reads fetched from block 1020 that are within a predetermined distance threshold from each MAPQ0 read.
- the search in block 1040 searches for proximal reads with high MAPQ in regions A and B For example, in one embodiment, the maximum flow cell distance is equal to 50. If a MAPQ0 read is not found at the decision block 1032, then the workflow moves to the end and the process 1000 terminates.
- the workflow 1000 moves to the block 1040 to search for high MAPQ reads within a distance threshold from each MAPQ0 read.
- the workflow 1000 moves to a decision state 1042 to determine whether high MAPQ reads that are within the distance threshold from each MAPQ0 read has been found. If such high MAPQ reads are found, then the workflow moves to block 1050 to obtain a genomic distance (GD) between high MAPQ reads and all AAs. If a high MAPQ read is not found, then the workflow moves to the end.
- GD genomic distance
- measuring the distance between the high MAPQ reads and all alternative alignments (keep the mean value of the distances when more than one proximal link read is found).
- the workflow moves to a decision state 1052 to determine whether if one of the AA has an average GD within a GD threshold. In one example, if one of the alternative alignments has an average GD smaller than 50kb, then the workflow proceeds to block 1060 to update the MAPQ score of the MAPQ0 read. If all of the AA > 50kb or if more than one AA ⁇ 50kb and the rest are ⁇ 50 kb, then the workflow moves to block 1061 and the MAPQ score of MAPQ0 is not updated.
- FIG. 11 provides a non-limiting example of a workflow of updating a MAPQ score of a MAPQ0 read.
- alignment data is obtained, which includes two homologous regions (A and B).
- the alignment data is obtained from BAM, BED, SEG, or any other file formats.
- Block 1120 includes fetching high MAPQ (>30) reads from A and B (+75kb flanks).
- Block 1130 includes searching for MAPQ0 reads from A and B (+75kb flanks) and saving all possible alternative alignment (AA) locations.
- the workflow moves to a decision state 1132 to determine whether a MAPQ0 read has been found. If a MAPQ0 read has been found, then the workflow moves to block 1140 to search for high MAPQ reads fetched from block 1120 that are within a distance threshold from each MAPQ0 read. If a MAPQ0 read is not found, then the workflow moves to block 1135 where the regional window to obtain alignment data is expanded to be greater than the starting +75kb flanks.
- the workflow moves to a decision state 1142 to determine whether high MAPQ reads that are within the distance threshold from each MAPQ0 read has been found. If such high MAPQ reads are found, then the workflow moves to block 1150 to obtain a genomic distance (GD) between high MAPQ reads and all AAs. If a high MAPQ read is not found, then the workflow moves to block 1135 where the regional window to obtain alignment data is expand to be greater than the starting +75kb flanks.
- GD genomic distance
- measuring the distance between the high MAPQ reads and all alternative alignments (keep the mean value of the distances when more than one proximal link read is found).
- the workflow moves to a decision state 1152 to determine whether if one of the AA has an average GD within a GD threshold. In this example, if one of the alternative alignments has an average GD smaller than 50kb, then the workflow proceeds to block 1160 to update the MAPQ score of the MAPQ0 read. If all of the AA > 50kb or if more than one AA ⁇ 50kb and the rest are ⁇ 50 kb, then the workflow moves to block 1161 and the MAPQ score of MAPQ0 is not updated.
- Various embodiments of the present disclosure may be a system, a method, and/or a computer program product at any possible technical detail level of integration.
- the computer program product may include a computer readable storage medium (or mediums) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
- the functionality described herein may be performed as software instructions are executed by, and/or in response to software instructions being executed by, one or more hardware processors and/or any other suitable computing devices.
- the software instructions and/or other executable code may be read from a computer readable storage medium (or mediums).
- Computer readable storage mediums may also be referred to herein as computer readable storage or computer readable storage devices.
- the computer readable storage medium can be a tangible device that can retain and store data and/or instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device (including any volatile and/or non-volatile electronic storage devices), a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a solid state drive, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
- Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the "C" programming language or similar programming languages.
- Computer readable program instructions may be callable from other instructions or from itself, and/or may be invoked in response to detected events or interrupts.
- Computer readable program instructions configured for execution on computing devices may be provided on a computer readable storage medium, and/or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution) that may then be stored on a computer readable storage medium.
- Such computer readable program instructions may be stored, partially or fully, on a memory device (e.g., a computer readable storage medium) of the executing computing device, for execution by the computing device.
- the computer readable program instructions may execute entirely on a user's computer (e.g., the executing computing device), partly on the user’s computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
- FPGA field-programmable gate arrays
- PLA programmable logic arrays
- These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart(s) and/or block diagram(s) block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer.
- the remote computer may load the instructions and/or modules into its dynamic memory and send the instructions over a telephone, cable, or optical line using a modem.
- a modem local to a server computing system may receive the data on the telephone/cable/optical line and use a converter device including the appropriate circuitry to place the data on a bus.
- the bus may carry the data to a memory, from which a processor may retrieve and execute the instructions.
- the instructions received by the memory may optionally be stored on a storage device (e.g., a solid-state drive) either before or after execution by the computer processor.
- each block in the flowchart or block diagrams may represent a service, module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the blocks may occur out of the order noted in the Figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- certain blocks may be omitted in some implementations.
- the methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate.
- any of the processes, methods, algorithms, elements, blocks, applications, or other functionality (or portions of functionality) described in the preceding sections may be embodied in, and/or fully or partially automated via, electronic hardware such application-specific processors (e.g., application-specific integrated circuits (ASICs)), programmable processors (e.g., field programmable gate arrays (FPGAs)), application-specific circuitry, and/or the like (any of which may also combine custom hard-wired logic, logic circuits, ASICs, FPGAs, etc. with custom programming/execution of software instructions to accomplish the techniques).
- ASICs application-specific integrated circuits
- FPGAs field programmable gate arrays
- any of the above-mentioned processors, and/or devices incorporating any of the above-mentioned processors may be referred to herein as, for example, “computers,” “computer devices,” “computing devices,” “hardware computing devices,” “hardware processors,” “processing units,” and/or the like.
- Computing devices of the above-embodiments may generally (but not necessarily) be controlled and/or coordinated by operating system software, such as Mac OS, iOS, Android, Chrome OS, Windows OS (e.g., Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10, Windows 11, Windows Server, etc.), Windows CE, Unix, Linux, SunOS, Solaris, Blackberry OS, VxWorks, or other suitable operating systems.
- operating system software such as Mac OS, iOS, Android, Chrome OS, Windows OS (e.g., Windows XP, Windows Vista, Windows 7, Windows 8, Windows 10, Windows 11, Windows Server, etc.), Windows CE, Unix, Linux, SunOS, Solaris, Blackberry OS, VxWorks, or other suitable operating
- the computing devices may be controlled by a proprietary operating system.
- Conventional operating systems control and schedule computer processes for execution, perform memory management, provide file system, networking, I/O services, and provide a user interface functionality, such as a graphical user interface (“GUI”), among other things.
- GUI graphical user interface
- ranges provided herein include the stated range and any value or sub-range within the stated range, as if such value or sub-range were explicitly recited.
- a range from about 2 kbp to about 20 kbp should be interpreted to include not only the explicitly recited limits of from about 2 kbp to about 20 kbp, but also to include individual values, such as about 3.5 kbp, about 8 kbp, about 18.2 kbp, etc., and sub-ranges, such as from about 5 kbp to about 10 kbp, etc.
- Conditional language such as “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain examples include, while other examples do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular example.
Landscapes
- Chemical & Material Sciences (AREA)
- Organic Chemistry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Analytical Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Physics & Mathematics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Biochemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
La présente invention concerne des procédés permettant : d'attribuer des lectures de séquences d'acides nucléiques à des polynucléotides cibles, y compris la mise à disposition d'un substrat sur lequel sont immobilisés des complexes de transposome, les complexes de transposome comprenant une transposase et un premier polynucléotide comprenant une séquence terminale et un premier marqueur ; de mettre en contact les complexes de transposome avec des polynucléotides cibles dans des conditions permettant de fragmenter les polynucléotides cibles ; d'amplifier les polynucléotides cibles pour constituer une pluralité d'ensembles d'acides nucléiques sur le substrat ; de déterminer les séquences d'acides nucléiques lues des acides nucléiques fragmentés dans chacun des ensembles d'acides nucléiques ; et d'attribuer les séquences d'acides nucléiques lues aux polynucléotides cibles en utilisant les informations d'emplacement obtenues.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363511593P | 2023-06-30 | 2023-06-30 | |
| US63/511,593 | 2023-06-30 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025006487A1 true WO2025006487A1 (fr) | 2025-01-02 |
Family
ID=91959237
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/035447 Pending WO2025006487A1 (fr) | 2023-06-30 | 2024-06-25 | Utilisation des coordonnées spatiales des cuves à circulation pour relier les lectures afin d'améliorer l'analyse du génome |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025006487A1 (fr) |
Citations (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (fr) | 1989-10-26 | 1991-05-16 | Sri International | Sequençage d'adn |
| WO2004018497A2 (fr) | 2002-08-23 | 2004-03-04 | Solexa Limited | Nucleotides modifies |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| WO2007123744A2 (fr) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systèmes et procédés pour analyse de séquençage par synthèse |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| US20100120098A1 (en) | 2008-10-24 | 2010-05-13 | Epicentre Technologies Corporation | Transposon end compositions and methods for modifying nucleic acids |
| WO2012025250A1 (fr) | 2010-08-27 | 2012-03-01 | Illumina Cambridge Ltd. | Méthodes de séquençage de polynucléotides |
| WO2012106546A2 (fr) * | 2011-02-02 | 2012-08-09 | University Of Washington Through Its Center For Commercialization | Cartographie massivement parallèle de contiguïté |
| US20120282617A1 (en) | 2009-06-02 | 2012-11-08 | Biotium, Inc. | Detection using a dye and a dye modifier |
| WO2014108810A2 (fr) * | 2013-01-09 | 2014-07-17 | Lumina Cambridge Limited | Préparation d'échantillon sur un support solide |
| WO2019028047A1 (fr) * | 2017-08-01 | 2019-02-07 | Illumina, Inc | Indexation spatiale de matériel génétique et préparation de pharmacothèque à l'aide de billes d'hydrogel et de cellules d'écoulement |
| WO2019160820A1 (fr) * | 2018-02-13 | 2019-08-22 | Illumina, Inc. | Séquençage d'adn à l'aide de billes d'hydrogel |
| WO2021021515A1 (fr) * | 2019-08-01 | 2021-02-04 | Illumina, Inc. | Cuves à circulation |
| WO2023122755A2 (fr) * | 2021-12-23 | 2023-06-29 | Illumina Cambridge Limited | Cuve à circulation et procédés |
-
2024
- 2024-06-25 WO PCT/US2024/035447 patent/WO2025006487A1/fr active Pending
Patent Citations (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO1991006678A1 (fr) | 1989-10-26 | 1991-05-16 | Sri International | Sequençage d'adn |
| US7329492B2 (en) | 2000-07-07 | 2008-02-12 | Visigen Biotechnologies, Inc. | Methods for real-time single molecule sequence determination |
| US7211414B2 (en) | 2000-12-01 | 2007-05-01 | Visigen Biotechnologies, Inc. | Enzymatic nucleic acid synthesis: compositions and methods for altering monomer incorporation fidelity |
| US7057026B2 (en) | 2001-12-04 | 2006-06-06 | Solexa Limited | Labelled nucleotides |
| WO2004018497A2 (fr) | 2002-08-23 | 2004-03-04 | Solexa Limited | Nucleotides modifies |
| US7315019B2 (en) | 2004-09-17 | 2008-01-01 | Pacific Biosciences Of California, Inc. | Arrays of optical confinements and uses thereof |
| US7405281B2 (en) | 2005-09-29 | 2008-07-29 | Pacific Biosciences Of California, Inc. | Fluorescent nucleotide analogs and uses therefor |
| WO2007123744A2 (fr) | 2006-03-31 | 2007-11-01 | Solexa, Inc. | Systèmes et procédés pour analyse de séquençage par synthèse |
| US20080108082A1 (en) | 2006-10-23 | 2008-05-08 | Pacific Biosciences Of California, Inc. | Polymerase enzymes and reagents for enhanced nucleic acid sequencing |
| US20100120098A1 (en) | 2008-10-24 | 2010-05-13 | Epicentre Technologies Corporation | Transposon end compositions and methods for modifying nucleic acids |
| US20120282617A1 (en) | 2009-06-02 | 2012-11-08 | Biotium, Inc. | Detection using a dye and a dye modifier |
| WO2012025250A1 (fr) | 2010-08-27 | 2012-03-01 | Illumina Cambridge Ltd. | Méthodes de séquençage de polynucléotides |
| WO2012106546A2 (fr) * | 2011-02-02 | 2012-08-09 | University Of Washington Through Its Center For Commercialization | Cartographie massivement parallèle de contiguïté |
| WO2014108810A2 (fr) * | 2013-01-09 | 2014-07-17 | Lumina Cambridge Limited | Préparation d'échantillon sur un support solide |
| WO2019028047A1 (fr) * | 2017-08-01 | 2019-02-07 | Illumina, Inc | Indexation spatiale de matériel génétique et préparation de pharmacothèque à l'aide de billes d'hydrogel et de cellules d'écoulement |
| WO2019160820A1 (fr) * | 2018-02-13 | 2019-08-22 | Illumina, Inc. | Séquençage d'adn à l'aide de billes d'hydrogel |
| WO2021021515A1 (fr) * | 2019-08-01 | 2021-02-04 | Illumina, Inc. | Cuves à circulation |
| WO2023122755A2 (fr) * | 2021-12-23 | 2023-06-29 | Illumina Cambridge Limited | Cuve à circulation et procédés |
Non-Patent Citations (1)
| Title |
|---|
| BENTLEY ET AL., NATURE, vol. 456, 2008, pages 53 - 59 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240287507A1 (en) | Massively parallel contiguity mapping | |
| JP6743268B2 (ja) | 合成核酸スパイクイン | |
| Wadapurkar et al. | Computational analysis of next generation sequencing data and its applications in clinical oncology | |
| US20200056232A1 (en) | Dna sequencing and epigenome analysis | |
| CN109952612B (zh) | 用于表达谱分类的方法 | |
| CN114174530A (zh) | 用于分析核酸的方法和组合物 | |
| JP7541363B2 (ja) | プーリングを介した多数の試料の効率的な遺伝子型決定のための方法および試薬 | |
| JP7707177B2 (ja) | 融合事象を決定するための方法およびシステム | |
| JP2018143257A (ja) | 断片化したゲノムdna試料におけるゲノム連結性情報の保存 | |
| CN110520542A (zh) | 用于靶向核酸序列富集的方法及在错误纠正的核酸测序中的应用 | |
| Day‐Williams et al. | The effect of next‐generation sequencing technology on complex trait research | |
| AU2020333348B2 (en) | Method for detecting chromosomal abnormality by using information about distance between nucleic acid fragments | |
| WO2025006487A1 (fr) | Utilisation des coordonnées spatiales des cuves à circulation pour relier les lectures afin d'améliorer l'analyse du génome | |
| US20250210140A1 (en) | Mapping resolution using spatial information of sequenced reads | |
| WO2025160181A1 (fr) | Procédés de mise en phase d'haplotypes | |
| WO2025059045A1 (fr) | Systèmes et procédés de détermination de liaison de lectures de séquence sur une cellule de flux | |
| Bolognini | Unraveling tandem repeat variation in personal genomes with long reads | |
| EP4555084A2 (fr) | Systèmes et procédés de détection de variants dans des cellules | |
| HK40068259A (en) | Method for detecting chromosomal abnormality by using information about distance between nucleic acid fragments | |
| CTO et al. | Human Genome Variation Discovery via | |
| Feng | Comprehensive Sequencing with Surface Tagmentation Based Technology |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24745555 Country of ref document: EP Kind code of ref document: A1 |