US11078531B2 - Deepsimulator method and system for mimicking nanopore sequencing - Google Patents
Deepsimulator method and system for mimicking nanopore sequencing Download PDFInfo
- Publication number
- US11078531B2 US11078531B2 US16/769,127 US201816769127A US11078531B2 US 11078531 B2 US11078531 B2 US 11078531B2 US 201816769127 A US201816769127 A US 201816769127A US 11078531 B2 US11078531 B2 US 11078531B2
- Authority
- US
- United States
- Prior art keywords
- nucleotide sequence
- electrical current
- deep learning
- current signals
- input nucleotide
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Definitions
- Embodiments of the subject matter disclosed herein generally relate to a system and method for obtaining nucleotide sequence reads, and more specifically, mimicking all the stages of a Nanopore sequencing.
- NGS Next-generation sequencing
- Nanopore sequencing owns the advantages of long-reads (Byrne et al., 2017), point-of-care (Lu et al., 2016), and PCR-free (Simpson et al., 2017), which enable de novo genome or transcriptome assembling with repetitive regions, field real-time analysis, and direct epigenetic detection, respectively.
- a method for sequencing biopolymers including selecting with a sequence generator module an input nucleotide sequence having plural k-mers; simulating with a deep learning simulator, actual electrical current signals corresponding to the input nucleotide sequence; identifying reads that correspond to the actual electrical current signals; and displaying the reads.
- the deep learning simulator includes a context-dependent deep learning model that takes into consideration a position of a k-mer of the plural k-mers on the input nucleotide sequence when calculating a corresponding actual electrical current.
- a computing device for sequencing biopolymers, the computing device including a processor and a display.
- the processor is configured to select with a sequence generator module an input nucleotide sequence having plural k-mers, to simulate with a deep learning simulator, actual electrical current signals corresponding to the input nucleotide sequence, and to identify reads that correspond to the actual electrical current signals.
- the display is configured to display the reads.
- the deep learning simulator includes a context-dependent deep learning model that takes into consideration a position of a k-mer of the plural k-mers on the input nucleotide sequence when calculating a corresponding actual electrical current.
- non-transitory computer readable medium including computer executable instructions, wherein the instructions, when executed by a processor, implement instructions for sequencing biopolymers as discussed above.
- FIG. 2A illustrates an actual nucleotide Nanopore sequencer while FIG. 2B illustrates the DeepSimulator that mimics the Nanopore sequencer;
- FIG. 4 illustrates how two different sets of data are transformed to a common space with a deep learning neural network algorithm
- FIG. 5 illustrates the structure of a context-dependent pore model component
- FIG. 6 illustrates a distribution used by a signal repeating component to repeat a signal generated by the context-dependent pore model
- FIG. 7 is a flowchart of a method for sequencing a biopolymer.
- FIG. 8 is a schematic diagram of a computing device that implements the DeepSimulator.
- a novel Nanopore simulator 100 requests from the user just to introduce at an input/output module 102 , a reference genome or assembled contigs 103 , specifying the coverage or the number of reads.
- the reference genome would then go through a sequence generator module 104 at a pre-processing stage, which produces several shorter sequences, satisfying the input coverage requirement and the read length distribution of the real Nanopore reads. Then, those sequences would pass through the signal generator module 106 , which contains the pore model component 106 A and the signal repeating component 106 B.
- the pore model component 106 A is used to model the expected current signal of a given k-mer (k usually equals 5 or 6 and here 5-mers are used without loss of generality), which is followed by the signal repeating component 106 B, which produces the simulated current signals. These simulated signals are similar to the real signals in both strength and scale. Finally, the simulated signal would go through Albacore2, the official basecaller module 108 , to produce the final simulated reads, which may be displayed on a display 110 .
- the various components discussed herein may be implemented in a processor 130 , the input/output module 102 may be implemented in dedicated circuitry 120 , and the display 110 may be just one component of a results presenting system 140 .
- the system 140 may include a printer, or a drawing, or an electronic file.
- FIGS. 2A and 2B A pictorial representation of an empirical experiment associated with Nanopore and the simulation generated by the Deepsimulator are illustrated in FIGS. 2A and 2B .
- FIG. 2A shows the empirical experiment that is performed for determining the nucleotide sequence reads while
- FIG. 2B shows the simulation process of the DeepSimulator 100 .
- the DeepSimulator 100 is “deep” in two folds. First, instead of being a simulator that only mimics the result, this simulator mimics the entire Nanopore sequencing. Secondly, when translating the initial sequence into the current signal, a context-dependent pore model is created using deep learning methods. By mimicking the way how the empirical Nanopore model works, the DeepSimulator simulates the complete Nanopore sequencing process, producing both the simulated current signals and the final reads.
- the DeepSimulator uses the basecaller module/software Albacore, which is also used by the Nanopore model.
- the DeepSimulator uses the basecaller module/software Albacore, which is also used by the Nanopore model.
- the DeepSimulator not only eliminates the procedure of learning the parameters in the profile, but also implicitly deploys the actual parameters.
- the DeepSimulator offers more flexibility. For instance, the user can choose to use a different basecaller, or tune the parameters in the signal generation module to obtain the final reads with different accuracies.
- the DeepSimulator attempts to mimic the entire pipeline of the Nanopore sequencing.
- the first stage is sample preparation, which would result in the nucleotide specimen 200 used in the experiment.
- the next stage is to measure the electrical current signals 210 of the nucleotide sequences 212 that form the specimen 200 , using a Nanopore sequencing device, such as Minion.
- the pore model component 106 A takes as input the nucleotide sequence 260 and outputs the context-dependent expected current signal for each 5-mer in the sequence 260 , which is discussed in detail later.
- the signal simulation component 106 B repeats the expected signal (which is output by the pore model component 106 A) several times, at each position, based on a signal repeat time distribution and then adds a random noise to produce the simulated current signals 270 . This component is also discussed later.
- the last module of DeepSimulator is the commonly used basecallers 250 , similar to the Nanopore system illustrated in FIG. 2A .
- a beta distribution is used to fit the reads for the E. coli genome.
- it cannot be fit using a single distribution e.g., reads for lambda phage genome.
- a mixture distribution of two gamma distributions is used to describe pattern 304 .
- the users can choose either of the three patterns. Alternatively, the user can also specify other distribution patterns for the read length.
- the context-dependent pore model 106 A is discussed.
- the first step is to simulate its corresponding current signal 270 .
- the current signal 270 is affected by the advance of the nucleotide sequence via the pore model 106 .
- it is first formulated the problem of building the pore model, followed by the corresponding solution BiLSTM-extended Deep Canonical Time Warping (BDCTW).
- BDCTW BiLSTM-extended Deep Canonical Time Warping
- the BDCTW algorithm is divided into three parts: (1) general framework of deep canonical time warping, (2) feature representation, and (3) neural network architecture. Then, the context-dependent pore model is generated.
- a first challenge is the scale difference. Because the frequency of the original electrical current measurements (taken at 4000 Hz) 210 is about 8-10 times faster than the speed at which the single-strand nucleotide sequence passes through the pore (the translocation speed is around 450 bases per second) (Stoiber and Brown, 2017), the temporal scale difference between the raw signals ⁇ and the nucleotide sequence X is large.
- the sequence X is first encoded in step 402 by being, for example, digitized.
- the encoded sequence 406 is fed to a deep neural network (DNN) algorithm 408 to perform a spatial transformation.
- DNN deep neural network
- the DNN 408 is used to learn from known data (i.e., raw signal 210 ).
- the raw data 210 is also fed to a DNN to perform a spatial transformation. Note that the raw data 210 is actual data and thus, only the sequence 260 needs to be fit to the raw data.
- the transformed features f 1 and f 2 for X and ⁇ are not only temporally aligned with each other, but also maximally correlated.
- Y i F i (X i ; ⁇ i ) represents the activation function of the final layer of the corresponding DNN for Xi, which has d maximally correlated units, where d ⁇ min(d1, d2).
- F 1 in FIG. 4 corresponds to the sequence 260 and F 2 corresponds to the raw signal 210 .
- Such an operation reduces the input data samples to the same feature dimension and then performs a maximal correlation analysis, which essentially resembles the classical canonical correlation analysis (CCA) (Akaike, 1976).
- CCA canonical correlation analysis
- T 1 , T 2 and T are the lengths of X, ⁇ , and the final alignment, respectively.
- ⁇ i are the binary selection matrices that encode the alignment paths for X i . That is, ⁇ 1 and ⁇ 2 remap the nucleotide sequence X with length T 1 and raw signals ⁇ with length T 2 to a common temporal scale T in space 400 .
- D is a diagonal matrix and I is the identity matrix.
- Vector 1 (0) is an appropriate dimensionality vector of all 1's (0's).
- Such an objective function can be solved via alternating optimization (Trigeorgis et al., 2016). Specifically, given the final layer output F i (X i ; ⁇ i ), the method employs dynamic time warping (DTW) (Salvador and Chan, 2007) to obtain the optimal warping matrices ⁇ i , which temporally align the input sequence X i and the final alignment.
- DTW dynamic time warping
- ⁇ ij ⁇ ⁇ 1 T - 1 ⁇ F i ⁇ ( X i ; ⁇ i ) ⁇ ⁇ i ⁇ C T ⁇ ⁇ j T ⁇ F j ⁇ ( X j ; ⁇ j ) T denotes the empirical covariance between the transformed data sets, where C T is the centering matrix,
- the feature function F 1 (X 1 ; ⁇ 1 ) is extended in the original DCTW with bi-directional long short-term memory (Bi-LSTM) (Boza et al., 2017) to incorporate the contextual information.
- the DNN architecture in FIG. 4 is further discussed with reference to FIG. 5 .
- elements f 1 (x 1 ) correspond to the activation function F 1 of the final layer of DNN 408 and elements f 2 (x 2 ) correspond to the activation function F 2 of the final layer of DNN 410 .
- only the DNN 408 is used for the learning process as the raw signal 210 does not need to learn anything, i.e., only the sequence 260 is learning which electrical signals from the raw signals 210 correspond to each of its elements.
- this embodiment uses one “1” and 4 k ⁇ 1 “O”s (necessary for the one-hot encoding) to represent each k-mer (k ⁇ ⁇ 1, 3, 5 ⁇ ). Then, for each nucleotide sequence X 260 with a length T1, as shown in FIG. 5 , the one-hot encoding 501 would produce three feature matrices 502 , 504 , 506 , with dimensions T1 ⁇ 4, T1 ⁇ 64, and T1 ⁇ 1024, respectively. Each row in a feature matrix 502 or 504 or 506 represents a specific position and each column represents the appearance of a certain k-mer.
- the neural network architecture 500 is discussed with regard to FIG. 5 .
- the Bi-LSTM architecture 510 is used for the other feature mapping function F 1 (X 1 ; ⁇ 1 ) for the nucleotide sequence 260 (see FIG. 4 ).
- the Bi-LSTM architecture 510 is used. Specifically, as shown in FIG.
- this embodiment uses a Bi-LSTM block 510 A, 510 B, and 510 C, respectively, to obtain the hidden representation, with 50 forward LSTM cells 512 and 50 backward LSTM cells 514 .
- the method feeds the concatenated representation 522 into a fully-connected layer 530 with 200 nodes, which is followed by a regression layer 540 , after which a transformed signal 550 is generated. All the weights are initialized using the Xavier method. To avoid overfitting, this embodiment utilizes weight decay with the coefficient as 1e ⁇ 4 . In one application, it is possible to choose Adam (Kingma and Ba, 2014) as the optimizer with the learning rate 1e ⁇ 4 . Deploying batch normalization (loffe and Szegedy, 2015) to accelerate the train, the batch size is set as 64 during training.
- the deep neural network model 500 is implemented using Tensor-flow (Abadi, 2016) and can converge within 6 hours with the help of two Pascal Titan X cards, which is faster than the existing simulators.
- the deep neural network 500 in deep canonical time warping for feature mapping of the input nucleotide sequence 260 shown in FIG. 5 becomes the context-dependent pore model 106 A after training.
- the encodings then go through BiLSTM layers 510 , fully-connected layers 530 as well as the final regression layer 540 to generate the expected electrical signals 550 .
- the next simulation step performed by the signal generator module 106 is to repeat the expected electrical signals 550 at each position and add random noise. As previously discussed, this is achieved by the signal repeating component 106 B. It is well-known that during sequencing, the raw signal 210 's acquisition speed is much faster than the DNA or RNA moving speed, causing a certain 5-mer being measured multiple times. Thus, to convert the expected signals 550 produced by the pore model 106 A, to the current signals 270 , which can be put into the basecaller model 250 , it is necessary to repeat a certain expected signal 550 several times. Similar to the read length method discussed with regard to FIGS.
- the repeat time is modeled using a mixture of alpha distributions.
- the repeat time would be drawn from the distribution for each position on the expected signal, generating the simulated current signal by repeating that position for a certain number of times.
- the raw signals are extremely noisy due to the complicated sequencing environment, including voltage changes, noise and interactions between channels (David et al., 2016). Therefore, in one application, Gaussian noise is added with the user-defined variance parameter to each position of the simulated signals.
- One difficulty of this step is to get the statistics of the repeat time, as shown in FIG. 6 .
- the first step (i) take as input the reference genome 260 , raw signals 210 produced by Minion, and the basecalled reads 240 from Albacore and map the reads onto the reference genome by Minimap (Li, 2016), which would mark out the ground truth (at least approximate) sequence that corresponds to the raw signal.
- the second step (ii) with the ground truth sequence, get the expected signal of each 5-mer in the sequence using the context-dependent pore model 106 A.
- step (iii) apply dynamic time warping (DTW) (Salvador and Chan, 2007) to map the raw signal 210 and the expected signal 550 , which is based on the fact that those two signals should have the similar shape.
- step (iv) based on the mapping, it is possible to find out the repeat time from the raw signal positions that correspond to each expected signal position. Performing the above method on a large dataset, it is possible to obtain a stable statistic of the repeat time. Then, the method fits the distribution as a mixture model.
- Nanopore sequencing datasets from different species were used, ranging from three in-house datasets: lambda phage, E. coli K-12 sub-strain MG1655, and Pandoraea pnomenusa strain 6399, to the publicly available human data.
- all the samples were sequenced on the MinION device with 1D protocol on R9.4 flow cells (FLOMIN106 protocol).
- the publicly available human dataset is the human chromosome 21 from the Nanopore WGS Consortium (Jain et al., 2017b).
- the samples in this dataset were sequenced from the NA12878 human genome reference on the Oxford Nanopore MinION using 1D ligation kits (450 bp/s) with R9.4 flow cells.
- the Nanopore raw signal datasets in the FAST5 format were downloaded from nanopore-wgs-consortium4.
- the reference genomes of the four datasets were downloaded from NCBI5.
- the context-dependent pore model 106 A of the second module 106 in the DeepSimulator 100 was trained on the Pandoraea pnomenusa dataset. To construct a dataset that is discussed later, which is used to check the performance of the pore models, 700 reads were randomly sampled from each of remaining three species to form a dataset containing 2100 reads.
- sequence generator module 104 also provides an interface so that the user can enter a user-defined read length distribution.
- the distributions of the length of the simulated reals by DeepSimulator on human, E. coli K-12 sub-strain MG1655, and lambda phage were found to be very similar to that of the experimental reads.
- DTW dynamic time warping
- the output reads of the DeepSimulator can have a basecalling accuracy ranging from 83% to 97%.
- the basecalling module is configured to assign a base to an actual electric current.
- the basecaller module outputs the reads 240 ′, which include the plural bases of the selected nucleotide sequence 260 that was originally input to the DeepSimulator.
- Nanopore sequencing has higher potential in genome assembly than the other sequencing technologies.
- one of the main applications for Nanopore sequencing is de novo assembly.
- Two wide-recognized de novo assembly pipelines, Canu (Koren et al., 2017) and Miniasm (Li, 2016) with Racon (Vaser et al., 2017) were used to perform such task on two different sets of simulated reads generated by the DeepSimulator from the E. coli K-12 genome and the lambda phage genome, respectively. Both of the two experiments succeeded in assembling the simulated reads into one contig. The comparison between the assemblies and the reference genome was plotted using MUMmer (Delcher et al., 1999).
- Single nucleotide polymorphisms are found to be involved in the etiology of many human diseases.
- hundreds of SNPs in the mitochondiral DNA (mtDNA) have been linked to aging-related genes (Stewart and Chinnery, 2015).
- mtDNA mitochondiral DNA
- the current methods which are designed for detecting mitochondrial mutations from a population of cells, would perform massively parallel sequencing of short DNA fragments, having difficulty in performing the complete haplotyping.
- Nanopore sequencing which has the potential of performing the long-read single-molecular sequencing of mtDNA, may overcome the hurdle. Under that circumstance, mimicking the ideal single molecular Nanopore sequencing scenarios, experiments were conducted on the success rate of SNPs detection with respect to the sequencing coverage, using the simulated reads from the DeepSimulator. Note that a long-read sequencing is considered to have average fragment lengths of over 10,000 base-pairs.
- the proposed DeepSimulator is the first successful Nanopore simulator that mimics the entire procedure of the Nanopore sequencing. Unlike the previous simulators, which only simulate the reads from the statistical patterns of the real data, the DeepSimulator simulates both the raw electrical current signals and the nucleotide reads.
- the pipeline of the simulator is highly modularized, which is easier to be customized by users. For example, the users can use another basecaller, to replace Albacore, to obtain the reads with the profile of that basecaller.
- the modularization of the DeepSimulator when compared with other simulators, it is more likely for the DeepSimulator to keep up with the rapid development of the Nanopore sequencing technology. If one step of the Nanopore sequencing pipeline is updated, it is easy to update the corresponding module of the DeepSimulator without changing the entire pipeline.
- the simulated electrical current signals which are useful for the development of basecallers and for the benchmarking of signal-level read mappers.
- the DeepSimulator can generate benchmark datasets to evaluate the newly developed methods for Nanopore sequencing data analysis. Unlike the empirical datasets whose ground truth is difficult to obtain, the DeepSimulator can be fully controlled, which makes it a practical complement to the empirical data. Second, as shown in the SNP detection experiments, it can act as a guidance to the empirical experiment by simulating the ideal case.
- the method includes a step 700 of selecting, with a sequence generator module 104 , an input nucleotide sequence 260 having plural k-mers, a step 702 of simulating 702 , with a deep learning simulator 100 , actual electrical current signals 270 corresponding to the input nucleotide sequence 260 , a step 704 of identifying reads 240 ′ that correspond to the actual electrical current signals 270 , and a step 706 of displaying the reads 240 ′.
- the deep learning simulator 100 includes a context-dependent deep learning model 106 A that takes into consideration a position of a k-mer of the plural k-mers on the input nucleotide sequence 206 when calculating a corresponding actual electrical current.
- the context-dependent deep learning model calculates transformed signals 550 by using a Bi-LSTM extended Deep Canonical Time Warping, which combines a bi-directional long short-term memory (Bi-LSTM) method with a deep canonical time warping (DCTW) method.
- the context-dependent deep learning model compares two linearly structured data sets having different lengths, and feature dimensionality, wherein the first data set corresponds to the input nucleotide sequence 260 and the second data set corresponds to measured electrical current signals 210 .
- the method may further include a step of applying a first deep neural network algorithm 408 to the input nucleotide sequence to obtain the first data set, and a step of applying a second deep neural network algorithm 410 to the measured electrical current signals to obtain the second data set, wherein the first and second data sets are in a common space.
- the method may also include a step of applying an objective function to the first and second data sets to temporally align the input nucleotide sequence and the measured electrical current signals, a step of repeating the transformed signals 550 , in a signal repeating module 106 B, at each position based on a mixture of alpha distributions, to generate the actual electrical current signals 270 , and/or a step of adding a random noise to the transformed signals.
- the method may include a step of using plural different k-mers for each base of the input nucleotide sequence, where the plural different k-mers include a 1-mer, a 3-mer and a 5-mer.
- the method may further include a step of encoding the bases of the input nucleotide sequence with a one-hot encoding using the plural different k-mers for each base.
- the method may include a step of randomly selecting a starting position along the input nucleotide and/or a step of selecting a length of a read based on one of three distributions.
- Computing device 800 of FIG. 8 is an exemplary computing structure that may be used in connection with such a system.
- the DeepSimulator 100 from FIG. 1 may be implemented in the computing device 800 .
- Exemplary computing device 800 suitable for performing the activities described in the exemplary embodiments may include a server 801 .
- a server 801 may include a central processor (CPU) 802 coupled to a random access memory (RAM) 804 and to a read-only memory (ROM) 806 .
- ROM 806 may also be other types of storage media to store programs, such as programmable ROM (PROM), erasable PROM (EPROM), etc.
- Processor 802 may communicate with other internal and external components through input/output (I/O) circuitry 808 and bussing 810 to provide control signals and the like.
- I/O input/output
- Processor 802 carries out a variety of functions as are known in the art, as dictated by software and/or firmware instructions.
- Server 801 may also include one or more data storage devices, including hard drives 812 , CD-ROM drives 814 and other hardware capable of reading and/or storing information, such as DVD, etc.
- software for carrying out the above-discussed steps may be stored and distributed on a CD-ROM or DVD 816 , a USB storage device 818 or other form of media capable of portably storing information. These storage media may be inserted into, and read by, devices such as CD-ROM drive 814 , disk drive 812 , etc.
- Server 801 may be coupled to a display 820 , which may be any type of known display or presentation screen, such as LCD, plasma display, cathode ray tube (CRT), etc.
- a user input interface 822 is provided, including one or more user interface mechanisms such as a mouse, keyboard, microphone, touchpad, touch screen, voice-recognition system, etc.
- Server 801 may be coupled to other devices, such as a smart device, e.g., a phone, tv set, computer, etc.
- the server may be part of a larger network configuration as in a global area network (GAN) such as the Internet 828 , which allows ultimate connection to various landline and/or mobile computing devices.
- GAN global area network
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Biotechnology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Organic Chemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Zoology (AREA)
- Analytical Chemistry (AREA)
- Wood Science & Technology (AREA)
- General Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Genetics & Genomics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
Abstract
Description
-
- This application is a U.S. National Stage Application of International Application No. PCT/IB2018/058502, filed on Oct. 30, 2018, which claims priority to U.S. Provisional Patent Application No. 62/598,086, filed on Dec. 13, 2017, entitled “DEEPSIMULATOR: A DEEP SIMULATOR FOR MIMICKING NANOPORE SEQUENCING,” U.S. Provisional Patent Application No. 62/599,908, filed on Dec. 18, 2017, entitled “DEEPSIMULATOR: A DEEP SIMULATOR FOR MIMICKING NANOPORE SEQUENCING,” and U.S. Provisional Patent Application No. 62/702,161, filed on Jul. 23, 2018, entitled “DEEPSIMULATOR METHOD AND SYSTEM FOR MIMICKING NANOPORE SEQUENCING,” the disclosures of which are incorporated herein by reference in their entirety.
-
- Currently, the existing pore models (github.cominanoporetech/kmer_models) are context-independent and they assign to each 5-mer a fixed value for the expected current signal, regardless of the location of the 5-mer on the nucleotide sequence. The
novel pore model 106A is a context-dependent pore model, which takes advantage of a deep learning method, which has shown great potential in bioinformatics (see, for example, Alipanahi et al., 2015; Li et al., 2017; Dai et al., 2017). Nonetheless, as discussed later, it is challenging to train the deep learning model because of the fact that the current signal is usually 8-10 times longer than the nucleotide sequence. To solve this difficulty, a novel deep learning strategy BiLSTM-extended Deep Canonical Time Warping (BDCTW), which combines bi-directional long short-term memory (Bi-LSTM) (Graves and Schmidhuber, 2005) with deep canonical time warping (DCTW) (Trigeorgis et al., 2016) is used herein to solve the scale difference issue.
- Currently, the existing pore models (github.cominanoporetech/kmer_models) are context-independent and they assign to each 5-mer a fixed value for the expected current signal, regardless of the location of the 5-mer on the nucleotide sequence. The
argminθ
where X1=X and X2=Ŷ. T1, T2 and T are the lengths of X, Ŷ, and the final alignment, respectively. Δi are the binary selection matrices that encode the alignment paths for Xi. That is, Δ1 and Δ2 remap the nucleotide sequence X with length T1 and raw signals Ŷ with length T2 to a common temporal scale T in
corr(F 1(X 1;θ1)Δ1 ,F 2(X 2;θ2)Δ2)=∥K DCTW∥*, (2)
where ∥·∥* is the nuclear norm, KDCTW={circumflex over (Σ)}11 −1/2{circumflex over (Σ)}12{circumflex over (Σ)}22 −1/2 is the kernel matrix of DCTW,
denotes the empirical covariance between the transformed data sets, where CT is the centering matrix,
where USVT=KDCTW is the singular value decomposition (SVD) of the kernel matrix KDCTW. By employing this equation as the sub-gradient, it is possible to optimize the parameters θi in each neural network DNN via back-propagation.
- Abadi, M. (2016). Tensorflow: Learning functions at scale. Acm Sigplan Notices, 51(9), 1-1.
- Akaike, H. (1976). Canonical correlation analysis of time series and the use of an information criterion. Mathematics in Science and Engineering, 126, 27-96.
- Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015). Predicting the sequence specificities of dna- and rna-binding proteins by deep learning. Nat Biotechnol, 33(8), 831-8.
- Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25(17), 3389-402.
- Baker, E. A. G., Goodwin, S., McCombie, W. R., and Mendivil Ramos, O. (2016). Silico: A simulator of long read sequencing in pacbio and oxford nanopore. bioRxiv, page 76901.
- Boza, V., Brejova, B., and Vinar, T. (2017). Deepnano: Deep recurrent neural networks for base calling in minion nanopore reads. PloS one, 12(6), e0178751.
- Byrne, A., Beaudin, A. E., Olsen, H. E., Jain, M., Cole, C., Palmer, T., DuBois R. M., Forsberg, E. C., Akeson, M., and Vollmers, C. (2017). Nanopore long-read rnaseq reveals widespread transcriptional variation among the surface receptors of individual b cells. bioRxiv, page 126847.
- Dai, H., Umarov, R., Kuwahara, H., Li, Y., Song, L., and Gao, X. (2017). Sequence2vec: A novel embedding approach for modeling transcription factor binding affinity landscape. Bioinformatics.
- David, M., Dursi, L. J., Yao, D., Boutros, P. C., and Simpson, J. T. (2016). Nanocall: an open source basecaller for oxford nanopore sequencing data. Bioinformatics, page btw569.
- Deamer, D., Akeson, M., and Branton, D. (2016). Three decades of nanopore sequencing. Nature biotechnology, 34(5), 518-525.
- Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, S. L. (1999). Alignment of whole genomes. Nucleic Acids Research, 27(11), 2369-2376.
- Escalona, M., Rocha, S., and Posada, D. (2016). A comparison of tools for the simulation of genomic next-generation sequencing data. Nat Rev Genet, 17(8), 459-69.
- Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Confere-nce on Knowledge Discovery and Data Mining, KDD'96, pages 226-231. AAAI Press.
- Graves, A. (2013). Generating sequences with recurrent neural networks. arXivpreprint arXiv:1308.0850.
- Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional Istm and other neural network architectures. Neural Networks, 18(5), 602-610.
- loffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
- Jain, C., Dilthey, A., Koren, S., Aluru, S., and Phillippy, A. M. (2017a). A fast approximate algorithm for mapping long reads to large reference databases. bioRxiv, page 103812.
- Jain, M., Koren, S., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T., Malla, S., Marriott, H., Miga, K. H., Nieto, T., O'Grady, J., Olsen, H. E., Pedersen, B. S., Rhie, A., Richardson, H., Quinlan, A., Snutch, T. P., Tee, L., Paten, B., Phillippy, A. M., Simpson, J. T., Loman, N. J., and Loose, M. (2017b). Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv.
- Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980.
- Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res, 27(5), 722-736.
- Lee, H., Gurtowski, J., Yoo, S., Marcus, S., McCombie, R. W., and Schatz, M. (2014). Error correction and assembly complexity of single molecule sequencing reads. BioRxiv, page 6395.
- Li, H. (2011). A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
- Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisylong sequences. Bioinformatics, 32(14), 2103-2110.
- Li, H. (2017). Minimap2: fast pairwise alignment for long nucleotide sequences. arXiv.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and (2009). The sequence alignment/map format and samtools. Bioinformatics, 25(16), 2078-2079.
- Li, Y., Wang, S., Umarov, R., Xie, B., Fan, M., Li, L., and Gao, X. (2017). Deepre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics.
- Lu, H., Giordano, F., and Ning, Z. (2016). Oxford nanopore minion sequencing and genome assembly. Genomics, proteomics & bioinformatics, 14(5), 265-279.
- MacLean, D., Jones, J. D. G., and Studholme, D. J. (2009). Application of ‘next-generation’ sequencing technologies to microbial genetics. Nature Reviews Microbiology, 7(4), 287-296.
- Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature reviews. Genetics, 11(1), 31.
- Salvador, S. and Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intell. Data Anal., 11(5), 561-580.
- Simpson, J. T., Workman, R. E., Zuzarte, P., David, M., Dursi, L., and Timp, W. (2017). Detecting dna cytosine methylation using nanopore sequencing. nature methods, 14(4), 407-410.
- Sovi'c, I., Siki'c, M., Wilm, A., Fenlon, S. N., Chen, S., and Nagarajan, N. (2016). Fast and sensitive mapping of nanopore sequencing reads with graphmap. Nature communications, 7, 11307.
- Stewart, J. B. and Chinnery, P. F. (2015). The dynamics of mitochondrial dna heteroplasmy: implications for human health and disease. Nature Reviews Genetics, 16(9), 530-542.
- Stoiber, M. and Brown, J. (2017). Basecrawller: Streaming nanopore basecalling directly from raw signal. bioRxiv, page 133058.
- Swain, M. J. and Ballard, D. H. (1991). Color indexing. Int. J. Comput. Vision, 7(1), 11-32.
- Trigeorgis, G., Nicolaou, M. A., Zafeiriou, S., and Schuller, B. W. (2016). Deep canonical time warping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5110-5118.
- Vaser, R., Sovic, I., Nagarajan, N., and Sikic, M. (2017). Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research.
- Yang, C., Chu, J., Warren, R. L., and Birol, I. (2017). Nanosim: nanopore sequence read simulator based on statistical characterization. GigaScience, 6(4), 1-6.
- Zeng, F., Jiang, R., and Chen, T. (2013). Pyrohmmvar: a sensitive and accurate method to call short indels and snps for ion torrent and 454 data. Bioinformatics, 29(22), 2859-2868.
Claims (18)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/769,127 US11078531B2 (en) | 2017-12-13 | 2018-10-30 | Deepsimulator method and system for mimicking nanopore sequencing |
Applications Claiming Priority (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201762598086P | 2017-12-13 | 2017-12-13 | |
| US201762599908P | 2017-12-18 | 2017-12-18 | |
| US201862702161P | 2018-07-23 | 2018-07-23 | |
| US16/769,127 US11078531B2 (en) | 2017-12-13 | 2018-10-30 | Deepsimulator method and system for mimicking nanopore sequencing |
| PCT/IB2018/058502 WO2019116119A1 (en) | 2017-12-13 | 2018-10-30 | Deepsimulator method and system for mimicking nanopore sequencing |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB2018/058502 A-371-Of-International WO2019116119A1 (en) | 2017-12-13 | 2018-10-30 | Deepsimulator method and system for mimicking nanopore sequencing |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/355,823 Continuation US11851704B2 (en) | 2017-12-13 | 2021-06-23 | Deepsimulator method and system for mimicking nanopore sequencing |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20200370110A1 US20200370110A1 (en) | 2020-11-26 |
| US11078531B2 true US11078531B2 (en) | 2021-08-03 |
Family
ID=64477218
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/769,127 Active US11078531B2 (en) | 2017-12-13 | 2018-10-30 | Deepsimulator method and system for mimicking nanopore sequencing |
| US17/355,823 Active 2039-05-03 US11851704B2 (en) | 2017-12-13 | 2021-06-23 | Deepsimulator method and system for mimicking nanopore sequencing |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/355,823 Active 2039-05-03 US11851704B2 (en) | 2017-12-13 | 2021-06-23 | Deepsimulator method and system for mimicking nanopore sequencing |
Country Status (3)
| Country | Link |
|---|---|
| US (2) | US11078531B2 (en) |
| EP (1) | EP3724884A1 (en) |
| WO (1) | WO2019116119A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210317523A1 (en) * | 2017-12-13 | 2021-10-14 | King Abdullah University Of Science And Technology | Deepsimulator method and system for mimicking nanopore sequencing |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112481413B (en) * | 2021-01-13 | 2022-02-15 | 南京集思慧远生物科技有限公司 | Plant mitochondrial genome assembly method based on second-generation and third-generation sequencing technologies |
| CN113569055B (en) * | 2021-07-26 | 2023-09-22 | 东北大学 | Method for constructing open pit mine knowledge graph based on genetic algorithm optimization neural network |
| WO2024124521A1 (en) * | 2022-12-16 | 2024-06-20 | 深圳华大生命科学研究院 | Method and device for classifying nanopore sequencing time series electrical signal |
| CN118675621B (en) * | 2024-06-07 | 2025-09-30 | 中国科学院杭州医学研究所 | A method for nucleic acid sequence generation based on reinforcement learning and context-free grammar |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013041878A1 (en) | 2011-09-23 | 2013-03-28 | Oxford Nanopore Technologies Limited | Analysis of a polymer comprising polymer units |
| WO2016181369A1 (en) | 2015-05-14 | 2016-11-17 | Uti Limited Partnership | Method for determining nucleotide sequence |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3724884A1 (en) * | 2017-12-13 | 2020-10-21 | King Abdullah University Of Science And Technology | Deepsimulator method and system for mimicking nanopore sequencing |
-
2018
- 2018-10-30 EP EP18808488.3A patent/EP3724884A1/en not_active Withdrawn
- 2018-10-30 WO PCT/IB2018/058502 patent/WO2019116119A1/en not_active Ceased
- 2018-10-30 US US16/769,127 patent/US11078531B2/en active Active
-
2021
- 2021-06-23 US US17/355,823 patent/US11851704B2/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2013041878A1 (en) | 2011-09-23 | 2013-03-28 | Oxford Nanopore Technologies Limited | Analysis of a polymer comprising polymer units |
| WO2016181369A1 (en) | 2015-05-14 | 2016-11-17 | Uti Limited Partnership | Method for determining nucleotide sequence |
Non-Patent Citations (47)
| Title |
|---|
| "Disease Watch," In the News, Nature Reveiws, Microbiology, Feb. 2009, vol. 7, pp. 96-97. |
| Abadi, M., "TensorFlow: Learning Functions at Scale," Sep. 18-24, 2016, Acm Sigplan Notices, vol. 51, No. 9, 1 page. |
| Abnizova, I., et al., "Analysis of Context-Dependent Errors for Illumina Sequencing," Journal of Bioinformatics and Computational Biology, Apr. 1, 2012, vol. 10, No. 2 (20 pages). |
| Akaike, H., "Canonical Correlation Analysis of Time Series and the Use of an Information Criterion," Mathematics in Science and Engineering, 1976, vol. 126, pp. 27-96. |
| Alipanahi, B., et al., "Predicting the Sequence Specificities of DNA- and RNA-Binding Proteins by Deep Learning," National Biotechnology, Aug. 2015, vol. 33, No. 8, pp. 831-838. |
| Altschul, S.F., et al., "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, Sep. 1, 1997, vol. 25, No. 17, pp. 3389-3402. |
| Baker, E.A.G., "Silico: A Simulator of Long Read Sequencing in PacBio and Oxford Nanopore," bioRxiv, Sep. 22, 2016, pp. 76901-76903. |
| Boza, V., et al., "Deepnano: Deep Recurrent Neural Networks for Base Calling in Minion Nanopore Reads," PLOS One, Jun. 5, 2017, vol. 12, No. 6, 13 pages. |
| Byrne, A., et al., "Nanopore Long-Read RNAseq Reveals Widespread Transcriptional Variation Among the Surface Receptors of Individual B Cells," Nature Communications, Jul. 19, 2017, 11 pages. |
| Dai, H., et al., "Sequence2Vec: A Novel Embedding Approach for Modeling Transcription Factor Binding Affinity Landscape," Bioinformatics, Jul. 26, 2017, vol. 33, No. 2, pp. 3575-3583. |
| David, M., et al., "Nanocall: An Open Source Basecaller for Oxford Nanopore Sequencing Data," Bioinformatics, Sep. 10, 2016, Vo. 33, No. 1, pp. 49-55. |
| Deamer, D., et al., "Three Decades of Nanopore Sequencing," Nature Biotechnology, May 2016, vol. 34, No. 5, pp. 518-525. |
| Delcher, A. L., et al., "Alignment of Whole Genomes," Nucleic Acids Research, Apr. 14, 1999, vol. 27, No. 11, pp. 2369-2376. |
| Escalona, M., et al., "A Comparison of Tools for the Simulation of Genomic Next-Generation Sequencing Data," Nature Reviews, Genetics, Jun. 20, 2016, vol. 17, No. 8, pp. 459-469. |
| Ester, M., et al., "A Density-Based Algorithm for Discovering Clusters a Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD'96, Aug. 2, 1996, pp. 226-231, AAAI Press. |
| Graves, A., "Generating Sequences with Recurrent Neural Networks," Aug. 4, 2013, arXivpreprint arXiv:1308.0850. |
| Graves, A., et al., "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures," Neural Networks, Jul.-Aug. 2005, vol. 18, No. 5, pp. 602-610. |
| International Search Report in corresponding/related International Application No. PCT/IB2018/058502, dated Mar. 18, 2019. |
| Ioffe, S., et al., "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift," Feb. 11, 2015, arXiv:1502.03167. |
| Jain, C., et al., "A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases," Research in Computational Molecular Biology, May 3-7, 2017, pp. 103812. |
| Jain, M. et al., "Nanopore Sequencing and Assembly of a Human Genome with Ultra-Long Reads," Nature Biotechnology, Apr. 2018, vol. 36, No. 4, pp. 338-345; additional Online Methods. |
| Kingma, D., et al., "Adam: A Method for Stochastic Optimization," Published as conference paper at ICLR 2015, Dec. 22, 2014, arXivpreprint arXiv:1412.6980, 15 pages. |
| Koren, S., et al., "Canu: Scalable and Accurate Long-Read Assembly Via Adaptive k-mer Weighting and Repeat Separation," Genome Res, Mar. 15, 2017, vol. 27, No. 5, pp. 722-736. |
| Lee, H., et al., "Error Correction and Assembly Complexity of Single Molecule Sequencing Reads," BioRxiv, Jun. 18, 2014, 17 pages. |
| Li, H., "A Statistical Framework for SNP Calling, Mutation Discovery, Association Mapping and Population Genetical Parameter Estimation from Sequencing Data," Bioinformatics, Sep. 8, 2011, vol. 27, No. 21, pp. 2987-2993. |
| Li, H., "Minimap and Miniasm: Fast Mapping and de novo Assembly for Noisylong Sequences," Bioinformatics, Mar. 19, 2016, vol. 32, No. 14, pp. 2103-2110. |
| Li, H., "Minimap2: Fast Pairwise Alignment for Long Nucleotide Sequences," Bioinformatics, Sep. 15, 2018, vol. 34, No. 18, pp. 3094-3100. |
| Li, H., et al., "The Sequence Alignment/Map Format and SAMtools," Bioinformatics, Jun. 8, 2009, vol. 25, No. 16, pp. 2078-2079. |
| Li, Y. et al., "DEEPre: Sequence-Based Enzyme EC No. Prediction by Deep Learning," Bioinformatics, Oct. 23, 2017, vol. 34, No. 5, pp. 760-769. |
| Li, Y. et al., "DeepSimulator: A Deep Simulator for Nanopore Sequencing," Bioinformatics, Apr. 6, 2018, vol. 34, No. 17, pp. 2899-2908. |
| Loose, M., et al., "Real-Time Selective Sequencing using Nanopore Technology," Nature Methods, Jul. 25, 2016, vol. 13, No. 9, pp. 751-754. |
| Lu, H., et al., "Oxford Nanopore MinION Sequencing and Genome Assembly," Genomics, Proteomics & Bioinformatics, Sep. 17, 2016, vol. 14, No. 5, pp. 265-279. |
| MacLean, D., et al., "Application of ‘Next-Generation’ Sequencing Technologies to Microbial Genetics," Nature Reviews Microbiology, Feb. 23, 2009, vol. 7, No. 4, pp. 287-296. |
| Metzker, M. L., "Sequencing Technologies—The Next Generation," Nature Reviews, Genetics, Jan. 2010, vol. 11, No. 1, pp. 31-46. |
| Ratkovic, M., Deep Learning Model for Base Calling of MinION Nanopore Reads, Master Thesis No. 1417, University of Zagreb, Faculty of Electrical Engineering and Computing, Jun. 1, 2017, pp. 1-48, Zagreb, Croatia. |
| Salvador, et al., "Toward Accurate Dynamic Time Warping in Linear Time and Space," Intelligent Data Analysis, Oct. 1, 2007, vol. 11, No. 5, pp. 561-580. |
| Simpson, J.T., et al., "Detecting DNA Cytosine Methylation using Nanopore Sequencing," Nature Methods, Apr. 2017, vol. 14, No. 4, pp. 407-410. |
| Sovic, I., et al., "Fast and Sensitive Mapping of Nanopore Sequencing Reads with Graphmap," Nature Communications, Apr. 15, 2016, vol. 7, 11307 (11 pages). |
| Stewart, J.B., et al., "The Dynamics of Mitochondrial DNA Heteroplasmy: Implications for Human Health and Disease," Nature Reviews Genetics, Sep. 2015, vol. 16, No. 9, pp. 530-542. |
| Stoiber, M., et al., "Basecrawller: Streaming Nanopore Basecalling Directly from Raw Signal," bioRxiv, May 1, 2017, 133058 (15 pages). |
| Swain, M.J., et al., "Color indexing," International Journal of Computer Vision, Jun. 6, 1991, vol. 7, No. 1, pp. 11-32. |
| Timp, W., et al., "DNA Base-Calling from a Nanopore Using a Viterbi Algorithm," Biophysical Journal, May 1, 2012, vol. 102, pp. L37-L39. |
| Trigeorgis, G., et al., "Deep Canonical Time Warping," In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5110-5118. |
| Vaser, R., et al., "Fast and Accurate de novo Genome Assembly from Long Uncorrected Reads," Genome Research, Jan. 18, 2017, 10 pages. |
| Written Opinion of the International Searching Authority in corresponding/related International Application No. PCT/IB2018/058502, dated Mar. 18, 2019. |
| Yang, C., et al., "NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization." GigaScience, Feb. 21, 2017, vol. 6, No. 4, pp. 1-6. |
| Zeng F., et al., "PyroHMMvar: A Sensitive and Accurate Method to Call Short Indels and SNPs for Ion Torrent and 454 Data," Bioinformatics, Aug. 31, 2013, vol. 29, No. 22, pp. 2859-2868. |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210317523A1 (en) * | 2017-12-13 | 2021-10-14 | King Abdullah University Of Science And Technology | Deepsimulator method and system for mimicking nanopore sequencing |
| US11851704B2 (en) * | 2017-12-13 | 2023-12-26 | King Abdullah University Of Science And Technology | Deepsimulator method and system for mimicking nanopore sequencing |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2019116119A1 (en) | 2019-06-20 |
| US11851704B2 (en) | 2023-12-26 |
| US20210317523A1 (en) | 2021-10-14 |
| EP3724884A1 (en) | 2020-10-21 |
| US20200370110A1 (en) | 2020-11-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Li et al. | DeepSimulator: a deep simulator for Nanopore sequencing | |
| US11851704B2 (en) | Deepsimulator method and system for mimicking nanopore sequencing | |
| Ono et al. | PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores | |
| Meisner et al. | Inferring population structure and admixture proportions in low-depth NGS data | |
| Sahraeian et al. | SMETANA: accurate and scalable algorithm for probabilistic alignment of large-scale biological networks | |
| Wee et al. | The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing | |
| Gymrek et al. | Interpreting short tandem repeat variations in humans using mutational constraint | |
| Magi et al. | Characterization of MinION nanopore data for resequencing analyses | |
| Tian et al. | Simulated maximum likelihood method for estimating kinetic rates in gene expression | |
| Ali et al. | Alignment-free protein interaction network comparison | |
| Tran et al. | A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data | |
| Loewe | A framework for evolutionary systems biology | |
| Paulsen et al. | Manifold based optimization for single-cell 3D genome reconstruction | |
| Husmeier et al. | Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo | |
| Wajid et al. | Do it yourself guide to genome assembly | |
| Botero et al. | Network analyses in plant pathogens | |
| Azad et al. | Use of artificial genomes in assessing methods for atypical gene detection | |
| He et al. | Informative SNP selection methods based on SNP prediction | |
| Pelizzola et al. | Multiple haplotype reconstruction from allele frequency data | |
| Lee et al. | Survival prediction and variable selection with simultaneous shrinkage and grouping priors | |
| Bai et al. | KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate | |
| Mahony et al. | Self-organizing neural networks to support the discovery of DNA-binding motifs | |
| Bamezai et al. | Protein engineering in the computational age: An open source framework for exploring mutational landscapes in silico | |
| Krishnan et al. | Rhometa: Population recombination rate estimation from metagenomic read datasets | |
| Ye et al. | A multi-Poisson dynamic mixture model to cluster developmental patterns of gene expression by RNA-seq |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| AS | Assignment |
Owner name: KING ABDULLAH UNIVERSITY OF SCIENCE AND TECHNOLOGY, SAUDI ARABIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GAO, XIN;LI, YU;WANG, SHENG;AND OTHERS;REEL/FRAME:053269/0034 Effective date: 20200603 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |