WO2025223677A1

WO2025223677A1 - Method for analyzing molecules of interest, method for automatically selecting electrical signals related to a set of known barcode molecules, signal processing unit and computer program for executing the method

Info

Publication number: WO2025223677A1
Application number: PCT/EP2024/061629
Authority: WO
Inventors: Wiep VAN DER TOORN; Wang LIUWEI; Redmond SMYTH; Patrick BOHN; Max VON KLEIST
Original assignee: Helmholtz Zentrum fuer Infektionsforschung HZI GmbH; Robert Koch Institute; Freie Universitaet Berlin
Current assignee: Helmholtz Zentrum fuer Infektionsforschung HZI GmbH; Robert Koch Institute; Freie Universitaet Berlin
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2025-10-30
Anticipated expiration: 2026-10-26
Also published as: WO2025224313A1

Abstract

A method for analyzing molecules of interest by use of an adaptor attached to the molecule of interest is described, wherein said adaptor comprises a barcode molecule, wherein the barcode molecule is one instance of a set of defined sequences of molecules. The method comprises the steps of: - temporarily aligning an electrical signal related to an unknown barcode molecule derived by electrically measuring at least of the part of the molecule of interest comprising the barcode molecule and a set of respective signals for known barcode molecules stored in a data memory, and - predicting the instance of barcode molecule from the similarity between the temporarily aligned electrical signal and the respective signals for known barcode molecules, wherein the electrical signal is temporarily aligned and the instance of barcode molecule is predicted automatically in a data processing unit. The method is also directed to a process of barcode design based on the similarity of temporarily aligned electrical signals related to the candidate barcode molecules and a set of designed target signals.

Description

Method for analyzing molecules of interest, method for automatically selecting electrical signals related to a set of known barcode molecules, signal processing unit and computer program for executing the method

Description:

The present invention related to a method for analyzing molecules of interest by use of an adaptor attached to the molecule of interest, said adaptor comprises a barcode molecule, wherein the barcode molecule is one instance of a set of defined sequences of molecules, for example nucleic acid sequences.

Further, the present invention relates to a method for automatically selecting electrical signals related to a set of known barcode molecules and storing the selected electrical signals for the set for use in the method of analyzing molecules of interest, wherein the analyzed molecules of interest comprising an adaptor attached to the molecule of interest, said adaptor comprising a barcode molecule.

The present invention further relates to a signal processing unit and a computer program arranged for executing the above-mentioned methods. Nanopore direct RNA sequencing (dRNA-seq) offers unique insights into transcriptomics and epitranscriptomics. However, its application in a wide range of interests is precluded by the lack of high accuracy, high yield sample multiplexing. Current methods are limited to only a limited number of barcodes with significant misclassification rates and provide no adaptive sampling solutions based on barcodes. dRNA-seq provides a powerful tool for transcriptomics as disclosed in reference [1], Unlike most next-generation sequencing methods, dRNA-seq does not require the conversion of RNA to cDNA and thus preserves native RNA base modifications, offering a unique opportunity to study the dynamic epitranscriptome as disclosed in reference [2],

In dRNA-seq, individual RNA molecules are threaded through a protein nanopore that is embedded in an electrical ly-resistant membrane across which a constant electric voltage is applied. The translocation of the RNA causes systematic disruptions to the ionic current flow, characteristic of the stretch of nucleotides passing through the pore at a given time, a 5-mer in the case of RNA as disclosed in reference [3],

The controlled translocation of the RNA is facilitated by a motor protein that guides the nucleotides through the pore in a sequential, step-by-step process. Each translocation is reflected by a signal ‘event’ — the repeated measurement of the characteristic signal of the nucleotides that reside in the pore. The number of repeated measurements per event is commonly termed the ‘dwell time’. Intrinsic fluctuation of dwell time is a characteristic feature of nanopore sequencing. The resulting temporal variation can pose challenges in accurately converting the signal into a nucleotide sequence (‘basecalling’), as well as in identifying barcode sequences based on the raw electric signal.

Sample multiplexing, or multiplex sequencing, is a common sequencing practice that uses barcodes to allow multiple samples to be pooled together and sequenced simultaneously during a single run. It reduces costs and increases efficiency and productivity, while minimizing experimental variability. Moreover, precise barcoding and multiplexing lay the foundation for intricate experimental designs, such as those seen in single-cell sequencing as disclosed in reference [4],

While single-cell transcriptomics using nanopore sequencing has been demonstrated by depending on a cDNA intermediate as disclosed in reference [5], achieving single-cell epitranscriptomics would necessitate accurate RNA barcoding within the dRNA-seq framework directly as disclosed in reference [6], However, while (de)multiplexing protocols for DNA-based nanopore sequencing have been developed, no protocols have been released for multiplexing dRNA-seq samples. Conceptually, dRNA-seq barcoding can be achieved by use of RNA barcodes, ligated to the 3‘ end of the transcript. Because RNA barcodes can be directly base-called by the RNA basecaller, read classification can be achieved by aligning the barcode basecalls to a reference. While such RNA barcodes may be useful in specific contexts, their use requires additional ligation and purification steps in the dRNA-seq library preparation process. This can lead to decreased efficiency and ligation biases and increases the complexity and duration of the experimental protocol. Moreover, the required RNA oligos represent a significant cost. In addition, the placement of such short RNA barcode sequences at the 3’ end of the poly-A tail, bordering the DNA adapter, may compound basecalling challenges, in addition to the already relatively high basecalling error rate of ~ 10% as disclosed in reference [7],

Alternatively, dRNA-seq barcoding can be achieved by incorporating a barcode into the dRNA-seq sequencing adapter. The sequencing adapter, known as the Reverse Transcription Adapter (RTA), contains an 18-nucleotide stretch that can be customized to serve as a barcode for read multiplexing. The ligation of the RTA to the 3’ end of the RNA is integral to dRNA-seq library preparation and such barcoding would thus not complicate the experimental protocol or increase its costs. However, since the RTA adapters are DNA- based, such DNA barcodes cannot be directly base-called with the RNA basecaller used in dRNA-seq. It is therefore necessary to process and classify the RTA barcodes in the ‘raw signal’ space.

Such a strategy employing customized RTA adapters was first realized by Smith et al. [8] in a method named DeePlexiCon (DPC). DPC was trained to demultiplex up to four barcoded samples. Its main algorithm transforms the adapter signal to 2-D images using the Gramian Angular Summation Field (GASF) transformation and classifies the images into one of the four barcodes using a deep convolutional neural network. The transformation and use of a deep learning model to classify the signal images requires extensive computational resources, such as GPUs, for performant run times. Additionally, intrinsic variation in the dwell time can produce inconsistent signal-transformed images when using DeePlexiCon, further complicating the classification task.

In contrast, Dynamic Time Warping (DTW) is a robust and efficient algorithm to compare sequences with temporal variation as disclosed in references [9, 10, 11], Object of the present invention is to provides an improved method for analysing molecules of interest, method for automatically selecting electrical signals related to a set of known barcode molecules, signal processing unit and computer program for executing the method.

The object is achieved by the method for analysing molecules of interest comprising the steps of claim 1 , the method for automatically selecting electrical signals related to a set of known barcode molecules comprising the steps of claim 9, the signal processing unit according to claim 11 and the computer program according to claim 12.

According to the invention, the method for analysing molecules of interest by use of an adaptor attached to the molecule of interest, said adaptor comprises a barcode molecule, comprises the steps of:

- temporarily aligning an electrical signal related to an unknown barcode molecule derived by electrically measuring at least of the part of the molecule of interest comprising the barcode molecule and a set of respective signals for known barcode molecules stored in a data memory, and

- predicting the instance of barcode molecule from the similarity between the temporarily aligned electrical signal and the respective signals for known barcode molecules, wherein the electrical signal is temporarily aligned and the instance of barcode molecule is predicted automatically in a data processing unit.

The method is executed automatically in a data processing unit, which is arranged for determining the similarity between the temporarily aligned electrical signal and the respective signals for known barcode molecules, for example by determining the dynamic time warping distance, and predicting the instance of barcode molecule by the above- mentioned processing steps.

The method can be implemented e.g. for dRNA-seq barcoding and demultiplexing combining

(i) design of large barcode sets with clear inter-barcode distinction in signal space,

(ii) tailored signal preprocessing, and

(iii) Support Vector Machine (SVM) classification of signal data with a Dynamic Time Warping Distance (DTWD) kernel.

Preferably, the method does not depend on basecalling nor requires specific Graphics Processor Unit (GPU) compute resources. A large set of barcode molecules, e.g. RNAs containing sequence tags, can be generated to train a computer implemented model by use of artificial intelligence means. With the present method, it can be exemplarily shown that, while requiring only a limited number of e.g. 400 training reads per barcode, the method according to the present invention using preferably the dynamic time warping significantly outperforms existing data-heavy solutions and boasts a significantly faster run time (15 ms per read).

The improvements in speed and accuracy make the method suitable for live (during- sequencing) demultiplexing and barcode-based adaptive sampling, e.g. of dRNA-seq, proteins or the like, in real-time.

Adaptive sampling by executing different barcode balancing schemes during sequencing of multiplexed samples results in enrichment of underrepresented samples and adds to the benefits of multiplexed sequencing e.g. of dRNA-Seq and the like.

The method enables vastly higher-throughput, cost-effective epitranscriptomic research and facilitate more comprehensive and intricate experimental designs.

Temporarily aligning the electrical signal can be performed with the step of partitioning the electrical signal derived by electrically measuring at least of the part of a molecule into segments and determining the similarity between electrical signals for the segments. This improves the speed of the method and reliability of the measurement result.

Temporarily aligning the electrical signal and predicting the instance of barcode molecule can be performed by determining the dynamic time warping distances between the electrical signal and the respective signals for known barcode molecules, wherein the dynamic time warping distance is a measure of the similarity between the temporarily aligned electrical signal related to the measured unknown barcode molecule and an related electrical signal for a known barcode molecule of the set of defined sequences of molecules.

The determination of the value for the dynamic time warping distance can be performed by temporal aligning the electrical signals or a substitute of signals derived from the electrical signal obtained by electrically measuring the barcode molecule against a known electrical signal derived by electrical measuring at least of a respective part of a known barcode molecule of a training set and determining the distance between the pair of temporally aligned electrical signals. Preferably, the method may comprise the steps of:

- recording the electrical signal generated by the barcode sequence for an instance of the pool of molecules of interest (i.e. recording the electrical signal derived by electrically measuring at least of the part of the barcode molecule in the adapter attached to the molecule of interest),

- preprocessing the recorded electrical signal by reducing the number of signals per instance by segmentation of the recorded electrical signal such that for each segment of the electrical signal the variance contained within the segment is decreased, followed by substitution of signals contained in a segment with a summarizing value, including but not limited to the arithmetic mean or median, and

- predicting the instance of barcode molecule by determining the dynamic time warping distance between the preprocessed electric signal of the unknown barcode and a collection of respective electric signals stored in a data memory related to known barcodes, i.e. electrical signals stemming for which the specific signal-sequence assignment is known, wherein the dynamic time warping distance is a measure of the similarity between the electrical signal related to the unknown barcode molecule and an electrical signal related to a known barcode molecule, wherein the dynamic time warping distance is determined and the instance of barcode molecule is predicted automatically in a data processing unit.

Preferably, the measuring of the molecule of interest is stopped based on the predicted instance of barcode molecule and/or an electrical signal derived by electrically measuring at least of the part of the molecule of interest related to a sequence of molecules. For example, a sequencing process of the molecule of interest can be stopped in case that the predicted instance of barcode molecule does not match with the targeted instance (i.e. type) of barcode molecule for the adaptor attached to said molecule of interest. The electric signal for the barcode molecule in the adaptor attached to the molecule of interest is provided before measuring, i.e. the sequencing process, the molecule of interest itself. The prediction of the instance of barcode molecule can be used as quality signal to reject samples before further sequencing the probe. This allows to speed up sequencing of multiple samples, in particular for sample multiplexing or multiplex sequencing by use barcodes to allow processing multiple samples pooled together and sequenced simultaneously during a single run. The sequencing process could also be stopped based on analyzing the sequence signal derived in the sequencing process for another part of the adaptor or the molecule of interest, e.g. PolyA sequence, frequency sequence and the like. This can be preferably combined with control by the predicted instance of barcode molecule. The step of determining the value for the dynamic time warping distance can be processed by temporal aligning pairs of electrical signals, namely the electrical signal extracted for a barcode molecule in a sequencing process of at least the part of the molecule comprising the adaptor including the barcode molecule and a known electrical signal derived in a sequencing processing of at least of a respective part of a known barcode molecule of a training set, and determining the distance between the pair of temporally aligned electrical signals.

The set of known barcode molecules used for predicting the instance of barcode molecule are selected as an optimized diverse subset of a group of barcode molecules having a similarity of the related electrical signal derived in a sequencing process of the barcode molecule and an targeted signal above a threshold value for the similarity and having a dissimilarity between the electrical signal of the unrelated barcode molecules of the set of barcode molecules above a threshold value for the dissimilarity.

The method can be performed in processing hybrid molecules, consisting of DNA barcode molecules ligated (i.e. attached) to RNA molecules.

Preferably, the method is performed in a sequencing process of RNA, DNA, proteins or polymers, in particular polypeptides.

Preferably, the electrical signal is measured in a sequencing process of at least a part of the molecule of interest comprising the barcode molecule to detect the presence of the molecule of interest in a probe and/or to measure the quantity of the molecule of interest in the probe.

In a preferred embodiment, the measuring of the electrical signal in the sequencing process of the molecule of interest comprises measuring electric current or voltage fluctuations in a sensor structure comprising a probe with the molecule of interest and an adaptor attached to the molecule of interest. Usually, the adaptor is attached to the molecule of interest by covalent bounding. Electrically measuring at least of the part of the molecule of interest can be performed, for example, in a nanopore structure of a sequencing unit.

In order to provide a set of best suitable barcode modules and related electrical signals, the method comprises the steps of automatically selecting electrical signals related to a set of known barcode molecules and storing the selected electrical signals for the set for use in the method of analysing molecules of interest. The analyzed molecules of interest comprises an adaptor attached to the molecule of interest, said adaptor comprising a barcode molecule. The selection step comprises processing the electrical signals in a data processing unit for selecting an optimized, diverse subset of a group of candidate barcode molecules with the steps of:

- determining the similarity of temporarily aligned electrical signals related to the candidate barcode molecules and a set of designed target signals,

- selecting a set of high-score candidate barcode molecules having a similarity of the related electrical signal derived from measuring of the candidate barcode molecule and a target signal above a threshold value for the similarity, and

- selecting the optimized, diverse subset of barcode molecules from those high-score candidate barcode molecules having a dissimilarity between the electrical signals of the unrelated barcode molecules of the selected set of high-score candidate barcode molecules above a threshold value for the dissimilarity.

This reduces the number of known barcode molecules considered in determining the similarity, for example by determining the dynamic time warping distances, between an extracted electrical signal for a barcode molecule when measuring a molecule of interest e.g. in a sequencing process. Thus, the required processing time and accuracy of the detection result can be improved.

Preferably, said method comprises a step of determining the value of the dynamic time warping distances between the electrical signal obtained in a sequencing process of a barcode molecule and a set of electrical signals for known barcode molecules and training a classifier with the determined dynamic time warping distances to determine the instance of barcode molecule with the electrical signal derived by the sequencing process of the barcode molecule.

The method may comprise the step of determining the part of the electrical signal related to the adaptor attached to the molecule of interest in the electrical signal extracted in a sequencing process of at least of a part of the molecule by use of the signal features of the electrical signal.

Preferably, the method can be used to determine the part of the electrical signal related to the adaptor, i.e. to identify the adaptor and its position in the electrical signal obtained by sequencing at least of a part of the molecule. Therefore, the method may comprise the step of determining the part of the electrical signal related to the adaptor attached to the molecule in the electrical signal extracted in a sequencing process of at least of a part of the molecule by use of the predicted instance of barcode from the determined value for the dynamic time warping distance for the part of the electrical signal related to the adaptor.

The method may comprise a step of preprocessing the electrical signal. This can include the steps of partitioning the electrical signal extracted in a sequencing process of at least of a part of the molecule into segments and determining the dynamic time warping distances for the segments. The preprocessing of the electric signal measured for an unknown barcode molecule can be performed by segmentation of the electrical signal such that for each segment of the electrical signal the variance contained within the segment is decreased, followed by substitution of signals contained in a segment with a summarizing value, including but not limited to the arithmetic mean or median.

The method can be used for different applications, for example by using the predicted instance of barcode molecule for multiplexed sequencing of molecules, for quantifying the amount of molecule attached to a specific barcode, for use in diagnostics and/or for control of sequencing devices for multiplexed sequencing of molecules.

The method can be performed automatically in a signal processing unit comprising a data memory and an input for receiving electrical signals derived from measuring of molecules, e.g. in a sequencing process, wherein the signal processing unit is arranged for automatically processing the above-mentioned method.

The signal processing part can be integral part of a sequencer device.

The method can be implemented in a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the above- mentioned steps of the method.

The invention if exemplarily described by an embodiment with the enclosed figures. It shows:

Fig. 1a-f - method for selecting an optimized set of barcodes;

Fig. 2a-f - selection of an optimized set of barcodes a resulting improvement of performance;

Figure 3 - diagram of the number of reads for each WDX-demultiplexed sample from a single Flongle sequencing run that were aligned uniquely to the Vero genome, SARS-CoV-2 genome and TurboGFP respectively;

Figure 4 - Abbreviated workflow of experimental protocol to generate training data; Figure 5 - diagrams showing the intrinsic variation in dwell time posing a source of noise;

Figure 6 - exemplary training set visualized per barcode class;

Figure 7 - diagrams of reads for generated experimental data sets using different RTA barcode designs;

Figure 8 - examples of mislabeled signal instances due to missed read splits;

Figure 9 - examples of noise class signal instances;

Figure 10 - diagrams showing the performance on validation set during model selection; Figure 11 - diagram of the run time per read as a function of the number of barcodes;

Figure 12 - diagrams showing precision-recall curves for prior art DeePlexiCon (left) and the WDX-DPC (right) model according to the present invention;

Figure 13 - kernel matrix used to train the WDX4 model according to the present invention;

Figure 14 - diagram showing the precision-recall curves of individual WDX-RTA barcodes used in WDX4 model according to the present invention;

Figure 15 - kernel matrix used to train the WDX12 model according to the present invention;

Figure 16 - diagram showing the precision-recall curves of individual WDX-RTA barcodes used in WDX12 model according to the present invention;

Figure 17 - Temporal Perturbation Analysis between DTWD transformation according to the present invention and GASF transformation;

Figure 18 - demultiplexing statistics of SARS-CoV-2 sequencing run;

Figure 19 - empirical confidence score Cumulative Distribution Function (CDF) for each barcode.

The method is described in the implementation of a scalable method for RTA-based barcoding in dRNA-seq. The method is based on dynamic time warping (DTW) and improves accuracy in barcode demultiplexing by proposing sets of optimized RTA barcodes with maximized inter-barcode distances in the signal space. An improved barcode signal extraction algorithm, improved signal preprocessing, and lastly, a barcode classification algorithm based on Support Vector Machine (SVM) with a Dynamic Time Warping Distance (DTWD) kernel is described.

To train and test the barcode demultiplexer, and to independently evaluate the classification method and barcode design, experimental data sets are generated with a set of optimized RTA barcodes and existing, non-optimized RTA barcodes used in the method DeePlexiCon (DPC) as described in reference [8], To do so, barcodes (classification input) were attached to a diverse set of labeled (classification target) RNA. While the method for barcode set design is not limited by the number of barcodes, here, a set of twelve is proposed and tested. The resultant demultiplexer model serves as a barcode classifier tailored for sequencing tasks that use these optimized barcodes. When benchmarked against the prior art method DeePlexiCon in reference [8], the method according to the present invention provided superior results in demultiplexing accuracy, yield (the percentage of classified reads) and run time, while tripling the number of accommodated barcodes.

As an application, the method according to the present invention can be used to demultiplex samples of SARS2-CoV-2 RNA from infected cells, extracted at different times post infection that were sequenced together on a single Flongle flow cell. This application allows to study viral replication kinetics, and illustrates the ability to facilitate the conception and execution of novel, complex dRNA-seq experimental designs.

As a result, optimized barcode molecules having sequences in 5' to 3'-direction had been found which form a selected set of high-score candidate barcode molecules.

Barcode ID Barcode sequence SEQ ID No.

Barcode 1 I I I I I ACTGCCAGTGACT 1

Barcode 2 AGGGGAGAGAGCCCCCCC 2

Barcode 3 CACGTCA I l l i CCACGTC 3

Barcode 4 GGAGGCCAGGCGGACCGA 4

Barcode 5 ACGGACC I l l i GACTTAA 5

Barcode 6 TATTGCATACTGCGCCGC 6

Barcode 7 CCACGGAGGGAGGATTGG 7

Barcode 8 TTACCGGCAGTGACGGAC 8

Barcode 9 CGAGATTGCATCCCCCCC 9

Barcode 10 TACCACCTGCCGGCGGCC 10

Barcode 11 GCCCGCCGGGGGAGAAGC 11

Barcode 12 I I I I I I I I ACCGGCAGTT 12

Table 1 : Optimized WarpDemuX barcodes, sequences in 5‘ to 3‘ direction

Figure 1 presents an exemplarily model and training data design according to the present invention.

In Figure 1 (a), the in vitro generation of training data is schematically shown. To train the demultiplexing classifier, RNA molecules are engineered with two distinct barcodes. A first RNA barcode is positioned at the 3’ end of the transcript, serving as a ground truth label. A second DNA barcode within a custom RTA is introduced during standard dRNA-seq library preparation, which serves as the target for classification.

In Figure 1 (b), a direct RNA sequencing raw signal example is shown. The raw signal is a time series of electric current measurement in Pico-Ampere when sequencing the molecules in a nanopore structure. The electric current signal consists of two main components: the RTA adapter signal (DNA), and the RNA signal including the poly-A tail. The last part of the electric signal related to the adapter contains the electric signal related to the RTA barcode to be classified.

In Figure 1 (c) it is shown that after locating the adapter component, the raw adapter signal is segmented into events, with each segment represented by its average signal. The final segments of the adapter, designated as the barcode ‘fingerprint’, identify the RTA barcode. Dynamic Time Warping Distance (DTWD) is used to quantify the similarity between fingerprints.

In Figure 1 (d) the training of the model and classification of barcode signal fingerprints into barcode classes using the DTWD kernel function is shown. This comprises training of a Support Vector Machine (SVM) classifier with the DTWD kernel function on barcode fingerprints from sequencing reads captures the nuances of the barcode signals and promotes robust, noise-resilient classification.

Figure 1 (e) and (f) the performance of models trained on existing, non-optimized RTA barcodes on withheld test sets derived by the DeePlexiCon method in reference [8] and the method according to the present invention (WarpDemuX DPC) with the SVM model is presented. It can be seen that the value of similarity between the related predicted and actual barcode molecules No. 1 to 4 in table in Figure (f) derived with the method according to the present invention outperforms the prior art DeePlexiCon CNN model with the results for the values of similarity shown in Figure 1 (e). The values presented reflect the confusion matrix, normalized over the ‘Actual’ class (true label), on both test sets combined, evaluated at confidence cutoff 0.5.

To train, evaluate and benchmark the method according to the present invention, dualbarcoded data sets Fig. 1 (a) are generated using different RTA barcodes: Two data sets had been generated with the DeePlexiCon (DPC)-RTA barcodes, as proposed in Smith et al. in reference [8], attached to in-line RNA-barcoded (i) Vibrio cholerae glycine riboswitch and

(ii) tetrahymena ribozyme transcripts (data sets 1 and 2 in Suppl. Table 4) and four data sets with WarpDemuX (WDX)-RTA barcodes (Table 1) attached to in-line RNA-barcoded

(iii) V. chol glycine riboswitch (iv) hc16,

(v) HCV IRES and

(vi) bacterial RNase P type A transcripts (data sets 3 to 6 in Suppl. Table 4).

Post-sequencing, reads are base-called and uniquely mapped to RNA reference sequences + RNA barcode references to infer their ground truth RTA-barcode labels based on their RNA-barcode. Sequencing yield is shown in Suppl. Fig. 7.

The first component of the method is preprocessing of the raw signal of the sequencer. This involves identifying the DNA/RNA boundary in the raw signal to extract the adapter signal and segmenting of the adapter signal into discrete sequencing events. Subsequently, the segments reflecting the barcode (the barcode ‘fingerprint’; Fig. 1 (b) to (c)) are extracted and classified. To classify the barcode fingerprints, support vector machines (SVMs) are trained. For this, a DTWD-based kernel function is used that captures essential spatial and temporal signal information implicitly by quantifying how similar a fingerprint is to the RTA barcode fingerprints of reads with known labels (Fig. 1 (c) - (d)).

A 4-class WarpDemuX DTW-SVM model had been trained on a subset of DPC’s original training data (400 reads per barcode) and the performance of this "WarpDemuX-DPC" model is benchmarked against that of DPC on the DPC-RTA data sets generated according to the present invention.

Similar to DPC, a confidence score is defined to quantify the certainty of a classification by the method according to the present invention (WarpDemuX) (see Methods). By enforcing a confidence cutoff and discarding all predictions with a confidence score below the cutoff as ‘unclassified’, users can adjust the model’s performance in terms of accuracy and yield (1- fraction of ‘unclassified’ reads) to fit their needs. However, the fraction of unclassified reads should not be viewed purely as an indicator of model performance, as it is also influenced by the dataset’s characteristics. Specifically, datasets with a higher amount of noisy reads tend to have a reduced yield, regardless of the model’s confidence levels. Therefore, to distinguish the deliberate trade-off between accuracy and yield from mere noise detection, and to enhance WarpDemuX’s robustness, a ‘noise class’ can be integrated into the training dataset. This class encompasses barcode fingerprints from reads affected by signal irregularities, such as stalled adapter sequences, blocked pore signals, or inaccurately detected DNA/RNA boundaries (Suppl. Fig. 9). Reads identified as noise, denoted by the class label -1 , are subsequently omitted from performance assessment analyses.

Using a confidence threshold of 0.5, which is the standard setting in the prior art method DeePlexiCon described in reference [8], the method according to the present invention WarpDemuX demonstrates superior performance over DeePlexiCon in both demultiplexing accuracy (achieving 0.97 compared to 0.92; Fig. 1 (e) - (f)) and yield, with a lower percentage of RTA barcodes left unclassified (7.7% vs. 11.4%), despite requiring just 1% of the training data per barcode (400 instances vs. 40,000). Furthermore, when aiming for maximum yield (at a confidence threshold of 0), the method according to the present invention WarpDemuX still outperformed DeePlexiCon, recording an accuracy of 0.94 against 0.87 (Suppl. Table 12). No data was classified as noise. These results, representing the weighted average across the two test sets, demonstrate WarpDemuX’s enhanced classification performance. Additional details on performances at different confidence thresholds per data set and further evaluation of the barcode performance are described in more detail in the Supplementary Chapter below. The improved accuracy and yield using WarpDemuX argues that classification based on DTWD effectively captured the nuances of the barcode signals and is more robust to noise emanating from dwell time variation than the employed methods in DeePlexiCon.

In addition, WarpDemux-DPC runs 10-times faster than DeePlexiCon, requiring about 14.4ms vs. 163ms per read on a standard laptop (see Table 2). On the described set-up, WarpDemuX-DPC classifies 4000 reads in less than 1 minute. Notably, with a peak memory usage (maximum resident set size) of less than 600 MB, and hence extensive computational resources, such as GPUs or large compute clusters, are not essential for WarpDemuX.

Table 2: Performance comparison across different models The results presented in Table 2 demonstrates the WarpDemuX’s superiority to DeePlexiCon, even on barcodes that are not explicitly optimized for DTWD-based separability. Optimized WDX barcodes further improve WDX performance, both in terms of predictive performance (accuracy, precision, recall) and percentage of reads classified. Reported values reflect the weighted average over test sets. Precision and recall reflect the average over barcode classes. The confidence cutoff is set to 0.5 for all models. Reported run times are the 10-fold cross-validated average computing time from 4000 reads on 8 CPU cores (11th Gen Intel(R) Core(TM) i7-1165G7 (2.80GHz) with 32GB memory).

Figure 2 shows optimized barcodes improving the WarpDemuX performance according to the method of the present invention. Figure 2 (a) presents the in silico barcode design. Wave-like signal patterns are proposed that enhance distinction between the individual target patterns. First, candidate barcodes are scored based on their similarity to a target pattern. Second, high scoring candidates are selected and an optimized set of barcodes is identified based on their mutual Dynamic Time Warping Distance. Figure 2 (b) shows that the proposed method for barcode set design reduces the computational cost of identifying good barcode sets. Results shown are based on barcodes of length 18 and sets of size 12. Figure 2 (c) and (d) shows the performance of WarpDemuX SVM models on optimized barcodes, using 4 barcodes (WDX4; Figure 2 (c) and 12 barcodes (WDX12; Figure 2 (d)). The use of optimized barcodes improves accuracy and yield, with the 12-barcode model maintaining high performance. Values reported reflect the confusion matrix, normalized over the ‘Actual’ class (true label), on both test sets combined, evaluated at confidence cutoff 0.5. Figure 2 (e) presents the relation between accuracy and percentage of unclassified reads, modulated by the confidence cutoff. WarpDemuX models surpass DeePlexiCon, with the incorporation of optimized barcodes further improving WarpDemuX performance over its performance on non-optimized DeePlexiCon barcodes. WarpDemuX indicates good generalizability, with high accuracy even at low confidence scores. Although there is an initial trade-off between accuracy and yield, the predictive performance of WDX4 and WDX12 respectively converges to approximately 0.99 and 0.98 accuracy around 91% and 83% yield at a confidence threshold of 0.8, with only marginal improvements at the expense of yield for more stringent confidence thresholds. Values reflect the performance per model as the weighted averaged over their respective test sets. Figure 2 (f) shows the performance trends of WarpDemuX with an increasing number of barcodes, evaluated at a confidence cutoff of 0.5. Linear regression analysis of the 10-fold cross-validated performance (weighted averaged over test sets) demonstrates the method’s consistent robustness and suggests scalability for a larger set of optimized barcodes. In Silico Designed barcodes improve WarpDemuX performance

To expand and go beyond the pool of 4 barcodes in DPC, a computational framework is implemented to design optimized barcodes by utilizing the DTWD as a metric to maximize their differentiation in signal space. In brief, a set of wave-like target signal patterns (Fig. 2 a) are constructed that exhibited high inter-target DTWD. Then, it is searched for barcode sequences whose putative signals derived from kmer models (github.com/nanoporetech/kmer_models) developed by Oxford Nanopore Technologies (ONT) would closely match these target patterns (Suppl. Material 1). Then it is searched for 12 candidate barcodes that are expected to maximize all pairwise DTWDs (see Methods for technical details). This design strategy avoids resorting to an exhaustive search across all possible sets of barcodes and thereby significantly reduces the computational costs of barcode set design (Fig. 2 (b)). The set of 12 optimized WarpDemuX RTA barcodes (WDX- RTA) is shown in Table 1.

To assess WarpDemuX’s barcode design strategy, the Support Vector Machine model (i.e. artificial intelligence) is first trained on four of the optimized WDX-RTA barcodes. The performance of this "WDX4" model on the withheld test sets (weighted average), is shown in Fig. 2 (c) and Table 2. Compared to the WarpDemuX’s performance according to the present invention on non-optimized RTA-barcodes (the WarpDemux-DPC model), the use of optimized barcodes improves accuracy, with 0.99 vs. 0.97, as well as yield, with 4.0% vs. 7.7% of RTA barcodes remaining unclassified, while run time remains unaltered (see Table 2).

Essentially, this result strongly supports the result of the present invention, that barcode design based on dynamic time-warping distance (DTWD), as implemented in WarpDemuX, increases the distinction between the barcodes and hence their identification.

To evaluate WarpDemuX’s scalability, a model capable of demultiplexing 12 distinct barcode classes is developed and tested (referred to as the "WDX12" model; Fig. 2 (d)). When comparing models designed to differentiate between four and twelve WDX-RTA barcodes (WDX4 vs. WDX12), WarpDemuX demonstrated stability in performance. Despite the complexity of distinguishing a larger number of barcode classes, the accuracy of WarpDemuX only slightly decreased from 0.99 to 0.98. However, as expected with the expansion of barcode classes, the challenge of classification increased, leading to a decrease in prediction confidence. This resulted in a higher rate of unclassified barcodes, growing from 4.0% to 8.8% at a confidence threshold of 0.5, as detailed in Table 2. Fig. 2 (e) illustrates the relation between accuracy and yield, modulated by varying confidence cutoffs. A higher cutoff emphasizes accuracy, while a lower one boosts the percentage of reads classified. Depending on the experimental setting, users can adapt the cutoff to fit their needs. Recommended thresholds per model, and corresponding performances per dataset are included in the supplementary material (Suppl. Table 9).

Remarkably, even with the added complexity from expanding the number of barcode classes, the WDX12 model consistently outperformed DPC across all confidence thresholds in both accuracy and yield (Fig 2 (e)). Also in terms of run time, WDX12 remained superior to DPC (see Table 2).

When evaluating the trend of WarpDemuX’s performance and yield over increasing number of barcodes (Fig. 2 (f)), model performance is predicted to decrease slowly with the number of barcodes used. Given a well-performing set of optimized barcodes, it is predicted that demultiplexing of 24 barcodes can be implemented at an accuracy of 96% with 14% of reads unclassified (confidence cutoff = 0.5).

Validation of WarpDemuX by in vivo sequencing of viruses

To validate effectiveness of the method according to the present invention (WarpDemuX) and illustrate the practical application of WarpDemuX in complex biological studies, the WDX12 model is applied to demultiplex in vivo transcribed RNA samples from cells infected with SARS-CoV-2. Specifically, first, RNA are extracted from cells at 8 and 24 hours postinfection, which were either uninfected (mock), infected with wild-type SARS-CoV-2, or infected with a GFP-expressing virus, and barcoded these samples with WDX-RTA barcodes 1 through 10 (details in Suppl. Table 14). Then, these samples are sequenced concurrently on a single Flongle flow cell.

While only a limited number of reads are obtained during this sequencing run (5222 reads passing a minimum qscore threshold of 11), 3874 reads could be retained after demultiplexing with a confidence threshold of 0.8 (74% yield, marginally lower than the 79% yield observed with in vitro transcribed RNA (Figure 2 (e))). Notably, read filtering by Phred qscore had improved the WDX confidence score distribution (Suppl. Figure 18), indicating that there is a common underlying source of noise that can affect both basecalling and WDX demultiplexing confidence.

It is observed that at the confidence threshold of 0.8 approximately 1.1% of the classified reads were inaccurately assigned to the two barcodes that were not employed in the experiment (specifically Barcode 11 and Barcode 12, Figure 18 (a)). Notably, when comparing the empirical confidence score cumulative density functions for each barcode (Figure 19 (b)), the false positive predictions associated with Barcodes 11 and 12 are found to be characterized by lower predictive confidence compared to the predictions for other barcodes.

As expected, viral RNA can be detected only in virus-infected samples, and GFP sequence are found only present in those samples infected with a virus expressing GFP (Figure 3), confirming the accuracy estimates measured with the in vitro transcribed RNAs.

Figure 3 presents a diagram of a number of reads for each WDX-demultiplexed sample from a single Flongle sequencing run that were aligned uniquely to the Vero genome, SARS-CoV-2 genome and TurboGFP respectively. The results are highly reproducible between the two replicates and indicative of accurate demultiplexing, with viral RNA only detected in virus-infected samples, and GFP sequence only present in those samples infected with a virus expressing GFP.

The relative abundances of GFP- and SARS-CoV-2 reads was also highly reproducible between the two replicates, thus providing a blueprint for direct RNA sequencing to study viral replication kinetics.

In the following, the methods are described in more detail.

1) Adapter Signal Processing and Alignment

To account for the intrinsic variation in the dwell time, as well as the within-event measurement noise, the raw signal of the adapter is partitioned into n segments, based on the method introduced in ONT’s signal analysis tool Tombo (github.com/nanoporetech/tombo; algorithmic details are described in Suppl. Material 5.1.1 below). Each segment is subsequently represented by its average signal (Fig. 1 (c)). To mitigate the risk of data loss due to missed events, the signal is intentionally oversegmented, detecting approximately 10% more segments than expected based on the number of bases in the RTA. The segmented adapter signal is normalized and trimmed to retain just the final 25 segments, which should contain the barcode signal based on its known sequence length and location within the RTA.

Since the segmentation is likely imperfect, subsequent Dynamic Time Warping (DTW) is performed to compute the distance between segmented barcode signals. DTW is a robust solution to align two sequences with temporal variation by compressing or stretching their temporal axis to minimize the cumulated Euclidean distance between corresponding elements (thus correcting for imperfect segmentation; Fig. 1 (c)).

To compute the Dynamic Time Warping Distance (DTWD) between two segmented signals s1 and s2, DDTW(s1 , s2) the method "dtaidistance" described in reference [12] can be used. The parameters used are detailed in Supplementary Chapter 9 below.

2) Barcode Design via Dynamic Time Warping Distances

A method is executed to generate sets of k RTA barcodes with distinct signal profiles de novo which avoids resorting to an exhaustive search across all sets of k barcodes (complexity for barcodes of length L, e.g. ~ 2.3 x 10¹²¹ for L = 18 and k = 12). First, a set of wave-like target signal patterns are constructed (Fig. 2 (a)) that exhibits high intertarget DTWD, since such patterns are expected to improve segmentation, as well as distinction between barcodes. Then, for barcode seguences is searched whose putative signal, constructed using the ONT kmer models (github.com/nanoporetech/kmer_models), would closely match these target patterns (see Suppl. Material 1). Subseguently, the top 50 barcodes are selected per target pattern. For the twelve target patterns high-lighted in Fig. 2 (a), this yielded a total of 600 barcodes, for which all 600 x 600 pairwise DTWDs are computed. Lastly, within the matrix of 600² pairwise DTWDs the Maximal Diversity Problem (MDP, below) is solved with k=12 to propose 12 candidate barcodes that are expected to be accurately identifiable and distinguishable. The MDP is defined as: where M denotes the entire set of barcode candidates (600 in the example), k is the number of distinct barcodes aimed to select, s, is the segmented signal of barcode i, DDTW(Sj, Sj) is the DTWD between s, and Sj, and b(i) is a delta dirac function indicating whether barcode i is selected. Fig. 2 (a) visualizes the steps of the in silico design process, additional details per step are described in Suppl. Material 1. The set of optimized, WarpDemuX RTA barcodes (WDX-RTA; k = 12) is shown in Table 1. Corresponding custom RTAs can be designed as described in Suppl. Materials 1 and ordered for benchmarking.

3) Design and Synthesis of Dual-Barcoded RNA for Classifier Training To evaluate the barcode design and train a classifier for demultiplexing labeled RTA signals are generated by synthesizing and sequencing various custom RTA adapters attached to RNA molecules with integrated (in-line) RNA barcodes — serving as ground truth label. Notably, the in-line RNA barcodes used were integrated into the RNA transcripts using PCR, positioning them at the 5’ end of the polyA tail to enhance the accuracy of basecalling and, consequently, the precision of barcode classification, compared to standard RNA barcodes that are ligated to the 3’ end of the polyA tail. The RNA in-line barcode-to-RTA barcode assignment is randomized across replicates to circumvent any potential of bias introduced by the in-line RNA barcode signal. The generated data sets (1-6) are detailed in Suppl. Tab. 4. a) RNA synthesis

The barcoded RNA designated for training the WarpDemuX classifier can be synthesized via in vitro transcription. To enrich the diversity of our dataset, a variety of RNA templates can be used.

Briefly, starting from different DNA templates, 12 PCRs per template are performed with a forward primer containing a T7 RNA polymerase promoter and varying reverse primers to add different RNA barcodes to the 3’ end, followed by a 10nt poly-A tail. After purification of PCR products, they were used as templates for in vitro transcription, followed by DNA digest and RNA clean-up. The RNA can then be further polyadenylated to ensure no signal spillover from the RNA barcode into the DNA signal. Further details and a visual summary of the protocol are included in Suppl. Material 2. b) Direct RNA sequencing and basecallinq

The direct RNA sequencing library preparation can be started by ligating 100 ng of polyadenylated barcoded RNA onto a custom RTA adapter. Further library preparation can be performed according to ONT’s recommendations with the exception of use of MarathonRT as reverse transcription enzyme. (Suppl. Material 3). After motor protein ligation, 7 ul of prepared library can be loaded on a Flongle FLO-FLG001 flow cell, and reads can be acquired until all pores became inactive. A detailed description of all steps as described in Supplementary Chapter 2 and 3 below. Reads can be basecalled post-acquisition with the direct RNA model of ONTs research basecaller ’dorado’, which had shown accuracy improvement over the ’guppy’ direct RNA basecaller. The full configuration is described in more detail in Supplementary Chapter 4 below.

4) WDX-RTA DNA barcode identification a) SVM training

A support vector machine (SVM) can be trained to classify the sequenced dual-barcoded RNA reads into RTA barcode categories (compare Suppl. Tab. 3) based on their RTA- barcode signal, using a combination of read mapping and RNA barcode classification to infer their ground truth labels. b) RNA labeling

To determine ground truth labels, reads can be first uniquely mapped to RNA reference sequences with the method LAST described in reference [13], Then, the RNA barcode region can be identified and reads can be associated with the RNA barcode that has the smallest Levenshtein distance to them. Reads for which this distance surpassed a defined threshold of 4 can be discarded. Based on the specific combinations of RNA and RNA-RTA barcode pairs used during library preparation, the read assignments can be used to establish the RTA-barcode ground truth labels required to train the classifier.

5) Data Set Selection and Preparation a) Training and Validation Sets Creation

We constructed training and validation sets for our models, selecting 400 reads per barcode for training and 200 for validation. These sets utilized data from datasets 3-5 for the WarpDemuX models (WDX4, WDX12) and datasets from the replicate 2-5 of the DeePlexiCon training data for the WarpDemuX-DPC model, as detailed in Smith et al. [8], Further details on training set size determination is described in Supplementary Chapter 9 below. b) Test Sets Configuration

For the WarpDemuX models, two distinct test sets can be designed: one comprising 600 reads per bar-code from datasets 3-5, and another using the entirety of dataset 6, which includes RNA without enzymatic polyadenylation. Datasets 1 and 2 serves as the test sets for the WarpDemuX-DPC and DPC models. c) Instance Selection and Noise Class

To select instances for the training set, first the pairwise Dynamic Time Warping Distances (DTWDs) are calculated for all instances outside of validation or test sets. Utilizing the robust Z-score (defined below) of the median within-class distance, potentially mislabeled instances can be identified and removed. Subsequently, robust Z-scores for median interclass distances can be calculated to detect noisy signals, randomly selecting 400 of these instances as training data for the noise class. Further elaboration and examples of signal analysis are described in the Supplementary Chapter 8 below.

The robust Z-score is calculated as follows, where Xi represents the value under evaluation (either median within-class or inter-class distance), Median(X) is the dataset’s median, and MAD is the Median Absolute Deviation: n Ro ,bust . r Z, score = X -i - M —e —dia -nt X)

MAD , ’ with MAD = Median(\Xi — Median(X)\) d) Optimizing Data Representation

To ensure the training set captures the full range of the original data distribution despite its limited size, both the Maximal- and Minimal Diversity Problem for the calculated pairwise distances of the remaining reads are addressed. This process is stratified across barcode classes, e.g. selecting 80 instances per barcode that maximized and minimized their mutual distance. The training set can be completed with e.g. 240 randomly sampled reads per barcode. e) Stratified Sampling Across Replicates

All sampling processes for validation, training, and test sets can be stratified across replicate groups to compensate for variations in sequencing yields, ensuring a balanced representation across all data. f) SVM training

The RTA barcode classifiers are trained on preprocessed RTA barcode signals. To obtain these RTA barcode signals, first the DNA adapter signal can be identified in the raw signal using ADAPT (github.com/wvandertoorn/ADAPT). Then the identified DNA adapter signal can be segmented, normalized and trimmed, as already described above in section 1) ‘Adapter Signal Processing and Alignment’. For each RTA barcode signal, a feature representation can be generated that quantifies how similar the signal is to the RTA barcode signals of reads with known labels. To do this, DTWDs between the signal and all labeled signals in the training data set are computed.

Support Vector Machines (SVMs) are then used for classification by introducing a Dynamic Time Warping Kernel function, K, which is defined as:

/<(si,s₂) = exp ( - y . DDTW(S₁,S₂)') where s1 and s2 are two preprocessed RTA barcode signal instances, and y = 1.2 is a hyper-parameter of the kernel function.

During training, the SVM kernel matrix is based on the pairwise DTWDs between the barcode signals in the training data.

An SVM implementation can be used based on libsvm as described in reference [14], with a one-vs-one scheme for multi-class support, as implemented in Scikit-Learn according to reference [15], Herein, calibrated class membership probability estimates are provided based on an internal 5-fold cross-validation combined with a multiclass extension to Platt scaling as described in reference [16], Further details on the model parameterization are described in Supplementary Chapter 9 below. g) WDX-RTA DNA barcodes classification

During inference, for each read to be classified, K^, s₂) is computed against all barcode signals in the training data, i.e. s₂ e B, with B = {B^ B₂,... Bk}, where Bj contains all segmented barcode signals that belong to RTA barcode i based on their ground truth label.

Each prediction in the classification system according to the present invention can be paired with a confidence score, indicating the reliability of the predicted class label. Reads for which the prediction has a confidence score below a user-defined cutoff are discarded (’unclassified’). For comparability, the same method can be adopted as used in the prior art method DeePlexiCon. The confidence score can be defined as the interval between the classifier’s probability estimate of the predicted class and the second most probable class. h) Sequencing of SARS-CoV-2 infected cells

To prepare the sequencing samples, 5 x 105 Vero E6 TMPRSS2 cells were infected in duplicate with two different recombinant SARS-CoV-2 viruses at MOI=0.01. Briefly, rSARS- CoV-2 WT and rSARS-CoV-2 GFP inoculum (generated as described in [17]) were prepared in DMEM supplemented with 1% FCS. Before the infection, Vero E6 TMPRSS2 cells were washed once with PBS and incubated with the respective inoculum for 1h at 37 °C with a gentle shake every 10 min. The inoculum was removed and fresh DMEM supplemented with 5% FCS, 100 mg/mL of penicillin and 100U/mL of streptomycin was added to the cells. Then, cells were harvested at 8 and 24 h post infection. A mock control of uninfected cells was also performed. Total RNA was extracted with Trizol (Invitrogen) using the manufacturer’s recommendations, precipitated by addition of 0.1 volumes of 3 M NaOAc and 3 volumes of ice cold 100 % EtOH and incubation at -20 °C for >12 h, followed by centrifugation at 16,000 xg for 30 min at 4 °C, a wash with 70 % EtOH and another round of centrifugation before resuspension in 20 ul RNase-free H2O. RNA concentration was quantified on Nanodrop and 1 ug of total RNA per sample was used for WarpDemuX- multiplexed direct RNA library preparation as described in Supplemental Section S 3 below in more detail. The generated reads were demultiplexed with the WDX 12 barcode model at a confidence threshold of 0.8 and reads were aligned with minimap (see Supplementary Chapter 13 for details).

6) Data and Code Availability

The present invention (WarpDemuX) can be implemented as a Python command line tool with Cython-based signal processing to achieve fast CPU run time. The tool can be easily installed using pip. Detailed instructions as well as test data are provided in the code repository of the tool. The implemented method WarpDemuX can manage multiple formats including the current ONT standard pod5, the community developed slow5 and blow5, and the deprecated fast5 format.

Supplemental Description

S1 Designing WarpDemuX Barcodes

1.1 Experimental design

The RTA adapter consists of two partially complementary oligos. To ensure compatibility with the sequencing chemistry the adapter oligos should be designed similarly to the tested design described in reference Smith et al. [8], with the exception of allowing variation of GC content. In detail, the two last nucleotides of the 5’ end of the forward RTA adapter (GG) were kept constant to reduce potential ligation biases. In addition, all adapters retained the 3’ adapter sequence that is required for the RMX motor protein to ligate onto. The reverse adapter was designed as a reverse complement of the forward adapter, with the exception of a 10 nucleotide dT 3’ overhang to hybridize to the RNA poly-A tail, as well as a Y anchor sequence 5’ of the reverse complement RTA barcode. All oligos were ordered from IDT with standard desalting purification, and with forward adapters containing a 5’ phosphorylation. The following scheme exemplifies the design:

Table 3: Custom RTA design scheme (5’ to 3’ orientation for both oligos)

1.2 In silico design Barcode optimization addresses the need for a set of distinguishable signals that can be easily detected and separated. For barcodes of length L, an exhaustive search across all possible sets of k barcodes has complexity (jQ, making such a search computationally infeasible: “ 2.3 x 10¹²¹ for L = 18, k = 12.

Instead, a set of wave-like target signal patterns are constructed since such patterns improves segmentation, as well as distinction between barcodes and thus make for good barcodes in the context of our classification method.

1.2.1 Target Patterns Derivation

Across target patterns, wave-like elements are combined with diverse frequencies and amplitudes. Additionally, consistent minimal and maximal values are maintained across all target patterns to mitigate potential normalization biases during signal processing.

The customizable section of the RTA that serves as barcode spans 18 nucleotides. Given the sequencing kmer size of 6, however, the preceding five constant nucleotides and the following constant two, as well as the first bases of the subsequent polyA tail also influence its signal. As a result, 20 DNA and 3 DNA-polyA signal events are expected. As there is no ONT kmer model for mixed DNA-polyA kmers, the signal of these kmers can be assumed to be an interpolation between the value of the last DNA kmer and the expected signal level of a pure polA segment (— 110 pA). When constructing the target patterns, the interpolation kmers are substituted with a singular polyA segment observation, thus yielding patterns that span 21 observations.

1 .2.2 Candidate Set Construction

It is searched for barcode sequences whose putative signal, constructed using the ONT kmer models (github. com/nanoporetech/kmer_models), would closely match these target patterns. These models are designed for DNA sequenced in the 5’ to 3’ direction. In dRNA- seq, however, the RTA is sequenced in 3’ to 5’ direction. To increase the likelihood that the modeled barcode signals reflect the sequencing data, the barcode design is limited to the subset of kmers for which the 5’-3’ to 3’-5’ directional reversal likely has little effect. Specifically, kmers are selected with minimal variation (less than 10 pA) in expected signal mean when compared to their mirrored counterparts. For example, if the expected signal level difference between the kmer AAACCC and the kmer CCCAAA is < 10 pA, then the kmer AAACCC is selected. Subsequently, the expected signal values are discretized for the selected kmers (1524 in total) and their mirrored counterparts respectively into five equally-sized, rank-based buckets. Only those kmers whose original and mirrored versions both fell within the lowest, middle, or highest rank buckets were further considered, leaving us with 531 kmers. This discretization and selection strategy enhances the probability of identifying wave-like patterns - marked by alternating low and high values - when assembling the kmers into barcodes. Additionally, this approach strengthens the stringency of our heuristic for ensuring robustness against directional reversal. From the filtered set of kmers, all conceivable combinations of five kmers, adhering to a stride of three, are generated. This means that the final three bases of one kmer overlap with the initial three bases of the following kmer. This approach yielded roughly 3.4 * 106 candidate barcode sequences, each 18 bases long.

For each, the full adapter sequence is constructed by pre- and appending the flanking regions of the adapter. Using the ONT kmer models the putative adapter signal can be constructed and appended with a singular observation of the expected polyA tail signal. The full adapter signal can be normalized to zero mean and unit variance. The last 21 observations of the idealized adapter signal corresponding to the variable barcode part are selected and used to calculate the distance to each target pattern.

1.2.3 Candidate Selection

To select the set of k = 12 barcodes the top 50 best-scoring (lowest distance) candidate barcodes per target pattern are selected. For the k = 12 target patterns highlighted in Fig. 2 (a), this yielded a total of 600 unique barcodes, for which all 6002 pairwise DTWDs are computed.

Lastly, within the matrix of 6002 pairwise DTWDs the Maximal Diversity Problem (MDP) is solved to propose 12 candidate barcodes that are expected to be accurately identifiable and distinguishable. The MDP can be solved using a greedy heuristic implemented as in Algorithm 1.2.3.1.

Algorithm 1.2.3.1 : Greedy heuristic for solving the Maximum Diversity Problem, input Distance matrix D e W^⁷¹, Number of elements to select m output : Indices of the selected elements n ‘-number of rows in D; first <- maxfXj, Dyf, 11 Select the element with the maximum sum of distances selected «- [/7rst]; while length of selected < m do maxMinDist < - 1; nextindex < - 1; for i <-0 to n - 1 do if i e selected then minDistZSelected <-min{Dij :j E selected}', if minDistZSelected > maxMinDist then maxMinDist <- minDistZSelected', next Index <- j; end end end

Append nextindex to selected', end return selected',

S2 Generating in vitro transcribed barcoded RNA 2.1 RNA barcoding strategy

In order to train and benchmark RTA-based barcoding strategies in vitro transcribed RNAs of different templates containing varying RNA-encoded barcodes are generated. This allows generating high accuracy ground truth labels for different RTA signals. To ensure that no signal spillover from RNA barcode into RTA signal is occurring, it is advisable to enzymatically poly-A tail these RNAs post in vitro transcription (although they already contain a 10 nt poly-A tail that can be used for direct ligation of RTA adapters). Figure 4 provides a visual summary of the protocol. Further, it is preferred to randomize RNA- barcode to RTA assignments to eliminate the possibility of inadvertently training on RNA signal in the rare possibility of RTA adapter signal failure 4.

Table 4: Run-RNA-barcode to RTA assignments for RTA training/evaluation data

2.2 RNA barcode design

The RNA barcode sequences were designed with high RNA sequencing error rates in mind. Specifically, a large pool of 8 nt long barcode candidates was generated by random sampling, with some additional candidates supplemented by BARCOSEL as described in reference [19], and a set of 18 barcodes that maximized inter-barcode Levenshtein edit distances was chosen. To balance purine and pyrimidine content in each barcode and double the number of variable nucleotides in the final barcodes a self-complementary stem- loop (i.e. [barcode]-GAGUA-[revcomp_barcode]) was generated for each barcode. To introduce these RNA barcodes into DNA templates reverse primers containing a 10 nt long T stretch at their 5’ end, followed by the reverse complement of the barcode stem-loop, and finally the reverse complement of the M13-reverse sequence were ordered with standard desalting purity from IDT (Table 5).

Table 5: RNA barcode design and reverse PCR primers to generate in vitro transcription templates. The constant M13r sequence is in lowercase.

2.3 PCR to generate barcoded DNA templates PCR was performed on templates of the Vibrio cholerae glycine riboswitch, hc16 ligase, bacterial RNase P type A, tetrahymena riboswitch and Hepatitis C Virus IRES cloned into pJet (Thermo Fisher) flanked by a T7 RNA promoter sequence and an M13r sequence at 5’ and 3’ ends respectively as described in Bohn et al. [20] using Primestar GXL (Takara) according to the manufacturer’s instructions. Reaction volume was 20 ul, annealing temp. 50 °C, extension time 30 seconds at 35 cycles with forward primers described in Table 6 and reverse primers as described in Table 5.

Table 6: Forward PCR primers to generate in vitro transcription templates.

2.4 In vitro transcription

PCR products were quality controlled via agarose gel and purified by addition of 18 ul H2O and 38 ul SPRI beads (Omega Biotek NGS). After hybridization for 5 min under light agitation (300 rpm on hula mixer) beads were pelleted using a DynaMag-96 Side Magnet (Invitrogen). Supernatant was removed, beads were washed twice with 100 ul 80% EtOH and eluted in 20 ul H2O. DNA concentration was measured with Nanodrop and 250 fmol of DNA was then used in a 10 ul in vitro transcription reaction: 5 mM NTPs, 5 ll/ul T7 RNA polymerase (homemade), 1 ul YIPP (NEB), 0.1 ul RNasin (Molox) in 40 mM Tris pH 7.5, 18 mM MgCI2, 10 mM DTT, 1 mM spermidine, 5 mM NTPs, 40 II RNasin (Molox) and homemade T7 RNA polymerase. After in vitro transcription for 2 h at 37 °C the DNA template was digested by addition of 10 ul of DNase mix containing 1 ul 10x DNase I buffer and 0.2 ul of DNase I (NEB) followed by a 15 min incubation at 37 °C. The RNA was then purified by addition of 1.4 volumes of SPRI beads and two washes with 80% EtOH. Purified RNA was then quality controlled with a 1 .5% agarose gel in 1x TAE and concentration was quantified via Nanodrop.

2.5 Poly-A tailing

For poly-A tailing 1 ug of IVT RNA was incubated with 0.5 ul E. coli Poly-A Polymerase (NEB), 1 ul 10 mM ATP in 1x poly-A polymerase reaction buffer in a 10 ul reaction for 30 min at 37 °C, followed by purification with 1.4 volumes of SPRI beads and two washes in 190 ul 80% EtOH before elution in 20 ul RNase-free H2O.

S3 Direct RNA sequencing library preparation

3.1 Annealing of RTA adapters

The forward and reverse strands of the RTA oligos were annealed following a procedure similar to that described in Smith et al. [8], Specifically, oligos were diluted to 10 uM each in 30 mM HEPES-KOH (pH 7.5), 100 mM K-Acetate, heated to 95 °C for 1 min and cooled to 25 °C at a rate of 0.5 °C/min. The annealed RTA adapters were then diluted to 1.4 uM with RNase-free H2O and stored at -20 °C until use.

3.2 RTA adapter ligation

Volumes were scaled for direct RNA sequencing on a Flongle flow cell according to the protocols. io protocol of Krause in reference [21] 2020. In detail, for each sample, 100 ng of polyadenylated in vitro transcribed RNA in 4.4 ul, was transferred into a PCR tube, followed by addition of 1 ul previously annealed custom RTA adapter, 1.6 ul 5x NEBNext Quick Ligation Buffer (NEB) and 1 ul T4 DNA Ligase high concentration (NEB). The reaction was mixed well by pipetting and incubated at room temperature for 20 min. To stop the ligation 2 ul of 0.5 M EDTA was added to each reaction, samples were pooled, purified with 0.8 volumes of SPRI beads washed twice with 80 % EtOH, and eluted in 20 ul RNase-free H2O.

3.3 Reverse Transcription

To perform the reverse transcription a master mix containing 1.9 ul H2O, 15.7 ul 3x MarathonRT buffer (150 mM Tris-HCI pH 8.3, 600 mM KCI, 60 % (v/v) glycerol), 2.35 ul 100 mM DTT, 0.2 ul RNasin (Molox), 1.9 ul 50 mM MgCI2, 3 ul of 10 mM dNTPs and 2 ul MarathonRT (produced in house as described in [20]) was prepared and added to the 20 ul RNA-RTA ligation. After incubation at 42 °C for 60 min the reaction was stopped by addition of 2 ul Proteinase K (NEB), followed by another incubation at 42 °C for 15 min. The reaction was purified with 1 volume SPRI beads, washed thrice with 190 ul 80 % EtOH, and eluted in 30 ul RNase-free water (volume was increased to adjust for higher total input amount due to multiplexing).

3.4 Motor protein ligation

To proceed with Nanopore sequencing next ligation on the motor protein is performed. For this 10 ul of purified reverse-transcribed RNA-RTA library can be transferred into a 1.5 ml DNA LoBind tube (Eppendorf), followed by addition of 2.5 ul H2O, 4 ul 5x NEBNext Quick Ligation Buffer (NEB), 2 ul RMX motor protein adapter (ONT SQK-RNA002), and 1.5 ul T4 DNA Ligase high cone. (NEB). The reaction was mixed by pipetting and incubated for 20 min at RT, followed by purification with 20 ul SPRI beads, washed twice with RNA wash buffer (WSB, ONT SQK-RNA002). The library was then eluted in 9 ul elution buffer (EB, ONT SQK-RNA002) and 8 ul was transferred into a new 1.5 ml DNA LoBind tube. The library was then loaded onto a Flongle FLO-FLG001 flow cell for sequencing.

54 Data Acquisition and Basecallinq software configuration

The following configuration is used:

• MinKNOW 22.12.7

• dorado 0.3.4

• protocol MIN106_RNA:FLO-FLG001 :SQK-RNA002

• dorado model: rna002_70bps_hac@v3

55 Pre-Processing of DNA Adapter Signal

5.1 Signal Segmentation

Figure 5 presents the Intrinsic variation in dwell time poses a source of noise. Figure 5A shows varying dwell time per event and between reads results in varying length adapter signals. Figure 5B shows segmentation of the raw signal into events, with each segment represented by its average signal, reduces signal noise caused by differences in dwell time and within-event measurement noise. Figure 5C shows that signal noise causes imperfect segmentation in which some events are missed while other spurious events are detected.

To account for the inherent variation in dwell time (Fig. 5A), the raw signal is segmented (Fig. 5B) by identifying the most significant shifts in current level based on the running difference between neighboring windows of raw signal. The pseudo code is described in section 5.1.1 below. Its parameters were tuned to reflect the translocation speed of the RTA, with a minimal segment length of 15 observations and a sliding window width of 30 observations. Considering the known length of the adapter sequence, the signal is expected to contain 98 events. However, because segmentation is likely to be imperfect (Fig. 5C), with missed events posing a greater challenge than spurious events, it is proposed to intentionally oversegment the signal by approximately 10% to detect 110 events. Each segmented event is then represented by its average signal. Post-segmentation, segment means are normalized to zero mean and unit variance. Only the variable barcode component of the segmented signal, expected to be represented within the last 25 segments, based on the known barcode sequence length and location in the adapter, is used for classification. Algorithm 5.1.1 : Greedy algorithm for raw signal segmentation based on the most significant shifts in signal intensity. input Raw signal X of length N l <^ 6\ // Minimal segment length k <- 12; // Width of sliding window n«- 110; // Number of segmentation points to identify

W<-0; // Empty list r<-0;

C«-[0JV];

S<— 0; for i«— 1 to N — k do

Append w_t to W\ end for i«— 1 to N — 2k do

Append t_{i i+1} to T ; end while size of C < n do

Find index i of max absolute value in T;

Append i + k to C\

Exclude values within range I of i from T; end

Sort values in C\ for each consecutive pair (c_1; c₂) in C do s ‘-mean of X[c_x : c₂] ;

Append s to S', end return .S’;

5.2 Pairwise DTWD Considerations Segmentation ensures that all inputs to the classification procedure are of uniform length. Due to the signal normalization, pairwise DTWDs are of similar magnitude across barcodes, negating the need for further scaling or normalization of the pairwise distance matrix.

S6 WDX-RTA barcodes

Figure 6 shows a diagram for a training set visualized per barcode class. Black lines indicate the dynamic time warped mediod of barcode signal instances, obtained through DTW Barycenter Averaging described in reference [22], Training instances are thereafter warped to their class mediod for visualization.

S7 Generated Data

Figure 7 presents generated experimental data sets using different RTA barcode designs. Coverage per barcode and replicate after mapping of sequenced reads. A: WDX-RTA dataset consisting of four replicates with our 12 optimized WDX-RTA barcodes. Replicate 1 ,2, and 3 contain enzymatic polyadenylated RNA, replicate 4 contains RNA without enzymatic polyadenylation to assess WarpDemuX’s performance on RNA with short polyA tails. B: DPC-RTA dataset consisting of two replicates with the 4 preexisting RTA barcodes used by Smith et al. [8] to train and evaluate DeePlexiCon.

Replicatel V. Replicate2 Replicate3 HCV Replicate4 bact Total c/?o/gly hc16 IRES RNase P Not

WDX-RTA Polyadenylated Polyadenylated Polyadenylated polyadenylated

Barcode 01 548 2604 411 872 4435

Barcode 02 633 3493 712 1476 6314

Barcode 03 433 3125 868 894 5320

Barcode 04 652 3078 188 1182 5100

Barcode 05 111 2132 237 285 2765

Barcode 06 162 963 626 1480 3231

Barcode 07 888 4806 701 1207 7602

Barcode 08 3120 2962 703 1894 8679

Barcode 09 324 5839 427 1305 7895

Barcode 10 1006 2802 857 2073 6738

Barcode 11 588 2459 888 1649 5584

Barcode 12 570 5575 360 749 7254

Total 9035 39838 6978 15066 70917

Table 7: Generated experimental WDX-RTA data, coverarge per barcode and replicate after mapping and adapter detection.

Replicatel V. chol gly Replicate2 tetrahymena Total

DPC-RTA Polyadenylated Polyadenylated

Barcode 1 601 1856 2457

Barcode 2 600 2482 3082

Barcode 3 191 3141 3332

Barcode 4 558 975 1533

Total 1950 8454 10404

Table 8: Generated experimental DPC-RTA data, coverage per barcode and replicate after mapping and adapter detection. Figure 8 shows examples of mislabeled signal instances due to missed read splits. (A) Concatenation of two complete reads, each consisting of an adapter, polyA and RNA signal. (B) Concatenation of an ‘adapter-only’ read and a complete read, wherein the first consists of an adapter and subsequent polyA tail signal, but misses the RNA signal.

S 8 Mislabeled instances and Noise Class

Potentially mislabeled instances are identified based on the robust Z-score of their median within-class distance. Reads with an absolute value of at least the 98th percentile were excluded.

Upon inspection of the excluded data, it can be noticed that a substantial part of these reads were the result of a sequencing artifact known as a missed read split. The two common types are:

(1) a concatenation of two complete reads, each consisting of an adapter, polyA and RNA signal (Fig. 8A), and

(2) a concatenation of an ‘adapter-only’ read and a complete read, wherein the first consists of an adapter and subsequent polyA tail signal, but misses the RNA signal (Fig. 8B).

In both cases, the adapter signal of the first read in the read pair is detected, processed and classified. However, the ground truth label for this read, inferred based on a combination of the base called RNA transcript and RNA barcode, may belong to the second read of the concatenated pair, rather than the one processed.

After filtering out mislabeled instances, the robust Z-scores of the median inter-class distances are calculated to identify noisy signals. For this, the minimal absolute value score across classes is used. Reads with a minimal absolute value score above the 99th percentile value were labeled as noise. From these, 400 instances are randomly sampled as training data for the noise class.

Upon inspection of the noise class data, stalls can be observed to be the main type of signal irregularity, potentially leading to mis-segmentation of the adapter itself (Fig. 9A). In addition, stalls can be observed in the beginning of the adapter to cause mis-identification of the DNA/RNA boundary (Fig. 9B). Figure 9 shows examples of noise class signal instances. Figure 9A shows stall in adapter signal leading to potential mis-segmentation. Figure 9B shows stall in the beginning of the adapter leading to misidentification of DNA/RNA boundary

S9 Parameterization

Parameters were tuned based on 10-fold cross validated performance on the validation set (200 reads per barcode).

9.1 Number of training instances per barcode

During WarpDemuX training, the number of training instances per barcode are selected based on the average 10-fold cross-validated model performance on the validation set. The model reached optimal performance at a small number of training instances per barcode (Fig. 10A), while run time kept increasing with the number of training instances as expected (Fig. 10B). The model’s performance was found to be robust with as few as 400 training instances per barcode.

Figure 10 shows the performance on validation set during model selection. Figure 10A shows the performance and yield for increasing number of training instances per barcode, evaluated at confidence threshold of 0.5, shows that the gains of increasing number of training instances per barcode on the model’s quality already diminish at small training set sizes. Values represent mean over 10-fold cross-validation with error bars indicating 95% Cl. Figure 10B shows the classification time per read for an increasing number of training instances per barcode. Shaded area reflects the average ±2std classification time per read, computed from 10 runs of 4000 reads on 8 CPU cores (11th Gen Intel(R) Core(TM) i7- 1165G7 (2.80GHz) with 32GB memory).

9.2 SVM

A C-Support Vector Classification SVM implementation can be used based on libsvm described in reference [14], with a one-vs-one scheme for multiclass support, as implemented in Scikit-Learn as disclosed in reference [15] (scikit-learn 1.3.1).

With C, the regularization parameter, set to 10.0, kernel = ‘precomputed’ and default settings otherwise.

9.3 DTWD

To compute the Dynamic Time Warping Distance (DTWD) between two barcode fingerprints, the Python Package dtaidistance 2.3.10 of reference [12] can be used, with a window size of 15 (upper bound to allowed shift from the diagonal) and an additive warping penalty of 0.1 per compression or expansion.

S10 Recommended confidence cutoffs To facilitate a fair comparison between models with a different number of barcodes, as well as to improve the ease-of-use for the user, two modes of operation per model are introduced: a High Accuracy mode (HA) and a High Recovery mode (HR), with respectively tuned confidence cutoffs based on the performance on the validation set. For the mode HA, the confidence cutoff is set to the value for a 90% yield on the validation set. For mode HR, the confidence cutoff is set to the value for a 95% yield on the validation set. The respective performances per dataset are reported in Suppl. Tab. 9 as follows:

.. . . _ . . Reads classified . Confidence Unclassified . _ . . _ „

Model Dataset _r„ , mode . ,, . _{ro/ 1} Accuracy Precision Recall as noise [97o] cutoff reads [%] ³

2 0.0000 HA 0.2250 3.0000 0.9510 0.9440 0.9370

Dr v

WDX4 3-5 1.7500 HR 0.7700 5.5000 0.9924 0.9923 0.9924

WDX4 3-5 1.7500 HA 0.9400 11.0400 0.9943 0.9941 0.9943

WDX4 6 4.6700 HR 0.7700 9.1900 0.9899 0.9900 0.9864

WDX4 6 4.6700 HA 0.9400 16.9200 0.9926 0.9925 0.9917

WDX6 3-5 2.1900 HR 0.6600 4.6900 0.9949 0.9949 0.9949

WDX6 3-5 2.1900 HA 0.8900 10.5000 0.9968 0.9969 0.9968

WDX6 6 5.7300 HR 0.6600 8.6900 0.9921 0.9925 0.9899

WDX6 6 5.7300 HA 0.8900 16.2700 0.9952 0.9957 0.9940

WDX8 3-5 2.5600 HR 0.5600 4.7500 0.9912 0.9914 0.9911

WDX8 3-5 2.5600 HA 0.8200 9.4200 0.9948 0.9948 0.9947

WDX8 6 5.6000 HR 0.5600 7.9800 0.9851 0.9853 0.9834

WDX8 6 5.6000 HA 0.8200 14.6800 0.9900 0.9902 0.9897

WDX10 3-5 2.5200 HR 0.4835 5.2700 0.9870 0.9871 0.9869

WDX10 3-5 2.5200 HA 0.7370 10.0200 0.9924 0.9925 0.9922

WDX10 6 5.3100 HR 0.4835 8.5700 0.9827 0.9818 0.9813

WDX10 6 5.3100 HA 0.7370 14.6500 0.9890 0.9886 0.9877

WDX12 3-5 2.1700 HR 0.4315 5.5100 0.9788 0.9791 0.9785

WDX12 3-5 2.1700 HA 0.7200 10.9300 0.9853 0.9855 0.9850

WDX12 6 4.7500 HR 0.4315 8.8500 0.9717 0.9697 0.9717

WDX12 6 4.7500 HA 0.7200 15.7000 0.9811 0.9798 0.9805

Table 9: Performance Comparison Across Models in Different Operation Modes

S11 Performance

11.1 Runtime

Table 10 shows the classification time per read for WarpDemuX models. Model _ Time [ms]

WDX4 1.61

WDX6 2.65

WDX8 4.05

WDX10 5.37

WDX12 _ 6.68

Table 10: Classification time per read for WarpDemuX models.

The reported times reflect average time per read, computed from 10 runs of 4000 reads on 8 CPU cores (11th Gen Intel(R) Core(TM) i7-1165G7 (2.80GHz) with 32GB memory).

Time per read [ms]

Model mean std

WDX4 14.37 0.07

WDX6 14.62 0.12

WDX8 14.93 0.07

WDX10 15.26 0.09

WDX12 15.56 0.14

Table 11 : Run time per read for WarpDemuX models.

Figure 11 presents the run time per read for WarpDemuX models which increases linearly with the number of barcodes. The reported times reflect average time per read, computed from 10 runs of 4000 reads on 8 CPU cores (11th Gen Intel(R) Core(TM) i7-1165G7 (2.80GHz) with 32GB memory).

11.2 Test set performances at different confidence thresholds

The test set performances are illustrated with table 12 below.

.. . . _ . . Reads classified Confidence Unclassified . _ . . _ ..

Model Dataset . _{o/ 1} . „ . _{ro/ 1} Accuracy Precision Recall as noise 1 %] cutoff reads [%] ³

1 0.000 0.500 9.487 0.965 0.957 0.967

Urv WDX-

0.000 0.800 18.462 0.982 0.977 0.983 DPC WDX- 0.000 0.000 0.000 0.937 0.929 0.919 DPC WDX- 0.000 0.300 4.093 0.957 0.950 0.944 DPC WDX- 0.000 0.500 7.275 0.969 0.964 0.958 DPC WDX- 0.000 0.800 14.443 0.985 0.980 0.979 DPC

WDX4 3-5 1.750 0.000 0.000 0.984 0.984 0.984

WDX4 3-5 1.750 0.300 1.375 0.987 0.987 0.987

WDX4 3-5 1.750 0.500 2.708 0.989 0.989 0.989

WDX4 3-5 1.750 0.800 6.083 0.992 0.992 0.992

WDX4 6 4.668 0.000 0.000 0.978 0.964 0.974

WDX4 6 4.668 0.300 2.541 0.984 0.978 0.980

WDX4 6 4.668 0.500 4.648 0.987 0.983 0.984

WDX4 6 4.668 0.800 9.977 0.990 0.990 0.988

WDX12 3-5 2.167 0.000 0.000 0.960 0.961 0.960

WDX12 3-5 2.167 0.300 3.750 0.976 0.976 0.976

WDX12 3-5 2.167 0.500 6.181 0.980 0.980 0.980

WDX12 3-5 2.167 0.800 13.444 0.986 0.986 0.986

WDX12 6 4.752 0.000 0.000 0.941 0.937 0.943

WDX12 6 4.752 0.300 6.000 0.965 0.963 0.965

WDX12 6 4.752 0.500 10.056 0.975 0.973 0.975

WDX12 6 4.752 0.800 19.010 0.984 0.983 0.983

Table 12: Comparative performance of model and barcode types at various confidence thresholds. Precision and recall as macro average across classes.

11.3 Test set performances on non-optimized barcodes

Figure 12 presents precision-recall curves for DeePlexiCon (left) and WDX-DPC (right) models. Evaluated on dataset 1-2 at a confidence cutoff of 0.5. Both models were trained on the same data (provided by the DeePlexiCon authors), and both evaluated on datasets 1 and 2. The left part of Figure 12 shows the performance of the published DeePlexiCon model. The right part of Figure 12 presents the performance of the WarpDemuX model according to the present invention.

11.4 Performance of individual optimized WDX-RTA barcodes

11.4.1 WDX4

Figure 13 shows the kernel matrix used to train the WDX4 model, with values averaged per class pair, shows the separation between barcode classes.

Figure 14 presents the precision-recall curves of individual WDX-RTA barcodes used in

WDX4 model and evaluated on dataset 3-6 at a confidence cutoff of 0.5.

11.4.2 WDX12

Figure 15 shows the kernel matrix used to train the WDX12 model, with values averaged per class pair, shows the separation between barcode classes. Figure 16 presents the precision-recall curves of individual WDX-RTA barcodes used in the

WDX12 model and evaluated on dataset 3-6 at a confidence cutoff of 0.5.

11.5 Best performing barcode sets

Based on a 10-fold cross-validated benchmark, the best performing subsets of barcodes are selected for experimental settings requiring less than 12 barcodes (Suppl. Tab. 13). The trained WarpDemuX models for these barcode subsets are available in the code repository.

Number of ... „

. WarpDemuX _v barcodes samples ^r

4 4 5 6 8

6 1 2 5 6 7 11

8 1 3 5 6 7 9 11 12

10 1 2 3 5 6 7 9 10 11 12

12 1 2 3 4 5 6 7 8 9 10 11 12

Table 13: Best performing barcodes sets for various numbers of barcodes, based on model selection benchmark.

S12 Temporal Perturbation Analysis between DTWD and GASF

While the DTWD is resilient to temporal perturbations, the GASF feature transformation exhibits vulnerability to such perturbations (Fig. 17). Consequently, stochastic fluctuations in dwell time introduce significant noise when employing a GASF-based approach. Although DeePlexiCon shows that a CNN can learn to discern between genuine correlations and noise in the signal-transformed image, the inherent complications of the GASF transformation intensify the classification task. This is manifested in the augmented training data requisite for the CNN and its diminished performance in comparison with WarpDemuX.

Figure 17 shows a Temporal Perturbation Analysis between DTWD and GASF. A comparative study illustrating the susceptibility of the GASF feature transformation to temporal perturbations, as opposed to the robustness exhibited by DTWD.

S13 Analysis of SARS-CoV-2 sequencing run

13.1 Quantification of primary aligned reads

The generated reads were demultiplexed with the WDX 12 barcode model at a confidence threshold of 0.8 (WDX-RTA to sample assignment is described in Table 14).

Table 14: WarpDemuX RTA adapter to sample assignment of SARS-CoV-2 sequencing run

Next, reads were aligned with minimap2 to the Vero reference genome (NCBI RefSeq: GCF_015252025.1), SARS-CoV-2 genome (NCBI RefSeq: LC528233) or TurboGFP sequence (NCBI RefSeq: GU452685.1 306 to 999): minimap2 — ax splice — u f — kl4 {reference_index} {fastq_file} > {bamfile}

Uniquely mapped reads to each reference were extracted with samtools: samtools view — F 0x100 {bamfile} | awk ’{{print $1, $3}}’ > {outfile}

When a multi-mapping read was detected it was assigned to refA - refB (for example reads that aligned to both SARS-CoV-2 and TurboGFP were assigned to SARS2-TurboGFP). Those aligning to two Vero reference contigs were assigned "Vero".

13.2 WarpDemuX demultiplexing statistics

Figure 18 shows the demultiplexing statistics of SARS-CoV-2 sequencing run:

(a) Number of demultiplexed reads per WDX barcode at confidence threshold of 0.8. Reads with mean Phred quality scores below 11 were filtered from analysis due to low alignment rates to references.

(b) WarpDemuX confidence score distribution of reads below (left) and above (right) a mean Phred quality score of 11 , indicating that a common source of noise exists between basecalling and WarpDemuX classification.

13.3 Poly A tail length estimation

To estimate the poly-A tail lengths of SARS-CoV-2 RNA in cells and virion as well as the effect of SARS-CoV-2 infection on cellular mRNA poly A tail length tail length estimation is performed with tailfindr disclosed in reference [23], Specifically, the raw pod5 files can be converted into fast5 files before rebasecalling with guppy 6.3.4 with ‘-fast5-out‘ option. Then, the tailfindr version 1.4 can be executed on the generated fast5 files and incorporated the per read poly A tail estimations as provided in the csv file into our analysis.

The data are re-basecalled with dorado 0.4.x with the -estimate-poly-a option.

Figure 19 presents an empirical confidence score Cumulative Distribution Function (CDF) for each barcode. Dashed curves represent Barcodes 11 and 12, which were not part of the experimental design, illustrating their distinctively lower predictive confidence compared to other barcodes, validating our model’s discriminating ability.

13.4 RNA m6A modification status detection

To evaluate differential m6A modification rate of mRNA after SARS-CoV-2 infection as well as m6A status of packaged SARS-CoV-2 RNA base calling is performed with m6A basecaller disclosed in reference [24] - a custom guppy model that contains methylation probabilities in its output.

S14 Live barcode balancing

Live barcode balancing in direct RNA sequencing has the prospect to substantially increase yield for multiplexed sequencing runs, as well as to increase pore half-life when some, but not all samples that are sequenced have high likelihood of blocking pores. It therefore can reduce labour and resource costs, making dRNA sequencing a more viable solution to prevalent questions related to epitranscriptomics.

Of note is that adaptive barcode balancing will not alter the relative abundance of barcodes, but it will result in higher turnover of reads, thus leading to an increase of pore occupancy for underrepresented barcodes.

To facilitate live barcode balancing multiple steps need to be performed with high accuracy and yield:

1.) streaming data quickly from the device,

2.) identifying and extracting adapter signals from start until the transition to the polyA tail/RNA,

3.) rapid processing and classification of the adapter signal, and

4.) decision on rejection and if so, sending the command.

For dRNA adaptive barcode sampling to result in substantial enrichment when compared to a standard sequencing run, the time from the read starting to enter the pore to the decision should be as short as possible. Additionally, rejection of RNA reads should not lead to a substantial increase in pore loss.

One possible source of pore loss can be the RNA refolding at the trans side of the pore. To reduce the likelihood of this, reads can be rejected while the poly A tail is still in the pore. To estimate the required turnaround time for this goal for RNA002 the following is calculated: DNA adapter translocation takes approximately 2 seconds, followed by translocation of RNA nucleotides at 70 bases per second. At a poly A tail length of 30 nucleotides, this means optimally the desire to achieve a time of up to 430 ms from poly A start to decision. The adaptive sampling process is made available through the read_until_api, which acts as an interface between the gRPC interface in MinKNOW and custom python code. Matt Loose’s team leveraged this API to develop the first iteration of DNA adaptive sampling [25], adding a critical component, an accumulating cache, to the read until api. This cache is critical because the near whole adapter signal is required at once (up until the poly A signal starts) to be able to segment and classify properly, and this signal is highly likely to stretch multiple individual chunks.

14.1 Reducing the Chunk Size

Chunks are blocks of signal that are received through the read_until_api. The size of chunks is determined in the sequencing protocol toml file, specifically by the break_reads_second parameter. By default chunks consist of 1 second of signal. However, in the attempt to achieve low latency, reducing the chunk size is a critical step. Indeed, reducing the break_read_seconds parameter to 0.1 second resulted in receiving chunks of size 300 samples, which is critical to enable rejection during or shortly after the poly A tail has passed the pore.

However, this change also resulted in nearly no reads being generated by MinKNOW. Investigating this issue further, the read_classification parameters as well as the channel_states are identified as the two key steps that required tuning. The reason is as follows: first, new reads are detected by an open pore event (which is not bound by the chunk size). Then, each chunk gets classified according to the boundaries specified in the read_classification parameters (duration means duration since the read started). The channel state is then determined by regex matching of patterns of chunk classifications, those are defined in the channel_states.toml file.

When decreasing the chunk size, statistics such as median and range (90%-10% percentiles) become less consistent, and these can be identified as the parameters that required the most tuning. Critically, the range was previously used to detect stalled reads, as it was unlikely (apart from the poly A tail) that the signal stayed within a small range for 1 second. In contrast, it may happen relatively frequently that the signal may stay constant in a 0.1 second window-thus it is decided to decrease the minimum range required. Upon replaying the simulation, these changes not only recovered the output of the initial sequencing run, but increased it by approximately 25%.

However, this change in isolation would likely lead to stalls not being detected, thus severely affecting overall flow cell yield. To address this problem a read_classification state "stalled" is added so that chunks where the range of the signal is near 0 should be classified in. However, as brief stalls are expected to occur within a read, the channel_state pattern of strand is allowed to contain up to 3 ‘stalled’ chunks in between two ’strand’ chunks. Only when at least 10 chunks are consecutively classified as ‘stalled’ the channel_state is changed to ‘blocked’, which will then initiate the standard channel unblocking procedure.

14.2 Read Until API changes

Having achieved a 100 ms chunk size, the next focus lies on the read until API. As previously mentioned, the accumulating cache that appends a chunk raw signal (instead of overwriting it) is critical, especially as the individual chunk sizes are decreased. However, it should be noticed that for some reads, the start of the signal did not match the start of the new read, and this may result in incorrect poly A detection, segmentation or classification. The source of this is identified to come from the accumulating cache filtering incoming chunks for their classification by default, and only accepting adapter- or strand-classified chunks. This filter should be removed during chunk accumulation, but instead a filter can be added in the get_read_chunks function that in the present case will only return accumulated chunks that contain at least one adapter classification. In addition, a max length to the accumulated raw data can be added to ensure that raw signal of long reads or inactive channels does not grow out of bounds.

14.3 Fast Poly A Detection

To ensure that all signals that are submitted for WarpDemuX classification according to the present invention contain the full adapter signal, poly A adapter signals are identified with a fast heuristic based primarily on moving mean and variance. Only if a position is found, and the median signal left of that position is sufficiently lower than on the right side of the position, does the raw data get submitted into the WarpDemuX classification queue. This allows to detect poly A tails in [XX %] of reads with mean qscores of 7 or above. 14.4 Live WarpDemuX Demultiplexing

The process of WarpDemuX demultiplexing according to the present invention contains of three distinct steps:

- Segmentation of signal into discrete chunks,

- dynamic time warping of the discretized signal against known signals, and finally

- classification by an SVM based on DTW distances to the known barcodes.

14.5 Decision Strategies

The last step after classifying raw signal of adapters is to decide on the fate of the read. The simplest approach would be rejection based on relative adapter occurrence. However, this has drawbacks with regards to unligated adapter sequences, which when evaluating raw data may be present in 50% of adapter events, and potentially at varying abundance between different samples.

Instead, two alternative strategies are proposed, that both employ additional information present in pod5 files that are generated during the sequencing run. In both strategies first adapters are classified and their read ids are stored. Then search in the pod5 files is performed as they are generated during the run to identify matching read ids, counting all matches for each barcode. As adapter-only reads are not written out to pod5 files this ensures that only on true RNA-containing reads are normalized. Intriguingly, the output of this analysis can subsequently be used to estimate the fraction of adapter-only reads in the library, helping to troubleshoot library preparation protocols. In the second strategy the same lookup in pod5 files is performed, but the number of detected events is integrated by MinKNOW. These events correspond closely to the number of bases in the read (with approximately 100 additional events introduced by the adapter), so incorporating this information allows us to balance different samples not only by read count, but by base count. Finally, by monitoring the end reason of reads in pod5 files, barcodes could be identified that resulted in higher than expected rates of pore blockage, and flag those for rejection.

Discussion

Direct RNA sequencing, with its capacity to retain native RNA base modifications, has become indispensable in unraveling the intricacies of the epitranscriptome. Although the technology brings remarkable capabilities, the challenge of effective demultiplexing in dRNA-seq has stymied the expansion of its potential applications. The method according to the present invention, called WarpDemuX, greatly enhances the efficiency and efficiency of dRNA-seq multiplexing. WarpDemuX distinguishes itself through three key features: scalability, data efficiency and computational efficiency. WarpDemuX outperforms by effective demultiplexing from four barcodes to at least twelve, with enhanced performance or yield. Furthermore, it achieves this with significantly less parameters and data than competing, data-intensive, deep learning approach DeePlexiCon. Finally, WarpDemuX provides significant improvements in run time for CPU-based settings. Altogether, WarpDemuX represents an efficient demultiplexing solution that is suitable for low-resource settings.

Excitingly, WarpDemux is poised to further improve the direct RNA sequencing method itself by enabling the systematic evaluation of different library preparation strategies all on a singular flow cell - thus eliminating batch effects from different flow cells.

The standout performance of WarpDemuX can be attributed to improvements across the entire demultiplexing pipeline, from optimized barcodes to improved signal preprocessing and more temporally robust classification: WarpDemuX introduces a DTWD kernel approach for barcode classification that improves performance even on barcodes that are not explicitly optimized for DTWD-based separability. By using the optimized barcodes however, the predictive confidence is improved even further, leading to higher yield. The classification efficiency is further augmented by enhanced signal preprocessing. This preprocessing reduces data noise, mitigates spurious temporal correlations from varying dwell times and reduces the size of the data. Consequently, it paves the way for a leaner classifier that requires minimal training data.

The improved accuracy of WarpDemuX over DeePlexiCon can be explained by their different sensitivity to temporal perturbation of the signal: A comparative analysis of prior art DeePlexiCon’s and WarpDemuX’s signal similarity quantification methods under temporal variation is included in Supplementary Chapter 12. While the DTWD is resilient to temporal perturbations, the GASF feature transformation employed in DeePlexiCon exhibits vulnerability to such perturbations. Given the inherent dwell time variability in dRNA-seq data, this vulnerability results in inconsistent GASF signal-images across barcodes which largely complicates the classification task. In contrast, DTWD’s resilience to temporal variation further underscores the aptness of DTWD as a measure of signal similarity in the context of dRNA-seq signal analysis.

Notably, WarpDemuX signal preprocessing hinges on first accurately detecting the start of the polyA tail in the raw signal, which becomes increasingly challenging as the polyA tail length decreases. Despite this, WarpDemuX also achieved high demultiplexing performance on RNA with short polyA tails (dataset 6, see Supplementary Chapter 11.2). This indicates that WarpDemuX can also be employed in experimental settings where enzymatic polyadenylation is not feasible, such as in the studies where the length of the polyA tail is of interest.

High accuracy direct RNA multiplexing, as showcased by WarpDemuX, will allow for more cost- and resource-effective profiling of RNA. An important feature of the approach according to the present invention is that it operates independently of basecalling. Thus, WarpDemuX is uninfluenced by basecaller updates and can, in principle, be employed live during sequencing. This real-time functionality opens doors to quantifying the number of reads per barcode near instantaneously, allowing for dynamic modulation during sequencing runs, even in low resource settings. For instance, reads from samples observed to generate high numbers of blocked pores could be actively rejected to extend flow cell lifespan and total yield. Additionally, such a live process could also be utilized to balance read numbers or total yield between barcodes, thereby ensuring a more consistent and uniform sequencing yield across samples. Notably, reads can be rejected 60 ms (TBD) after the poly A tail begins, thus virtually all poly-adenylated RNA can be rejected based on their barcode before the variable sequence starts to translocate through the pore, reducing the likelihood of pore blockage.

However, the current implementation does not show data enrichment close to the theoretical estimations. It appears that while individual rejections do not have an immediate catastrophic effect, they carry a likelihood of resulting in pore blockage. This appears to be recoverable in most cases, but in rare instances can lead to pore loss. In general, it reduces the throughput, thus limiting the achievable enrichment. Tuning the reject duration may increase rejection efficiency and reduce pore blockage.

The difficulty in detecting and subsequently unblocking pores that sequence RNA is amplified by the decreased chunk size in our sequencing config, which was required to achieve the low latency rejection. This is a limitation of the current implementation of the read until api, as signal data is only returned in full signal chunks, which are also used as the basis unit to classify signal into reads vs noise. A different implementation strategy that would allow retrieval of signal data independent of MinKNOW chunk sizes could circumvent the block detection issue, further decrease latency as well as reduce computational demands on MinKNOW, but would likely require interacting with MinKNOW beyond its gRPC interface. This general live barcode balancing strategy is also amendable to DNA sequencing, as the barcode there is also located at the start of the signal. Compared to the current latency of around 1 second (default chunk size), a WarpDemuX-like approach with improved read until api could reduce it to 0.3 seconds, thus reducing the time wasted on unwanted sequences by 70%, in turn increasing enrichment of on-target barcodes.

Furthermore, Oxford Nanopore Technologies recently unveiled the successor to the SQK- RNA002 kit: the SQK-RNA004. This newer version boasts an updated motor protein that translocates at a higher pace and an RNA-sequencing-specific pore, presumably producing an even higher quality signal. Notably, the general library preparation approach remains consistent between the two versions. Given this, it is expected that the DNA barcodes will seamlessly integrate with the SQK-RNA004 kit, extending the relevance and applicability of WarpDemuX. While the necessity to train a new model is anticipated to demultiplex SQK- RNA004 data, it could be shown that the associated demands in terms of data, time, and resources are minimal, making this process feasible and straightforward.

While the presented study underlines the strengths of WarpDemuX method according to the present invention, potential future developments could further amplify its applicability. Optimizing barcode signal extraction, signal segmentation and normalization, for instance, could enhance performance even further. Given that adapter extraction and segmentation currently account for 90% of run time (see Supplementary Chapter 11.1), computational advancements in these areas would also substantially improve run time. Additionally, while WarpDemuX has showcased its potential to scale to even more barcodes, it could be shown that the exact barcodes used will impact the model’s quality, most profoundly in terms of predictive confidence and therefore yield. Moreover, the possibility of reaching a saturation point in creating distinct signal patterns could set an upper limit on the number of available barcodes. Systematic efforts to test the limits of barcode design and classification will be valuable.

Additionally, as the number of DTWD calculations required for classification scales with the number of barcode classes, the use of learned DTW kernels rather than full signal-to-signal comparisons could be explored to further speed up the method.

In conclusion, WarpDemuX stands out as a promising solution in the landscape of dRNA- seq demultiplexing. In juxtaposition with the existing alternative, DeePlexiCon, the approach according to the present invention offers a clearer, noise-resilient interpretation of the underlying barcode signals, ensuring higher fidelity in classification. By effectively leveraging optimized barcodes, tailored signal preprocessing, and an SVM classifier with a DTWD kernel, it sets a new benchmark in the field. As dRNA-seq continues to evolve, tools like WarpDemuX can play a pivotal role in maximizing the benefits of this transformative technology, catalyzing advancements in transcriptomics and epitranscriptom ics research.

References

[1] Daniel R. Garalde et al. “Highly parallel direct RNA sequencing on an array of nanopores”. In: Nature Methods 15.3 (Mar. 2018), pp. 201-206.

[2] Oguzhan Begik, John S. Mattick, and Eva Maria Novoa. “Exploring the epitranscriptome by native RNA sequencing”. In: RNA (New York, N.Y.) 28.11 (Nov. 2022), pp. 1430-1439.

[3] Miten Jain et al. “Advances in nanopore direct RNA sequencing”. In: Nature Methods 19.10 (Oct. 2022), pp. 1160-1164.

[4] Aleksandra A Kolodziejczyk et al. “The technology and biology of single-cell RNA sequencing”. In: Molecular cell 58.4 (2015), pp. 610-620.

[5] Kevin Lebrigand et al. “High throughput error corrected Nanopore single cell transcriptome sequencing”. In: Nature communications 11.1 (2020), p. 4025.

[6] Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. “Single-cell RNA sequencing technologies and bioinformatics pipelines”. In: Experimental & molecular medicine 50.8 (2018), pp. 1-14.

[7] Wang Liu-Wei et al. “Sequencing accuracy and systematic errors of nanopore direct RNA sequencing”. In: bioRxiv (2023).

[8] Martin A. Smith et al. “Molecular barcoding of native RNAs using nanopore sequencing and deep learning”. In: Genome Research 30.9 (Sept. 2020), pp. 1345- 1353.

[9] Meinard Muller. “Dynamic time warping”. In: Information retrieval for music and motion (2007), pp. 69-84.

[10] Lawrence R Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition. Tsinghua University Press, 1999.

[11] Hoang Anh Dau et al. The UCR Time Series Classification Archive. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/. Oct. 2018.

[12] Wannes Meert et al. DTAIDistance. Oct. 2022. DOI: 10.5281/zenodo.7158824.

[13] Michiaki Hamada et al. “Training alignment parameters for arbitrary sequencers with LAST-TRAIN”. In: Bioinformatics 33.6 (Nov. 2016), pp. 926-928.

[14] Chih-Chung Chang and Chih-Jen Lin. “LIB-SVM: A library for support vector machines”. In: ACM Transactions on Intelligent Systems and Technology 2.3 (Apr. 2011), pp. 1-27. [15] Fabian Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12.85 (2011), pp. 2825-2830.

[16] Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. “Probability Estimates for Multiclass Classification by Pairwise Coupling”. In: The Journal of Machine Learning Research 5 (Dec. 2004), pp. 975-1005.

[17] Marco Olguin-Nava et al. “Building Blocks of Understanding: Constructing a Reverse Genetics Platform for studying determinants of SARS-CoV-2 replication”. In: bioRxiv (2024). DOI: 10. 1101/2024.02.05.578560.

[18] Loris Bennett, Bernd Melchers, and Boris Proppe. “Curta: A General-purpose High- Performance Computer at ZEDAT, Freie Universitat Berlin”. In: (2020). DOI: 10. 17169 / refubium-26754.

[19] Panu Somervuo et al. “BARCOSEL: a tool for selecting an optimal barcode set for high-throughput sequencing”. In: BMC Bioinformatics 19.1 (July 5, 2018), p. 257.

[20] Patrick Bohn et al. “Nano-DMS-MaP allows isoform-specific RNA structure determination”. In: Nat Methods 20.6 (June 2023), pp. 849-859.

[21] Maximilian Krause. Flongle DirectRNA Library preparation. 2020. URL: https://dx.doi.org/ 0.17504/protocols.io.bcwcixaw (visited on 05/23/2023).

[22] Frangois Petitjean, Alain Ketterlin, and Pierre Gangarski. “A global averaging method for dynamic time warping, with applications to clustering”. In: Pattern Recognition 44.3 (Mar. 2011), pp. 678-693.

[23] Maximilian Krause et al. “tailfindr: alignment-free poly(A) length measurement for Oxford Nanopore RNA and DNA sequencing”, eng. In: RNA 25.10 (2019). PMID: 31266821 , pp. 1229- 1241. ISSN: 1469-9001 (Electronic), 1355-8382 (Print), 1355- 8382 (Linking). DOI: 10.1261/rna. 071332.119.

[24] Sonia Cruciani et al. “De novo basecalling of m6A modifications at single molecule and single nucleotide resolution”. In: bioRxiv (2023). DOI: 10.1101/2023.11.13. 566801. eprint: https://www.biorxiv.Org/content/early/2023/11/13/2023.11.13. 566801.full.pdf. URL: https://www.biorxiv.org/content/early/ 2023/11/13/2023.11.13. 566801.

Claims

Claims:

1. Method for analyzing molecules of interest by use of an adaptor attached to the molecule of interest, said adaptor comprises a barcode molecule, wherein the barcode molecule is one instance of a set of defined sequences of molecules, characterized by: - temporarily aligning an electrical signal related to an unknown barcode molecule derived by electrically measuring at least of the part of the molecule of interest comprising the barcode molecule and a set of respective signals for known barcode molecules stored in a data memory, and

2. Method according to claim 1 , characterized by temporarily aligning the electrical signal with the step of partitioning the electrical signal derived by electrically measuring at least of the part of a molecule into segments and determining the similarity between electrical signals for the segments.

3. Method according to claim 1 or 2, characterized by temporarily aligning the electrical signal and predicting the instance of barcode molecule by determining the dynamic time warping distances between the electrical signal and the respective signals for known barcode molecules, wherein the dynamic time warping distance is a measure of the similarity between the temporarily aligned electrical signal related to the measured unknown barcode molecule and an related electrical signal for a known barcode molecule of the set of defined sequences of molecules.

4. Method according to claim 3, characterized by determining the value for the dynamic time warping distance by temporal aligning the electrical signals or a substitute of signals derived from the electrical signal obtained by electrically measuring the barcode molecule against a known electrical signal derived by electrical measuring at least of a respective part of a known barcode molecule of a training set and determining the distance between the pair of temporally aligned electrical signals.

5. Method according to one of claims 1 to 4, characterized by stopping the measuring of the molecule of interest based on the predicted instance of barcode molecule and/or an electrical signal derived by electrically measuring at least of the part of the molecule of interest related to a sequence of molecules.

6. Method according to one of claims 1 to 5, characterized in, that the set of known barcode molecules used for predicting the instance of barcode molecule are selected as an optimized diverse subset of a group of barcode molecules having a similarity of the related electrical signal derived in a sequencing process of the barcode molecule and an targeted signal above a threshold value for the similarity and having a dissimilarity between the electrical signal of the unrelated barcode molecules of the set of barcode molecules above a threshold value for the dissimilarity.

7. Method according to one of the preceding claims, characterized in processing hybrid molecules, consisting of DNA barcode molecules ligated to RNA molecules.

8. Method according to one of the preceding claims, characterized by measuring the electrical signal in a sequencing process of RNA, DNA, proteins or polymers, in particular polypeptides.

9. Method according to one of the preceding claims, characterized by measuring the electrical signal in a sequencing process of at least a part of the molecule of interest comprising the barcode molecule to detect the presence of the molecule of interest in a probe and/or to measure the quantity of the molecule of interest in the probe.

10. Method according to claim 9, characterized by measuring the electrical signal in the sequencing process of the molecule of interest by measuring electric current or voltage fluctuations in a sensor structure comprising a probe with the molecule of interest and an adaptor attached to the molecule of interest.

11. Method according to claim 9 or 10, characterized by electrically measuring at least of the part of the molecule of interest in a nanopore structure of a sequencing unit.

12. Method for automatically selecting electrical signals related to a set of known barcode molecules and storing the selected electrical signals for the set for use in the method of analyzing molecules of interest according to one or the preceding claims, wherein the analyzed molecules of interest comprising an adaptor attached to the molecule of interest, said adaptor comprising a barcode molecule, characterized by processing the electrical signals in a data processing unit for selecting an optimized, diverse subset of a group of candidate barcode molecules with the steps of:

- selecting the optimized, diverse subset of barcode molecules from those high- score candidate barcode molecules having a dissimilarity between the electrical signals of the unrelated barcode molecules of the selected set of high-score candidate barcode molecules above a threshold value for the dissimilarity.

13. Method according to claim 12, characterized by determining the value of the dynamic time warping distances between the electrical signal obtained in a sequencing process of a barcode molecule and a set of electrical signals for known barcode molecules and training a classifier with the determined dynamic time warping distances to determine the instance of barcode molecule with the electrical signal derived by the sequencing process of the barcode molecule.

14. Method according to one of the preceding claims, characterized by determining the part of the electrical signal related to the adaptor attached to the molecule of interest in the electrical signal extracted in a sequencing process of at least of a part of the molecule by use of the signal features of the electrical signal.

15. Method according to one of the preceding claims, characterized by preprocessing the electric signal measured for an unknown barcode molecule by segmentation of the electrical signal such that for each segment of the electrical signal the variance contained within the segment is decreased, followed by substitution of signals contained in a segment with a summarizing value, including but not limited to the arithmetic mean or median.

16. Method according to one of the preceding claims characterized in using the predicted instance of barcode molecule for multiplexed sequencing of molecules, for quantifying the amount of molecule attached to a specific barcode, for use in diagnostics and/or for control of sequencing devices for multiplexed sequencing of molecules.

17. Signal processing unit comprising a data memory and an input for receiving electrical signals derived from electrically measuring of molecules, characterized in, that the signal processing unit is arranged for automatically processing the method according to one of the preceding claims.

18. Computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of one of claims 1 to 16.

*****