WO2025238371A1

WO2025238371A1 - Polymer analysis using transformer neural network

Info

Publication number: WO2025238371A1
Application number: PCT/GB2025/051061
Authority: WO
Inventors: Michael Vella; Samuel George DAVIS
Original assignee: Oxford Nanopore Technologies PLC
Current assignee: Oxford Nanopore Technologies PLC
Priority date: 2024-05-17
Filing date: 2025-05-16
Publication date: 2025-11-20
Anticipated expiration: 2026-11-17
Also published as: GB202407044D0

Abstract

A method of estimating a sequence of polymer units of a polymer comprising measurements of a signal generated during translocation of the polymer with respect to a nanopore includes determining a sequence of hidden states associated with the signal using a transformer encoder network, and processing the sequence of hidden states using a network head to estimate a sequence of polymer units of the polymer. The processing by the network head includes generating a respective array of values for each hidden state indicating an estimated likeliness of the hidden state being associated with each of a set of symbols, where the set of symbols includes symbols associated with respective types of canonical polymer unit, and analysing the generated arrays of values to estimate the sequence of polymer units of the polymer.

Description

_{POLYMER ANALYSIS USING TRANSFORMER NEURAL NETWORK} Technical Field The present disclosure relates to estimating a sequence of polymer units of a polymer. The disclosure has particular, though not exclusive, relevance to basecalling a polynucleotide. Background Various biochemical analysis systems provide measurements of polymer units for the purpose of determining the sequence. One such type of analysis system uses a nanopore. Biochemical analysis systems that use a nanopore have been the subject of much recent development. Typically, successive measurements of a polymer are taken from a sensor element comprising a nanopore during translocation of the polymer through the nanopore. Some property of the system depends on the polymer units in the nanopore, and measurements of that property are taken. This type of measurement system using a nanopore has considerable promise, particularly in the field of _{sequencing a polynucleotide such as Deoxyribonucleic acid (DNA) or Ribonucleic acid} (RNA), sometimes referred to as “basecalling”. Biochemical analysis systems using nanopores can provide long continuous reads of polymers, for example in the case of polynucleotides ranging in length from 20 to >1Mb. The data gathered in this way comprises measurements, such as measurements of ion current, where the translocation of the polymer through the _{nanopore results in a change in the measured property indicative of the sequence of} polymer units. Some biochemical analysis systems use machine learning models to estimate a sequence of polymer units from data comprising sequences of measurements collected as described above. Suitable machine learning models include Hidden Markov Models (HMMs), neural networks including recurrent neural network (RNN) models, such as _{those comprising long short-term memory (LSTM) units, and combinations of the} above. However, the inventors have recognised that such models have limitations in _{terms of accuracy and scalability, and the performance of such models can deteriorate} _{for very long sequences of measurements as may be needed to sequence polymers with} _{hundreds of thousands, or more, polymer units.} Summary A_{ccording to aspects of the present disclosure, there are provided a computer-} _{implemented method of estimating a sequence of polymer units of a polymer} comprising measurements of a signal generated during translocation of the polymer with respect to a nanopore, a data processing system comprising means for carrying out the method, and a computer program product (such as one or more non-transitory computer-readable storage media) comprising instructions which, when executed by a computer, cause the computer to carry out the method. The method includes determining a sequence of hidden states associated with the signal using a transformer encoder network, and processing the sequence of hidden _{states using a network head to estimate a sequence of polymer units of the polymer.} The processing by the network head includes generating a respective array of values for each hidden state indicating an estimated likeliness of the hidden state being associated with each of a set of symbols, where the set of symbols includes symbols associated _{with respective types of canonical polymer unit, and analysing the generated arrays of} _{values to estimate the sequence of polymer units of the polymer.} _{The inventors have found that using the transformer encoder network and} _{network head enables more accurate sequencing or basecalling at a given number of} _{floating point operations per second (FLOPS) when compared with other types of} machine learning model for processing sequential data. For example, the inventors were _{able to achieve around 1.5 Q higher accuracy than a state-of-the-art basecaller} employing a long short-term memory (LSTM) architecture. Furthermore, the accuracy _{of the proposed architecture is found to be more scalable with model size (e.g. depth,} _{size of hidden layer, length of training runs, context length etc.) than existing} _{approaches, leading to the possibility of even better performance as improvements in} _{hardware and techniques such as quantisation enable larger models to be realised.} The transformer encoder network may use sliding window self-attention, whereby each hidden state depends on a respective portion of the signal. The respective portion may be a contiguous portion or may include gaps, for example in the case of diluted sliding window self-attention or cross-attention between concatenated signal portions as described hereinafter. By using sliding window self-attention, the memory _{footprint of the method becomes independent of the length of the signal while the} _{compute cost scales linearly with the length of the signal, in contrast with the quadratic} _{scaling exhibited by global self-attention mechanisms. This may enable very long} _{polymers (for example, more than 100,000, more than 1,000,000, or more than} _{10,000,000 polymer units) to be sequenced or basecalled in a single pass, potentially in} substantially real-time as the polymer is translocated with respect to the nanopore. By contrast, the performance of methods based on recurrent neural networks such as LSTMs tends to decline with very long sequence lengths, for example sequences with more than 100,000 polymer units. The inventors have discovered that the accuracy of the method may begin to _{saturate beyond a certain window size, for example at a window size of around 64, 128,} 256, or 512, meaning that most of the benefits of the transformer encoder architecture may be reaped using a window size of between 64 and 512, without being subject to _{the quadratic scaling of compute costs (this may be in contrast with for example} _{language modelling applications of transformers, in which long-range relationships} _{between tokens may be critical and therefore global attention may be needed to achieve} peak performance). Determining the sequence of hidden states may include processing the signal to generate a sequence of feature vectors using one or more neural network layers to generate a sequence of feature vectors, the sequence of feature vectors, and processing the sequence of feature vectors using the transformer encoder network to determine the sequence of hidden states. The one or more neural network layers may include one or more convolutional layers. The polymer units of the polymer may pass through the nanopore at an average _{translocation rate, and the signal may have a first frequency that is greater (for example,} at least 3 times greater) than the average translocation rate, thereby enabling multiple measurements to be made in the vicinity of each polymer unit. The sequence of feature _{vectors may have a second frequency that is lower than the first frequency. Processing} the sequence of feature vectors may include generating a sequence of intermediate _{states having the second frequency as an output of the transformer encoder network,} and upsampling the sequence of intermediate states to determine the sequence of hidden states. The sequence of hidden state may have a third frequency that is at no less than the average translocation rate. By generating a sequence of hidden states with a lower frequency than the _{signal, the compute cost and memory footprint of the subsequent processing by the} _{transformer encoder may be reduced significantly without reducing accuracy. By using} one or more neural network layers (which may include one or more convolutional _{layers) to achieve this reduction in frequency, the feature vectors may nevertheless} capture substantially all of the useful information within the signal. The inventors have found that frequency reduction by factor of between 5 and 20 may be achieved without reducing accuracy. The frequency may nevertheless be at least partially restored upon generating the hidden states, otherwise it may not be possible for the network head to resolve all of the polymer units of the polynucleotide. To enable full resolution of the polynucleotide, the third frequency may for example be between 1.5 and 5 times the average translocation rate. The upsampling of the sequence of intermediate states may be performed for _{example using a linear upsampling layer, or a convolutional upsampling layer, or an} attention upsampling layer. The linear upsampling may achieve at least comparable performance with other upsampling methods, while being less computationally expensive and not significantly increasing the number of trainable parameters of the machine learning model. Analysing the generated arrays of values may include applying a search algorithm to the generated arrays of values. For example, the set of _{symbols may include a blank symbol, and analysing the generated arrays of values may} include determining an intermediate sequence of symbols based on the generated arrays of value, and collapsing the intermediate sequence of symbols by merging repeated symbols and then removing instances of the blank symbol. In this way, the search algorithm may correspond to a connectionist temporal classification (CTC) topology. Providing a blank symbol and an algorithm for collapsing the intermediate _{sequence of symbols may enable the sequence of polymer units to be estimated without} relying on a known alignment between measurements in the signal and the polymer units in the polymer. In this way, configuring the apparatus may be greatly simplified _{while also providing robustness against variations in translocation rate.} The intermediate sequence of symbols may be determined using a greedy search _{technique in which the most likely symbol indicated by each array of values is selected,} which in many cases will result in the most likely sequence of polymer units being returned. However, due to the possible lack of alignment, multiple intermediate _{sequences may map to a given final estimate after the collapsing algorithm is applied.} In that case, a better approach may be to identify the estimate whose corresponding intermediate sequences have the greatest cumulative posterior likelihood. To account for this situation, determining the intermediate sequence of symbols may include performing a beam search based on the generated arrays of values. Using a wider beam _{may result increase the probability of the best estimate being identified, at the expense} of a higher compute cost. The network head may be configured (for example, trained) to generate the arrays of values as a conditional random field (CRF), which advantageously may allow for conditional dependence of the arrays of values which may result in improved accuracy compared with methods that assume conditional independence. When the _{arrays of values are generated as a CRF and then the collapse algorithm described above} _{is applied to arrive at the final estimate of the sequence of polymer units, the resulting} network head may be referred to as a CTC-CRF head, though it will be appreciated that other types of network head may be used, such as a CTC head, an autoregressive transformer decoder, or any other form of network head or decoder that can estimate or classify polymer units from the series of hidden states. The polymer may be a first polymer, and the signal may further comprise measurements of a second polymer by the sensor element during translocation of the second polymer with respect to the nanopore, the first polymer and the second polymer _{being a complementary pair. The transformer encoder network may include an} attention mechanism arranged to capture relationships between measurements associated with mutually corresponding segments of the first polymer and the second polymer. For example, the signal may include a first signal portion comprising measurements of the first polymer concatenated with a second signal portion comprising measurements of the second polymer. The attention mechanism may then _{be a global self-attention mechanism which learns to cross-attend between} _{corresponding parts of the first and second signal portions, which advantageously may} _{obviate the need to align the first and second signal portions. Alternatively, the} _{attention mechanism may use two local windows respectively centred on} (approximately) corresponding parts of the first and second signal portions. This approach may result in a more scalable compute cost as the size of the sliding windows can remain constant, but may need at least partial alignment of the signal portions. The transformer encoder may include between 8 and 64 transformer encoder _{blocks. For example, the transformer encoder may include between 12 and 20} _{transformer encoder blocks, which the inventors have found to provide a good balance} in the trade-off between accuracy and speed. The polymer may be a polynucleotide such as DNA or RNA, and the polymer units may be nucleotides. A further aspect of the present disclosure include apparatus comprising a sensor element comprising a nanopore, means for translocating a polymer with respect to the nanopore, and a data processing system comprising means for carrying out the above computer-implemented method. The means may include a polymer binding enzyme for controlling the translocation of the polymer with respect to the nanopore. A still further aspect includes a method comprising translocating a polymer with respect to a nanopore, generating a signal comprising measurements of the polymer by a sensor element comprising the nanopore, and analysing the signal using the above computer-implemented method. According to further aspects of the present disclosure, there are provided a _{computer-implemented method of training a neural network model comprising a} transformer encoder network and a network head, a data processing system comprising means for carrying out the method, and a computer program product (such as one or more non-transitory computer-readable storage media) comprising instructions which, when executed by a computer, cause the computer to carry out the method. The method may include providing a signal comprising measurements of a polymer by a sensor element comprising a nanopore during translocation of the polymer with respect to the nanopore, the signal being associated with a target sequence of _{polymer units, determining a sequence of hidden states associated with the signal using} the transformer encoder network, processing the sequence of hidden states using the network head to generate a respective array of values indicating an estimated likeliness of each hidden state being associated with each of a set of symbols comprising symbols associated with respective types of canonical polymer unit and updating parameters of the neural network based on a loss function depending on the generated arrays of values and the target sequence of polymer units. The loss function may include at least one of a connectionist temporal classification (CTC) loss and a connectionist temporal classification-conditional random field (CTC-CRF) loss, which may enable training on signals where alignment between the measurements and the polymer units is not known a priori. Further features and advantages will become apparent from the following description of preferred embodiments, given by way of example only, which is made with reference to the accompanying drawings. Brief Description of the Drawings F_{ig.1 shows an example of apparatus for estimating a sequence of polymer units} of a polymer. F_{ig. 2 shows schematically an example of a neural network for analysing a} _{signal comprising measurements of a polymer.} _{Fig. 3 is a flow diagram representing a method of analysing a signal according} to an aspect of the present disclosure. Fig. 4 is a flow diagram representing a method of training a neural network according to an aspect of the present disclosure. Figs.5A and 5B illustrate methods of analysing a composite signal comprising measurements of complementary strands of a polymer. Detailed Description Details of systems and methods according to examples will become apparent from the following description with reference to the figures. In this description, for the purposes of explanation, numerous specific details of certain examples are set forth. Reference in the specification to ‘an example’ or similar language means that a feature, structure, or characteristic described in connection with the example is included in at least that one example but not necessarily in other examples. It should be further noted that certain examples are described schematically with certain features omitted and/or necessarily simplified for the ease of explanation and understanding of the concepts underlying the examples. Embodiments of the present disclosure relate to sequencing polymers such as polynucleotides. In particular, embodiments described herein address challenges involved in accurately and efficiently sequencing long-chain polymers. F_{ig. 1 shows apparatus 100 for estimating a sequence or series of polymer units} _{of a polymer 102. The polymer 102 may be a polynucleotide strand of a polynucleotide} (or nucleic acid), and the polymer units may be nucleotides. However, in general the _{polymer 102 may be of any type, for example a polypeptide such as a protein, or a} _{polysaccharide. The polymer 102 may be natural or synthetic. In the case of a} polynucleotide or nucleic acid, the polymer units are nucleotides. The nucleic acid is typically deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or a synthetic nucleic acid known in the art, such as peptide nucleic acid (PNA), glycerol nucleic acid (GNA), threose nucleic acid (TNA), locked nucleic acid (LNA) or other synthetic polymers with nucleotide side chains. The nucleic acid may be single-stranded, be double-stranded or comprise both single-stranded and double-stranded regions. The nucleic acid may comprise one strand of RNA hybridised to one strand of DNA. Typically cDNA, RNA, GNA, TNA or LNA are single stranded. The polymer units may be any type of nucleotide. The nucleotide can be naturally occurring or artificial. For instance, the method may be used to verify the sequence of a manufactured oligonucleotide. The polymer units may be canonical polymer units. For example in the case that the polymer 102 is a DNA polynucleotide, the canonical bases are adenine (A), cytosine (C), guanine (G), and thymine (T). By contrast ribonucleic acid (RNA) comprises the canonical bases A, C and G, with uracil (U) in place of thymine. A nucleotide may also lack a nucleobase and a sugar. The nucleotide may be modified such as 5mC, 5hmC and N1-methylpseudouridine. Machine learning methods to detect modified bases are disclosed in WO23094806. T_{he apparatus 100 comprises a sensor element 104 comprising a nanopore 106} _{situated in a membrane 108, and a sensor device 110. Although only a single nanopore} 106 is shown in Fig. 1, the apparatus 100 may employ many nanopores, for example arranged in an array, to provide parallelised collection of information. The measurement system may comprise at least 10 nanopores, at least 100 nanopores, or at least 1,000 nanopores. T_{he nanopore 106 is a pore or hole, typically having a size of the order of} nanometres, which may allow the passage of polymers therethrough. The nanopore may be a protein pore or a solid state pore. The nanopore 106 may, for example, be a protein pore such as a polypeptide or a collection of polypeptides. Alternatively, the nanopore may be composed of any other such molecules that allow the nanopore 106 to function as an aperture in the membrane 108. The biological pore may be a transmembrane protein pore. Transmembrane protein pores for use in accordance with the invention can be derived from ^-barrel pores or α-helix bundle pores. ^-barrel pores comprise a barrel or channel that is formed from ^-strands. Suitable ^-barrel pores include, but are not limited to, ^-toxins, such as α-hemolysin, anthrax toxin and leukocidins, and outer membrane proteins/porins of bacteria, such as Mycobacterium smegmatis porin (Msp), for example MspA, MspB, MspC or MspD, lysenin, outer membrane porin F (OmpF), outer membrane porin G (OmpG), outer membrane phospholipase A and Neisseria autotransporter lipoprotein (NalP). α-helix bundle pores comprise a barrel or channel that is formed from α-helices. Suitable α-helix bundle pores include, but are not limited to, inner membrane proteins and α outer membrane proteins, such as WZA and ClyA toxin. The transmembrane pore may be derived from Msp or from α-hemolysin (α-HL). The transmembrane pore may be derived from lysenin. Suitable pores derived from lysenin are disclosed in WO 2013/153359. Suitable pores derived from MspA are disclosed in WO-2012/107778. The pore may be derived from CsgG, such as disclosed in WO-2016/034591 and WO2019/002893, both herein incorporated by reference in their entirety. The pore may be a DNA origami pore. A protein pore may be a naturally occurring pore or may be a mutant pore. A protein pore may be inserted into an amphiphilic layer such as a biological membrane, for example a lipid bilayer. An amphiphilic layer is a layer formed from amphiphilic molecules, such as phospholipids, which have both hydrophilic and lipophilic properties. The amphiphilic layer may be a monolayer or a bilayer. The amphiphilic layer may be a co-block polymer such as disclosed in Gonzalez-Perez et al., Langmuir, 2009, 25, 10447-10450, WO2014/064444, or US6723814 herein incorporated by reference in its entirety. Alternatively, a protein pore may be inserted into an aperture provided in a solid state layer, for example as disclosed in WO2012/005857. A suitable apparatus for providing an array of nanopores is disclosed in WO 2014/064443. The nanopores may be provided across respective wells wherein electrodes are provided in each respective well in electrical connection with an ASIC for measuring current flow through each nanopore. A suitable current measuring apparatus may comprise the current sensing circuit as disclosed in WO-2016/181118. The nanopore 106 may comprise an aperture formed in a solid state layer, which may be referred to as a solid state pore. The aperture may be a well, gap, channel, trench or slit provided in the solid state layer along or into which analyte may pass. Such a solid-state layer is not of biological origin. In other words, a solid state layer is not derived from or isolated from a biological environment such as an organism or cell, or a synthetically manufactured version of a biologically available structure. Solid state layers can be formed from both organic and inorganic materials including, but not limited to, microelectronic materials, insulating materials such as Si3N4, A1203, and SiO, organic and inorganic polymers such as polyamide, plastics such as Teflon® or elastomers such as two-component addition-cure silicone rubber, and glasses. The solid state layer may be formed from graphene. Suitable graphene layers are disclosed in WO-2009/035647, WO-2011/046706 or WO-2012/138357. Suitable methods to prepare an array of solid state pores is disclosed in WO-2016/187519. Such a solid state pore is typically an aperture in a solid state layer. The aperture may be modified, chemically, or otherwise, to enhance its properties as a nanopore. A solid state pore may be used in combination with additional components which provide an alternative or additional measurement of the polymer such as tunnelling electrodes (Ivanov AP et al., Nano Lett. 2011 Jan 12;11(1):279-85), or a field effect transistor (FET) device (as disclosed for example in WO-2005/124888). Solid state pores may be formed by known processes including for example those described in WO-00/79257. The nanopore 106 may be a hybrid of a solid state pore with a protein pore. The polymer 102 may be a single strand of a dual-stranded polymer 112. The _{nanopore 106 may be constructed to allow the polymer 102 to be moved or translocated} _{through the nanopore 106.} _{The sensor device 110 may takes a series or sequence of measurements of a} _{property that depends on the polymer units of the polymer 102 being translocated with} _{respect to the nanopore 106. The series of measurements may form a measurement} signal. The property that is measured may be associated with an interaction between _{the polymer 102 and the nanopore 106. Such an interaction may occur at a constricted} _{region of the pore. In one type of sensor device 110, the property that is measured may} _{be the ion current flowing through a nanopore. These and other electrical properties} may be measured using standard single channel recording equipment as describe in _{Stoddart D et al., Proc Natl Acad Sci, 12;106(19):7702-7, Lieberman KR et al, J Am} Chem Soc. 2010;132(50):17961-72, and WO-2000/28312. Alternatively, measurements of electrical properties may be made using a multi-channel system, for example as described in WO-2009/077734, WO-2011/067559 or WO-2014/064443. The property that is measured by the sensor device 110 may not necessarily be ion current. Some examples of alternative types of property include electrical properties and optical properties. A suitable optical method involving the measurement of fluorescence is disclosed by J. Am. Chem. Soc. 2009, 131 1652-1653. Possible electrical properties include: ionic current, impedance, a tunnelling property, for example tunnelling current (for example as disclosed in Ivanov AP et al., Nano Lett. 2011 Jan 12;11(1):279-85), and a FET (field effect transistor) voltage (for example as disclosed in WO2005/124888). One or more optical properties may be used, optionally combined with electrical properties (Soni GV et al., Rev Sci Instrum. 2010 Jan;81(1):014301). The property may be a transmembrane current, such as ion current flow through a nanopore. The ion current may typically be the DC ion current, although in principle an alternative is to use the AC current flow (i.e. the magnitude of the AC current flowing under application of an AC voltage). I_{onic solutions may be provided on either side of the membrane 108 or solid} state layer, which ionic solutions may be present in respective compartments. A sample containing the polymer analyte of interest may be added to one side of the membrane and allowed to move with respect to the nanopore, for example under a potential difference or chemical gradient. The measurement signal may be derived during the _{movement of the polymer 102 with respect to the nanopore 106, for example taken} _{during translocation of the polymer 102 through the nanopore 106. The polymer may} _{partially translocate the nanopore 106.} _{In order to allow measurements to be taken as the polymer 102 passes through} _{the nanopore 106, the rate of translocation can be controlled by a polymer binding} _{moiety. Typically the moiety can move the polymer 102 through the nanopore with or} against an applied field. The moiety can be a molecular motor using for example, in the case where the moiety is an enzyme, enzymatic activity, or as a molecular brake. Where the polymer is a polynucleotide there are a number of methods proposed for controlling the rate of translocation including use of polynucleotide binding enzymes. Suitable enzymes for controlling the rate of translocation of polynucleotides include, but are not limited to, polymerases, helicases, exonucleases, single stranded and double stranded binding proteins, and topoisomerases, such as gyrases. The helicase may be any of the helicases, modified helicases or helicase constructs disclosed in WO2013/057495, WO 2013/098562, WO2013098561, WO 2014/013259; WO 2014/013262 and WO 2014013260. The helicase may be added to the polynucleotide during sample preparation and stalled by one or more spacers as disclosed in WO2014135838. For other polymer types, moieties that interact with that polymer type can be used. The polymer interacting moiety may be any disclosed in WO-2010/086603, WO- 2012/107778, and Lieberman KR et al, J Am Chem Soc.2010;132(50):17961-72), and for voltage gated schemes (Luan B et al., Phys Rev Lett. 2010;104(23):238103). The rate of translocation of the polymer through the nanopore may be controlled by a voltage control pulse to step the polymer through the nanopore such as disclosed in WO2019/006214. Translocation of the polymer may be controlled by a molecular hopper such as disclosed by WO2020/016573. The polynucleotide may comprise a polymer leader sequence which preferentially threads into the pore. The leader is preferably negatively charged and may be a polynucleotide, such as DNA or RNA, a modified polynucleotide (such as abasic DNA), PNA, LNA, polyethylene glycol (PEG) or a polypeptide. The leader sequence may form part of a Y adaptor typically comprising (a) a double stranded region and (b) a single stranded region or a region that is not complementary at the other end. Leader sequences and Y adaptors suitable for use are disclosed for example in WO2017149316. The effective surface concentration at the membrane surface can be enhanced by coupling the polynucleotide to the membrane. Suitable coupling moieties are disclosed for example in WO2017149316 and WO12164270. The polymer binding moiety can be used in a number of ways to control the polymer motion. The moiety can move the polymer through the nanopore with or against the applied field. The polynucleotide binding enzyme does not need to display enzymatic activity as long as it is capable of binding the target polynucleotide and controlling its movement through the pore. For instance, the enzyme may be modified to remove its enzymatic activity or may be used under conditions which prevent it from acting as an enzyme. Such conditions are discussed in more detail below. The polynucleotide binding enzyme may be a Dda helicase such as disclosed in WO2015055981, hereby incorporated by reference in its entirety. Translocation of the polymer 102 through the nanopore may occur, either cis to _{trans or trans to cis, either with or against an applied potential, applied by an electronic} _{circuit (not shown) as described hereafter. The translocation may occur under an} applied potential which may control the translocation. The binding enzyme is typically held against the cis or trans opening of the nanopore during translocation of the polynucleotide through the nanopore under an applied potential. T_{he electronic circuit may be an electronic circuit as disclosed in} WO2016059427, incorporated herein by reference in its entirety. The electronic circuit _{may be arranged to control the application of bias voltages across each sensor element} of the sensor device 110. During normal operation, the bias voltage is selected to enable translocation of a polymer through the pore of a sensor element. Such a bias voltage may typically be of a level up to -200 mV. The bias voltage supplied by the electronic circuit may also be selected so that it is sufficient to eject the translocating polymer from the pore. By causing the electronic circuit to supply such a bias voltage, the sensor element is operable to eject a polymer that is translocating through the pore. To ensure reliable ejection, the bias voltage is typically a reverse bias, although that is not always _{essential. The electronic circuit may be connected to electrodes on either side of the} membrane 108. The electronic circuit controls the application of bias voltages to _{generate a bias between the electrodes to control translocation of the polymer 102 as} described above. In some examples, exonucleases that act progressively or processively on double stranded DNA can be used on the cis side of the nanopore 106 to feed the remaining single strand through under an applied potential or the trans side under a reverse potential. Likewise, a helicase that unwinds the double stranded DNA can also be used in a similar manner. There are also possibilities for sequencing applications that require strand translocation against an applied potential, but the DNA must be first “caught” by the enzyme under a reverse or no potential. With the potential then switched back following binding the strand will pass cis to trans through the pore and be held in an extended conformation by the current flow. The single strand DNA exonucleases or single strand DNA dependent polymerases can act as molecular motors to pull the recently translocated single strand back through the pore in a controlled stepwise manner, trans to cis, against the applied potential. Alternatively, the single strand DNA dependent polymerases can act as a molecular brake slowing down the movement of a polynucleotide through the pore. Any moieties, techniques or enzymes described in WO-2012/107778 or WO-2012/033524 could be used to control polymer motion. Measurement signals generated by the sensor device 110 may be provided to a data processing system 114 for analysis. Variations of the property measured by the _{sensor device 110 may be extremely small. For example, measurements of ionic current} may be on the order of picoamps. Measurements of the property may therefore by _{amplified before being digitised to generate the measurement signal.} The data processing system 114 may be a desktop computer, a laptop computer, _{a tablet computer, a server computer, or any combination thereof. In some examples,} the data processing system 114 may be a dedicated device for sequencing polymers, such as a device produced by Oxford Nanopore Technologies (RTM). The data processing system 114 includes one or more processors 116 and memory 118. Additionally, the data processing system 114 includes a power supply 120, which may include a mains power supply, a battery, a solar power supply, and/or the like, and the data processing system 114 may also include one or more interface _{devices 122. The interface devices 122 may include input devices such as a keyboards,} _{a touch screen, a touch pad, a mouse, a microphone, etc. to enable a user to control the} _{data processing device 114, along with output devices such as a display, a speaker, etc.} The interface devices 122 may also include network interface devices for example to enable data or information generated by the data processing system 114 to be transmitted to other devices or systems. The processor(s) 116 may include one or more of each of a central processing unit (CPUs), graphics processing unit (GPU), neural processing unit (NPU), neural network accelerator (NNA), tensor processing unit (TPU), application-specific integrated circuit (ASIC), application-specific standard product (ASSP), digital signal processor (DSP), field programmable gate array (FPGA), system-on-chip (SoC), any other suitable form of integrated circuit. In the present disclosure, the term memory is used to encapsulate both volatile and non-volatile working memory, as well as non-volatile storage. The memory 118 is _{configured to store measurement signals generated by the sensor device 110, as well as} _{program code and a machine learning model for implementing the methods described} hereinafter. The program code may include source code, object code, firmware, etc. in any suitable language. For example, the source code may be written in Python, C, C++, Rust, Julia, etc and may use specific development frameworks, libraries, or packages, including PyTorch, TensorFlow, Keras, CUDA, etc. Fig. 2 shows an example of a machine learning model for use in analysing a _{signal 202 such as a measurement signal determined using the apparatus of Fig. 1. The} _{signal 202 may include raw measurements of a polymer by a sensor element as} _{described above. The signal 202 may include a sequence of raw measurements or may} _{have undergone pre-processing such as normalisation, outlier removal, amplification} _{etc. The signal 202 may have a first frequency corresponding to the number of} measurements taken by the sensor element per unit time. The frequency may for example be of the order of thousands of measurements per seconds or Hertz, for example 1kHz, 2kHz, 5kHz, 10kHz or any other frequency sufficiently high multiple measurements to be taken as each polymer unit passes the nanopore. The average rate at which polymer units pass the nanopore may for example be of the order of hundreds of polymer units per second, such as 100 units/s, 200 units/s, 400 units/s, 800 units/s. _{For a given rate, the measurement frequency may be chosen to be high enough to} _{capture a maximal degree of useful information about the polymer units before} saturation. The frequency of the signal 202 may for example be in the region of ten or more times higher than the average rate at which the polymer units pass the nanopore. In a particular example, the measurement frequency is 5kHz and the average rate of polymer units passing the nanopore is 400 units/s. The signal 202 in this example is processed by one or more neural network _{layers 204, which in this example include ^ one-dimensional convolutional layers each} applying one or more kernels to generate a respective sequence of feature vectors 206. The kernels may be applied sequentially to the signal 202 resulting in the sequence of feature vectors 206 being generated sequentially, for example in substantially real time as the signal 202 is generated, or alternatively the processing at different kernel positions may be parallelised across multiple processor cores. A_{t least one of the convolutional layers may use a stride greater than one. The} _{width of the kernel and the stride at each layer may be chosen such that portions of the} signal processed at successive positions of the kernel overlap, meaning that every measurement in the signal 202 is processed by the kernel in at least one position. The number of feature vectors 206 may be less than the number of measurements in the _{signal 202, or in other words the downsampled signal may have a lower frequency than} the signal 202. Successive convolutional layers may incrementally reduce the _{frequency, for example by each having a stride greater than one. The frequency of the} _{sequence of feature vectors 206 may be between 5 and 20 times less than the frequency} of the signal 202. For example, the frequency of the sequence of feature vectors 406 may be 12 times less than the frequency of the signal 202, which may be achieved using _{two convolutional layers with stride two and one convolutional kernel with stride three} _{(in any order), or alternatively using one convolutional layer with stride four and one} convolutional layer with stride three. It will be appreciated that other combinations may be used to achieve other degrees of downsampling, and stride-one convolutions _{may also be included to increase the expressiveness of the feature representations} _{encoded by the feature vectors 206. The kernel used in a given convolutional layer may} _{for example have a depth of 2, 4, 816, or any other suitable depth. Other types of layers} _{or operations may additionally, or alternatively, be used in the downsampling process} _{and/or otherwise to generate the sequence of feature vectors 206, such as pooling layers} _{or predetermined filters. In some examples, the feature vectors 206 may be generated} without any degree of downsampling. The sequence of feature vectors 206 is processed by a transformer encoder network comprising one or more transformer encoder blocks 208. In the example of _{Fig. 2, the transformer encoder network includes ^ transformer encoder blocks 208} _{each including an attention layer 210 and a feed forward network 212. The number ^} of transformer encoder blocks may be chosen to balance possibly competing requirements such as accuracy, speed, memory footprint, compute, and need for high _{volumes of training data. In some examples, ^ may be between 8 and 64 (inclusive),} _{or between 12 and 20 (inclusive), though it will be appreciated that more or fewer} transformer encoder blocks may be appropriate in particular settings, such as when very large or small volumes of training data are available, or when accuracy is paramount irrespective of compute cost, or when sequencing is to be performed on a mobile device and/or substantially in real time. The purpose of the attention layer 210 is to capture relationships between _{vectors in a sequence. More precisely, for a given input vector the attention layer 210} _{may generate a corresponding output vector that capture relationships between the} _{given input vector and other input vectors within a receptive field of the corresponding} _{output vector. The corresponding output vector may be said to attend to the input} _{vectors within its receptive field. For example, the attention layer 210 of a given} _{transformer encoder block 208 may be configured to process a sequence of ^ input} _{vectors ^^ to generate a sequence of ^ output vectors ^^ of dimension ^. The output} _{vectors ^^ may be computed using a scaled dot-product attention kernel according to} the following equation: ^_{^ = softmax} where _{with ^^ denoting the receptive field of the attention layer 210 for the output vector ^^.} _{The matrices ^^ , ^^ , ^^ are learned weight matrices arranged to transform a given} _{input vector ^^ into a query, key, or value respectively, while the constant ^ is the inner} _{dimension of the queries and keys. The output vector ^^ may be a sum of the values} _{weighted by the scaled dot-product similarity of the keys and queries.} _{Prior to the processing by the attention layer 210, a positional encoding} technique may be applied to encode positional information about the respective sequence positions of the input vectors. For example, fixed or learned positional encodings, such as sinusoidal positional encodings, may be added to the input vectors. As another example, rotary position embeddings (RoPE) may be applied, in which the _{queries and keys in the attention layer 210 are rotated, with each position in the} _{sequence receives a different rotation. This way, the dot-product between queries and} keys may diminish for tokens that are distant from one another in the sequence, _{providing an effective means of encoding relative positions. Compared with additive} _{positional encodings, RoPE has been found to maintain a greater degree of the original} token information while still providing the model with an effective way to understand sequence positions. In another example, attention with linear biases (ALiBi) may be applied, in which fixed sequence of bias values are added to the key-query dot-product within the attention layer, or adaptive ALiBi which replaces the fixed sequence with learned additive biases. The attention layer 210 may be a multi-headed attention layer which applies _{multiple attention functions (or “heads”) in parallel, with different learned weight} _{matrices, and combines the resulting vectors to give a final output. For example, the} output of the attention layer 210 may be determined by concatenating the outputs of the individual attention functions and projecting them to an output matrix comprising the sequence of output vectors. F_{or global attention, the receptive field for each input vector may be the entire} _{sequence, such that ^^ = ^} With global attention, the attention layer 210 _{may in principle be able to capture all possible relationships between the input vectors} of the transformer encoder block 208, including possible long-range correlations or _{patterns. However, global attention leads to a computational complexity that grows} _{quadratically with the number of input vectors (i.e. ^(^^)), which may be inefficient} _{and possibly prohibitive for sequencing very long polymers. To mitigate such issues,} alternative topologies may be used for the attention layer 210, such as sparse topologies. A particularly advantageous implementation of the attention layer 210 for the present purpose is sliding window self-attention, in which the receptive field for a given _{output vector is a contiguous sequence of inputs vectors of fixed size. For example, the} _{receptive field for an output vector ^^ may be given by ^^ =} where the constants ^_^ and ^_^ denote the extent of the sliding window on either side _{of the ^^^ input vector. A symmetric sliding window of size ^ may have ^^ = ^^ =} ^/2. Sliding window attention has the advantage that the computational complexity _{grows only linearly with the length of the sequence, ^(^^), while the memory footprint} remains constant, because in effect the transformer encoder processes only a fixed- length sequence of the input signal at a time. As a result, long range correlations and patterns within the signal 202 may not be captured, but the inventors have found that a window of sufficient width, such as between 64 and 512, may be sufficient to capture the relevant information for sequencing or basecalling, whilst performance improvements may saturate beyond such window sizes. By using sliding window self- attention, arbitrarily long polymers may be sequenced without performance deterioration or prohibitive compute scaling. This contrasts with existing applications of transformers, such as natural language processing, in which performance can suffer significantly if long-range correlations are neglected. It will be appreciated that different window sizes or topologies may be used for the attention layer 210 in different transformer encoder blocks 208, or for different _{heads within a given transformer encoder block 208. For example, diluted sliding} _{window self-attention may be used in some blocks/heads, in which gaps are introduced} _{into the windows, with different gaps being used for different blocks/heads to enable} different correlations to be captured at different blocks/heads. T_{he sequence of output vectors generated by the attention layer 210 may be} _{processed by a feed forward network 212. The feed forward network 212 may apply a} _{non-linear function independently to each output vector ^^ generated by the attention} _{layer 210. The non-linear function may for example be composed of two learned linear} _{transformations, with a rectified-linear unit (ReLU) activation function applied between} the two learned linear transformations, as follows: _{FFN^^^^(^^, ^^, ^^, ^^, ^^) = max(0, ^^^^ + ^^) ^^ + ^^,} where ^_^, ^_^ are learned weight matrices, and ^_^, ^_^ are (optional) learned bias vectors. Depending on the dimensions of these vectors and matrices, the feed forward network 212 may effectively expand and contract the size of the internal state. Expanding the internal state can enable the transformer encoder block 208 to be more expressive, at the cost of a greater number of parameters and therefore higher compute and memory requirements, while returning the internal state to its original dimensions enables arbitrary numbers of transformer encoder blocks 208 to be stacked. The _{expansion factor, which may for example be defined as the ratio |^^|/|^^| of} dimensions between the hidden vectors, may be in the region of the order of 2, 5, 10 or more. Other implementations of the feed forward network 212 are possible, for example in which the ReLU activation function is replaced by a Gaussian error linear unit (GeLU) or the Swish function. Other implementations are based on the Gated Linear Unit (GLU), which employs a component-wise product of two linear transformations of the input, one of which is sigmoid-activated. The sigmoid activation may also be omitted, resulting in a so-called bilinear feed forward network. In a _{particularly preferred example, the feed forward network 212 comprises a sigmoid-} weighted gated linear unit (SwiGLU), which employs a component-wise product of two linear transformations of the input, one of which is swish-activated, as shown below: F_{FN^^^^^^(^^, ^^, ^, ^^) = (Swish^(^^ ^) ⊗ ^^^)^^,} _{where ^^, ^^, ^ are learned weight matrices and the Swish function is defined as} _{Swish ^(^) =} _{It will be appreciated that variations on these functions are} possible, for example by adding learned bias vectors or changing the value of the _{parameter ^ of the SwiGLU function.} _{The transformer encoder block 208 may perform additional operations. For} _{example, the transformer encoder block 208 may apply residual connection(s) in} _{parallel with the attention layer 210 and/or the feed forward network 212, in which the} _{input vector is added to the output of the attention layer 210 and/or feed forward} _{network 212. Residual connections mainly help mitigate vanishing gradients during} training, whilst also encouraging the feed forward network 212 to learn representations that retain local information, which may be particularly valuable in the present setting _{where local effects may be more important than long-range effects. Additionally, or} alternatively, the transformer encoder block 208 may include one or more normalisation _{functions, which may reduce training time and improve stability by preventing} exploding parameter values. Examples of normalisation functions include batch _{normalisation and layer normalisation. Normalisation functions may be applied after} each of the attention layer 210 and the feed forward network 212 (so-called post-norm), _{or before each of the attention layer 210 and the feed forward network 212 (so-called} pre-norm). The output of the final transformer encoder block 208 is a sequence of intermediate hidden states 214. The number (or frequency) of intermediate hidden states 214 may be equal to the number (or frequency) of feature vectors 206 generated by the downsampling neural network layers 204. Depending on the level of _{downsampling applied by the neural network layers 204, the frequency of the} intermediate hidden states 214 may be comparable to, or even less than, the rate at _{which polymer units are translocated relative to the nanopore. It may therefore not be} possible to map individual intermediate hidden states to polymer units for classification without the possibility of some polymer units being missed, despite the hidden states 214 encoding sufficient information to identify all of the polymer units. To mitigate this issue, the machine learning model may apply an upsampling layer 216 to obtain a _{sequence of upsampled hidden states 218 which may have a sufficiently high frequency} that at least one hidden state will map to each individual polymer unit. The frequency _{of the sequence of upsampled hidden states 218 may be greater than the average} translocation rate, for example between 1.5 and 5 times the average translocation rate. The upsampling layer may increase the frequency by a factor of 2, 5, 8, or any other _{suitable upsampling factor depending on the frequency of the sequency of intermediate} hidden states and the average (or minimum) translocation rate. The upsampling layer 216 may be any layer or functional capable of increasing the frequency or sampling rate of a sequence while maintaining information within the sequence. For example, the upsampling layer 216 may be a convolutional upsampling _{layer, an attention upsampling layer, or a linear upsampling layer. Linear upsampling} _{may be performed by increasing the dimension of an intermediate hidden state 214 of} _{shape (1, ^) to have a stretched shape of (1, ^^), where ^ is the upsampling factor, then} _{reshaping the resulting vector from shape (1, ^^) to (^, ^). The upscaling factor may} be, for example 2, 3, 5, or any other value to achieve a suitable frequency for the upsampled hidden states 218. Linear upsampling has the advantage of being _{computationally less expensive than convolutional upsampling or attention upsampling.} _{The upsampled hidden states 218 are processed using a network head 220 to} estimate a sequence of polymer units of the polymer. The network head 220 in this example includes a projection layer 222 for projecting each upsampled hidden state 218 _{to a respective array of values 224 indicating an estimated likeliness of the hidden state} 218 being associated with each of a set of symbols, the set of symbols comprising symbols associated with respective types of canonical polymer unit. The likeliness arrays may be vectors. Within the meaning of the present disclosure, the term likeliness may refer to a measure of probability or any other quantity representing how likely an event is to happen (possibly, but not necessarily, normalised within a range of zero to one). The set of symbols may correspond to canonical nucleotide bases such as adenine _{(A), cytosine (C), guanine (G) and thymine (T) for DNA basecalling, or adenine (A),} _{cytosine (C), guanine (G) and uracil (U) for RNA basecalling. For other types of} polymer, other symbols may be included. The network head 220 analyses the arrays of values 224, for example by _{applying a search algorithm to the arrays of values 224, to estimate the sequence of} _{polymer units of the polymer. As explained above the frequency of the hidden state} _{218 arriving at the network head may be higher than the translocation rate of the} polymer units, meaning that there is not a one-to-one mapping of arrays 224 to polymer units. To address this issue, the network head may adopt a connectionist temporal classification (CTC) topology, and may therefore be referred to as a CTC head or a conditional random field-CTC (CRF-CTC) head, depending on the arrays 224 are generated and how the network head 220 is trained. As discussed in more detail hereinafter, using a CTC-CRF head may lead to improved accuracy as it may better capture conditional dependencies between hidden states 218 than a “vanilla” CTC head. _{According to the CTC topology, an additional “blank” symbol (e.g., is} introduced to the set of symbols, and the algorithm includes determining and collapsing _{an intermediate sequence of symbols to arrive at a final estimate. For example, the} _{search algorithm may collapse the intermediate sequence symbols by (1) merging} repeated symbols and then (2) removing instances of the blank symbol. In the example _{of Fig. 2, an intermediate sequence 226 of symbols corresponding to DNA nucleotide} _{bases is given by CGG-A-TT. Repeated symbols (i.e. neighbouring symbols that are} identical to one another) are merged, resulting in a partially collapsed sequence 228 of symbols given by CG-A-T. Instances of the blank symbol are then removed to determine a fully collapsed sequence 230, which may be returned as the estimated sequence of polymer units. T_{he network head 220 may determine the intermediate sequence 226 of symbols} _{by select the most likely symbol from each array 224, which may result in the} intermediate sequence with the highest possible posterior probability. This may in turn result in the estimated nucleotide sequence with the highest posterior probability, which _{is the desired behaviour of the network head 220. However, in some cases multiple} _{intermediate sequence may map to a common fully collapsed sequence (for example,} intermediate sequences CGG-A-TT and CCGA-TT both map to a fully collapsed sequence CGAT). Accordingly, the intermediate sequence with the highest posterior probability may not correspond to the fully collapsed sequence with the highest _{posterior probability. The relevant metric is the cumulative posterior probability for all} intermediate sequences corresponding to a given collapsed sequence. However, the computational complexity of any algorithm capable of finding the highest cumulative posterior probability results in a combinatoric explosion which prohibits use of such _{algorithms for long sequence of polymers. As an alternative, a heuristic known as a} beam search may be used, which moves along the sequence of arrays 224 in a sequential _{manner, computing an updated set of hypotheses at each step. The updated set of} _{hypotheses at each step is derived from the set of hypotheses at the previous set by} _{extending each hypothesis with all possible keeping only the top ^ candidates, where ^} _{is a hyperparameter referred to as the beam width. The beam width ^ may for example} _{be 2, 4, 8, 16, 32, 64, or any other suitable size. With the CTC topology, a score (e.g.,} _{probability) is determined at each step after collapsing repeated symbols and removing} _{blank characters. In some examples, separate scores may be tracked for hypotheses that} end in the blank symbol and do not end in the blank symbol, and these separate scores may be combined to determine which hypothesis branches to maintain. In the case of a beam search, the intermediate sequence 226 may correspond to the remaining branch after the algorithm has run through all of the arrays 224. It will be appreciated that other network heads may be used in place of the CTC _{head or CTC-CRF head described above, such as a transformer decoder head which} may be arranged to generatively predict the sequence of polymers using the set of _{hidden states 218 as context. However, compared with a CTC-based head or another} _{type of relatively light-weight head, a transformer decoder head would include} additional blocks of attention layers and feed forward networks, which may significantly increase the computationally complexity of performing inference, while the additional trainable parameters may result in more training runs being needed to achieve a comparable level of accuracy. To summarise, Fig. 3 shows an example of a computer-implemented method _{300 of analysing a signal comprising measurements of a polymer by a sensor element} _{comprising a nanopore during translocation of the polymer with respect to the} nanopore. The method 300 may use a neural network model comprising a transformer encoder network and a network head, as described above with reference to Fig.2. The method 300 includes determining, at 302, a sequence of hidden states associated with _{the signal using the transformer encoder network, and processing the sequence of} _{hidden states using the network head to estimate a sequence of polymer units of the} _{polymer. The processing by the network head includes generating, at 304, a respective} array of values for each hidden state indicating an estimated likeliness of the hidden state being associated with each of a set of symbols, where the set of symbols includes _{symbols associated with respective types of canonical polymer unit, and analysing the} _{generated arrays of values to estimate, at 306, the sequence of polymer units of the} polymer. F_{ig. 4 shows an example of a computer-implemented method 400 of training a} neural network model comprising a transformer encoder network and a network head for use in the computer-implemented method 300 of Fig.3. The method 400 includes _{providing, at 402, a training signal comprising measurements of a polymer by a sensor} element comprising a nanopore during translocation of the polymer with respect to the _{nanopore. The training signal being associated with a target sequence of polymer units.} _{The method 400 includes determining, at 404, a sequence of hidden states associated} _{with the signal using the transformer encoder network, processing, at 406, the sequence} of hidden states using the network head to generate a respective array of values indicating an estimated likeliness of each hidden state being associated with each of a set of symbols comprising symbols associated with respective types of canonical _{polymer unit, and updating, at 406, parameters of the neural network based on a loss} _{function that depends on the generated arrays of values and the target sequence of} polymer units. The training signal and associated target sequence may be obtained by measuring a polymer with a known sequence using the sensor element, or by measuring a polymer with an unknown sequence using the sensor element and using an alternative sequencing method to determine the target sequence. In either case, the training signal and the target sequence may not necessarily be aligned, meaning that there may not be a known correspondence between positions in the target sequence and sections of the training signal. The training signal may be part of a dataset comprising a large number of training signals each having respective target sequences. The dataset may for example include hundreds, thousands, tens of thousands, hundreds of thousands, or millions of training signals. The dataset may include polymers of a given class, such as polynucleotides, DNA, RNA, animal DNA, plant DNA, animal RNA, plant RNA, etc. or may include polymers of a wide range of classes. The updating of the parameters may be performed using a gradient-based optimisation method, in which one or more loss values corresponding to evaluations of the loss function a batch of one or more training signals are backpropagated through the neural network to determine the gradient of the loss (or losses) with respect to the parameters of the neural network. Values of the parameters (e.g., weights, biases) may be updated using a gradient-based optimiser, such as stochastic gradient descent (SDG), Adam, RMSProp, or AdamW. The optimiser may have one or more hyperparameters _{such as learning rate, which may be held constant or may be varied according to a} predetermined or learned schedule. For example, learning rate warm-up may be applied in which the learning rate increases gradually from a very small value in order to mitigate issues associated with exploding gradients in early phases of the training. Techniques such as batch normalization may be used to improve the efficiency of the training process. This process may be repeated over many iterations covering one or more passes _{through the dataset (epochs), for example until a stopping condition is satisfied, such as} _{a convergence condition or after a predetermined number of iterations or epochs (such} as 10, 30, 50 or 100 epochs), at which point the neural network may be validated and tested using validation and test datasets before being deployed for use. F_{or a signal ^ and associated target sequence ^ within the training dataset, the} _{loss function may include a CTC loss corresponding to a negative log-likelihood of the} _{target sequence ^ given the signal ^, i.e. ^^^^(^, ^) = − log^^(^|^)^. For long} _{sequences, this term (and its gradient) can be prohibitively expensive to compute using} _{the naïve approach of computing and summing the probabilities of each possible} alignment. However, the CTC loss can be calculated efficiently using a dynamic _{programming algorithm referred to as a forward-backward algorithm, in which a} forward variable and backward variable are recursively calculated during forward and _{reverse passes through the sequence of hidden states and then used to determine the log} likelihood. A_{possible drawback of using the CTC loss is that it works on the basis that} _{every symbol is conditionally independent of every other symbol given the sequence of} hidden states, and may therefore fail to leverage information pertaining to commonly _{occurring patterns or correlations in the symbols. To alleviate this issue, a CTC-CRF} _{loss may be used, which may result in the arrays of values being determined as a CRF} which can capture conditional dependencies between symbols. The CRF models the _{conditional probability of a hidden state sequence ^ given an input sequence ^ as} follows: _{where ^(^, ^) = ∑^ log ^(^^|^) + log ^(^) is a potential function.} _{The hidden state sequence ^ is connected to the target sequence ^ by a mapping} _{^: ^^ → ^^ where ^^ and ^^ are the symbol sets for the hidden states and target} sequence respectively. The CTC-CRF loss may then be given by: ^_{^^^^^^^ = − log ^ ^(^, ^|^) = − log} ^ The CTC-CRF loss can also be computed using a forward-backward algorithm. In _{some examples, the loss function may include a weighted linear combination of the} CTC loss and the CTC-CRF loss (for example, with the CTC loss having a relatively lower weight such as 0.01 or 0.1 times the weight of the CTC-CRF loss). Including _{both loss terms may help with convergence during training.} Regularisation and other techniques may be applied during training of the neural _{network model, for example to improve the convergence properties, stability and speed} of the training. Examples include layer drop in which certain layers are randomly dropped with a predetermined probability during training, and dropout in which certain parameters such as elements of the learned matrices are randomly zeroed with a predetermined probability during training. A further option is batch normalisation. Suitable batch sizes for training may be in the region of 32, 64, 128, or 256 training signals. The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, while the methods described above may be applied to single strand of a polynucleotide, in other examples a signal or signals may include measurements of multiple polymer strands, for example two complementary strands of a polynucleotide. In such cases, the _{transformer encoder network may be adapted to process measurements of both signals.} _{By leveraging the information from both complementary strands, accuracy may be} further improved as the effects of measurement errors, noise, and other sources of error may be mitigated by the inherent redundancy of information between the strands. F_{ig. 5A shows an example of a composite signal formed of a first signal portion} 502 concatenated with a second signal portion 504. The first signal portion 502 comprises measurements of a first polynucleotide strand and the second signal portion 504 comprises measurements of a second polynucleotide strand, where the first and second polynucleotide strands are complementary strands. The start of the first signal _{portion 502 is indicated by a first dashed line 506, while the start of the second signal} portion 504 is indicated by a second dashed line 508. The two signal portions are at least partially aligned in the sense that the start of the first signal portion 502 and the start of the second signal portion 504 correspond to approximately equal locations within the respective polynucleotide strands. The concatenated signal portions 502, 504 may be processed according to the methods described above to generate a sequence of feature vectors 510. In this example, the sequence of feature vectors 510 is processed using one or more transformer encoder blocks having a sliding window cross- attention layer 512. The attention layer 512 in this example has a receptive field corresponding to two sliding windows 514 and 516. For feature vectors 510 derived from the first signal portion 502, the receptive field may be given by ^(^) ^ = _{the number of} feature vectors derived from each signal portion. For feature vectors 510 derived from the second signal portion 504, the receptive field may be given by = _{In this way, output of the} attention layer 512 attends to feature vectors within both sliding windows, allowing relationships between the first signal portion 502 and the second signal portion 504 to be captured. This dual sliding window approach may benefit from the advantages of sliding window attention described elsewhere in the present disclosure, provided the degree of misalignment is sufficiently small that corresponding parts of the first signal portion 502 and the second signal portion 504 concurrently fall within the first and second sliding windows 514, 516. F_{ig. 5B illustrates an alternative method of analysing the composite signal of} Fig. 5A. In this example, one or more transformer encoder blocks use a global self- attention layer 518 with a receptive field covering the feature vectors 510 from both _{signal portions, such that ^^} (assuming ^_^ feature vectors are derived from each of the signal portions 502, 504). The global self-attention layer 518 advantageously may not require any alignment between the first signal portion 502 and the second signal portion 504, so may be applicable in settings where such alignment cannot be readily obtained. Still further embodiments are envisaged. For example, further modifications to the neural network architectures discussed above may be employed, for example _{configuring one or more of the transformer encoder blocks as a mixture of experts to} _{achieve more efficient training, and/or quantising one or more parts of the neural} network model. Furthermore, additional steps may be carried out in dependence on the estimated sequence of polymer units of the polymer. For example, the estimated _{sequence may be output to a display or to a file/memory, or to another software} _{application. Additionally, or alternatively, the estimated sequence may be used to} _{determine whether to eject the polymer being sequenced, and/or to translocate the} _{polymer at the same or a faster rate without making further electrical measurements,} for example to mitigate use of computational resources on h It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A computer-implemented method of estimating a sequence of polymer units of _{a polymer comprising measurements of a signal generated during translocation of the} polymer with respect to a nanopore, comprising: determining, using a transformer encoder network, a sequence of hidden states associated with the signal; p_{rocessing the sequence of hidden states using a network head to estimate a} sequence of polymer units of the polymer, the processing comprising: generating, for each hidden state, a respective array of values indicating an estimated likeliness of the hidden state being associated with each of a set of symbols, the set of symbols comprising symbols associated with respective types of canonical polymer unit; and a_{nalysing the generated arrays of values to estimate the sequence of} polymer units of the polymer. _{2. The computer-implemented method of claim 1, wherein the transformer} encoder network uses sliding window self-attention, whereby each hidden state depends on a respective portion of the signal. _{3. The computer-implemented method of claim 2, wherein the polymer comprises} at least 100,000 polymer units. _{4. The computer-implemented method of claim 2 or claim 3, wherein the sliding} window attention uses a window size of between 64 and 512. _{5. The computer-implemented method of any preceding claim, wherein} determining the sequence of hidden states comprises: processing, using one or more neural network layers, the signal to generate a _{sequence of feature vectors; and} processing, using the transformer encoder network, the sequence of feature vectors to determine the sequence of hidden states. _{6. The computer-implemented method of claim 5, wherein:} polymer units of the polymer pass through the nanopore at an average _{translocation rate;} the signal has a first frequency that is greater than the average rate; t_{he sequence of feature vectors has a second frequency that is lower than the} _{first frequency; and} processing the sequence of feature vectors comprises: generating, as an output of the transformer encoder network, a sequence of intermediate states having the second frequency; and upsampling the sequence of intermediate states to determine the sequence of hidden states, wherein the sequence of hidden state has a third frequency that is at no less than the average translocation rate. _{7. The computer-implemented method of claim 6, wherein the third frequency is} less than the first frequency. _{8. The computer-implemented method of claim 6 or claim 7, wherein the third} frequency is between 1.5 and 5 times the average translocation rate. _{9. The computer-implemented method of any of claims 6 to 8, wherein the first} _{frequency is at least 3 times greater than the average translocation rate.} _{10. The computer-implemented method of any of claims 6 to 9 wherein a ratio} between the first frequency and the second frequency is between 5 and 20. _{11. The computer-implemented method of any of claims 6 to 10, wherein the} upsampling is linear upsampling. _{12. The computer-implemented method of any of claims 5 to 11, wherein the one} or more neural network layers comprise one or more convolutional layers. _{13. The computer-implemented method of any preceding claim, wherein:} the set of symbols comprises a blank symbol; and a_{nalysing the generated arrays of values comprises:} determining an intermediate sequence of symbols based on the generated arrays of values; and collapsing the intermediate sequence of symbols by merging repeated symbols and then removing instances of the blank symbol. _{14. The computer-implemented method of claim 13, wherein determining the} intermediate sequence of symbols comprises performing a beam search based on the generated arrays of values. _{15. The computer-implemented method of any preceding claim, wherein the} _{network head applies a linear projection to the sequence of hidden states to generate the} arrays of values. _{16. The computer-implemented method of any preceding claim, wherein the} network head is configured to generate the arrays of values as a conditional random field (CRF). _{17. The computer-implemented method of any preceding claim, wherein:} the polymer is a first polymer; the signal further comprises measurements of a second polymer by the sensor element during translocation of the second polymer with respect to the nanopore, the first polymer and the second polymer being a complementary pair; and the transformer encoder network comprises an attention mechanism arranged to capture relationships between measurements associated with mutually corresponding segments of the first polymer and the second polymer.

_{18. A computer-implemented method of training a neural network model} comprising a transformer encoder network and a network head, the method comprising: providing a signal comprising measurements of a polymer by a sensor element comprising a nanopore during translocation of the polymer with respect to the nanopore, the signal being associated with a target sequence of polymer units; determining, using the transformer encoder network, a sequence of hidden states associated with the signal; processing the sequence of hidden states using the network head to generate, for each hidden state, a respective array of values indicating an estimated likeliness of the hidden state being associated with each of a set of symbols, the set of symbols comprising symbols associated with respective types of canonical polymer unit; and updating parameters of the neural network based on a loss function depending on the generated arrays of values and the target sequence of polymer units. _{19. The computer-implemented method of claim 19, wherein the loss function} comprises at least one of a connectionist temporal classification (CTC) loss and a connectionist temporal classification-conditional random field (CTC-CRF) loss. _{20. The computer-implemented method of any preceding claim, wherein the} polymer is a polynucleotide, and the polymer units are nucleotides. _{21. The computer-implemented method of any preceding claim, wherein the} signal is an electrical signal. _{22. The computer-implemented method of any preceding claim, wherein the} signal is indicative of ion flow through the nanopore. _{23. A computer program product comprising instructions which, when the program} is executed by a computer, cause the computer to carry out the method of any preceding claim.

_{24. A data processing system comprising means for carrying out the method of any} of claims 1 to 22. _{25. Apparatus comprising:} a sensor element comprising a nanopore; means for translocating a polymer with respect to the nanopore; and a data processing system comprising means for carrying out the method of any of claims 1 to 17 to analyse a signal comprising measurements of a polymer by the sensor element during translocation of the polymer with respect to the nanopore. _{26. A method comprising:} translocating a polymer with respect to a nanopore; generating a signal comprising measurements of the polymer by a sensor element comprising the nanopore; and analysing the signal using the computer-implemented method of any of claims 1 to 17.