WO2025184420A1

WO2025184420A1 - Hybrid neural networks with attention and recurrence

Info

Publication number: WO2025184420A1
Application number: PCT/US2025/017712
Authority: WO
Inventors: Samuel Laurence SMITH; Soham De; Aleksandar Stoyanov BOTEV; Anushan Kalinga FERNANDO; George-Cristian MURARU; Ruba MUTASIM HAROUN ALI; Albert GU; Razvan PASCANU; Caglar GULCEHRE; Joao Ferdinando GOMES DE FREITAS
Original assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Current assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Priority date: 2024-02-27
Filing date: 2025-02-27
Publication date: 2025-09-04
Anticipated expiration: 2026-08-27

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing input sequences using a hybrid neural network implementing both attention and recurrence to perform one or more machine learning tasks. In one example, a method performed by one or more computers is described. The method includes: receiving an input sequence including a respective input token at each of a number of input positions; and processing the input sequence, using a neural network, to generate a network output. The neural network includes a number of layer blocks, including: (i) one or more attention layer blocks, and (ii) one or more recurrent layer blocks. Each attention layer block includes an attention layer configured to apply an attention mechanism. Each recurrent layer block includes a recurrent layer configured to apply a recurrent operation.

Description

Attorney Docket No.45288-0438WO1 HYBRID NEURAL NETWORKS WITH ATTENTION AND RECURRENCE CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Provisional Patent Application No. 63/558,623, titled “HYBRID NEURAL NETWORKS WITH ATTENTION AND RECURRENCE”, filed on Feb. 27, 2024, which is hereby incorporated by reference in its entirety. TECHNICAL FIELD [0002] This disclosure relates generally to methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing input sequences using a neural network implementing attention and recurrence to perform one or more machine learning tasks. BACKGROUND [0003] This disclosure relates to neural networks. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current value inputs of a respective set of parameters. SUMMARY [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that processes input sequences, using a “hybrid” neural network implementing both attention and recurrence, to perform one or more machine learning tasks. [0005] Recurrent neural networks (“RNNs”) scale efficiently on long sequences at interface, but are difficult to train due to the exploding (and vanishing) gradient problem. Traditional RNNs also suffer from poor scaling during training, arising from sequential data processing that prevents parallelization. Attorney Docket No.45288-0438WO1 [0006] In contrast, implementations of the neural network described in this specification mix attention and recurrence to achieve state-of-the-art performance at inference while being significantly more efficient to train than other models, e.g., recurrent-only models, Transformers, and attention-only models. In experiments, the neural network matched or exceeded the performance of such models on downstream tasks, while being trained on significantly fewer tokens, e.g., 7 times fewer tokens in some cases. The neural network described in this specification was also able to extrapolate on sequences significantly longer than those seen during training. The neural network matched the hardware efficiency of Transformers during training, and at inference it had lower latency and significantly higher throughput. This specification also describes techniques for scaling the neural network up to 14 billion parameters or more, and how to shard the neural network for highly efficient distributed training. [0007] Implementations of the neural network can be configured through training to perform any kind of machine learning task. That is, the neural network can be configured to receive any kind of input sequence and process the input sequence to generate any kind of network output, e.g., a score, a classification, or a regression output, based on the input sequence. For brevity, examples of machine learning tasks that the neural network can perform are described at the end of this specification. [0008] In general, the input sequence includes a respective input token at each of multiple input positions. In some situations, the neural network can be referred to as an auto-regressive neural network, i.e., because the neural network auto-regressively generates an output sequence of tokens using its network outputs. More specifically, the neural network auto-regressively generates the output sequence by generating each particular token in the output sequence conditioned on a current input sequence that includes any tokens that precede the particular token in the output sequence, i.e., the tokens that have already been generated for any previous positions in the output sequence that precede the particular position of the particular token. [0009] Examples of the neural network include an input subnetwork, a residual network, and an output subnetwork. The input subnetwork is configured to encode the input sequence into an input embedding sequence. The residual network includes a sequence of residual layer blocks that continually update the input embedding sequence output by the input subnetwork. In general, each residual layer block includes: (i) an attention layer block that applies an attention operation via an attention layer, or (ii) a recurrent layer block that applies a recurrence operation via a recurrent Attorney Docket No.45288-0438WO1 layer (e.g., a gated linear recurrent layer). Each residual layer block can include a feedforward layer block (e.g., a gated multilayer perceptron (“MLP”) block) that applies an activation function, e.g., a unidirectional nonlinear operation, via a feedforward layer. The output subnetwork is configured to process an output embedding sequence to generate a network output including a score distribution, e.g., over a vocabulary of tokens of a tokenizer. For example, the output embedding sequence can be the residual embedding sequence or a combination of the input and residual embedding sequences. [0010] Unlike conventional attention or recurrent neural networks that each utilize only attention or recurrence operations, implementations of the neural network described in this specification utilize both attention and recurrence. Hence, the neural network may be referred to as a “hybrid neural network”. In many cases, the neural network can include significantly more recurrent layer blocks than attention layer blocks. For example, the neural network can include 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, or more recurrent layer blocks for each attention layer block. Using more recurrent layer blocks than attention layer blocks can speed up inference times and reduce memory costs compared to conventional attention-based neural networks (e.g., Transformers), while simultaneously having equal or better inference accuracy (and training loss) for the same model size. This is because the standard, global attention mechanism scales quadratically with respect to sequence length ^, involving ^^^^{^}^ time and space during inference. In contrast, recurrence scales linearly ^^^^ with sequence length at inference and is typically faster than attention mechanisms even for modest sequence lengths. [0011] Moreover, the recurrent layer of each recurrent layer block can be a gated linear recurrent layer (or “GLRU”), which is novel version of a linear recurrent layer utilizing gating mechanisms on the recurrence weights, but not on the hidden state itself. Thus, the recurrence relation is linear with respect to the hidden state which allows the GLRU to be parallelized during inference and training, e.g., using parallel (or associative) scans. Implementing parallel scans allows the neural network to be implemented efficiently on hardware that is optimized for parallel processing, e.g., one or more graphics processing units (“GPUs”). Linear recurrence can also mitigate (or eliminate) the exploding/vanishing gradient problem in conventional RNNs, which has limited their use in modern applications involving long-range sequence processing, e.g., natural language processing (“NLP”). Particularly, the neural network described in this specification can be scaled to sizes typical of large language models (“LLMs”) (e.g., having 14 billion or more parameters) and can Attorney Docket No.45288-0438WO1 efficiently process long-range input sequences (e.g., sequences including 2048 or more input tokens, e.g., 4096 or more input tokens, 8192 or more input tokens, 16384 or more input tokens, 32768 or more input tokens, 65536 or more input tokens). [0012] In general, the attention layer of each attention layer block is configured to: receive a layer input sequence; and apply an attention mechanism over the layer input sequence to generate a layer output sequence. The attention layer blocks can all include a global attention layer, a local attention layer, or a mix of both. For example, the attention layer blocks can be grouped into: (i) a subset of attention layer blocks that each include a global attention layer, and (ii) a complement of the subset that each include a local attention layer. [0013] Each global attention layer applies a global attention mechanism that, for each input position, attends over all of the input positions preceding or equal to the input position. The global attention mechanisms applied by the global attention layers can be dense attention mechanisms or sparse attention mechanisms. [0014] Each local attention layer, on the other hand, applies a local attention mechanism that, for each input position, attends only over a local subset of input positions that are within a local window of the input position. That is, unlike the global attention mechanisms, the local attention mechanism does not attend to any position that is outside of the local window of the input position. The local windows are generally “causal,” so that they include up to a fixed number of input positions that are closest to the input position and that precede or are equal to the input position, but not any input positions that are after the input position in the input sequence. The fixed number of input positions is generally smaller than the total number of input positions in the input sequence and is referred to as the size of the local window. For the local attention mechanism, for each input position, the input positions that are used to generate the queries, keys, and values for the input position are defined by the local window size for the local attention mechanism, i.e., non-zero attention weights for a given input position are computed only for input positions that are within the local window of the given input position. [0015] Each of the attention layers can also use an attention mechanism that applies a positional encoding to each of the input positions. “Positional Encoding” refers to modifying the operations applied by the attention layer for a given input position based on the absolute or relative position of the input position within the input sequence. For example, the positional encoding can be Rotary Attorney Docket No.45288-0438WO1 Positional Embedding (“RoPE”) or a different type of positional encoding, e.g., a relative positional encoding or an Attention with Linear Biases (“ALiBi”) positional encoding. [0016] Generally, to apply an attention mechanism, an attention layer includes one or more attention heads. Each attention head generates a set of queries, a set of keys, and a set of values, and then applies any of a variety of variants of query-key-value (“QKV”) attention, e.g., a dot product attention function or a scaled dot product attention function, using the queries, keys, and values to generate an output. Each query, key, value can be a vector that includes one or more vector elements. When there are multiple attention heads, the attention layer then combines the outputs of the multiple attention heads, e.g., by concatenating the outputs. In some cases, the attention mechanism uses multi-query attention (“MQA”), where each attention head shares a common set of keys and values but does not share queries. [0017] In some cases, because the attention mechanism applied by the attention layers is causal, the system can store, for any given attention mechanism and when generating the respective layer output embedding for any given input position, the layer output embeddings or the keys and values already computed for earlier input positions rather than re-computing the layer output embeddings (or the keys and values) for earlier input positions. Thus, in these cases, applying an attention mechanism over the layer input sequence refers to processing the respective layer input embedding for the last input position in the current input sequence using keys and values or layer output embeddings for the other input positions that have been retrieved from memory (e.g., from a “cache”). [0018] In general, the recurrent layer of each recurrent layer block is configured to, for each of the input positions: receive a layer input including: (i) a hidden state for a preceding input position, and (ii) a layer input embedding for the input position; process the layer input to generate a hidden state for the input position; and process the hidden state and layer input embedding for the input position to generate a layer output embedding for the input position. The recurrent layer blocks can all include a nonlinear recurrent layer, a linear recurrent layer, or a gated linear recurrent layer (“GLRU”). For example, in some implementations of the neural network, the recurrent layer of each recurrent layer block is a GLRU. [0019] In some implementations, each recurrent layer block is a gated recurrent layer block that includes: (i) a first channel, (ii) a second, parallel channel, and (iii) a multiplicative gate proceeding the first and second channels. The multiplicative gate can combine the outputs of the first and Attorney Docket No.45288-0438WO1 second channels by performing an elementwise multiplication operation between their respective layer output sequences. The first channel can include a feedforward layer configured to: receive a layer input sequence; and apply an activation function over the layer input sequence to generate a layer output sequence. For example, the feedforward layer can be a Rectified Linear Unit (“ReLU”), a Gaussian error Linear Unit (“GeLU”), a leaky ReLU, a sigmoid function, a hyperbolic tangent (“tanh”) function, a softmax function, or a swish function. The second channel can include: (i) a linear layer, (ii) a convolutional layer immediately proceeding the linear layer, and (iii) the recurrent layer immediately proceeding the convolutional layer. The linear layer is configured to: receive a layer input sequence; and apply a linear transformation over the layer input sequence to generate a layer output sequence. The convolutional layer is configured to: receive a layer input sequence; and apply a convolution operation over the layer input sequence to generate a layer output sequence. [0020] The feedforward layer of each feedforward layer block is configured to: receive a layer input sequence; and apply an activation function over the layer input sequence to generate a layer output sequence. For example, similar to above, the feedforward layer of each feedforward layer block can be a ReLU, a GeLU, a leaky ReLU, a sigmoid function, a tanh function, a softmax function, or a swish function. In some implementations, each feedforward layer block is a gated feedforward layer block that includes: (i) a first channel including the feedforward layer, (ii) a second, parallel channel including a linear layer; and (iii) a multiplicative gate proceeding the first and second channels. For example, the gated feedforward layer block can be a gated multi-layer perceptron (“MLP”) block. [0021] This specification also describes a method performed by one or more computers. The method includes: receiving an input sequence including a respective input token at each of multiple input positions; and processing the input sequence, using a neural network, to generate a network output. The neural network includes a number of layer blocks including: (i) one or more attention layer blocks, and (ii) one or more recurrent layer blocks. Each attention layer block includes an attention layer configured to: receive a layer input sequence comprising a respective layer input embedding for each of the input positions; and apply an attention mechanism over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the input positions. Each recurrent layer block includes a recurrent layer configured to implement one or more recurrence operations. Attorney Docket No.45288-0438WO1 [0022] In various implementations, the one or more recurrence operations can be implemented using parallel computation, e.g., using one or more parallel scans. The one or more recurrence operations can be implemented using parallel computing hardware. [0023] The one or more recurrence operations can include, for each of the input positions: receiving a layer input including: (i) a hidden state for a preceding input position, and (ii) a layer input embedding for the input position; processing the layer input to generate a hidden state for the input position; and processing the hidden state and layer input embedding for the input position to generate a layer output embedding for the input position. [0024] In particular, in various implementations, the recurrent layer of each recurrent layer block may be a linear recurrent layer, and for each linear recurrent layer, the hidden state for each input position may be linear in the hidden state for the preceding input position. [0025] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. [0026] In recent years, both deep learning and natural language processing (“NLP”) have been dominated by the Transformer architecture, which interleaves multi-layer perceptrons (“MLPs”) and multi-head attention (“MHA”). In practice, Transformers achieve better performance than recurrent neural networks (“RNNs”) and are efficient in utilizing modern hardware. The main reason for this is their ease of parallel training and efficient scalability on modern hardware. As a result, Transformer-based large language models (“LLMs”) trained on large datasets collected from the web have achieved many notable successes. [0027] However, despite their successes, Transformers are difficult to scale efficiently to long sequences due to the quadratic complexity of global attention. Additionally, the linear growth of their Key-Value (“KV”) cache with the sequence length makes them slow during inference time. While Multi-Query Attention (“MQA”) offers some mitigation, by reducing the cache size by a constant factor, it does not fully address the underlying issue. Recurrent-based language models present a compelling alternative as they compress the full sequence into a fixed-sized hidden state representation. [0028] Nevertheless, to replace Transformers, new RNN designs should demonstrate not only comparable performance at scale but also similar hardware efficiency. In this specification, a gated linear recurrent layer is introduced, referred to as the “GLRU”, around which a recurrent layer block is designed as a replacement for attention, e.g., MHA and/or MQA. Using the recurrent layer Attorney Docket No.45288-0438WO1 block, a “hybrid” neural network is introduced that interleaves feedforward layer blocks with a mix of recurrent layer blocks and attention layer blocks, e.g., where the recurrent layer blocks outnumber the attention layer blocks by factors of 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. The neural network can speed up inference times and reduce memory costs compared to attention-only models, e.g., Transformers, while simultaneously having equal or better inference accuracy (and training loss) for the same model size. Due to implementing linear recurrence, the GLRU can also be parallelized during training and scaled to modern hardware in analogous manner as attention layers. [0029] Implementations of the neural network described in this specification have several other advantages when compared to Transformer and other attention-based architectures. For example, the neural network can exhibit power law scaling between held-out loss and training FLOPs (“Floating Point Operations”), e.g., up to and beyond 7 billion parameters, as observed for Transformers. The neural network can also achieve lower held-out loss and higher throughput than strong Transformer models, as well as achieving lower latency when sampling long sequences. Moreover, the neural network can perform better than Transformers when evaluated on input sequences longer than those seen during training, and can also efficiently learn copying and retrieval tasks. [0030] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS [0031] FIG.1A is a schematic diagram depicting an example of a system configured to perform a machine learning task using a neural network. [0032] FIG.1B is a schematic diagram depicting an example of a residual layer block including a temporal mixing layer block and a feedforward layer block. [0033] FIG. 1C is a schematic diagram depicting an example of a temporal mixing layer block configured as a recurrent layer block including a recurrent layer. [0034] FIG. 1D is a schematic diagram depicting an example of a temporal mixing layer block configured as an attention layer block including an attention layer. [0035] FIG. 1E is a schematic diagram depicting an example of a feedforward layer block. Attorney Docket No.45288-0438WO1 [0036] FIG. 1F is a schematic diagram depicting an example of a recurrent layer configured as a Gated Linear Recurrent Unit (“GLRU”). [0037] FIG.2A is a flow diagram of an example process for processing an input sequence using a neural network to generate a network output. [0038] FIG. 2B is a flow diagram of an example process for processing an input embedding sequence using a residual network to generate a residual embedding sequence. [0039] FIG. 2C is a flow diagram of an example process for processing a block input sequence using a residual layer block to generate a block output sequence. [0040] FIG. 2D is a flow diagram of an example process for processing a layer input sequence using the GLRU to generate a layer output sequence. [0041] FIG. 3A is an experimental plot depicting scaling curves of models of a neural network with attention only (a Transformer (“TFM”) baseline), recurrence only (“Hawk”), and both attention and recurrence (“Griffin”). [0042] FIG. 3B is an experimental plot depicting maximum throughput of the Transformer baseline, Hawk, and Griffin models. [0043] FIGs. 4A-4C are experimental plots depicting training durations of the Transformer baseline and Griffin models versus sequence length for different sizes of the models. [0044] FIGs. 5A and 5B are experimental plots depicting latency of the Transformer baseline, Hawk, and Griffin models of the neural network versus different sampling prefills. [0045] FIGs.6A and 6B are experimental plots depicting performance of the Transformer baseline with no positional encoding (“NoPE”), Rotary Positional Embedding (“RoPE”), Hawk, and Griffin models at 1 billion parameters. [0046] FIGs.7A-7C are experimental plots depicting accuracy of the Transformer baseline, Hawk, and Griffin models on different copying and retrieval tasks. [0047] Like reference numbers and designations in the various drawings indicate like elements. DETAILED DESCRIPTION [0048] In recent years, both deep learning and neural language processing (“NLP”) have been dominated by the Transformer architecture, which interleaves multi-layer perceptrons (“MLPs”) and multi-head attention (“MHA”). Transformers typically achieve better performance than recurrent neural networks (“RNNs”) in practice and are also efficient at utilizing modern hardware. Attorney Docket No.45288-0438WO1 For example, transformer-based large language models (“LLMs”) trained on a large corpus of data collected from the web have achieved success. [0049] However, despite their successes, Transformers are difficult to scale efficiently to long sequences due to the quadratic complexity ^^{^}^^{^^} of global attention in the sequence length (^). Additionally, the linear growth of the Key-Value (“KV”) cache with the sequence length makes Transformers slow during inference. Although Multi-Query Attention (“MQA”) partially mitigates this issue by reducing the cache size by a constant factor, the cache still grows linearly in sequence length. Recurrent-based language models are a compelling alternative as they compress the entire sequence into a fixed-sized hidden state which is updated iteratively. However, to replace Transformers, new RNN models should demonstrate not only comparable performance at scale but also achieve similar hardware efficiency. [0050] To overcome these challenges, this specification introduces a novel gated linear recurrent layer, referred to as the “GLRU”, around which a new recurrent layer block is developed to replace attention, e.g., MHA and/or MQA. Several neural network models are described that implement this recurrent layer block, with example implementations of these models evaluated in a set of experiments. One example is a recurrent-only neural network, referred to as “Hawk”, that interleaves feedforward layer blocks with recurrent layer blocks. Another example is a hybrid attention/recurrent neural network, referred to as “Griffin”, that interleaves feedforward layer blocks with a mixture of recurrent layer blocks and attention layer blocks. Implementations of the Hawk and Griffin models in experiments showed that: 1. Hawk and Griffin exhibited power law scaling between held-out loss and training FLOPs (“Floating Point Operations”), up to and beyond 7 billion parameters, as previously observed for Transformer architectures. 2. Griffin achieved lower held-out loss than Transformer baselines at all model scales used in the experiments. 3. Hawk and Griffin were trained on 300 billion tokens at a range of model scales. Hawk-3B exceeds the reported performance of Mamba-3B on downstream tasks, despite being trained on half as many tokens. Griffin-7B and Griffin-14B matched the performance of Llama-2 despite being trained on roughly 7 times fewer tokens in some cases. Attorney Docket No.45288-0438WO1 4. Griffin achieved comparable training efficiency to Transformers on TPU-v3. Since diagonal RNN layers are memory bound, this was achieved using a kernel for the GLRU, implemented in Pallas, that minimized memory transfers. 5. During inference, both Hawk and Griffin achieved significantly higher throughput than Transformers, and they achieved lower latency when sampling long sequences. 6. Griffin performed better than Transformers when evaluated on sequences longer than those seen during training, and could also efficiently learned copy and retrieval tasks from the training data. [0051] These and other features relating to the neural network architectures for implementing both attention and recurrence are described in more detail below. [0052] FIG.1A is a schematic diagram depicting an example of a system 10 configured to perform a machine learning task using a neural network 100. The system 10 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented. [0053] At a high-level, the system 10 is configured to: receive an input sequence (^) 102; and ^{process the input sequence 102, using the neural network 100, to generate a network output (^ =} _{^^(^)) 104. Here, ^^ is a function representing the parametric model of the neural network 100,} parameterized by a set of network parameters (^). The network parameters (^) include the respective (learnable) parameters, e.g., weights and biases, of each neural network layer of the neural network 100. The neural network 100 is configured to perform the machine learning task on the input sequence 102 to generate the network output 104. For brevity, examples of machine learning tasks that the system 10 can perform using the neural network 100 are described at the end of this specification. _{[0054] In general, the input sequence 102, given as ^ = ^^^, ^^, … , ^^^, includes a respective} _{input token (^^) at each of multiple input positions ^ = 1,2, … , ^, where ^ is the length of the input} sequence 102. For example, the input sequence 102 can describe, e.g., in a natural language, the machine learning task to be performed by the neural network 100. On the other hand, the network output 104 can be any desired network output, e.g., a score, a classification, or a regression output. [0055] Of particular note, implementations of the neural network 100 can efficiently process input sequences 102 that are long-range sequences. A long-range sequence generally refers to a sequence of 2000 or more tokens, e.g., 4000 or more tokens, 8000 or more tokens, 16000 or more tokens, Attorney Docket No.45288-0438WO1 32000 or tokens, 64000 or more tokens, 128000 or more tokens. For example, the neural network 100 can perform machine learning tasks on an input sequence 102 including 16000 or more interacting tokens, e.g., machine learning tasks in the Long-Range Arena (“LRA”) such as PathFinder and PathX. Details of particular machine learning tasks in the LRA, such as training and test datasets, are provided by Yi Tay, et al., “Long-Range Arena: A Benchmark for Efficient Transformers.” arXiv preprint arXiv:2011.04006 (2020). [0056] In some implementations, the neural network 100 can be referred to as an auto-regressive neural network, that is, because the neural network 100 auto-regressively generates an output _{sequence ^ = ^^^, ^^, … , ^^^ by repeatedly updating and reprocessing the input sequence 102 for} _{each of multiple output positions ^ = 1,2, … , ^, where ^ is the length of the output sequence. The} output sequence includes a respective output token (^_^) at each output position. The neural network 100 auto-regressively generates the output sequence by generating the respective output token at each output position in the output sequence conditioned on an input sequence 102 that represents _{a current input sequence ^^ = ^^^, ^^:^^^^ for the output position. That is, the input sequence 102} _{includes an initial input} _{with any output tokens ^^:^^^ = ^^^, ^^, … , ^^^^^} that precede the output token (^_^), i.e., the output tokens that have generated for any previous output positions in the output sequence that precede the particular output position of the particular token. [0057] In more detail, the neural network 100 can generate the output sequence iteratively, i.e., one output token at a time, using an auto-regressive sampling procedure. For each ^-th output position in the output sequence, the neural network 100 processes the input sequence 102 for the ^{output position to generate a network output 104 including a score distribution ^^ = ℙ^ =} _{ℙ^^^!^^^ over a vocabulary of tokens of a tokenizer, e.g., a probability distribution in the form of} neural network 100 then samples the output token ^_^~ℙ^^_^!^^ for the ^-th output position from the score distribution. The neural network 100 subsequently updates the input _{sequence 102 by appending the output token thereto, ^^ → ^^$^ = ^^^ , ^^^ = ^^^, ^^:^^. The} neural network 100 then reprocesses the updated input sequence 102, as described above, to _{generate a network output 104 including a score distribution ^^$^ = ℙ^$^ over the vocabulary for} _{the (^ + 1)-th output position, and thereafter samples the output token ^^$^~ℙ^$^ at the output} position from the score distribution. The neural network 100 performs this process repeatedly until Attorney Docket No.45288-0438WO1 _{an end-of-sequence (“EOS”) token ^^ = ^&'( is reached, terminating the auto-regressive} sampling procedure. The neural network 100 then returns the output sequence, including each _{output token generated during the auto-regressive sampling procedure, given as ^ = ^^:^ =} _{^^^, ^^, … , ^^^. Note, returning the EOS token is optional, so the neural network 100 can} _{the output sequence as ^ = ^^:^^^ = ^^^, ^^, … , ^^^^^.} [0058] In the system 10 has trained the neural training data to perform the machine learning task, e.g., via a supervised, or unsupervised learning technique. The neural network 100 can also be pre-trained and the system 10 can fine-tune the neural network 100 for other machine learning tasks (e.g., downstream tasks). [0059] For example, the system 10 or another training system can train the neural network 100 through one or more of unsupervised learning, e.g., a language modeling objective, supervised learning, e.g., supervised fine-tuning, instruction tuning, direct preference optimization, and so on, or reinforcement learning, e.g., reinforcement learning from human or AI feedback, and so on. As a particular example, system 10 or the other training system can first train the neural network 100 on an initial data set through unsupervised learning, e.g., using a next token prediction objective or other appropriate objective, and then further train, i.e., “fine-tune” the neural network 100 on one or more additional objectives, e.g., through one or more of supervised fine-tuning, instruction tuning, direct preference optimization, and so on, or reinforcement learning, e.g., reinforcement learning from human or AI feedback, and so on. [0060] A general procedure for how the system 10 trains the neural network 100 to perform a generic machine learning starting from initialization of the network parameters is described below. Benchmarking results of the neural network 100 trained on various machine learning tasks are reported and described with reference to FIGs. 3A-7C. [0061] To begin, the system 10 first initializes the network parameters (^) of the neural network _{100. The system 10 then obtains a training dataset ) = *^^+, ^+^,./} _{+-^ related to a machine} ^{learning task, e.g., a task in the LRA. The training dataset 510 includes 01 training examples} _{^^+, ^+^. The training dataset can include any appropriate number of training examples for the} machine learning task, e.g., 10³ or more training examples, 10⁵ or more training examples, 10⁶ or more training examples, 10⁷ or more training examples, 10⁸ or more training examples, 10⁹ or more training examples, etc. Each training example includes: (i) a respective training network input (^₊), and (ii) a corresponding target network output (^₊). The system 10 then trains the Attorney Docket No.45288-0438WO1 neural network 100 on the training dataset (or one or more batches (ℬ) of training examples in the training dataset) to perform the machine learning task. In general, the system 10 trains the neural network 100 to produce the target network output in response to its training network input. That is, the system 10 processes the training network input of each training example, using the neural _{network 100, to generate a training network output ^;+ = ^^^^+^ that is an estimate of the target} network output for the training example. _{[0062] The system 10 then optimizes an objective function ℒ^^^, … , ^./ , ^;^, … , ^;./^ that depends} on the training and target network outputs of each training example in the training dataset (or a batch of training examples in the training dataset). Particularly, the system 10 minimizes (or maximizes) the objective function with respect to the network parameters of the neural network _{100, e.g., such as ^ = arg m} _{^in ℒ. Broadly speaking, the system 10 generally uses an objective} function that encourages each training network output to meet its respective target network output, e.g., as measured by some error or similarity metric between the two. In light of this, the objective function can be (or can include) any appropriate objective function for the training dataset and machine learning task the neural network 100 is trained to perform. For example, the objective function (e.g., loss function) can include a mean squared error loss or a mean absolute error loss for a regression task, a binary cross-entropy loss or a Hinge loss for a binary classification task, a categorical cross-entropy loss for a multi-class classification task, a Kullback-Leibler divergence loss for a generative or reinforcement learning task, a MinMax loss for an image segmentation task, etc. [0063] To optimize the objective function, the system 10 computes gradients of the objective function with respect to the network parameters of the neural network 100, e.g., using backpropagation. The system 10 then uses the gradients to update the network parameters of the neural network 100. For example, the system 10 can use a stochastic gradient descent method with a particular learning rate and/or weight decay, such as Implicit updates, Momentum, Adam, RMSProp, AdaGrad, etc., to update the network parameters with the values that optimize the objective function 520. The system 10 can perform any appropriate number of training iterations to optimize the objective function, e.g., 10³ or more training iterations, 10⁵ or more training iterations, 10⁶ or more training iterations, 10⁷ or more training iterations, 10⁸ or more training iterations, 10⁹ or more training iterations, etc. Attorney Docket No.45288-0438WO1 [0064] After training, the system 10 can then evaluate how well the neural network 100 performs the machine learning task using a test dataset, e.g., to benchmark the neural network 100 for inference accuracy, e.g., see FIGs. 3A-7C for benchmarking results of the experiments. [0065] The system 10 can be implemented in any appropriate location, e.g., on a user device (e.g., a mobile device), or on one or more computers in a data center, etc. In some implementations, users can interact with the system 10, e.g., by providing a query (e.g., including an input sequence 102 for the neural network 100) by way of an interface, e.g., a graphical user interface, or an application programming interface (“API”). In particular, a user can provide a query that includes: (i) a request to process an input sequence 102, and (ii) the input sequence 102. The input sequence 102 may describe, e.g., in a natural language, a machine learning task to be performed by the neural network 100. The system 10 can process the input sequence 102 using the neural network 100, responsive to the request, and provide the network output 104 (or an auto-regressively generated output sequence) resulting from the machine learning task performed by the neural network 100 to the user, e.g., for implementation on a user device of the user, or for storage in a data storage device. In some cases, the system 10 can transmit the network output 104 (or the auto-regressively generated output sequence) to a user device of the user, e.g., by way of a data communication network (e.g., the Internet). [0066] Referring to FIG.1A, the neural network 100 includes an input subnetwork 110, a residual network 120, an output subnetwork 130, a skip connection 140, and an additive gate 142. [0067] The input subnetwork 110 is configured to: receive the input sequence (^) 102; and process the input sequence 102 to generate an input embedding sequence (C) representing the input _{sequence 102. The input embedding sequence C = ^D^, D^, … , D^^ includes a respective embedding} (D_^) of each input token (^_^) in the input sequence 102. Here, the input subnetwork 110 is configured as an embedding network (e.g., an encoder network). For example, the input subnetwork 110 can be a linear embedding network, a lookup table-based embedding network, a subword or byte-level embedding network, a sequence-to-sequence embedding network, a Transformer-based embedding network, a convolutional-based embedding network, a recurrent- based embedding network, a long short-term memory (“LSTM”)-based embedding network, a graph-based embedding network, or a hybridization thereof. [0068] The residual network 120 follows the input subnetwork 110 and is the backbone of the neural network 100. The residual network 120 is configured to: receive the input embedding Attorney Docket No.45288-0438WO1 sequence (C); and process the input embedding sequence to generate a residual embedding _{sequence (CE). The residual embedding sequence CE = ^D^̃, D^̃, … , D^̃^ includes a respective residual} embedding (D_^̃) for each input position in the input 102. [0069] The additive gate 142 follows the residual additive gate 142 is connected to the output of the residual network 120 and the input of the residual network 120 via the skip connection 140. The additive gate 142 is configured to combine, e.g., via summation, the input (C) and residual (CE) embedding sequences to generate an output embedding sequence (CG). The output _{embedding sequence CG = C + CE = ^D^̂, D^̂, … , D^̂^ includes a respective output embedding (D^̂) for} each input position in the input sequence 102. [0070] The output subnetwork 130 follows the additive gate 142 and is connected to the output thereof. The output subnetwork 130 is configured to: receive the output embedding sequence (CG); and process the output embedding sequence to generate the network output (^) 104. Here, the output subnetwork 130 is configured as a projection network to generate the network output 104, e.g., including a score distribution over the vocabulary of tokens. For example, the output subnetwork 130 can be a softmax-based projection network (e.g., a normalization layer 201, followed by a linear layer 202, followed by a softmax function), a fully connected (dense) projection network, an attention-based projection network, a contrastive or distance-based projection network, a decoder-based projection network, a graph-based projection network, or a hybridization thereof. In some implementations, the weights of the output subnetwork 130 are shared with the input subnetwork 110. [0071] As shown in FIG.1A, the residual network 120 includes a sequence of residual layer blocks _{210-1 through 210-N arranged into a residual configuration, where I = 1,2, … , 0 indexes a} residual layer block 210-l, and 0 denotes the total number of residual layer blocks 210, i.e., the depth of the residual network 120. Generally, a residual configuration enables the neural network 100 to implement deep models, e.g., having tens, hundreds, thousands, or tens of thousands of neural network layers, that are easier to train and approach better accuracy with increasing numbers of layers. For example, the residual network 120 can include at least about 2, 5, 10, 20, 50, 100, 200, 500, 1000, or more residual layer blocks 210. _{[0072] Each residual layer block 210-l is configured to: receive a respective block input sequence} _{JK = ^LK , LK , K K} _{^ ^ … , L^ ^ including a respective block input embedding (L^) for each input position in} the input 102; and process the respective block input sequence to generate a respective Attorney Docket No.45288-0438WO1 _{block output sequence JMK = ^LNK , LNK , … , LNK ^ NK} _{^ ^ ^ including a respective block output embedding (L^) for} each input position in the input For the first residual layer block 210-1 in the residual _{network 120, the block input = C is the input embedding sequence to the residual} network 120. For each residual layer block 210-l in the residual network 120 proceeding the first _{residual layer block 210-1, the block input sequence JK = JMK^^ is the block output sequence output} by the preceding residual layer block 210-(l-1). For the last residual layer block 210-N in the _{residual network 120, the block output sequence JM. = CE is the residual embedding sequence} output by the residual network 120. [0073] FIG. 1B is a schematic diagram depicting an example of a residual layer block 210. The ^{residual layer block 210 corresponds to the base pattern of the residual network 120 that is repeated} _{0 times within the residual network 120. The residual layer block 210 includes a first} normalization layer 201-1, a temporal mixing layer block 220, a first skip connection 140-1, a first additive gate 142-1, a second normalization layer 201-2, a feedforward layer block 230, a second skip connection 140-2, and a second additive gate 142-2. The residual layer block 210 resembles a Transformer layer block, but with the modification that the temporal mixing layer block 220 can be a recurrent layer block 220R or an attention layer block 220A. [0074] The first normalization layer 201-1 is configured to: receive the block input sequence (J) that is input to the residual layer block 210; and apply a normalization operation over the block input sequence to generate a normalized version of the block input sequence (O). The normalized _{version of the block input sequence O = ^P^, P^, … , P^^ includes a normalized version (P^) of} the respective block input embedding input position in the input sequence 102. For example, the first normalization layer 201-1 can implement Root Mean Square normalization (“RMSNorm”), batch normalization (“BatchNorm”), layer normalization (“LayerNorm”), weight normalization (“WeightNorm”), or other normalization scheme. [0075] The temporal mixing block 220 is the component of the neural network 100 that aggregates hidden layer activations at different input positions in the input sequence 102. The temporal mixing layer block 220 follows the first normalization layer 201-1. The temporal mixing layer block 220 is configured to: receive the normalized version of the block input sequence (O); and process the normalized version of the block input sequence to generate a first block output sequence (J^{M^}). The Attorney Docket No.45288-0438WO1 _{first block output sequence JM^ = ^LN^} _{^, LN^} _{^, … , LN^} _{^^ includes a respective first block output} embedding (L^N _^ ^{^}) for each input position in the input sequence 102. [0076] mixing layer block 220 can be configured as a recurrent layer block 220R or an block 220A, each described below with reference to FIGs. 1C and 1D, respectively. Note, in many cases, the neural network 100 includes more recurrent layer blocks 220R than attention layer blocks 220A. For example, the neural network 100 can include 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, or more recurrent layers blocks 220R for each attention layer block 220A. In some implementations, the residual layer blocks 210 including the attention layer blocks 220A can interleave the residual layer blocks 210 including the recurrent layer blocks 210R, e.g., in a periodic fashion that repeats for each attention layer block 220A. Note, using more recurrent layer blocks 220R than attention layer blocks 220A can speed up inference times and reduce memory costs compared to conventional attention-based neural networks (e.g., Transformers), while simultaneously having equal or better inference accuracy (and training loss) for the same model size. [0077] The first additive gate 142-1 follows the temporal mixing layer block 220. The first additive gate 142-1 is connected to the output of the temporal mixing layer block 220 and the input of the first normalization layer 201-1 via the first skip connection 140-1. The first additive gate 142-1 is configured to combine, e.g., via summation, the block input sequence (J) and the first block output _{sequence (JM^) to generate an updated block input sequence (JQ). The updated block input sequence} _{JQ = J + JM^ = ^LQ} _{^ , LQ} _{^ , … , LQ} _{^ ^ includes a respective updated block input embedding (LQ} _{^) for each} input position in the input sequence 102. [0078] The second normalization layer 201-2 follows the first additive gate 142-1 and is connected to the output thereof. The second normalization layer 201-2 is configured to: receive the updated block input sequence (J^Q); and apply a normalization operation over the updated block input sequence to generate a normalized version of the updated block input sequence (O^Q). The _{normalized version of the updated block input sequence OQ = ^PQ Q Q} _{^ , P^ , … , P^^ includes a} normalized version (P^Q) of the respective updated block for each position in the input sequence 102. For example, the second normalization layer 201-2 can implement RMSNorm, BatchNorm, LayerNorm, WeightNorm, or another normalization scheme. [0079] The feedforward layer block 230 follows the second normalization layer 201-2. The feedforward layer block 230 is configured to: receive the normalized version of the updated block Attorney Docket No.45288-0438WO1 input sequence (O^Q); and process the normalized updated block input sequence to generate a second _{block output (JM^). The second block output sequence JM^ = ^LN^} _{^, LN^} _{^, … , LN^} _{^^ includes a} respective second block output embedding (L^{N^}) for each input [0080] The second additive gate 142-2 follows the feedforward The second additive gate 142-2 is connected to the output of the feedforward layer block 230 and the input of the second normalization layer 201-2 via the second skip connection 140-2. The second additive gate 142-2 is configured to combine, e.g., via summation, the updated block input sequence (J^Q) _{and the second block output sequence (JM^) to generate the block output sequence M J = JQ + JM^ that} is output by residual layer block 210. [0081] FIG. 1C is a schematic diagram depicting an example of the temporal mixing layer block 220 configured as a recurrent layer block 220R. The recurrent layer block 220R includes a first linear layer 202-1, a feedforward layer 203, a second linear layer 202-2, a convolutional layer 204, a recurrent layer 205, a multiplicative gate 152, and a third linear layer 202-3. [0082] Each of the first linear layer 202-1, feedforward layer 203, second linear layer 202-2, convolutional layer 204, recurrent layer 205, and third linear layer 202-3 are configured to: receive _{a respective layer input sequence R = ^S^, S^, … , S^^ including a respective layer input embedding} (S_^) for each input position of the input sequence 102; and process the respective layer input _{sequence to generate a respective layer output sequence T = ^U^, U^, … , U^^ including a respective} layer output embedding (U_^) for each input position of the input sequence 102. [0083] The linear layers 202-1, 202-2, and 202-3 are each configured to apply a respective linear transformation over its layer input sequence to generate its layer output sequence. For example, each of the linear layers 202-1, 202-2, and 202-3 can be configured as a fully connected (dense) layer, a projection layer, an expansion layer, or a contraction layer. Here, the first 202-1 and second 202-2 linear layers are configured as expansion layers that increase the dimensionality of each _{layer input embedding in its layer input sequence by an expansion factor of V > 1, and the third} linear layer 202-3 is configured as a contraction layer that decreases the dimensionality of each _{layer input embedding in its layer input sequence by a contraction factor of V^^ < 1. For example,} the expansion factor can be equal to 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. [0084] The feedforward layer 203 is configured to apply an activation function over its layer input sequence to generate its layer output sequence. For example, the feedforward layer 203 can be configured as a Rectified Linear Unit (“ReLU”), a Gaussian error Linear Unit (“GeLU”), a leaky Attorney Docket No.45288-0438WO1 ReLU, a sigmoid function, a hyperbolic tangent (“tanh”) function, a softmax function, or a swish function. [0085] The convolutional layer 204 is configured to apply a convolution operation over its layer input sequence to generate its layer output sequence. For example, the convolutional layer 204 can be configured as a one-dimensional convolution (“Conv1D”) layer that applies a one-dimensional convolutional operation over its layer input sequence, e.g., with a temporal filter dimension of 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 50, or more. The convolutional layer 204 can offer greater parallelizability of the recurrent layer block 220R, as well as efficiently capturing temporal and hierarchical patterns in the input sequence 102. [0086] The recurrent layer 205 is configured to apply a recurrent operation over its layer input sequence to generate its layer output sequence. For example, the recurrent layer 205 can be configured as a vanilla recurrent layer, a LSTM layer, a Gated Recurrent Unit (“GRU”), a bidirectional recurrent layer, a variational recurrent layer, a linear recurrent layer, or a Gated Linear Recurrent Unit (“GLRU”) 205G which is described in more detail with reference to FIG. 1F. [0087] As shown in FIG. 1C, the recurrent layer block 220R is configured as a gated recurrent layer block that utilizes two parallel channels of layers 150-1 and 150-2 to perform the gating operation. The first channel of layers 150-1 operates as a gate that modulates how much information passes through the second channel of layers 150-2, which performs the main processing of the recurrent layer block 220R. The gated configuration of the feedforward layer block 220R can improve expressivity and efficiency over a single channel of layers. [0088] The first linear layer 202-1 and the feedforward layer 203 are arranged into the first channel of layers 150-1. The layer input sequence ^R^ that is input to the first linear layer 202-1 is the normalized version of the block input sequence (O) that is input to the recurrent layer block 220R. The feedforward layer 203 follows the first linear layer 202-1 in the first channel of layers 150-1. The layer input sequence ^{^}R^{^} that is input to the feedforward layer 203 is the layer output sequence (T) that is output by the first linear layer 202-1. [0089] The second linear layer 202-2, the convolutional layer 204, and the recurrent layer 205 are arranged into the second channel of layers 150-2 parallel to the first channel of layers 150-1. The layer input sequence ^{^}R^{^} that is input to the second linear layer 202-1 is the normalized version of the block input sequence (O) that is input to the recurrent layer block 220R. The convolutional layer 204 follows the second linear layer 202-2 in the second channel of layers 150-2. The layer Attorney Docket No.45288-0438WO1 input sequence ^R^ that is input to the convolutional layer 204 is the layer output sequence (T) that is output by the second linear layer 202-1. The recurrent layer 205 follows the convolutional layer 204 in the second channel of layers 150-2. The layer input sequence ^R^ that is input to the recurrent layer 205 is the layer output sequence (T) that is output by the convolutional layer 204. [0090] The multiplicative gate 152 follows the first 150-1 and second 150-2 channels and is connected to the outputs thereof. The multiplicative gate 152 is configured to perform an elementwise multiplication operation between the layer output sequences (T) that are output by the feedforward layer 203 and the recurrent layer 205 to generate the layer input sequence ^{^}R^{^} that _{is input to the third linear layer 202-3. That is, R = T^^T^, where T^ is the output sequence of the} first channel of layers 150-1 and T^{^} is the output sequence of the second channel of layers 150-2. [0091] The third linear layer 202-3 follows the multiplicative gate 152 and is connected to the output thereof. The third linear layer 202-3 generates, as its layer output sequence (T), the first block output sequence (J^{M^}) that is output by the recurrent layer block 220R. [0092] FIG. 1D is a schematic diagram depicting an example of the temporal mixing layer block 220 configured as an attention layer block 220A. The attention layer block 220A includes a first (optional) linear layer 202-1, an attention layer 206, and a second (optional) linear layer 202-2. [0093] Each of the first linear layer 202-1, attention layer 206, and second linear layer 202-2 is _{configured to: receive a respective layer input sequence R = ^S^, S^, … , S^^ including a respective} layer input embedding (S_^) for each input position of the input sequence 102; and process the _{respective layer input sequence to generate a respective layer output sequence T = ^U^, U^, … , U^^} including a respective layer output embedding (U_^) for each input position of the input sequence 102. [0094] Here, the layer input sequence (R) that is input to the first linear layer 202-1 is the normalized version of the block input sequence (O) that is input to the attention layer block 220A. The attention layer 206 follows the first linear layer 202-1 such that the layer input sequence (R) that is input to the attention layer 206 is the layer output sequence (T) that is output by the first linear layer 202-1. The second linear layer 202-2 follows the attention layer 206 such that the layer input sequence (R) that is input to the second linear layer 202-2 is the layer output sequence (T) that is output by the attention layer 206. The second linear layer 202-2 generates, as its layer output sequence (T), the first block output sequence (J^{M^}) that is output by the attention layer block 220A. Attorney Docket No.45288-0438WO1 [0095] The linear layers 202-1 and 202-2 are each configured to apply a respective linear transformation over its layer input sequence to generate its layer output sequence. For example, each of the linear layers 202-1 and 202-2 can be configured as a fully connected (dense) layer, a projection layer, an expansion layer, or a contraction layer. Here, the first linear layer 202-1 is configured as an expansion layer that increases the dimensionality of each layer input embedding _{in its layer input sequence by the expansion factor of V > 1, and the second linear layer 202-2 is} configured as a contraction layer that decreases the dimensionality of each layer input embedding _{in its layer input sequence by the contraction factor of V^^ < 1.} [0096] The attention layer 206 is configured to apply an attention mechanism, e.g., a self-attention mechanism, over its layer input sequence to generate its layer output sequence. There are several types of attentions mechanisms that can be utilized by the attention layer 206. [0097] As one example, the attention layer 206 can be a global attention layer and the attention mechanism can be a global attention mechanism that, for each input position in the input sequence 102, attends over all of the input positions preceding or equal to the input position. The global attention mechanism applied by the global attention layer can be a dense attention mechanism or a sparse attention mechanism. [0098] As another example, the attention layer 206 can be a local (e.g., sliding window) attention layer and the attention mechanism can be a local attention mechanism that, for each input position in the input sequence 102, attends only over a local subset of the input positions that are within a local window of the input position. That is, unlike the global attention mechanism, the local attention mechanism does not attend to any position that is outside of the local window of the input position. The local attention mechanism applied by the local attention layer can also be a dense attention mechanism or a sparse attention mechanism. [0099] Here, dense and spare attention mechanisms refer to how many attention weights of the attention matrix are active (non-zero) within the attention window over which the attention layer 206 applies the attention mechanism. For global attention, the attention window for an input position includes all the input positions up to and including the input position. For local attention, the attention window for an input position is the local window of the input position. For a dense attention mechanism, the attention matrix is a dense attention matrix such that each attention weight is active. For sparse attention, the attention matrix is a sparse attention matrix such that at Attorney Docket No.45288-0438WO1 least one attention weight is inactive (zero). Hence, both global and local attention mechanism can be dense or sparse attention mechanism depending on the implementation. [0100] The local window is generally “causal,” so that it includes up to a fixed number of input positions that are closest to the input position and that precede or are equal to the input position, but not any input positions that are after the input position in the input sequence 102. The fixed number of input positions is generally smaller than the total number of input positions in the input sequence 102 and is referred to as the size of the local window. Note, one of the disadvantages of global attention is that its computational complexity ^^^^{^}^ grows quadratically in the length of the input sequence 102. Local attention addresses this issue by bounding the attention mechanism to the local window. This not only reduces the compute, but also limits the size of the key-value (“KV”) cache to the size of the local window, making it no longer quadratic in the sequence length. [0101] In some cases, because the attention mechanism applied by the attention layers is causal, the system 10 can store, for any given attention mechanism and when generating the respective layer output embedding for any given input position, the layer output embeddings or the keys and values already computed for earlier input positions rather than re-computing the layer output embeddings (or the keys and values) for earlier input positions. Thus, in these cases, applying an attention mechanism over the layer input sequence refers to processing the respective layer input embedding for the last input position in the current input sequence 102 using keys and values or layer output embeddings for the other input positions that have been retrieved from memory (e.g., from a “cache”). [0102] The attention layer 206 can use an attention mechanism that applies a positional encoding to each of the input positions in the input sequence 102. “Positional encoding” refers to modifying the operations applied by the attention layer 206 for a given input position based on the absolute or relative position of the input position within the input sequence 102. For example, the positional encoding can be Rotary Positional Embedding (“RoPE”) or a different type of positional encoding, e.g., a relative positional encoding or an Attention with Linear Biases (“ALiBi”) positional encoding. [0103] The attention layer 206 can use an attention mechanism including one or more attention heads. Each attention head generates a set of queries, a set of keys, and a set of values, and then applies any of a variety of variants of query-key-value (“QKV”) attention, e.g., a dot product attention function or a scaled dot product attention function, using the queries, keys, and values to Attorney Docket No.45288-0438WO1 generate the layer output. Each query, key, value can be a vector that includes one or more vector elements. When there are multiple attention heads, the attention layer 206 then combines the outputs of the multiple attention heads, e.g., by concatenating the outputs. In some cases, the attention mechanism is a multi-query attention (“MQA”) mechanism, where each attention head shares a common set of keys and values but does not share queries. For a local attention mechanism, for each input position of the input sequence 102, the input positions that are used to generate the queries, keys, and values for the input position are defined by the local window size for the local attention mechanism, i.e., non-zero attention weights for a given input position are computed only for input positions that are within the local window of the given input position. [0104] FIG. 1E is a schematic diagram depicting an example of the feedforward layer block 230. The feedforward layer block 230 includes a first linear layer 202-1, a feedforward layer 203, a second linear layer 202-2, a multiplicative gate 152, and a third linear layer 202-3. [0105] Each of the first linear layer 202-1, feedforward layer 203, second linear layer 202-2, and _{third linear layer 202-3 are configured to: receive a respective layer input sequence R =} _{^S^, S^, … , S^^ including a respective layer input embedding (S^) for each input position of the} input sequence 102; and process the respective layer input sequence to generate a respective layer _{output sequence T = ^U^, U^, … , U^^ including a respective layer output embedding (U^) for each} input position of the input sequence 102. [0106] The linear layers 202-1, 202-2, and 202-3 are each configured to apply a respective linear transformation over its layer input sequence to generate its layer output sequence. For example, each of the linear layers 202-1, 202-2, and 202-3 can be configured as a fully connected (dense) layer, a projection layer, an expansion layer, or a contraction layer. Here, the first 202-1 and second 202-2 linear layers are configured as expansion layers that increase the dimensionality of each _{layer input embedding in its layer input sequence by the expansion factor of V > 1, and the third} linear layer 202-3 is configured as a contraction layer that decreases the dimensionality of each _{layer input embedding in its layer input sequence by the contraction factor of V^^ < 1.} [0107] The feedforward layer 203 is configured to apply an activation function over its layer input sequence to generate its layer output sequence. For example, the feedforward layer 203 can be configured as a ReLU, a GeLU, a leaky ReLU, a sigmoid function, a tanh function, a softmax function, or a swish function. Attorney Docket No.45288-0438WO1 [0108] As shown in FIG.1E, the feedforward layer block 230 is configured as a gated feedforward layer block, e.g., a gated multi-layer perceptron (“MLP”) layer block, that utilizes two parallel channels of layers 150-1 and 150-2 to perform the gating operation. The first channel of layers 150-1 operates as a gate that modulates how much information passes through the second channel of layers 150-2, which performs the main processing of the feedforward layer block 230. The gated configuration of the feedforward layer block 230 can improve expressivity and efficiency over a single channel of layers. [0109] The first linear layer 202-1 and the feedforward layer 203 are arranged into the first channel of layers 150-1. The layer input sequence ^R^ that is input to the first linear layer 202-1 is the normalized version of the updated block input sequence (O^Q) that is input to the feedforward layer block 230. The feedforward layer 203 follows the first linear layer 202-1 in the first channel of layers 150-1. The layer input sequence ^{^}R^{^} that is input to the feedforward layer 203 is the layer output sequence (T) that is output by the first linear layer 202-1. [0110] The second linear layer 202-2 is arranged into the second channel of layers 150-2 parallel to the first channel of layers 150-1. The layer input sequence ^{^}R^{^} that is input to the second linear layer 202-1 is the normalized version of the updated block input sequence (O^Q) that is input to the feedforward layer block 230. [0111] The multiplicative gate 152 follows the first 150-1 and second 150-2 channels and is connected to the outputs thereof. The multiplicative gate 152 is configured to perform an elementwise multiplication operation between the layer output sequences (T) that is output by the feedforward layer 203 and the second liner layer 202-2 to generate the layer input sequence ^R^ _{that is input to the third linear layer 202-3. That is, R = T^^T^, where T^ is the output sequence} of the first channel of layers 150-1 and T^{^} is the output sequence of the second channel of layers 150-2. [0112] The third linear layer 202-3 follows the multiplicative gate 152 and is connected to the output thereof. The third linear layer 202-3 generates, as its layer output sequence (T), the second block output sequence (J^{M^}) that is output by the feedforward layer block 230. [0113] FIG. 1F is a schematic diagram depicting an example of a recurrent layer 205 configured as a GLRU 205G that implements a gated linear recurrence operation on its layer input sequence (R) 302 to generate its layer output sequence (T) 304. Attorney Docket No.45288-0438WO1 [0114] Operations of the GLRU 205G, for each input position, are summarized concisely in the following three equations: *_{Z^ , [^, \^, ]^, = *Z^S^^, [^S^^, \^S^^, ]^S^^,, (1a)} [0115] and, ℎ_{^ = Z^ℎ^^^ + [^S^, (1b)} [0116] and, U_{^ = \^ℎ^ + ]^S^ . (1c)} [0117] Here, ℎ_^ is the hidden state for the ^-th input position. The hidden state may also be referred to as the “recurrent state”, the “recurrent hidden state”, or the “hidden recurrent state”. _{[0118] As shown in Eq. (1a), the GLRU 205G includes a set of activation (or gating) functions} _{*Z, [, \, ],, including a first activation function (Z), a second activation function ([), a third} activation function (\), and a fourth activation function (]). Each activation function is configured to process a layer input embedding to generate a respective matrix that enters the linear recurrence relations in Eqs. (1b) and (1c). Note, each activation function in the set of activation functions can be parametrized by one or more learnable parameters of the GLRU 205G and, therefore, are trainable. Moreover, one or more of the activation functions can be constant functions, i.e., constant matrices, that have the same value for each layer input embedding and can also be _{trainable. For example, the GLRU 205G can use a constant third activation function \^S^^ = \} _{and a constant fourth activation function ]^S^^ = ], such as \ = 1 and ] = 0.} [0119] After receiving the layer input sequence 302, the GLRU 205G applies each activation _{function over the layer input sequence 302 to generate a respective set of matrices *Z^ , [^, \^, ]^,} _{for each input position. Each set of matrices includes: a respective first matrix Z^ = Z^S^^ (also} _{referred to as the transition matrix), a respective second matrix [^ = [^S^^, a respective third} _{matrix \^ = \^S^^, and a respective fourth matrix ]^ = ]^S^^. It is important to note that the set} of activation functions implement an element-by-element gating mechanism that does not explicitly enter the linear recurrence relations of Eqs. (1b) and (1c). That is, the hidden state (ℎ_^) for each input position is linear with respect to the hidden state at the preceding input position. Hence, the set of activation functions can be applied sequentially to each layer input embedding or in parallel to the layer input sequence 302. Attorney Docket No.45288-0438WO1 [0120] Upon applying the set of activation functions to the layer input sequence 302, the GLRU 205G then performs the linear recurrence described in Eqs. (1b) and (1c). Particularly, starting at an initial hidden state ℎ_^, for each input position in the layer input sequence 302, the GLRU 205G performs the following operations: ^[0121] The GLRU 205G receives the layer input embedding (S_^) for the input position and obtains, e.g., retrieves from memory, the hidden state (ℎ_^^^) for the preceding input position. The hidden state is generally a vector that characterizes the GLRU 205G currently holds at a particular input position. Note, the initial state (ℎ_^) can be initialized in any way during inference and training, e.g., as a default or random value, a hyperparameter, etc. A common _{initialization for the initial hidden state is ℎ^ = 0 but others may also be chosen, e.g., with ℎ^ ≠ 0.} The GLRU 205G computes a first matrix-vector product between: (i) the first matrix (Z_^) for the input position, and (ii) the hidden state (ℎ_^^^) for the preceding input position. The GLRU 205G computes a second matrix-vector product between: (i) the second matrix ([_^) for the input position, and (ii) the layer input embedding (S_^) for the input position. The GLRU 205G then combines, e.g., via summation, the first and second matrix-vector products to generate the hidden state (ℎ_^) for the input position. The GLRU 205G computes a third matrix-vector product between: (i) the third matrix (\_^) for the input position, and (ii) the hidden state (ℎ_^) for the input position. The GLRU 205G computes a fourth matrix-vector product between: (i) the fourth matrix (]_^) for the input position, and (ii) the layer input embedding (S_^) for the input position. The GLRU 205G then combines, e.g., via summation, the third and fourth matrix-vector products to generate the layer output embedding (U_^) for the input position. [0122] Since the recurrence operations described above with reference to Eqs. (1b) and (1c) are _{linear, they can be efficiently parallelized by the GLRU 205G until ^ = ^ − 1 using parallel scans.} A parallel (or associative) scan is an algorithm that computes prefix summations (or similar operations) efficiently using parallel computation. Such parallel scans can be used in parallel processing, GPU computing, and deep learning optimizations. Such implementations of the neural network 100 make the system 10 suitable as a parallel computing system. Note, this is in contrast to traditional RNNs where activation functions enter the recurrence relation and computation is performed sequentially, thereby hindering parallelization. An example of a parallel scan that the GLRU 205G can perform on a layer input sequence 302 to generate a layer output sequence 304 is described in more detail below. Further details related to parallel scans are provided by Eric Attorney Docket No.45288-0438WO1 Martin and Chris Cundy, “Parallelizing Linear Recurrent Neural Nets Over Sequence Length,” arXiv preprint arXiv:1709.04057 (2017), and Jimmy T.H. Smith, Andrew Warrington, and Scott W. Linderman, “Simplified State Space Layers for Sequence Modeling,” arXiv preprint arXiv:2208.04933 (2022). [0123] Another advantage of linear recurrence is that many properties of linear algebra can be utilized by the GLRU 205G. As shown in FIG. 1F, the recurrence relation of Eq. (1b) can be _{unrolled using the initialization ℎ^ = 0, as follows:} ^^^ _+^^ (2) ℎ_{^ = b cd Z^^^ e [^^+S^^+} . [0124] In traditional RNNs the of the activation function, while the hidden state of the GLRU 205G may potentially vanish or explode _{exponentially as ^ → ^ increases. This phenomenon can be mitigated by using diagonal activation} functions such as: g_{hi j ^} _{^ ^ , (3a)} _{Z^ = f , [^ = 1 − Z ^k} _{[0126] where ^ is the elementwise multiplication operation. Here, l^ is referred to as the} “recurrence gate”, and k_^ is referred to as the “input gate”. Here, the state is given as: ℎ_{^ = Z^^qirs + [^^pi , (3c)} _{[0127] and the layering output} _activation function (m) in Eq. (3b) is a sigmoid function but any nonlinear function that is bounded between zero and one can be implemented, such as a softmax or swish function. n_o and L_o is the learnable weight matrix and bias vector of the recurrence gate, and n_p and L_p is the learnable weight matrix and bias vector of the input gate. The recurrence gate can approximately interpolate between a linear recurrence update and the previous hidden state, which allows it to effectively discard the layer input embedding at any input position and preserve all information from the previous input positions. These and other properties of the GLRU 205 enable the neural network 100 to achieve super-exponential memory by reducing the influence of uninformative inputs. Attorney Docket No.45288-0438WO1 _{[0128] Note, the base parameter (f^ can be parametrized as f = m^Λ^, where Λ is a learnable} _{parameter of the GLRU 205G. This guarantees that 0 ≤ f ≤ 1, ensuring that the recurrence} relation is stable. The exponent parameter (v^ is a scalar-valued constant, e.g., set to a value of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more. For numerical stability, in practice, the GLRU 205G can compute the first matrix (Z_^) for each input position in log-space. The GLRU 205G has gates on both the layer input embedding (S_^^ and the recurrent weight (f). However, neither gate depends on the hidden state ℎ_^^^ at the preceding input position, which ensures that the recurrence computation can be efficiently on device, e.g., in parallel. For training, both weight matrices, n_o and n_p, can be initialized using LeCun initialization. Λ can be initialized such that f^g is uniformly distributed between 0.9 and 0.999 at the start of training. [0129] As a noteworthy application of the diagonal parameterization, the GLRU 205G can compute the recurrence relation Eq. (1b) in parallel, e.g., using parallel scans, to substantially speed up training and inference of the neural network 100. Particularly, since the matrices entering the ^{recurrence relation are diagonal, the parallelization time is on the order of ^^0 log ^) and involves} _{^^0^) space, where 0 is the number of diagonal elements of Z^ and [^. The GLRU 205G can} implement parallel scans using a work-efficient for diagonal matrices, the total computational cost of a parallel scan using ^ processors is on the order of ^^0^). [0130] Further details related to work-efficient parallel scans is provided Ladner, Richard E., and Michael J. Fischer. “Parallel prefix computation,” Journal of the ACM (JACM) 27.4 (1980): 831- 838. In other implementations, the GLRU 205G can implement parallel scans using other algorithms that may offer more parallelism (but may not be work-efficient), such as the algorithm proposed by Hillis, W. Daniel, and Guy L. Steele Jr., “Data parallel algorithms,” Communications of the ACM 29.12 (1986): 1170-1183. [0131] Numerous different algorithms can be implemented by the GLRU 205G to compute a parallel scan as the design space grows exponentially with the sequence length as ^^2^{^y} ^. The general technique for constructing such an algorithm is described below, which also holds for non- diagonal matrices. [0132] To implement a parallel scan for computing prefix summations, the GLRU 205G can first precompute a respective input tuple (v_^) for each input position as: v_{^ = ^Z^, [^S^^, (4a)} Attorney Docket No.45288-0438WO1 _{[0133] with the initialization v^ = ^k, 0^, and k being the identity matrix. The GLRU 205G can} then perform a prefix computation on the input tuples, which produces a respective hidden tuple (z_^) for each input position as: z_{^ = v^ ∘ z^^^ = v^ ∘ v^^^ ∘ … ∘ v^, (4b)} [0134] where ∘ is a binary associative operator of the prefix computation. The associative operator (or prefix operator) implements the recurrence relation in Eq. (1b), which performs an operation on any two operands v_^ and v_^ as: v_{^ ∘ v^ = ^Z^, [^S^^ ∘ ^Z^ , [^S^^ = ^Z^Z^ , Z^[^S^ + [^S^^. (4c)} _{[0135] Hence, the hidden} state (ℎ_^) for the input which the GLRU 205G can then use to compute the layer output _{embedding U^ = \^ℎ^ + ]^S^ for the input position.} [0136] the help of the prefix operator and Eq. (4c), hidden tuples may be extended to groups _{of contiguous input positions, from input position ^ to input position ^ (with ^ ≥ ^), as follows:} _{z^:^ = v^ ∘ v^^^ ∘ ⋯ ∘ v^. (5a)} [0137] Moreover, since the prefix z_{^:^ = z^:+ ∘ z+^^:^, (5b)} _{[0138] for ^ ≥ ~ > ^. Eq. (5b)} _{can be} decomposed into multiple contiguous subgroups of input positions. This allows the GLRU 205G to compute subgroups (z_^:^) in parallel and then combine them to compute the hidden tuples as: _{= z^:+ ∘ z+^^:^ = z^:+ ∘ z+^^ = z^:K ∘ zK^^:+ ∘ z+^^, (5c)} [0139] and so on. Hence the term “parallel scan”. _{[0140] As one example, consider a layer input sequence 302 of length ^ = 4. The GLRU 205G} _{can compute the hidden tuples as: z^ = v^ ∘ v^, z^ = v^ ∘ z^, z3 = v3 ∘ z^, and z5 = z5:3 ∘ z^, with} _{z5:3 = v3 ∘ v^. Introducing the additional term z5:3 breaks the dependency of z5 on z3, allowing} the GLRU 205G to compute the two in parallel. This parallelization can significantly reduce the number of sequential steps the GLRU 205G performs when the sequence length is large, since the parallel time scales logarithmically with the sequence length. Furthermore, it is suitable for implementation in parallel by multiple (^) processors, e.g., multiple cores of an integrated circuit. [0141] Referring again to the diagonalized parametrization of the GLRU 205G described above with reference to Eqs. (3a)-(3b). Such parameterizations reduce the overall number of parameters Attorney Docket No.45288-0438WO1 of the GLRU 205G, without limiting expressivity, as well as providing computational speedups for both training and inference of the neural network 100. The reasons for this are due, at least in part, to: (i) computing powers of diagonal matrices is computationally cheap (speeding up both training and inference), and (ii) unrolling a linear recurrence can be parallelized, as described above with reference to Eqs. (4a)-(5c), resulting in faster training and inference. Hence, such implementations of the GLRU 205G and neural network 100 make the system 10 particularly suitable as a parallel computing system. [0142] For reference, the recurrence gate of the GLRU 205G can be expressed in a slightly different, but mathematically equivalent form, e.g., for improving numerical stability. In particular, the GLRU 205G can compute the logarithm of the first matrix (Z_^) and then exponentiate it, instead of computing a sigmoid and then taking a power: l_{og Z gh} _{^ = log f i = log m^Λ^ghi − vsoftplus^Λ^^l^. (6a)} [0143] Note, the recurrence gate is quite different than other standard gates in the literature. In particular, most gating mechanisms allow full interpolation between the hidden state (ℎ_^^^) and the new observation (S_^). The recurrence gate of the GLRU 205G, on the other hand, is biased towards retaining information, and does not allow full discarding of the contribution of ℎ_^^^. This property depends on the value of Λ which influences the strength of the biasing. [0144] The GLRU 205G can be further extended to use complex numbers. To achieve this, GLRU _{205G parameterizes a complex diagonal recurrence via fE = m^Λ^^^^, where ^ = √−1 is the} imaginary unit (not to be confused with a sequence position index) and ^ is a learnable parameter _{of GLRU 205G. In addition, the GLRU 205G can split the layer input embedding S^ = ^S^ ^} _{^ , S^ ^} along its channel dimensions, and interpret its first half (S_^ ^{^}) as the real part of a complex vector _{(SE ), and its second first half (S^) ^ ^} _{^ ^ as the imaginary part of the same complex vector SE^ = S^ + ^S^ .} The hidden state is then given as: ℎ_{N^ = Z^^^ℎN^^^ + [N^^SE^, (6b)} [0145] with, i_{^ (6c)} _{^ ^ −} _{[0146] and the recurrent l^ =} _{the same.} Note that the number of dimensions of l_^, k_^, fE, and ℎ^N _^ are half those of the layer input embedding Attorney Docket No.45288-0438WO1 _{S^. The layer output embedding is then U^ = ^Re^ℎN^^, Im^ℎN^^^, where Re and Im denote the real} and imaginary parts, respectively. [0147] FIG.2A is a flow diagram of an processing an input sequence 102 using the neural network 100 to generate a network output 104. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system, e.g., the system 10 of FIG. 1A, appropriately programmed in accordance with this specification, can perform the process 400 by implementing the neural network 100. [0148] The neural network 100 receives the input sequence 102 (410). [0149] The neural network 100 processes the input sequence 102, using the input subnetwork 110, to generate an input embedding sequence representing the input sequence 102 (420). [0150] The neural network 100 processes the input embedding sequence, using the residual network 120, to generate a residual embedding sequence (430). [0151] The neural network 100 combines the input and residual embedding sequences, using the additive gate 142, to generate an output embedding sequence (440). [0152] The neural network 100 processes the output embedding sequence, using the output subnetwork 130, to generate the network output 104 (450). [0153] FIG. 2B is a flow diagram of an example process 430 for processing an input embedding sequence using the residual network 120 to generate a residual embedding sequence. For convenience, the process 430 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system, e.g., the system 10 of FIG. 1A, appropriately programmed in accordance with this specification, can perform the process 400 by implementing the residual network 120. [0154] For each residual layer block 210-1 to 210-N in the residual network 120: [0155] The residual network 120 receives a block input sequence for the residual layer block 210- l (432). If the residual layer block 210-l is the first residual layer block 210-1 in the residual network 210, the block input sequence is the residual embedding sequence. If the residual layer block 210-1 is not the first residual layer block 210-1 in the residual network 120, the block input sequence is the block output sequence for the previous residual layer block 210-(1-1) in the residual network 120. Attorney Docket No.45288-0438WO1 [0156] The residual network 120 processes the block input sequence, using the residual layer block 210-1, to generate a block output sequence for the residual layer block 210-l (432). If the residual layer block 210-1 is the last residual layer block 210-N in the residual network 210, the block output sequence is the residual embedding sequence. If the residual layer block 210-1 is not the last residual layer block 210-N in the residual network 120, the block output sequence is the block input sequence for the next residual layer block 210-(1+1) in the residual network 120. [0157] FIG.2C is a flow diagram of an example process 434 for processing a block input sequence using a residual layer block 210 to generate a block output sequence. For convenience, the process 434 will be described as being performed by a system of one or more computers located in one or more locations. For example, a system, e.g., the system 10 of FIG.1A, appropriately programmed in accordance with this specification, can perform the process 434 by implementing the residual layer block 210. [0158] The residual layer block 210 receives the block input sequence (462). [0159] The residual layer block 210 processes the block input sequence, using the first normalization layer 201-1, to generate a normalized version of the block input sequence (464). [0160] The residual layer block 210 processes the normalized version of the block input sequence, using the temporal mixing layer block 220, to generate first block output sequence (466). [0161] The residual layer block 210 combines the block input sequence and the first block output sequence, using the first additive gate 142-1, to generate an updated block input sequence (468). [0162] The residual layer block 210 processes the updated block input sequence, using the second normalization layer 201-2, to generate a normalized version of the updated block input sequence (470). [0163] The residual layer block 210 processes the normalized version of the updated block input sequence, using the feed-forward layer block 230, to generate a second block output sequence (472). [0164] The residual layer block 210 combines the updated block input sequence and the second block output sequence, using the second additive gate 140-2, to generate the block output sequence (474). [0165] FIG.2D is a flow diagram of an example process 500 for processing a layer input sequence 302 using the GLRU 205G to generate a layer output sequence 304. For convenience, the process 500 will be described as being performed by a system of one or more computers located in one or Attorney Docket No.45288-0438WO1 more locations. For example, a system, e.g., the system 10 of FIG.1A, appropriately programmed in accordance with this specification, can perform the process 500 by implementing the GLRU 205G. [0166] The GLRU 205G receives the layer input sequence 302 including a respective layer input embedding at each of multiple input positions. The GLRU 205G then processes the layer input sequence 302, in accordance with the process 500, to generate the layer output sequence 304 including a respective layer output embedding at each of the input positions. [0167] For each input position (^) of the layer input sequence 302: [0168] The GLRU 205G receives a layer input including: (i) a hidden state for a preceding input position, and (ii) the layer input embedding for the input position. (510). [0169] The GLRU 205G process the layer input embedding for the input position, using a set of _{activation functions *Z, [, \, ],, to generate a set of matrices *Z^, [^, \^, ]^, including a first (Z^),} second ([_^), third (\_^), and fourth (]_^) matrix for the input position (520). [0170] The GLRU 205G computes a first matrix-vector product between: (i) the first matrix for the input position, and (ii) the hidden state for the preceding input position (530). [0171] The GLRU 205G computes a second matrix-vector product between: (i) the second matrix for the input position, and (ii) the layer input embedding for the input position (540). [0172] The GLRU 205G combines, e.g., via summation, the first and second matrix-vector products to generate a hidden state for the input position (550). [0173] The GLRU 205G computes a third matrix-vector product between: (i) the third matrix for the input position, and (ii) the hidden state for the input position (560). [0174] The GLRU 205G computes a fourth matrix-vector product between: (i) the fourth matrix for the input position, and (ii) the layer input embedding for the input position (570). [0175] The GLRU 205G combines, e.g., via summation, the third and fourth matrix-vector products to generate the layer output embedding for the input position (580). [0176] FIGs. 3A-7C are experimental plots depicting results of several experiments that were performed to train and evaluate select models of the neural network 100, with detailed commentary on the experiments provided below. It is emphasized, however, that the experiments are presented merely as examples of the described architectures of the neural network 100. Many other variants of the neural network 100 than those in experiments are also possible, e.g., models with different configurations of the residual layer blocks 210 (e.g., different forms of attention and/or recurrence Attorney Docket No.45288-0438WO1 and different residual and gated configurations), different numbers of layers in each residual layer block 210, different numbers of the residual layer blocks 210, different model dimensions and sizes, and so on. [0177] Scaling studies provide important insights into how to tune the hyperparameters of a model and its behavior at scale. Here, the models of the neural network 100 that were evaluated in the experiments are defined, and provide scaling curves up to and beyond 7 billion parameters. Finally, the performance of the models was assessed on downstream tasks. Three model families of the neural network 100 were considered during the experiments: (i) Transformer baseline: The Transformer (“TFM”) baseline is a pure attention implementation of the neural network 100, that is, it uses the residual network 120 where the temporal mixing layer block 220 of each residual layer block 210 is configured as an attention layer block 220A. For the Transformer baseline, the attention layer 206 of each attention layer block 220A was configured with MQA. (ii) Hawk: Hawk is a pure recurrent implementation of the neural network 100, that is, it uses the residual network 120 where the temporal mixing layer block 220 of each residual layer block 210 is configured as a recurrent layer block 220R. For Hawk, the recurrent layer 205 of each recurrent layer block 220R was configured as the GLRU 205G. (iii)Griffin: Griffin is a hybrid implementation of the neural network 100, this is, it uses the residual network 120 where the temporal mixing layer block 220 of each residual layer block 210 is configured as an attention layer block 220A or a recurrent layer block 220R. For Griffin, the attention layer 206 of each attention layer block 220A was configured with MQA, and the recurrent layer 205 of each recurrent layer block 220R was configured as the GLRU 205G. [0178] The key advantage of recurrent layer blocks over global attention is that they use a fixed state size to summarize the sequence, whereas the size of the KV cache grows proportional to sequence length. Since local attention has the same property, mixing recurrent layer blocks 220R with attention layer blocks 220A implementing local attention preserves this benefit. The experiments indicated this combination is extremely effective, since the attention layers 206 accurately model the recent past, while the recurrent layers 205 can transmit information across long sequences. Griffin employs a layered structure by alternating two residual layer blocks 210 that each include a recurrent layer block 220R followed by one residual layer block 210 that includes an attention layer block 220A. Attorney Docket No.45288-0438WO1 [0179] FIG. 3A is an experimental plot depicting scaling curves of the Transformer baseline. Hawk, and Griffin models of the neural network 100. Particularly, FIG. 3A shows the validation loss of each of the models as function of training FLOPs. FIG.3B is an experimental plot depicting maximum throughput of the Transformer baseline, Hawk, and Griffin models of the neural network 100. Particularly, FIG. 3B shows the maximum tokens per second decoded of each of the models as a function of the number of tokens decoded. [0180] The main scaling results of the experiments are outlined in FIGs. 3A and 3B. All three model families of the neural network 100 were trained at a range of model scales from 100 million to 7 billion parameters, with an additional Griffin model with 14 billion parameters. The number of training tokens was increased to be roughly proportional to the number of parameters of the model, as prescribed by the Chinchilla scaling laws. Models were trained on the MassiveText dataset, previously used to train Gopher and Chinchilla, although a slightly different data subset distribution was used in the experiments. A sequence length (^) of 2048 tokens was used. All experiments used the AdamW optimizer. The learning rate, weight decay, and ^_^ parameters were tuned for small models, and these runs were used to identify scaling rules for the hyper-parameters which predict their optimal values for the 7 billion and 14 billion parameter models. [0181] All three model families of the neural network 100 demonstrated a linear scaling relationship between the validation loss and training FLOPs (see FIG. 3A; note both axes are in log scale), as previously observed for Transformers. Notably, Griffin achieved lower validation loss than the Transformer baseline across all FLOPs budgets despite not using any global attention layers. Hawk, on the other hand, achieved slightly higher validation loss, but this gap appeared to close as the training budget increased. [0182] In order to compare to other models in the literature, the models were trained on 300 billion tokens before being evaluated on downstream tasks. The two external baselines that were compared to were Mamba-3B, the strongest small recurrent model reported in the literature to date, and Llama-2, a widely used open Transformer model. Both external baselines have been trained on significantly more than 300 billion tokens – Mamba has been trained on 300 billion tokens, twice more, and Llama-2 has been trained on 2 trillion tokens, nearly seven times more. Note, however, that both Mamba and Llama-2 were trained on different datasets and with different hyper-parameter tuning strategies, which may partially explain the strong performance of the Attorney Docket No.45288-0438WO1 neural network 100 models. The Transformer baseline was, therefore, also included and trained on the same data and with the same hyper-parameter tuning budget as Hawk and Griffin. Table 1: Character normalized accuracy of models evaluated in the experiments. that both Hawk and Griffin achieved strong performance. In line with other works, the character normalized accuracy on MMLU, HellaSwag, PIQA, ARC-E and ARC-C are reported in Table 1, while absolute accuracy onWinoGrande is reported with partial scoring. The performance of Hawk improved significantly as its model size was increased, and Hawk-3B achieved stronger performance on downstream tasks than Mamba-3B, despite being trained on half as many tokens. Griffin-3B significantly outperformed Mamba-3B, and Griffin-7B and Griffin-14B achieved performance competitive with Llama-2, despite being trained on nearly 7 times fewer tokens. Hawk was also competitive with the Transformer baseline, while Griffin outperformed the Transformer baseline. [0184] Two main engineering challenges were encountered when developing and scaling the models of the neural network 100. First, how to efficiently shard the models across multiple devices. Second, how to efficiently implement linear recurrences to maximize training efficiency on TPUs. Both of these challenges are addressed in this section, before providing an empirical comparison of the training speed of Griffin and the Transformer baseline. [0185] As the model size of the neural network 100 increased, it could not be fit on a single device during training, even with a batch size of 1 per-device. To remedy this model parallelism was used Attorney Docket No.45288-0438WO1 to shard large models of the neural network 100 across devices during training. Since communication costs across different training devices are expensive, efficient sharding of the neural network 100 was employed for fast training at scale. [0186] For the feedforward layer block 230, Megatron-style sharding was used, which requires a single all-reduce operation in both the forward and the backward pass. Similarly, the same strategy was applied to the linear layers 202-1 and/or 202-2 in the attention layer block 220A, and the attention mechanism over its heads was additionally sharded. [0187] The recurrent layer block 220R contains two linear layers per channel 150-1 and 150-2, corresponding to the three linear layers 202-1, 202-2, and 202-3 and the GLRU 205G. This allows Megatron-style sharding of these layers in an equivalent fashion. The convolutional layer 204 operates independently across channels, enabling its parameters to be split across devices without incurring any communication overhead. To avoid additional cross-device communication, block- diagonal weights for the gates in the GLRU 205G were used (see Eqs. (3a) and (3b)), instead of dense matrices. For all experiments described in this specification, 16 blocks for both the recurrence and input gates were used. The diagonal structure of the recurrence offers the same advantage as the convolutional layer 204, allowing parameter sharding and computation without any communication. With this strategy, the recurrent layer block 220R’s communication requirements are equivalent to those of the feedforward layer block 230. [0188] Note, optimizer states can consume significant memory, exceeding the size of the model parameters themselves. To address this, ZeRO parallelism was employed, distributing both optimizer states and model parameters across the batch shards. The bfloat16 representation was used for model parameters and activations, minimizing any data transfer overhead. [0189] Current deep learning accelerators are optimized for classical architectures which are composed largely of matrix multiplications and convolutions. These operations have a high FLOPs-to-byte ratio, motivating the development of specialized hardware units like Nvidia GPUs’ TensorCores and Google TPUs’ MXUs. Classical RNNs also benefit from this due to their dense recurrence matrices. In contrast, the GLRU 205G has a low FLOPs-to-byte ratio. This fundamental difference poses a computational challenge, as existing accelerators lack optimization for such workloads. Since all the experiments were run on TPU-v3, the focus was on developing an efficient implementation tailored to this device. Attorney Docket No.45288-0438WO1 [0190] One challenge of utilizing a device like the TPU-v3 for the GLRU 205G is that the update equation of the hidden state in Eq. (3c) is an elementwise operation. For each element update it loads 6 bytes (assuming bfloat16 which involves 2 bytes for each of the variables ℎ_^^^, Z_^, and S_^) and writes 2 bytes (the hidden state ℎ_^) while the computation executes 6 FLOPs (number of arithmetic operations in Eq. (3c)) per element. This translates to a FLOPs-to- ratio of 0.75 – below the device’s capacity for elementwise operations of 4.2. Execution time is therefore influenced by memory transfers between high-bandwidth memory (“HBM”) and VMEM. [0191] To address this, a custom Pallas kernel was written for the computation of in Eq. (3c) using a linear scan. For reference, Pallas is a framework in JAX that enables writing efficient GPU/TPU kernels with relative ease. A linear scan is an algorithm that processes elements in a sequence one- by-one in a single pass. This allowed memory transfers to be minimized, by keeping the hidden state in VMEM all the time, and also to perform memory transfers in larger chunks rather than one at a time. In practice, this translates to almost 3x speed up over the native JAX implementation of the linear scan. Additionally, 10-20% lower training times per step was observed for the full Hawk model of the neural network 100, relative to the same model using the native Jax implementation. [0192] FIGs. 4A-4C are experimental plots depicting training durations of the Transformer baseline and Griffin models of the neural network 100 versus sequence length for different sizes of the models. FIG.4A shows runtimes of the models with 400 million parameters, FIG.4B shows runtimes of the models with 1 billion parameters, and FIG.4C shows runtimes of the models with 7 billion parameters. [0193] Training speeds were compared across different sizes of the models of the neural network 100, as well as across different sequence lengths, to investigate the computational advantages of the neural network 100 during training. For each model size, the total number of tokens per batch was fixed, meaning that as the sequence length increased, the number of sequences was proportionally decreased. In FIGs. 4A-4C, the relative runtimes of the Griffin model of the neural network 100 is compared to that of the Transformer baseline model of the neural network 100 at 2048 sequence length. At the lowest sequence length, the two models had similar training time, but as the sequence length increase, the Transformer baseline became slower, while Griffin’s training time remained the same. The drop in speed for the Transformer baseline was more pronounced at smaller model sizes and decreased at larger model sizes. This can be explained by the fact that all models contained a large number of linear layers 202. Their computation scales Attorney Docket No.45288-0438WO1 quadratically in the embedding dimension (model width) while the recurrent layer 205 scales linearly with embedding dimensions. This means that as the model width is increases compared to the sequence length, the linear layers 202 become the primary computational bottleneck, minimizing the efficiency gains from the recurrent layer block 220R. Therefore, replacing Transformers with Hawk or Griffin offers the most significant wall-clock time improvement when sequence length is sufficiently large relative to model width to ensure the attention computation constitutes a major portion of the total computation time. [0194] Inference in large language models (“LLMs”) is composed of two stages. In the “prefill” stage, a prompt is received and processed. This step is effectively performing a forward pass of the model. Since the prompt can be processed in parallel across the sequence, model operations are compute bound during this stage. Thus, the relative speeds of Transformers and recurrent models during the prefill stage is expected to be similar to the relative speeds of the same models during training. [0195] Prefill is followed by a “decode” stage, in which tokens are sampled auto-regressively from the model. As is shown below, recurrent models have lower latency and higher throughput during the decoding stage, especially for longer sequence lengths where the key-value (“KV”) cache used in attention can get large. [0196] There are two main metrics to consider when evaluating inference speed. The first is latency, which measures the time taken to generate a specified number of tokens at a certain batch size. The second is throughput, which measures the largest number of tokens per second that can be generated on a single device when sampling a specified number of tokens. Since throughput is given by tokens sampled times batch size divided by latency, one can improve throughput either by reducing the latency or by reducing memory usage to enable the use of larger batch sizes on device. Latency can be useful to consider for real-time applications that require a quick response time. Throughput is also useful to consider as it indicates the maximum number of tokens one could sample from a particular model in a given time. This property is useful when considering other language applications such as Reinforcement Learning from Human Feedback (“RLHF”) or scoring language model outputs such as done in AlphaCode where being able to output a large number of tokens in a given time is an appealing feature. [0197] Most components of language models are memory bound during decoding as long as batch size isn’t too big, e.g., less than about 128, and this is assumed for the remainder of this section. Attorney Docket No.45288-0438WO1 The largest memory overheads of Transformers typically come from the parameters themselves and the KV cache. Therefore, we can approximate the time required to generate a single token for each sequence in the batch [ during decoding as the time involved to load these two quantities from memory: Time to sample next token (7) p_{aram size + ^batch size^ × ^cache size^} _≈ _{memory bandwidth .} [0198] Here, cache size refers to either the size of the KV cache at batch size 1 (for Transformers), or to the size of the hidden state at batch size 1 (for RNNs). [0199] The difference in cache size relative to model parameters has important implications for sampling efficiency. In recurrent 220R and local attention 220A layer blocks, parameter loading is the primary bottleneck, (because the cache size is substantially smaller). In contrast, global attention’s KV cache scales with the sequence length ^ and can be comparable to, or even exceed, the size of the model parameters. This introduces considerable overhead when the sequence length is large enough (as shown in F.4). Consequently, an equally sized recurrent model can exhibit substantially lower latency than a Transformer when the sequence length is large. Note however that as the model size grows the sequence length at which we see latency benefits (where the KV cache size is comparable to parameter size) also increases. It is important to note that, as well as improving latency, having a small hidden state can also increase the largest batch size that fits in memory on a single device, leading to higher throughput. [0200] FIGs. 5A and 5B are experimental plots depicting latency of the Transformer baseline, Hawk, and Griffin models of the neural network 100 versus different sampling prefills for a range of sequence lengths. FIG. 5A shows latency for sampling from an empty prefill. FIG. 5B shows latency for sampling from a prefill of 4 thousand tokens. [0201] Here, inference results for model sizes of 1 billion parameters are reported. For the baseline, a MQA Transformer baseline model of the neural network 100 was used, which was significantly faster during inference than the standard MHA Transformer often used in the literature. The models that were compared include: i) Transformer baseline, ii) Hawk, and iii) Griffin. For comparing different models both latency and throughput are reported. [0202] The latency for models with a batch size of 16 with an empty prefill as well as a prefill of 4096 tokens are compared in FIGs. 5A and 5B respectively. Hawk and Griffin achieved faster Attorney Docket No.45288-0438WO1 sampling latency than Transformers for long sequences. This is particularly noticeable as the sequence length and the prefill length (which affect the size of the KV cache) increased. Griffin achieved similar latency to Hawk, demonstrating the excellent compatibility of linear recurrences and local attention. [0203] Comparisons were made for maximum throughput (tokens/s) of the same models when sampling 512, 1024, 2048 and 4196 tokens following an empty prompt in FIG. 3B. It can be seen that both Griffin and Hawk achieved significantly higher throughput than the Transformer baseline. This is partially due to recurrent models having lower latency but also primarily occurs because Griffin and Hawk can fit larger batch sizes than the Transformer baseline on a single device, since their cache size is smaller. Hawk achieved higher throughputs than Griffin, since the size of the local attention cache eventually becomes comparable to the size of the parameters when the batch size is large. [0204] FIGs.6A and 6B are experimental plots depicting performance of the Transformer baseline with no positional encoding (“NoPE”) and Rotary Positional Embedding (“RoPE”), Hawk, and Griffin models of the neural network 100 at 1 billion parameters on a held-out evaluation set of books. FIG.6A shows performance of each of the models trained with sequence length 2048. FIG. 6B shows performance of Hawk and Griffin models with sequence lengths of respectively 2048 (“2k”) and 8192 (“8k”). Hawk and Griffin were able to extrapolate to significantly longer sequences than the Transformer baselines, and further improved performance when trained on longer sequences. [0205] In this experiment, the effectiveness of Hawk and Griffin was evaluated when using in longer contexts to improve their next token prediction and investigate their extrapolation capabilities during inference. Additionally, the models’ performance on tasks that require copying and retrieval capabilities was explored, both for models that are trained on such tasks, as well as when testing for these capabilities with pre-trained language models. [0206] The ability of Hawk and Griffin was investigated to improve their predictions with longer contexts. In particular, the models were evaluated by measuring the loss on a held-out books dataset across a range of sequence lengths. Using these long documents allowed evaluation of the models’ extrapolation ability, i.e., the ability to accurately predict the next token given contexts that are longer than those seen during training. Attorney Docket No.45288-0438WO1 [0207] In Transformers, the ability to extrapolate is largely determined by the positional encoding used for the attention layers. For recurrent models, it is instead dictated by the capacity of the model to keep refining the representation stored in the hidden state as the context becomes longer. From FIG. 6A, it is observed that, up to some maximal length, both Hawk and Griffin improved next token prediction given longer contexts, and they were overall able to extrapolate to significantly longer sequences (at least 4 times longer) than they were trained on. In particular, Griffin extrapolated remarkably well even when using “RoPE” for local attention layers. The results presented so far evaluate models that have been trained on sequences of 2048 tokens. In order to assess whether the models could also effectively learn from longer contexts, 1 billion parameter models were trained on sequences of 8192 (“8k”) tokens on MassiveText, and compared to models trained on the same dataset but on sequences of length 2048 (“2k”) tokens. The total number of training tokens was kept the same across the models by reducing the batch size by a factor of 4 for the models trained on the sequence length of 8192 (while keeping the number of training steps fixed). As illustrated in FIG.6B, Hawk-8k and Griffin-8k achieved lower evaluation loss for sequences of length 8192 or larger, compared to Hawk-2k and Griffin-2k. This indicates that Hawk and Griffin were able to learn to use longer contexts during training. Interestingly, when evaluated at short sequence lengths, Hawk-2k and Griffin-2k performed slightly better than Hawk- 8k and Griffin-8k. This suggests that the training sequence length can be chosen according to the intended downstream use of the model. [0208] FIGs.7A-7C are experimental plots depicting accuracy of the Transformer baseline, Hawk, and Griffin models of the neural network 100 on different copying and retrieval tasks. FIG. 7A shows the performance of 5-block deep models on a held-out evaluation set when explicitly trained on a Selective Copying task. FIG. 7B shows the performance of 5-block deep models on a held- out evaluation set when explicitly trained on an Inductions Head task. FIG. 7C shows the performance of the models on a Phonebook Lookup Task when evaluating pre-trained Hawk and Griffin models with 7 billion parameters against the Transformer baseline with 6 billion parameters. [0209] Recent work has shown that Transformers can be significantly more efficient than state- space models (“SSMs”), a popular new family of RNNs, at learning synthetic tasks such as copying the context or retrieving relevant tokens from the context. Additionally, it has been demonstrated in the literature that pre-trained Transformers such as Pythia are better at copying and retrieval Attorney Docket No.45288-0438WO1 tasks at evaluation time compared to pre-trained SSM models such as Mamba. Experiments on the efficiency of Griffin and Hawk in learning how to copy and retrieve tokens from the context are presented in FIGs. 7A-7C. Additionally, pre-trained Hawk and Griffin models were evaluated on a Phonebook Lookup task designed to test both copying and retrieval capabilities. [0210] To investigate the efficiency of learning how to copy and retrieve relevant tokens from the context, the models were trained on two synthetic tasks: Selective Copying and Induction Heads. To be able to compare Transformers with Hawk and Griffin, 5-block deep networks with a model dimension of 64 were considered, totaling roughly 250 thousand parameters, where Griffin used a single attention layer block 220A in the middle of the residual network 120, i.e., in the third residual layer block 210-3. ^ Selective Copying Task: In this task, the model needs to learn to copy data tokens from a sequence while ignoring noise tokens from the context. For this experiment, a vocabulary size of 16 was used, and the models were trained on sequences of length 1024, containing 16 data tokens (randomly sampled from the vocabulary and at random locations), with the rest of the tokens set to the noise token. Griffin used a local attention window size of 512. ^ Induction Heads Task: In this task, the model needs to learn to recall the token immediately following a special token. This requires the model to learn the special token, and retrieve the token immediately following it in the context. If the model is able to learn the task, it should be able to extrapolate to significantly longer sequences than it was trained for. For this experiment, a vocabulary size of 16 was used, and the models were trained on sequences of length 256 where the tokens were sampled randomly, and the location of the special token in the sequence was randomly sampled. Griffin used a local attention window of size 128. [0211] The results of the copying and retrieval tasks are shown in FIGs. 7A-7C. On the Selective Copying task, it was found that all three models of the neural network 100 were able to solve the task perfectly. When comparing speed of learning on this task, Hawk was significantly slower than Transformers. Interestingly though, Griffin showed almost no slowdown, effectively matching the speed of learning of Transformers, despite using only a single local attention layer 206. [0212] On the Induction Heads task, while all three models of the neural network 100 solved the task perfectly up to the training sequence length, the Transformer baseline was not able to extrapolate to longer sequences during evaluation. While the Transformer baseline used RoPE, similar observations for Transformers have been made with a range of positional encodings. Hawk Attorney Docket No.45288-0438WO1 was able to perfectly extrapolate on this task to evaluation sequences several orders of magnitude longer than the training sequence length. Notably, Griffin, with local attention, also demonstrated exceptional ability to extrapolate on this task. [0213] The experiments showed that copying and retrieval capabilities naturally emerged in the pre-trained models.7 billion parameter Hawk and Griffin models of the neural network 100 and a 6 billion parameter Transformer baseline model of the neural network 100 were considered, all trained on 300 billion tokens on the MassiveText dataset. For the Phonebook Lookup task, the models were provided a synthetic phonebook containing names and numbers, and the model was asked to retrieve the correct phone number given a name. The prompt to the model was a phonebook including randomly sampled list of names and numbers of a certain length, followed by two randomly sampled examples of the task, followed by a randomly sampled name from the phonebook for which the model needs to retrieve the correct phone number. [0214] From FIG.7C, it can been that while Hawk performed reasonably well on the task for very short phonebook lengths, it failed to memorize and retrieve the correct phone number when the phonebook length grew. This is not particularly surprising since Hawk used a small fixed-size state. The Transformer baseline could almost perfectly solve this task up to the training sequence length, but failed to retrieve the correct phone number for context lengths longer than the training sequence length. Interestingly, Griffin could perfectly solve this task up to a context length that matches its local attention window size of 1024, in spite of using only a single local attention layer 206. Once the context length was long enough such that the local attention window did not cover the whole phonebook, performance started to degrade. Griffin was also able to extrapolate better to longer sequence lengths compared to Transformers. [0215] Some examples of machine learning tasks that the neural network 100 can be configured to perform are described in the following. It will be understood that the neural network 100 can be configured to perform any appropriate task, and the below are examples of such tasks. [0216] In any of the implementations below, the neural network 100 may be deployed as part of a chat bot, dialogue agent, or other software tool that receives inputs from users and provides outputs in response to the received input, e.g., as part of a conversation or dialogue. In these implementations, the input sequences 102 received by the neural network 100 are (generated from) user inputs and the network outputs 104, e.g., output sequences, generated by the neural network 100 can be used to generate responses to the user inputs. Attorney Docket No.45288-0438WO1 [0217] In any of the implementations below, the neural network 100 may be configured as, or include, a generative (e.g., large) language model or a multi-modal model, e.g., a visual and language model, to perform these example machine learning tasks. [0218] As one example, the machine learning task may be a neural machine translation task. For example, if the input sequence 102 to the neural network 100 is a sequence of text, e.g., a sequence of words, phrases, characters, or word pieces, in one language, an output sequence generated, e.g., auto-regressively generated, by the neural network 100 may be a translation of the sequence of text into another language, i.e., a sequence of text in the other language that is a translation of the input sequence 102 of text. The vocabulary for the input tokens may be words, wordpieces or characters of the first language, and the vocabulary for the output tokens may be words, wordpieces or characters of the other language. As a particular example, the machine learning task may be a multi-lingual machine translation task, where the neural network 100 is configured to translate between multiple different source language – target language pairs. In this example, the source language text may be augmented with an identifier that indicates the target language into which the neural network should translate the source language text. [0219] Some implementations may be used for automatic code generation. For example, the input tokens of the input sequence 102 may represent words, wordpieces or characters in a first natural language and the output tokens of an output sequence generated by the neural network 100 may represent instructions in a computer programming or markup language, or instructions for controlling an application program to perform a task e.g. build a data item such as an image or web page. [0220] As another example, the task may be an audio processing task. For example, if the input sequence 102 to the neural network 100 is a sequence representing a spoken utterance, the network output 104 generated by the neural network 100 may be a score for each of a set of pieces of text, each score representing an estimated likelihood that the piece of text is the correct transcript for the utterance. As another example, if the input sequence 102 to the neural network 100 is a sequence representing a spoken utterance, the network output 104 generated by the neural network 100 can indicate whether a particular word or phrase (“hotword”) was spoken in the utterance. As another example, if the input sequence 102 to the neural network 100 is a sequence representing a spoken utterance, the network output 104 generated by the neural network 100 can be a Attorney Docket No.45288-0438WO1 classification of the spoken utterance into one of a plurality of categories, for example an identity of the natural language in which the utterance was spoken. [0221] As another example, the machine learning task can be a natural language processing or understanding task, e.g., an entailment task, a paraphrase task, a textual similarity task, a sentiment task, a sentence completion task, a grammaticality task, and so on, that operates on a sequence of text in some natural language. [0222] As another example, the machine learning task can be a text to speech task, where the input sequence 102 is text in a natural language or features of text in a natural language and the network output 104 is a spectrogram, a waveform, or other data defining audio of the text being spoken in the natural language. [0223] As another example, the machine task can be a health prediction task, where the input sequence 102 is a sequence derived from electronic health record data for a patient and the network output 104 is a prediction that is relevant to the future health of the patient, e.g., a predicted treatment that should be prescribed to the patient, the likelihood that an adverse health event will occur to the patient, or a predicted diagnosis for the patient. Such electronic health data may, for example, comprise one or more sequences of physiological data taken from a patient, with the output being a corresponding prediction that relates to those sequences of data. Examples of physiological data and a corresponding prediction include: blood glucose measurements, with the prediction being a predicted future blood glucose measurement or the prediction of a hyper- or hypo-glycemic event; a heart rate, with the prediction being the presence or absence of a heart condition, or a future cardiac event; blood pressure measurements, with the prediction being the risk of a future heart condition; or the like. [0224] As another example, the machine learning task can be a text generation task, where the input sequence 102 is a sequence of text, and the network output 104 is another sequence of text. e.g., a completion of the input sequence 102 of text, a response to a question posed in the input sequence 102, or a sequence of text that is about a topic specified by the input sequence 102 of text. As another example, the input sequence 102 for the text generation task can be an input other than text, e.g., one or more of image data, video data and audio data, and the network output 104 can be text that describes the input sequence 102. [0225] As another example, the machine learning task can be an image processing task, where the input sequence 102 is a conditioning input and the network output 104 is a sequence of intensity Attorney Docket No.45288-0438WO1 values for the pixels of an image. For instance, the conditioning input can include one or both of text data (e.g. a prompt) and image data. The image processing task can include one or more of image generation, image completion, image extrapolation, image up-scaling, etc. [0226] As another example, the machine learning task can be an image processing task, where the input sequence 102 includes image data and the output characterizes the input data. The input image data may include a sequence of intensity values for the pixels of an image. For instance, the image processing task may classify the image and/or may output text characterizing the image. For instance, the input sequence 102 may include an image and the network output 104 may include text describing the image. [0227] For an image processing task, any input image data can be converted into image tokens (e.g. embeddings of patches of the image(s) contained in the image data) (e.g. through an encoder). Similarly, any output image data can be output through decoding image tokens. The image data may be video data (e.g. may comprise a sequence of images (frames) over time). Accordingly, the image processing tasks described herein may be equally applied to process or generate video data. The video processing task may include one or more of video generation, frame completion, frame expansion, frame up-scaling, frame interpolation, video extrapolation, etc. [0228] In some implementations the input sequence 102 represents data to be classified, and the network output 104 includes a classification of the input sequence 102. For instance, the input sequence 102 may include one or more of a sequence of text data, a sequence of image data, a sequence of video data, a sequence of audio data, or a sequence of sensor data. The input sequence 2 may be encoded (e.g. embedded). That is, the input sequence 102 may include one or more of embedded text data, embedded image data, embedded video data, embedded audio data or embedded sensor data. [0229] In some implementations the input sequence 102 represents data to be compressed, e.g., image data, text data, audio data, or any other type of data, and the network output 104 is a compressed version of the data. The input and output tokens may each include any representation of the data to be compressed/compressed data e.g., symbols or embeddings generated/decoded by a respective neural network. [0230] As another example, the machine learning task can be an agent control task, where the input sequence 102 is a sequence of observations or other data characterizing states of an environment, and the network output 104 defines an action to be performed by the agent in Attorney Docket No.45288-0438WO1 response to the most recent data in the sequence. The agent can be, e.g., a real-world or simulated robot, a control system for an industrial facility, or a control system that controls a different kind of agent. The observations may comprise sensor data captured by sensors associated with (e.g., part of) the agent, for example visual data, LIDAR data, sonar data, agent configuration data (e.g., joint angles), agent orientation data, or the like. [0231] In some implementations, the environment is a real-world environment, the agent is a mechanical (or electro-mechanical) agent interacting with the real-world environment, e.g., a robot or an autonomous or semi-autonomous land, air, or sea vehicle operating in or navigating through the environment, and the actions are actions taken by the mechanical agent in the real-world environment to perform the task. For example, the agent may be a robot interacting with the environment to accomplish a specific task, e.g., to locate or manipulate an object of interest in the environment or to move an object of interest to a specified location in the environment or to navigate to a specified destination in the environment. [0232] In these implementations, the observations may include, e.g., one or more of: images, object position data, and sensor data to capture observations as the agent interacts with the environment, for example sensor data from an image, distance, or position sensor or from an actuator. For example, in the case of a robot, the observations may include data characterizing the current state of the robot, e.g., one or more of: joint position, joint velocity, joint force, torque or acceleration, e.g., gravity-compensated torque feedback, and global or relative pose of an item held by the robot. In the case of a robot or other mechanical agent or vehicle the observations may similarly include one or more of the positions, linear or angular velocity, force, torque or acceleration, and global or relative pose of one or more parts of the agent. The observations may be defined in 1, 2 or 3 dimensions, and may be absolute and/or relative observations. The observations may also include, for example, sensed electronic signals such as motor current or a temperature signal; and/or image or video data for example captured by a camera or a LIDAR sensor, e.g., data from sensors of the agent or data from sensors that are located separately from the agent in the environment. [0233] In these implementations, the actions may be control signals to control the robot or other mechanical agent, e.g., torques for the joints of the robot or higher-level control commands, or the autonomous or semi-autonomous land, air, sea vehicle, e.g., torques to the control surface or other control elements e.g. steering control elements of the vehicle, or higher-level control commands. Attorney Docket No.45288-0438WO1 The control signals can include for example, position, velocity, or force/torque/acceleration data for one or more joints of a robot or parts of another mechanical agent. The control signals may also or instead include electronic control data such as motor control data, or more generally data for controlling one or more electronic devices within the environment the control of which has an effect on the observed state of the environment. For example, in the case of an autonomous or semi-autonomous land or air or sea vehicle the control signals may define actions to control navigation e.g., steering, and movement e.g., braking and/or acceleration of the vehicle. [0234] In some implementations, the environment is a simulation of the above-described real- world environment, and the agent is implemented as one or more computers interacting with the simulated environment. For example, the system 10 implementing the neural network 100 may be used to select actions in the simulated environment during training or evaluation of the system 10 and, after training, or evaluation, or both, are complete, the action selection policy may be deployed for controlling a real-world agent in the particular real-world environment that was the subject of the simulation. This can avoid unnecessary wear and tear on and damage to the real-world environment or real-world agent and can allow the neural network 100 to be trained and evaluated on situations that occur rarely or are difficult or unsafe to re-create in the real-world environment. For example, the system 10 may be partly trained using a simulation of a mechanical agent in a simulation of a particular real-world environment, and afterwards deployed to control the real mechanical agent in the particular real-world environment. Thus, in such cases the observations of the simulated environment relate to the real-world environment, and the selected actions in the simulated environment relate to actions to be performed by the mechanical agent in the real-world environment. [0235] In some implementations, as described above, the agent may not include a human being (e.g., it is a robot). Conversely, in some implementations the agent comprises a human user of a digital assistant such as a smart speaker, smart display, or other device. Then the information defining the task can be obtained from the digital assistant, and the digital assistant can be used to instruct the user based on the task. [0236] For example, the system 10 implementing the neural network 100 may output to the human user, via the digital assistant, instructions for actions for the user to perform at each of a plurality of time steps. The instructions may for example be generated in the form of natural language (transmitted as sound and/or text on a screen) based on actions chosen by the system 10. The Attorney Docket No.45288-0438WO1 system 10 chooses the actions such that they contribute to performing a task. A monitoring system (e.g. a video camera system) may be provided for monitoring the action (if any) which the user actually performs at each time step, in case (e.g. due to human error) it is different from the action which the system 10 instructed the user to perform. Using the monitoring system, the system 10 can determine whether the task has been completed. The system 10 may identify actions which the user performs incorrectly with more than a certain probability. If so, when the system 10 instructs the user to perform such an identified action, the system 10 may warn the user to be careful. Alternatively, or additionally, the system 10 may learn not to instruct the user to perform the identified actions, i.e., ones which the user is likely to perform incorrectly. [0237] More generally, the digital assistant instructing the user may comprise receiving, at the digital assistant, a request from the user for assistance and determining, in response to the request, a series of tasks for the user to perform, e.g., steps or sub-tasks of an overall task. Then for one or more tasks of the series of tasks, e.g., for each task, e.g., until a final task of the series the digital assistant can be used to output to the user an indication of the task, e.g., step or sub-task, to be performed. This may be done using natural language, e.g., on a display and/or using a speech synthesis subsystem of the digital assistant. Visual, e.g., video, and/or audio observations of the user performing the task may be captured, e.g., using the digital assistant. The system 10 as described above may then be used to determine whether the user has successfully achieved the task e.g., step or sub-task, i.e., from the answer as previously described. If there are further tasks to be completed the digital assistant may then, in response, progress to the next task (if any) of the series of tasks, e.g., by outputting an indication of the next task to be performed. In this way the user may be led step-by-step through a series of tasks to perform an overall task. During the training of the neural network 100, training rewards may be generated e.g., from video data representing examples of the overall task (if corpuses of such data are available) or from a simulation of the overall task. [0238] As another example, the machine learning task can be a genomics task, where the input sequence 102 is a sequence representing a fragment of a DNA sequence or other molecule sequence and the network output 104 is either an embedding of the fragment for use in a downstream task, e.g., by making use of an unsupervised learning technique on a data set of DNA sequence fragments, or an output for the downstream task. Examples of downstream tasks include Attorney Docket No.45288-0438WO1 promoter site prediction, methylation analysis, predicting functional effects of non-coding variants, and so on. [0239] In some cases, the machine learning task is a combination of multiple individual machine learning tasks, i.e., the system 10 is configured to perform multiple different individual machine learning tasks, e.g., two or more of the machine learning tasks mentioned above. For example, the system 10 can be configured to perform multiple individual natural language understanding tasks, with the input sequence 102 includes an identifier for the individual natural language understanding task to be performed on the input sequence 102. [0240] In some cases, the machine learning task is a multi-modal processing task that requires processing multi-modal data. In general, multi-modal data is a combination of two or more different types of data, e.g., two or more of audio data, image data, text data, or graph data. As one example, the multi-modal data may comprise audio-visual data, comprising a combination of pixels of an image or of video and audio data representing values of a digitized audio waveform. As another example, the multi-modal data may comprise a combination of i) text data representing text in a natural language and ii) pixels of an image or of video or audio data representing values of an audio waveform. Optionally, but not necessarily, the different types of data may represent the same or overlapping objects using the different modalities (types), and when processing multi- modal data, the data may be mapped into a common embedding space. [0241] As a particular example, the machine learning task is a multi-modal processing task that requires processing both text and image inputs, so that the neural network 100 includes both a computer vision neural network and a text processing neural network. That is, the target output to be generated by the computer vision neural network for a given image depends on one or more outputs generated by the text processing neural network for one or more corresponding text inputs (and vice versa). Examples of such tasks include open-vocabulary image classification, open- vocabulary object detection, image captioning, text-based image search, image-based retrieval, and so on. [0242] More generally, the multi-modal processing task may correspond to any of the machine learning tasks previously described for any of the types of data making up the multi-modal combination. For example, an accuracy of the previously described tasks may be increased when the task is applied to multi-modal data combining the data for which the task has been previously Attorney Docket No.45288-0438WO1 described and another type of data. For example, detection or classification of an object or event may be improved when data of multiple different types (modalities) is processed. [0243] More generally, the machine learning task to be performed by the neural network 100 can be specified by the input sequence. As a particular example, the input sequence can include a prompt or an instruction that specifies the machine learning task that is to be performed by the neural network. Optionally, in this example, the input sequence also includes context for performing the machine learning task. [0244] In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions. [0245] The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine- generated electrical, optical, or electromagnetic signal, designed to carry information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud- Attorney Docket No.45288-0438WO1 based environments where components reside on different machines or within a cloud infrastructure. [0246] The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application- specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General-Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly, TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics. [0247] A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and Attorney Docket No.45288-0438WO1 specialized languages or libraries designed for GPGPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics. [0248] In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of AI and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors. [0249] The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in AI and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases. [0250] Computers capable of executing a computer program can be based on general-purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. Attorney Docket No.45288-0438WO1 These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of processing units and memory will depend on factors like the complexity of the AI model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high-performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage. [0251] Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence. [0252] To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction. Attorney Docket No.45288-0438WO1 [0253] Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models. [0254] Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience. [0255] The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP/IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities. [0256] In addition to the embodiments of the attached claims and the embodiments described above, the following numbered embodiments are also innovative. Attorney Docket No.45288-0438WO1 [0257] 1. A method performed by one or more computers, the method comprising: receiving an input sequence comprising a respective input token at each of a plurality of input positions; and processing the input sequence, using a neural network, to generate a network output, wherein: the neural network comprises a plurality of layer blocks including: (i) one or more attention layer blocks, and (ii) one or more recurrent layer blocks, each attention layer block comprises an attention layer configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply an attention mechanism over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions, and each recurrent layer block comprises a recurrent layer configured to, for each of the plurality of input positions: receive a layer input comprising: (i) a recurrent state for a preceding input position, and (ii) a layer input embedding for the input position; process the layer input to generate a recurrent state for the input position; and process the recurrent state and layer input embedding for the input position to generate a layer output embedding for the input position. [0258] 2. The method of embodiment 1, wherein the attention layer of each attention layer block in a subset of the one or more attention layer blocks is a global attention layer, and each global attention layer applies a global attention mechanism that, for each input position, attends over all of the plurality of input positions preceding or equal to the input position. [0259] 3. The method of embodiment 2, wherein the global attention mechanism is a dense attention mechanism. [0260] 4. The method of any one of embodiments 2-3, wherein the attention layer of each attention layer block in a complement of the subset of attention layer blocks is a local attention layer, and each local attention layer applies a local attention mechanism that, for each input position, attends only over a local subset of the plurality of input positions that are within a local window of the input position. [0261] 5. The method of any preceding embodiment, wherein for each attention layer, each of the layer input and output embeddings has a plurality of dimensions, and the attention mechanism applies positional encoding to each of the plurality of dimensions of the layer input and layer output embeddings. [0262] 6. The method of embodiment 5, wherein the positional encoding is a relative positional encoding or a Rotary Positional Embedding. Attorney Docket No.45288-0438WO1 [0263] 7. The method of any preceding embodiment, wherein the recurrent layer of each recurrent layer block is a linear recurrent layer, and for each linear recurrent layer, the recurrent state for each input position is linear in the recurrent state for the preceding input position. [0264] 8. The method of any preceding embodiment, wherein each recurrent layer block further comprises a convolutional layer immediately preceding the recurrent layer, and the convolutional layer is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply a convolution operation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. [0265] 9. The method of embodiment 8, wherein each recurrent layer block further comprises a linear layer immediately preceding the convolutional layer, and the linear layer is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply a linear transformation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. [0266] 10. The method of embodiment 9, wherein each recurrent layer block is a gated recurrent layer block comprising: a first channel comprising a feedforward layer configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply an activation function over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions; a second, parallel channel comprising: (i) the linear layer, (ii) the convolutional layer, and (iii) the recurrent layer; and a multiplicative gate proceeding the first and second channels. [0267] 11. The method of embodiment 10, wherein the feedforward layer of each gated recurrent layer block is a Rectified Linear Unit or a Gaussian error Linear Unit. [0268] 12. The method of any one of embodiments 10-11, wherein the first channel of each gated recurrent layer block further comprises a linear layer immediately preceding the feedforward layer. [0269] 13. The method of any preceding embodiment, wherein the plurality of layer blocks includes a plurality of feedforward layer blocks each comprising a feedforward layer. Attorney Docket No.45288-0438WO1 [0270] 14. The method of embodiment 12, wherein the feedforward layer of each feedforward layer block is a Rectified Linear Unit or a Gaussian error Linear Unit. [0271] 15. The method of any one of embodiments 13-14, wherein each feedforward layer block is a gated feedforward layer block comprising: a first channel comprising the feedforward layer; a second, parallel channel comprising a linear layer; and a multiplicative gate proceeding the first and second channels. [0272] 16. The method of embodiment 15, wherein the first channel of each gated feedforward layer block further comprises a linear layer immediately preceding the feedforward layer. [0273] 17. The method of any one of embodiments 13-16, wherein: the plurality of layer blocks is arranged into a sequence according to a base pattern repeating multiple times within the sequence, the base pattern comprises: (i) a layer block of a first type, and (ii) a layer block of a second type proceeding the layer block of the first type, each layer block of the first type comprises a corresponding one of the attention or recurrent layer blocks, and each layer block of the second type comprises a corresponding one of the feedforward layer blocks. [0274] 18. The method of embodiment 17, wherein: the base pattern further comprises: (i) a first normalization layer immediately preceding the layer block of the first type, and (ii) a second normalization layer immediately preceding the layer block of the second type, and each of the first and second normalization layers is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply a normalization operation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. [0275] 19. The method of embodiment 18, where the normalization operation is a layer normalization operation, a batch normalization operation, a weight normalization operation, or a root mean square normalization operation. [0276] 20. The method of any one of embodiments 17-19, wherein the base pattern is a residual pattern comprising: a first skip connection between: (i) a first input preceding the layer block of the first type, and (ii) a first additive gate proceeding the layer block of the first type; and a second skip connection between: (i) a second input preceding the layer block of the second type, and (ii) a second additive gate proceeding the layer block of the second type. [0277] 21. The method of any preceding embodiment, wherein the plurality of layer blocks includes more recurrent layer blocks than attention layer blocks. Attorney Docket No.45288-0438WO1 [0278] 22. The method of embodiment 21, wherein the plurality of layer blocks includes 2 or more recurrent layer blocks for each attention layer block. [0279] 23. The method of embodiment 22, wherein the plurality of layer blocks includes 8 or more recurrent layer blocks for each attention layer block. [0280] 24. The method of any preceding embodiment, wherein the input sequence is a long- range sequence. [0281] 25. The method of embodiment 24, wherein the input sequence includes 2048 or more input tokens. [0282] 26. The method of any preceding embodiment, wherein the network output is a prediction of a token that follows the input token at a last input position in the input sequence. [0283] 27. The method of embodiment 26, further comprising: generating an output token using the network output; appending the output token to the input sequence; processing the input sequence appended with the output token, using the neural network, to generate a prediction of a token that follows the output token. [0284] 28. The method of embodiment 27, further comprising: performing multiple iterations of the method of embodiment 27 to generate an output sequence comprising a respective output token at each of a plurality of output positions. [0285] 29. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of embodiments 1-28. [0286] 30. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of embodiments 1-28. [0287] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as Attorney Docket No.45288-0438WO1 such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. [0288] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. [0289] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. [0290] What is claimed is:

Claims

Attorney Docket No.45288-0438WO1 CLAIMS 1. A method performed by one or more computers, the method comprising: receiving an input sequence comprising a respective input token at each of a plurality of input positions; and processing the input sequence, using a neural network, to generate a network output, wherein: the neural network comprises a plurality of layer blocks including: (i) one or more attention layer blocks, and (ii) one or more recurrent layer blocks, each attention layer block comprises an attention layer configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply an attention mechanism over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions, and each recurrent layer block comprises a recurrent layer configured to, for each of the plurality of input positions: receive a layer input comprising: (i) a hidden state for a preceding input position, and (ii) a layer input embedding for the input position; process the layer input to generate a hidden state for the input position; and process the hidden state and layer input embedding for the input position to generate a layer output embedding for the input position. 2. The method of claim 1, wherein the attention layer of each attention layer block in a subset of the one or more attention layer blocks is a global attention layer, and each global attention layer applies a global attention mechanism that, for each input position, attends over all of the plurality of input positions preceding or equal to the input position. 3. The method of claim 2, wherein the global attention mechanism is a dense attention mechanism. Attorney Docket No.45288-0438WO1 4. The method of any one of claims 2-3, wherein the attention layer of each attention layer block in a complement of the subset of attention layer blocks is a local attention layer, and each local attention layer applies a local attention mechanism that, for each input position, attends only over a local subset of the plurality of input positions that are within a local window of the input position. 5. The method of any preceding claim, wherein for each attention layer, each of the layer input and output embeddings has a plurality of dimensions, and the attention mechanism applies positional encoding to each of the plurality of dimensions of the layer input and layer output embeddings. 6. The method of claim 5, wherein the positional encoding is a relative positional encoding or a Rotary Positional Embedding. 7. The method of any preceding claim, wherein the recurrent layer of each recurrent layer block is a linear recurrent layer, and for each linear recurrent layer, the hidden state for each input position is linear in the hidden state for the preceding input position. 8. The method of any preceding claim, wherein each recurrent layer block further comprises a convolutional layer immediately preceding the recurrent layer, and the convolutional layer is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply a convolution operation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. 9. The method of claim 8, wherein each recurrent layer block further comprises a linear layer immediately preceding the convolutional layer, and the linear layer is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and Attorney Docket No.45288-0438WO1 apply a linear transformation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. 10. The method of claim 9, wherein each recurrent layer block is a gated recurrent layer block comprising: a first channel comprising a feedforward layer configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply an activation function over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions; a second, parallel channel comprising: (i) the linear layer, (ii) the convolutional layer, and (iii) the recurrent layer; and a multiplicative gate proceeding the first and second channels. 11. The method of claim 10, wherein the feedforward layer of each gated recurrent layer block is a Rectified Linear Unit or a Gaussian error Linear Unit. 12. The method of any one of claims 10-11, wherein the first channel of each gated recurrent layer block further comprises a linear layer immediately preceding the feedforward layer. 13. The method of any preceding claim, wherein the plurality of layer blocks includes a plurality of feedforward layer blocks each comprising a feedforward layer. 14. The method of claim 12, wherein the feedforward layer of each feedforward layer block is a Rectified Linear Unit or a Gaussian error Linear Unit. 15. The method of any one of claims 13-14, wherein each feedforward layer block is a gated feedforward layer block comprising: a first channel comprising the feedforward layer; Attorney Docket No.45288-0438WO1 a second, parallel channel comprising a linear layer; and a multiplicative gate proceeding the first and second channels. 16. The method of claim 15, wherein the first channel of each gated feedforward layer block further comprises a linear layer immediately preceding the feedforward layer. 17. The method of any one of claims 13-16, wherein: the plurality of layer blocks is arranged into a sequence according to a base pattern repeating multiple times within the sequence, the base pattern comprises: (i) a layer block of a first type, and (ii) a layer block of a second type proceeding the layer block of the first type, each layer block of the first type comprises a corresponding one of the attention or recurrent layer blocks, and each layer block of the second type comprises a corresponding one of the feedforward layer blocks. 18. The method of claim 17, wherein: the base pattern further comprises: (i) a first normalization layer immediately preceding the layer block of the first type, and (ii) a second normalization layer immediately preceding the layer block of the second type, and each of the first and second normalization layers is configured to: receive a layer input sequence comprising a respective layer input embedding for each of the plurality of input positions; and apply a normalization operation over the layer input sequence to generate a layer output sequence comprising a respective layer output embedding for each of the plurality of input positions. 19. The method of claim 18, where the normalization operation is a layer normalization operation, a batch normalization operation, a weight normalization operation, or a root mean square normalization operation. Attorney Docket No.45288-0438WO1 20. The method of any one of claims 17-19, wherein the base pattern is a residual pattern comprising: a first skip connection between: (i) a first input preceding the layer block of the first type, and (ii) a first additive gate proceeding the layer block of the first type; and a second skip connection between: (i) a second input preceding the layer block of the second type, and (ii) a second additive gate proceeding the layer block of the second type. 21. The method of any preceding claim, wherein the plurality of layer blocks includes more recurrent layer blocks than attention layer blocks. 22. The method of claim 21, wherein the plurality of layer blocks includes 2 or more recurrent layer blocks for each attention layer block. 23. The method of claim 22, wherein the plurality of layer blocks includes 8 or more recurrent layer blocks for each attention layer block. 24. The method of any preceding claim, wherein the input sequence is a long-range sequence. 25. The method of claim 24, wherein the input sequence includes 2048 or more input tokens. 26. The method of any preceding claim, wherein the network output is a prediction of a token that follows the input token at a last input position in the input sequence. 27. The method of claim 26, further comprising: generating an output token using the network output; appending the output token to the input sequence; processing the input sequence appended with the output token, using the neural network, to generate a prediction of a token that follows the output token. 28. The method of claim 27, further comprising: Attorney Docket No.45288-0438WO1 performing multiple iterations of the method of claim 27 to generate an output sequence comprising a respective output token at each of a plurality of output positions. 29. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-28. 30. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-28.