WO2025219918A1

WO2025219918A1 - Duplicating a storable sub-element of a first storable element

Info

Publication number: WO2025219918A1
Application number: PCT/IB2025/054023
Authority: WO
Inventors: Eshcar Hillel; Moshe Twitto; Aryeh Mergi
Original assignee: Pliops Ltd
Current assignee: Pliops Ltd
Priority date: 2024-04-16
Filing date: 2025-04-16
Publication date: 2025-10-23
Anticipated expiration: 2026-10-16

Abstract

A method for transformer inference, the method includes (a) receiving one or more prompts; and (b) responding to the one or more prompts by executing multiple prefill and decoding iterations. An executing of a prefill iteration that requires previously calculated attention content comprises retrieving the previously calculated attention content from a hardware key-value storage that is disaggregated from graphic processing units used to perform transformer related calculations during the multiple prefill and decoding iterations.

Description

DUPLICATING A STORABLE SUB-ELEMENT OF A FIRST STORABLE ELEMENT CROSS REFERENCE

[001] This application claims priority from US provisional patent serial number 63/634,907 filing date April 16, 2024 which is incorporated herein by reference.

BACKGROUND

[002] The GenAI revolution is powered by Large Language Models (LLM s) that are transforming the way humans interact with machines and each other. However, the deployment of LLMs is not without challenges. The computational requirements of LLMs are enormous, requiring expensive and power-hungry GPUs to deliver real-time inference. Since power provisioning for these GPUs is limited, and compute demand often exceeds supply, it is not uncommon for user queries to be denied or served in violation of Service Level Agreement (SLAs). This has led to a growing interest in developing efficient and cost-effective solutions for deploying LLMs at scale.

SUMMARY

[003] There may be provided a method, a system, and a non-transitory computer readable medium as illustrated in the application.

BRIEF DESCRIPTION OF THE DRAWINGS

[004] The subject matter regarded as the embodiments of the disclosure is particularly pointed out and distinctly claimed in the concluding portion of the specification. The embodiments of the disclosure, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:

[005] FIG. 1 illustrates an example of performing in parallel (a) prefetching a next layer previously calculated attention content, (b) performing transformer related calculations to a current layer, and (c) storing previously calculated attention content of a previous layer;

[006] FIG. 2 illustrates an example of prefill operation and decode operations;

[007] FIG. 3 illustrates an example of a system that includes multiple compute nodes and multiple storage nodes that communicate with each other;

[008] FIG. 4 illustrates an example of a system that includes multiple compute nodes and multiple storage nodes that communicate with each other;

[009] FIG. 5 illustrates an example of a compute node and a storage node;

[0010] FIG. 6 illustrates an example of a compute node and a storage node;

[0011] FIG. 7 illustrates an example of a compute node and a storage node; [0012] FIG. 8 illustrates an example of a software stack, a storage node software stack and functionalities of the GPU host, GPU and KV store supported by the software stack; and [0013] FIG. 9 illustrates an example of a method.

DETAIUED DESCRIPTION OF THE DRAWINGS

[0014] The following abbreviations are used:

[0015] While various examples refer to LLMs - is should be noted that any reference to LLM should be applied mutatis mutandis to transformers or any machine learning process that calculated attention.

[0016] LLM applications are growing rapidly, leading to deployments of expensive and power-hungry GPUs at an unprecedented rate. This work presents a novel approach to offloading KV-cache tensors by leveraging distributed KV-store to deliver high performance LLM inference at a fraction of the cost and power consumption of GPUs. We provide deep theoretical analysis and practical considerations that affect the gain of KV- cache offloading, which would help predict the gain for current as well as future models and systems.

[0017] We demonstrate the effectiveness of our approach by comparing it to vLLM, a state-of-the-art LLM inference engine, on a variety of benchmarks. Our approach achieves comparable performance to vanilla vLLM while consuming significantly less power and costing less to deploy. Specifically, we show 2-3x higher tokens per second (tps) per GPU and 2-3x tps per user vs vLLM vanilla on real world workloads. We discuss how this approach can provide end-to-end gain in a wider range of applications and use cases, such as RAG -based inference, and agentic workloads.

[0018] The suggested solution makes the existing infrastructure more efficient — enabling it to serve more queries faster under the same power budget and delivering measurable benefits to cloud providers and users alike. Creating Key-Value cache (KV- cache) entries and running self-attention operations is the most compute -intensive part of LLM inference. The solution offloads KV-cache to a dedicated hardware-based accelerated KV-store. KV-cache offloading moves KV-cache blocks from GPU memory to storage, enabling more potential KV-cache hits. By replacing compute with IO accesses, we achieve significant performance improvements and cost savings compared to existing LLM inference engines.

[0019] Popular LLM applications handle long prompts with repeated context. For example: chatbots sessions reuse prior messages for each new input, code analysis apps repeatedly process the same code repository, and assistant and recommendations systems use the same document chunks across queries. In these applications, the KV-cache is recomputed for each new input, even though the context is the same.

[0020] This inefficient repeated context recomputation, inflates serving costs and constrains HBM bandwidth utilization and end-to-end performance by restricting batch size.

[0021] This application illustrates a solution that uses a hardware-based accelerator for KV-cache processing. The solution exhibits significant performance improvements and cost savings compared to existing LLM inference engines.

[0022] For simplicity of explanation, we will start by focusing on a simple single query-response use case, where a user submits a query and receives a response from the inference system. We then extend the discussion to multi -turn applications, such as chatbots and autonomous agents, where the system must maintain a conversation history and manage multiple queries simultaneously.

[0023] Basic Flow. Attention-based language models, such as those employing transformers, calculate attention scores signifying the correlation between each input token and all other preceding tokens in the sequence. This is a computationally intensive process, especially for long sequences. It is processed by the self-attention block at each layer of the model.

[0024] A query is handled in two parts: (1) the prefill stage, in which attention is computed for the entire prompt, (2) the auto-regressive decoding stage, generating the tokens that comprise the response to the query. In the decoding phase the context is gradually growing as each decoding step adds a token to the context. The attention context is captured by two tensors called Key and Value. To optimize the performance and efficiency during inference a Key-Value cache (KV-cache) is used. The prefill phase computes and prefills the KV-cache with the keys and the values of the entire prompt. Decoding steps reuse the entries from the KV-cache and append new entries to it upon each generated token. [0025] Three main indicators are used to evaluate the performance of LLM inference systems: Time-To-First-Token (TTFT), Time-Per-Output-Token (TPOT), Throughput.

[0026] These metrics are crucial for understanding the efficiency and effectiveness of the system in handling user queries and generating responses.

[0027] TTFT measures how quickly users start seeing the model’s output after entering their query. Low waiting times for a response are essential in real-time interactions. This metric is driven by the time required to process the prompt, prefill the KV-cache and then generate the first output token.

[0028] TPOT measures time to generate an output token for each user query in the decode phase. The reciprocal metric is tokens-per-second-per-user (tps/user). This metric corresponds with how each user perceives the “speed” of the model. For example, a TPOT of 100 milliseconds/tok is 10 tokens per second per user, or approximately 450 words per minute, which is faster than a typical person can read.

[0029] Throughput measures the number of output tokens per second the inference system can generate across all users and requests. This can be presented as tokens-per- second- per-GPU (tps/GPU).

[0030] Deployment Challenges. The inference system is composed of a cluster of GPU servers for handling multiple concurrent user requests. Each server is equipped with high- performance GPUs and is responsible for processing a subset of the incoming requests.

[0031] KV-cache is known to have a huge memory footprint. Hence, many techniques are aimed at reducing its size. Multi-Head Latent Attention is an advanced KV- cache compression technique that generalizes Grouped Query Attention. While these innovations optimize memory footprint and reduce memory bandwidth usage, they do not reduce the computational overhead for generating KV-cache - which is reduced by using the suggested solution.

[0032] In real-time online serving, multiple queries are submitted and must be served within SLAs. The system employs batching and load balancing techniques. Batching groups incoming requests into batches, which are then processed together to improve GPU utilization. Load balancing distributes the incoming requests across the available servers, ensuring an even distribution of the workload and preventing any single server from becoming a bottleneck. There are two common techniques for batching inference queries. Dynamic batching waits for a batch of queries to arrive and then processes them together until all complete. In continuous batching, instead of waiting for all sequences in a batch to finish, queries are grouped together as they arrive at the step level. [0033] Multi-turn Apps. Chatbots are the most widely adopted use case for leveraging the powerful chat and reasoning capabilities of LLMs. GPU memory capacity is limited, and the cache eviction policy discards stale conversations. It may even delete content related a conversation round soon after the conversation round ends. I such case, when a client resumes (after being idle for some time) the prefill phase must recompute the cache for the entire history. This re-computation incurs a computational cost quadratic in the dimension of token embedding and in the total conversation length. With the increase of sequence (context) length to 100s thousands to millions, and the dimension of token embeddings to more than 10K, quadratic computation at prefill phase translates to 100s Tera-Flops.

[0034] This has 3 main consequences: a. Increasing TTFT of each round in a multi -turn conversation, negatively affecting user experience. b. Limiting the number of sequences the system processes at once (batch size), i.e., reducing system throughput, and therefore lowering its cost efficiency. c. When applying continuous batching, the long prefill stage of one query user means long iteration for generating the next tokens of the queries in the same batch that are in their decoding stage. Again, negatively affecting user experience.

[0035] This is true not only for multi-tum chatbots but also for a wider range of applications acting as autonomous agents. These applications utilize more advanced sharing patterns like irregular tree-structured sharing. As a result, these systems aim to significantly reduce TTFT, by reducing or even eliminating the prefill compute required for restoring the KV-cache.

[0036] KV-Cache Offloading Requirements. KV cache offloading moves large KV- caches from GPU memory to KV-cache pools that reside on another memory or storage device enabling much higher KV-cache hits. The mission is to design a system to store and retrieve precomputed KV-cache entries efficiently. The goal is reducing the compute and accelerating the prefill phase, specifically in multi-tum applications with recurring requests. [0037] These are examples of the requirements for a system that can efficiently handle KV-cache offloading: a. Data sharing : allow multiple GPU servers to share the KV -cache data for optimal load balancing and efficiency. b. Zero application friction: enable KV-cache retrieval based only on content; avoid additional metadata management for space and time efficiency and avoid user session association for improved security. c. Performance: Fast Reads: retrieving KV data from storage should be significantly faster than recomputing it, Non-Blocking Writes: storing newly generated KV-cache entries should not slow down ongoing computations, No Degradation: when cached KV data is unavailable, the fallback to normal computation should not create any performance penalties d. End-to-End gain: any acceleration results in lower TCO and lower power consumption

[0038] Next, we review the implication of system requirements and deployment consideration on a storage system for KV-cache offloading.

[0039] Zero- Application Friction. The application manages KV-cache entries in storage via content-based indexing, so no application specific metadata is needed for the index. A prominent design choice is spanning all prompts via a hash-based index, like the method applied in paged- attention, or a prefix-tree index, like the method applied in Lianmin Zheng at el. "SGLang: Efficient execution of structured language model programs", 2024.

[0040] It implies indexing and caching small block of the KV -cache as prefix-tree nodes. Typically, each KV-cache block, or prefix-tree node is composed of 16 - 32 tokens, with one token carrying few KBs. Total of less than 100KB of data per block. The benefit of small blocks is minimal redundancy as multiple branching contexts can share prefix blocks.

[0041] On the one hand, this is ideal for chain-of-thought and multi-agent scenarios where requests often diverge at different points, overlapping prompts (e.g., system instructions), and prompts fractions that are repeated across requests, as in RAG-based inference. On the other hand, this imposes a challenge for existing file and object-based storage systems as a vast number of small files and objects are managed by the system.

[0042] Object Size. Modem techniques, such as GQA, reduce the vector size without improving compute requirements. The result is significantly smaller IO size. The object size per attention layer depends on D the model hidden dimension (e.g., 5120, 8K, 16K); P the precision in bytes per element (1/2, 1, 2), which can change across layers; C the number of tokens indexed together in a block (16, 32); G the GQA factor (1, 4, 8); TP the tensor parallelism. Across various parameter sets, single-layer object size can span from a few hundred bytes to several tens ofkilobytes. For example, object size for Llama- 3.1 - 405B is 2KB (D = 16K, G= 16, TP = 8, C = 16), whereas for the smaller (older) Llama -2 - 13B the size is 80KB (D = 5K, G = 1, TP = 1, C = 16). If we keep KV-cache of all layers under the same object, the above multiplied by 32-128, and typical object size (or file) is lOKBs-lOMBs.

[0043] Compression and adaptive quantization techniques further reduces the vector size. It also causes the vector size to vary significantly with up to 4x compression gain. Variable size objects within files suffers from a major overhead.

[0044] IO-Compute Overlap. Reads should be fast, otherwise they will become the main performance bottleneck. The storage system should be able to retrieve KV-cache entries at a rate that is significantly faster than recomputing them. In addition, KV-cache entries for newly added tokens need to be computed.

[0045] A non-optimal approach would be to first fetch entries of history KV- cache from storage, then compute the new entries, and finally store the newly computed KV-cache entries (called delta prefill) back in storage. However, this can potentially double the prefill time. Therefore, we aim to parallelize IO with compute: prefetch the next layer’s KV-cache entries, while computing the KV-cache entries of the current layer and storing the entries of the previous layer. This IO-compute overlap can result in excessive CPU-GPU synchronization overhead, and small read IOs of size between 1KB to 20KBs. Traditional file systems and DRAM-HBM data transfers perform poorly for many tiny requests.

[0046] The solution includes LLM KV-Cache Offloading. According to an embodiment the solution leverages GPU-initiated KV storage to offload KV-cache processing to dedicated KV-store hardware that includes (for example - the LightningAI of Pliops Ltd. Of Tel Aviv, Israel). The solution replaces prefill by compute with prefill by IO operations. The KV-cache tensors are persisted to KV-storage as they are produced: either two dense vectors per token, or one sparse vector per token in more advanced models.

[0047] In the decode step a single token, and in a prefill step, all newly added prompt tokens. When a client resumes after being idle for some time the prefill phase restores the KV-cache by retrieving it from the KV-storage instead of computing it from scratch. The lower the computational power of the GPU the higher the benefit of replacing compute with IOs.

[0048] According to an embodiment, the full context of the user conversation is stored in storage. However, the application can choose to restore only a prefix of the history based on availability of resources, specifically HBM space, compute, and memory bandwidth based on served traffic. Moreover, the entire context of all user history is stored in KV storage. It can be restored even days after the session was last visited. The application can manage the history storage: delete expired sessions (e.g., for GDPR compliance) or move them to “cold” storage. Users’ context history store can be mined and be the basis for analytics, BI, personalization, and monetization opportunity for app owner, and as a third-party data provider.

[0049] According to an embodiment the solution is implemented by the LightningAI which is a generic infrastructure for Al applications and app lie s disaggregated KV storage, extreme performance, GPU-Initiated KV IO Pliops HW KV solution saturates the fabric (including 400Gb and above) also when the traffic is with extremely small Random IOs size in read and write.

[0050] Pliops XDP Delivers Required Efficiency Combining hardware-accelerated KV with compression and quantization delivers end-to-end system efficiency: high IOPS per dol- lar/watt, and Low networking overhead.

[0051] The solutions' performance gains are proportional to both IO speed and compute requirements. Since MLA, like GQA, reduces KV cache size without lowering compute needs, it results in a net gain for the solution. More broadly, any KV compression technique that does not proportionally reduce compute requirements — such as MQA, GQA, and MLA — inherently enhance the solutions efficiency.

[0052] Expected Gain Analysis. We first analyse the expected gain for static batching scheduling: Denote by TTLT (B) and TPOT (B) the time it takes to run a prefrll and decode steps, respectively, in a batch of size B.

[0053] Prefrll gain is compute-bound, therefore TTFT (B) = B * TTFT (1)

[0054] Decode phase is memory bandwidth-bound, therefore, TPOT (B) ~= TPOT

(l),when assuming model size is much larger than the aggregated KV-cache size. Denote by SLAT T F T and SLAT POT the SLAs the application defined for completing a prefrll and decode steps, respectively.

[0055] The solutions source of gain is pre-fill time reduction. Replacing GPU compute with storage IO in prefrll allows higher HBM bandwidth efficiency in decode steps via larger batch size. Denote by xthe prefrll acceleration factor of TTFT (I) of the solution over vanilla: TTFT^kv4kv(B) = B * TTFT (l)/x. The solution can increase batch size by a factor of x and still meet SLAs

[0056] Results. DeepSeek-V3 employs a Mixture of Experts (MoE ) model with 671B parameters, but only 37B parameters are used per token. While MoE improves efficiency by selectively activating a subset of the model’s parameters, it does not directly impact the attention mechanism responsible for KV-cache management. From an inference perspective, the performance gain for KV-cache handling in DeepSeek-V3 is like that of a 37B parameter dense model. However, MoE models typically require multiple GPUs due to their size, meaning that KV-cache access is distributed across multiple GPUs. This distribution increases the need for frequent, small KV-cache accesses — a scenario where the solution excels, further strengthening its competitive advantage.

[0057] DeepSeek models incorporate architectural modifications that facilitate speculative decoding, where multiple tokens are predicted in each decoding round. This reduces HBM bandwidth requirements per token, particularly benefiting batched inference. The solution performance gain is tied to HBM bandwidth efficiency. When speculative decoding reduces the HBM bandwidth tax per token, it enables larger batch sizes, which in turn amplifies the solutions advantages.

[0058] Prefill -decode disaggregation is an inference deployment strategy adopting separate prefill and decode GPU clusters that is becoming popular. For prefill -only GPUs, the solution delivers massive efficiency gains — up to 8x. By leveraging KV offloading technology, prefill GPU cluster footprints can be reduced by at least 5x, significantly improving deployment efficiency and cost-effectiveness.

[0059] According to an embodiment, the solution utilizesKV storage accesses to replace compute-based prefill with an IO- based prefill. Compute is quadratic in model dimension, and IO is linear in model dimension. Thus, the larger the model and its dimension, the higher the benefit of replacing compute with IO.

[0060] Attention tensors are persisted to KV storage as they are produced: two dense vectors per layer per token: single token in each decoding round, all prompt tokens in the prefill phase of each conversation turn.

[0061] When a user resumes after being idle for some time the pre-fill phase restores the attention tensors in HBM by retrieving them from the KV-storage instead of computing them from the prompt itself.

[0062] Requests are batched and allocated resources in a way that maximizes resources utilization - the batching may include applying continuous batching or other type of batching.

[0063] The solution disaggregates prefill from decoding schedule them separately on dedicated resources, when context is long, or scheduled together otherwise.

[0064] In multi node setting requests are routed to the node storing their past attention scores. [0065] In multi-GPU setting, when a model cannot fit into a single HBM, batches are split such that each sub-batch utilizes a different HBM at any time.

[0066] In case context attention is retrieved into a prefill GPU or CPU memory, it is passed to a decoding GPU viaNVLink (a wire-based serial multi -lane near-range communications link developed by Nvidia); e.g., NVlink connecting Grace memory with Hopper memory in a Grace Hopper superchip.

[0067] The access (read and writes) to stored context attention can be GPU- initiated or CPU-initiated.

[0068] The full context of the user conversation is stored in storage. However, the application can choose to restore only a suffix of the history based on availability of resources, specifically HBM space/compute/bandwidth based on served traffic.

[0069] Moreover, the entire context of all user history is stored in KV storage by sessions. It can be restored even days after the session was last visited.

[0070] The application can manage the history storage: delete expired sessions (e.g., for GDPR compliance) or move them to "cold" storage.

[0071] Users' context history store can be mined for analytics, Bl, personalization, and monetization opportunity for app owner, and as a third-party data provider.

[0072] Figure 1 illustrates an example of performing in parallel (a) prefetching a next layer previously calculated attention content, (b) performing transformer related calculations to a current layer, and (c) storing previously calculated attention content of a previous layer.

[0073] Figure 1 illustrates: a. A computation of a token in relation to a transformer model having N layers - which shows a sequence of N computation steps (compute L0 - compute LN) and related get/put submission/completion kernel instructions/notifications denoted 21, 22, 23 and 3. b. Prefetching a next layer previously calculated attention content (IOs GET L0...IOs GET L4 12-0 - 12-4) in parallel to compute LO-Compute L3. For simplicity of explanation the prefetching related to an N layer (preceding Compute LN) is not shown. c. Storing previously calculated attention content of a previous layer (IOs PUT L0...IOs PUT L2 13-0 - 13-2) in parallel to compute LI -Compute L3. For simplicity of explanation the storing related to an N layer (following Compute LN) is not shown. [0074] Figure 2 illustrates prefill operation 31 (most of which include retrieve attention state operations 33 instead of recalculating the previous attention state) and decode operations 32. Figure 2 also illustrates a decode operation 33 that includes a sequence of compute token per layer 34 operations that are immediately followed by submission operations 35.

[0075] Figure 3 illustrates a system that includes multiple compute nodes 40 and multiple storage nodes 50 that communicate with each other. The compute nodes are disaggregated from the storage nodes 50.

[0076] The compute nodes 40 include GPUs 42, application containers 41 that (at least) execute applications). The storage nodes 50 include memory units 52 that may buffer content, controllers 53 that manage the operation of the storage nodes - including managing KV content , and Solid State Disks (SSDs) 51.

[0077] Figure 4 illustrates a system that includes multiple compute nodes 40a and multiple storage nodes 50a that communicate with each other. The compute nodes 40a are disaggregated from the storage nodes 50a.

[0078] The compute nodes 40a include GPUs 42, DPU/NIC 45 for managing communication, and application containers 41 that are illustrated as including inference application 43, and KV I/O SDK (software development kit exposing KV API for developers) 44. The storage node 50a includes DPU/NIC 55 for managing communication, memory units such as NVMe-oF (NVMe over Fabrics protocol specification designed to connect hosts to storage targets) 56, XDP controllers (software service) 57 that manage the operation of the storage nodes - including managing KV content , and Solid State Disks (SSDs) 51.

[0079] Figure 5 illustrates a compute node 40b having a GPU initiated IO operations (using NVMe link to DPUs 45a). The application containers 41 include inference application (vUUM) 43, multi IO KV API 46 and KV I/O SDK 44. The storage nodes 50b include DPU/NIC 55 for managing communication, memory units such as NVMe-of-targets 56, XDP controllers 57 that manage the operation of the storage node - including managing KV content , and Solid State Disks (SSDs) 51.

[0080] Figure 6 illustrates a compute node 40c having a GPU triggered IO operations without a DPU. The compute node includes GPUs 42, application containers 41 (that include inference application (vUUM) 43, multi IO KV API 46 and KV I/O SDK 44), KV gateway container (software service for managing communication) 48 and NICs 45b. The KV gateway container 48 is in communication with the KV I/O SDK 44 and NICs 45b. [0081] The storage node 50b include DPU/NIC 55 for managing communication, memory units such as NVMe-of-targets 56, XDP controllers 57 that manage the operation of the storage nodes - including managing KV content , and Solid State Disks (SSDs) 51.

[0082] Figure 7 illustrates a compute node 40c having a GPU initiated IO operations without a DPU. The compute node includes GPUs 42, application containers 41 (that include inference application (vUUM) 43, multi IO KV API 46 and KV I/O SDK 44), KV gateway container 48 and NICs 45b. The KV gateway container 48 is in communication with the KV I/O SDK 44 and NICs 45b.

[0083] Figure 8 illustrates an example of a software stack 60, storage node software stack 70 and functionalities of the GPU host, GPU and KV store supported by the software stack.

[0084] According to an embodiment, the software stack includes at least some of: a. vUUM production stack that is a combination of tools and infrastructure used to serve vUUM (a high-throughput and memory-efficient UUM inference engine) in production environments. b. vUUM + KV cache acceleration manages at least in part the interaction of between the vUUM and the key-value cache mechanism. c. KV I/O SDK (software development kit exposing KV API for developers). d. Uow-level system acceleration concepts related to Key-Value (KV) storage and I/O optimization across CPUs, GPUs, and DPUs (Data Processing Units) such as KV SDK API, GPU KV I/O, DPU NVMe emulation, CPU NVMe emulation, triggered multi I/O, CPU KO I/O, NV NVMe.

[0085] According to an embodiment, the KV SDK API (Key-Value Software Development Kit API) is a programming interface for interacting with key-value storage systems. It abstracts put/get/delete operations like in RocksDB, Redis, or KV-based flash storage. In accelerated environments (e.g., using GPUs or DPUs), the KV SDK may interface with specialized hardware to: Speed up access, Enable parallelism, Offload operations from the CPU.

[0086] According to an embodiment, the GPU KV I/O (GPU Key-Value Input/Output) Refers to performing key-value operations directly on GPUs. This avoids round-tripping data back to the CPU, enabling zero-copy I/O and low-latency data access. It is useful for UUM inference engines like vUUM, where models can store and retrieve cached key-value tensors entirely within GPU memory.

[0087] According to an embodiment, the DPU NVMe Emulation is executed by a DPU that emulates an NVMe SSD, responding to block read/write commands. [0088] According to an embodiment, the CPU NVMe Emulation is executed by a CPU that emulates an NVMe SSD, responding to block read/write commands.

[0089] According to an embodiment, the triggered Multi I/O is a high-performance I/O scheduling mechanism. Multiple I/O operations (like NVMe reads/writes) are triggered by a single event or condition, allowing batch execution. Especially useful in parallel systems (like GPUs or DPUs) to reduce the number of syscall context switches or DMA triggers.

[0090] Figure 8 illustrates, in addition to software elements illustrated above, a software node production stack 70 which coordinates scheduling of inference tasks to different vllm instances running on various GPU nodes.

[0091] Figure 9 illustrates an example of method 100 for transformer inference.

[0092] According to an embodiment, method 100 starts by step 110 of receiving one or more prompts.

[0093] According to an embodiment, step 110 is followed by step 120 of responding to the one or more prompts by executing multiple prefill and decoding iterations.

[0094] According to an embodiment, an executing of a prefill iteration that requires previously calculated attention content includes retrieving the previously calculated attention content from a hardware key-value storage that is disaggregated from graphic processing units used to perform transformer related calculations during the multiple prefill and decoding iterations.

[0095] According to an embodiment step 120 includes step 121 of disaggregating graphic processing unit prefill related calculations from graphic processing unit decoding related calculations.

[0096] According to an embodiment step 120 includes step 122 of storing attention content in the hardware key-value storage immediately following a calculating of the attention content.

[0097] According to an embodiment the multiple prefill and decoding iterations are associated with different layers of a transformer model.

[0098] According to an embodiment, step 120 includes step 123 of pipelining (i) retrieving operations, (ii) transformer related calculations and (iii) storing operations related to the hardware key-value storage.

[0099] According to an embodiment, step 123 includes performing in parallel (a) prefetching a next layer previously calculated attention content, (b) performing transformer related calculations to a current layer, and (c) storing previously calculated attention content of a previous layer. [00100] According to an embodiment method 120 includes step 124 of storing by the hardware key-value storage previously calculated attention content for a period that exceeds (for example by factors of 10, 100, 1000 and even more) a time to live period of content cached in at least one of a graphic processing unit cache, a local cache, or a data processing unit cache. The extended delay period guarantees that even hen processing multiple threads and/or skipping from one task to the other, the required attention content will still reside in the key-value storage and can be used for retrieving the attention state from the key value storage. According to an embodiment the size of the hardware key-value storage exceeds by a factor of at least 100, 1000, 10000 the size of the graphic processing unit cache, a local cache, or a data processing unit cache - whish allows the storage the entire (or significant selected portions of) conversation history (even when the conversation is very long and is associated with extensive amount of attention content).

[00101] According to an embodiment, step 120 includes step 125 of applying contentbased and application- agnostic indexing of attention content items stored in the hardware keyvalue storage.

[00102] According to an embodiment, the one or more prompts are associated with a conversation history that is stored in the hardware key-value storage, and the previously calculated attention content forms the entire history or forms only a portion of the history. [00103] According to an embodiment the portion of the history is determined based on batching.

[00104] According to an embodiment, step 120 includes step 126 of triggering the retrieving the previously calculated attention content by at least one of the graphic processing units.

[00105] According to an embodiment, step 120 includes step 127 of initiating the retrieving the previously calculated attention content by at least one of the graphic processing units.

[00106] According to an embodiment, a data processing unit controls the retrieving.

[00107] According to an embodiment, step 110 includes step 112 of batching multiple received prompts to provide a batch of the one or more prompts.

[00108] According to an embodiment, the hardware key-value storage includes multiple key-value storage nodes.

[00109] According to an embodiment, the graphic processing units are in multiple graphic processing nodes that are in communication with the multiple key-value storage nodes.

[00110] According to an embodiment, a storable element includes information and [00111] Any reference to “may be” should also refer to “may not be.”

[00112] In the foregoing detailed description, numerous specific details are set forth to provide a thorough understanding of the one or more embodiments of the disclosure.

However, it will be understood by those skilled in the art that the present one or more embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present one or more embodiments of the disclosure.

[00113] It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

[00114] Because the illustrated embodiments of the disclosure may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present one or more embodiments of the disclosure and in order not to obfuscate or distract from the teachings of the present one or more embodiments of the disclosure.

[00115] Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.

[00116] Any reference in the specification to a system and any other component should be applied mutatis mutandis to a method that may be executed by a system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.

[00117] Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.

[00118] Any combination of any module or unit listed in any of the figures, any part of the specification and/or any claims may be provided. Especially any combination of any claimed feature may be provided. [00119] In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims.

[00120] Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks, circuit elements, or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures may be implemented which achieve the same functionality.

[00121] Any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as "associated with" each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being "operably connected," or "operably coupled," to each other to achieve the desired functionality.

[00122] Any reference to “consisting,” “having” and/or “including” should be applied mutatis mutandis to “consisting” and/or “consisting essentially of.”

[00123] Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.

[00124] Also, for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner.

[00125] However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.

[00126] In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an." The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first" and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.

[00127] While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

[00128] It is appreciated that various features of the embodiments of the disclosure which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the embodiments of the disclosure which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.

[00129] It will be appreciated by persons skilled in the art that the embodiments of the disclosure are not limited by what has been particularly shown and described hereinabove. Rather, the scope of the embodiments of the disclosure is defined by the appended claims and equivalents thereof.

Claims

WE CLAIM

1. A method for transformer inference, the method comprising: receiving one or more prompts; and responding to the one or more prompts by executing multiple prefdl and decoding iterations; wherein an executing of a prefdl iteration that requires previously calculated attention content comprises retrieving the previously calculated attention content from a hardware key-value storage that is disaggregated from graphic processing units used to perform transformer related calculations during the multiple prefdl and decoding iterations.

2. The method according to claim 1, comprising disaggregating graphic processing unit prefdl related calculations from graphic processing unit decoding related calculations.

3. The method according to claim 1 , comprising storing attention content in the hardware key-value storage immediately following a calculating of the attention content.

4. The method according to claim 1, wherein the multiple prefdl and decoding iterations are associated with different layers of a transformer model.

5. The method according to claim 4 comprising pipelining retrieving operations, transformer related calculations and storing operations related to the hardware keyvalue storage.

6. The method according to claim 5 wherein the pipelining comprises: prefetching a next layer previously calculated attention content in parallel to performing transformer related calculations to a current layer and in parallel to storing previously calculated attention content of a previous layer.

7. The method according to claim 1, wherein the hardware key- value storage stores previously calculated attention content for a period that exceeds a time to live period of content cached in at least one of a graphic processing unit cache, a local cache, or a data processing unit cache.

8. The method according to claim 1, comprising applying content-based and applicationagnostic indexing of attention content items stored in the hardware key-value storage.

9. The method according to claim 1, wherein the one or more prompts are associated with a conversation history that is stored in the hardware key-value storage.

10. The method according to claim 10, wherein the previously calculated attention content forms only a portion of the history. 1 l.The method according to claim 10 comprising batching multiple received prompts to provide a batch of the one or more prompts, and wherein the portion of the history is determined based on the batching.

12. The method according to claim 1, comprising triggering the retrieving the previously calculated attention content by at least one of the graphic processing units.

13. The method according to claim 1, comprising initiating the retrieving the previously calculated attention content by at least one of the graphic processing units.

14. The method according to claim 1, wherein a data processing unit controls the retrieving.

15. The method according to claim 1, comprising batching multiple received prompts to provide a batch of the one or more prompts.

16. The method according to claim 1, wherein the hardware key-value storage comprises multiple key-value storage nodes.

17. The method according to claim 16, wherein the graphic processing units are located in multiple graphic processing nodes that are in communication with the multiple keyvalue storage nodes.

18. A non-transitory computer readable medium for transformer inference, the non- transitory computer readable medium stores instructions that once executed by a computerized system, causes the computerized system to: receive one or more prompts; and respond to the one or more prompts by executing multiple prefdl and decoding iterations; wherein an executing of a prefdl iteration that requires previously calculated attention content comprises retrieving the previously calculated attention content from a hardware key-value storage that is disaggregated from graphic processing units used to perform transformer related calculations during the multiple prefdl and decoding iterations.

19. A computerized system for transformer inference that comprises: graphic processing units; hardware key-value storage that is disaggregated from graphic processing units; wherein the computerized system is configured to receive one or more prompts and to respond to the one or more prompts by executing multiple prefdl and decoding iterations; whereon the responding comprises performing, by the graphic processing units, transformer related calculations during the multiple prefill and decoding iterations; and wherein an executing of a prefill iteration that requires previously calculated attention content comprises retrieving the previously calculated attention content from the hardware key-value storage.