WO2025101998A1 - Système à architecture centrale fractale pour la mise en œuvre de grands modèles de langue privés et efficaces - Google Patents
Système à architecture centrale fractale pour la mise en œuvre de grands modèles de langue privés et efficaces Download PDFInfo
- Publication number
- WO2025101998A1 WO2025101998A1 PCT/US2024/055260 US2024055260W WO2025101998A1 WO 2025101998 A1 WO2025101998 A1 WO 2025101998A1 US 2024055260 W US2024055260 W US 2024055260W WO 2025101998 A1 WO2025101998 A1 WO 2025101998A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cores
- array
- processing
- core
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/008—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5012—Processor sets
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/509—Offload
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
Definitions
- the present disclosure relates generally to security for large language models. More particularly, aspects of this disclosure relate to a core based architecture for executing private large language models and efficient encryption to protect the data and models.
- GenAI Generative Artificial Intelligence
- a LLM may be considered as a nonlinear mapping from the input X (a query) and to output Y (a response) by performing the following major processing blocks shown in FIG. 1 A.
- the major processing blocks thus include Embedding and Encoding 10, Multiple Head Attentions 12, Feed Forward Perceptron 14, Layer Normalization 16 and Softmax functions 18.
- the Embedding and Encoding 10 is a pre-processing unit which mainly performs the matrix multiplications between the input matrix and the corresponding weight matrices.
- Multiple Head Attentions 12 perform multiple matrix multiplications according to three weight matrices called a Query Matrix 20, a Key Matrix 22, and a Value Matrix 24, respectively to jointly attend to information from different representation subspaces at different positions.
- the Feed Forward Perceptron layers 14 constitute a conventional feedforward neural network having at least one hidden layer.
- the Layer Normalization 16 simply normalizes each input.
- the Softmax functions 18 is an activation function that scales numbers/logits into probabilities. [0005]
- the output of a LLM can uniquely be generated by all the pre-determined weight matrices and its inputs. These weight matrices can be obtained during an off-line training stage.
- LLM simply performs all the above matrix multiplications and corresponding nonlinear functions such as layer normalization and Softmax operations.
- Different LLMs have different parameter sizes (the total element number of the weight matrices), which mainly depend on the numbers of attention heads and number of LLM layers. For example, GPT-3 has 96 heads and 96 layers and hence there are about 175 billion parameters (elements of weight matrices) in total.
- FIG. IB shows a simplified diagram of a general LLM 40 between a client with X input, the LLM executed in a server, and an output Y to the client.
- the General LLM is actually the LLM that is most commonly used.
- W denotes the entire weight matrix
- F (W, X) represents the nonlinear mapping that the LLM is to perform.
- the processing steps include: 1) a Client sending their data X to the party operating the server that owns LLM parameters; 2) the Server performing the LLM operation with X as inputs of the LLM as shown in FIG. 1A; and 3) the Server sending the Client the generated output Y, which equals F (W, X) as shown in FIG. IB.
- a Fine-Tuned LLM involves taking a pre-existing general model that has been trained on a large dataset, such as a language model like GPT-3, and refining the general model for a specific task or domain. During fine-tuning, the model is further trained on a smaller, domain-specific dataset. This process adapts the parameters of the model to the nuances of the target task, improving its performance and making it more capable in handling specific tasks.
- Fine-tuned LLMs are a cost-effective and efficient way to leverage the knowledge learned by a pre-trained general model while tailoring the general model to specific applications. This reduces the need for extensive training from scratch. Fine-tuned LLMs allow for rapid development of domain specific Al solutions with high accuracy and applicability.
- FIG. 1C is a diagram of a fine-tuned LLM process 50.
- FIG. 1C is a representation of performance of a Fine-Tuned LLM task where AW is the difference between the new modified weight matrix (from specific datasets) and the original weight matrix W (from larger datasets).
- the processing steps of the fine-tuned LLM 50 in FIG. 1C includes: 1) a Client sending their data, X, to a party operating the server with LLM parameters and finetuned parameters; 2) the Server performing the LLM operation with X as inputs of the Fine- Tuned LLM; and 3) the Server sending the generated output Y to the Client.
- the generated output, Y may be represented by F(VF,X) + F(AVF,X) or F(W + AI/F, X).
- a client of a party that owns such LLMs is providing queries to the LLM owner and receives responses to these queries from the LLM owner.
- the queries from the client may contain intellectual property and the answers to these queries may also contain new and novel intellectual property that the owner of the LLM now has access to.
- the information that is used for the fine-tuning LLM may be highly sensitive, proprietary, classified or protected under privacy laws such as European GDPR rules or US HIPPA rules. There is a need to be able to perform training of LLMs without exposing the training information during the fine-tuning process. Such information may be protected by encryption.
- a second framework is OpenFHE, which supports multiple schemes including Brakerski Gentry Vaikun- Tanathan (BGV), Brakerski/Fan-Vercauteren (BFV), Cheon-Kim-Kim-Song (CKKS), TFHE, and FHEW.
- An example FHE operation is a Partial Result Sharing Approach for two parties, which is functionally equivalent to a (2,2) threshold approach.
- the first party generates their own FHE public key (PK1) and FHE private key (SKI) for their database.
- the second party generates their own FHE public key (PK2) and FHE private key (SK2) for their database.
- the public keys, PK1 and PK2 are shared between the parties, while the private keys, SKI and SK2 are kept secret.
- a Joint Key is computed using the public keys PK1 and PK2.
- This key is neither a secret key nor a public key in the traditional sense but serves as a unified platform facilitating joint computations on encrypted data.
- Evaluation keys associated with the joint key, are generated. These evaluation keys are crucial for operations like “addition” and “multiplication” in the homomorphic encryption domain.
- the first party may encrypt its data (Datal) with PK1, resulting in Ciphertextdb 1.
- the second party may encrypt its data (Data2) with PK2, resulting in Ciphertextdb2.
- a process often referred to as key switching is applied. Key switching converts Ciphertextdb 1 and Ciphertextdb2 to be compatible with the joint key (JK), allowing for homomorphic operations without revealing the actual data.
- JK joint key
- the first party can no longer decrypt Ciphertextdb 1 using only SKI
- the second party can no longer decrypt Ciphertextdb2 using only SK2.
- both parties must collaborate to decrypt the respective ciphertexts.
- Computations (such as addition, multiplication, etc.) may be performed on the encrypted data. These operations are executed under the joint key, ensuring consistency and validity. After computations, the process for joint decryption begins, ensuring that neither the public keys, SKI nor SK2, are exposed to the other party.
- the Concrete library is an open-source library developed in Rust that builds on the state-of-art TFHE cryptosystem.
- the Concrete library provides a user friendly interface making FHE easy to integrate.
- the Concrete library deals with inputs of arbitrary format and comes with an extensive set of operations for manipulating ciphertexts, including a programmable bootstrapping process.
- Learning With Errors (LWE) is a quantum robust method of cryptography applicable to FHE.
- the LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography.
- FHE is based on a quantum secure scheme for the LWE (learning with errors) problem.
- the FHE allows computations such Boolean operations, Integer arithmetic operations, and floating-point arithmetic operations on ciphertext without decryption.
- sensitive data analysis computations may be performed on encrypted data without ever decrypting the data.
- LLMs can be grouped into three levels by taking different data format (plaintext or ciphertext) for input query, weights and output responses of a LLM.
- a Level 1 private LLM encrypts input queries and output responses but leaves general weights in plaintext.
- the Level 1 private LLM does not incorporate fine-tuning.
- a Level 2 private LLM encrypts input queries and output responses, but leaves fine-tuning and general weights in plaintext.
- a Level 3 private LLM encrypts input queries, output responses, and fine-tuning weights, but leaves general weights in plaintext. The additional encryption required as well as the other computational operations required makes each successive level of the private LLM more impractical for current hardware.
- the present disclosure relates generally to security applications. More particularly, aspects of this disclosure relate to techniques to protect private large language models with efficient encryption.
- an example system to output a response to a query includes an array of processing cores arranged in a grid allowing each processing core to communicate directly to a neighboring processing core.
- the system includes an interconnection network coupled to each of the processing cores allowing communication between the processing cores.
- a first processing core of the array of processing cores is configured to receive an encrypted query.
- a second processing core of the array of processing cores is configured to input the encrypted query to a large language model.
- the second processing core is configured to execute the large language model having general weights in plaintext.
- the second processing cores is configured to provide an encrypted output of the large language model.
- a further implementation of the example system includes a third processing core of the array of processing cores configured to decrypt the output of the large language model.
- the example system includes a third processing core configured as a fine-tuned foundation layer of the large language model that accepts the encrypted query.
- the fine-tuned foundation layer includes a matrix of proprietary weights.
- the matrix of proprietary weights is in plaintext.
- the example system includes a fourth processing core of the array of processing cores configured to encrypt the matrix of proprietary weights.
- the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm.
- the large language model has a plurality of layers including the foundation layer.
- the other layers of the plurality of layers apply the general weights.
- the first and second cores include at least one of a Reduced Instruction Set Computing (RISC) processor core, a special purpose processing core; or a RISC processor core with a set of special purpose processing cores embedded within.
- RISC Reduced Instruction Set Computing
- the first and second cores are in an array of RISC processor cores interconnected with an array of special purpose processing cores.
- the encryption is performed via Fully Homomorphic Encryption (FHE).
- FHE Fully Homomorphic Encryption
- Another implementation is where the encrypted query is encrypted from a plaintext query input to an external device that transmits the encrypted query and evaluation keys for evaluation by the first processing core.
- the example system includes a third processing core configured to simultaneously process both plaintext and homomorphic ciphertext.
- the second processing core is configured to perform an encrypted summation.
- the example system includes a computational fabric having a plurality of individual integrated circuits. The array of processing cores is on at least one of the plurality of individual integrated circuits.
- the computational fabric allows communication between each of the plurality of individual integrated circuits.
- an example array of cores on an integrated circuit die includes an interconnection network coupled to each of the processing cores allowing communication between the processing cores.
- a first processing core or cores of the array of processing cores is configured to receive an encrypted query.
- a second processing core or cores of the array of processing cores is configured to input the encrypted query to a large language model.
- the second processing core or cores is configured to execute the large language model having general weights in plaintext.
- the second processing core or cores is configured to provide an encrypted output of the large language model.
- a further implementation of the example array of cores includes a third processing core or cores of the array of processing cores configured as a fine-tuned foundation layer of the large language model that accepts the encrypted query.
- the fine-tuned foundation layer includes a matrix of proprietary weights.
- a fourth processing core or cores of the array of processing cores is configured to decrypt the output of the large language model.
- the matrix of proprietary weights is in plaintext.
- the example array of cores further includes a fifth set of processing cores of the array of processing cores configured to encrypt the matrix of proprietary weights.
- the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm.
- the large language model has a plurality of layers including the foundation layer. The other layers of the plurality of layers apply the general weights.
- Another disclosed example is method of configuring an array of programmable cores including a plurality of programmable cores coupled via an interconnection network.
- a first processing core or cores of the array of processing cores is configured to receive an encrypted query.
- a second processing core or cores of the array of processing cores is configured input the encrypted query to a large language model and execute the large language model having general weights in plaintext.
- the second processing core or cores of the array are configured to provide an encrypted output of the large language model.
- FIG. 1 A is a diagram of a prior art LLM layer with different operations
- FIG. IB is a simplified illustration of inputs and outputs to a general LLM
- FIG. 1C is a simplified illustration of inputs and outputs to a fine-tuned LLM
- FIG. 2A is a diagram of a chip having four dies each having multiple processing cores
- FIG. 2B is a simplified diagram of one of the dies on the chip shown in FIG. 2A;
- FIG. 3 A is a block diagram of the array of cores in the die in FIG. 2B;
- FIG. 3B is a three-dimensional view of the array of cores in the die in FIG. 2B;
- FIG. 4 is an example reconfigurable arithmetic engine configuration of one of the cores in the core array in FIG. 2A;
- FIG. 6A is an example of a Level 1 private LLM
- FIG. 6B is an example of a Level 2 private LLM
- FIG. 6C is an example of a Level 3 private LLM
- FIG. 7 is a table showing the cost and power comparisons for different LLMs by conventional hardware
- FIG. 8 is a diagram showing reduction of a matrix of fine-tuning weights using a low- rank adaptation algorithm by using two lower ranked matrices;
- FIG. 9 is a diagram of the data flow of the inference stage of a low-rank adaption (LoRA) algorithm to reduce the size of the fine-tune proprietary weights for a private LLM; and
- FIG. 10 is a table showing the trillion multiplier equivalent (TME) values for conventional solutions for private LLM and the example core based architecture for private LLMs.
- TEM multiplier equivalent
- This present disclosure provides technical details of private LLM technology by addressing working principles, application scenarios, privacy and security requirements, and training and inference algorithms.
- the present disclosure also describes a silicon and chip implementation of private LLM technology that may be adapted to different products.
- the example private LLM technology may be classified into three application scenarios (levels).
- the present disclosure presents the corresponding algorithms related to training and inferences of the three application scenarios.
- the disclosed array of cores architecture converges Private LLM into an optimum silicon implementation that are superior to existing solutions (e.g., GPU based and ASIC based) in terms of power consumption, cost, and processing latency.
- the present disclosure is directed toward an effective solution for implementing private large-language-models (LLM) (transform neural networks) system.
- LLM large-language-models
- the example method is directed toward protecting the LLM through secure encryption such as FHE, using plaintext for the publicly known weights, but protecting all other LLM components including any proprietary weights through FHE for fine tuning.
- FIG. 2A shows an example chip 100 that is subdivided into four identical dies 102, 104, 106, and 108.
- Each of the dies 102, 104, 106, and 108 include multiple processor cores, support circuits, serial interconnections and serial data control subsystems.
- the dies 102, 104, 106, and 108 may each have 4,096 processing cores as well as SERDES interconnection lanes to support different communication protocols.
- each of the dies 102, 104, 106, and 108 in this example are interconnected by Interlaken connections.
- the chip 100 is designed to allow one, two or all four of the dies 102, 104, 106, and 108 to be used.
- the pins on a package related to un-used dies are left unconnected in the package or the board.
- the dies are scalable as additional chips identical to the chip 100 may be implemented in a device or a circuit board.
- a single communication port such as an Ethernet port is provided for the chip 100.
- other ports may be provided, such as one or more ports for each die.
- FIG. 2B is a block diagram of one example of the die 102.
- the die 102 includes a fractal array 130 of processing cores.
- the processing cores in the fractal array 130 are interconnected with each other via a system interconnect 132.
- the entire array of cores 130 serves as the major processing engine of the die 102 and the chip 100.
- the system interconnection 132 is coupled to a series of memory input/output processors (MIOP) 134.
- the system interconnection 132 is coupled to a control status register (CSR) 136, a direct memory access (DMA) 138, an interrupt controller (IRQC) 140, an I2C bus controller 142, and two die to die interconnections 144.
- the two die to die interconnections 144 allow communication between the array of processing cores 130 of the die 102 and the two neighboring dies 104 and 108 in FIG. 2 A.
- the chip includes a high bandwidth memory controller 146 coupled to a high bandwidth memory 148 that constitute an external memory sub-system.
- the chip also includes an Ethernet controller system 150, an Interlaken controller system 152, and a PCIe controller system 154 for external communications.
- each of the controller systems 150, 152, and 154 have a media access controller, a physical coding sublayer (PCS) and an input for data to and from the cores.
- PCS physical coding sublayer
- Each controller of the respective communication protocol systems 150, 152, and 154 interfaces with the cores to provide data in the respective communication protocol.
- the Interlaken controller system 152 has two Interlaken controllers and respective channels.
- a SERDES allocator 156 allows allocation of SERDES lines through quad M-PHY units 158 to the communication systems 150, 152 and 154.
- Each of the controllers of the communication systems 150, 152, and 154 may access the high bandwidth memory 148.
- the array 130 of directly interconnected cores are organized in tiles with 16 cores in each tile.
- the array 130 functions as a memory network on chip by having a high-bandwidth interconnect for routing data streams between the cores and the external DRAM through memory IO processors (MIOP) 134 and the high bandwidth memory controller 146.
- MIOP memory IO processors
- the array 130 functions as a link network on chip interconnection for supporting communication between distant cores including chip-to-chip communication through an “Array of Chips” Bridge module.
- the array 130 has an error reporter function that captures and filters fatal error messages from all components of array 130.
- FIG. 3 A is a detailed diagram of the array of cores 130 in FIG. 3B.
- FIG. 3B is a three- dimensional image of the array of cores 130 in FIG. 3B.
- the array of cores 130 is organized into four core clusters such as the clusters 200, 210, 220, and 230 shown in FIG. 3 A.
- the cluster 200 includes cores 202a, 202b, 202c, and 202d.
- Each of the four cores in each cluster 200 such as cores 202a, 202b, 202c, and 202d are coupled together by a router 204.
- FIG. 3B shows other clusters 210, 220, and 230 with corresponding cores 212a-212d, 222a-222d and 232a-232d and corresponding routers 214, 224, and 234.
- each of the cores 202a, 202b, 202c, and 202d has up to four sets of three interconnections [L, A, R],
- a core in the center of the array such as the core 202d includes four sets of interconnections 240, 242, 244, and 246 each connected to one of four neighboring cores.
- core 202b is connected to the core 202d via the interconnections 240
- core 202c is connected to the core 202d via the interconnections 242
- core 212b is connected to the core 202d via the interconnections 244
- core 202c is connected to the core 202d via the interconnectors 246.
- a separate connector 248 is coupled to the wire router 204 of the cluster 200.
- each core in the middle of the array has four sets of interconnections, while border cores such as the core 202c only have three sets of interconnections 250, 252, and 246 that are connected to respective cores 202a, 202d, and 212a.
- FIG. 4 shows a block diagram of an example processing core 400 that includes a reconfigurable arithmetic engine (RAE) 410.
- the RAE 410 may be configured and reconfigured to perform relevant mathematical routines such as matrix multiplications, Fast Fourier Transforms (FFT), Inverse FFTs (IFFT), point wise multiplication, Softmax and other related nonlinear functions required in an LLM.
- the RAE 410 includes input reorder queues, a multiplier shifter-combiner network, an accumulator and logic circuits.
- the RAE 410 operates in several modes, such as operating as an ALU, and include a number of floating point and integer arithmetic modes, logical manipulation modes (Boolean logic and shift/rotate), conditional operations, and format conversion.
- the RAE 410 includes three inputs 412, 414, and 416 and three outputs 422, 424, and 426.
- the RAE 410 receives the output data from a program executed by another RAE 430 and output data from another program executed by another RAE 432.
- An aggregator (AGG) 434 provides an output of aggregated data from different sources to the RAE 410.
- a memory read output 436 and a memory write output 438 also provide data to the RAE 410.
- the memory outputs 436 and 438 provide access to a memory such as an SRAM that stores operand data, and optionally may also store configurations or other instructions for the RAE 410.
- Each of the output data of the RAE 430, RAE 432, aggregator 434, memory read output 436 and the memory write output 438 are provided as inputs to three multiplexers 442, 444, and 446.
- the outputs of the respective multiplexers 442, 444, and 446 are coupled to the respective inputs 412, 414, and 416 of the RAE 410.
- a set of cores may be configured as a full RISC-V processor with associated SRAM able to execute traditional control flow programs as a function representing the computation within a dataflow node.
- RISC-V for Legacy code is supported by configuring multiple cores under software control. This may be used to produce software GPUs or other types of cores from the multiple cores.
- the processing cores such as the FracTLcores® offered by Cornami are an efficient set of transistors for streaming data driven workloads, with a dynamic programming scheduler such as the TruStream programming scheduler offered by Cornami and memory, created from a set of RAE Cores.
- the FracTLcores® can scale up to 64,000,000 cores across chips and systems at near linear scale. Combining the aspects of both data flow and reconfigurable computing to stream data, this architecture with highly functional computational elements can dynamically scale over many chips.
- the example architecture enables developers to take full advantage of both parallelism and pipelining to minimize latency and maximize overall application performance and throughput.
- the use of the architecture of processing cores results in reduction in processing cost.
- the cores may employ a data-flow programming model resulting in a 5x reduction in processing cost.
- a data-defining function computation for the cores may result in a 6x reduction in processing cost.
- a data Read/Write with a Tensor pattern applied to the cores may result in a 6x reduction in processing cost.
- FIG. 5 is a diagram of four configurations 510, 520, 530, and 540 of the array of cores in FIG. 2B as either a RISC-V processor or a specialized ALU internal module.
- the configurations 510, 520, 530, and 540 can dynamically switch from one type to the other by reconfiguring some or all of the computational cores in the configurations.
- the first configuration 510 is a set of cores configured as a full RISC processor with associated SRAM able to execute traditional Control Flow programs as a function representing the computation within a dataflow node.
- the RISC processor includes sixteen separate cores 512.
- Another configuration 520 is sixteen independently reconfigurable and programmable ALUs, that are each of cores 522 (for example FracTLcores® available from Cornami). Each of the cores 522 have associated SRAM supporting multiple simultaneous integer and floating point computations of up to 128-bits.
- the configuration 520 thus is a set of cores that are configured as individual FracTLcores®.
- the configuration 530 includes one or more RISC cores 532 that are a set of sixteen cores in this example.
- the RISC core 532 can have additional individual or multiple cores 534 incorporated within them to accelerate specific RISC functions. Alternatively, the additional cores 534 may be designated for data path/arithmetic acceleration, enhancing ALU performance.
- sixteen cores are configured to become the RISC-V.
- Optional additional cores may be added to the configuration to provide hardware acceleration to math operations performed by the RISC.
- a normal RISC processor does not have hardware to perform a cosine function.
- an additional core may be added and configured to perform a hardware cosine operation. This enhances the ISA instruction set of the RISC processor by adding the hardware accelerated cosine function that may be accessed by the RISC processor.
- the configuration 540 has a set of cores that are configured into two individual groupings of cores configured as RISC processors 542 and cores that are configured as ALUs (e.g., FracTLcores®) 544.
- a client may submit an encrypted query to the LLM architecture configured on an array of cores such as that in FIG. 2B.
- the results of the LLM will be output in an encrypted output that may be decrypted.
- private LLMs can be grouped into three levels by taking different data format (plaintext or ciphertext) for input query, weights and output responses of a LLM.
- the superscript “C” denotes the ciphertext format for the below discussion.
- FIG. 6A shows the processing steps of a Level 1 Private LLM process 600.
- the process 600 includes a client input stage 612, a server executing an LLM 614 and a client output stage 616.
- a mathematical representation 620 and process 600 show the following.
- the Client first encrypts their data X (Plaintext) to the ciphertexts X c by using some algorithms to support fully Homomorphic Encryption (FHE) such as CKKS or TFHE where “Sk” and “Pk” denote the secret keys and public keys owned by the Client (612).
- FHE Fully Homomorphic Encryption
- the Server 614 performs the LLM operation with X c as inputs of the general LLM as shown in the mathematical representation 620. More specifically, the example array of cores based chip in FIGs. 2A-2B first receives the ciphertext X c from the client 612 via the data link interface and also reads the plaintext weights (W) of the LLM from memory, which could be SRAM, cache, or DRAM. The chip 100 performs all the matrix multiplications and nonlinear functions as shown in the mathematical representation 620. The server 614 then sends the generated output Y c to the client via the data-link interface.
- W plaintext weights
- the generated output equals F (W, X c ) and also should be equal to ((F(W, X)) c , that is, the ciphertexts corresponding to the plaintexts F(X).
- the client 612 decrypts the received ciphertext Y c to finally get the desired result, which equals to F(X).
- FIG. 6B shows the processing steps of a Level 2 Private LLM 630.
- the process 630 includes a client input stage 632, a server executing an LLM 634 and a client output stage 636.
- a mathematical representation 640 and process 630 show the following.
- the processing steps of the Level 2 Private LLM in FIG. 6B include the following.
- the Client first encrypts their data X (Plaintext) to the ciphertexts X c with FHE algorithms using the keypair, “Sk” and “Pk”.
- the Client sends their encrypted data X c to the server 634.
- the operator of the server 634 owns the general LLM parameters and fine-tuning parameters, which are both still in plaintext format.
- the server 634 performs the fine-tuning LLM operation with X c being inputs of the Fine-Tuned LLM as shown in FIG. 6B.
- the chip 100 in FIGs. 2A-2B first receives the ciphertext X c from the Client via the data link interface and also reads both the plaintext weights W of the general LLM and the plaintext weights AI/F of the fine-tuned LLM from a memory.
- the chip 100 then performs all the matrix multiplications and nonlinear functions as shown in the second part of FIG. 6B.
- the Server sends to the Client the generated output Y c via a data-link interface.
- the generated output equals F(W, X c ) + F(AVF, X C ), and also should equal (F(VF,X) + F(AVF, X)) C , that is, the ciphertexts corresponding to the desired output.
- the Client uses the secret key “Sk” and decrypts the received ciphertext Y c to finally get the desired result Y.
- FIG. 6C shows the processing steps of a Level 3 Private LLM 650.
- the process 650 includes a client input stage 652, a server executing an LLM 654 and a client output stage 656.
- a mathematical representation 660 and process 650 show the following processing steps for the Level 3 Private LLM.
- the Client first encrypts their data X (Plaintext) to the ciphertexts X c with FHE algorithms using the keypair, “Sk” and “Pk.”
- the Client sends their encrypted data X c to the server 654.
- the server operator owns the general LLM parameters in the format of plaintexts and the fine-tuning parameters in the format of ciphertexts encrypted by the server itself.
- the ciphertext version (AVF) c of the fine-tune weights AI/F can be pre-determined and stored in memory by using the keypair, “SSk” and “SPk” as shown in FIG. 6C.
- the Server performs the fine-tuning private LLM operation with X c being inputs as shown in FIG. 6C.
- the chip 100 first receives the ciphertext X c from the Client via the data link interface.
- the chip 100 also reads the plaintext weights W of the general LLM and the ciphertext weights (AVF) C of the fine-tuned LLM from memory. The chip 100 then performs all the matrix multiplications and nonlinear functions as shown in the second part of FIG. 6C.
- the Server 654 sends the generated output Y c and the public key “SPk” to the Client via the data-link interface.
- the generated output equals F , and also should equal (F(VF,X) + F(AVF, X)) C , that is, the ciphertexts corresponding to the desired output.
- the Client decrypts the received ciphertext Y c to finally get the desired result Y by using the secret key “Sk” and public key “SPk” from the server.
- FIG. 7 is a table 700 that presents a quantitative comparison of two kinds of representative implementation platforms (ASIC and GPU) in terms of the total cost and power consumption in order to generate the desired responses for a complete sequence of tokens using the parameters setting in GPT-3.
- the first column in the table 700 in FIG. 7 is the listing of five LLMs in rows 710, 712, 714, 716, and 718.
- the second and third columns of the table 700 in FIG. 7 indicate the number of trillion MEs (TME) to be needed by an ASIC and a GPU in order to generate the desired output for a complete sequence of tokens in GPT-3, respectively.
- TME Methyl-N-phenyl-N
- the fourth column of the table 700 shows the ratio of the five LLMs in the rows 710, 712, 714, 716, and 718 with the General LLM in the row 710 in terms of the number of GPUs.
- the example reconfigurable core architecture allows practical implementation of the different level private LLMs.
- the array of cores architecture in FIG. 2B and FIG. 3B combines aspects of dataflow and reconfigurable computing to stream data through a computational fabric architecture with highly functional computational elements that can dynamically scale over many chips.
- the computational fabric is represented by one or many custom ASIC chip(s) residing in one or multiple PCIe cards within one or multiple host servers. Each host server has an x86 processor(s) running Linux as an interface to the computational fabric.
- the custom ASICs have several key functional components that are linked by following three types of core communication mechanisms.
- the first communication mechanism is the adjacent core-to-core in the array 130 which is one core communicating with a physically adjacent core as laid out on the silicon substrate.
- Adjacent core communication is the most efficient inter-core communication mechanism and takes place via the North, South, East, or West core interfaces.
- the second communication mechanism is a Network-On-Chip (NOC) which generalizes cores to core communication interface where they are not side-by- side on the same chip or when cores reside on different chips as shown in FIG. 3B.
- the third communication mechanism is a PCIe link for intra-system communications between the host and PCIe boards.
- This reconfigurable core array computing architecture allows different functions to be defined by dynamically changing the topological linkages of processing cores within a computational fabric to achieve superior silicon utilization in terms of application performance, throughput, power consumption, and processing latency.
- the computational fabric significantly reduces the dependence on memory to store intermediate computational results and exceeds the flexibility and programmability of an FPGA or DSP or GPU while still providing near ASIC level solution performance.
- the example reconfigurable core architecture in FIG. 2B and FIG. 3B is a very powerful hardware computing platform to perform extensive matrix multiplications required in executing General LLM and Private LLM with near zero programming complexity.
- the computational complexity involved in a Level 3 Private-LLM is still too high even for the example core based architecture as shown in FIG. 7. This is because all the computations related to fine-tuning weights in the Level-3 Private LLM need to be operated in ciphertext format.
- the reduction of the computational complexity for a Level 3 Private LLM can be accomplished by using a low-rank adaption (LoRA) algorithm.
- LoRA low-rank adaption
- the use of the LoRA concept in a Level 3 private LLM can reduce the size of the fine-tuning weights VF.
- LoRA Low-Rank Adaptation of Large Language Models
- a step-by-step description of how LoRA works in terms of training and inference is as follows.
- a large pre-trained language model such as GPT-3 is used as a starting point. These models have a massive number of parameters and required substantial computational resources for their training.
- Target layers are selected within the model for low-rank adaptation. These layers are typically chosen based on their computational intensity and importance to the target task.
- a technique called “low-rank factorization” is applied to these layers, which simplifies the model's calculations by using fewer numbers and thus reduces model complexity.
- the simplified model is fine-tuned by training the model on a specific task using these reduced parameters. This fine-tuning process tailors the model to the task.
- the adapted model may be used for specific inference applications as it is more efficient and memory-friendly for the specific application.
- the adopted model is thus suitable for real-world use in applications like chatbots, translation, or text generation.
- the general process starts by checking the task performance by the simplified model. This ensures that the simplified model still performs well for the target task. In many cases, the simplified model may achieve a good balance between efficiency and accuracy.
- the LoRA process uses several parameters such as a matrix scaling factor a and the matrix rank factor r. They determines how large the LoRA matrices will be in terms of the dimensions and values.
- An example rule of thumb for alpha as: a 2r has shown to be effective in some experiments. There are sometimes tradeoffs between memory usage and training time.
- FIG. 8 is a matrix diagram 800 of an example LoRA.
- a weight matrix with d X d can be decomposed into the multiplication of a low-rank matrix A (d X r) 812 and another low-matrix B (r X d) 814, Although both W and AW of Fig. 6C have the same dimensions and rank, AW (d x d) in an example solution matrix A x B 816 can be generated by two much lower-rank matrices 812 and 814 (A and B) as shown in FIG. 8 where the rank r could be as small as 10, which is 1000 times less than d.
- the fine-tuning weights may be obtained in ciphertext domain and may be used to perform the desired LLM processing on ciphertexts. Since the dimensions of the matrices 812 and 814 (A and B) are much smaller, the corresponding computational complexity for ciphertext would be as low as the one for plaintexts, which means the reduction in 100 times can be achieved by the example LoRA based solution.
- the small size matrices A and B may first be encrypted and then their encrypted versions may be used to replace further perform the computations in ciphertext domain as required in all the processing steps of Level 3 Private LLM shown in FIG. 6C. Since the dimensions of matrices A and B are much smaller, the corresponding computational complexity for ciphertext would be as low as the ones for plaintexts, which means the reduction in 100 times can be achieved by the LoRA based algorithm.
- FIG. 9 further illustrates the data flow of inference stage of a LoRA based Level-3 Private LLM executed by a server 910 that communicates with a client 912.
- the client 912 generates a user prompt 920 that receives a query represented by data X.
- the query, Q (data X) is encoded to Q’ (922).
- the client 912 uses the keypair “Sk” and “Pk”, the client 912 first encrypts their data X (Plaintext) to the ciphertexts X c with FHE algorithms (924).
- the client 912 then sends their encrypted data X c as a ciphertext input 932 to the server 910 which holds an LLM 930 and general LLM parameters in the format of plaintext 934 and the fine-tuning weight parameters AW 936 in the format of ciphertext encrypted by the server itself.
- the former (AVF) C (4 X B) c first performs the multiplications and then the encryption.
- the latter (A c X B c ) first performs the encryption and then the multiplications.
- the processing unit that may be the chip 100 in FIG. 2 A in the server 910 performs the LoRA based fine-tuned LLM operation with X c being inputs as shown in FIG. 9.
- the server 910 first receives the ciphertext X c from the client input (932) via the data link interface and also reads the plaintext weights W (934) of the general LLM and the ciphertext weights of the fine-tuned LLM (936) from memory.
- the ciphertext weights are broken down into two matrices 938 and 940 in the process described in FIG. 8.
- the chip 100 performs all the matrix multiplications and nonlinear functions as shown in FIG 6C and FIG. 9.
- the resulting output from both the plaintext weights and the ciphertext weights are added to produce a generated output (942).
- the Server sends the generated output Y c and the public key “SPk” to the Client via the data-link interface.
- the generated output equals F(I/F, X c ) + F((A x B) C , X C ) or F(VF, X C ) + F(A C x B C , X C and also should equal (F(VF, X) + F A x B, X)) c , that is, the ciphertexts corresponding to the desired output.
- the client 912 receives the ciphertext and the public key (950).
- the client 912 decrypts the received ciphertext Y c to finally get the desired decided result (952) Y by using the secret keys “Sk” and the public keys “SPk from the server 910.
- FIG. 10 shows a table 1000 that is a quantitative comparison of a known GPU platform to the example array of cores platform in terms of the total cost and power consumption in order to generate the desired responses for a complete sequence of tokens with using the parameters setting for GPT-3.
- the example array of cores platform costs only about one tenth of what the GPU platform costs for performing the same General LLM inference task of Figure IB.
- the GPU platform costs 179 times but the example platform costs only 3.03 times.
- the GPU platform costs 190 times but the example platform costs only 3.19 times.
- the GPU platform costs 19998 times but the example platform costs only 330 times.
- the GPU platform costs 196 times but the example platform costs only 3.51 times, which suggests that GPU platform is still not practical.
- the example array of cores platform can serve as a feasible and practical solution for the deployment of all these three levels of Private LLMs into real-world applications.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Neurology (AREA)
- Medical Informatics (AREA)
- Computer Security & Cryptography (AREA)
- Bioethics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
L'invention concerne un système destiné à émettre une réponse à une requête par l'intermédiaire du traitement d'un grand modèle de langue. Le système comprend un réseau de cœurs de traitement disposés dans une grille, ce qui permet à chaque cœur de traitement de communiquer directement avec un cœur de traitement voisin. Un réseau d'interconnexion est couplé à chacun des cœurs de traitement, ce qui permet la communication entre les cœurs de traitement. Un premier cœur de traitement du réseau de cœurs de traitement est conçu pour recevoir une requête chiffrée. Un second cœur de traitement du réseau de cœurs de traitement est conçu pour entrer la requête chiffrée dans un grand modèle de langue ; pour exécuter le grand modèle de langue dont les poids généraux sont en texte en clair ; et pour fournir une sortie chiffrée du grand modèle de langue.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363597561P | 2023-11-09 | 2023-11-09 | |
| US63/597,561 | 2023-11-09 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025101998A1 true WO2025101998A1 (fr) | 2025-05-15 |
Family
ID=95696675
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/055260 Pending WO2025101998A1 (fr) | 2023-11-09 | 2024-11-08 | Système à architecture centrale fractale pour la mise en œuvre de grands modèles de langue privés et efficaces |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025101998A1 (fr) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160350648A1 (en) * | 2014-11-07 | 2016-12-01 | Microsoft Technology Licensing, Llc. | Neural networks for encrypted data |
| US20190294805A1 (en) * | 2018-03-22 | 2019-09-26 | Via Science, Inc. | Neural-network training using secure data processing |
| US20200042856A1 (en) * | 2018-07-31 | 2020-02-06 | International Business Machines Corporation | Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit |
| US20200349435A1 (en) * | 2016-06-22 | 2020-11-05 | Massachusetts Institute Of Technology | Secure Training of Multi-Party Deep Neural Network |
| US20220383126A1 (en) * | 2021-05-19 | 2022-12-01 | Microsoft Technology Licensing, Llc | Low-Rank Adaptation of Neural Network Models |
| US20220414223A1 (en) * | 2021-06-29 | 2022-12-29 | EMC IP Holding Company LLC | Training data protection for artificial intelligence model in partitioned execution environment |
-
2024
- 2024-11-08 WO PCT/US2024/055260 patent/WO2025101998A1/fr active Pending
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20160350648A1 (en) * | 2014-11-07 | 2016-12-01 | Microsoft Technology Licensing, Llc. | Neural networks for encrypted data |
| US20200349435A1 (en) * | 2016-06-22 | 2020-11-05 | Massachusetts Institute Of Technology | Secure Training of Multi-Party Deep Neural Network |
| US20190294805A1 (en) * | 2018-03-22 | 2019-09-26 | Via Science, Inc. | Neural-network training using secure data processing |
| US20200042856A1 (en) * | 2018-07-31 | 2020-02-06 | International Business Machines Corporation | Scheduler for mapping neural networks onto an array of neural cores in an inference processing unit |
| US20220383126A1 (en) * | 2021-05-19 | 2022-12-01 | Microsoft Technology Licensing, Llc | Low-Rank Adaptation of Neural Network Models |
| US20220414223A1 (en) * | 2021-06-29 | 2022-12-29 | EMC IP Holding Company LLC | Training data protection for artificial intelligence model in partitioned execution environment |
Non-Patent Citations (1)
| Title |
|---|
| XUANQI LIU; ZHUOTAO LIU: "LLMs Can Understand Encrypted Prompt: Towards Privacy-Computing Friendly Transformers", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 May 2023 (2023-05-28), 201 Olin Library Cornell University Ithaca, NY 14853, XP091522789 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Samardzic et al. | Craterlake: a hardware accelerator for efficient unbounded computation on encrypted data | |
| Samardzic et al. | F1: A fast and programmable accelerator for fully homomorphic encryption | |
| Turan et al. | HEAWS: An accelerator for homomorphic encryption on the Amazon AWS FPGA | |
| CN114816334B (zh) | 加速单元、相关装置和方法 | |
| Nejatollahi et al. | CryptoPIM: In-memory acceleration for lattice-based cryptographic hardware | |
| Roy et al. | FPGA-based high-performance parallel architecture for homomorphic computing on encrypted data | |
| Zhao et al. | A high-performance domain-specific processor with matrix extension of RISC-V for module-LWE applications | |
| Feldmann et al. | F1: A fast and programmable accelerator for fully homomorphic encryption (extended version) | |
| Wold et al. | Pipeline and parallel-pipeline FFT processors for VLSI implementations | |
| US7325123B2 (en) | Hierarchical interconnect for configuring separate interconnects for each group of fixed and diverse computational elements | |
| Zhang et al. | Sok: Fully homomorphic encryption accelerators | |
| CN113468099B (zh) | 可重构计算装置、处理器及方法 | |
| Yang et al. | Phantom: A cuda-accelerated word-wise homomorphic encryption library | |
| Huang et al. | Garbled circuits in the cloud using FPGA enabled nodes | |
| CN118525320A (zh) | 用于全同态加密(fhe)应用的密码处理器 | |
| KR102616119B1 (ko) | 스트리밍 코어와 접힌 완전 파이프라인 fft를 이용한토러스 완전 동형 암호화(tfhe) 가속을 위한 하드웨어 아키텍처 | |
| Wolfe et al. | Secret sharing MPC on FPGAs in the datacenter | |
| Hao et al. | FastSecNet: An efficient cryptographic framework for private neural network inference | |
| Haghi et al. | A reconfigurable compute-in-the-network fpga assistant for high-level collective support with distributed matrix multiply case study | |
| Zhou et al. | UFC: A unified accelerator for fully homomorphic encryption | |
| Kim et al. | Cifher: A chiplet-based fhe accelerator with a resizable structure | |
| Roy et al. | Accelerator for computing on encrypted data | |
| Liao et al. | Turbohe: Accelerating fully homomorphic encryption using fpga clusters | |
| CN118171748A (zh) | 一种量子线路构建方法及相关装置 | |
| Yang et al. | Bandwidth efficient homomorphic encrypted matrix vector multiplication accelerator on fpga |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24889739 Country of ref document: EP Kind code of ref document: A1 |