WO2025101998A1

WO2025101998A1 - Fractal core architecture system for implementing efficient private large language models

Info

Publication number: WO2025101998A1
Application number: PCT/US2024/055260
Authority: WO
Inventors: Paul L. Master; Fa-Long Luo; Mohammad RAEINI
Original assignee: Cornami Inc
Current assignee: Cornami Inc
Priority date: 2023-11-09
Filing date: 2024-11-08
Publication date: 2025-05-15
Anticipated expiration: 2026-05-09

Abstract

A system to output a response to a query via processing a large language model is disclosed. The system includes an array of processing cores arranged in a grid allowing each processing core to communicate directly to a neighboring processing core. An interconnection network is coupled to each of the processing cores allowing communication between the processing cores. A first processing core of the array of processing cores is configured to receive an encrypted query. A second processing core of the array of processing cores is configured to input the encrypted query to a large language model; execute the large language model having general weights in plaintext; and provide an encrypted output of the large language model.

Description

FRACTAL CORE ARCHITECTURE SYSTEM FOR IMPLEMENTING EFFICIENT PRIVATE LARGE LANGUAGE MODELS

PRIORITY CLAIM

[0001] The present disclosure claims benefit of and priority to U.S. Provisional Patent Application Serial No. 63/597,561, filed on November 9, 2023. The contents of that application are hereby incorporated by reference in their entirety.

TECHNICAL FIELD

[0002] The present disclosure relates generally to security for large language models. More particularly, aspects of this disclosure relate to a core based architecture for executing private large language models and efficient encryption to protect the data and models.

BACKGROUND

[0003] Large Language Model (LLM) and Generative Pre-trained Transforms (GPT) technology based Generative Artificial Intelligence (GenAI) is now finding a use in almost every aspect of industry and society due to powerful capabilities in extracting, processing and expanding data, information and knowledges. GenAI can help to meet increasing digital demands in terms of cost, power, capacity, coverage, latency, efficiency, flexibility, compatibility, quality of experience and silicon convergence.

[0004] A LLM may be considered as a nonlinear mapping from the input X (a query) and to output Y (a response) by performing the following major processing blocks shown in FIG. 1 A. The major processing blocks thus include Embedding and Encoding 10, Multiple Head Attentions 12, Feed Forward Perceptron 14, Layer Normalization 16 and Softmax functions 18. In a LLM layer as shown in FIG. 1 A, the Embedding and Encoding 10 is a pre-processing unit which mainly performs the matrix multiplications between the input matrix and the corresponding weight matrices. Multiple Head Attentions 12 perform multiple matrix multiplications according to three weight matrices called a Query Matrix 20, a Key Matrix 22, and a Value Matrix 24, respectively to jointly attend to information from different representation subspaces at different positions. The Feed Forward Perceptron layers 14 constitute a conventional feedforward neural network having at least one hidden layer. The Layer Normalization 16 simply normalizes each input. The Softmax functions 18 is an activation function that scales numbers/logits into probabilities. [0005] The output of a LLM can uniquely be generated by all the pre-determined weight matrices and its inputs. These weight matrices can be obtained during an off-line training stage. During an inference stage, LLM simply performs all the above matrix multiplications and corresponding nonlinear functions such as layer normalization and Softmax operations. Different LLMs have different parameter sizes (the total element number of the weight matrices), which mainly depend on the numbers of attention heads and number of LLM layers. For example, GPT-3 has 96 heads and 96 layers and hence there are about 175 billion parameters (elements of weight matrices) in total.

[0006] In terms of training and inference, the applications and implementation scenarios of LLM may be mainly categorized into the following three groups, namely: 1) General LLMs; 2) Fine-Tuned LLMs; and 3) Private LLMs. FIG. IB shows a simplified diagram of a general LLM 40 between a client with X input, the LLM executed in a server, and an output Y to the client. The General LLM is actually the LLM that is most commonly used. In this example, W denotes the entire weight matrix and F (W, X) represents the nonlinear mapping that the LLM is to perform.

[0007] In a General LLM, the processing steps include: 1) a Client sending their data X to the party operating the server that owns LLM parameters; 2) the Server performing the LLM operation with X as inputs of the LLM as shown in FIG. 1A; and 3) the Server sending the Client the generated output Y, which equals F (W, X) as shown in FIG. IB.

[0008] It is possible to perform processing using very specific knowledge to produce LLMs that are experts in the narrow knowledge domain taking the existing pre-trained General LLM as a starting point. The use of the specific knowledge combined with a general LLM results in a Fine-Tuned LLM. A Fine-Tuned LLM involves taking a pre-existing general model that has been trained on a large dataset, such as a language model like GPT-3, and refining the general model for a specific task or domain. During fine-tuning, the model is further trained on a smaller, domain-specific dataset. This process adapts the parameters of the model to the nuances of the target task, improving its performance and making it more capable in handling specific tasks. Fine-tuned LLMs are a cost-effective and efficient way to leverage the knowledge learned by a pre-trained general model while tailoring the general model to specific applications. This reduces the need for extensive training from scratch. Fine-tuned LLMs allow for rapid development of domain specific Al solutions with high accuracy and applicability.

[0009] FIG. 1C is a diagram of a fine-tuned LLM process 50. FIG. 1C is a representation of performance of a Fine-Tuned LLM task where AW is the difference between the new modified weight matrix (from specific datasets) and the original weight matrix W (from larger datasets). [0010] More specifically, the processing steps of the fine-tuned LLM 50 in FIG. 1C includes: 1) a Client sending their data, X, to a party operating the server with LLM parameters and finetuned parameters; 2) the Server performing the LLM operation with X as inputs of the Fine- Tuned LLM; and 3) the Server sending the generated output Y to the Client. The generated output, Y, may be represented by F(VF,X) + F(AVF,X) or F(W + AI/F, X).

[0011] In the above General LLM and Fine-Tuned LLM, both the client and the server are fully transparent and there is neither privacy, protection, or security among their input data and output data as well as those customized fine-tuning parameters owned by the server operator. Hence, there are three problems regarding privacy and data protection when using either model. [0012] From the perspective of the owner of the fine-tuned LLM model, the theft of one of these fine-tuned LLM parameters is catastrophic. The LLM can enable competitors to produce products or offer services that compete with the LLM owner. Such information may also be simply released on the internet and thus drive the revenue of the LLM owner to zero. There is therefore a need to prevent the access and use by unauthorized third parties of a proprietary fine-tuned LLM.

[0013] A client of a party that owns such LLMs is providing queries to the LLM owner and receives responses to these queries from the LLM owner. The queries from the client may contain intellectual property and the answers to these queries may also contain new and novel intellectual property that the owner of the LLM now has access to. There is a need to prevent the leak of intellectual property of a client into the LLM from the users queries as well as a need to protect any new and novel intellectual property from the responses of the LLM.

[0014] The information that is used for the fine-tuning LLM may be highly sensitive, proprietary, classified or protected under privacy laws such as European GDPR rules or US HIPPA rules. There is a need to be able to perform training of LLMs without exposing the training information during the fine-tuning process. Such information may be protected by encryption.

[0015] Currently, encryption techniques relate to public/private key mechanisms that require an intensive level of computing power to brute force solve the encryption. Such systems are currently secure because of the corresponding intensive level of computing power necessary to crack such encryption. However, with the potential advent of quantum computers, standard encryption techniques may be vulnerable to being solved by a quantum computer. Thus, new types of quantum secure encryption have been proposed, such as fully homomorphic encryption (FHE). FHE allows computations on ciphertext without having to perform decryption. This allows delegation of sensitive data analysis computations on encrypted data. There are several open-source frameworks of fully homomorphic encryption. One such framework is the Concrete library that implements the Fully Homomorphic Encryption over the Torus (TFHE) procedure. A second framework is OpenFHE, which supports multiple schemes including Brakerski Gentry Vaikun- Tanathan (BGV), Brakerski/Fan-Vercauteren (BFV), Cheon-Kim-Kim-Song (CKKS), TFHE, and FHEW.

[0016] An example FHE operation is a Partial Result Sharing Approach for two parties, which is functionally equivalent to a (2,2) threshold approach. The first party generates their own FHE public key (PK1) and FHE private key (SKI) for their database. The second party generates their own FHE public key (PK2) and FHE private key (SK2) for their database. The public keys, PK1 and PK2 are shared between the parties, while the private keys, SKI and SK2 are kept secret.

[0017] A Joint Key (JK) is computed using the public keys PK1 and PK2. This key is neither a secret key nor a public key in the traditional sense but serves as a unified platform facilitating joint computations on encrypted data. Evaluation keys, associated with the joint key, are generated. These evaluation keys are crucial for operations like “addition” and “multiplication” in the homomorphic encryption domain. Thus, the first party may encrypt its data (Datal) with PK1, resulting in Ciphertextdb 1. The second party may encrypt its data (Data2) with PK2, resulting in Ciphertextdb2.

[0018] A process often referred to as key switching is applied. Key switching converts Ciphertextdb 1 and Ciphertextdb2 to be compatible with the joint key (JK), allowing for homomorphic operations without revealing the actual data. After Ciphertextdb 1 & Ciphertextdb2 are key switched to be under the joint key, the first party can no longer decrypt Ciphertextdb 1 using only SKI, and the second party can no longer decrypt Ciphertextdb2 using only SK2. Thus, both parties must collaborate to decrypt the respective ciphertexts. Computations (such as addition, multiplication, etc.) may be performed on the encrypted data. These operations are executed under the joint key, ensuring consistency and validity. After computations, the process for joint decryption begins, ensuring that neither the public keys, SKI nor SK2, are exposed to the other party.

[0019] The Concrete library is an open-source library developed in Rust that builds on the state-of-art TFHE cryptosystem. The Concrete library provides a user friendly interface making FHE easy to integrate. The Concrete library deals with inputs of arbitrary format and comes with an extensive set of operations for manipulating ciphertexts, including a programmable bootstrapping process. Learning With Errors (LWE) is a quantum robust method of cryptography applicable to FHE. The LWE problem is conjectured to be hard to solve, and thus to be useful in cryptography. FHE is based on a quantum secure scheme for the LWE (learning with errors) problem. The FHE allows computations such Boolean operations, Integer arithmetic operations, and floating-point arithmetic operations on ciphertext without decryption. Thus, sensitive data analysis (computations) may be performed on encrypted data without ever decrypting the data.

[0020] In theory, privacy could be accomplished by using fully homomorphic encryption (FHE) approaches, but this approach is too computationally cumbersome for conventional hardware such as graphic processor units (GPU)s. According to different requirements for privacy and security, private LLMs can be grouped into three levels by taking different data format (plaintext or ciphertext) for input query, weights and output responses of a LLM.

[0021] A Level 1 private LLM encrypts input queries and output responses but leaves general weights in plaintext. The Level 1 private LLM does not incorporate fine-tuning. A Level 2 private LLM encrypts input queries and output responses, but leaves fine-tuning and general weights in plaintext. A Level 3 private LLM encrypts input queries, output responses, and fine-tuning weights, but leaves general weights in plaintext. The additional encryption required as well as the other computational operations required makes each successive level of the private LLM more impractical for current hardware.

[0022] Thus, there is a need for a method to facilitate encryption for three level private LLMs that are computationally practical. There is a further need for a computer architecture having an array of homogeneous cores configured to perform LLM functions. There is also a need for a computer architecture having an array of homogeneous cores configured to perform encryption of queries to an LLM and weights for an LLM.

SUMMARY

[0023] The present disclosure relates generally to security applications. More particularly, aspects of this disclosure relate to techniques to protect private large language models with efficient encryption.

[0024] The term embodiment and like terms, e.g., implementation, configuration, aspect, example, and option, are intended to refer broadly to all of the subject matter of this disclosure and the claims below. Statements containing these terms should be understood not to limit the subject matter described herein or to limit the meaning or scope of the claims below. Embodiments of the present disclosure covered herein are defined by the claims below, not this summary. This summary is a high-level overview of various aspects of the disclosure and introduces some of the concepts that are further described in the Detailed Description section below. This summary is not intended to identify key or essential features of the claimed subject matter. This summary is also not intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this disclosure, any or all drawings, and each claim.

[0025] According to certain aspects of the present disclosure, an example system to output a response to a query is disclosed. The system includes an array of processing cores arranged in a grid allowing each processing core to communicate directly to a neighboring processing core. The system includes an interconnection network coupled to each of the processing cores allowing communication between the processing cores. A first processing core of the array of processing cores is configured to receive an encrypted query. A second processing core of the array of processing cores is configured to input the encrypted query to a large language model. The second processing core is configured to execute the large language model having general weights in plaintext. The second processing cores is configured to provide an encrypted output of the large language model.

[0026] A further implementation of the example system includes a third processing core of the array of processing cores configured to decrypt the output of the large language model. Another implementation is where the example system includes a third processing core configured as a fine-tuned foundation layer of the large language model that accepts the encrypted query. The fine-tuned foundation layer includes a matrix of proprietary weights. Another implementation is where the matrix of proprietary weights is in plaintext. Another implementation is where the example system includes a fourth processing core of the array of processing cores configured to encrypt the matrix of proprietary weights. Another implementation is where the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm. Another implementation is where the large language model has a plurality of layers including the foundation layer. The other layers of the plurality of layers apply the general weights. Another implementation is where the first and second cores include at least one of a Reduced Instruction Set Computing (RISC) processor core, a special purpose processing core; or a RISC processor core with a set of special purpose processing cores embedded within. Another implementation is where the first and second cores are in an array of RISC processor cores interconnected with an array of special purpose processing cores. Another implementation is where the encryption is performed via Fully Homomorphic Encryption (FHE). Another implementation is where the encrypted query is encrypted from a plaintext query input to an external device that transmits the encrypted query and evaluation keys for evaluation by the first processing core. Another implementation is where the example system includes a third processing core configured to simultaneously process both plaintext and homomorphic ciphertext. Another implementation is where the second processing core is configured to perform an encrypted summation. Another implementation is where the example system includes a computational fabric having a plurality of individual integrated circuits. The array of processing cores is on at least one of the plurality of individual integrated circuits. Another implementation is where the computational fabric allows communication between each of the plurality of individual integrated circuits.

[0027] According to certain aspects of the present disclosure, an example array of cores on an integrated circuit die is disclosed. The array of cores includes an interconnection network coupled to each of the processing cores allowing communication between the processing cores. A first processing core or cores of the array of processing cores is configured to receive an encrypted query. A second processing core or cores of the array of processing cores is configured to input the encrypted query to a large language model. The second processing core or cores is configured to execute the large language model having general weights in plaintext. The second processing core or cores is configured to provide an encrypted output of the large language model.

[0028] A further implementation of the example array of cores includes a third processing core or cores of the array of processing cores configured as a fine-tuned foundation layer of the large language model that accepts the encrypted query. The fine-tuned foundation layer includes a matrix of proprietary weights. A fourth processing core or cores of the array of processing cores is configured to decrypt the output of the large language model. Another implementation is where the matrix of proprietary weights is in plaintext. The example array of cores further includes a fifth set of processing cores of the array of processing cores configured to encrypt the matrix of proprietary weights. The encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm. Another implementation is where the large language model has a plurality of layers including the foundation layer. The other layers of the plurality of layers apply the general weights.

[0029] Another disclosed example is method of configuring an array of programmable cores including a plurality of programmable cores coupled via an interconnection network. A first processing core or cores of the array of processing cores is configured to receive an encrypted query. A second processing core or cores of the array of processing cores is configured input the encrypted query to a large language model and execute the large language model having general weights in plaintext. The second processing core or cores of the array are configured to provide an encrypted output of the large language model.

[0030] The above summary is not intended to represent each embodiment or every aspect of the present disclosure. Rather, the foregoing summary merely provides an example of some of the novel aspects and features set forth herein. The above features and advantages, and other features and advantages of the present disclosure, will be readily apparent from the following detailed description of representative embodiments and modes for carrying out the present invention, when taken in connection with the accompanying drawings and the appended claims. Additional aspects of the disclosure will be apparent to those of ordinary skill in the art in view of the detailed description of various embodiments, which is made with reference to the drawings, a brief description of which is provided below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0031] The disclosure will be better understood from the following description of exemplary embodiments together with reference to the accompanying drawings, in which:

[0032] FIG. 1 A is a diagram of a prior art LLM layer with different operations;

[0033] FIG. IB is a simplified illustration of inputs and outputs to a general LLM;

[0034] FIG. 1C is a simplified illustration of inputs and outputs to a fine-tuned LLM;

[0035] FIG. 2A is a diagram of a chip having four dies each having multiple processing cores;

[0036] FIG. 2B is a simplified diagram of one of the dies on the chip shown in FIG. 2A;

[0037] FIG. 3 A is a block diagram of the array of cores in the die in FIG. 2B;

[0038] FIG. 3B is a three-dimensional view of the array of cores in the die in FIG. 2B;

[0039] FIG. 4 is an example reconfigurable arithmetic engine configuration of one of the cores in the core array in FIG. 2A;

[0040] FIG. 5 is a diagram of configurations of the array of cores in FIG. 2A as either a RISC-V or a specialized ALU internal module;

[0041] FIG. 6A is an example of a Level 1 private LLM;

[0042] FIG. 6B is an example of a Level 2 private LLM;

[0043] FIG. 6C is an example of a Level 3 private LLM;

[0044] FIG. 7 is a table showing the cost and power comparisons for different LLMs by conventional hardware;

[0045] FIG. 8 is a diagram showing reduction of a matrix of fine-tuning weights using a low- rank adaptation algorithm by using two lower ranked matrices; [0046] FIG. 9 is a diagram of the data flow of the inference stage of a low-rank adaption (LoRA) algorithm to reduce the size of the fine-tune proprietary weights for a private LLM; and

[0047] FIG. 10 is a table showing the trillion multiplier equivalent (TME) values for conventional solutions for private LLM and the example core based architecture for private LLMs.

[0048] The present disclosure is susceptible to various modifications and alternative forms. Some representative embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

DETAILED DESCRIPTION

[0049] The present inventions can be embodied in many different forms. Representative embodiments are shown in the drawings, and will herein be described in detail. The present disclosure is an example or illustration of the principles of the present disclosure, and is not intended to limit the broad aspects of the disclosure to the embodiments illustrated. To that extent, elements, and limitations that are disclosed, for example, in the Abstract, Summary, and Detailed Description sections, but not explicitly set forth in the claims, should not be incorporated into the claims, singly, or collectively, by implication, inference, or otherwise. For purposes of the present detailed description, unless specifically disclaimed, the singular includes the plural and vice versa; and the word “including” means “including without limitation.” Moreover, words of approximation, such as “about,” “almost,” “substantially,” “approximately,” and the like, can be used herein to mean “at,” “near,” or “nearly at,” or “within 3-5% of,” or “within acceptable manufacturing tolerances,” or any logical combination thereof, for example.

[0050] This present disclosure provides technical details of private LLM technology by addressing working principles, application scenarios, privacy and security requirements, and training and inference algorithms. The present disclosure also describes a silicon and chip implementation of private LLM technology that may be adapted to different products. As explained above, there is a need for applications of Private LLM technology. As will be explained, the example private LLM technology may be classified into three application scenarios (levels). The present disclosure presents the corresponding algorithms related to training and inferences of the three application scenarios. The disclosed array of cores architecture converges Private LLM into an optimum silicon implementation that are superior to existing solutions (e.g., GPU based and ASIC based) in terms of power consumption, cost, and processing latency.

[0051] The present disclosure is directed toward an effective solution for implementing private large-language-models (LLM) (transform neural networks) system. The example method is directed toward protecting the LLM through secure encryption such as FHE, using plaintext for the publicly known weights, but protecting all other LLM components including any proprietary weights through FHE for fine tuning.

[0052] FIG. 2A shows an example chip 100 that is subdivided into four identical dies 102, 104, 106, and 108. Each of the dies 102, 104, 106, and 108 include multiple processor cores, support circuits, serial interconnections and serial data control subsystems. For example, the dies 102, 104, 106, and 108 may each have 4,096 processing cores as well as SERDES interconnection lanes to support different communication protocols. There are die to die parallel connections between the dies 102, 104, 106 and 108. Thus, each of the dies 102, 104, 106, and 108 in this example are interconnected by Interlaken connections. The chip 100 is designed to allow one, two or all four of the dies 102, 104, 106, and 108 to be used. The pins on a package related to un-used dies are left unconnected in the package or the board. The dies are scalable as additional chips identical to the chip 100 may be implemented in a device or a circuit board. In this example, a single communication port such as an Ethernet port is provided for the chip 100. Of course, other ports may be provided, such as one or more ports for each die.

[0053] FIG. 2B is a block diagram of one example of the die 102. The die 102 includes a fractal array 130 of processing cores. The processing cores in the fractal array 130 are interconnected with each other via a system interconnect 132. The entire array of cores 130 serves as the major processing engine of the die 102 and the chip 100. In this example, there are 4096 cores in the fractal array 130 that are organized in a grid.

[0054] The system interconnection 132 is coupled to a series of memory input/output processors (MIOP) 134. The system interconnection 132 is coupled to a control status register (CSR) 136, a direct memory access (DMA) 138, an interrupt controller (IRQC) 140, an I2C bus controller 142, and two die to die interconnections 144. The two die to die interconnections 144 allow communication between the array of processing cores 130 of the die 102 and the two neighboring dies 104 and 108 in FIG. 2 A. [0055] The chip includes a high bandwidth memory controller 146 coupled to a high bandwidth memory 148 that constitute an external memory sub-system. The chip also includes an Ethernet controller system 150, an Interlaken controller system 152, and a PCIe controller system 154 for external communications. In this example each of the controller systems 150, 152, and 154 have a media access controller, a physical coding sublayer (PCS) and an input for data to and from the cores. Each controller of the respective communication protocol systems 150, 152, and 154 interfaces with the cores to provide data in the respective communication protocol. In this example, the Interlaken controller system 152 has two Interlaken controllers and respective channels. A SERDES allocator 156 allows allocation of SERDES lines through quad M-PHY units 158 to the communication systems 150, 152 and 154. Each of the controllers of the communication systems 150, 152, and 154 may access the high bandwidth memory 148.

[0056] In this example, the array 130 of directly interconnected cores are organized in tiles with 16 cores in each tile. The array 130 functions as a memory network on chip by having a high-bandwidth interconnect for routing data streams between the cores and the external DRAM through memory IO processors (MIOP) 134 and the high bandwidth memory controller 146. The array 130 functions as a link network on chip interconnection for supporting communication between distant cores including chip-to-chip communication through an “Array of Chips” Bridge module. The array 130 has an error reporter function that captures and filters fatal error messages from all components of array 130.

[0057] FIG. 3 A is a detailed diagram of the array of cores 130 in FIG. 3B. FIG. 3B is a three- dimensional image of the array of cores 130 in FIG. 3B. The array of cores 130 is organized into four core clusters such as the clusters 200, 210, 220, and 230 shown in FIG. 3 A. For example, the cluster 200 includes cores 202a, 202b, 202c, and 202d. Each of the four cores in each cluster 200 such as cores 202a, 202b, 202c, and 202d are coupled together by a router 204. FIG. 3B shows other clusters 210, 220, and 230 with corresponding cores 212a-212d, 222a-222d and 232a-232d and corresponding routers 214, 224, and 234.

[0058] As may be seen specifically in FIG. 3B, in this example, each of the cores 202a, 202b, 202c, and 202d has up to four sets of three interconnections [L, A, R], For example, a core in the center of the array such as the core 202d includes four sets of interconnections 240, 242, 244, and 246 each connected to one of four neighboring cores. Thus, core 202b is connected to the core 202d via the interconnections 240, core 202c is connected to the core 202d via the interconnections 242, core 212b is connected to the core 202d via the interconnections 244, and core 202c is connected to the core 202d via the interconnectors 246. A separate connector 248 is coupled to the wire router 204 of the cluster 200. Thus, each core in the middle of the array, has four sets of interconnections, while border cores such as the core 202c only have three sets of interconnections 250, 252, and 246 that are connected to respective cores 202a, 202d, and 212a.

[0059] In order to configure the cores of the example array 130 in FIG. 2 A, the inputs of certain blocks may be changed to configure blocks for one of the three different function blocks. The functions may be configured by simply changing the inputs of the processing cores. FIG. 4 shows a block diagram of an example processing core 400 that includes a reconfigurable arithmetic engine (RAE) 410. The RAE 410 may be configured and reconfigured to perform relevant mathematical routines such as matrix multiplications, Fast Fourier Transforms (FFT), Inverse FFTs (IFFT), point wise multiplication, Softmax and other related nonlinear functions required in an LLM.. The RAE 410 includes input reorder queues, a multiplier shifter-combiner network, an accumulator and logic circuits. The RAE 410 operates in several modes, such as operating as an ALU, and include a number of floating point and integer arithmetic modes, logical manipulation modes (Boolean logic and shift/rotate), conditional operations, and format conversion. The RAE 410 includes three inputs 412, 414, and 416 and three outputs 422, 424, and 426. The RAE 410 receives the output data from a program executed by another RAE 430 and output data from another program executed by another RAE 432. An aggregator (AGG) 434 provides an output of aggregated data from different sources to the RAE 410. A memory read output 436 and a memory write output 438 also provide data to the RAE 410. The memory outputs 436 and 438 provide access to a memory such as an SRAM that stores operand data, and optionally may also store configurations or other instructions for the RAE 410.

[0060] Each of the output data of the RAE 430, RAE 432, aggregator 434, memory read output 436 and the memory write output 438 are provided as inputs to three multiplexers 442, 444, and 446. The outputs of the respective multiplexers 442, 444, and 446 are coupled to the respective inputs 412, 414, and 416 of the RAE 410.

[0061] There are two versions of configuration of computational cores which can dynamically switch from one type to the other. A set of cores may be configured as a full RISC-V processor with associated SRAM able to execute traditional control flow programs as a function representing the computation within a dataflow node. RISC-V for Legacy code is supported by configuring multiple cores under software control. This may be used to produce software GPUs or other types of cores from the multiple cores. The processing cores such as the FracTLcores® offered by Cornami are an efficient set of transistors for streaming data driven workloads, with a dynamic programming scheduler such as the TruStream programming scheduler offered by Cornami and memory, created from a set of RAE Cores. In this example, the FracTLcores® can scale up to 64,000,000 cores across chips and systems at near linear scale. Combining the aspects of both data flow and reconfigurable computing to stream data, this architecture with highly functional computational elements can dynamically scale over many chips. The example architecture enables developers to take full advantage of both parallelism and pipelining to minimize latency and maximize overall application performance and throughput. The use of the architecture of processing cores results in reduction in processing cost. The cores may employ a data-flow programming model resulting in a 5x reduction in processing cost. A data-defining function computation for the cores may result in a 6x reduction in processing cost. A data Read/Write with a Tensor pattern applied to the cores may result in a 6x reduction in processing cost.

[0062] FIG. 5 is a diagram of four configurations 510, 520, 530, and 540 of the array of cores in FIG. 2B as either a RISC-V processor or a specialized ALU internal module. The configurations 510, 520, 530, and 540 can dynamically switch from one type to the other by reconfiguring some or all of the computational cores in the configurations. The first configuration 510 is a set of cores configured as a full RISC processor with associated SRAM able to execute traditional Control Flow programs as a function representing the computation within a dataflow node. In this example, the RISC processor includes sixteen separate cores 512. Another configuration 520 is sixteen independently reconfigurable and programmable ALUs, that are each of cores 522 (for example FracTLcores® available from Cornami). Each of the cores 522 have associated SRAM supporting multiple simultaneous integer and floating point computations of up to 128-bits. The configuration 520 thus is a set of cores that are configured as individual FracTLcores®. The configuration 530 includes one or more RISC cores 532 that are a set of sixteen cores in this example. The RISC core 532 can have additional individual or multiple cores 534 incorporated within them to accelerate specific RISC functions. Alternatively, the additional cores 534 may be designated for data path/arithmetic acceleration, enhancing ALU performance.

[0063] Thus, to implement a standard 64 bit RISC processor such as the RISC-V processor in this example, sixteen cores are configured to become the RISC-V. Optional additional cores may be added to the configuration to provide hardware acceleration to math operations performed by the RISC. For example, a normal RISC processor does not have hardware to perform a cosine function. Thus, an additional core may be added and configured to perform a hardware cosine operation. This enhances the ISA instruction set of the RISC processor by adding the hardware accelerated cosine function that may be accessed by the RISC processor. The configuration 540 has a set of cores that are configured into two individual groupings of cores configured as RISC processors 542 and cores that are configured as ALUs (e.g., FracTLcores®) 544.

[0064] In this example, a client may submit an encrypted query to the LLM architecture configured on an array of cores such as that in FIG. 2B. The results of the LLM will be output in an encrypted output that may be decrypted.

[0065] As will be explained, the example fractal core architecture applied to private LLMs solves the problems with existing hardware. This allows use of fully homomorphic encryption (FHE) approaches for maximum security. As explained above, private LLMs can be grouped into three levels by taking different data format (plaintext or ciphertext) for input query, weights and output responses of a LLM. The superscript “C” denotes the ciphertext format for the below discussion.

[0066] FIG. 6A shows the processing steps of a Level 1 Private LLM process 600. The process 600 includes a client input stage 612, a server executing an LLM 614 and a client output stage 616. A mathematical representation 620 and process 600 show the following. The Client first encrypts their data X (Plaintext) to the ciphertexts X^c by using some algorithms to support fully Homomorphic Encryption (FHE) such as CKKS or TFHE where “Sk” and “Pk” denote the secret keys and public keys owned by the Client (612). The Client 612 sends their encrypted data X^c to the server 614 which owns general LLM parameters in plaintext format (W). The Server 614 performs the LLM operation with X^c as inputs of the general LLM as shown in the mathematical representation 620. More specifically, the example array of cores based chip in FIGs. 2A-2B first receives the ciphertext X^c from the client 612 via the data link interface and also reads the plaintext weights (W) of the LLM from memory, which could be SRAM, cache, or DRAM. The chip 100 performs all the matrix multiplications and nonlinear functions as shown in the mathematical representation 620. The server 614 then sends the generated output Y^c to the client via the data-link interface. The generated output equals F (W, X^c) and also should be equal to ((F(W, X))^c, that is, the ciphertexts corresponding to the plaintexts F(X). By using the secret keys “Sk”, the client 612 decrypts the received ciphertext Y^c to finally get the desired result, which equals to F(X).

[0067] FIG. 6B shows the processing steps of a Level 2 Private LLM 630. The process 630 includes a client input stage 632, a server executing an LLM 634 and a client output stage 636. A mathematical representation 640 and process 630 show the following. The processing steps of the Level 2 Private LLM in FIG. 6B include the following. The Client first encrypts their data X (Plaintext) to the ciphertexts X^c with FHE algorithms using the keypair, “Sk” and “Pk”. The Client sends their encrypted data X^c to the server 634. The operator of the server 634 owns the general LLM parameters and fine-tuning parameters, which are both still in plaintext format. The server 634 performs the fine-tuning LLM operation with X^c being inputs of the Fine-Tuned LLM as shown in FIG. 6B. In this stage, the chip 100 in FIGs. 2A-2B first receives the ciphertext X^c from the Client via the data link interface and also reads both the plaintext weights W of the general LLM and the plaintext weights AI/F of the fine-tuned LLM from a memory. The chip 100 then performs all the matrix multiplications and nonlinear functions as shown in the second part of FIG. 6B. The Server sends to the Client the generated output Y^c via a data-link interface.

[0068] The generated output equals F(W, X^c) + F(AVF, X^C), and also should equal (F(VF,X) + F(AVF, X))^C, that is, the ciphertexts corresponding to the desired output. Having received the output: Y^c, the Client uses the secret key “Sk” and decrypts the received ciphertext Y^c to finally get the desired result Y.

[0069] In Level 3 Private LLMs, not only are inputs and outputs encrypted, but the finetuning weights are also encrypted by the Server side as shown. FIG. 6C shows the processing steps of a Level 3 Private LLM 650. The process 650 includes a client input stage 652, a server executing an LLM 654 and a client output stage 656. A mathematical representation 660 and process 650 show the following processing steps for the Level 3 Private LLM. The Client first encrypts their data X (Plaintext) to the ciphertexts X^c with FHE algorithms using the keypair, “Sk” and “Pk.” The Client sends their encrypted data X^c to the server 654. The server operator owns the general LLM parameters in the format of plaintexts and the fine-tuning parameters in the format of ciphertexts encrypted by the server itself. The ciphertext version (AVF)^cof the fine-tune weights AI/F can be pre-determined and stored in memory by using the keypair, “SSk” and “SPk” as shown in FIG. 6C. The Server performs the fine-tuning private LLM operation with X^c being inputs as shown in FIG. 6C. Like the Level 2 private LLM in FIG. 6B, the chip 100 first receives the ciphertext X^c from the Client via the data link interface. The chip 100 also reads the plaintext weights W of the general LLM and the ciphertext weights (AVF)^C of the fine-tuned LLM from memory. The chip 100 then performs all the matrix multiplications and nonlinear functions as shown in the second part of FIG. 6C. The Server 654 sends the generated output Y^c and the public key “SPk” to the Client via the data-link interface. The generated output equals F

, and also should equal (F(VF,X) + F(AVF, X))^C, that is, the ciphertexts corresponding to the desired output. The Client decrypts the received ciphertext Y^c to finally get the desired result Y by using the secret key “Sk” and public key “SPk” from the server.

[0070] In comparison with a general LLM and a fine-tuned LLM in plaintext format, the computational complexity and the corresponding hardware implementation cost and power consumption for Private LLM is greatly increased mainly because most matrix multiplications are performed in ciphertext. A large amount of extra bootstrapping operations are required in order to reduce the growth of noise incurred by FHE. FIG. 7 is a table 700 that presents a quantitative comparison of two kinds of representative implementation platforms (ASIC and GPU) in terms of the total cost and power consumption in order to generate the desired responses for a complete sequence of tokens using the parameters setting in GPT-3.

[0071] Since the cost and power consumptions, even for the exactly same processing algorithm, will greatly vary with many hardware factors such as the overall architecture, processing units, control and instruction units, data read and write (cache, DRAM and SRAM), address-generation and process technologies, a new unified metric called “hardware multiplierequivalent (ME)” is used as a standard metric of computational and implementation complexity that takes all the above factors into account. For example, to perform a complex-value multiplication, an ASIC platform would use at least 6 MEs but a GPU platform would use at least 50 MEs because a data read for multiplication from DRAM in the GPU costs at least 5 times more than the operation of multiplication itself.

[0072] The first column in the table 700 in FIG. 7 is the listing of five LLMs in rows 710, 712, 714, 716, and 718. The second and third columns of the table 700 in FIG. 7 indicate the number of trillion MEs (TME) to be needed by an ASIC and a GPU in order to generate the desired output for a complete sequence of tokens in GPT-3, respectively. For the case of General LLM inference in the first row 710 as an example, 30.5 TME and 336 TME are required by ASIC solution and GPU solution, respectively. As shown in row 714, for the same inference task, using a Level-1 Private LLM, 1530 TME and 59,670 TME are required for the respective ASIC and GPU, which is more than 100 times for General LLM. As shown in row 718, for a level-3 Private LLM, 175,950 TME and 6,719,333 TME are required for the respective ASIC and GPU, which is another 100 times increase. For further illustrations, the fourth column of the table 700 shows the ratio of the five LLMs in the rows 710, 712, 714, 716, and 718 with the General LLM in the row 710 in terms of the number of GPUs. More specifically, if it is assumed that the total number of GPUs needed to perform General LLM is the unity, 178 GPUs, 190 GPUs, and 19,998 GPUs are required for implementing Level- 1, Level-2 and Level-3 Private LLMs, respectively in rows 714, 716, and 718. This is not practical and even impossible for a Level-3 Private LLM.

[0073] The example reconfigurable core architecture allows practical implementation of the different level private LLMs. As described above, the array of cores architecture in FIG. 2B and FIG. 3B combines aspects of dataflow and reconfigurable computing to stream data through a computational fabric architecture with highly functional computational elements that can dynamically scale over many chips. The computational fabric is represented by one or many custom ASIC chip(s) residing in one or multiple PCIe cards within one or multiple host servers. Each host server has an x86 processor(s) running Linux as an interface to the computational fabric. The custom ASICs have several key functional components that are linked by following three types of core communication mechanisms. The first communication mechanism is the adjacent core-to-core in the array 130 which is one core communicating with a physically adjacent core as laid out on the silicon substrate. Adjacent core communication is the most efficient inter-core communication mechanism and takes place via the North, South, East, or West core interfaces. The second communication mechanism is a Network-On-Chip (NOC) which generalizes cores to core communication interface where they are not side-by- side on the same chip or when cores reside on different chips as shown in FIG. 3B. The third communication mechanism is a PCIe link for intra-system communications between the host and PCIe boards.

[0074] This reconfigurable core array computing architecture allows different functions to be defined by dynamically changing the topological linkages of processing cores within a computational fabric to achieve superior silicon utilization in terms of application performance, throughput, power consumption, and processing latency. The computational fabric significantly reduces the dependence on memory to store intermediate computational results and exceeds the flexibility and programmability of an FPGA or DSP or GPU while still providing near ASIC level solution performance.

[0075] The example reconfigurable core architecture in FIG. 2B and FIG. 3B is a very powerful hardware computing platform to perform extensive matrix multiplications required in executing General LLM and Private LLM with near zero programming complexity. The computational complexity involved in a Level 3 Private-LLM is still too high even for the example core based architecture as shown in FIG. 7. This is because all the computations related to fine-tuning weights in the Level-3 Private LLM need to be operated in ciphertext format. The reduction of the computational complexity for a Level 3 Private LLM can be accomplished by using a low-rank adaption (LoRA) algorithm. The use of the LoRA concept in a Level 3 private LLM can reduce the size of the fine-tuning weights VF.

[0076] “Low-Rank Adaptation of Large Language Models” (LoRA) is a specific technique in the field of natural language processing to reduce their computational requirements during fine-tuning and inference. LoRA leverages low-rank factorization techniques to approximate the weight matrices in these models, reducing their memory and computational footprint without significant loss in performance.

[0077] A step-by-step description of how LoRA works in terms of training and inference is as follows. A large pre-trained language model such as GPT-3 is used as a starting point. These models have a massive number of parameters and required substantial computational resources for their training. Target layers are selected within the model for low-rank adaptation. These layers are typically chosen based on their computational intensity and importance to the target task.

[0078] A technique called “low-rank factorization” is applied to these layers, which simplifies the model's calculations by using fewer numbers and thus reduces model complexity. The simplified model is fine-tuned by training the model on a specific task using these reduced parameters. This fine-tuning process tailors the model to the task. The adapted model may be used for specific inference applications as it is more efficient and memory-friendly for the specific application. The adopted model is thus suitable for real-world use in applications like chatbots, translation, or text generation.

[0079] The general process starts by checking the task performance by the simplified model. This ensures that the simplified model still performs well for the target task. In many cases, the simplified model may achieve a good balance between efficiency and accuracy.

[0080] The LoRA process uses several parameters such as a matrix scaling factor a and the matrix rank factor r. They determines how large the LoRA matrices will be in terms of the dimensions and values. An example rule of thumb for alpha as: a = 2r has shown to be effective in some experiments. There are sometimes tradeoffs between memory usage and training time.

[0081] FIG. 8 is a matrix diagram 800 of an example LoRA. A weight matrix with d X d can be decomposed into the multiplication of a low-rank matrix A (d X r) 812 and another low-matrix B (r X d) 814, Although both W and AW of Fig. 6C have the same dimensions and rank, AW (d x d) in an example solution matrix A x B 816 can be generated by two much lower-rank matrices 812 and 814 (A and B) as shown in FIG. 8 where the rank r could be as small as 10, which is 1000 times less than d. This allows LoRA based fine-tuning weights AI/F (d x d) that can be generated and replaced by two much lower-rank matrices A and B. Thus, AI/F = A x B as shown in FIG. 8. This means that the total size of weights becomes (2r x d) in LoRA instead of (d x d). Because the rank r could be as small as 1, which is 1000 times less than d, and hence the total size of weights size in LoRA is 500 times less than the one in the original AVF, computational resources required may be greatly reduced.

[0082] With this decomposition, the fine-tuning weights may be obtained in ciphertext domain and may be used to perform the desired LLM processing on ciphertexts. Since the dimensions of the matrices 812 and 814 (A and B) are much smaller, the corresponding computational complexity for ciphertext would be as low as the one for plaintexts, which means the reduction in 100 times can be achieved by the example LoRA based solution.

[0083] With this decomposition, instead encrypting weights AI/F which has the same dimensions and sizes as W, the small size matrices A and B may first be encrypted and then their encrypted versions may be used to replace

further perform the computations in ciphertext domain as required in all the processing steps of Level 3 Private LLM shown in FIG. 6C. Since the dimensions of matrices A and B are much smaller, the corresponding computational complexity for ciphertext would be as low as the ones for plaintexts, which means the reduction in 100 times can be achieved by the LoRA based algorithm.

[0084] FIG. 9 further illustrates the data flow of inference stage of a LoRA based Level-3 Private LLM executed by a server 910 that communicates with a client 912. The client 912, generates a user prompt 920 that receives a query represented by data X. The query, Q (data X) is encoded to Q’ (922). Using the keypair “Sk” and “Pk”, the client 912 first encrypts their data X (Plaintext) to the ciphertexts X^c with FHE algorithms (924).

[0085] The client 912 then sends their encrypted data X^c as a ciphertext input 932 to the server 910 which holds an LLM 930 and general LLM parameters in the format of plaintext 934 and the fine-tuning weight parameters AW 936 in the format of ciphertext encrypted by the server itself. The ciphertext version (AVF)^cof the fine-tuned weights AW 936 can be generated and replaced by the LoRA decomposition: AW = A x B, by using the keypair “SSk” and “SPk” owned by the server 910.

[0086] There are two options for this replacement, namely, (AVF)^C = (4 X B)^c and (AVF)^C = (A^c X B^c) . The former (AVF)^C = (4 X B)^cfirst performs the multiplications and then the encryption. The latter

= (A^c X B^c) first performs the encryption and then the multiplications.

[0087] The processing unit that may be the chip 100 in FIG. 2 A in the server 910 performs the LoRA based fine-tuned LLM operation with X^c being inputs as shown in FIG. 9. Like the process shown in FIG. 6C, the server 910 first receives the ciphertext X^c from the client input (932) via the data link interface and also reads the plaintext weights W (934) of the general LLM and the ciphertext weights

of the fine-tuned LLM (936) from memory. In this example, the ciphertext weights are broken down into two matrices 938 and 940 in the process described in FIG. 8. The chip 100 them performs all the matrix multiplications and nonlinear functions as shown in FIG 6C and FIG. 9. The resulting output from both the plaintext weights and the ciphertext weights are added to produce a generated output (942). The Server sends the generated output Y^c and the public key “SPk” to the Client via the data-link interface. The generated output equals F(I/F, X^c) + F((A x B)^C, X^C) or F(VF, X^C) + F(A^C x B^C, X^C and also should equal (F(VF, X) + F A x B, X))^c, that is, the ciphertexts corresponding to the desired output.

[0088] The client 912 receives the ciphertext and the public key (950). The client 912 decrypts the received ciphertext Y^c to finally get the desired decided result (952) Y by using the secret keys “Sk” and the public keys “SPk from the server 910.

[0089] Using the same metric unit (TME) and the same ratio definition as those in the table in FIG. 7, FIG. 10 shows a table 1000 that is a quantitative comparison of a known GPU platform to the example array of cores platform in terms of the total cost and power consumption in order to generate the desired responses for a complete sequence of tokens with using the parameters setting for GPT-3. It can be seen from the fifth column of the table 1000 that the example array of cores platform costs only about one tenth of what the GPU platform costs for performing the same General LLM inference task of Figure IB. For performing a Level- 1 Private LLM inference task of Fig. 6, the GPU platform costs 179 times but the example platform costs only 3.03 times. For performing Level-2 Private LLM inference task of Fig. 6B, the GPU platform costs 190 times but the example platform costs only 3.19 times. For performing Level-3 Private LLM inference task of Fig. 6C without using LoRA, the GPU platform costs 19998 times but the example platform costs only 330 times. As shown in the last row for performing Level-3 Private LLM inference task by using of LoRA based algorithm, the GPU platform costs 196 times but the example platform costs only 3.51 times, which suggests that GPU platform is still not practical. Instead, the example array of cores platform can serve as a feasible and practical solution for the deployment of all these three levels of Private LLMs into real-world applications.

[0090] The above shows that the example array of cores architecture can effectively implement all three levels of Private LLM technology so as to meet the increasing demands in terms of the privacy and security related to generative artificial intelligence.

[0091] The terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof, are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.” [0092] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. Furthermore, terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. [0093] While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Numerous changes to the disclosed embodiments can be made in accordance with the disclosure herein, without departing from the spirit or scope of the invention. Thus, the breadth and scope of the present invention should not be limited by any of the above described embodiments. Rather, the scope of the invention should be defined in accordance with the following claims and their equivalents.

[0094] Although the invention has been illustrated and described with respect to one or more implementations, equivalent alterations, and modifications will occur or be known to others skilled in the art upon the reading and understanding of this specification and the annexed drawings. In addition, while a particular feature of the invention may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application.

Claims

CLAIMS WHAT IS CLAIMED IS:

1. A system to output a response to a query, the system comprising: an array of processing cores arranged in a grid allowing each processing core to communicate directly to a neighboring processing core; an interconnection network coupled to each of the processing cores allowing communication between the processing cores; a first set of processing cores of the array of processing cores configured to receive an encrypted query; and a second set of processing cores of the array of processing cores configured to: input the encrypted query to a large language model; execute the large language model having general weights in plaintext; and provide an encrypted output of the large language model.

2. The system of claim 1, further comprising a third set of processing cores of the array of processing cores configured to decrypt the output of the large language model.

3. The system of any one of claims 1-2, further comprising a third set of processing cores of the array of processing cores configured as a fine-tuned foundation layer of the large language model that accepts the encrypted query, wherein the fine-tuned foundation layer includes a matrix of proprietary weights.

4. The system of claim 3, wherein the matrix of proprietary weights is in plaintext.

5. The system of claim 3, further comprising a fourth set of processing cores of the array of processing cores configured to encrypt the matrix of proprietary weights.

6. The system of claim 5, wherein the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm.

7. The system of any one of claims 3-6, wherein the large language model has a plurality of layers including the foundation layer, wherein the other layers of the plurality of layers apply the general weights.

8. The system of any one of claims 1-7, wherein the first and second set of processing cores include at least one of a Reduced Instruction Set Computing (RISC) processor core, a special purpose processing core; or a RISC processor core with a set of special purpose processing cores embedded within.

9. The system of any one of claims 1-8, wherein the set of first and second processing cores are in an array of RISC processor cores interconnected with an array of special purpose processing cores.

10. The system of any one of claims 1-9, wherein the encryption is performed via Fully Homomorphic Encryption (FHE).

11. The system of any one of claims 1-10, wherein the encrypted query is encrypted from a plaintext query input to an external device that transmits the encrypted query and evaluation keys for evaluation by the first processing core.

12. The system of any one of claims 1-11, further comprising a third set of processing cores configured to simultaneously process both plaintext and homomorphic ciphertext.

13. The system of any one of claims 1-12, wherein the second set of processing cores is configured to perform an encrypted summation.

14. The system of any one of claims 1-13, further comprising a computational fabric having a plurality of individual integrated circuits, wherein the array of processing cores is on at least one of the plurality of individual integrated circuits.

15. The system of claim 14, wherein the computational fabric allows communication between each of the plurality of individual integrated circuits.

16. An array of cores on an integrated circuit die comprising: an interconnection network coupled to each of the processing cores allowing communication between the processing cores; a first processing core or cores of the array of processing cores is configured to receive an encrypted query; and a second processing core or cores of the array of processing cores is configured to: input the encrypted query to a large language model; execute the large language model having general weights in plaintext; and provide an encrypted output of the large language model.

17. The array of cores of claim 16, further comprising a third processing core or cores of the array of processing cores configured to decrypt the output of the large language model.

18. The array of cores of any one of claims 16-17, further comprising: a third processing core or cores of the array of processing cores configured as a finetuned foundation layer of the large language model that accepts the encrypted query, wherein the fine-tuned foundation layer includes a matrix of proprietary weights; and a fourth processing core or cores of the array of processing cores configured to decrypt the output of the large language model.

19. The array of cores of claim 18, wherein the matrix of proprietary weights is in plaintext,

20. The array of cores of any one of claims 17-19, further comprising a fourth processing core or cores of the array of processing cores configured to encrypt the matrix of proprietary weights,

21. The array of cores of claim 20, wherein the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm.

22. The array of cores of claim 18, wherein the large language model has a plurality of layers including the foundation layer, wherein the other layers of the plurality of layers apply the general weights.

23. The array of cores of any one of claims 16-22, wherein the first and second processing cores include at least one of a Reduced Instruction Set Computing (RISC) processor core, a special purpose processing core; or a RISC processor core with a set of special purpose processing cores embedded within.

24. The array of cores of any one of claims 16-23, wherein the first and second processing cores are in an array of RISC processor cores interconnected with an array of special purpose processing cores.

25. The array of cores of any one of claims 16-24, wherein the encryption is performed via Fully Homomorphic Encryption (FHE).

26. The array of cores of any one of claims 16-25, wherein the encrypted query is encrypted from a plaintext query input to an external device that transmits the encrypted query and evaluation keys for evaluation by the first processing core.

27. The array of cores of any one of claims 16-26, further comprising a third processing core or cores configured to simultaneously process both plaintext and homomorphic ciphertext.

28. The array of cores of any one of claims 16-27, wherein the second processing core or cores is configured to perform an encrypted summation.

29. A method of configuring an array of programmable cores including a plurality of programmable cores coupled via an interconnection network, the method comprising: configuring a first processing core or cores of the array of processing cores to receive an encrypted query; and configuring a second processing core or cores of the array of processing cores to: input the encrypted query to a large language model; execute the large language model having general weights in plaintext; and provide an encrypted output of the large language model.

30. The method of claim 29, further comprising configuring a third processing cores or cores of the array of processing cores to decrypt the output of the large language model.

31. The method of any one of claims 29-30, further comprising configuring a third processing core or cores of the array of processing cores as a fine-tuned foundation layer of the large language model that accepts the encrypted query, wherein the fine-tuned foundation layer includes a matrix of proprietary weights.

32. The method of claim 31, wherein the matrix of proprietary weights is in plaintext.

33. The method of any one of claims 31-32, further comprising configuring a fourth processing core or cores of the array of processing cores to encrypt the matrix of proprietary weights.

34. The method of claim 33, wherein the encryption of the matrix of proprietary weights is reduced by applying a low-rank adaption (LoRA) algorithm.

35. The method of any one of claims 31-34, wherein the large language model has a plurality of layers including the foundation layer, wherein the other layers of the plurality of layers apply the general weights.

36. The method of any one of claims 29-35, wherein the first and second set of processing cores include at least one of a Reduced Instruction Set Computing (RISC) processor core, a special purpose processing core; or a RISC processor core with a set of special purpose processing cores embedded within.

37. The method of any one of claims 29-36, wherein the set of first and second processing cores are in an array of RISC processor cores interconnected with an array of special purpose processing cores.

38. The method of any one of claims 29-37, wherein the encryption is performed via Fully Homomorphic Encryption (FHE).

39. The method of any one of claims 29-38, wherein the encrypted query is encrypted from a plaintext query input to an external device that transmits the encrypted query and evaluation keys for evaluation by the first processing core.

40. The method of any one of claims 29-39, further comprising configuring a fifth processing core or cores to simultaneously process both plaintext and homomorphic ciphertext.

41. The method of any one of claims 29-40, further comprising configuring the second processing core or cores to perform an encrypted summation.