[go: up one dir, main page]

WO2022118236A1 - Closely coupled hardware acceleration in a multi-processors environment - Google Patents

Closely coupled hardware acceleration in a multi-processors environment Download PDF

Info

Publication number
WO2022118236A1
WO2022118236A1 PCT/IB2021/061227 IB2021061227W WO2022118236A1 WO 2022118236 A1 WO2022118236 A1 WO 2022118236A1 IB 2021061227 W IB2021061227 W IB 2021061227W WO 2022118236 A1 WO2022118236 A1 WO 2022118236A1
Authority
WO
WIPO (PCT)
Prior art keywords
workers
master
algorithm
worker
electronic circuitry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2021/061227
Other languages
French (fr)
Inventor
Shaul Wilenski
Amit CEDERBAUM
Avey Gabrielli
Moshe Dan AZOGUI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jet Blockchain Inc
Original Assignee
Jet Blockchain Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jet Blockchain Inc filed Critical Jet Blockchain Inc
Publication of WO2022118236A1 publication Critical patent/WO2022118236A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload

Definitions

  • the present invention in some embodiments thereof, relates to enhancing performance of computation intensive algorithms, and, more specifically, but not exclusively, to enhancing performance of computation intensive algorithms using a multi-processing environment integrated with hardware accelerators optimized for the computation intensive algorithms.
  • One such architecture is vector processing in which a plurality of processing pipelines may be utilized in parallel to process data simultaneously in attempt to reduce computation time.
  • Another approach may be based on utilization of hardware elements configured to conduct computations, for example, arithmetic and/or logic operation at gate-level with little and/or no software intervention.
  • Some platforms aim to further combine between these two architectures to provide multiple processing pipelines each supported at least partially by hardware acceleration elements.
  • an integrated electronic circuitry integrating hardware accelerators in a multi-processing environment for enhancing performance of computation intensive algorithms, comprising a plurality of worker processing units each coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker, and a master processing unit coupled to the plurality of workers.
  • the master is configured to:
  • Each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory.
  • Output a combined outcome aggregating the plurality of partial outcomes.
  • a method of using hardware accelerators integrated in a multi-processing environment to enhance performance of computation intensive algorithms comprising using an electronic circuitry integrating a plurality of worker processing units and a master processing unit coupled to the plurality of workers, each of the plurality of workers is coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker, the master is configured for:
  • Each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory.
  • the master is coupled to a respective memory exclusively associated with the master for storing program instructions and data.
  • the respective memory coupled with each of the workers comprises:
  • ICCM Respective Instruction Closely Coupled Memory
  • DCCM Respective Data Closely Coupled Memory
  • the plurality of processing units are configured to access a common memory shared by the plurality of processing units.
  • each of the hardware accelerators of the set is optimized for executing one or more of the plurality of algorithms.
  • the instruction of the master to each of the plurality of workers further comprising one or more configuration parameters relating to the utilization of the respective selected hardware accelerator.
  • the instruction of the master to the plurality of workers further comprising instructing at least some of the plurality of workers to each utilize its respective selected hardware accelerator according to one or more respective configuration parameters.
  • At least some of the plurality of data segments distributed to at least some of the plurality of workers is similar.
  • the plurality of algorithms comprise one or more computation intensive algorithms applied for computing a hash value for a block recording transactions of an altcoin cryptocurrency in a blockchain.
  • one or more of the computation intensive algorithm implement Ethash algorithm for computing the hash value.
  • one or more of the computation intensive algorithm implement ProgPOW algorithm for computing the hash value. In a further implementation form of the first and/or second aspects, one or more of the computation intensive algorithm implement Equihash algorithm for computing the hash value.
  • one or more of the computation intensive algorithm implement RandomX algorithm for computing the hash value.
  • the plurality of algorithms comprise one or more encryption algorithms.
  • the plurality of algorithms comprise one or more High Performance Computing (HPC) algorithms.
  • HPC High Performance Computing
  • the plurality of algorithms comprise one or more algorithms applied for implementing a Machine Learning (ML) model.
  • ML Machine Learning
  • the plurality of algorithms comprise one or more algorithms applied for training the ML model.
  • the master is further configured to conduct a plurality of iterations for computing the combined outcome.
  • the master is configured to perform the following in each of the plurality of iterations:
  • the master constructs an overall outcome for the input data by aggregating the respective combined outcomes computed in the plurality of iterations.
  • each of the plurality of processing units is utilized by an ARC Intellectual Property (IP) core.
  • IP Intellectual Property
  • each of the hardware accelerators of each respective set is coupled to the respective worker using ARC Processor Extension (APEX) technology.
  • APEX ARC Processor Extension
  • Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks automatically. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
  • a data processor such as a computing platform for executing a plurality of instructions.
  • the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data.
  • a network connection is provided as well.
  • a display and/or a user input device such as a keyboard or mouse are optionally provided as well.
  • FIG. 1 is a schematic illustration of an exemplary integrated circuit integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention
  • FIG. 2 is a flowchart of exemplary processes executed by a master processing unit and a plurality of worker processing units of an integrated device to enhance performance of computation intensive algorithms, according to some embodiments of the present invention
  • FIG. 3 is a schematic illustration of building blocks of a Keccak algorithm
  • FIG. 4A and FIG. 4B which are schematic illustrations of building blocks of a Blake2b algorithm
  • FIG. 5 A and FIG. 5B are schematic illustrations of an Advanced Encryption Standard (AES) algorithm flow
  • FIG. 6 is a schematic illustration of an integrated device integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective hardware accelerator optimized for computing HASH values according to an Ethash algorithm, according to some embodiments of the present invention.
  • FIG. 7 is a schematic illustration of a flow of computing HASH values according to Equihash algorithm using an integrated circuit comprising a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to enhancing performance of computation intensive algorithms, and, more specifically, but not exclusively, to enhancing performance of computation intensive algorithms using a multi-processing environment integrated with hardware accelerators optimized for the computation intensive algorithms.
  • integrated circuits and methods for enhancing performance of computation intensive algorithms through utilization of an integrated electronic circuit implementing a multi-processing environment consisting of a plurality of processing units each exclusively coupled to local memory arrays and private hardware accelerators which are highly optimized for execution of the computation intensive algorithms.
  • the computation intensive algorithms may be directed to general High Performance Computing (HPC). However, one or more of the computation intensive algorithms may be used for more application specific computations, for example, cryptography and/or encryption algorithms, for example, Advanced Encryption Standard (AES), Keccak, Secure Hash Algorithm (e.g. SHA-2, SHA-3) and/or the like.
  • One or more of the computation intensive algorithms may be further directed to more specific applications, for example, data block hash computations for one or more blockchains, such as, for example, cryptocurrency blockchains (e.g. Ethereum, Zcash, Monero, etc.), smart contracts blockchains and/or the like.
  • cryptocurrency blockchains e.g. Ethereum, Zcash, Monero, etc.
  • smart contracts blockchains e.g. Ethereum, Zcash, Monero, etc.
  • hash intensive computation algorithms may include, for example, Keccak, Block2b, Ethash, ProgPOW, Equihash, RandomX and/or the like.
  • the integrated circuit for example, a component, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) and/or the like may integrate plurality of processing units deployed in a master-worker architecture comprising a master processing unit (designated master herein after) and a plurality of worker (slave) processing units (designated workers herein after).
  • the master and the workers which typically share similar architectures, may be integrated in the integrated circuit as Intellectual Property (IP) cores.
  • IP Intellectual Property
  • the master may execute one or more high level processes, applications, algorithms and/or the like (collectively designated high level application herein after) for managing the workers to execute one or more of the computation intensive algorithms.
  • the master may distribute and manage tasks relating to the computation intensive algorithm(s) which may be executed simultaneously (in parallel) by the workers thus significantly increasing computation performance of the computation intensive algorithm(s), in particular, reduce an execution time of the computation intensive algorithm(s).
  • the plurality of processing units may be typically connected to a global system memory shared by all processing unit which may be integrated in the integrated circuit and/or external, for example, a Dynamic Random Access Memory (DRAM), specifically high speed DRAM, such as, for example, Double Data Rate (DDR) 4, DDR 5 and/or the like.
  • DRAM Dynamic Random Access Memory
  • the global memory may be large in capacity in order to support storage of large data structures and/or extensive code segments but may typically be slow (high access time).
  • time sharing (arbitration) between the processing units must be applied thus further increasing the access time of each single processing unit to the global memory.
  • the master processing unit and each of the worker processing units may be coupled to one or more a closely coupled memory arrays exclusively associated with the respective processing unit and thus available for use only by the respective processing unit.
  • These closely coupled memory arrays may include, for example, a Closely Coupled Instruction Memory (ICCM) for storing instructions (program code), a Closely Coupled Data Memory (DCCM) for storing data and/or the like.
  • Each of the processing units may further include one or more private cache memory arrays, for example, a data cache and/or an instruction cache.
  • the ICCM and DCCM may support high speed access (low access time) and may be configured to provide sufficient capacity for efficient program store (ICCM) and sufficient data storage space (DCCM) while maintaining low footprint in terms of electronic resources (typically referred to as gates).
  • Each of the workers may be further coupled to a respective set of hardware accelerators.
  • Each hardware accelerator may be configured and optimized for executing and/or supporting execution of a respective one of the plurality of computation intensive algorithms.
  • Each of the hardware accelerators may include one or more functional modules implemented in gate-level (electronic resources) optimized for conducting the computations relating to a respective computation intensive algorithm via electronic resources with little and typically no software intervention.
  • Each hardware accelerator may be therefore designed and constructed to include hardware circuits, elements, blocks and/or modules customized and optimized for executing the respective algorithm.
  • the hardware accelerators may be integrated in the integrated circuit using one or more architectures, modes and/or implementations, for example, an IP core, a supplemental instruction set for the worker processing unit and/or the like.
  • the master executing the high level application may receive input data relating to a respective one of the computation intensive algorithms and may instruct each of the workers to prepare, for example, load, connect, assign, activate and/or the like their exclusively associated hardware accelerator optimized for the respective computation intensive algorithm.
  • the master may instruct the workers using one or more methods and/or implementations.
  • the master may write one or more code segments to the ICCM of one or more workers where the code segment(s) include instructions such that when executing the code segment(s), the worker(s) may apply the instructions.
  • the master may communicate with one or more of the workers via one or more communication and/or messaging channels established between the master and the workers.
  • the master may further segment the input data to a plurality of data segments and may distribute each of the plurality data segments to a respective one of the workers.
  • the plurality of workers each using its respective closely coupled memory arrays and its respective hardware accelerator, may execute in parallel to process simultaneously the plurality of data segments according to the respective computation intensive algorithm and produce a plurality of partial outcomes each corresponding to a respective data segment.
  • the workers may process their respective data segments in a plurality of iterations as required by the respective computation intensive algorithm.
  • the plurality of workers may then transfer their partial outcomes to the master which may compute a combined outcome by aggregating the plurality of partial outcomes received from the workers.
  • the master may repeat the data segmentation and distribution in a plurality of iterations where in each iteration a portion of the input data may be processed and eventually all outcomes may be aggregated to produce an overall outcome.
  • the integrated circuit may be further applied to train one or more Machine Learning (ML) models, for example, a neural network, a Support Vector Machine (SVM) and/or the like.
  • ML Machine Learning
  • SVM Support Vector Machine
  • the master may transfer the ML model to at least some of the plurality of workers and may further transfer a respective one of the plurality of training datasets and/or a respective segment of the large training datasets to each of the workers.
  • the plurality of workers may operate in parallel to simultaneously train the ML model each with a respective training dataset such that the overall training time may be significantly reduced.
  • the integrated circuit specifically the master and the worker processing units may be implemented using one or more Central Processing Unit (CPU) models, hardware architectures and/or Instruction Set Architectures (ISA) as known in the art, for example, ARC® architecture by Synopsys®, Core® architecture by Intel®, Advanced RISC Machines (ARM) architecture and/or the like.
  • CPU Central Processing Unit
  • ISA Instruction Set Architecture
  • ARC® architecture by Synopsys®
  • Core® architecture by Intel®
  • ARM Advanced RISC Machines
  • the processing architecture, platform and resources of the master and the workers may be selected to provide high performance computing combined with flexible operation modes and ability to efficiently connect to peripheral modules via high speed interconnection channels, buses and/or fabrics.
  • the ARC® HS Family ultra-compact processing cores which include high performance, DSP-enhanced processors are characterized by excellent code density, small footprint and very low power consumption, making them ideal for power-critical and area-sensitive embedded and deeply embedded applications.
  • the ARC® HS cores are equipped with high speed interfaces, ports and/or other provisions for connecting to multiple diverse and high speed memory and/or peripheral cores.
  • the ARC® HS cores further support ARC Processor Extension (APEX) technology for integration with additional optionally custom hardware modules, for example, the hardware accelerators customized and configured for efficient, high performance processing according to specific algorithms.
  • APEX ARC Processor Extension
  • the multi-processing architecture integrating the highly optimized hardware accelerators may present significant advantages and benefits compared to currently existing architecture, devices, systems and methods for implementing computation intensive algorithms.
  • Some of the existing architectures may be based on one or more CPU models for executing the computation intensive algorithms in software. Since software execution is based on a very limited hardware register set of the CPU, such software implementations may be highly inefficient since the computation intensive algorithms may involve a plurality of arithmetic and/or logic operations over massive amounts of data which the limited hardware register set may be incapable to effectively handle.
  • the multi-processing architecture integrating the hardware acceleration may overcome this processing time limitation through the deployment of the plurality of worker processing units (processing pipelines) each utilizing its own respective hardware accelerator which is specifically customized and/or optimized for execution of the specific computation intensive algorithm currently executed.
  • Other existing architectures may be based on one or more hardware acceleration and/or vector processing architectures, for example, Graphic Processing Unit (GPU) and/or the like.
  • hardware acceleration architectures may typically be hard coded and thus very static, making them incapable of efficiently adapting to each specific computation intensive algorithm.
  • the existing devices and methods may deploy massive hardware acceleration comprising a plurality of processing pipelines which may be thus capable of executing the computation intensive algorithms but at the expense of an extremely high power consumption.
  • each hardware accelerator specifically customized and/or optimized for efficiently executing the respective computation intensive algorithm may comprise only the hardware elements required (or essential) for the respective computation intensive algorithm and may thus significantly reduce the power consumption of the hardware accelerator and thus of the overall power consumption of the integrated circuit.
  • each of the workers may be associated with a plurality of such hardware accelerators each optimized for a respective computation intensive algorithm, the workers may select for each computation session the most suitable hardware accelerator for the currently executed computation intensive algorithm.
  • a plurality of hardware accelerators are deployed in the integrated circuit, only one of these hardware accelerators may be actually put to use (activated) at any given time thus preventing redundant power consumption.
  • the master and the workers each having a processing unit may execute one or more code segments to provide this flexibility which may be lacking in exiting solutions based on hardware acceleration alone.
  • allocating to each worker dedicated memory resources, for example, the ICCM and the DCCM for private use by the respective worker independently and separately from all other workers may significantly increase memory availability, accessibility and/or reduce access latency for the workers to access their private memory array(s) thus further increasing the execution performance of the computation intensive algorithms.
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer program code comprising computer readable program instructions embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire line, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • the computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • the computer readable program instructions for carrying out operations of the present invention may be written in any combination of one or more programming languages, such as, for example, assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • ISA instruction-set-architecture
  • machine instructions machine dependent instructions
  • microcode firmware instructions
  • state-setting data state-setting data
  • source code or object code written in any combination of one or more programming languages including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • FPGA field-programmable gate arrays
  • PLA programmable logic arrays
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. l is a schematic illustration of an exemplary integrated circuit integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
  • An integrated electronic circuit 100 may integrate a master processing unit 102 (interchangeably designated master herein after) and a plurality of worker (slave) processing units 104 (interchangeably designated workers herein after), for example, a worked 1 104-1, a worker 2 104-2 and so on to worker N 104-N.
  • master processing unit 102 interchangeably designated master herein after
  • worker processing units 104 interchangeably designated workers herein after
  • the integrated circuit 100 specifically the master processing unit 102 and the worker processing units 104 may be implemented using one or more CPU models, hardware architectures and ISAs as known in the art, for example, ARC® architecture by Synopsys®, Core® architecture by Intel®, ARM architecture and/or the like.
  • the master processing unit 102 and the worker processing units 104 which typically share similar architectures, may be integrated in the integrated circuit 100 as Intellectual Property (IP) cores.
  • IP Intellectual Property
  • each of the master 102 and the workers 104 may execute one or more software modules, for example, process, a script, an application, an agent, a utility, a tool, an Operating System (OS) and/or the like each comprising a plurality of program instructions stored in a non- transitory medium (program store).
  • the processing architecture of the master processing unit 102 and the worker processing units 104 may be selected to provide high performance computing combined with flexible operation modes and ability to efficiently connect to peripheral modules via high speed interconnection channels, buses and/or fabrics.
  • the ARC® HS Family ultra-compact processing cores which include high performance, DSP-enhanced processors are characterized by excellent code density, small size and very low power consumption, making them ideal for power-critical and area-sensitive embedded and deeply embedded applications.
  • the ARC® HS cores are equipped with high speed interfaces, ports and/or other provisions for connecting to multiple diverse and high speed memory and/or peripheral cores.
  • the ARC® HS cores further support APEX technology for integration with additional optionally custom hardware modules, for example, hardware accelerators customized and configured for efficient, high performance processing of data according to specific algorithms thus dramatically boosting performance and/or reducing power consumption.
  • APEX technology for integration with additional optionally custom hardware modules, for example, hardware accelerators customized and configured for efficient, high performance processing of data according to specific algorithms thus dramatically boosting performance and/or reducing power consumption.
  • each of the workers 104 may be coupled to a closely coupled memory exclusively associated with the respective worker 104.
  • the worker 104-1 may be closely coupled to an ICCM 106-1 and a DCCM 108-1
  • the worker 104-2 may be closely coupled to an ICCM 106-2 and a DCCM 108-2 and so on to the worker 104-N which may be closely coupled to an ICCM 106-N and a DCCM 108-N.
  • each closely coupled ICCM 106-i and the DCCM 108-i may be connected to the respective worker 104-i via high-speed and optionally dedicated interface(s) such that the data transfer rate between the respective worker 104-i and the respective ICCM 106-i and/or DCCM 108-i for read and/or or write operations may be extremely high.
  • Each of the workers 104 may further include one or more cache memory arrays 110, for example, a data cache and/or an instruction cache.
  • the worker 104-1 may be include a cache 110-1
  • the worker 104- 2 may include a cache 110-2 and so on to the worker 104-N which may include a cache 110-N.
  • One or more operational parameters of the ICCMs 106, the DCCMs 108 and/or the caches 110 of the master 102 and/or of the workers 104 may be typically set, defined and/or selected according to one or more operational parameters of the integrated circuit 100, for example, an architecture of the master 102 and/or the workers 104 (32-bit, 64-bit, etc.), port width of the of the master 102 and/or the workers 104 and/or the like.
  • one or more of the operational parameters of the ICCMs 106, the DCCMs 108 and/or the caches 110 may be further set, defined and/or selected according to one or more characteristics (needs) of the application and/or algorithms executed using the integrated circuit 100, for example, performance, power consumption, footprint and/or the like.
  • Each of the workers 104 may be further coupled to a respective set 112 of hardware (HW) accelerators 114 each configured and optimized for executing and/or supporting execution of respective one of a plurality of computation intensive algorithms.
  • the hardware accelerators 114 are hardware units comprising one or more functional modules implemented in gate-level optimized for conducting computations relating to the computation intensive algorithms via hardware (electronic) resources with little and typically no software intervention.
  • Each hardware accelerator 114 may be therefore designed and constructed to include hardware circuits, elements, blocks and/or modules customized and optimized for executing the respective algorithm.
  • each of the workers 104 is coupled to a respective set 112 of hardware accelerators 114 exclusively associated with the respective worker 104 such that each hardware accelerator 114 of the respective set 112 may be accessed and used only by the respective associated worker 104.
  • One or more of the hardware accelerators 114 may be configured for computation intensive algorithms directed to general High Performance Computing (HPC) for a plurality of applications in a plurality of market segments, for example, Banking, financial services and insurance (BFSI), Information Technology (IT), Telecom, Military & Aerospace, Healthcare to name a few.
  • HPC High Performance Computing
  • Such applications may include, for example, ML applications, ML platform, smart robots, computer vision and video recognition, Natural Language Processing (NLP) and speech recognition, recommendation engines, virtual assistants, gesture control and many more.
  • NLP Natural Language Processing
  • one or more of the hardware accelerators 114 may be configured and optimized for more application specific computations.
  • Such application specific computation intensive algorithms may include, for example, one or more cryptography and/or encryption algorithms, for example, Advanced Encryption Standard (AES), Keccak, Secure Hash Algorithm (e.g. SHA-2, SHA-3) and/or the like.
  • AES Advanced Encryption Standard
  • Keccak Keccak
  • Secure Hash Algorithm e.g. SHA-2, SHA-3
  • One or more of the hardware accelerators 114 may be further configured and optimized for even more specific computation algorithms, for example, block hash computation algorithms for one or more blockchains, such as, for example, cryptocurrency blockchains (e.g. Ethereum, Zcash, Monero, etc.), smart contracts blockchains and/or the like.
  • cryptocurrency blockchains e.g. Ethereum, Zcash, Monero, etc.
  • Such hash computation algorithms may include, for example, Keccak, Block2b, Ethash, ProgPOW, Equihash, RandomX and/or the like.
  • one or more of the hardware accelerators 114 may be configured and optimized for utilizing one or more ML models, for example, a neural network, an SVM and/or the like.
  • one or more of the hardware accelerators 114 may be configured and optimized for training one or more of the ML models.
  • the hardware accelerators 114 may be implemented and/or integrated in the integrated circuit 100 using one or more methods, techniques and/or implementations as known in the art, for example, a hard IP core, a soft IP core, an extension to the instruction set of the worker processing units 104 and/or the like.
  • the hardware accelerators 114 may be implemented using the APEX extension of the ARC® HS architecture to expand the instruction set of the ARC® HS processing cores to further support the instructions, address mapping, additional resources and/or the like required to realize the hardware accelerators 114.
  • the hardware accelerators 114 may be implemented as IP cores attached to their respective associated worker 104 via one or more interconnection channels, for example, a bus, a link, a port and/or the like.
  • the integrates circuit 100 may include a memory controller 122 for connecting to one or more external memory arrays, typically high capacity memory, for example, a global (system) memory 150 implemented by one or more RAM devices, for example, Double Data Rate (DDR) 4, DDR5 and/or the like.
  • a global memory 150 implemented by one or more RAM devices, for example, Double Data Rate (DDR) 4, DDR5 and/or the like.
  • DDR Double Data Rate
  • the global memory 150 and/or part thereof is integrated within the integrated circuit 100.
  • the integrates circuit 100 may further include one or more Direct Memory Accelerator (DMA) engines 122 for transferring data to/from the global memory 150 from/to one or more of the functional modules integrated in the integrated circuit 100, for example, the master 102 and/or one or more of the workers 104.
  • DMA engines 122 may be further used for transferring data among the functional modules integrated in the integrated circuit 100, for example, from/to the master 102 to/from one or more of the workers 104, between workers 104 and/or the like.
  • the integrates circuit 100 may also include one or more peripheral functional modules 126, for example, a serial communication controller configured to communicate via one or more serial communication channels (e.g. RS-232, RS-422, RS-485, etc.), a network controller configured to communicate via one or more networks (e.g. LAN, WAN, WLAN, etc.), a Joint Test Action Group (JTAG) controller configured to connect to one or more testing and/or debugging channels and/or tools and/or the like.
  • a serial communication controller configured to communicate via one or more serial communication channels (e.g. RS-232, RS-422, RS-485, etc.)
  • a network controller configured to communicate via one or more networks (e.g. LAN, WAN, WLAN, etc.)
  • JTAG Joint Test Action Group
  • the integrates circuit 100 may include one or more interconnection channels 120, for example, a bus, a switch fabric, a link, a port and/or the like.
  • the interconnection channel(s) may be implemented using one or more architectures, modes and/or protocols as known in the art and/or proprietary interconnection channel(s), for example, parallel channels comprising multiple address and/or data lines (e.g. AXI, etc.), serial channels (e.g. PCI Express, etc.) and/or the like.
  • one or more of the interconnection channels 120 may typically be very high speed channels supporting high bandwidth and low latency.
  • the interconnection channels 120 may be deployed in the integrated circuit 100 to support efficient connectivity.
  • a first bus 120A may be deployed in the integrates circuit 100 to connect between the master 102 and the plurality of workers 104.
  • the master 102 may communicate with one or more of the workers 104.
  • the master 102 may access one or more resources of one or more of the workers 104, for example, the ICCM(s) 106, the DCCM(s) 108, registers of the worker(s) 104 and/or the like.
  • one or more of the workers 104 may access, via the first bus 120 A, one or more resources of the master 102, for example the DCCM 108-M.
  • one or more of the workers 104 may access one or more of the resources of one or more other workers 104.
  • a second bus 120B may be deployed in the integrates circuit 100 to connect between the master 102 and the plurality of workers 104 to the other functional modules integrated in the integrates circuit 100, for example the memory controller 122, the DMA engine(s) 124, the peripherals 126 and/or the like.
  • the master 102 and/or one or more of the workers 104 may access the global system memory 120 for reading (fetching) data from the global system memory 120 and/or for writing (storing) data to the global system memory 120.
  • One or more operational parameters of the interconnections channels for example, connected nodes (functional modules), number of lines, timing, protocol and/or the like may be predefined during the design, synthesis and/or build of the integrates circuit 100.
  • FIG. 2 is a flowchart of exemplary processes executed by a master processing unit and a plurality of worker processing units of an integrated device to enhance performance of computation intensive algorithms, according to some embodiments of the present invention.
  • An exemplary process 200 may be executed by a master processing un it such as the master processing unit 102 for managing a distributed processing of a task by a plurality of worker processing units such the worker processing units 104.
  • the master 102 may distribute a plurality of data segments to the workers 104 and receive partial outcomes computed in parallel by the workers 104 each for a respective one of the data segments.
  • the master 102 may then aggregate the partial outcome to produce a combined outcome which may be output.
  • the process 200 may by thus regarded as a high level process carried out, for example, by a high level application executed by the master 102 for managing the data processing and computations carried out by the workers 104.
  • each of the workers 104 since each of the workers 104 is exclusively associated with a respective set such as the set 112 of hardware accelerators such as the hardware accelerators 114, each of the workers 104 may compute the respective partial outcome using one or more hardware accelerators which are exclusively associated with the respective worker 104 and are thus used solely by the respective worker 104.
  • the process 200 representing a computation session, starts with the master 102 receiving input data relating to a respective (certain) one of a plurality of algorithms, specifically computation intensive algorithms such as, for example, HPC algorithms, cryptography algorithms and/or encryption algorithms such as, for example, AES, Keccak, SHA (e.g. SHA-2, SHA-3) and/or the like.
  • one or more of the computation intensive algorithms may be specific and/or customized for bone or more blockchain applications, specifically for computing hash values for blocks added in one or more blockchains, for example, a cryptocurrency blockchain (e.g. Ethereum, Zcash, Monero, etc.), a smart contracts blockchain and/or the like.
  • the master 102 may instruct each of the plurality of workers 104 to prepare one of their associated hardware accelerators 114 for executing the certain algorithm, specifically for processing the input data according to the certain algorithm.
  • each of the plurality of hardware accelerators 114 is configured and/or optimized for executing and/or supporting execution of respective one of the plurality of computation intensive algorithms
  • the master 102 may select the hardware accelerator 114 to be used by the workers 104 according to the certain algorithm which needs to be applied during the current session. For example, assuming the input data relates to an encryption algorithm, for example, AES, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for AES computation. In another example, assuming the input data relates to computing a HASH value for a block of a certain blockchain, for example, Ethereum, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for Ethash computation. In another example, assuming the input data relates to computing a HASH value for a block of another blockchain, for example, Zcash, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for Equihash computation.
  • an encryption algorithm for example, AES
  • the master 102 may optionally configure the operation mode of the hardware accelerator 114 of one or more of the workers 104.
  • the instructions of the master 102 to one or more of the workers 104 may include one or more configuration parameters, for example, an initial state, an execution mode, an execution parameter and/or the like.
  • the master 102 may naturally instruct all the workers 104 to utilize their respective associated selected hardware accelerator 114 using the same configuration parameters.
  • the master 102 may optionally instruct one or more of the workers 104 to utilize their respective hardware accelerators 114 according to one or more different (respective) configuration parameters.
  • the master 102 may instruct a first worker 104, for example, the worker 104-1 to utilize its associated selected hardware accelerator 114, for example, the hardware accelerator 114-11 using a certain configuration parameter while instructing a second worker 104, for example, the worker 104-2 to utilize its associated selected hardware accelerator 114-21 using another configuration parameter which is different from the certain configuration parameter.
  • the master 102 may instruct one or more of the workers 104 to prepare different hardware accelerators 114 for processing the input data.
  • the master 102 may instruct the worker 104-1 may prepare a first hardware accelerator 114 of the set 112-1 associated with the worker 104-1 while instructing the worker 104-2 to prepare a second hardware accelerator 114 of the set 112-2 associated with the worker 104-2.
  • the master 102 may determine that only a subset of the plurality of workers 104 may be used for the current computation session.
  • the subset, specifically the number of workers 104 of the subset may be determined according to one or more of execution parameters defined for the current computation session.
  • the execution parameters may define one or more aspects, attributes, constraints and/or the like for the current computation session.
  • one or more of the execution parameters may define a tradeoff between a performance of the execution of the certain algorithm, i.e. of processing the input data according to the certain algorithm and a power consumption of the integrated electronic circuitry 100 required for processing the input data according to the certain algorithm.
  • a certain execution rule may define that a certain maximum power consumption must not be crossed. The master 102 may determine that in order not to cross the maximum power consumption, only a subset comprising 4 workers 104 out of 12 available workers 104 may be utilized for processing the input data according to the certain algorithm.
  • a certain execution rule may define that the overall computation time must not exceed a certain maximum computation time. The master 102 may determine that in order not to exceed the maximum computation time, a subset comprising 10 workers 104 of the 12 available workers 104 must be utilized.
  • the master 102 may instruct the workers 104 to use different (other) hardware accelerators 114 of the set 112 accordingly.
  • the respective hardware accelerator 114 selected by the master 102 for each computation session may be selected according to the input data received in the respective session, specifically according to the algorithm to which the input data relates.
  • the master 102 may apply one or more methods, modes, provisions and/or the like as known in the art for instructing the workers 104 to prepare their selected hardware accelerator s) 114.
  • the master 102 may directly communicate with one or more of the workers 104, for example, via the second bus 120B, to transmit the instructions via one or more communication and/or messaging channels established between the master 102 and the workers 104.
  • the master may access one or more registers of one or more of the workers 104 to load the instructions.
  • the master may access the ICCM 106 of one or more of the workers 104 to load one or more code segments comprising the instructions and may instruct each respective worker 104 to execute the code segment(s) from its respective ICCM 106 such that when executing the code segment(s) loaded by the master 102 the respective worker 104 may apply the instructions.
  • the process 250 executed by each of the workers 104, each worker 104 may prepare its respective hardware accelerator 114 as instructed by the master 102.
  • Preparing the hardware accelerators 114 may be done according to the implementation of the selected hardware accelerator 114. For example, assuming the selected hardware accelerator 114 is implemented using a dedicated instruction set which may be loaded into the worker 104, the master 102 may load the dedicated instruction set to the ICCM 106 of each of the plurality of workers 104. Continuing the previous example, assuming the selected hardware accelerator 114 is implemented using the APEX extension, each worker 104 may load the APEX extension instruction set of the selected hardware accelerator 114. In another example, the selected hardware accelerator 114 is implemented as a dedicated IP core, each worker 104 may load the dedicated IP core from persistent storage into a gate level logic array of the respective worker 104.
  • the master 102 may segment the input data to a plurality of data segments.
  • the master 102 may segment the input data to the plurality of data segments according to the certain algorithm such that the data segments are at least partially independent of each other and may be thus processed simultaneously.
  • the input data relates to a certain encryption algorithm and further assuming the input data is composed of a plurality of separate data blocks.
  • the master 102 may segment the input data to the plurality of separate data blocks which may be each encrypted independently according to the certain encryption algorithm.
  • the input data is arranged in columns and/or rows which may be each processed independently followed by an aggregation of the outcomes of the processed columns and/or rows. In such case, the master 102 may segment the input data to the plurality of columns and/or the plurality of rows.
  • the plurality of data segments may be identical.
  • the same input data may need to be processed according to a certain algorithm using one or more different algorithms configuration parameters, for example, initial state, execution mode, execution parameter and/or the like.
  • the plurality of data segments created by the master 102 may represent the same data, which may practically include the input data.
  • the same input data may need to be processed according to a plurality of different algorithms.
  • the plurality of data segments created by the master 102 may also represent the same data, which may practically include the input data.
  • the master 102 may distribute the plurality of data segments of the input data to the plurality of workers 104 such that each of the workers 104 may receive a respective one of the data segments.
  • the master 102 may distribute the data segments only to the workers 104 of the selected subset.
  • the master 102 may employ one or more methods, techniques, modes and/or engines for distributing the data segments to the workers 104. For example, the master 102 may write the respective data segment to the DCCM 108 of each of one or more of the workers 104. In another example, the master 102 may store the plurality of data segments in its DCCM-M 108-M and/or in the global memory 150 and may transmit to each of the workers a respective pointer pointing to an address in the DCCM-M 108-M and/or in the global memory 150 where the respective data segment assigned to the respective worker 104 is stored. Each of the workers 104 may then access the DCCM-M 108-M and/or the global memory 150 to fetch its respectively assigned data segment.
  • the master 102 may store the plurality of data segments in its DCCM- M 108-M and/or in the global memory 150 and may operate one or more of the DMA engines 124 to transfer the data segments from the DCCM-M 108-M and/or the global memory 150 to the workers 104, for example, to the DCCMs 108 of the workers 104.
  • the master 102 may further transfer one or more additional data items which may be required for the workers 104, specifically for their respective hardware accelerator 114 to process their respective data segments.
  • the additional data may include, for example, one or more common value such as, for example, a header, a nonce and/or the like which may be required for executing the certain algorithm by all workers 104 regardless of the respective data segments.
  • each of the workers 104 may receive its respective data segment from the master 102.
  • each of the workers 104 may compute a respective partial outcome by processing its respective data segment using its exclusively associated hardware accelerator 114 as instructed by the master 102 according to the certain algorithm. As escribed in step 252, in response to the instruction from the master 102, each worker 104 prepared its respective selected hardware accelerator 114 which is uniquely associated ad hence solely used by the respective worker 104. Each worker may therefore use its respective hardware accelerator 114 to process its respective data segment according to the certain algorithm.
  • each of the workers 104 may transfer its respective partial outcome to the master 102.
  • each of the workers 104 may employ one or more methods, techniques, modes and/or engines for transferring its respective partial outcome to the master 102.
  • one or more of the workers 104 may write their respective partial outcomes to the DCCM-M 108-M of the master 102.
  • one or more of the workers 104 may store their respective partial outcomes in their respective DCCMs 108 and/or in the global memory 150 and may transmit to the master 102 pointers pointing to an address in the DCCM 108 and/or tin the memory 150 where the respective partial outcomes are stored.
  • the master 102 may then access the DCCMs 108 and/or the global memory 150 to fetch the partial outcomes.
  • one or more of the workers 104 may store their partial outcomes in their DCCMs 108 and/or in the global memory 150 and may operate one or more of the DMA engines 124 to transfer the partial outcomes from the DCCMs 108 and/or from the global memory 150 to the master 102, for example, to the DCCM-M 108-M of the master 102.
  • the master 102 may receive the partial outcomes from the workers 104.
  • the master 102 may aggregate the encrypted values to produce a combined value for the input data.
  • the workers 104 each computed a hash value for a respective row of data (data segment) of the input data according to a certain hash function (algorithm), for example, Keccak
  • the master 102 may aggregate the plurality of hash value to produce a combined hash value for the input data.
  • Computing the combined outcome may sometimes require the master 102 to perform additional computations over the partial outcomes and/or part thereof.
  • the process 200, the process 250 and/or parts therefrom are repeated in a plurality of iterations to compute the combined outcome.
  • the master 102 may initiate a plurality of iterations of the process 200, specifically steps 206-212 to process the input data after split to multiple portions (chunks).
  • the master 102 may segment a respective one of the data portions to a plurality of data segments and may distribute the data segments to the workers 104.
  • Each worker 104 may compute in each iteration a respective outcome for its respective data segment of the respective data portion and may transfer their respective partial outcome back to the master 102.
  • the plurality of workers 104 may compute a respective subset of a plurality of partial outcomes relating to the data segments of the respective data portion.
  • the master 102 may receive a respective subset of partial outcomes and may compute a respective combined outcome for the respective data portion by aggregating the respective subset of partial outcomes. After the plurality of iterations are complete, the master 102 may construct an overall outcome for the input data by aggregating the respective combined outcomes computed in the plurality of iterations.
  • the process 200 may include a single path (single iteration) while the process 250 may comprise a plurality of iterations.
  • the master 102 may segment the input data to the plurality of data segments distributed to the plurality of workers 104.
  • one or more of the workers 104 may repeat the process 250, specifically step 256 to compute their respective partial outcome in a plurality of iterations.
  • the certain algorithm executed by the selected hardware accelerator 114 of each worker 104 may execute in a plurality of iterations where in each iteration the input data to the hardware accelerator 114 is the partial outcome a part thereof and/or a derivative thereof, computed in the previous iteration.
  • each worker 104 may transfer its final partial outcome to the master 102.
  • the integrated circuit 100 may be applied for efficiently processing data according to a plurality of computation intensive algorithms employed for a plurality of applications, for example, HPC, cryptography, encryption, decryption, blockchain applications, ML model utilization, ML model training and/or the like.
  • the integrated circuit 100 may be applied for computing a hash value for a block that records transactions of one or more altcoin cryptocurrencies in a blockchain concatenating a plurality of such blocks in order to ensure security, irreversibility and immutability of the blockchain.
  • Such altcoin cryptocurrencies may include, for example, Ethereum, Zcash, Monero and/or the like.
  • FIG. 3 a schematic illustration of building blocks of a Keccak algorithm.
  • the Keccak algorithm as known in the art, is a variant of SHA3 which chronologically precedes it and is based on a novel approach known as sponge construction.
  • Sponge construction is based on a wide random function and/or random permutation, and allows inputting ("absorbing" in sponge terminology) any amount of data, and outputting (“squeezing") any amount of data, while acting as a pseudorandom function with regard to all previous inputs.
  • An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the Keccak algorithm may include a padding block 302 and a permutation block 304 parameterized to enable maximum block size of 1088 bits.
  • the block size parameter r is 1088
  • the block size is 576.
  • the exemplary implementation may enable both 64 bits and 32 bits encoding.
  • Basic implementation may support Keccak-f[1600], in which parameter b is set to 1600.
  • the size of the input to the padding block 302 is 64 bit and the output of the padding block 320 may be 1088 or 576 bits.
  • the size of the input to the permutation block 304 may be 1088 or 576 bits and its output of the permutation block 304 may be 1600 or 800 bits according to the above configuration description.
  • An integrated circuit such as the integrated circuit 100 may include a plurality of Keccak hardware accelerators 114 each associated with a respective one of the plurality of workers 104.
  • the Keccak hardware accelerators 114 which may be implemented using the APEX interface may include one or more registers for Hardware Software Interface (HSI).
  • HSA Hardware Software Interface
  • the registers may include, for example, a 64-bit SOURCE1 register which may be set by a software driver of the Keccak hardware accelerator 114 for controlling the Keccak hardware accelerator 114:
  • the registers may also include a 64-bit SOURCE2 register for storing the input data, specifically the respective data segment distributed to the respective worker 104.
  • the registers may further include one or more registers accessible via the SOURCE1 register, for example:
  • - NAME0 register at address 0x00 containing a first part, CORE_NAME0, of the Keccak hardware accelerator 114 IP core, for example, "KECK" (0x4b45434b).
  • This register is a 32-bit, read only register.
  • - NAME1 register at address 0x01 containing a second part, CORE NAMEl, of the Keccak hardware accelerator 114 IP core, for example, "e2” (0x61312020).
  • This register is a 32- bit, read only register.
  • This register is a 32-bit, read only register.
  • Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective Keccak hardware accelerator 114 may significantly enhance performance of the Keccak algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
  • FIG. 4A and FIG. 4B are schematic illustrations of building blocks of a Blake2b algorithms which is, as known in the art, a cryptographic hash function configured to compress and mix input data.
  • An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the Blake2b algorithm may implement 12 mix rounds as shown in FIG. 4B, with 4 mixers per round as shown in FIG. 4A.
  • An integrated circuit such as the integrated circuit 100 may include a plurality of Blake2b hardware accelerators 114 each associated with a respective one of a plurality of workers such as the plurality of workers 104.
  • the Blake2b hardware accelerators 114 which may be implemented using the APEX interface may include one or more registers for HSI.
  • the registers may include, for example, a 64-bit SOURCE1 register which may be set by a software driver of the Keccak hardware accelerator 114 for controlling the Blake2b hardware accelerator 114:
  • the registers may also include a 64-bit SOURCE2 register for storing the input data, specifically the respective data segment distributed to the respective worker 104.
  • the registers may further include one or more registers accessible via the SOURCE1 register, for example:
  • This register is a 32- bit, read only register.
  • This register is a 32- bit, read only register.
  • This register is a 32-bit, read only register.
  • ADDR STATUS register at address 0x09 storing status of the Blake2b hardware accelerator 114 IP core for example:
  • This register is a 32-bit, read only register.
  • ADDR BLOCK W00 register at address 0x10 storing the address of the data of the block input.
  • ADDR DIGESTO register at address 0x80 storing first 64-bit of the data output of the Blake2b hardware accelerator 114.
  • Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective Blake2b hardware accelerator 114 may significantly enhance performance of the Blake2b algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
  • the integrated circuit 100 constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing encrypting data according to one or more encryption algorithms, for example, AES and/or the like.
  • encryption algorithms may be applied for one or more applications, for example, communication, Hardware Security Modules (HSM) and/or the like.
  • the master 102 may receive a plurality of data segments that need to be encrypted according to a certain encryption algorithm.
  • the master 102 may therefore instruct each of the workers 104 to use a respective hardware accelerator 114 optimized for the certain encryption algorithm.
  • the master 102 may further distribute the plurality of received data segments to at least some of the plurality of workers 104 which may each utilize its respective uniquely associated hardware accelerator 114 which was selected and instructed accordingly by the master 102. After encrypting its respective data segment, each worker 104 may transfer a respective encrypted data stream back to the master 102 which may output the encrypted data streams received from the workers 104.
  • FIG. 5A and FIG. 5B are schematic illustrations of an AES algorithm flow.
  • An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the AES algorithm which is an encryption algorithm as known in the art may be optimized for mixing and XORing function blocks implementing the AES algorithm.
  • the XORing function may receive two input data items, for example, two 64-bit words, XOR between them and return the result. This simple action may significantly reduce computation load of instructions trying to mix two separated 32-bit words, causing additional store and load instructions.
  • the AES algorithm further includes a swap function which is the rounds main function.
  • Some implementations forms of the AES hardware accelerator 114 may be limited in their interface width, for example, the APEX interface may be limited to 64-bits.
  • the AES hardware accelerator 114 may therefore implement the 128-bits output swap function using two identical function blocks differing from each other only in their output bits, one block may output the lower 64-bits of the swap function’s 128-bits output while the other block may output the upper 64-bits.
  • the flow of the AES swap function demonstrated in FIG. 5A and FIG. 5B may include the following:
  • a SubBytes step 502 which is a non-linear substitution step in which each byte is replaced with another byte according to a lookup table.
  • a ShiftRows step 504 which is a transposition step in which the last three rows of the state are shifted cyclically a certain number of steps.
  • a MixColumns step 506 which is a linear mixing operation which operates on the columns of the state, combining the four bytes in each column.
  • a final XOR step 508 executed by the AES hardware accelerator 114 using the swap blocks with the round sub key.
  • Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective AES hardware accelerator 114 may significantly enhance performance of the AES algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
  • the integrated circuit 100 constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a Ethash algorithm.
  • the Ethash algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Ethereum.
  • FIG. 6 is a schematic illustration of an integrated device integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective hardware accelerator optimized for computing HASH values according to an Ethash algorithm, according to some embodiments of the present invention.
  • An exemplary integrated circuit such as the integrated circuit 100 may include a master such as the master 102 and a plurality of workers such as the workers 104, for example, a worker 104- 1, a worker 104-2 to a worker 104-N.
  • Each of the workers 104 may be coupled to a respective DCCM such as the DCCM 106 exclusively associated with the respective worker 104, for example, the worker 104-1 may be coupled to a DCCM 1 108-1.
  • Each of the workers 104 may be further exclusively associated with a respective set such as the set 112 comprising a plurality of hardware accelerators such as the hardware accelerators 114, for example, a Ethash hardware accelerator 114J optimized for computing hash values according to the Ethash algorithm.
  • Each of the Ethash hardware accelerator 114J associated and utilized by the respective workers 104 may include several functional modules or blocks, for example, a Keccak module 602, a first mix module 604, a Direct Acyclic Graph (DAG) page fetch module 606, a second mix module 608 and a post processing model 610.
  • a Keccak module 602 a Keccak module 602
  • a first mix module 604 a Direct Acyclic Graph (DAG) page fetch module 606
  • DAG Direct Acyclic Graph
  • the master 102 may communicate, for example, via a network controller peripheral 126 A integrated in the integrated circuit 100, with one or more external devices, systems, nodes and/or the like to receive a DAG representing cryptocurrency transactions as known in the art and/or information relating to the DAG. Based on the received information the master 102 may construct the DAG and may store it in a global memory such as the global memory 150 accessible to all workers 104.
  • the master 102 may then instruct each of the workers 104 to prepare its exclusively associated Ethash hardware accelerator 114J.
  • the master 102 may further receive and/or compute a header relating to a new block to be created and added to the blockchain which is derived from a previous block of the blockchain.
  • the master 102 may also receive and/or compute a plurality of nonce values relating to the new block which are practically guessed (drawn) nonces for the hash computation.
  • the nonces may be randomly selected using a random number generator, pseudo random number generator and/or the like. However, the nonces may be also computed based on one or more equations and/or rules.
  • the master 102 may transfer the header and a respective one of the plurality of nonces to each of the workers 104 using one or more of the data transfer modes and/or implementations described herein before.
  • the master 102 may transfer the header and the respective nonce into the DCCM 108 of each of the workers 104.
  • the master 102 may instruct the plurality of workers 104 to start computing the hash values.
  • Each worker 104 may then apply its respective Ethash hardware accelerator 114J to compute the hash value.
  • the Keccak module 602 executing an SHA-3 like algorithm (e.g. Keccak-f) based on the header and the respective nonce may create an initial block, for example, a 128-byte mix which may be designated Mix 0.
  • the first mix module 604 may apply one or more mixing functions to create an address to get the next 128-byte from the DAG.
  • the DAG page fetch module 606 may then access the global memory 150 to fetch the next 128-byte block of the DAG using the address computed by the first mix module 604.
  • the DAG page fetch module 606 may apply one or more data transfer modes and/or implementations as described herein before for fetching the data from the global memory 150.
  • the number of workers 104 that are applied for the Ethash hash computation may be therefore determined based on a trade-off between the bandwidth capacity of the interconnection channels, specifically a bus such as the bus 120B connecting the workers 104 to the global memory 150 and the processing capabilities of each of the workers 104.
  • the second mix module 608 may apply one or more Ethereum specific mixing functions to combine Mix 0 with the retrieved DAG page.
  • the DAG page fetch module 606 may then further fetch the next 128-byte block from the DAG stored in the global memory using the address computed by the second mix module 608.
  • the second mix module 608 may then again combine the previous mix computed by the second mix module 608 with the retrieved DAG page.
  • the steps of the DAG page fetch module 606 and the second mix module 608 may be repeated for a predefined number of iterations, for example, 64 iterations to produce a final mix designated Mix 64.
  • the post processing model 610 may then process Mix 64 to produce a shorter mix, for example, a 32-byte Mix Digest.
  • Each worker 104 may then compare the resulting Mix Digest against a predefined 32-byte Target Threshold. If the Mix Digest is less than or equals to the Target Threshold, then the respective nonce used by the respective worker 104 may be considered successful, and may be broadcasted to the Ethereum network. Otherwise, the respective nonce is considered invalid. The entire process may be then repeated by the master 102 using a different set of nonces distributed to the plurality of workers 104.
  • Applying the plurality of workers 104 to simultaneously compute a plurality of Mix Digest values based on the plurality of different nonces may significantly increase the performance, in particular, reduce the time of the computation by the order of the number of workers 104 used in the process.
  • using the specifically optimized Ethash hardware accelerator 114J may further significantly improve the performance and/or reduce the power consumed by the integrated circuit 100 for the computation compared to software based solutions.
  • the integrated circuit 100 constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a Equihash algorithm.
  • the Equihash algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Zcash.
  • FIG. 7 a schematic illustration of a flow of computing HASH values according to Equihash algorithm using an integrated circuit comprising a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
  • the multi-processing master-worker architecture of an integrated circuit such as the integrated circuit 100 may be applied for computing hash values according to the Equihash algorithm. It should be noted that the numbers and values mentioned in the description are exemplary and should not be construed as limiting since other numbers and/or values may be used with the same architecture, flows and/or modules.
  • a master such as the master 102 may transfer 512 pairs of numbers in the range of [0 : 2 A 21 ] for correcting Random Noise Generators (RNG) of a plurality of workers such as the workers 104, for example, 300 workers 104, starting from a worker 104-1 through to a worker 104-300.
  • RNG Random Noise Generators
  • the master 102 may transmit to each worker 104 three concatenate numbers: a header (864b), a nonce (256b) and a random number (21b).
  • Each of the workers 104 may transfer back to the master 102 512 pairs of numbers, in the range of [0 : 2 A 21 ] for further refining by the master 102.
  • Each of the workers 104 may further send to the master 102 512 x 200b numbers which are the output of a XOR matrix of the respective worker 104.
  • Each worker 104 may use a respective RNG to create 512 random numbers in the range of [0 : 2 A 21], designated Pl, ..., P512.
  • the 512 random numbers may be then pushed (input) into a Blake2b hardware accelerator 114 configured with the following parameters - input 256B, output 50B / 400b where the 400b output is divided to 2 x 200b words.
  • Selecting each of the 512 pairs of the 200b word of the blake2b output may be done based on whether the random numbers Pl, . . . , P512 are odd or even respectively thus giving XI, . . . , X256. As result there are 512 numbers (Pl, ..., P512) and corresponding 512 words of 200b (xl, ..., x512).
  • the 512 numbers and their corresponding 512 200b words may be injected into a filer by constructing a 512 x 512 matrix where each element [i,j] of the matrix is 200b wide and is a XOR result of Xi and Xj .
  • the values of the XOR result’s may be checked for leading zeros. If leading zeros count is higher, for example, than 25 (empiric number), the pair Pi, Pj and the corresponding XOR result may be added to a list:
  • Each worker 104 may send its respective list to the master 102.
  • the list may contain between
  • the master 102 may process the received data in two steps.
  • the master 102 may:
  • the master 102 may:
  • the 2 rows are two 200b words to be used as the algorithm solution, or the POW (proof of work) to be sent to the node as part of the solution message.
  • the integrated circuit 100 constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a RandomX algorithm.
  • the RandomX algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Monero.
  • the integrated circuit 100 constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for implementing one or more ML models.
  • ML models for example, classifiers may heavily rely on a plurality of Multiply and Accumulate (MAC) operations.
  • One or more of the hardware accelerators 114 may be configured to support hardware execution of the MACs thus significantly increasing the computation performance of these MACs compared to software solutions. Applying the plurality of workers 104 to simultaneously compute a plurality of respective MACs using their respective hardware accelerators 114 may therefore significantly increase the performance, specifically reduce the computation time of the ML classification process by the order of the number of employed workers 104.
  • the integrated circuit 100 may be further applied to train one or more ML models, for example, a neural network, an SVM and/or the like.
  • ML models for example, a neural network, an SVM and/or the like.
  • the master 102 may transfer the ML model to at least some of the plurality of workers 104 and may further transfer a respective one of the plurality of training datasets and/or a respective segment of the large training datasets to each of the workers 104.
  • the plurality of workers 104 may operate in parallel to simultaneously train the ML model each with a respective training dataset such that the overall training time may be significantly reduced.
  • composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals there between.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Disclosed herein are integrated electronic circuits integrating hardware accelerators in a multi-processing environment and methods for use to enhance performance of computation intensive algorithms. The integrated circuits comprise a master processing unit and a plurality of worker processing units each coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker. In response to receiving input data relating to a respective one of a plurality of computation intensive algorithms, the master instructs each worker to execute the algorithm using one of the hardware accelerators of its respective set selected according to the algorithm. The master distributes a plurality of data segments of the input data to the workers which compute partial outcomes for their respective data segments according to the algorithm using their respective selected hardware accelerators. The master collects and aggregates the partial outcomes and produce a combined outcome.

Description

CLOSELY COUPLED HARDWARE ACCELERATION IN A
MULTI-PROCESSORS ENVIRONMENT
RELATED APPLICATION/S
This application claims the benefit of priority of U.S. Provisional Patent Application No. 63/120,256 filed on 2 December 2020, the contents of which are incorporated herein by reference in their entirety.
FIELD AND BACKGROUND OF THE INVENTION
The present invention, in some embodiments thereof, relates to enhancing performance of computation intensive algorithms, and, more specifically, but not exclusively, to enhancing performance of computation intensive algorithms using a multi-processing environment integrated with hardware accelerators optimized for the computation intensive algorithms.
In order to address and accommodate the ever increasing demand for higher computing power, the computing and processing platforms, hardware and architectures are also required to constantly evolve.
A plurality of hardware and electronic computing architectures have been and continue to be developed to comply with multiple requirements and constraints applied which may be highly challenging and often conflicting, for example, computing capacity, power (energy) consumption, size, cost and/or the like.
One such architecture is vector processing in which a plurality of processing pipelines may be utilized in parallel to process data simultaneously in attempt to reduce computation time. Another approach may be based on utilization of hardware elements configured to conduct computations, for example, arithmetic and/or logic operation at gate-level with little and/or no software intervention. Some platforms aim to further combine between these two architectures to provide multiple processing pipelines each supported at least partially by hardware acceleration elements.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention there is provided an integrated electronic circuitry integrating hardware accelerators in a multi-processing environment for enhancing performance of computation intensive algorithms, comprising a plurality of worker processing units each coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker, and a master processing unit coupled to the plurality of workers. The master is configured to:
- Receive input data relating to a respective one of a plurality of computation intensive algorithms.
Instruct each of the plurality of workers to execute the respective algorithm using one of the hardware accelerators of its respective set selected according to the respective algorithm.
Distribute a plurality of data segments of the input data to the plurality of workers.
Receive a plurality of partial outcomes computed simultaneously by the plurality of workers. Each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory.
Output a combined outcome aggregating the plurality of partial outcomes.
According to a second aspect of the present invention there is provided a method of using hardware accelerators integrated in a multi-processing environment to enhance performance of computation intensive algorithms, comprising using an electronic circuitry integrating a plurality of worker processing units and a master processing unit coupled to the plurality of workers, each of the plurality of workers is coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker, the master is configured for:
- Receiving input data relating to a respective one of a plurality of algorithms.
Instructing each of the plurality of workers to execute the respective algorithm using one of the hardware accelerators of its respective set selected according to the respective algorithm.
- Distributing a plurality of data segments of the input data to the plurality of workers.
- Receiving a plurality of partial outcomes computed simultaneously by the plurality of workers. Each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory.
Outputting a combined outcome aggregating the plurality of partial outcomes.
In a further implementation form of the first and/or second aspects, the master is coupled to a respective memory exclusively associated with the master for storing program instructions and data. In a further implementation form of the first and/or second aspects, the respective memory coupled with each of the workers comprises:
- Respective Instruction Closely Coupled Memory (ICCM) for program instructions store. The respective ICCM is accessible by the master for writing program instructions for execution by the respective worker, and
- Respective Data Closely Coupled Memory (DCCM) for storing data used by the respective worker. The respective DCCM is accessible by the master for distributing the respective data segment to the respective worker and for receiving the respective partial outco e computed by the respective worker.
In a further implementation form of the first and/or second aspects, the plurality of processing units are configured to access a common memory shared by the plurality of processing units.
In a further implementation form of the first and/or second aspects, each of the hardware accelerators of the set is optimized for executing one or more of the plurality of algorithms.
In a further implementation form of the first and/or second aspects, the instruction of the master to each of the plurality of workers further comprising one or more configuration parameters relating to the utilization of the respective selected hardware accelerator.
In a further implementation form of the first and/or second aspects, the instruction of the master to the plurality of workers further comprising instructing at least some of the plurality of workers to each utilize its respective selected hardware accelerator according to one or more respective configuration parameters.
In an optional implementation form of the first and/or second aspects, at least some of the plurality of data segments distributed to at least some of the plurality of workers is similar.
In a further implementation form of the first and/or second aspects, the master is further configured to operate only a subset of the plurality of workers, the number of workers in subset is set according to one or more execution parameters defining a tradeoff between a performance of the execution of the respective algorithm and a power consumption of the integrated electronic circuitry.
In a further implementation form of the first and/or second aspects, the plurality of algorithms comprise one or more computation intensive algorithms applied for computing a hash value for a block recording transactions of an altcoin cryptocurrency in a blockchain.
In a further implementation form of the first and/or second aspects, one or more of the computation intensive algorithm implement Ethash algorithm for computing the hash value.
In a further implementation form of the first and/or second aspects, one or more of the computation intensive algorithm implement ProgPOW algorithm for computing the hash value. In a further implementation form of the first and/or second aspects, one or more of the computation intensive algorithm implement Equihash algorithm for computing the hash value.
In a further implementation form of the first and/or second aspects, one or more of the computation intensive algorithm implement RandomX algorithm for computing the hash value.
In a further implementation form of the first and/or second aspects, the plurality of algorithms comprise one or more encryption algorithms.
In a further implementation form of the first and/or second aspects, the plurality of algorithms comprise one or more High Performance Computing (HPC) algorithms.
In a further implementation form of the first and/or second aspects, the plurality of algorithms comprise one or more algorithms applied for implementing a Machine Learning (ML) model.
In a further implementation form of the first and/or second aspects, the plurality of algorithms comprise one or more algorithms applied for training the ML model.
In an optional implementation form of the first and/or second aspects, the master is further configured to conduct a plurality of iterations for computing the combined outcome. The master is configured to perform the following in each of the plurality of iterations:
- Distribute, to the plurality of workers, a respective subset of a plurality of data segments of a respective data portion of the input data.
- Receive a respective subset of a plurality of partial outcomes computed by the plurality of workers for the respective subset of a plurality of data segments.
Compute a respective combined outcome for the respective data portion by aggregating the respective subset of the plurality of partial outcomes,
Wherein after the plurality of iterations are complete, the master constructs an overall outcome for the input data by aggregating the respective combined outcomes computed in the plurality of iterations.
In a further implementation form of the first and/or second aspects, each of the plurality of processing units is utilized by an ARC Intellectual Property (IP) core.
In a further implementation form of the first and/or second aspects, each of the hardware accelerators of each respective set is coupled to the respective worker using ARC Processor Extension (APEX) technology.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims. Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks automatically. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.
For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of methods and/or systems as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars are shown by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a schematic illustration of an exemplary integrated circuit integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention;
FIG. 2 is a flowchart of exemplary processes executed by a master processing unit and a plurality of worker processing units of an integrated device to enhance performance of computation intensive algorithms, according to some embodiments of the present invention;
FIG. 3 is a schematic illustration of building blocks of a Keccak algorithm;
FIG. 4A and FIG. 4B which are schematic illustrations of building blocks of a Blake2b algorithm;
FIG. 5 A and FIG. 5B are schematic illustrations of an Advanced Encryption Standard (AES) algorithm flow;
FIG. 6 is a schematic illustration of an integrated device integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective hardware accelerator optimized for computing HASH values according to an Ethash algorithm, according to some embodiments of the present invention; and
FIG. 7 is a schematic illustration of a flow of computing HASH values according to Equihash algorithm using an integrated circuit comprising a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION
The present invention, in some embodiments thereof, relates to enhancing performance of computation intensive algorithms, and, more specifically, but not exclusively, to enhancing performance of computation intensive algorithms using a multi-processing environment integrated with hardware accelerators optimized for the computation intensive algorithms.
According to some embodiments of the present invention, there are provided integrated circuits and methods for enhancing performance of computation intensive algorithms through utilization of an integrated electronic circuit implementing a multi-processing environment consisting of a plurality of processing units each exclusively coupled to local memory arrays and private hardware accelerators which are highly optimized for execution of the computation intensive algorithms.
The computation intensive algorithms may be directed to general High Performance Computing (HPC). However, one or more of the computation intensive algorithms may be used for more application specific computations, for example, cryptography and/or encryption algorithms, for example, Advanced Encryption Standard (AES), Keccak, Secure Hash Algorithm (e.g. SHA-2, SHA-3) and/or the like. One or more of the computation intensive algorithms may be further directed to more specific applications, for example, data block hash computations for one or more blockchains, such as, for example, cryptocurrency blockchains (e.g. Ethereum, Zcash, Monero, etc.), smart contracts blockchains and/or the like. Such hash intensive computation algorithms may include, for example, Keccak, Block2b, Ethash, ProgPOW, Equihash, RandomX and/or the like.
The integrated circuit, for example, a component, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) and/or the like may integrate plurality of processing units deployed in a master-worker architecture comprising a master processing unit (designated master herein after) and a plurality of worker (slave) processing units (designated workers herein after). The master and the workers which typically share similar architectures, may be integrated in the integrated circuit as Intellectual Property (IP) cores.
The master may execute one or more high level processes, applications, algorithms and/or the like (collectively designated high level application herein after) for managing the workers to execute one or more of the computation intensive algorithms. The master may distribute and manage tasks relating to the computation intensive algorithm(s) which may be executed simultaneously (in parallel) by the workers thus significantly increasing computation performance of the computation intensive algorithm(s), in particular, reduce an execution time of the computation intensive algorithm(s).
The plurality of processing units, namely the master and the workers may be typically connected to a global system memory shared by all processing unit which may be integrated in the integrated circuit and/or external, for example, a Dynamic Random Access Memory (DRAM), specifically high speed DRAM, such as, for example, Double Data Rate (DDR) 4, DDR 5 and/or the like. The global memory may be large in capacity in order to support storage of large data structures and/or extensive code segments but may typically be slow (high access time). Moreover, since the global memory is shared by all of the processing units, time sharing (arbitration) between the processing units must be applied thus further increasing the access time of each single processing unit to the global memory.
In order to overcome the global memory limitation, the master processing unit and each of the worker processing units may be coupled to one or more a closely coupled memory arrays exclusively associated with the respective processing unit and thus available for use only by the respective processing unit. These closely coupled memory arrays may include, for example, a Closely Coupled Instruction Memory (ICCM) for storing instructions (program code), a Closely Coupled Data Memory (DCCM) for storing data and/or the like. Each of the processing units may further include one or more private cache memory arrays, for example, a data cache and/or an instruction cache. The ICCM and DCCM may support high speed access (low access time) and may be configured to provide sufficient capacity for efficient program store (ICCM) and sufficient data storage space (DCCM) while maintaining low footprint in terms of electronic resources (typically referred to as gates).
Each of the workers may be further coupled to a respective set of hardware accelerators. Each hardware accelerator may be configured and optimized for executing and/or supporting execution of a respective one of the plurality of computation intensive algorithms. Each of the hardware accelerators may include one or more functional modules implemented in gate-level (electronic resources) optimized for conducting the computations relating to a respective computation intensive algorithm via electronic resources with little and typically no software intervention. Each hardware accelerator may be therefore designed and constructed to include hardware circuits, elements, blocks and/or modules customized and optimized for executing the respective algorithm. The hardware accelerators may be integrated in the integrated circuit using one or more architectures, modes and/or implementations, for example, an IP core, a supplemental instruction set for the worker processing unit and/or the like.
The master executing the high level application may receive input data relating to a respective one of the computation intensive algorithms and may instruct each of the workers to prepare, for example, load, connect, assign, activate and/or the like their exclusively associated hardware accelerator optimized for the respective computation intensive algorithm. The master may instruct the workers using one or more methods and/or implementations. For example, the master may write one or more code segments to the ICCM of one or more workers where the code segment(s) include instructions such that when executing the code segment(s), the worker(s) may apply the instructions. In another example, the master may communicate with one or more of the workers via one or more communication and/or messaging channels established between the master and the workers.
The master may further segment the input data to a plurality of data segments and may distribute each of the plurality data segments to a respective one of the workers. The plurality of workers, each using its respective closely coupled memory arrays and its respective hardware accelerator, may execute in parallel to process simultaneously the plurality of data segments according to the respective computation intensive algorithm and produce a plurality of partial outcomes each corresponding to a respective data segment. Optionally, the workers may process their respective data segments in a plurality of iterations as required by the respective computation intensive algorithm. The plurality of workers may then transfer their partial outcomes to the master which may compute a combined outcome by aggregating the plurality of partial outcomes received from the workers. Optionally, the master may repeat the data segmentation and distribution in a plurality of iterations where in each iteration a portion of the input data may be processed and eventually all outcomes may be aggregated to produce an overall outcome.
The integrated circuit may be further applied to train one or more Machine Learning (ML) models, for example, a neural network, a Support Vector Machine (SVM) and/or the like. For example, assuming a certain ML model needs to be trained using a plurality of training datasets and/or a very large training dataset. In such case the master may transfer the ML model to at least some of the plurality of workers and may further transfer a respective one of the plurality of training datasets and/or a respective segment of the large training datasets to each of the workers. The plurality of workers may operate in parallel to simultaneously train the ML model each with a respective training dataset such that the overall training time may be significantly reduced.
The integrated circuit, specifically the master and the worker processing units may be implemented using one or more Central Processing Unit (CPU) models, hardware architectures and/or Instruction Set Architectures (ISA) as known in the art, for example, ARC® architecture by Synopsys®, Core® architecture by Intel®, Advanced RISC Machines (ARM) architecture and/or the like. In particular, the processing architecture, platform and resources of the master and the workers may be selected to provide high performance computing combined with flexible operation modes and ability to efficiently connect to peripheral modules via high speed interconnection channels, buses and/or fabrics. For example, the ARC® HS Family ultra-compact processing cores which include high performance, DSP-enhanced processors are characterized by excellent code density, small footprint and very low power consumption, making them ideal for power-critical and area-sensitive embedded and deeply embedded applications. Moreover, the ARC® HS cores are equipped with high speed interfaces, ports and/or other provisions for connecting to multiple diverse and high speed memory and/or peripheral cores. In particular, the ARC® HS cores further support ARC Processor Extension (APEX) technology for integration with additional optionally custom hardware modules, for example, the hardware accelerators customized and configured for efficient, high performance processing according to specific algorithms.
The multi-processing architecture integrating the highly optimized hardware accelerators may present significant advantages and benefits compared to currently existing architecture, devices, systems and methods for implementing computation intensive algorithms. Some of the existing architectures may be based on one or more CPU models for executing the computation intensive algorithms in software. Since software execution is based on a very limited hardware register set of the CPU, such software implementations may be highly inefficient since the computation intensive algorithms may involve a plurality of arithmetic and/or logic operations over massive amounts of data which the limited hardware register set may be incapable to effectively handle. The multi-processing architecture integrating the hardware acceleration, on the other hand, may overcome this processing time limitation through the deployment of the plurality of worker processing units (processing pipelines) each utilizing its own respective hardware accelerator which is specifically customized and/or optimized for execution of the specific computation intensive algorithm currently executed.
Other existing architectures may be based on one or more hardware acceleration and/or vector processing architectures, for example, Graphic Processing Unit (GPU) and/or the like. However, such hardware acceleration architectures may typically be hard coded and thus very static, making them incapable of efficiently adapting to each specific computation intensive algorithm. In order to overcome this limitation, the existing devices and methods may deploy massive hardware acceleration comprising a plurality of processing pipelines which may be thus capable of executing the computation intensive algorithms but at the expense of an extremely high power consumption. This is in contrast to the multi-processing architecture integrating the hardware acceleration where each hardware accelerator specifically customized and/or optimized for efficiently executing the respective computation intensive algorithm may comprise only the hardware elements required (or essential) for the respective computation intensive algorithm and may thus significantly reduce the power consumption of the hardware accelerator and thus of the overall power consumption of the integrated circuit. Since each of the workers may be associated with a plurality of such hardware accelerators each optimized for a respective computation intensive algorithm, the workers may select for each computation session the most suitable hardware accelerator for the currently executed computation intensive algorithm. However, while a plurality of hardware accelerators are deployed in the integrated circuit, only one of these hardware accelerators may be actually put to use (activated) at any given time thus preventing redundant power consumption.
Moreover, since some flexibility is required for the execution of at least some of the computation intensive algorithms, the master and the workers each having a processing unit may execute one or more code segments to provide this flexibility which may be lacking in exiting solutions based on hardware acceleration alone.
Furthermore, allocating to each worker dedicated memory resources, for example, the ICCM and the DCCM for private use by the respective worker independently and separately from all other workers may significantly increase memory availability, accessibility and/or reduce access latency for the workers to access their private memory array(s) thus further increasing the execution performance of the computation intensive algorithms.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer program code comprising computer readable program instructions embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire line, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
The computer readable program instructions for carrying out operations of the present invention may be written in any combination of one or more programming languages, such as, for example, assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Referring now to the drawings, FIG. l is a schematic illustration of an exemplary integrated circuit integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
An integrated electronic circuit 100, for example, a component, an ASIC, an FPGA and/or the like may integrate a master processing unit 102 (interchangeably designated master herein after) and a plurality of worker (slave) processing units 104 (interchangeably designated workers herein after), for example, a worked 1 104-1, a worker 2 104-2 and so on to worker N 104-N.
The integrated circuit 100, specifically the master processing unit 102 and the worker processing units 104 may be implemented using one or more CPU models, hardware architectures and ISAs as known in the art, for example, ARC® architecture by Synopsys®, Core® architecture by Intel®, ARM architecture and/or the like. In particular, the master processing unit 102 and the worker processing units 104, which typically share similar architectures, may be integrated in the integrated circuit 100 as Intellectual Property (IP) cores.
As such each of the master 102 and the workers 104 may execute one or more software modules, for example, process, a script, an application, an agent, a utility, a tool, an Operating System (OS) and/or the like each comprising a plurality of program instructions stored in a non- transitory medium (program store). The processing architecture of the master processing unit 102 and the worker processing units 104 may be selected to provide high performance computing combined with flexible operation modes and ability to efficiently connect to peripheral modules via high speed interconnection channels, buses and/or fabrics. For example, the ARC® HS Family ultra-compact processing cores which include high performance, DSP-enhanced processors are characterized by excellent code density, small size and very low power consumption, making them ideal for power-critical and area-sensitive embedded and deeply embedded applications. Moreover, the ARC® HS cores are equipped with high speed interfaces, ports and/or other provisions for connecting to multiple diverse and high speed memory and/or peripheral cores. The ARC® HS cores further support APEX technology for integration with additional optionally custom hardware modules, for example, hardware accelerators customized and configured for efficient, high performance processing of data according to specific algorithms thus dramatically boosting performance and/or reducing power consumption. The fact that the ARC® HS Family is described herein in more detail should not be construed as limiting, since other architectures as known in the art and/or will become known may be applied to utilize the master 102 and/or the workers 104.
The master 102 may be coupled to a closely coupled memory exclusively associated with the master 102. The closely coupled memory may include one or more memory arrays, for example, a Closely Coupled Instruction Memory (ICCM) 106-M for storing instructions (program code) and a Closely Coupled Data Memory (DCCM) 108-M for storing data. The closely coupled ICCM 106-M and DCCM 108-M may be connected to the master 102 via high-speed interface(s), optionally dedicated interface(s), such that the data transfer rate between the master 102 and the ICCM 106-M and/or the DCCM 108-M for read and/or or write operations may be extremely high. The master 102 may further include one or more cache memory arrays 110-M, for example, a data cache and/or an instruction cache.
As described for the master 102, each of the workers 104 may be coupled to a closely coupled memory exclusively associated with the respective worker 104. For example, the worker 104-1 may be closely coupled to an ICCM 106-1 and a DCCM 108-1, the worker 104-2 may be closely coupled to an ICCM 106-2 and a DCCM 108-2 and so on to the worker 104-N which may be closely coupled to an ICCM 106-N and a DCCM 108-N. As described for the master 102, each closely coupled ICCM 106-i and the DCCM 108-i (i between 1 and N) may be connected to the respective worker 104-i via high-speed and optionally dedicated interface(s) such that the data transfer rate between the respective worker 104-i and the respective ICCM 106-i and/or DCCM 108-i for read and/or or write operations may be extremely high. Each of the workers 104 may further include one or more cache memory arrays 110, for example, a data cache and/or an instruction cache. For example, the worker 104-1 may be include a cache 110-1, the worker 104- 2 may include a cache 110-2 and so on to the worker 104-N which may include a cache 110-N.
One or more operational parameters of the ICCMs 106, the DCCMs 108 and/or the caches 110 of the master 102 and/or of the workers 104, for example, speed, size (density), arrangement (width) and/or the like may be typically set, defined and/or selected according to one or more operational parameters of the integrated circuit 100, for example, an architecture of the master 102 and/or the workers 104 (32-bit, 64-bit, etc.), port width of the of the master 102 and/or the workers 104 and/or the like. However, one or more of the operational parameters of the ICCMs 106, the DCCMs 108 and/or the caches 110 may be further set, defined and/or selected according to one or more characteristics (needs) of the application and/or algorithms executed using the integrated circuit 100, for example, performance, power consumption, footprint and/or the like.
Each of the workers 104 may be further coupled to a respective set 112 of hardware (HW) accelerators 114 each configured and optimized for executing and/or supporting execution of respective one of a plurality of computation intensive algorithms. The hardware accelerators 114 are hardware units comprising one or more functional modules implemented in gate-level optimized for conducting computations relating to the computation intensive algorithms via hardware (electronic) resources with little and typically no software intervention. Each hardware accelerator 114 may be therefore designed and constructed to include hardware circuits, elements, blocks and/or modules customized and optimized for executing the respective algorithm.
As seen in the integrated circuit 100, while the hardware accelerators sets 112 may be typically similar for all the workers 104, each of the workers 104 is coupled to a respective set 112 of hardware accelerators 114 exclusively associated with the respective worker 104 such that each hardware accelerator 114 of the respective set 112 may be accessed and used only by the respective associated worker 104. For example, a hardware accelerators set 112-1 comprising L hardware accelerators 114-11, 114-12 through 114-1L may be exclusively associated with and coupled to the worker 104-1, a hardware accelerators set 112-2 comprising L hardware accelerators 114-21, 114-22 through 114-2L may be exclusively associated with and coupled to the worker 104-2 and so on to a hardware accelerators set 112-N comprising L hardware accelerators 114-N1, 114-N2 through 114-NL which may be exclusively associated with and coupled to the worker 104-N.
One or more of the hardware accelerators 114 may be configured for computation intensive algorithms directed to general High Performance Computing (HPC) for a plurality of applications in a plurality of market segments, for example, Banking, financial services and insurance (BFSI), Information Technology (IT), Telecom, Military & Aerospace, Healthcare to name a few. Such applications may include, for example, ML applications, ML platform, smart robots, computer vision and video recognition, Natural Language Processing (NLP) and speech recognition, recommendation engines, virtual assistants, gesture control and many more.
However, one or more of the hardware accelerators 114 may be configured and optimized for more application specific computations. Such application specific computation intensive algorithms may include, for example, one or more cryptography and/or encryption algorithms, for example, Advanced Encryption Standard (AES), Keccak, Secure Hash Algorithm (e.g. SHA-2, SHA-3) and/or the like. One or more of the hardware accelerators 114 may be further configured and optimized for even more specific computation algorithms, for example, block hash computation algorithms for one or more blockchains, such as, for example, cryptocurrency blockchains (e.g. Ethereum, Zcash, Monero, etc.), smart contracts blockchains and/or the like. Such hash computation algorithms may include, for example, Keccak, Block2b, Ethash, ProgPOW, Equihash, RandomX and/or the like. Moreover, one or more of the hardware accelerators 114 may be configured and optimized for utilizing one or more ML models, for example, a neural network, an SVM and/or the like. Furthermore, one or more of the hardware accelerators 114 may be configured and optimized for training one or more of the ML models.
The hardware accelerators 114 may be implemented and/or integrated in the integrated circuit 100 using one or more methods, techniques and/or implementations as known in the art, for example, a hard IP core, a soft IP core, an extension to the instruction set of the worker processing units 104 and/or the like. Continuing the previous example of implementing the workers 104 using the ARC® HS cores, the hardware accelerators 114 may be implemented using the APEX extension of the ARC® HS architecture to expand the instruction set of the ARC® HS processing cores to further support the instructions, address mapping, additional resources and/or the like required to realize the hardware accelerators 114. In another example, the hardware accelerators 114 may be implemented as IP cores attached to their respective associated worker 104 via one or more interconnection channels, for example, a bus, a link, a port and/or the like.
The integrates circuit 100 may include a memory controller 122 for connecting to one or more external memory arrays, typically high capacity memory, for example, a global (system) memory 150 implemented by one or more RAM devices, for example, Double Data Rate (DDR) 4, DDR5 and/or the like. Optionally, the global memory 150 and/or part thereof is integrated within the integrated circuit 100.
The integrates circuit 100 may further include one or more Direct Memory Accelerator (DMA) engines 122 for transferring data to/from the global memory 150 from/to one or more of the functional modules integrated in the integrated circuit 100, for example, the master 102 and/or one or more of the workers 104. One or more of the DMA engines 122 may be further used for transferring data among the functional modules integrated in the integrated circuit 100, for example, from/to the master 102 to/from one or more of the workers 104, between workers 104 and/or the like.
The integrates circuit 100 may also include one or more peripheral functional modules 126, for example, a serial communication controller configured to communicate via one or more serial communication channels (e.g. RS-232, RS-422, RS-485, etc.), a network controller configured to communicate via one or more networks (e.g. LAN, WAN, WLAN, etc.), a Joint Test Action Group (JTAG) controller configured to connect to one or more testing and/or debugging channels and/or tools and/or the like.
To support connectivity between the functional modules and IP cores integrated in it, the integrates circuit 100 may include one or more interconnection channels 120, for example, a bus, a switch fabric, a link, a port and/or the like. The interconnection channel(s) may be implemented using one or more architectures, modes and/or protocols as known in the art and/or proprietary interconnection channel(s), for example, parallel channels comprising multiple address and/or data lines (e.g. AXI, etc.), serial channels (e.g. PCI Express, etc.) and/or the like. As they are implemented within the integrated circuit 100, one or more of the interconnection channels 120 may typically be very high speed channels supporting high bandwidth and low latency.
The interconnection channels 120 may be deployed in the integrated circuit 100 to support efficient connectivity. For example, a first bus 120A may be deployed in the integrates circuit 100 to connect between the master 102 and the plurality of workers 104. Via the first bus 120A, the master 102 may communicate with one or more of the workers 104. Moreover, via the first bus 120 A, the master 102 may access one or more resources of one or more of the workers 104, for example, the ICCM(s) 106, the DCCM(s) 108, registers of the worker(s) 104 and/or the like. Optionally, one or more of the workers 104 may access, via the first bus 120 A, one or more resources of the master 102, for example the DCCM 108-M. More optionally, via the first bus 120 A, one or more of the workers 104 may access one or more of the resources of one or more other workers 104. In another example, a second bus 120B may be deployed in the integrates circuit 100 to connect between the master 102 and the plurality of workers 104 to the other functional modules integrated in the integrates circuit 100, for example the memory controller 122, the DMA engine(s) 124, the peripherals 126 and/or the like. As such, via the memory controller 122 connected to the second bus 120B, the master 102 and/or one or more of the workers 104 may access the global system memory 120 for reading (fetching) data from the global system memory 120 and/or for writing (storing) data to the global system memory 120. One or more operational parameters of the interconnections channels, for example, connected nodes (functional modules), number of lines, timing, protocol and/or the like may be predefined during the design, synthesis and/or build of the integrates circuit 100.
Reference is now made to FIG. 2, which is a flowchart of exemplary processes executed by a master processing unit and a plurality of worker processing units of an integrated device to enhance performance of computation intensive algorithms, according to some embodiments of the present invention.
An exemplary process 200 may be executed by a master processing un it such as the master processing unit 102 for managing a distributed processing of a task by a plurality of worker processing units such the worker processing units 104. The master 102 may distribute a plurality of data segments to the workers 104 and receive partial outcomes computed in parallel by the workers 104 each for a respective one of the data segments. The master 102 may then aggregate the partial outcome to produce a combined outcome which may be output. The process 200 may by thus regarded as a high level process carried out, for example, by a high level application executed by the master 102 for managing the data processing and computations carried out by the workers 104.
An exemplary process 250 may be executed by each worker 104-i of the plurality of workers 104 (i = 1, 2, . . . , N) to process a respective one of the plurality of data segments received from the master 102 and compute a respective partial outcome which is transferred back to the master 102. In particular, since each of the workers 104 is exclusively associated with a respective set such as the set 112 of hardware accelerators such as the hardware accelerators 114, each of the workers 104 may compute the respective partial outcome using one or more hardware accelerators which are exclusively associated with the respective worker 104 and are thus used solely by the respective worker 104.
As shown at 202, the process 200, representing a computation session, starts with the master 102 receiving input data relating to a respective (certain) one of a plurality of algorithms, specifically computation intensive algorithms such as, for example, HPC algorithms, cryptography algorithms and/or encryption algorithms such as, for example, AES, Keccak, SHA (e.g. SHA-2, SHA-3) and/or the like. Moreover, one or more of the computation intensive algorithms may be specific and/or customized for bone or more blockchain applications, specifically for computing hash values for blocks added in one or more blockchains, for example, a cryptocurrency blockchain (e.g. Ethereum, Zcash, Monero, etc.), a smart contracts blockchain and/or the like. As shown at 204, the master 102 may instruct each of the plurality of workers 104 to prepare one of their associated hardware accelerators 114 for executing the certain algorithm, specifically for processing the input data according to the certain algorithm.
Since each of the plurality of hardware accelerators 114 is configured and/or optimized for executing and/or supporting execution of respective one of the plurality of computation intensive algorithms, the master 102 may select the hardware accelerator 114 to be used by the workers 104 according to the certain algorithm which needs to be applied during the current session. For example, assuming the input data relates to an encryption algorithm, for example, AES, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for AES computation. In another example, assuming the input data relates to computing a HASH value for a block of a certain blockchain, for example, Ethereum, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for Ethash computation. In another example, assuming the input data relates to computing a HASH value for a block of another blockchain, for example, Zcash, the master 102 may instruct the workers 104 to prepare a respective hardware accelerator 114 specifically adapted for Equihash computation.
The master 102 may optionally configure the operation mode of the hardware accelerator 114 of one or more of the workers 104. For example, the instructions of the master 102 to one or more of the workers 104 may include one or more configuration parameters, for example, an initial state, an execution mode, an execution parameter and/or the like. The master 102 may naturally instruct all the workers 104 to utilize their respective associated selected hardware accelerator 114 using the same configuration parameters. However, the master 102 may optionally instruct one or more of the workers 104 to utilize their respective hardware accelerators 114 according to one or more different (respective) configuration parameters. For example, the master 102 may instruct a first worker 104, for example, the worker 104-1 to utilize its associated selected hardware accelerator 114, for example, the hardware accelerator 114-11 using a certain configuration parameter while instructing a second worker 104, for example, the worker 104-2 to utilize its associated selected hardware accelerator 114-21 using another configuration parameter which is different from the certain configuration parameter.
Optionally, the master 102 may instruct one or more of the workers 104 to prepare different hardware accelerators 114 for processing the input data. For example, the master 102 may instruct the worker 104-1 may prepare a first hardware accelerator 114 of the set 112-1 associated with the worker 104-1 while instructing the worker 104-2 to prepare a second hardware accelerator 114 of the set 112-2 associated with the worker 104-2. Optionally, the master 102 may determine that only a subset of the plurality of workers 104 may be used for the current computation session. The subset, specifically the number of workers 104 of the subset may be determined according to one or more of execution parameters defined for the current computation session. The execution parameters may define one or more aspects, attributes, constraints and/or the like for the current computation session. For example, one or more of the execution parameters may define a tradeoff between a performance of the execution of the certain algorithm, i.e. of processing the input data according to the certain algorithm and a power consumption of the integrated electronic circuitry 100 required for processing the input data according to the certain algorithm. For example, a certain execution rule may define that a certain maximum power consumption must not be crossed. The master 102 may determine that in order not to cross the maximum power consumption, only a subset comprising 4 workers 104 out of 12 available workers 104 may be utilized for processing the input data according to the certain algorithm. In another example, a certain execution rule may define that the overall computation time must not exceed a certain maximum computation time. The master 102 may determine that in order not to exceed the maximum computation time, a subset comprising 10 workers 104 of the 12 available workers 104 must be utilized.
Naturally, since during different computation sessions, the input data received by the master 102 may relate to different algorithms, the master 102 may instruct the workers 104 to use different (other) hardware accelerators 114 of the set 112 accordingly. As such, the respective hardware accelerator 114 selected by the master 102 for each computation session may be selected according to the input data received in the respective session, specifically according to the algorithm to which the input data relates.
The master 102 may apply one or more methods, modes, provisions and/or the like as known in the art for instructing the workers 104 to prepare their selected hardware accelerator s) 114. For example, the master 102 may directly communicate with one or more of the workers 104, for example, via the second bus 120B, to transmit the instructions via one or more communication and/or messaging channels established between the master 102 and the workers 104. In another example, the master may access one or more registers of one or more of the workers 104 to load the instructions. In another example, the master may access the ICCM 106 of one or more of the workers 104 to load one or more code segments comprising the instructions and may instruct each respective worker 104 to execute the code segment(s) from its respective ICCM 106 such that when executing the code segment(s) loaded by the master 102 the respective worker 104 may apply the instructions. As shown at 252, the process 250 executed by each of the workers 104, each worker 104 may prepare its respective hardware accelerator 114 as instructed by the master 102.
Preparing the hardware accelerators 114 may be done according to the implementation of the selected hardware accelerator 114. For example, assuming the selected hardware accelerator 114 is implemented using a dedicated instruction set which may be loaded into the worker 104, the master 102 may load the dedicated instruction set to the ICCM 106 of each of the plurality of workers 104. Continuing the previous example, assuming the selected hardware accelerator 114 is implemented using the APEX extension, each worker 104 may load the APEX extension instruction set of the selected hardware accelerator 114. In another example, the selected hardware accelerator 114 is implemented as a dedicated IP core, each worker 104 may load the dedicated IP core from persistent storage into a gate level logic array of the respective worker 104.
As shown at 206, the master 102 may segment the input data to a plurality of data segments.
Specifically, the master 102 may segment the input data to the plurality of data segments according to the certain algorithm such that the data segments are at least partially independent of each other and may be thus processed simultaneously. For example, assuming the input data relates to a certain encryption algorithm and further assuming the input data is composed of a plurality of separate data blocks. In such case, the master 102 may segment the input data to the plurality of separate data blocks which may be each encrypted independently according to the certain encryption algorithm. In another example, assuming the input data is arranged in columns and/or rows which may be each processed independently followed by an aggregation of the outcomes of the processed columns and/or rows. In such case, the master 102 may segment the input data to the plurality of columns and/or the plurality of rows.
Optionally, the plurality of data segments may be identical. For example, in same scenarios the same input data may need to be processed according to a certain algorithm using one or more different algorithms configuration parameters, for example, initial state, execution mode, execution parameter and/or the like. In such case the plurality of data segments created by the master 102 may represent the same data, which may practically include the input data. In another example, the same input data may need to be processed according to a plurality of different algorithms. In such case the plurality of data segments created by the master 102 may also represent the same data, which may practically include the input data.
As shown at 208, the master 102 may distribute the plurality of data segments of the input data to the plurality of workers 104 such that each of the workers 104 may receive a respective one of the data segments. In case the master 102 determined that only a subset of the plurality of workers 104 is used for the current computation session, the master 102 may distribute the data segments only to the workers 104 of the selected subset.
The master 102 may employ one or more methods, techniques, modes and/or engines for distributing the data segments to the workers 104. For example, the master 102 may write the respective data segment to the DCCM 108 of each of one or more of the workers 104. In another example, the master 102 may store the plurality of data segments in its DCCM-M 108-M and/or in the global memory 150 and may transmit to each of the workers a respective pointer pointing to an address in the DCCM-M 108-M and/or in the global memory 150 where the respective data segment assigned to the respective worker 104 is stored. Each of the workers 104 may then access the DCCM-M 108-M and/or the global memory 150 to fetch its respectively assigned data segment. In another example, the master 102 may store the plurality of data segments in its DCCM- M 108-M and/or in the global memory 150 and may operate one or more of the DMA engines 124 to transfer the data segments from the DCCM-M 108-M and/or the global memory 150 to the workers 104, for example, to the DCCMs 108 of the workers 104.
The master 102 may further transfer one or more additional data items which may be required for the workers 104, specifically for their respective hardware accelerator 114 to process their respective data segments. The additional data may include, for example, one or more common value such as, for example, a header, a nonce and/or the like which may be required for executing the certain algorithm by all workers 104 regardless of the respective data segments.
As shown at 254, each of the workers 104 may receive its respective data segment from the master 102.
As shown at 256, each of the workers 104 may compute a respective partial outcome by processing its respective data segment using its exclusively associated hardware accelerator 114 as instructed by the master 102 according to the certain algorithm. As escribed in step 252, in response to the instruction from the master 102, each worker 104 prepared its respective selected hardware accelerator 114 which is uniquely associated ad hence solely used by the respective worker 104. Each worker may therefore use its respective hardware accelerator 114 to process its respective data segment according to the certain algorithm.
As shown at 258, each of the workers 104 may transfer its respective partial outcome to the master 102.
As described for master data segments distribution, each of the workers 104 may employ one or more methods, techniques, modes and/or engines for transferring its respective partial outcome to the master 102. For example, one or more of the workers 104 may write their respective partial outcomes to the DCCM-M 108-M of the master 102. In another example, one or more of the workers 104 may store their respective partial outcomes in their respective DCCMs 108 and/or in the global memory 150 and may transmit to the master 102 pointers pointing to an address in the DCCM 108 and/or tin the memory 150 where the respective partial outcomes are stored. The master 102 may then access the DCCMs 108 and/or the global memory 150 to fetch the partial outcomes. In another example, one or more of the workers 104 may store their partial outcomes in their DCCMs 108 and/or in the global memory 150 and may operate one or more of the DMA engines 124 to transfer the partial outcomes from the DCCMs 108 and/or from the global memory 150 to the master 102, for example, to the DCCM-M 108-M of the master 102.
As shown at 210, the master 102 may receive the partial outcomes from the workers 104.
As shown at 212, the master 102 may compute a combined outcome which aggregates the plurality of partial outcomes received from the workers 104.
For example, assuming the workers 104 each computed an encrypted value for a respective data block (data segment) of the input data according to a certain encoding algorithm, for example, AES, the master 102 may aggregate the encrypted values to produce a combined value for the input data. In another example, assuming the workers 104 each computed a hash value for a respective row of data (data segment) of the input data according to a certain hash function (algorithm), for example, Keccak, the master 102 may aggregate the plurality of hash value to produce a combined hash value for the input data.
Computing the combined outcome may sometimes require the master 102 to perform additional computations over the partial outcomes and/or part thereof.
Optionally, the process 200, the process 250 and/or parts therefrom are repeated in a plurality of iterations to compute the combined outcome.
For example, the master 102 may initiate a plurality of iterations of the process 200, specifically steps 206-212 to process the input data after split to multiple portions (chunks). In such case, in each iteration, the master 102 may segment a respective one of the data portions to a plurality of data segments and may distribute the data segments to the workers 104. Each worker 104, in turn, may compute in each iteration a respective outcome for its respective data segment of the respective data portion and may transfer their respective partial outcome back to the master 102. As such, during each iteration the plurality of workers 104 may compute a respective subset of a plurality of partial outcomes relating to the data segments of the respective data portion. In each iteration, the master 102 may receive a respective subset of partial outcomes and may compute a respective combined outcome for the respective data portion by aggregating the respective subset of partial outcomes. After the plurality of iterations are complete, the master 102 may construct an overall outcome for the input data by aggregating the respective combined outcomes computed in the plurality of iterations.
In another example, the process 200 may include a single path (single iteration) while the process 250 may comprise a plurality of iterations. In such case, the master 102 may segment the input data to the plurality of data segments distributed to the plurality of workers 104. However, one or more of the workers 104 may repeat the process 250, specifically step 256 to compute their respective partial outcome in a plurality of iterations. For example, the certain algorithm executed by the selected hardware accelerator 114 of each worker 104 may execute in a plurality of iterations where in each iteration the input data to the hardware accelerator 114 is the partial outcome a part thereof and/or a derivative thereof, computed in the previous iteration. After the plurality of iterations are complete, each worker 104 may transfer its final partial outcome to the master 102.
As described herein before, due to its unique architecture, specifically, the multitude of workers 104 each exclusively associated with respective memory (ICCM 106, DCCM 108) and respective highly optimized hardware accelerators 114, the integrated circuit 100 may be applied for efficiently processing data according to a plurality of computation intensive algorithms employed for a plurality of applications, for example, HPC, cryptography, encryption, decryption, blockchain applications, ML model utilization, ML model training and/or the like.
For example, the integrated circuit 100 may be applied for computing a hash value for a block that records transactions of one or more altcoin cryptocurrencies in a blockchain concatenating a plurality of such blocks in order to ensure security, irreversibility and immutability of the blockchain. Such altcoin cryptocurrencies may include, for example, Ethereum, Zcash, Monero and/or the like.
Reference is now made to FIG. 3, which a schematic illustration of building blocks of a Keccak algorithm. The Keccak algorithm, as known in the art, is a variant of SHA3 which chronologically precedes it and is based on a novel approach known as sponge construction. Sponge construction is based on a wide random function and/or random permutation, and allows inputting ("absorbing" in sponge terminology) any amount of data, and outputting ("squeezing") any amount of data, while acting as a pseudorandom function with regard to all previous inputs.
An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the Keccak algorithm may include a padding block 302 and a permutation block 304 parameterized to enable maximum block size of 1088 bits.
Denoting c as a capacity of the Keccak algorithm, b as a block width (bits) and r as block size (rate) of the of the Keccak algorithm, the capacity of the Keccak algorithm may be represented by c = b ■ r. For Keccak 256 (c = 256the block size parameter r is 1088, while for Keccak 512 (c=512) the block size is 576. The exemplary implementation may enable both 64 bits and 32 bits encoding. Basic implementation may support Keccak-f[1600], in which parameter b is set to 1600. However, the use of Keccak-f[800], with b=800, may be also used by the exemplary block configuration.
The size of the input to the padding block 302 is 64 bit and the output of the padding block 320 may be 1088 or 576 bits. The size of the input to the permutation block 304 may be 1088 or 576 bits and its output of the permutation block 304 may be 1600 or 800 bits according to the above configuration description.
An integrated circuit such as the integrated circuit 100 may include a plurality of Keccak hardware accelerators 114 each associated with a respective one of the plurality of workers 104. The Keccak hardware accelerators 114 which may be implemented using the APEX interface may include one or more registers for Hardware Software Interface (HSI).
The registers may include, for example, a 64-bit SOURCE1 register which may be set by a software driver of the Keccak hardware accelerator 114 for controlling the Keccak hardware accelerator 114:
Figure imgf000027_0001
Figure imgf000027_0002
The registers may also include a 64-bit SOURCE2 register for storing the input data, specifically the respective data segment distributed to the respective worker 104.
The registers may further include one or more registers accessible via the SOURCE1 register, for example:
- NAME0 register at address 0x00 containing a first part, CORE_NAME0, of the Keccak hardware accelerator 114 IP core, for example, "KECK" (0x4b45434b). This register is a 32-bit, read only register. - NAME1 register at address 0x01 containing a second part, CORE NAMEl, of the Keccak hardware accelerator 114 IP core, for example, "e2” (0x61312020). This register is a 32- bit, read only register.
VERSION register at address 0x02 containing a version, CORE VERSION, of the Keccak hardware accelerator 114 IP core, for example, “0.10” (0x302e3130). This register is a 32- bit, read only register.
ADDR STATUS register at address 0x09 storing status of the Keccak hardware accelerator 114 IP core, for example:
BIT [01]: digest valid reg
BIT [00]: ready reg
This register is a 32-bit, read only register.
ADDR BLOCK W00 register at address 0x10 storing the address of the data of the block input.
ADDR DIGEST0 register at address 0x80 storing first 64-bit of the data output of the Keccak hardware accelerator 114.
ADDR DIGEST7 register at address 0x87 storing last 64-bit of the data output of the Keccak hardware accelerator 114.
Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective Keccak hardware accelerator 114 may significantly enhance performance of the Keccak algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
Reference is now made to FIG. 4A and FIG. 4B, which are schematic illustrations of building blocks of a Blake2b algorithms which is, as known in the art, a cryptographic hash function configured to compress and mix input data.
An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the Blake2b algorithm may implement 12 mix rounds as shown in FIG. 4B, with 4 mixers per round as shown in FIG. 4A. The Blake2b hardware accelerator may therefore complete the entire processes pipelines as follows: 4 [elks] * 2 [mix] * 12 [rounds] + 2 [elks output] = 98 clocks.
An integrated circuit such as the integrated circuit 100 may include a plurality of Blake2b hardware accelerators 114 each associated with a respective one of a plurality of workers such as the plurality of workers 104. The Blake2b hardware accelerators 114 which may be implemented using the APEX interface may include one or more registers for HSI. The registers may include, for example, a 64-bit SOURCE1 register which may be set by a software driver of the Keccak hardware accelerator 114 for controlling the Blake2b hardware accelerator 114:
Figure imgf000029_0001
The registers may also include a 64-bit SOURCE2 register for storing the input data, specifically the respective data segment distributed to the respective worker 104.
The registers may further include one or more registers accessible via the SOURCE1 register, for example:
- NAMEO register at address 0x00 containing a first part, CORE_NAME0, of the Blake2b hardware accelerator 114 IP core, for example, "blak" (0x626c616b). This register is a 32- bit, read only register.
- NAME1 register at address 0x01 containing a second part, CORE_NAME1, of the Blake2b hardware accelerator 114 IP core, for example, "e2” (0x61312020). This register is a 32- bit, read only register.
VERSION register at address 0x02 containing a version, CORE VERSION, of the Blake2b hardware accelerator 114 IP core, for example, “0.10” (0x302e3130). This register is a 32-bit, read only register.
ADDR CTRL register at address 0x04. This bit is a 32-bit register comprising the following fields:
BITs [23: 16]: core digest len new
BITs [15:04]: kMsgLen new
BIT [02]: CTRL FINAL BIT
BIT [01]: CTRL NEXT BIT
BIT [00]: CTRL INIT BIT
ADDR STATUS register at address 0x09 storing status of the Blake2b hardware accelerator 114 IP core, for example:
BIT [01]: digest valid reg
BIT [00]: ready reg
This register is a 32-bit, read only register.
ADDR BLOCK W00 register at address 0x10 storing the address of the data of the block input. ADDR DIGESTO register at address 0x80 storing first 64-bit of the data output of the Blake2b hardware accelerator 114.
ADDR DIGEST15 register at address 0x8F storing last 64-bit of the data output of the Blake2b hardware accelerator 114.
Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective Blake2b hardware accelerator 114 may significantly enhance performance of the Blake2b algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
According to some embodiments of the present invention the integrated circuit 100, constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing encrypting data according to one or more encryption algorithms, for example, AES and/or the like. Such encryption algorithms may be applied for one or more applications, for example, communication, Hardware Security Modules (HSM) and/or the like.
In such embodiments, the master 102 may receive a plurality of data segments that need to be encrypted according to a certain encryption algorithm. The master 102 may therefore instruct each of the workers 104 to use a respective hardware accelerator 114 optimized for the certain encryption algorithm.
The master 102 may further distribute the plurality of received data segments to at least some of the plurality of workers 104 which may each utilize its respective uniquely associated hardware accelerator 114 which was selected and instructed accordingly by the master 102. After encrypting its respective data segment, each worker 104 may transfer a respective encrypted data stream back to the master 102 which may output the encrypted data streams received from the workers 104.
Reference is now made to FIG. 5A and FIG. 5B, which are schematic illustrations of an AES algorithm flow. An exemplary hardware accelerator such as the hardware accelerator 114 optimized for executing the AES algorithm which is an encryption algorithm as known in the art may be optimized for mixing and XORing function blocks implementing the AES algorithm.
The XORing function may receive two input data items, for example, two 64-bit words, XOR between them and return the result. This simple action may significantly reduce computation load of instructions trying to mix two separated 32-bit words, causing additional store and load instructions. The AES algorithm further includes a swap function which is the rounds main function. Some implementations forms of the AES hardware accelerator 114 may be limited in their interface width, for example, the APEX interface may be limited to 64-bits. The AES hardware accelerator 114 may therefore implement the 128-bits output swap function using two identical function blocks differing from each other only in their output bits, one block may output the lower 64-bits of the swap function’s 128-bits output while the other block may output the upper 64-bits.
The flow of the AES swap function demonstrated in FIG. 5A and FIG. 5B may include the following:
A SubBytes step 502 which is a non-linear substitution step in which each byte is replaced with another byte according to a lookup table.
A ShiftRows step 504 which is a transposition step in which the last three rows of the state are shifted cyclically a certain number of steps.
A MixColumns step 506 which is a linear mixing operation which operates on the columns of the state, combining the four bytes in each column.
A final XOR step 508 executed by the AES hardware accelerator 114 using the swap blocks with the round sub key.
Employing the plurality of workers 104 to simultaneously process segments of the input data each using its respective AES hardware accelerator 114 may significantly enhance performance of the AES algorithm execution by the order of the number of workers 104 utilized for the parallel computation where each worker 14 may process a respective block (segment) of the input data.
According to some embodiments of the present invention the integrated circuit 100, constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a Ethash algorithm. The Ethash algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Ethereum.
Reference is now made to FIG. 6, which is a schematic illustration of an integrated device integrating a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective hardware accelerator optimized for computing HASH values according to an Ethash algorithm, according to some embodiments of the present invention.
An exemplary integrated circuit such as the integrated circuit 100 may include a master such as the master 102 and a plurality of workers such as the workers 104, for example, a worker 104- 1, a worker 104-2 to a worker 104-N. Each of the workers 104 may be coupled to a respective DCCM such as the DCCM 106 exclusively associated with the respective worker 104, for example, the worker 104-1 may be coupled to a DCCM 1 108-1. Each of the workers 104 may be further exclusively associated with a respective set such as the set 112 comprising a plurality of hardware accelerators such as the hardware accelerators 114, for example, a Ethash hardware accelerator 114J optimized for computing hash values according to the Ethash algorithm.
Each of the Ethash hardware accelerator 114J associated and utilized by the respective workers 104 may include several functional modules or blocks, for example, a Keccak module 602, a first mix module 604, a Direct Acyclic Graph (DAG) page fetch module 606, a second mix module 608 and a post processing model 610.
Initially, the master 102 may communicate, for example, via a network controller peripheral 126 A integrated in the integrated circuit 100, with one or more external devices, systems, nodes and/or the like to receive a DAG representing cryptocurrency transactions as known in the art and/or information relating to the DAG. Based on the received information the master 102 may construct the DAG and may store it in a global memory such as the global memory 150 accessible to all workers 104.
The master 102 may then instruct each of the workers 104 to prepare its exclusively associated Ethash hardware accelerator 114J.
The master 102 may further receive and/or compute a header relating to a new block to be created and added to the blockchain which is derived from a previous block of the blockchain. The master 102 may also receive and/or compute a plurality of nonce values relating to the new block which are practically guessed (drawn) nonces for the hash computation. The nonces may be randomly selected using a random number generator, pseudo random number generator and/or the like. However, the nonces may be also computed based on one or more equations and/or rules.
The master 102 may transfer the header and a respective one of the plurality of nonces to each of the workers 104 using one or more of the data transfer modes and/or implementations described herein before. In particular, the master 102 may transfer the header and the respective nonce into the DCCM 108 of each of the workers 104.
After the DAG is available in the global memory 150 and each of the workers 104 has the header and the nonce stored in their local DCCM 108, the master 102 may instruct the plurality of workers 104 to start computing the hash values.
Each worker 104 may then apply its respective Ethash hardware accelerator 114J to compute the hash value. The Keccak module 602 executing an SHA-3 like algorithm (e.g. Keccak-f) based on the header and the respective nonce may create an initial block, for example, a 128-byte mix which may be designated Mix 0. The first mix module 604 may apply one or more mixing functions to create an address to get the next 128-byte from the DAG.
The DAG page fetch module 606 may then access the global memory 150 to fetch the next 128-byte block of the DAG using the address computed by the first mix module 604. The DAG page fetch module 606 may apply one or more data transfer modes and/or implementations as described herein before for fetching the data from the global memory 150. The number of workers 104 that are applied for the Ethash hash computation may be therefore determined based on a trade-off between the bandwidth capacity of the interconnection channels, specifically a bus such as the bus 120B connecting the workers 104 to the global memory 150 and the processing capabilities of each of the workers 104.
The second mix module 608 may apply one or more Ethereum specific mixing functions to combine Mix 0 with the retrieved DAG page.
The DAG page fetch module 606 may then further fetch the next 128-byte block from the DAG stored in the global memory using the address computed by the second mix module 608. The second mix module 608 may then again combine the previous mix computed by the second mix module 608 with the retrieved DAG page. The steps of the DAG page fetch module 606 and the second mix module 608 may be repeated for a predefined number of iterations, for example, 64 iterations to produce a final mix designated Mix 64.
The post processing model 610 may then process Mix 64 to produce a shorter mix, for example, a 32-byte Mix Digest.
Each worker 104 may then compare the resulting Mix Digest against a predefined 32-byte Target Threshold. If the Mix Digest is less than or equals to the Target Threshold, then the respective nonce used by the respective worker 104 may be considered successful, and may be broadcasted to the Ethereum network. Otherwise, the respective nonce is considered invalid. The entire process may be then repeated by the master 102 using a different set of nonces distributed to the plurality of workers 104.
Applying the plurality of workers 104 to simultaneously compute a plurality of Mix Digest values based on the plurality of different nonces may significantly increase the performance, in particular, reduce the time of the computation by the order of the number of workers 104 used in the process. Moreover, using the specifically optimized Ethash hardware accelerator 114J may further significantly improve the performance and/or reduce the power consumed by the integrated circuit 100 for the computation compared to software based solutions.
According to some embodiments of the present invention the integrated circuit 100, constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a Equihash algorithm. The Equihash algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Zcash.
Reference is now made to FIG. 7, which a schematic illustration of a flow of computing HASH values according to Equihash algorithm using an integrated circuit comprising a master processing unit and a plurality of worker processing units each exclusively coupled to a respective local memory and a respective set of hardware accelerators, according to some embodiments of the present invention.
The multi-processing master-worker architecture of an integrated circuit such as the integrated circuit 100 may be applied for computing hash values according to the Equihash algorithm. It should be noted that the numbers and values mentioned in the description are exemplary and should not be construed as limiting since other numbers and/or values may be used with the same architecture, flows and/or modules.
A master such as the master 102 may transfer 512 pairs of numbers in the range of [0 : 2A21 ] for correcting Random Noise Generators (RNG) of a plurality of workers such as the workers 104, for example, 300 workers 104, starting from a worker 104-1 through to a worker 104-300. Specifically, the master 102 may transmit to each worker 104 three concatenate numbers: a header (864b), a nonce (256b) and a random number (21b).
Each of the workers 104 may transfer back to the master 102 512 pairs of numbers, in the range of [0 : 2A21 ] for further refining by the master 102. Each of the workers 104 may further send to the master 102 512 x 200b numbers which are the output of a XOR matrix of the respective worker 104.
Each worker 104 may use a respective RNG to create 512 random numbers in the range of [0 : 2A21], designated Pl, ..., P512. The 512 random numbers may be then pushed (input) into a Blake2b hardware accelerator 114 configured with the following parameters - input 256B, output 50B / 400b where the 400b output is divided to 2 x 200b words.
Selecting each of the 512 pairs of the 200b word of the blake2b output may be done based on whether the random numbers Pl, . . . , P512 are odd or even respectively thus giving XI, . . . , X256. As result there are 512 numbers (Pl, ..., P512) and corresponding 512 words of 200b (xl, ..., x512).
The 512 numbers and their corresponding 512 200b words may be injected into a filer by constructing a 512 x 512 matrix where each element [i,j] of the matrix is 200b wide and is a XOR result of Xi and Xj . The values of the XOR result’s may be checked for leading zeros. If leading zeros count is higher, for example, than 25 (empiric number), the pair Pi, Pj and the corresponding XOR result may be added to a list:
Figure imgf000035_0001
Each worker 104 may send its respective list to the master 102. The list may contain between
1 and (512A2/(2 - 512)) items. Assuming the list contains M row elements.
The master 102 may process the received data in two steps.
In the first step, the master 102 may:
(1) place each row’s columns Pi and Pj, such that the higher number is on the right column, giving Pi’ and Pj’.
(2) Sort the M rows according to right column Pj ’ .
(3) Sort M rows according to left column Pi’
(4) Define a sub matrix as a group of same values of Pi’ (the left column).
(5) Extract the maximum size of the sub matrix.
(6) The result is a subset of M rows, where the right value Pj ’ is bigger than the left value Pi’ . In the second step, the master 102 may:
(1) Push the sub matrix from the first step to a 256 entries buffer, which number of rows M’ is lower than M, and has the structure of:
Figure imgf000035_0002
(2) Filter - step K, K e [1, ..., 8]:
(a) Build a matrix of (M’ x M’), each element is 200b wide.
(b) Each matrix element [i”, j”] is XOR result of Xi’ and Xj’
(c) Check the values of the XOR result’s leading zeros
(d) If leading zeros count is higher than 25+K - add the pair Pi’, Pj ’ and the corresponding
Xij ’ ’ XOR result to a list:
Figure imgf000035_0003
(e) In this manner, in each iteration the new table has doubled the number of columns, and the rows are reduced by half.
(f) If the number of rows in the next step of the table’s creation is 2, repeat the process with more data from the workers 104. (g) At the end of this iterative process - there will be 256 columns (excluding XOR result column) - and 2 rows, while the buffer will contain 256 entries.
(h) The 2 rows are two 200b words to be used as the algorithm solution, or the POW (proof of work) to be sent to the node as part of the solution message.
According to some embodiments of the present invention the integrated circuit 100, constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for computing hash values according to a RandomX algorithm. The RandomX algorithm is commonly used in one or more cryptocurrency blockchains such as, for example, Monero.
According to some embodiments of the present invention the integrated circuit 100, constructed using the multi-processing architecture in which the plurality of workers 104 each supported by an exclusively associated memory (ICCM 106 and DCCM 108) and a respective set 112 of hardware accelerators 114, may be applied for implementing one or more ML models. ML models, for example, classifiers may heavily rely on a plurality of Multiply and Accumulate (MAC) operations. One or more of the hardware accelerators 114 may be configured to support hardware execution of the MACs thus significantly increasing the computation performance of these MACs compared to software solutions. Applying the plurality of workers 104 to simultaneously compute a plurality of respective MACs using their respective hardware accelerators 114 may therefore significantly increase the performance, specifically reduce the computation time of the ML classification process by the order of the number of employed workers 104.
Moreover, the integrated circuit 100 may be further applied to train one or more ML models, for example, a neural network, an SVM and/or the like. For example, assuming a certain ML model needs to be trained using a plurality of training datasets and/or a very large training dataset. In such case the master 102 may transfer the ML model to at least some of the plurality of workers 104 and may further transfer a respective one of the plurality of training datasets and/or a respective segment of the large training datasets to each of the workers 104. The plurality of workers 104 may operate in parallel to simultaneously train the ML model each with a respective training dataset such that the overall training time may be significantly reduced.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant systems, methods and computer programs will be developed and the scope of the terms computation intensive algorithms and blockchains are intended to include all such new technologies a priori.
As used herein the term “about” refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of and "consisting essentially of'.
The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word “exemplary” is used herein to mean “serving as an example, an instance or an illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range. Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals there between.
The word “exemplary” is used herein to mean “serving as an example, an instance or an illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting. In addition, any priority document(s) of this application is/are hereby incorporated herein by reference in its/their entirety.

Claims

WHAT IS CLAIMED IS:
1. An integrated electronic circuitry integrating hardware accelerators in a multiprocessing environment for enhancing performance of computation intensive algorithms, comprising: a plurality of worker processing units each coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker; and a master processing unit coupled to the plurality of workers; the master is configured to: receive input data relating to a respective one of a plurality of algorithms, instruct each of the plurality of workers to execute the respective algorithm using one of the hardware accelerators of its respective set selected according to the respective algorithm, distribute a plurality of data segments of the input data to the plurality of workers; receive a plurality of partial outcomes computed simultaneously by the plurality of workers, each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory, and output a combined outcome aggregating the plurality of partial outcomes.
2. The integrated electronic circuitry of claim 1, wherein the master is coupled to a respective memory exclusively associated with the master for storing program instructions and data.
3. The integrated electronic circuitry of claim 1, wherein the respective memory coupled with each of the workers comprises:
- a respective Instruction Closely Coupled Memory (ICCM) for program instructions store, the respective ICCM is accessible by the master for writing program instructions for execution by the respective worker, and
- a respective Data Closely Coupled Memory (DCCM) for storing data used by the respective worker, the respective DCCM is accessible by the master for distributing the respective data segment to the respective worker and for receiving the respective partial outcome computed by the respective worker.
37
4. The integrated electronic circuitry of claim 1, wherein the plurality of processing units are configured to access a common memory shared by the plurality of processing units.
5. The integrated electronic circuitry of claim 1, wherein each of the hardware accelerators of the set is optimized for executing at least one of the plurality of algorithms.
6. The integrated electronic circuitry of claim 1, wherein the instruction of the master to each of the plurality of workers further comprising at least one configuration parameter relating to the utilization of the respective selected hardware accelerator.
7. The integrated electronic circuitry of claim 1, wherein the instruction of the master to the plurality of workers further comprising instructing at least some of the plurality of workers to each utilize its respective selected hardware accelerator according to at least one respective configuration parameter.
8. The integrated electronic circuitry of claim 1, further comprising at least some of the plurality of data segments distributed to at least some of the plurality of workers is similar.
9. The integrated electronic circuitry of claim 1, wherein the master is further configured to operate only a subset of the plurality of workers, the number of workers in subset is set according to at least one execution parameter defining a tradeoff between a performance of the execution of the respective algorithm and a power consumption of the integrated electronic circuitry.
10. The integrated electronic circuitry of claim 1, wherein the plurality of algorithms comprise at least one computation intensive algorithm applied for computing a hash value for a block recording transactions of an altcoin cryptocurrency in a blockchain.
11. The integrated electronic circuitry of claim 10, wherein the at least one computation intensive algorithm implements Ethash algorithm for computing the hash value.
12. The integrated electronic circuitry of claim 10, wherein the at least one computation intensive algorithm implements ProgPOW algorithm for computing the hash value.
38
13. The integrated electronic circuitry of claim 10, wherein the at least one computation intensive algorithm implements Equihash algorithm for computing the hash value.
14. The integrated electronic circuitry of claim 10, wherein the at least one computation intensive algorithm implements RandomX algorithm for computing the hash value.
15. The integrated electronic circuitry of claim 1, wherein the plurality of algorithms comprise at least one encryption algorithm.
16. The integrated electronic circuitry of claim 1, wherein the plurality of algorithm comprise at least one High Performance Computing (HPC) algorithm.
17. The integrated electronic circuitry of claim 1, wherein the plurality of algorithms comprise at least one algorithm applied for implementing a Machine Learning (ML) model.
18. The integrated electronic circuitry of claim 17, wherein the plurality of algorithms comprise at least one algorithm applied for training the ML model.
19. The integrated electronic circuitry of claim 1, wherein the master is further configured to conduct a plurality of iterations for computing the combined outcome, the master is configured to perform the following in each of the plurality of iterations: distribute, to the plurality of workers, a respective subset of a plurality of data segments of a respective data portion of the input data, receive a respective subset of a plurality of partial outcomes computed by the plurality of workers for the respective subset of a plurality of data segments, and compute a respective combined outcome for the respective data portion by aggregating the respective subset of the plurality of partial outcomes, wherein after the plurality of iterations are complete, the master constructs an overall outcome for the input data by aggregating the respective combined outcomes computed in the plurality of iterations.
20. The integrated electronic circuitry of claim 1, wherein each of the plurality of processing units is utilized by an ARC Intellectual Property (IP) core.
21. The integrated electronic circuitry of claim 20, wherein each of the hardware accelerators of each respective set is coupled to the respective worker using ARC Processor Extension (APEX) technology.
22. A method of using hardware accelerators integrated in a multi-processing environment to enhance performance of computation intensive algorithms, comprising: using an electronic circuitry integrating a plurality of worker processing units and a master processing unit coupled to the plurality of workers, each of the plurality of workers is coupled to a respective set of hardware accelerators and to a respective memory exclusively associated with the respective worker, the master is configured for: receiving input data relating to a respective one of a plurality of algorithms, instructing each of the plurality of workers to execute the respective algorithm using one of the hardware accelerators of its respective set selected according to the respective algorithm, distributing a plurality of data segments of the input data to the plurality of workers; receiving a plurality of partial outcomes computed simultaneously by the plurality of workers, each of the plurality of partial outcomes is computed for a respective one of the plurality of data segments according to the respective algorithm by a respective one of the plurality of workers using its respective selected hardware accelerator and its respective memory, and outputting a combined outcome aggregating the plurality of partial outcomes.
PCT/IB2021/061227 2020-12-02 2021-12-02 Closely coupled hardware acceleration in a multi-processors environment Ceased WO2022118236A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063120256P 2020-12-02 2020-12-02
US63/120,256 2020-12-02

Publications (1)

Publication Number Publication Date
WO2022118236A1 true WO2022118236A1 (en) 2022-06-09

Family

ID=81852980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2021/061227 Ceased WO2022118236A1 (en) 2020-12-02 2021-12-02 Closely coupled hardware acceleration in a multi-processors environment

Country Status (1)

Country Link
WO (1) WO2022118236A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4535169A1 (en) * 2023-10-06 2025-04-09 Q.ant GmbH Method for computing a partition function of a symmetric matrix on at least one processing unit

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341525A1 (en) * 2017-05-23 2018-11-29 Kla-Tencor Corporation Scalable and Flexible Job Distribution Architecture for a Hybrid Processor System to Serve High Bandwidth Real Time Computational Systems Used in Semiconductor Inspection and Metrology Systems
US20200342297A1 (en) * 2018-01-12 2020-10-29 Huawei Technologies Co., Ltd. Tree Topology Based Computing System and Method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341525A1 (en) * 2017-05-23 2018-11-29 Kla-Tencor Corporation Scalable and Flexible Job Distribution Architecture for a Hybrid Processor System to Serve High Bandwidth Real Time Computational Systems Used in Semiconductor Inspection and Metrology Systems
US20200342297A1 (en) * 2018-01-12 2020-10-29 Huawei Technologies Co., Ltd. Tree Topology Based Computing System and Method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MENG, XIANDONG ET AL.: "A high-performance heterogeneous computing platform for biological sequence analysis", IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, vol. 21, no. 9, 2010, pages 1267 - 1280, XP011299513 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4535169A1 (en) * 2023-10-06 2025-04-09 Q.ant GmbH Method for computing a partition function of a symmetric matrix on at least one processing unit

Similar Documents

Publication Publication Date Title
US20220224514A1 (en) Combined sha2 and sha3 based xmss hardware accelerator
Jawalkar et al. Orca: FSS-based secure training and inference with GPUs
ES2982493T3 (en) Dense-sparse matrix multiplication accelerator
US12137169B2 (en) Low latency post-quantum signature verification for fast secure-boot
US11750402B2 (en) Message index aware multi-hash accelerator for post quantum cryptography secure hash-based signing and verification
US10103873B2 (en) Power side-channel attack resistant advanced encryption standard accelerator processor
US12120227B2 (en) Efficient post-quantum secure software updates tailored to resource-constrained devices
US11121856B2 (en) Unified AES-SMS4—Camellia symmetric key block cipher acceleration
EP3965360A1 (en) State synchronization for post-quantum signing facilities
US12058261B2 (en) Low overhead side channel protection for number theoretic transform
JP2019207393A (en) Hardware accelerators and methods for high-performance authenticated encryption
DE102015002215A1 (en) Sorting processor, methods, systems and commands
US20090168999A1 (en) Method and apparatus for performing cryptographic operations
DE112013005239T5 (en) Instruction to Accelerate the Wireless SNOW 3G Security Algorithm
DE102018005101A1 (en) Field Test System Security
Mei et al. CUDA-based AES parallelization with fine-tuned GPU memory utilization
CN103490877A (en) Parallelization method for ARIA symmetric block cipher algorithm based on CUDA
Assafli et al. Advanced Encryption Standard (AES) acceleration and analysis using graphical processing unit (GPU)
Li et al. Efficient implementation of lightweight block ciphers on volta and pascal architecture
WO2022118236A1 (en) Closely coupled hardware acceleration in a multi-processors environment
Fang et al. SIFO: Secure computational infrastructure using FPGA overlays
Tetu et al. A standalone fpga-based miner for lyra2rev2 cryptocurrencies
US20220255757A1 (en) Digital signature verification engine for reconfigurable circuit devices
WO2023114568A1 (en) Xmss management to address randomized hashing and federal information processing standards
US20220416998A1 (en) Side channel protection for sha3 cryptographic functions

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21900190

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21900190

Country of ref document: EP

Kind code of ref document: A1