[go: up one dir, main page]

WO2009036987A2 - Parallel computer architecture based on cache- coherent non-uniform memory access (cc-numa) - Google Patents

Parallel computer architecture based on cache- coherent non-uniform memory access (cc-numa) Download PDF

Info

Publication number
WO2009036987A2
WO2009036987A2 PCT/EP2008/007897 EP2008007897W WO2009036987A2 WO 2009036987 A2 WO2009036987 A2 WO 2009036987A2 EP 2008007897 W EP2008007897 W EP 2008007897W WO 2009036987 A2 WO2009036987 A2 WO 2009036987A2
Authority
WO
WIPO (PCT)
Prior art keywords
machine
level
interacting
regardless
initial configuration
Prior art date
Application number
PCT/EP2008/007897
Other languages
French (fr)
Other versions
WO2009036987A3 (en
Inventor
Emilio Billi
Original Assignee
Screenlogix S.R.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Screenlogix S.R.L. filed Critical Screenlogix S.R.L.
Publication of WO2009036987A2 publication Critical patent/WO2009036987A2/en
Publication of WO2009036987A3 publication Critical patent/WO2009036987A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs

Definitions

  • the present invention refers to a machine architecture composed of a software level and a hardware level interacting with each other regardless of the initial configuration of said machine and a process for making said machine architecture .
  • clusters are hard to implement and manage and require a constant administration and maintenance .
  • the SMP systems offer overall improved performances with respect to the clusters, a simple programming mode and a vision of the system in Single-System Image, SSI, which permits a much facilitated administration and maintenance.
  • a Single-System Image system is administered as a single system where all of the resources are centralised.
  • the SMP systems are capable of managing only a small number of processors due to the technological limitations intrinsic to the architecture itself. Thus, it is necessary to construct an architecture capable of meeting the high resource and processing power needs of the marketplace .
  • cc-NUMA cache-coherent, non-uniform memory access
  • BIOS Basic Input Output System
  • Such technology is carried out by utilising conventional interconnections of network type, or mainly Infiniband or Ethernet.
  • these systems reserve up to 10% of the total memory in order to create a kind of virtual exchange cache .
  • interconnection protocols of network type are not 100% suitable for providing a protocol infrastructure for creating multiprocessor systems.
  • the technical task which the technical task proposes is therefore that of making a machine architecture which allows eliminating the above- lamented drawbacks of the prior art .
  • one object of the invention is that of making a machine architecture which uses a protocol and a BUS originating as processor memory BUS and which contains at its interior the instructions required for building both the shared memory and the Cache Coherency of the system.
  • Another object is to implement many functionalities at the hardware level that the known systems are obligated to execute at the software level for various reasons .
  • Another object is to bring into hardware the management functionalities of the both shared and cache level memory, allowing the obtainment of a performance increase of ten times that of the abovementioned systems without considerably increasing the achievement costs.
  • a machine architecture constituted of a software level and a hardware level interacting with each other regardless of the initial configuration of said machine, characterised in that it comprises at least two SMP computers connected with each other in parallel by means of at least one interconnection bus and with the use of FPGA processors dedicated to specific functionalities and to the management of the physical memory separated from the system memory and adapted to achieve an average-to-large cc-NUMA system managed through a BIOS system adapted to optimise the resources involved and to accelerate the system capacities.
  • Also forming the subject of the present invention is a process for making a machine architecture characterised in that said architecture is constituted by two levels: one software and one hardware, which interact with each other and that, therefore, the overall functionalities are defined on one hand by the high performance interconnection and low latency and on the other hand by the functionalities which the SUPERBIOS assigns to this interconnection and to the accessories connected thereto (FPGA) , in a manner adapted to configure said machine in order to obtain both said NUMA architecture and a series of sub-architectures obtained by not-implementing or implementing in a different manner some or all of the memory and cache functionalities.
  • FPGA field-programmable gate array
  • Fig. 1 shows a block diagram representing the execution procedure of the Superbios for the creation of the single system according to the finding .
  • Fig. 2 shows a processing unit composed of a
  • CPU processor with its Cache, a local RAM memory and a series of interfaces according to the finding.
  • Fig. 3 shows a containment diagram of the coprocessors according to the finding.
  • Fig. 4 schematically shows the operating system and the interaction between the hardware system and the software system according to the finding.
  • the machine architecture consists of a set of hardware and software adapted to achieve parallel machines capable of emulating SMP architectures with a great number of processors, starting from standard hardware, by first making a cc-NUMA machine that can be masked in a SMP system by virtualising the system itself. This masking facilitates programming and gives operating systems theoretically not intended for NUMA the possibility to function correctly.
  • the NUMA architecture is made by using an interconnection system derived from a BUS processor memory which implements all the characteristics necessary in order to create a NUMA system unaffected by the above-described limits and capable of being treated as a network type system but with much higher performances with respect to the systems described above.
  • the proposed architecture deals with a set of SMP computers interconnected with each other with an especially designed hardware based on the use both of an interconnection BUS of processor memory type and with the possibility to insert inside the achieved architecture a series of suitable FPGA processes dedicated to specific functionalities.
  • Such functionalities include, for example, the creation of the cache system coherency and the management of a dedicated physical memory separated from the system memory which is capable of making a cc-NUMA system of medium to large size, managed through the implementation of a suitable BIOS system capable of optimising the resources in question and accelerating the system performances in close cooperation with the hardware, starting from commercial hardware.
  • the main difference between this implementation type and those constituted by a simple software implementation (even if at the firmware level) of the functionalities dedicated to the creation of a Single System Image with shared memory and cache coherency consists of the construction of an architecture which is based on a specific BUS protocol which supports the shared memory and cache coherency functionalities in the protocol .
  • buses of network type are designed for transferring data and implement a high number of control packets at the protocol level, which in the specific case are not necessary, rather they slow down the system itself.
  • the latency of the BUS must be very low, since at the performance level even the latency of the interconnection system carries out a critical role of extreme importance - a few milliseconds of latency more, in fact, suffice for lowering the actual performances of the system more than 10%.
  • the protocols of network type are not often designed to have a very low latency, since this is not a required characteristic for their normal use.
  • the architecture is based on the merger of several computational nodes formed by SMP computers connected to each other by an interconnection system based on the SCI protocol (Scalable Coherent Interface) , which specifically is a BUS originating from the interconnection of multiprocessor systems or for the interconnection of processor memory systems and has a very low latency.
  • SCI protocol Scalable Coherent Interface
  • a software layer is in turn superimposed, preferably at firmware level, which hereinbelow will be identified as a SUPERBIOS capable of virtualising the discrete architecture of the hardware machine, transforming it into a cc-NUMA machine and optimising the management of the load of the processors until obtaining performances similar to a SMP machine in which even systems non-optimised for the NUMA architecture can function correctly.
  • the SUPERBIOS or the set of algorithms which are sent by the machine after having carried out all the BIOS base instructions (bios boot standard of the single nodes) , permit configuring the machine as a Single System Image of partionable SMP type in which the management and optimisation of the problems related to the NUMA architecture, for example the optimisation of the access to the memories, distributed by the various programs, is carried out in a transparent manner with respect to the system itself.
  • the introduction at the hardware level of suitable FPGA programmed for the creation and management both of the memory and cache allow accelerating the machine itself and obtaining very high performances .
  • the FPGAs can in turn control a dedicated DDR memory or the like on the boards containing the FPGAs themselves in order to eliminate the need to dedicate a portion the local memory for the management of the caches, as occurs in some of the commercial systems in use today, as described above .
  • this is a multi-level architecture, regardless of the initial configuration of the machine .
  • the proposed architecture allows overcoming the drawbacks present both in the current commercial implementations of SMP or NUMA driven software systems and in cluster type configurations.
  • SUPERBIOS can be accelerated and coordinated by the algorithms contained inside the FPGA processor on the interconnection board.
  • the advantage offered by the architecture of the present invention consists of the fact that it can integrate these algorithms in the management system of the machine, optimising the entire system for carrying out dedicated operations, so as to obtain an increase of performances, as well as a flexibility that cannot be achieved with other cc- NUMA architectures on the market.
  • the management at the SUPERBIOS level of the main functionalities related to the scheduling of the processes and to the management of the load balancing is that of making the complexity of the management of the typical memory of the cc-NUMA systems completely transparent to the software, also allowing non-NUMA operating systems to correctly function inside the architecture, not preventing the virtualisation of the architecture by means of third party software, from which the machine itself can draw benefit.
  • a further advantage of this architecture is that of not being a static architecture, or unlike conventional cc-NUMA architecture, the multi-level approach allows effectively emulating various types of architecture if is necessary as described above, obtaining greater performances than dedicated systems of conventional type.
  • the system based on the SCI protocol Scalable Coherent Interface is an open standard IEEE1596- 1992 which defines a point-point interconnection protocol capable of providing similar characteristics to those commonly present in the processor-processor or memory processor BUS systems
  • SCI protocol Scalable Coherent Interface is an open standard IEEE1596- 1992 which defines a point-point interconnection protocol capable of providing similar characteristics to those commonly present in the processor-processor or memory processor BUS systems
  • the SCI protocol makes it possible to build discrete hardware both in cluster type configurations and of finely interconnected type (cc-NUMA) .
  • the memory of the system and protocol associated therewith is completely scalable .
  • the SCI interface is designed for reaching a transfer rate of a gigabyte per second per channel, with the possibility to add more parallel channels, which allows linearly scaling even the pass-band.
  • the single nodes are connected with each other with a point-point connection, eliminating the need for a switch and eliminating further latency.
  • the point-to-point connections can be organised according to 2 -dimension torus ring types or three- dimension torus ring types. Each of these types can be made parallel, obtaining multiple ring architecture, for example 2 -dimensional or 3- dimensional multiple torus ring.
  • Certain tasks of housekeeping supervision assigned to a node of the task set which is of topology maintenance type and regards the identification of the break point and the processing of the alternative routings in case of failure, can be assigned one or more nodes called scrubbers .
  • BUS SCI can be borne over a cable .
  • SCI supports, at the protocol level, the distributed shared memory and the cache coherency, in addition it is possible to implement and use libraries of Message Passing (MPI) or PVM (Parallel Virtual Machine) type, typically used in supercomputer software .
  • MPI Message Passing
  • PVM Parallel Virtual Machine
  • SCI is a protocol suitable for making an interconnection BUS adapted for the above-described purpose.
  • One of the interfaces most suitable for the purpose is the PCI Express, but it is possible to utilise, for the interfacing of the board with the system, any one BUS with characteristics similar to or better than the present or future PIC Express .
  • One possible configuration of the proposed hardware is that in which an interface is present towards the PCI Express which by means of a suitable bridge communicates with the protocol coprocessor which integrates both the SCI instructions and the cache coherency shared memory controller made at logic level, following the characteristics and functionalities described in the protocol itself.
  • Such coprocessor could be composed of a controller for the access to the local memory connected to a controller for the remote memory,- the two controllers are then connected to an on-board memory controller board capable of managing the preferably high-speed RAM memory, directly mounted on a board usable as local cache .
  • Such controllers communicate by means of dual port memory controller with the LVDS transmission/reception interface or interfaces which are made by following the IEEE scalable coherent interface specifications.
  • a further coprocessor instead contains the instructions (preferably reconfigurable) related to the SUPERBIOS acceleration, or it serves as an accelerator device of FPGA type, accelerating software algorithms in hardware .
  • such hardware configuration can be implemented in various modes and can also be made starting from several boards produced by third parties joined in a suitable manner.
  • the SCI interconnection system could be an independent external board, to which it is possible to flank a FPGA coprocessor board containing the cache coherency shared memory controller and other reprogrammed acceleration boards .
  • the acceleration board could be a board inserted directly in the processor BUS using the base of a secondary processor and also a further PCI express board .
  • Such board can contain the second coprocessor or the latter can be external .
  • the system used for obtaining the cache coherency can preferably be that defined by the SCI protocol itself or a suitably optimised one made by third parties.
  • a system made by means of the hardware architecture described above has a series of separated units, in other words it appears as a series of independent computers, the boot procedure being able to be configured in various modes and in any case not being strictly binding.
  • the discrete machines are joined into a single machine by carrying out the instructions contained in the extended BIOS, or rather the interconnection boards are initialised, the node labels are assigned and the cache coherency and shared memory is started.
  • the machine is composed at the hardware level of three structure elements defined by the base element (CPU, MEMORY, I/O) which can be identified as a standard hardware of computer type; by one or more interconnection boards whose object is that of physically making the electrical interconnection between the various nodes and allowing the data exchange by supporting the SCI protocol communications base; from a series of coprocessors adapted for extending the SCI protocol functionalities in order to make the cache coherency, the shared memory and possibly the acceleration of several functionalities such as for example the migration of the processes and load balancing.
  • the base element CPU, MEMORY, I/O
  • interconnection boards whose Object is that of physically making the electrical interconnection between the various nodes and allowing the data exchange by supporting the SCI protocol communications base
  • coprocessors adapted for extending the SCI protocol functionalities in order to make the cache coherency, the shared memory and possibly the acceleration of several functionalities such as for example the migration of the processes and load balancing.
  • the processing unit 1 is preferably composed of a processor (CPU) Ia with its cache Ib, a local RAM memory and a series of Input/Output Id interfaces functioning as base element for the achievement of the system.
  • processor CPU
  • one or more communication interfaces 3, 4 are connected to this unit 1, one or more communication interfaces 3, 4 are connected.
  • Such interfaces 3, 4 make the SCI communications part both on the electrical level, by supplying the necessary apparatus for the communication of the signals, and by implementing the reading/writing or call and response base set typical of the SCI protocol .
  • the interfaces are provided with a series of inputs and outputs suitable for creating the point-point connection between the mutual interfaces present on the other nodes composing the system.
  • every interface is provided with an input channel 5,6 and an output channel 7,8.
  • a board containing the coprocessors is arranged on the system by means of the I/O interface Id. Every node can be connected, by suitably configuring the interfaces, to the other nodes with different interconnection topologies and with a different number of parallel channels.
  • the board containing the coprocessors has an interface 1 which allows the connection of the coprocessors 2 and 5 to the BUS system (I/O interface Id, figure 2) .
  • the coprocessor 2 achieves the cache coherency set of the SCI protocol by making, through a series of modules 4, the shared memory, managing the access of the local memory and the access to the remote memory.
  • a series of RAM memories are directly connected to the coprocessor and are used as cache memories for the management of the cache coherency.
  • This type of approach involves a reduction of the performances due to the further data load present on the interconnection bus necessary for the reading and writing of the data in the reserved portion of the central memory; in addition, a portion of the total memory available to the system is sacrificed, about 10%, removing resources for executing processing which require the storage of a great quantity of data.
  • the continuous access to the memory occurs by means of the interconnection system and a high number of transactions negatively affects the total performances, sometimes on the order of several orders of magnitude .
  • a second coprocessor 5 can be implemented for accelerating several recursive functions of the system Kernel, such as for example those related to the global scheduler (figure 1, element 6) .
  • This device achieves a direct acceleration of several functionalities related to the scheduler.
  • the balancing of the processes and the management of the overall work load of the machines is obtained by means of the global scheduler.
  • Such global scheduler can be interfaced with a specific coprocessor which allows a very quick performance of all the main operations which the scheduler is set to undertake.
  • the updating can be carried out through suitable tools designed for the specific function.
  • Said modules are specifically driven by suitable modules present in the SUPERBIOS.
  • the system is built with modular blocks whose minimum unit is constituted by the presence of the interconnection system with SCI protocol in the SUPERBIOS.
  • the cache coherency coprocessor which in any case remains a dynamic element, since if necessary the system could be configured as MMP shared nothing type, which would eliminate the need of a coprocessor for the cache coherency and shared memory .
  • the presence of the second coprocessor is not strictly binding for the system functioning; its presence favours the increase of the performance and a more effective management of the migration of the processes and overall load balancing of the machine .
  • the SUPERBIOS architecture can be as follows .
  • the SUPERBIOS has a level structure composed of modules, each module overseeing specific functionalities which can be integrated, or not integrated, with a hardware acceleration entrusted to one or more FPGAs or to a ASIC and one or ore FPGAs .
  • the SUPERBIOS execution procedure is started; more specifically, a series of operations necessary for the creation of a joint system is created.
  • the system initialises a series of low level drivers (point 1) which are necessary for the initialisation of the SCI (Scalable coherent interface) communication and all the peripheral devices necessary thereto. Inside this phase, the various communication units are initialised and an unequivocal NUMA address is assigned to them, having the object of identifying the interface in an unequivocal manner .
  • the system has initialised the communications hardware and has activated the communications ring.
  • the high level drivers are activated, that is the libraries and the management interfaces of the SCI protocol, this layer containing all the APIs (Application program interfaces) necessary for the correct management of the functionalities related to the memory and cache coherency.
  • the system is ready to execute the merging of the resources by activating the mechanism for managing and sharing the data through the entire system, activating the data sharing mechanism through the entire system and at the operating system level, making the access transparent to the remote data, initialising the cache coherency, managing the process memory migration and the thread memory and organising the SCI Shared memory .
  • Layer 4 oversees the management of the process migration, while layer 5 contains the instruction for the construction of the cache coherency.
  • the SCI protocol allows supporting the multiprocessing through a general shared memory model made through the coherency of the cache of the CPU.
  • Such mechanism is compatible with two separate programming models, on the one hand some applications can maintain the cache coherency through a hardware-software control while others can use MPI (message passing) type mechanisms.
  • MPI message passing
  • the SCI protocol permits easily supporting both the boards in a concurrent manner, i.e. simultaneously.
  • This feature allows the system to supply excellent performances for both programming models along with the flexibility for being able to use both applications of SMP type and MPI type.
  • the coherence mechanism can be implemented by working directly with the base request/response operations of the communication system.
  • the cache coherency system is based on a point- point scheme initialised by an applicant (typically one of the CPUs) and terminated by the responder (typically a memory or another processor) .
  • This level is integrated with a FPGA which accelerates and oversees all the functionalities necessary for the creation and management of the cache coherency.
  • the system is seen by the operating system as a single machine having a distributed shared memory.
  • the operating system In order to facilitate and increase the performances of the machine itself, it is necessary to add a manager of the load and the distribution of the work along all the nodes composing the machine itself .
  • the importance of balancing the load is strategic for being able to obtain high performances, reducing the work load of the interconnection system.
  • This layer 6 is defined as a global scheduler.
  • Every processor manages numerous tasks which must be balanced in a global vision in order to actually benefit from a distributed architecture of SSI type like that described.
  • NUMA Non Uniform Memory Access
  • the characteristic applied to the NUMA machines is that of preparing, in a memory portion "close” to the processor, the sequential calculation that such processor must execute.
  • the global scheduler algorithm is conceived for maintaining the data as close as possible to the CPU set for executing that data, making the block processes migrate towards the processing units.
  • a home node is assigned to which a memory portion taken from the local memory is allocated.
  • the scheduler can change the position of the home in consideration of the work load via bulk migration of the home and its memory, maintaining the validity of the home principle associated with the local memory.
  • the scheduler is structured for balancing and monitoring the load of the CPUs, and allows using MPI applications by configuring MPICH as if it was a SMP machine; managing the migration of the threads by balancing the load between the processing nodes by utilising the shared memory; managing a global indexing ID of the processes and their migration; creating remote processes, indexing them and making them migrate .
  • a loader 7 is implemented.
  • the task of the loader is the following.
  • the system appears as an SSI NUMA-like system (Single System Image NUMA machine) .
  • SSI NUMA-like system Single System Image NUMA machine
  • the loader ensures that it is possible to install a non-modified operating system on the main node, which after the installation of the management drivers (SSI NUMA driver) is capable of "seeing" all the resources of the machine itself.
  • SSI NUMA driver management drivers
  • One variant is obtainable by inserting a loader 7 in every node .
  • This modification allows making a particular configuration in which it is possible to install standard virtualisation software without making substantial modifications to their structure.
  • the functioning mechanism is the following.
  • an additional loader allows emulating this procedure without having to modify the architecture of the machine itself, also permitting executing two options: the first simply by installing the communication drivers inside the kernel inside the virtualisation system and then by providing the use of the interconnection network to the nodes of the virtual machine; the second is to modify the kernel of the virtualisation machine in a manner such that it is possible to use the SMP resources of the system.
  • the global scheduler can be structured so it can be considered a configurable dynamic scheduling system.
  • there are essentially two scheduler types the static scheduler and the dynamic scheduler. Both of these scheduler types can be of adaptive type, i.e. capable of being adapted to the task to be executed.
  • the scheduling system of static type organises the processes according to a series of policies (rules contained in one or more scripts) .
  • the processes managed by a scheduler of static type migrate the process at its origin, balancing the load of the system, but are not capable of making the progress migrate once it is in execution.
  • the dynamic scheduling scheme is capable of making the processes migrate even while these are being executed. In this manner, if the machine is found in unbalanced load phase, and if the need for a system balancing is verified, the scheduler is capable of removing the process and reallocating it in a different position without stopping its execution.
  • the schedulers in order to provide good integration and good system performances, must be capable of carrying out a series of tasks : sequential applications, parallel applications which communicate by means of share memory, applications which communicate by means of message passing (MP) , and correct distribution of the response load of the applications which require a large memory quantity or a large CPU quantity.
  • Beowulf-type systems offer a set of utilities for correctly configuring a scheduler and for managing the execution of MPI (message passing interface) applications or PVM (Parallel Virtual Machine) applications.
  • MPI messages passing interface
  • PVM Parallel Virtual Machine
  • the MPI system When an application is launched, the MPI system normally sets the process on a node of the system, without prior knowledge the state of the node in that moment.
  • This approach has numerous advantages: first of all, it is usually usable by a user capable of managing the pertaining application from the beginning, in addition he must be capable of developing according to the libraries MPI and PVM.
  • This user type is very rare and does not represent the average of the users of a processing system.
  • Another negative point consists of the fact that the work load distribution along the nodes occurs without the system's minimum knowledge on the state of occupation of the affected nodes, which in turn signifies that the balancing of the work load does not occur in an optimal manner.
  • a dynamic scheduler By means of a dynamic scheduler, one is capable of managing the processes by using an active balancing of the work load, operating in a transparent manner with regard to the user and to the application, i.e. it is not necessary to write explicit rules which automate the migration of the processes.
  • the migration of the processes accounts for the real occupation of the nodes .
  • One scheduler requires a mechanism which allows specifying new policies.
  • the processes and the threads can be created with the fork and execv commands .
  • the standard pthread interface allows creating and managing the threads in a share memory setting.
  • the global scheduler provided by the present invention serves to implement a mechanism of such type in order to render the machine a virtual SMP system.
  • the global scheduler is composed of a plurality of layers, in particular three layers: a first probe layer structured so as to detect the average occupation state of the nodes, a second analyser layer which intercepts the management of the scheduling and highlights its errors (for example, an excessive load on a single node or a hardware failure of one or more nodes) , and a third manager layer which manages the overall load and solves the problems reported by the analyser layer, permitting balancing the load between all the nodes present.
  • a first probe layer structured so as to detect the average occupation state of the nodes
  • a second analyser layer which intercepts the management of the scheduling and highlights its errors (for example, an excessive load on a single node or a hardware failure of one or more nodes)
  • a third manager layer which manages the overall load and solves the problems reported by the analyser layer, permitting balancing the load between all the nodes present.
  • the probe layer measures the load present in every single CPU and the occupation state of the memory.
  • probe layer types are provided for, and every probe layer type is interconnected with the analyser layer, which sends the related data to the local layer .
  • the active probe layers work in synchronism with the system timer, the passive probe layers react to single events.
  • the events can be of kernel type or of global scheduler type.
  • the probe layer When a passive probe layer is activated by an event, the probe layer immediately sends the message to the analyser layer.
  • an active probe layer is used for monitoring the load of a CPU: in this manner, the reported value is constantly updated and is constantly sent to the analyser state, while it is convenient to use a passive probe layer for monitoring the ping-pong of memory pages.
  • the analyser layer synthesised in FPGA receives information from the probe layers, analyses it and highlights possible irregular system states.
  • One series of analyser layers can function in a parallel manner on each node. This permit an effective monitoring of the state of the system and an optimal management of the load balancing along all the nodes .
  • the local analyser layer sends a scheduling request to the manager layer, the manager layer has a global vision of the resources and reacts by carrying out the best possible move.
  • system state analyses are executed in hardware, in order to have the monitoring be irrelevant with regard to the work load of the CPUs.
  • This type of approach allows obtaining, on one hand, probe layers and analyser layers that are very fast, and on the other hand the ability to carry out a great number of tests and detections without compromising the processing power of every single CPU.
  • the manager layer is present in every node inside the FPGA and is connected to the local analyser layers .
  • the various manager layers which make up the system communicate with each other in order to exchange the information on the state of the nodes .
  • the manager layer is the only layer which has an overall vision of the system state.
  • the scheduler can be represented as a global supervisor of the system which acts on the syst :e ⁇ m but which autonomously processes a large quantity of information.
  • this structure permitting the quick analysis of a large quantity of information, due to the speed obtainable by means of the use of the acceleration hardware deriving from the use of the FPGA, has as effect the possibility to obtain a very high efficiency scheduler capable of making decisions based on the high information quantities.
  • the global vision of the system is obtained by means of the information deriving from the probe layers implemented and allows detecting all the "unbalancing" problems which could be generated.
  • the manager layer is capable of carrying out a series of operations such as migrating the process or executing the "checkpoints" of an application (checkpoint is a procedure for which an application is photographed in order to permits its recovery in case of failure) in accordance with the programming policies of the scheduler, in order to obtain maximum efficiency.
  • the manager layer is capable of dynamically executing every scheduling policy.
  • the manager layers, the probe layers and the analyser layers can be configured according to a system interface which allows programming the machine in order to obtain an overall adaptive system capable of providing maximum performances according to the task which is assigned thereto.
  • the probe layer related to the CPU load is active, and the local analyser layer is configured for communicating the data to the manager layer when an overrunning is verified.
  • said hardware acceleration allows collecting and processing a high number of data without impacting the real performances of the machine.
  • the great quantity of data allows managing the manager layer with optimal results for the performances and the general balancing of the machine.
  • the state of inactivity of the nodes thus organised remains for more than a predetermined time, for example 10 minutes, one can order to the scheduler to turn off the inactive nodes .
  • This management of the scheduler reduces the energy consumptions by over 40%.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of the machine, comprising at least two SMP computers connected in parallel with each other by means of at least one interconnection BUS and comprising the use of FPGA processors dedicated for specific functionalities and the management of a dedicated physical memory separated from the system memory and adapted to make a cc-NUMA system of average and large size managed by means of a BIOS system adapted to optimise the resources in question and to accelerate the capacities of the system.

Description

MACHINE ARCHITECTURE COMPOSED OF A SOFTWARE LEVEL AND A HARDWARE LEVEL INTERACTING WITH EACH OTHER REGARDLESS OF THE INITIAL CONFIGURATION OF SAID MACHINE AND PROCESS FOR MAKING SAID MACHINE ARCHITECTURE
DESCRIPTION
The present invention refers to a machine architecture composed of a software level and a hardware level interacting with each other regardless of the initial configuration of said machine and a process for making said machine architecture .
As is known, there is an increasing need for resource and processing power available for high performance servers, in order to meet the growing need of application resources such as databases, processing systems, Internet applications and Internet providers .
Currently, many technologies are available for meeting these needs, and the most widespread are the clusters and the SMP (symmetric multiprocessor) technology. The SMP systems are also of the UMA system. Clusters have been and still are very widely used, first of all due to the cost advantage and to the availability on the standard hardware market, and they have the considerable advantage of being scalable .
Nevertheless, clusters are hard to implement and manage and require a constant administration and maintenance .
On the other hand, the SMP systems offer overall improved performances with respect to the clusters, a simple programming mode and a vision of the system in Single-System Image, SSI, which permits a much facilitated administration and maintenance.
Unlike the clusters, which view the nodes as separate entities to be singly administered, capable of cooperating with each other in order to solve common problems, a Single-System Image system is administered as a single system where all of the resources are centralised.
The SMP systems are capable of managing only a small number of processors due to the technological limitations intrinsic to the architecture itself. Thus, it is necessary to construct an architecture capable of meeting the high resource and processing power needs of the marketplace .
A third approach to the problem is given by the cache-coherent, non-uniform memory access (cc-NUMA) which exceeds the limits of the SMP architecture systems and provides an overall vision of the machine in Single-System Image, which simplifies the system management and maintenance.
In order to make such systems starting from standard and non-modified hardware, several software technologies were developed which, by implementing the changes at the BIOS (Basic Input Output System) level of the single SMP units or nodes composing the system, operate a kind of merger by making a shared memory and a cache coherency between the nodes themselves, providing the operating system (essentially Linux) with a single vision of the machine (Single-System Image) .
Such technology is carried out by utilising conventional interconnections of network type, or mainly Infiniband or Ethernet.
In addition, in order to achieve the synchronisation of the cache memories, these systems reserve up to 10% of the total memory in order to create a kind of virtual exchange cache .
The above-described systems achieved according to the prior art have various drawbacks due to the considerable performance limitations and technical limitations due to the use of the chosen interconnection system.
Indeed, it is evident that the interconnection protocols of network type, such as those mentioned above, are not 100% suitable for providing a protocol infrastructure for creating multiprocessor systems.
The technical task which the technical task proposes is therefore that of making a machine architecture which allows eliminating the above- lamented drawbacks of the prior art .
In the scope of this technical task, one object of the invention is that of making a machine architecture which uses a protocol and a BUS originating as processor memory BUS and which contains at its interior the instructions required for building both the shared memory and the Cache Coherency of the system. Another object is to implement many functionalities at the hardware level that the known systems are obligated to execute at the software level for various reasons .
Another object is to bring into hardware the management functionalities of the both shared and cache level memory, allowing the obtainment of a performance increase of ten times that of the abovementioned systems without considerably increasing the achievement costs.
The technical task, as well as these and other objects, according to the present invention, are achieved by making a machine architecture constituted of a software level and a hardware level interacting with each other regardless of the initial configuration of said machine, characterised in that it comprises at least two SMP computers connected with each other in parallel by means of at least one interconnection bus and with the use of FPGA processors dedicated to specific functionalities and to the management of the physical memory separated from the system memory and adapted to achieve an average-to-large cc-NUMA system managed through a BIOS system adapted to optimise the resources involved and to accelerate the system capacities.
Also forming the subject of the present invention is a process for making a machine architecture characterised in that said architecture is constituted by two levels: one software and one hardware, which interact with each other and that, therefore, the overall functionalities are defined on one hand by the high performance interconnection and low latency and on the other hand by the functionalities which the SUPERBIOS assigns to this interconnection and to the accessories connected thereto (FPGA) , in a manner adapted to configure said machine in order to obtain both said NUMA architecture and a series of sub-architectures obtained by not-implementing or implementing in a different manner some or all of the memory and cache functionalities.
Further characteristics and advantages of the invention will be clearer from the description of a preferred but not exclusive embodiment of the machine architecture and process for its achievement according to the finding, illustrated as indicative and non- limiting in the attached drawings, wherein:
Fig. 1 shows a block diagram representing the execution procedure of the Superbios for the creation of the single system according to the finding .
Fig. 2 shows a processing unit composed of a
CPU processor with its Cache, a local RAM memory and a series of interfaces according to the finding.
Fig. 3 shows a containment diagram of the coprocessors according to the finding.
Fig. 4 schematically shows the operating system and the interaction between the hardware system and the software system according to the finding.
With reference to the above-described drawings, the machine architecture, according to the finding, consists of a set of hardware and software adapted to achieve parallel machines capable of emulating SMP architectures with a great number of processors, starting from standard hardware, by first making a cc-NUMA machine that can be masked in a SMP system by virtualising the system itself. This masking facilitates programming and gives operating systems theoretically not intended for NUMA the possibility to function correctly.
The NUMA architecture is made by using an interconnection system derived from a BUS processor memory which implements all the characteristics necessary in order to create a NUMA system unaffected by the above-described limits and capable of being treated as a network type system but with much higher performances with respect to the systems described above.
The proposed architecture deals with a set of SMP computers interconnected with each other with an especially designed hardware based on the use both of an interconnection BUS of processor memory type and with the possibility to insert inside the achieved architecture a series of suitable FPGA processes dedicated to specific functionalities. Such functionalities include, for example, the creation of the cache system coherency and the management of a dedicated physical memory separated from the system memory which is capable of making a cc-NUMA system of medium to large size, managed through the implementation of a suitable BIOS system capable of optimising the resources in question and accelerating the system performances in close cooperation with the hardware, starting from commercial hardware.
The main difference between this implementation type and those constituted by a simple software implementation (even if at the firmware level) of the functionalities dedicated to the creation of a Single System Image with shared memory and cache coherency consists of the construction of an architecture which is based on a specific BUS protocol which supports the shared memory and cache coherency functionalities in the protocol .
That said above permits an enormous simplification of the management of the same functionalities by the system firmware, in addition the buses of network type are designed for transferring data and implement a high number of control packets at the protocol level, which in the specific case are not necessary, rather they slow down the system itself.
In addition, the latency of the BUS must be very low, since at the performance level even the latency of the interconnection system carries out a critical role of extreme importance - a few milliseconds of latency more, in fact, suffice for lowering the actual performances of the system more than 10%.
The protocols of network type are not often designed to have a very low latency, since this is not a required characteristic for their normal use.
The architecture, subject of the finding, is based on the merger of several computational nodes formed by SMP computers connected to each other by an interconnection system based on the SCI protocol (Scalable Coherent Interface) , which specifically is a BUS originating from the interconnection of multiprocessor systems or for the interconnection of processor memory systems and has a very low latency.
As a variant, it is possible to have a series of suitably configured FPGAs for carrying out the coprocessor function in the creation of the shared memory and cache coherency of the system, capable of making a NUMA or cc-NUMA architecture to which a control BIOS has been added capable of creating a Single System Image configuration.
To this architecture of hardware type, a software layer is in turn superimposed, preferably at firmware level, which hereinbelow will be identified as a SUPERBIOS capable of virtualising the discrete architecture of the hardware machine, transforming it into a cc-NUMA machine and optimising the management of the load of the processors until obtaining performances similar to a SMP machine in which even systems non-optimised for the NUMA architecture can function correctly.
The SUPERBIOS, or the set of algorithms which are sent by the machine after having carried out all the BIOS base instructions (bios boot standard of the single nodes) , permit configuring the machine as a Single System Image of partionable SMP type in which the management and optimisation of the problems related to the NUMA architecture, for example the optimisation of the access to the memories, distributed by the various programs, is carried out in a transparent manner with respect to the system itself.
The introduction at the hardware level of suitable FPGA programmed for the creation and management both of the memory and cache allow accelerating the machine itself and obtaining very high performances . The FPGAs can in turn control a dedicated DDR memory or the like on the boards containing the FPGAs themselves in order to eliminate the need to dedicate a portion the local memory for the management of the caches, as occurs in some of the commercial systems in use today, as described above .
The use of a dedicated memory permits a further increase of the overall performances of the machine thus achieved.
Before describing the specific functioning of the system and the inherent functionalities in the SUPERBIOS, it should be specified that such architecture is composed of two levels - one software and one hardware, which interact with each other and that therefore the overall functionalities are defined on one hand by the high performance interconnection and low latency and on the other hand by the functionalities which the SUPERBIOS assigns this interconnection and the accessories connected thereto (FPGA) .
In this manner, it is possible to configure the machine in order to obtain both the NUMA architecture described above and a series of sub- architectures obtained by non-implementing or implementing in a different manner some or all of the memory and cache functionalities.
In other words, by acting on the SUPERBIOS level, it is possible to not activate the cache coherency or the shared memory by making a pure and simple data passage on which, at the operating system level, it is possible to implement a top library, for example MPI (Message Passing Interface) , thus making the functionalities of a standard supercalculator or a cluster.
In addition, it is possible to make a share nothing type of architecture (i.e. the opposite of SSI (Single System Image) capable of being employed for the resolution of specific tasks) .
Advantageously, this is a multi-level architecture, regardless of the initial configuration of the machine .
Between each level and the hardware there is a close interaction which can be configured and managed.
In this manner, unlike the systems currently on the market, one advantageously obtains a multiple architecture in a single machine which can be reconfigured according to the tasks to be carried out and to the types of software to be used.
In addition, the use of a dedicated bus with low latency allows obtaining a high speed also in the applications where the single system image is not implemented, but a cluster- like system is made since the SCI protocol in any case allows the data exchange by utilising the direct access to the remote memory which ensures a considerable band quantity at a significant speed.
The proposed architecture allows overcoming the drawbacks present both in the current commercial implementations of SMP or NUMA driven software systems and in cluster type configurations.
In addition, as already said, unlike normal virtualisation systems, SUPERBIOS can be accelerated and coordinated by the algorithms contained inside the FPGA processor on the interconnection board.
These algorithms are not necessarily only those dedicated to the management of the memory and cache but can be extended to the management of the other aspects of the machine. For example, once the reference operating system is chosen, it is possible to make a hardware acceleration of the base functionalities of the system kernel for managing the access to the discs, integrating everything into a single system which will have performances of considerable importance with respect to any similar system which can be made with conventional technologies .
In addition, there are specific encryption and decryption of data which can be implemented at FPGA level, achieving dedicated coprocessors.
The advantage offered by the architecture of the present invention consists of the fact that it can integrate these algorithms in the management system of the machine, optimising the entire system for carrying out dedicated operations, so as to obtain an increase of performances, as well as a flexibility that cannot be achieved with other cc- NUMA architectures on the market.
Advantageously, the management at the SUPERBIOS level of the main functionalities related to the scheduling of the processes and to the management of the load balancing is that of making the complexity of the management of the typical memory of the cc-NUMA systems completely transparent to the software, also allowing non-NUMA operating systems to correctly function inside the architecture, not preventing the virtualisation of the architecture by means of third party software, from which the machine itself can draw benefit.
A further advantage of this architecture is that of not being a static architecture, or unlike conventional cc-NUMA architecture, the multi-level approach allows effectively emulating various types of architecture if is necessary as described above, obtaining greater performances than dedicated systems of conventional type.
As seen, unlike the normal interconnection systems, the system based on the SCI protocol (Scalable Coherent Interface is an open standard IEEE1596- 1992 which defines a point-point interconnection protocol capable of providing similar characteristics to those commonly present in the processor-processor or memory processor BUS systems) has a series of advantages which allow obtaining high performances .
The SCI protocol makes it possible to build discrete hardware both in cluster type configurations and of finely interconnected type (cc-NUMA) .
In SCI architecture, the memory of the system and protocol associated therewith is completely scalable .
The SCI interface is designed for reaching a transfer rate of a gigabyte per second per channel, with the possibility to add more parallel channels, which allows linearly scaling even the pass-band.
The single nodes are connected with each other with a point-point connection, eliminating the need for a switch and eliminating further latency.
The point-to-point connections can be organised according to 2 -dimension torus ring types or three- dimension torus ring types. Each of these types can be made parallel, obtaining multiple ring architecture, for example 2 -dimensional or 3- dimensional multiple torus ring.
Certain tasks of housekeeping supervision assigned to a node of the task set which is of topology maintenance type and regards the identification of the break point and the processing of the alternative routings in case of failure, can be assigned one or more nodes called scrubbers .
At the electrical level BUS SCI can be borne over a cable .
The combination of a pass-band that is linearly- scalable and has a very low latency allows creating an interconnection system capable of reaching performances typical of supercomputers .
SCI supports, at the protocol level, the distributed shared memory and the cache coherency, in addition it is possible to implement and use libraries of Message Passing (MPI) or PVM (Parallel Virtual Machine) type, typically used in supercomputer software .
Appropriately, SCI is a protocol suitable for making an interconnection BUS adapted for the above-described purpose.
More in detail, using a series of LVDS channels and an SCI processor together with one or more coprocessors made by means of FPGA, it is possible to make a board to be inserted inside SMP nodes constituted by standard hardware. For the connection of said board to systems on the market, it is possible to use the normally present expansion interfaces.
One of the interfaces most suitable for the purpose is the PCI Express, but it is possible to utilise, for the interfacing of the board with the system, any one BUS with characteristics similar to or better than the present or future PIC Express .
In order to make a system capable of building cc- NUMA multiprocessor machines starting from discrete hardware, it is necessary that the interface has a high pass-band or that it permits obtaining a high pass-band by exploiting several channels simultaneously.
One possible configuration of the proposed hardware is that in which an interface is present towards the PCI Express which by means of a suitable bridge communicates with the protocol coprocessor which integrates both the SCI instructions and the cache coherency shared memory controller made at logic level, following the characteristics and functionalities described in the protocol itself.
Such coprocessor could be composed of a controller for the access to the local memory connected to a controller for the remote memory,- the two controllers are then connected to an on-board memory controller board capable of managing the preferably high-speed RAM memory, directly mounted on a board usable as local cache .
Such controllers communicate by means of dual port memory controller with the LVDS transmission/reception interface or interfaces which are made by following the IEEE scalable coherent interface specifications.
A further coprocessor instead contains the instructions (preferably reconfigurable) related to the SUPERBIOS acceleration, or it serves as an accelerator device of FPGA type, accelerating software algorithms in hardware .
Appropriately, such hardware configuration can be implemented in various modes and can also be made starting from several boards produced by third parties joined in a suitable manner.
The SCI interconnection system could be an independent external board, to which it is possible to flank a FPGA coprocessor board containing the cache coherency shared memory controller and other reprogrammed acceleration boards . The acceleration board could be a board inserted directly in the processor BUS using the base of a secondary processor and also a further PCI express board .
Such board can contain the second coprocessor or the latter can be external .
The system used for obtaining the cache coherency can preferably be that defined by the SCI protocol itself or a suitably optimised one made by third parties.
At the starting time, a system made by means of the hardware architecture described above has a series of separated units, in other words it appears as a series of independent computers, the boot procedure being able to be configured in various modes and in any case not being strictly binding.
It is possible, for example, to make a sequence in which a main node is started, followed by all the others, loading in sequence from such node all the necessary software layers .
On the other hand, it is possible to simultaneously start all the nodes and then carry out the starting according to the same order or to another of the software layers composing the BIOS.
Once first boot phase has terminated, the discrete machines are joined into a single machine by carrying out the instructions contained in the extended BIOS, or rather the interconnection boards are initialised, the node labels are assigned and the cache coherency and shared memory is started.
At this point, it is possible to start the SUPERBIOS which organises the management of the NUMA machine according to specific algorithms.
It should be noted how it is possible to modify the characteristics by modifying or not sending the single blocks composing the BIOS and how it is possible, by exploiting the reconfigurability of the FPGAs, to start the cache coherency, or not start it.
In addition, it is possible to configure the system such that it uses the interconnection system exclusively as communications BUS and thus achieving an MMP or cluster or grid architecture where there is not shared memory nor cache coherency. Advantageously, these particular features are not present in the prior art and allow the architecture, subject of the patent, to be open to an infinite spectrum of applications by simply dynamically configuring the hardware and software which one intends to the machine to perform.
Inside the hardware board, it is also possible to insert a software configuration setting which allows choosing and reconfiguring the most suitable work mode .
In particular, the machine is composed at the hardware level of three structure elements defined by the base element (CPU, MEMORY, I/O) which can be identified as a standard hardware of computer type; by one or more interconnection boards whose objet is that of physically making the electrical interconnection between the various nodes and allowing the data exchange by supporting the SCI protocol communications base; from a series of coprocessors adapted for extending the SCI protocol functionalities in order to make the cache coherency, the shared memory and possibly the acceleration of several functionalities such as for example the migration of the processes and load balancing.
With reference to fig. 2, the processing unit 1 is preferably composed of a processor (CPU) Ia with its cache Ib, a local RAM memory and a series of Input/Output Id interfaces functioning as base element for the achievement of the system.
To this unit 1, one or more communication interfaces 3, 4 are connected. Such interfaces 3, 4 make the SCI communications part both on the electrical level, by supplying the necessary apparatus for the communication of the signals, and by implementing the reading/writing or call and response base set typical of the SCI protocol .
The interfaces are provided with a series of inputs and outputs suitable for creating the point-point connection between the mutual interfaces present on the other nodes composing the system.
Specifically, every interface is provided with an input channel 5,6 and an output channel 7,8.
A board containing the coprocessors is arranged on the system by means of the I/O interface Id. Every node can be connected, by suitably configuring the interfaces, to the other nodes with different interconnection topologies and with a different number of parallel channels.
More in detail, by making reference to figure 3, the board containing the coprocessors has an interface 1 which allows the connection of the coprocessors 2 and 5 to the BUS system (I/O interface Id, figure 2) .
The coprocessor 2 achieves the cache coherency set of the SCI protocol by making, through a series of modules 4, the shared memory, managing the access of the local memory and the access to the remote memory.
A series of RAM memories are directly connected to the coprocessor and are used as cache memories for the management of the cache coherency.
The use of a RAM memory, physically present on the coprocessor, allows obtaining a series of performance advantages .
In some systems on the market today, in fact, there is no physical memory dedicated to the cache and a portion of the RAM memory present on the node (about 10% of the total RAM memory present) is used.
This type of approach involves a reduction of the performances due to the further data load present on the interconnection bus necessary for the reading and writing of the data in the reserved portion of the central memory; in addition, a portion of the total memory available to the system is sacrificed, about 10%, removing resources for executing processing which require the storage of a great quantity of data.
The continuous access to the memory occurs by means of the interconnection system and a high number of transactions negatively affects the total performances, sometimes on the order of several orders of magnitude .
A second coprocessor 5 can be implemented for accelerating several recursive functions of the system Kernel, such as for example those related to the global scheduler (figure 1, element 6) .
It is in fact possible to accelerate several software algorithms in hardware, such algorithms executing specific functionalities, by increasing the performance by several orders of magnitude with respect to the equivalent non-accelerated algorithm.
This device achieves a direct acceleration of several functionalities related to the scheduler.
The balancing of the processes and the management of the overall work load of the machines is obtained by means of the global scheduler.
Such global scheduler can be interfaced with a specific coprocessor which allows a very quick performance of all the main operations which the scheduler is set to undertake.
The use of an acceleration hardware of the global scheduler makes the system very efficient, given the same work load, and minimises the response times of the system itself to the variations of the work load, executing a balancing of the system in much briefer times.
The presence of a coprocessor with the characteristic of being reprogrammable also allows being able update the system without having to modify or change the hardware, allowing a continuous development and constant improvement of the global scheduler without having to worry about the hardware .
It is clear that the global scheduler is an element subject to continuous improvements, due to the refinement of the algorithms and to the continuous change of the architecture of the processors themselves; for its nature, it would be very- disadvantageous in economical terms to bind the hardware of non-reprogrammable type with the function of accelerating such element.
The updating can be carried out through suitable tools designed for the specific function.
Finally, it is possible to have an overall vision of the functioning of the hardware and software sides of the system through figure 4, which summarises the interaction between the various hardware and the software system.
Inside the board 4, it is possible to note the interconnection ring, the communications module, the module with the co-processors on-board.
Said modules are specifically driven by suitable modules present in the SUPERBIOS. It should be noted that the system is built with modular blocks whose minimum unit is constituted by the presence of the interconnection system with SCI protocol in the SUPERBIOS. To this minimum unit, it is possible to add the cache coherency coprocessor, which in any case remains a dynamic element, since if necessary the system could be configured as MMP shared nothing type, which would eliminate the need of a coprocessor for the cache coherency and shared memory .
The presence of the second coprocessor is not strictly binding for the system functioning; its presence favours the increase of the performance and a more effective management of the migration of the processes and overall load balancing of the machine .
The SUPERBIOS architecture can be as follows .
The SUPERBIOS has a level structure composed of modules, each module overseeing specific functionalities which can be integrated, or not integrated, with a hardware acceleration entrusted to one or more FPGAs or to a ASIC and one or ore FPGAs . With reference to fig. 1, after the system has terminated the standard boot phase in which the base hardware composing the single node is initialised, the SUPERBIOS execution procedure is started; more specifically, a series of operations necessary for the creation of a joint system is created.
The system initialises a series of low level drivers (point 1) which are necessary for the initialisation of the SCI (Scalable coherent interface) communication and all the peripheral devices necessary thereto. Inside this phase, the various communication units are initialised and an unequivocal NUMA address is assigned to them, having the object of identifying the interface in an unequivocal manner .
At the end of this first phase present in every node, the system has initialised the communications hardware and has activated the communications ring.
Subsequently, the high level drivers are activated, that is the libraries and the management interfaces of the SCI protocol, this layer containing all the APIs (Application program interfaces) necessary for the correct management of the functionalities related to the memory and cache coherency.
At the end of this phase, the system is ready to execute the merging of the resources by activating the mechanism for managing and sharing the data through the entire system, activating the data sharing mechanism through the entire system and at the operating system level, making the access transparent to the remote data, initialising the cache coherency, managing the process memory migration and the thread memory and organising the SCI Shared memory .
This occurs by means of layers 3, 4 and 5.
Layer 4 oversees the management of the process migration, while layer 5 contains the instruction for the construction of the cache coherency.
Specifically, the SCI protocol allows supporting the multiprocessing through a general shared memory model made through the coherency of the cache of the CPU.
Such mechanism is compatible with two separate programming models, on the one hand some applications can maintain the cache coherency through a hardware-software control while others can use MPI (message passing) type mechanisms. The SCI protocol permits easily supporting both the boards in a concurrent manner, i.e. simultaneously.
This feature allows the system to supply excellent performances for both programming models along with the flexibility for being able to use both applications of SMP type and MPI type.
The coherence mechanism can be implemented by working directly with the base request/response operations of the communication system.
The cache coherency system is based on a point- point scheme initialised by an applicant (typically one of the CPUs) and terminated by the responder (typically a memory or another processor) .
This level is integrated with a FPGA which accelerates and oversees all the functionalities necessary for the creation and management of the cache coherency.
Once the preceding steps have been terminated, the system is seen by the operating system as a single machine having a distributed shared memory. In order to facilitate and increase the performances of the machine itself, it is necessary to add a manager of the load and the distribution of the work along all the nodes composing the machine itself .
The importance of balancing the load is strategic for being able to obtain high performances, reducing the work load of the interconnection system.
This layer 6 is defined as a global scheduler.
Every processor manages numerous tasks which must be balanced in a global vision in order to actually benefit from a distributed architecture of SSI type like that described.
One of the major problems affecting the NUMA (Non Uniform Memory Access) architecture machines is in fact the different access times to the various memory units .
This limit, if not correctly managed, can negatively and drastically affect the final performances of the machine itself.
It is evident that the need to access remote memory zones leads to the intensive use of the interconnection system, with an increase of the resource load used by the entire system. In addition, the access time to a remote resource is clearly higher than that necessary for accessing a local resource .
For this reason, the average access time to the memory, if the processes are not correctly organised, can become quite high.
In order to allow maximum benefit of the processing powers deriving from an architecture as described above, it is necessary to optimise the balancing process of the load related to the various CPU, and in particular manage the memory access in an optimal manner.
In order to satisfy such need, it is necessary to consider several theoretical schemes which oversee and describe the functioning and performances of a parallel machine.
A starting point for an analysis of the implementation mechanisms of an efficient scheduler is to consider the Amdahl law, on which it is clear that a correct preparation of the processes and a correct organisation of the resources allows carrying out the tasks that cannot be parallelised, minimising the total execution time losses.
In other words, there is a mode for organising the processes so that the non-parallelised percentages are carried out on a node which has allocated in its memory all the data necessary for completing the task.
Hence, the characteristic applied to the NUMA machines is that of preparing, in a memory portion "close" to the processor, the sequential calculation that such processor must execute.
In such a manner, the execution is much faster than it would be if the processor had to wait for the arrival of data from a remote memory.
Keeping in mind the existence of a non- parallelisable code portion inside a scheduler is important for the maximisation of the performances.
The global scheduler algorithm is conceived for maintaining the data as close as possible to the CPU set for executing that data, making the block processes migrate towards the processing units. Of course, for every task, a home node is assigned to which a memory portion taken from the local memory is allocated.
The scheduler can change the position of the home in consideration of the work load via bulk migration of the home and its memory, maintaining the validity of the home principle associated with the local memory.
This system considerably reduces the work load related to the remote memory access .
The scheduler is structured for balancing and monitoring the load of the CPUs, and allows using MPI applications by configuring MPICH as if it was a SMP machine; managing the migration of the threads by balancing the load between the processing nodes by utilising the shared memory; managing a global indexing ID of the processes and their migration; creating remote processes, indexing them and making them migrate .
Only with regard to the first node (or main node) , a loader 7 is implemented. The task of the loader is the following. At the end of the execution of the SUPERBIOS, the system appears as an SSI NUMA-like system (Single System Image NUMA machine) . At this point, there arises the need to install an operating system capable of placing the user in conditions so as to be able to operate with the machine itself .
The loader ensures that it is possible to install a non-modified operating system on the main node, which after the installation of the management drivers (SSI NUMA driver) is capable of "seeing" all the resources of the machine itself.
One variant is obtainable by inserting a loader 7 in every node .
This modification allows making a particular configuration in which it is possible to install standard virtualisation software without making substantial modifications to their structure.
The functioning mechanism is the following.
By inserting a loader in every single node, it is possible to load a kernel outside the bios of the node . One is capable of loading also a remote image of a kernel and then carrying out a system boot from a common source .
Most virtualisation software are made in a manner so as to carry out the boot of additional nodes to the first node by means of the PXE protocol. This means that their structure is conceived in a manner so as to install a kernel on a primary node and diffuse this kernel on the remote nodes.
The introduction of an additional loader allows emulating this procedure without having to modify the architecture of the machine itself, also permitting executing two options: the first simply by installing the communication drivers inside the kernel inside the virtualisation system and then by providing the use of the interconnection network to the nodes of the virtual machine; the second is to modify the kernel of the virtualisation machine in a manner such that it is possible to use the SMP resources of the system.
Reference is now made to the global scheduler.
The global scheduler can be structured so it can be considered a configurable dynamic scheduling system. At the state of the art, there are essentially two scheduler types : the static scheduler and the dynamic scheduler. Both of these scheduler types can be of adaptive type, i.e. capable of being adapted to the task to be executed.
The scheduling system of static type organises the processes according to a series of policies (rules contained in one or more scripts) . The processes managed by a scheduler of static type migrate the process at its origin, balancing the load of the system, but are not capable of making the progress migrate once it is in execution.
This signifies that the problem of balancing the load while the processes are being executed cannot be solved.
The dynamic scheduling scheme is capable of making the processes migrate even while these are being executed. In this manner, if the machine is found in unbalanced load phase, and if the need for a system balancing is verified, the scheduler is capable of removing the process and reallocating it in a different position without stopping its execution. The schedulers, in order to provide good integration and good system performances, must be capable of carrying out a series of tasks : sequential applications, parallel applications which communicate by means of share memory, applications which communicate by means of message passing (MP) , and correct distribution of the response load of the applications which require a large memory quantity or a large CPU quantity.
These different tasks require a global management of the differentiated resources.
Most of the systems offer a scheduler of static type.
Many Beowulf-type systems offer a set of utilities for correctly configuring a scheduler and for managing the execution of MPI (message passing interface) applications or PVM (Parallel Virtual Machine) applications.
When an application is launched, the MPI system normally sets the process on a node of the system, without prior knowledge the state of the node in that moment. This approach has numerous advantages: first of all, it is usually usable by a user capable of managing the pertaining application from the beginning, in addition he must be capable of developing according to the libraries MPI and PVM.
This user type is very rare and does not represent the average of the users of a processing system.
Another negative point consists of the fact that the work load distribution along the nodes occurs without the system's minimum knowledge on the state of occupation of the affected nodes, which in turn signifies that the balancing of the work load does not occur in an optimal manner.
By means of a dynamic scheduler, one is capable of managing the processes by using an active balancing of the work load, operating in a transparent manner with regard to the user and to the application, i.e. it is not necessary to write explicit rules which automate the migration of the processes.
In this case, the migration of the processes accounts for the real occupation of the nodes .
Starting from the assumption that a task can be executed, with real benefit, on a series of distributed processors, if the load of the latter is effectively executed One scheduler must be configurable in order to be adapted for the specific task that the machine is called to undertake.
One scheduler requires a mechanism which allows specifying new policies.
In Linux standards, the processes and the threads can be created with the fork and execv commands . In a SMP system, the standard pthread interface allows creating and managing the threads in a share memory setting.
The global scheduler provided by the present invention serves to implement a mechanism of such type in order to render the machine a virtual SMP system.
The global scheduler is composed of a plurality of layers, in particular three layers: a first probe layer structured so as to detect the average occupation state of the nodes, a second analyser layer which intercepts the management of the scheduling and highlights its errors (for example, an excessive load on a single node or a hardware failure of one or more nodes) , and a third manager layer which manages the overall load and solves the problems reported by the analyser layer, permitting balancing the load between all the nodes present.
The probe layer measures the load present in every single CPU and the occupation state of the memory.
Two probe layer types are provided for, and every probe layer type is interconnected with the analyser layer, which sends the related data to the local layer .
The active probe layers work in synchronism with the system timer, the passive probe layers react to single events.
The events can be of kernel type or of global scheduler type. When a passive probe layer is activated by an event, the probe layer immediately sends the message to the analyser layer.
For example, an active probe layer is used for monitoring the load of a CPU: in this manner, the reported value is constantly updated and is constantly sent to the analyser state, while it is convenient to use a passive probe layer for monitoring the ping-pong of memory pages. The analyser layer synthesised in FPGA receives information from the probe layers, analyses it and highlights possible irregular system states.
One series of analyser layers can function in a parallel manner on each node. This permit an effective monitoring of the state of the system and an optimal management of the load balancing along all the nodes .
For example, it is possible to monitor a CPU regarding the work load, monitor the temperature condition of the CPU and/or the power supplies of the system, such that the analyser is capable of processing all the information necessary for carrying out the most suitable operations .
If a problem related to a CPU is detected, the local analyser layer sends a scheduling request to the manager layer, the manager layer has a global vision of the resources and reacts by carrying out the best possible move.
In detail, the system state analyses are executed in hardware, in order to have the monitoring be irrelevant with regard to the work load of the CPUs. This type of approach allows obtaining, on one hand, probe layers and analyser layers that are very fast, and on the other hand the ability to carry out a great number of tests and detections without compromising the processing power of every single CPU.
The manager layer is present in every node inside the FPGA and is connected to the local analyser layers .
The various manager layers which make up the system (distributed in the single FPGA present in the coprocessors) communicate with each other in order to exchange the information on the state of the nodes .
The manager layer is the only layer which has an overall vision of the system state.
In order to make them communicate with each other, it is possible to use a communication barrier between the single FPGA whose task is that of accelerating the data exchange between the FPGA present in the system without using the data communication network (BUS SCI) . It is also possible to map the system communications on a dedicated Gbit ethernet interface. It should be underlined that the use of a separate interconnection allows obtaining performances greater than any other traditional system in which the scheduler communicates on the data bus . This separation of the data connection from the connection between the manager layers is made possible by the implementation structure of the manager layers which are autonomous from the data system.
The scheduler can be represented as a global supervisor of the system which acts on the syst :e<m but which autonomously processes a large quantity of information.
In addition, this structure, permitting the quick analysis of a large quantity of information, due to the speed obtainable by means of the use of the acceleration hardware deriving from the use of the FPGA, has as effect the possibility to obtain a very high efficiency scheduler capable of making decisions based on the high information quantities.
The global vision of the system is obtained by means of the information deriving from the probe layers implemented and allows detecting all the "unbalancing" problems which could be generated. When a scheduling problem is detected, the manager layer is capable of carrying out a series of operations such as migrating the process or executing the "checkpoints" of an application (checkpoint is a procedure for which an application is photographed in order to permits its recovery in case of failure) in accordance with the programming policies of the scheduler, in order to obtain maximum efficiency.
The manager layer is capable of dynamically executing every scheduling policy. The manager layers, the probe layers and the analyser layers can be configured according to a system interface which allows programming the machine in order to obtain an overall adaptive system capable of providing maximum performances according to the task which is assigned thereto.
One example of a possible approach of the management of the scheduler programming could be the following:
When the work load of a CPU exceeds 60%, the process is migrated.
In order to implement this rule, it is necessary that the probe layer related to the CPU load is active, and the local analyser layer is configured for communicating the data to the manager layer when an overrunning is verified.
Once the configuration parameters are set, the functions are executed inside the FPGA coprocessor.
As stated above, said hardware acceleration allows collecting and processing a high number of data without impacting the real performances of the machine. On the other hand, the great quantity of data allows managing the manager layer with optimal results for the performances and the general balancing of the machine.
The use of numerous probe layers is important for making a manager layer which is capable of acting in the best possible manner in many different situations.
By using the interfaces 12C present on the normal motherboard and by extrapolating the current values therefrom, along with the velocity of the fans and temperature, it is possible to supply, to the scheduler, information related not only to the work load of the node, but of its health state. This allows operating an intelligent scheduling of the system in order to permit and facilitate the maintenance in an automatic manner.
It is in fact possible to prearrange the scheduler such that, if the parameters meet a series of rules which identify a potential failure in a short time period, the processes are made to migrate onto other nodes, in order to liberate the work load of the damaged node and for example to proceed with the imminent danger signalling through a hardware signaller and/or proceed with the turning off of the same .
Moving along the lines of the same principle, it is possible to prearrange the scheduler such that it administers the energy resources in an intelligent manner, diminishing the system consumptions.
In other words, it is possible to operate in the following manner: by setting a policy which tells the scheduler that in if the balancing of a single node falls below a predetermined load, for example 10% of load for more than a predetermined time, for example 10 minutes, the processes therein contained are made to migrate onto the remaining nodes . The rule also obliges collected as much of the load as possible on a single node as soon as the load of the remaining nodes falls below a predetermined value, for example 15%; this signifies that in case of low activity, the work load is usually- concentrated on the smallest possible number of nodes .
If the state of inactivity of the nodes thus organised remains for more than a predetermined time, for example 10 minutes, one can order to the scheduler to turn off the inactive nodes .
It is clear that at the moment in which the work load increases, the nodes are reactivated.
This management of the scheduler reduces the energy consumptions by over 40%.
The machine architecture and the process thus conceived are susceptible to numerous modifications and variants, all coming within the scope of the inventive concept.
In addition all the details can be substituted by technically equivalent elements.

Claims

1. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, characterised in that it comprises at least two SMP computers connected in parallel with each other by means of at least one interconnection BUS and uses processors FPGA dedicated for specific functionalities and for managing a dedicated physical memory separated from the system memory and adapted to make a cc-NUMA system of average and large size managed by means of a BIOS system adapted to optimise the resources in question and to accelerate the capacities of the system.
2. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to claim 1, characterised in that said FPGA processors are adapted for creating a system cache coherency.
3. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SMP computers are of commercial type.
4. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said BUS has both the shared memory functionalities and said cache coherency functionalities, with an enormous simplification of the management of the same by the system firmware.
5. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said BUS has a very low latency.
6. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SMP computers are connected with each other by an interconnection system based on the SCI protocol (Scalable Coherent Interface) .
7. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said CSI is a BUS adapted for interconnecting multiprocessor systems or for the interconnection of processor memory systems .
8. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said BUS comprises a series of said FPGA suitably configured for carrying out the function of coprocessors in the creation of the shared memory and of said cache coherency for making a NUMA or cc-NUMA machine, to which a control BIOS is added capable of creating a Single System Image configuration .
9. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that it comprises a preferably firmware level software definable in a SUPERBIOS capable of virtualising the discrete architecture of said computers, transforming it into a cc-NUMA machine and optimising the load management of the processors in order to obtain performances similar to a SMP machine in which even systems non-optimised for the NUMA architecture can function correctly.
10. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SUPERBIOS comprises a set of algorithms which are sent to the machine after having carried out all the BIOS base instructions, i.e. of the bios boot standard of the single nodes, said algorithms permitting the configuration of the machine as a Single System Image of partionable SMP type.
11. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SUPERBIOS has a level structure composed of modules, every module oversees specific functionalities which can or cannot be integrated with a hardware acceleration entrusted to one or more FPGA or to an ASIC and one or more FPGA.
12. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the management and the optimisation of the problems related to the NUMA architecture are executed in a transparent manner with regard to the system itself.
13. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that at the hardware level it comprises suitable FPGAs programmed for the creation and management both of the memory and of the cache and which allow accelerating the machine itself and obtaining very high performances .
14. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said FPGAs can in turn control a DDR or similar memory placed on the board (s) containing the FPGAs themselves, in order to eliminate the need to dedicate a portion of the local memory for the management of the caches .
15. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that it uses a dedicated memory which allows a further increase of the overall performances of the machine itself thus achieved.
16. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the SCI interface is designed for reaching a transfer rate of a gigabyte per second per channel, with the possibility to add several parallel channels .
17. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that at the electrical level, the BUS SCI can be borne over cable .
18. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the combination of linearly scalable pass-band and a very low latency allows creating an interconnection system capable of reaching performances that are typical of supercomputers .
19. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SCI supports, at the protocol level, the distributed shared memory and the cache coherency, it being also possible to implement and use libraries of Message Passing (MPI) type or PVM (parallel virtual machine) type typically used in supercomputer software.
20. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said SCI is a protocol which is adapted for achieving an interconnection BUS.
21. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that by using a series of LVDS channels and an SCI processor together with one or more coprocessors made by means of FPGA, a board is made to be inserted inside the SMP nodes composed of standard hardware .
22. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that for the connection of said board to the systems on the market, it is possible to use the normally pre-existing expansion interfaces.
23. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said interface has a high pass-band or allows obtaining a high pass-band by utilising several channels simultaneously.
24. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that it has an interface towards the PCI-express which, by means of a suitable bridge, communicates with the protocol coprocessor which integrates both the SCI instructions and the cache coherency- shared memory controller made on the logic level, following the characteristics and functionalities described in the protocol itself.
25. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said coprocessor has a controller for the local memory access, connected to a controller for the remote memory, said two controllers being connected to an on-board memory controller for the management of the preferably high speed RAM memory mounted directly on a board usable as local cache .
26. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said controller communicates by means of a dual port memory- controller with the LVDS transmission/reception interface or interfaces which are made by following the IEEE Scalable coherent interface specifications.
27. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that it has a further coprocessor containing the instructions related to the acceleration of the SUPERBIOS, in order to act as an accelerator device of FPGA type, accelerating software algorithms in hardware .
28. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the processing unit is preferably composed of processor with a cache thereof, a local RAM memory and a series of Input/Output interfaces functioning as a base element for achieving the system.
29. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that, to said unit, one or more communication interfaces are connected which are adapted for making the SCI communications part both at the electrical level by providing the necessary apparatus for the communication of the signal, and by implementing the reading/writing or calling and response base set typical of the SCI protocol .
30. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the board containing the coprocessors has an interface which allows the connection of the coprocessors to the system BUS.
31. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the coprocessor achieves the cache coherency set of the SCI protocol by making the shared memory, through a series of modules, managing the access to the local memory and the access to the remote memory .
32. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that a series of RAM memories are directly connected to the coprocessor and are used as cache memories for the management of the cache coherency.
33. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that a second coprocessor can be implemented for accelerating several recursive functions of the system Kernel, such as for example those related to the global scheduler.
34. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the balancing of the processes and the management of the overall work load of the machines is obtained by means of the global scheduler.
35. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said scheduler can be interfaced with a specific coprocessor which allows a very quick performance of all the main operations for which a scheduler is set to undertake .
36. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that the algorithm of said global scheduler is conceived for maintaining the data as close as possible to the CPU set for executing that data, making the block processes migrate towards the processing units.
37. Machine architecture composed of a software level and a hardware level, interacting with each other regardless of the initial configuration of said machine, according to one or more of the preceding claims, characterised in that said scheduler is structured for balancing and monitoring the load of the CPUs, and allows using MPI applications by configuring MPICH as if it was a SMP machine, managing the migration of the threads by balancing the load between the processing nodes, utilising the shared memory, managing a global indexing ID of the processes and their migration, creating remote processing, their indexing and migration.
38. Process for making a machine architecture according to one or more of the preceding claims, characterised in that said architecture is composed of two levels, hardware and software, which interact with each other such that the overall functionalities are defined on one hand by the high performance interconnection and low latency and on the other by the functionalities which the SUPERBIOS assigns to this interconnection and to the accessories connected thereto (FPGA) , so as to configure said machine for obtaining both said NUMA architecture and a series of sub- architectures, non-implementing or implementing in a different manner several or all the memory and cache functionalities.
39. Process for making a machine architecture according to one or more of the preceding claims, characterised in that by operating on the SUPERBIOS level, it is possible to not activate the cache coherency or the shared memory, making a pure and simple data passage on which it is possible to implement, on the operating system level, a top library, for example MPI (message passing interface) , thus making the functionalities of a standard supercalculator or cluster.
40. Process for making a machine architecture according to one or more of the preceding claims, characterised in that it is possible to make an architecture of share nothing type, opposite of the SSI (Single system image) , capable of being employed for the resolution of specific tasks.
41. Process for making a machine architecture according to one or more of the preceding claims, characterised in that said SUPERBIOS can be accelerated and coordinated by the algorithms contained inside the FPGA processor placed on the interconnection board.
42. Process for making a machine architecture according to one or more of the preceding claims, characterised in that said algorithms are not necessarily those dedicated to managing the memory and cache but can be extended to the management of other aspects of the machine .
43. Process for making a machine architecture according to one or more of the preceding claims, characterised in that once the reference operating system is chosen, it is possible to make a hardware acceleration of the base functionalities of the system kernel for managing the access to the disks, integrating everything into a single system which will have performances of considerable importance with respect to a similar system that can be made with conventional technologies .
44. Process for making a machine architecture according to one or more of the preceding claims, characterised in that if there are specific data encrypting and decrypting algorithms, said data can be implemented at the FPGA level by making dedicated coprocessors .
45. Process for making a machine architecture according to one or more of the preceding claims, characterised in that the SCI interface is designed for reaching a transfer rate of one gigabyte per second per channel, with the possibility to add several parallel channels .
46. Process for making a machine architecture according to one or more of the preceding claims, characterised in that the single nodes are connected with each other with a point-to- point connection, eliminating both the need of a switch and further latency.
47. Process for making a machine architecture according to one or more of the preceding claims, characterised in that the point-to- point connections are organised according to two- or three-dimension torus ring types, each of these types being parallelised, obtaining multiple ring architecture, with multiple two- dimensional or three-dimensional torus.
48. Process for making a machine architecture according to one or more of the preceding claims, characterised in that at the end of the execution of the SUPERBIOS, the system appears as an SSINUMA-like (Single system image NUMA machine) , and at this point there arises the need of installing an operating system which is capable of placing the user in conditions such that the he can operate the machine itself.
49. Process for making a machine architecture according to one or more of the preceding claims, characterised in that the loader ensures that it is possible to install a non- modified operating system on the main mode, which following the installation of the management drivers (SSI NUMA driver) is capable of seeing all the resources of the machine itself.
50. Process for making a machine architecture according to one or more of the preceding claims, characterised in that, in a structural variant, it is possible to insert a loader into every node, so as to achieve a particular configuration in which it is possible to install standard virtualisation software without making substantial modifications to their structure.
51. Global scheduler for managing processes in a plurality of nodes, each comprising one or more processors, characterised in that it has a configurable dynamic stratified structure, a first probe layer being adapted to detect the average occupation state of the nodes, a second analyser layer being adapted to intercept the scheduling management in order to highlight errors, and a third management layer being adapted to manage the overall load and to solve the problems reported by the analyser layer, allowing balancing the load between all present nodes.
52. Global schedule according to the preceding claim, characterised in that said probe layer measures the current load in every single CPU and the occupation state of the memory .
53. Global scheduler according to one or more preceding claims, characterised in that said manager layer is made by using the interfaces 12C present on the normal motherboards and by extrapolating the current value therefrom, along with the velocity of the fans and temperature, so as to provide the scheduler with information related not only to the work load of the node but of its entire health condition.
54. Global scheduler according to one or more preceding claims, characterised in that it prearranges the scheduler such that, if the parameters respond to a series of rules which identify a potential failure in a short time period, the processes are made to migrate onto other nodes in order to liberate the work load of the damaged node .
55. Global scheduler according to one or more preceding claims, characterised in that it prearranges the scheduler for the intelligent administration of the energy resources, diminishing the resources of the system, setting a policy which tells the scheduler that in such case the balancing of a single node descends below a predetermined time, the processes contained therein are made to migrate on the remaining nodes .
56. Global scheduler according to one or more preceding claims, characterised in that it collects as much load as possible on a single node as the load of the remaining nodes falls below a predetermined value, such that in case of low activity the work load is concentrated on the smallest possible number of nodes.
57. Global scheduler according to one or more preceding claims, characterised in that if the state of inactivity of the nodes thus reorganised remains for more the predetermined time, the scheduler is ordered to turn off the inactive nodes .
PCT/EP2008/007897 2007-09-21 2008-09-19 Parallel computer architecture based on cache- coherent non-uniform memory access (cc-numa) WO2009036987A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ITMI2007A001829 2007-09-21
ITMI20071829 ITMI20071829A1 (en) 2007-09-21 2007-09-21 MACHINE ARCHETTING MADE BY A SOFTWARE LEVEL AND AN ITERAGENT HARDWARE LEVEL BETWEEN THEM FROM THE CONFIGURATION OF THE MACHINE AND THE PROCEDURE FOR THE IMPLEMENTATION OF THIS MACHINE ARCHITECTURE

Publications (2)

Publication Number Publication Date
WO2009036987A2 true WO2009036987A2 (en) 2009-03-26
WO2009036987A3 WO2009036987A3 (en) 2009-06-18

Family

ID=40085619

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2008/007897 WO2009036987A2 (en) 2007-09-21 2008-09-19 Parallel computer architecture based on cache- coherent non-uniform memory access (cc-numa)

Country Status (2)

Country Link
IT (1) ITMI20071829A1 (en)
WO (1) WO2009036987A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977637B2 (en) 2012-08-30 2015-03-10 International Business Machines Corporation Facilitating field programmable gate array accelerations of database functions

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5938765A (en) * 1997-08-29 1999-08-17 Sequent Computer Systems, Inc. System and method for initializing a multinode multiprocessor computer system
US6421775B1 (en) * 1999-06-17 2002-07-16 International Business Machines Corporation Interconnected processing nodes configurable as at least one non-uniform memory access (NUMA) data processing system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977637B2 (en) 2012-08-30 2015-03-10 International Business Machines Corporation Facilitating field programmable gate array accelerations of database functions
US8983992B2 (en) 2012-08-30 2015-03-17 International Business Machines Corporation Facilitating field programmable gate array accelerations of database functions

Also Published As

Publication number Publication date
WO2009036987A3 (en) 2009-06-18
ITMI20071829A1 (en) 2009-03-22

Similar Documents

Publication Publication Date Title
JP5782445B2 (en) How to allocate a portion of physical computing resources to a logical partition
Quraishi et al. A survey of system architectures and techniques for fpga virtualization
US8930507B2 (en) Physical memory shared among logical partitions in a VLAN
KR102103596B1 (en) A computer cluster arragement for processing a computation task and method for operation thereof
CN107479943B (en) Multi-operating system operation method and device based on industrial Internet operating system
US20080288747A1 (en) Executing Multiple Instructions Multiple Data (&#39;MIMD&#39;) Programs on a Single Instruction Multiple Data (&#39;SIMD&#39;) Machine
TW200405206A (en) Virtualization of input/output devices in a logically partitioned data processing system
CN102521209B (en) Parallel multiprocessor computer design method
CN113778612A (en) Implementation Method of Embedded Virtualization System Based on Microkernel Mechanism
US20110197196A1 (en) Dynamic job relocation in a high performance computing system
US20250278312A1 (en) Embedded system running method and apparatus, and embedded system and chip
JPH1097509A (en) Method and apparatus for distributing interrupts in a symmetric multiprocessor system
Wulf et al. A survey on hypervisor-based virtualization of embedded reconfigurable systems
CN113312141A (en) Virtual serial port for virtual machines
Pagani et al. A Linux-based support for developing real-time applications on heterogeneous platforms with dynamic FPGA reconfiguration
KR20160105636A (en) Server Virtualization Method of Multi Node System and Apparatus thereof
Xia et al. Hypervisor mechanisms to manage FPGA reconfigurable accelerators
US11243800B2 (en) Efficient virtual machine memory monitoring with hyper-threading
WO2009036987A2 (en) Parallel computer architecture based on cache- coherent non-uniform memory access (cc-numa)
US10152341B2 (en) Hyper-threading based host-guest communication
Gantel et al. Dataflow programming model for reconfigurable computing
Iserte et al. Gsaas: A service to cloudify and schedule gpus
CN114281529A (en) Distributed virtualization guest operating system scheduling optimization method, system and terminal
JP2023510131A (en) System-on-Chip Operating Multiple CPUs of Different Types and Operation Method Thereof
Akshintala et al. Talk to my neighbors transport: Decentralized data transfer and scheduling among accelerators

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08802405

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08802405

Country of ref document: EP

Kind code of ref document: A2