NL2014534A

NL2014534A - Many-core operating system.

Info

Publication number: NL2014534A
Application number: NL2014534A
Authority: NL
Inventors: Keimpe Rauwerda Gerhardus
Original assignee: Recore Systems B V
Priority date: 2015-03-27
Filing date: 2015-03-27
Publication date: 2016-10-10
Also published as: NL2014534B1

Abstract

Many-core processor architecture comprising a plurality of clusters, wherein each of said clusters comprises a plurality of arithmetic cores, a shared memory arranged to be accessed by each of said plurality of arithmetic cores, an input-output, lO, interface arranged for inter-connecting said plurality of clusters, and a control core arranged for locally scheduling of tasks over said arithmetic cores within a same cluster, wherein said many-core processor architecture comprises a many-core operating system arranged for distributing tasks over said clusters and wherein said many-core operating system comprises a plurality of cooperating microkernels distributed over said each of said clusters and arranged for running in a kernel space of said control core of each of said clusters for locally scheduling of said tasks over said arithmetic cores within a same cluster.

Description

TECHNICAL FIELD

The present invention generally relates to semiconductor structures and more particularly to an operating system architecture for semiconductor processor architecture employing a distributed and heterogeneous or homogeneous many-core architecture.

BACKGROUND OF THE INVENTION

The term multi-core processor is typically used for a single computing processor with or a few independent central processing units, i.e. cores, which are used for reading and executing program instructions. By incorporating multiple cores on a single chip, multiple instructions can be executed at the same time, increasing overall speed for programs amenable to parallel computing. These cores are typically integrated onto a single integrated circuit or onto multiple dies in a single chip package. Both cases are referred to as a processor chip.

The terms many-core, or massively multi-core, are sometimes used to describe multi-core architectures with an especially high number of cores. As such, a many-core processor is typically used for a single computing processor with many independent central processing units, i.e. cores, such as tens, hundreds or even thousands of cores.

Several limitations of multi-core processor chips exist which led to the development of many-core processors, such as imperfect scaling, difficulties in software optimization and maintaining concurrency over a number of cores.

In the course of time, the need for more processing power was initially fulfilled by increasing the speed with which instructions on the processor could be executed. Hence, increasing the clock speed of the processor. This speed however, can not be increased indefinitely because then physical constraints play a major role and can even be problematic. Further, not of less importance, energy consumption of a processor with such a high clock speed increases rapidly. A more recent development is therefore towards parallel execution of instructions by multiple processors or cores thereof. Among others Intel has played a large role in emerging of Symmetric Multicore Processing, SMP, systems in which a processor comprises two or more identical, i.e. homogeneous, processor cores. These cores are treated equally and none of them have a specific purpose. These are all general purpose processor cores.

One of the advantages of these SMP multicore processors when it comes to design and costs for example, is that they share resources. That is however, also a major drawback. Performance can degrade or at least not be at the level of the sum of the individual cores due to the sharing of memory for example. Memory latency and bandwidth can cause problems and these latency and bandwidth problems of one core can affect the other.

As such, multicore processors like SMP and in particular processors with even more cores, i.e. many-core systems have the potential to provide high performance processing power with a large number of cores, however, increasing the amount of cores also increases complexity of software management thereof. As such, there is a need for a many-core processing chip having a plurality of clusters able to provide high processing power, in which at least some of the above identified problems of known multiprocessing architectures are reduced. More in particular, a many-core processing chip having a plurality of clusters able to provide high processing power comprising an operating system able to make efficient use of the potential processing power comprised in the chip.

SUMMARY OF THE INVENTION

In a first aspect of the invention, there is provided a many-core processor architecture having a distributed shared memory, the architecture comprising a plurality of clusters, wherein each of the clusters comprises: - a plurality of arithmetic cores; - private memory arranged to be access by each of the plurality of arithmetic cores comprised in a single, corresponding cluster; - a shared memory being part of the distributed shared memory, and arranged to be accessed by a plurality of arithmetic cores comprised by each of the plurality of clusters; - an input-output, 10, interface arranged for inter-connecting the plurality of clusters, and - a control core which is arranged for locally scheduling of tasks over the arithmetic cores within a same cluster, wherein the many-core processor chip comprises a distributed many-core operating system arranged for distributing tasks over the clusters and wherein the distributed many-core operating system comprises a plurality of cooperating micro-kernels distributed over the each of the clusters and arranged for running in a kernel space of the control core of each of the clusters for locally scheduling of the tasks over the arithmetic cores within a same cluster.

Applications running on an multiprocessing architecture such as SMP architectures require different programming methods to achieve high performance. Some performance increase can be reached if applications which are not particularly programmed for these architectures, i.e. for uniprocessor systems, are run on SMP architectures. Hardware interrupts which suspend application tasks can be executed on an idle processor instead, hence, resulting in at least some performance increase.

However, the best performance increase is achieved if the application is programmed in a way it can easily be executed on multiprocessing architectures. There are even applications that run very well on multiprocessing architectures with a performance increase of nearly the sum of the number of individual processors.

The operating system managing the applications must support multiprocessing, otherwise the additional processors remain idle and the system in principle functions as a uniprocessor system. The complexity of both an operating system as well as the applications running thereon which are designed for multiprocessor architecture are considerably more complex as regard to instruction sets for example. Homogeneous processor systems can require different processor registers for dedicated instructions in order to be able to efficiently run such a multiprocessor system. Heterogeneous system on the other hand can implement different types of hardware for different instructions or uses.

As already briefly mentioned above, SMP systems share a single shared system bus or the like for accessing shared memory. Although there are alternatives in which for example different memory banks are dedicated for different processors and this increases memory bandwidth, there arise different problems, for example in inter process communication when data is to be transferred from one processor to another.

With a clustered architecture in which a plurality of clusters exist and wherein each of the clusters comprise a plurality of arithmetic cores, a shared memory arranged to be accessed by each of said plurality of arithmetic cores, an input-output, 10, interface arranged for inter-connecting the plurality of clusters, and a control core arranged for locally scheduling of tasks over the arithmetic cores within a same cluster, in principle at least some of the above identified problems of prior art SMP architectures can be solved. For example due to the dedicated shared memory within the cluster the amount of intra-cluster data throughput can be decreased significantly, hence reducing memory bandwidth problems within the chip. A different problem however then arises in managing the tasks within the architecture, i.e. over the whole of the clusters. The bottleneck of the system is most likely in managing these tasks, i.e. in efficient use of the processes of the operating system. In common implementations an operating system is running on a General Purpose Processor, GPP, outside of the clusters. This GPP comprises an operating system and most important, the kernel of the operating system.

The kernel is the core of the operating system and handles amongst others the management of the memory, interrupts and scheduling of the processes or tasks over the one or more processor cores. Most kernels also handle input/output of the processes and most control of the hardware within the computer, such as a network interface, graphical processing unit and the like.

In a single-core architecture, or even in multiprocessing architectures having a few cores the operating system and in particular the kernel thereof can efficiently manage its tasks. Although with many-core systems wherein the amount of cores is substantially higher the processing power can in principle be very high, efficient management of such a system by an operating system prevents maximum employment of the system.

In a first aspect of the invention there is provided a many-core processor architecture comprising a plurality of heterogeneous clusters. Each of these clusters comprises a plurality of heterogeneous arithmetic cores, a shared memory arranged to be accessed by each of the plurality of arithmetic cores, an input-output, 10, interface arranged for inter-connecting the plurality of clusters, and a control core arranged for locally scheduling of tasks over the arithmetic cores within a same cluster.

Since each cluster comprises both its own dedicated memory for handling all tasks within the cluster and its own control core for managing these tasks and the memory these are, in principle, arranged for local scheduling of the tasks. As such each control core of a cluster could be comprised of an operating system. Such would however require too high of a resource footprint. When only the kernel is to be comprised therein the footprint is reduced, but still too large for efficient use. As such, there is provided in a first aspect of the invention a microkernel that is sufficiently small to be comprised in a control core of a cluster of a many-core processing architecture without the need for the control core to increase in resource footprint.

The operating system, and in particular the user space applications can be executed from a central general purpose processor on the architecture and outside of the clusters and core of the operating system, the kernel, in particular the microkernel, can be distributed as a plurality of cooperating microkernels running from in a kernel space of the control cores of each of the clusters for locally scheduling of the tasks over the arithmetic cores within a same cluster.

In an example the plurality of cooperating micro-kernels are arranged for memory management of shared memory within each of the cores and for memory management of the shared memory for inter-communication in between the clusters.

In another example the plurality of cooperating micro-kernels are arranged for I/O management of the shared memory within each of said cluster for inter-connecting in between the clusters.

In yet another example a scheduler of the plurality of cooperating micro-kernels is arranged for first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest job first, fixed priority preemptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster.

In a further example of the invention, the many-core processor architecture comprises a plurality of heterogeneous clusters. A homogeneous architecture with shared global memory is easier to program for parallelism, that is when a program can make use of the whole core, compared to a heterogeneous architecture where the cores do not have the same instruction set. It was the insight of the inventors, however, that in case of an application which lends itself to be partitioned into long-lived threads of control with little or regular communication it makes sense to manually, and/or automatically put the partitions onto cores that are specialized for that specific task.

In another example of the invention the plurality of arithmetic cores comprises any of a Digital Signal Processor, DSP, a Hardware Accelerator core and a Processing General Purpose Processor, GPP.

In yet a further example, the 10 interface comprises a Network-on-Chip, NOC, interface for communication between the plurality of clusters.

In another example, the control core is a General Purpose

Processor.

In a second aspect of the invention, there is provided a computing system comprising a many-core processor architecture according to any of the examples provided above.

The expressions, i.e. the wording, of the different aspects comprised by the many-core processor architecture according to the present invention should not be taken literally. The wording of the aspects is merely chosen to accurately express the rationale behind the actual function of the aspects.

The above-mentioned and other features and advantages of the invention will be best understood from the following description referring to the attached drawings. In the drawings, like reference numerals denote identical parts or parts performing an identical or comparable function or operation.

The invention is not limited to the particular examples disclosed below in connection with a particular type of many-core processor architecture or with a particular many-core processor system.

This invention is not limited to chip implementations in CMOS. The architecture is technology agnostic, and can be implemented in FPGA and/or ASIC. So, according to all examples of the invention, a many-core processor architecture is considered to be implemented in one or more chips, such as FPGA or as ASIC.

Brief description of the drawings

Figure 1 is a schematic view of a topology of a many-core processor architecture according to the present invention.

Figure 2 is a schematic view of a topology of a computing system comprising a plurality of many-core processor architectures according to the present invention.

Figure 3 is a schematic view of a cluster layout according to the present invention.

Figure 4 is a schematic view of a topology of a many-core processor architecture with an operating system implementation according to an aspect of the present invention.

Figure 5 is a schematic view of a kernel layout for a many-core processor architecture according to an aspect of the invention.

Figure 6 is a schematic view of assigning tasks of an application to cores of the many-core processor architecture according to an aspect of the invention.

Detailed description of the drawings

Figure 1 is a schematic view of a topology of a many-core processor architecture 1 according to the present invention. Here, the many-core processor architecture 1 comprises a plurality of clusters 2, of which only four are shown. Typically, the architecture 1 comprises many more clusters, for example, 100 - 1000 clusters. The present invention is not limited in the amount of clusters to be incorporated in the many-core architecture 1, as the topology of the present invention is scalable, i.e. it is applicable for just a few clusters 2 to many hundreds or even thousands of clusters 2. A cluster 2 comprises a plurality of arithmetic cores 5, 6, 7, for example Digital Signal Processors, DSPs, 5 Hardware Accelerator cores 6 and Processing General Purpose Processors, GPPs, 7. As such, the clusters 2 may be heterogeneous clusters 2 as they comprises a variety of different types of arithmetic cores 5, 6, 7. Each cluster 2 of the many-core processor 1 may further comprise different types of arithmetic cores 5, 6, 7. In other words, the topology of each of the clusters 2 does not need to be the same, each cluster 2 may have their own set of various arithmetic cores 5, 6, 7.

Each cluster 2 typically comprises a memory 3, which is a physical piece of hardware under control by a control core 4. The memory may be split in a private memory and a shared memory (not shown). The difference between these types of memories is that a private memory comprised in a particular cluster 2 may only be accessed by the arithmetic cores 5, 6, 7 of that same cluster.

According to the present invention, the shared memory comprised in a cluster is a part of a larger distributed shared memory, wherein each of the shared memories of each of the clusters 2 may be addresses as one logically shared address space. The term shared thus does not mean that there is a single centralized memory in the architecture 1, but shared means that the address space of all the physical shared memories of each of the clusters 2 is shared.

The access to the memory 3, i.e. the local memory as well as the shared memory, is controlled by a control core 4. In other words, a control core 4 comprised by a particular cluster 2 is responsible for controlling access to the local memory and the shared memory comprised by that particular cluster 2.

Each cluster further comprises an input/output, IO, interface 9 arranged for inter/connecting the plurality of clusters 2 together. This means that the clusters 2 can communicate to each other via their corresponding input/output interface 9. The control core 4 is also arranged to control the communication at the IO interface 9, i.e. the communication between a plurality of clusters 2. The client application core/system 10 is then also connected to the clusters 2 via a same type of IO interface 9.

Typically, the IO interface 9 comprises a Network-on-Chip, NOC, interface for communication between said plurality of clusters 2. The inventors found that the use of a bus between the clusters 2 may not be beneficial, especially in cases where many clusters 2 are comprised in a single architecture 1. There are a couple of downsides of using a bus, which is that the bandwidth is limited, shared, speed goes down as the number of clusters 2 grows, there is no concurrency, pipelining is tough, there is a central arbitration and there are no layers of abstraction. The advantages of using a NOC are that the aggregate bandwidth grows, that link speed unaffected by the number of clusters 2, concurrent spatial reuse, pipelining is built-in, arbitration is distributed and abstraction layers are separated.

Disclosed in figure 1 is a topology of a many-core processor architecture , which is a homogeneous topology. However, according to the present invention, the topology may also be a heterogeneous topology, wherein the layout of each cluster may be different.

Figure 2 is a schematic view of a topology of a computing system 50 comprising a plurality of many-core processor architectures according to the present invention.

Here, a computing system 50 is displayed having of a hierarchical structure type of clusters. The computing system 50 comprises a plurality of main clusters 61, 62, 63, 64, each of said main clusters comprises a plurality of clusters 51, 52, 53, 54, i.e. sub-clusters.

One of the plurality of clusters 51, 52, 53, 54, i.e. sub-cluster, of a single main cluster 61, 62, 63, 64 is appointed as the responsible sub-cluster for the corresponding main cluster 61, 62, 63, 64. As such, that sub-cluster is arranged to, for example, distribute tasks among the sub-clusters and for communication between the sub-clusters, etc.

The input-output, IO, interface may be arranged for inter-connecting said plurality of clusters, for communication between the plurality of clusters in the form of message queuing and for address space translation for access to the shared memory.

Figure 3 is a schematic view of a layout of a cluster 101 according to the present invention.

The cluster 101 comprises a plurality of arithmetic cores 102, 104, 105. As shown in figure 3, a variety of different cores may exist, for example three Digital Signal Processors, DSPs, 102, seven hardware accelerators 104 and six processing General Purpose Processors, GPPs, 105. Each of these arithmetic cores 102, 104, 105 may comprise their own memory in the form of a cache memory. A cache memory is thus intended to be used for specifically one arithmetic core.

The cache memory may be a tightly coupled memory, i.e. memory which is specifically coupled to a single core.

In this example, the cluster 101 comprises one physical hardware memory chip 106, which is further divided in a private memory 107 arranged to be accessed by each of said plurality of arithmetic cores comprised in that single, corresponding cluster 101, and a shared memory 108 being part of a distributed shared memory, and arranged to be accessed by a plurality of arithmetic cores comprised by each of a plurality of other clusters. This thus implies a memory chip wherein the physically separate memories 108 can be addressed as one logically shared address space. Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared.

The cluster 101 further comprises an input-output, IO, interface 109 arranged for inter-connecting said plurality of clusters 101, also via the connecting lines 111, and a control core 110 which is arranged for locally scheduling of tasks over said arithmetic cores within a same cluster, for controlling communication between said plurality of clusters via said IO interface, and for controlling access to said private memory and said shared memory. The control core 110 is thus responsible for the scheduling part, i.e. scheduling tasks over the plurality of arithmetic cores 102, 104, 105, the communication between clusters, and for the memory 106 access.

The architecture 1, 50 described in the figures, also referred to as Multi Processor System on Chip, MPSoC or Multi Core System on Chip, MCSoC comprises a plurality of arithmetic cores 5, 6, 7, distributed over the many clusters 2. In these cores 5, 6, 7 the actual processing of the tasks of the application running on the Operating System, OS, is performed. In for example a Digital Signal Processing, DSP application, the cores 5, 6, 7, each process part of data of the DSP application.

In Figure 4 the Operating System, OS, 410 of the many core system 400 is shown. The OS is comprised of two parts, the part 411 of running on the client application core and the part 412 running on the individual subsystems, i.e. clusters.

Processing of tasks is under control of the OS. The OS of a computer system has the role, amongst others, of managing resources of the hardware platform. The resources are made available to the application or applications running on the OS. Examples of the resources that are managed by the OS include the instructions that are to be executed on the processor core(s), the Input/Output, IO devices, the memory (allocation), interrupts, etc.

The most simple form of an OS can only run one single program at a time, hence referred to as a single-task OS. A modern OS however allows the computer system to run multiple programs or threads of the program concurrently, hence called multi-tasking OS. A multi-tasking OS may be achieved by running concurrent tasks in a time-sharing manner, and/or by dividing tasks over multiple execution units, e.g. in a dual, multi or many core computer system. By employing time-sharing the available resources of the processor core is divided between multiple processes which are interrupted repeatedly in time-slices by a task scheduler which forms part of the OS.

In an example pre-emptive multitasking may be employed. With preemptive scheduling any processes executed on a processor core can be interrupted from being executed by the scheduler to suspend in favour of a different process that is to be invoke and executed on the processor core. It is up to the scheduler to determine when the current process is suspended and which process is to be executed next. This can be employed according to different type of scheduling regimes.

Pre-emptive scheduling has some advantages as compared to a cooperative multi-tasking OS wherein each process must explicitly be programmed to define if and when it may be suspended. With pre-emptive scheduling all processes will get some amount of CPU time at any given time. This makes the computer system more reliable by guaranteeing each process a regular slice of processor core time. Moreover, if for example data is received from an 10 device that requires immediate attention, processor core time is made available and timely processing of the data can be guaranteed.

The OS consists of several components that all have their own function and work together to execute applications on the OS. All applications need to make use of the OS in order to access any hardware in the computer system. The components of the OS operate at different levels. These can be represented in a model with the following parts in the order from high level to low level: user applications, shared libraries, devices drivers, kernel, hardware. In such a model, the levels above the kernel are also called user-space, or to run in user-mode.

One of the most important parts of an OS is the kernel which provides basic control over the hardware in the computer system. Memory, CPU, 10, peripherals, etc. are all under control of the kernel. The applications running on the OS can thus make use of these parts of the computer via the kernel.

Although hybrid forms also exist, in general two types of kernels can be recognised, monolithic kernels and microkernels. A monolithic kernel is a kernel in which most of the processes are handled in a supervisor-mode and form part of the kernel itself. Microkernels, μ-kernel, as shown in figure 5, are kernels in which most of the processes are handled in a user-mode, i.e. in user-space, and thus do not directly form part of the kernel. These processes can in them self communicate with each other directly, such without interference of the kernel.

Microkernels are relatively small in size since they are only comprised of the most fundamental parts of a kernel such as the scheduler, memory management and InterProcesCommunication, IPC 510, 520, 530. These parts are controlled from the supervisor-mode, i.e. the kernel-space, such with high restrictions, all other parts are controlled from a user-mode, such with lower restrictions. As indicated above, microkernels generally provide a multi level security hierarchy with a supervisor/kernel-mode at one end and a user mode at the other end

An application or task running in supervisor-mode has privileges necessary to obtain access to the resources of the computer system, i.e. the hardware. This access can be considered unsafe since misuse or an error can result in system failure. On the other hand, an applications or task running in user-mode can only access those parts of the OS that are considered safe, i.e. virtual memory, threads, processes, etc.

In a microkernel the access to these unsafe parts can be given to an application or task running in user-mode. Much of the functionality that in a monolithic kernel resides inside the kernel, is in a microkernel moved to the user-space, i.e. running in user-mode.

As indicated above, in the course of time, the need for more processing power shifted from increase in clock speed of the (single core) processor towards parallel execution of instructions on plural cores, i.e. multiple or even many core system with high amounts of arithmetic cores. Heterogeneous systems on a architecture with a high amount of cores, i.e. MPSoCs, have the potential to outperform single core or homogeneous many-core system. It is a preferred architecture when there is a demand for high performance at low power. MPSoCs however has some constrains. Execution of a few concurrent applications on a computer system with a single processor core can be performed simply by a conventional multi-tasking OS. Execution of plural concurrent applications on a computer system with many processor cores is, at least in an efficient manner, challenging. This makes the usability of such systems non-optimal as extensive knowledge is needed for effective use of such systems. Conventional applications may not perform to the optimal potential of a MPSoC. Applications should be apple to coop with high amounts of parallelism employ most of the cores of the system.

To that end conventional applications have to be programmed, converted or redesigned to allow the application to be divided into sub-tasks dedicated to be executed, on those many cores, concurrently.

It is proposed to divide applications into tasks in stead of threads, wherein the tasks are, in particular, small enough to be executed in a short amount of time on one of the cores and wherein, in particular, these tasks can depend on any one or more of the trade-offs performance, resources, latency, and energy budgets.

In principle, a task requiring high performance will not be able to perform that task at low energy consumption. Vice versa, low energy consumption will have a negative impact on performance. The same reasoning applies for resources and latency. Low latency will most likely require a large resource footprint.

For specific applications or specific parts of applications, i.e. tasks within the applications, it is important to have guarantees on performance. The kernel has to assure these performance guarantees.

To this end a many-core OS 410 is proposed as illustrated in Figure 4 which consists of at least to parts, being the part 411 running on the client application core/system and the part 412 running over the multiple subsystems/clusters. The first can be a conventional OS with a conventional kernel comprised therein. Examples thereof are Linux, Unix or Windows based Oss. This could however also be a custom OS, designed from scratch or more likely, Linux, Unix or Windows based.

In an example the application runs on the client application side 10 of Figure 1, of the system. Thus the applications can be programmed for any of these common OS types. The OS running in the cliënt general purpose processor is arranged to cooperate with the distributed OS 412 running over the total of clusters. Thus in this example, the path of execution can be considered to start at the client system side, via the (conventional) OS 411 running thereon, towards a distributed OS 412 at a lower hierarchy lever. Thus the assigning of the tasks towards the clusters is handled by the general purpose processer of the client system, and hence by the kernel of the OS running thereon.

Once the tasks are received by the subsystem, the individual general purpose processors 421, 422, 423, 424 thereof are arranged to handle the individual tasks within the cluster.

In order for the kernel of the OS on the client system to determine which tasks to be assigned to which cluster, the kernel requires additional information to base that decision on. To that end it is proposed to assign such information to the tasks such that the kernel 415 can assign each task in an effective manner and not only distribute the tasks evenly over the clusters. Due to the heterogeneous configuration of the system the clusters can have different components. Some clusters are for example arranged for more DSP processing while others are better arranged for other hardware acceleration tasks.

In known multicore systems the applications are divided into conventional low-level threads that are executed on an individual processor core in a time scheduling basis, e.g. by pre-emptive scheduling. In the architecture and OS according to the invention a different programming model is used. Although from a programmer’s perspective a homogenous multi or many core system with shared-memory is the most convenient, it is ineffective on a many core scale due to the scalability issues at these amounts of cores. Examples of these scalability issues are increased energy consumption and fault tolerances. Therefor a more abstract model of programming is proposed that is arranged for execution of activities, i.e. operations instead of convention instructions on a thread basis.

Activities are means to specify for which cores the application implemented its functions, what kernels and configurations are available for hardware accelerators etc. As such the application is not divided into instructions executed as threads but decomposed into parts, i.e. tasks and assigned with additional scheduling information for handling by the scheduler of the kernel on the basis of for example responsibility, input-and-output, resource budget, performance budget, etc. Thus an application can consist for example roughly of Graphical User Interface, GUI, instructions and data processing instructions as well as data input-output instructions. In accordance with the invention it is proposed to decompose the application into individual tasks that are arranged to be executed within a cluster of the many-core system. These tasks have additional scheduling information on the basis of which the scheduler of the kernel 415 running in the general purpose processor of the client application system can decide to which subsystem/cluster this tasks is to be assigned. Communication thereof is arranged via network on chip communication units 441, 442, 443, 444 and the use of the shared address space within the distributed shared memory of the memory units of the clusters 431, 432, 433, 434. For example GUI instructions have other requirements like responsiveness, then I/O instructions that rely more resource usage. Thus a GUI task comprised of GUI related instructions are more likely to be assigned a responsiveness profile and scheduled on a subsystem/cluster that comprises an architecture that is arranged to that end.

Once the task is assigned to any of the subsystems, the local microkernel 421, 422, 423, 424 of the distributed OS 412 is arranged to locally schedule the tasks over the individual arithmetic such as the DSPs and HW accelerators.

On a cluster level the general purpose processor of the cluster can determine an over or under capacity within the cluster. This depends on the architecture of the cluster, e.g. the amount of cores, types of cores, etc, as well as the amount of tasks assigned to the individual core. If the micro-kernel determines a resource shortage within the cluster, it can query a resource request to other (neighbouring) cluster to determine if tasks can be hand over to one of these other clusters. Herewith the overall resource capacity of the system is used more efficiently. The other way round works as well. The micro-kernel can also determine inefficient use of local resources due to a low instruction queue for example. In that case the micro-kernel can also signal its free capacity to other clusters, either directly via direct inter cluster communication via the network-on-chips 441, 442, 443, 444 or via the higher-level coordinating scheduler of the kernel 415 in the client system.

The use of the task based programming model and the decomposition of the application into parts, i.e. tasks that can be eventually be assigned to cores of the cluster is illustrated in Figure 6. In Figure 6an application is according to an aspect of the invention modelled towards a plurality of task. The applications is thus defined by multiple tasks, which may or may not rely on high amounts on communication. The communication between the tasks is known as the channels. The channels between these tasks thus have to guarantee a certain amount of sufficient bandwidth at low latency. As such, not all schedulers function optimal by utilizing the resources in a efficient manner. It is thus proposed to use a distributed many-core OS that is comprised over plural microkernels that control tasks in kernel mode such that they can be executed on the different cores 5, 6, 7, 102, 104, 105, of the cluster. The microkernel comprises a scheduler selected to efficiently employing the execution of the tasks on the individual cores. Such a scheduler is arranged for any one or more of the group first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest job first, fixed priority pre-emptive, round-robin or multilevel queue scheduling of tasks over the arithmetic cores within said same cluster.

Each of the GPPs run a microkernel and each microkernel is arranged for controlling IPC, scheduling of the tasks, and memory management, e.g. private memory and/or shared memory. The microkernel running in the GPP of the cluster thus controls the local resources of the cluster and runs in kernel-mode.

In accordance with all examples of the present invention, a many-core architecture is also to be understood as a many-core processor architecture.

The present invention is not limited to the embodiments as disclosed above, and can be modified and enhanced by those skilled in the art beyond the scope of the present invention as disclosed in the appended claims without having to apply inventive skills.

Claims

A multi-core processor architecture comprising a distributed shared memory, the architecture comprising a plurality of clusters, each of the plurality of clusters comprising: a plurality of computing cores; private memory arranged to be addressed by each of the plurality of computing cores contained in a single, corresponding cluster; a shared memory that is part of the distributed shared memory and is arranged to be addressed by the plurality of computing cores included in each of the plurality of clusters; an input-output, I / O, interface capable of interconnecting the plurality of clusters, and a control core adapted to schedule tasks locally over the computing cores within one and the same cluster, the multi-core processor architecture having a distributed multi-core operating system comprises, arranged for distributing tasks over the clusters and wherein the distributed multi-core control system comprises a plurality of cooperating microkernels distributed over each of the clusters and arranged to be executed in a kernel space of the control core of each of the clusters, for planning the tasks on the calculation centers locally within one and the same cluster.

The multi-core processor architecture according to claim 1, wherein the plurality of cooperating micro-kernels are arranged for memory management of shared memory within each of the cores and for memory management of the shared memory for mutual communication between the clusters.

The multi-core processor architecture according to any of the preceding claims, wherein the plurality of cooperating micro-kernels are arranged for I / O management of the shared memory within each of the clusters for interconnection between the clusters.

A multi-core processor architecture according to any of the preceding claims, wherein a planner of a plurality of cooperating microkernels is arranged for one or more of the group first in first out, earliest deadline first, shortest remaining time, round-robin, multilevel queue, shortest task first , fixed priority pre-emptive, round robin or multi-level queue scheduling of tasks over the calculation cores within one and the same cluster.

The multi-core processor architecture according to any of the preceding claims, wherein the multi-core processor architecture comprises a plurality of heterogeneous clusters.

A multi-core processor architecture according to any of the preceding claims, the plurality of arithmetic cores comprising one of a digital signal processor, DSP, a hardware accelerator core, and a general purpose target processor, Processing General Purpose Processor, GPP.

The multi-core processor architecture according to any of the preceding claims, wherein the I / O interface comprises a network-on-chip, NOC, interface for communication between the plurality of clusters.

The multi-core processor architecture according to any of the preceding claims, wherein the control core is a general purpose processor.

A computer system comprising one or more multi-core processor architectures according to any of the preceding claims.

The computer system of claim 9, wherein the system comprises a plurality of multi-core processor architectures according to any of the preceding claims 1-8, wherein each of the plurality of multi-core processor architectures comprises an I / O interface for interconnecting the different multi-core processor architectures and wherein the control system is arranged for distributing tasks among the various multi-core processor architectures.

A method of programming an application for a computer system according to claim 9 or 10, wherein the method comprises the step of defining the plurality of tasks of the application for distributing the tasks among the plurality of multi-core processor architectures.

The method of claim 11, wherein each of the tasks can be assigned to one or more of the group comprising a responsibility, input-output, low resource use and low performance use.