US20050278720A1 - Distribution of operating system functions for increased data processing performance in a multi-processor architecture - Google Patents
Distribution of operating system functions for increased data processing performance in a multi-processor architecture Download PDFInfo
- Publication number
- US20050278720A1 US20050278720A1 US11/091,362 US9136205A US2005278720A1 US 20050278720 A1 US20050278720 A1 US 20050278720A1 US 9136205 A US9136205 A US 9136205A US 2005278720 A1 US2005278720 A1 US 2005278720A1
- Authority
- US
- United States
- Prior art keywords
- operating system
- set forth
- data processor
- subordinate
- processors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30123—Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/509—Offload
Definitions
- the invention relates generally to data processing and, more particularly, to increasing data processing throughput in a data processing architecture.
- Vector computers and hardware use a pipelining approach, where the data is sequentially presented to a line of processing entities, called stages, where each processing entity performs a single or a small set of operations on each piece of data as it passes through. Data is fully processed only after it has passed through all stages. Thus, parallel processing occurs by having each processor perform the same operation on each piece of data flowing through it, where each stage of the pipeline operates simultaneously on different pieces of data. For example, Stage N operates on data i, Stage N-1 operates on data i+1, Stage N-2 operates on data i+2, and so on down to Stage 0. This is well suited to processing arrays.
- Superscalar computers such as the Pentium Pro and the PowerPC, support a combination of vector and software pipelining, which amount to handling more than one instruction at a time by preloading the next instruction while operating on the current one and by using a fine grain scheduling technique along with instruction ordering techniques to parallelize loops such that one iteration of a loop can begin execution before the previous one completes.
- Automatic parallelism typically is achieved in compilers by parallelizing array operations either by parallelizing loops, in a manner similar to software pipelining, or by parallelizing intrinsically parallel array operations.
- Some parallel computing approaches use a distributed cluster of machines that are autonomous computing nodes, each with its own memory, operating system, and disk space. These approaches also use message passing to coordinate the processing.
- An example of this is the Search for Extra-Terrestrial Intelligence (SETI) At Home project, where screen savers on home computers receive computational tasks over the internet and e-mail any interesting findings back to the main facility.
- SETI Search for Extra-Terrestrial Intelligence
- Shared Memory Processors or Symmetric Multi-Processors refer to single computers with multiple processors.
- the processors may run different programs or a single program can run in parallel across the processors.
- Parallelizing a single program requires special processing techniques, including message passing between the processors or use of shared memory to coordinate activities. Synchronization techniques are required to coordinate the asynchronous processors to maintain the logical dependence and required order of precedence.
- SIMD Single Instruction Multiple Data
- the instruction unit controls the entire array by fetching each instruction and commanding each processor in the array to carry it out, each with its own data.
- This approach is implemented in supercomputers such as the Control Data Systems CDC 6600 and 7600 and in Cray supercomputers via a technique called vector processing.
- Asymmetric Multi-Processor Operating System (“AsyMOS”) has been developed to exploit cost/performance advantages that accrue to small scale SMPs due to shared memory and packaging.
- the multiple processors are partitioned into three (3) types, applications processors, disk processors, and device processors, based on the end functionality that they will perform. By virtue of this partition, specific network and disk processing can be off-loaded from the applications processors that are running the main applications.
- vector processing and SIMD approaches are only applicable to certain kinds of problems, such as array processing. They work best if the problem is formulated in an appropriate manner, for example, by using vector algebra cronstructs.
- the SMP approach works best if different programs are run on different processors. If an application is not readily separated into different parts, then parallel processing may require special structuring of the problem, or the application may need to be written specifically for the parallel processing environment. This can entail the use of specific parallel processing code with message passing to coordinate the activities of the various processors as they operate in parallel.
- the present invention provides a data processing architecture with an accelerated operating system to increase the data processing throughput of an application executed according to a sequential programming model.
- An application running on a main data processor is interfaced to the operating system.
- exemplary embodiments of the present invention permit the subordinate data processors to also provide operating system support for the application running on the main data processor. This off-loading of operating system functionality from the main data processor increases the data processing throughput of the main data processor.
- the data processing apparatus comprises: 1) a main data processor capable of running an application; 2) a plurality of subordinate data processors which provide data processing support for the application running on the the main data processor; 3) a plurality of communication paths which respectively couple the subordinate data processors to the main data processor; and 4) an operating system, and an application interface which interfaces the application to the operating system, the application interface provided on the main data processor. At least some of the operating system is distributed among the subordinate data processors such that the subordinate data processors also provide operating system support for the application running on the main data processor.
- the main data processor runs the application according to a sequential programming model.
- the subordinate data processors provide the data processing support for the application running on the main data processor by inputting and outputting data from and to a site located physically separately from the data processing apparatus.
- the at least some of the operating system includes an operating system function that is accessed relatively frequently by the application running on the main data processor.
- the operating system function includes one of an IP stack, a dispatcher, a scheduler, and a virtual memory paging function.
- the operating system is a Linux operating system.
- the application interface renders the distribution of the at least some of the operating system transparent to the application running on the main data processor.
- FIG. 1 illustrates a data processing architecture according to an exemplary embodiment of the invention
- FIG. 2 illustrates the subordinate processors of FIG. 1 according to an exemplary embodiment of the invention
- FIG. 3 illustrates a detailed example of the data processing architecture of FIG. 1 according to an exemplary embodiment of the invention
- FIG. 4 illustrates the distributed operating system of FIG. 1 in more detail according to an exemplary embodiment of the invention.
- FIG. 5 illustrates an expanded data processing architecture according to an exemplary embodiment of the invention.
- FIGS. 1 through 5 discussed below, and the various embodiments used to describe the principles of the present invention in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the invention. Those skilled in the art will understand that the principles of the present invention may be implemented in any suitably arranged data processing architecture.
- FIG. 1 illustrates a data processing architecture according to an exemplary embodiment of the present invention.
- the data processing architecture 100 includes a main core processor 110 , a plurality of subordinate processors 130 , and memory 120 .
- the main core processor 110 is coupled by bus structure 140 for communication with the subordinate processors 130 and with memory 120
- the subordinate processors 130 are also coupled by bus structure 140 for communication with memory 120 .
- the subordinate processors 130 are cooperable with the main core processor 110 and memory 120 to provide the data processing architecture 100 with channelized I/O as indicated by the multiple I/O channels illustrated generally at 135 .
- the architecture 100 also includes a distributed operating system 150 , which is described in more detail below.
- FIG. 2 is a simplified diagram illustrating the subordinate processors of FIG. 1 according to an exemplary embodiment of the invention.
- the subordinate processor 200 of FIG. 2 includes registers 210 , a program control unit 220 , an instruction execution unit 240 , and a memory interface 250 .
- the registers 210 , program control unit 220 , and memory interface 250 are connected to the bus structure 140 for communication with one another, and also with the main core processor 110 and memory 120 (see also FIG. 1 ).
- the program control unit 220 appropriately loads instructions and data from memory 120 into the registers 210 .
- a plurality of sets of registers at 210 may be used in order to implement a corresponding plurality of execution threads.
- a multiplexer 230 is connected between the registers 210 and the instruction execution unit 240 , and the program control unit 220 controls the multiplexer 230 such that the registers associated with the desired thread are connected to the instruction execution unit 240 .
- only a single register set and a corresponding single execution thread may be implemented.
- the single register set can be connected directly to the instruction execution unit 240 , as indicated generally by broken line in FIG. 2 .
- the instruction execution unit 240 executes the instructions that it receives.
- the memory interface 250 reads data from memory 120 via bus structure 140 and outputs the data on I/O channel 260 .
- the memory interface 250 receives data from the I/O channel 260 , and writes the received data into memory 120 via bus structure 140 .
- Each of the subordinate processors illustrated at 130 in FIG. 1 implements an I/O channel, such as shown at 260 in FIG. 2 , thereby providing the multiple-channel, or channelized, I/O 135 in FIG. 1 .
- FIG. 3 illustrates a detailed example of the data processing architecture of FIGS. 1 and 2 according to an exemplary embodiment of the invention.
- the data processing architecture is utilized as a gateway or file server in a storage area network (SAN) 300 .
- the main core processor 110 is implemented as an X-SCALE processor in FIG. 3
- the subordinate processors 130 are implemented as microengines 130 a, 130 b, etc.
- the X-SCALE processor and microengines are provided in a conventional network processor integrated circuit, such as IXP2800 chips commercially available from Intel Corporation.
- a single-chip network processor is indicated generally at 330 in FIG. 3 .
- the memory 120 of FIG. 1 includes RDRAM 310 , QDRAM 320 and scratchpad memory 321 .
- the scratchpad memory 321 is provided on-chip with the X-SCALE processor and the microengines.
- the data processing architecture 100 is interfaced to a data network 350 and storage arrays 360 and 370 via an ASIC 340 (or an FPGA), Ethernet interfaces 341 and 342 , SCSI interfaces 343 a and 343 b, and Fiber Channel (FC) interface 344 .
- the interfaces at 341 - 344 are well known in the art.
- the ASIC 340 is designed to interface between the channelized I/O 135 of the data processing architecture 100 and the various interfaces 341 - 344 .
- the channelized I/O 135 is provided on the SPI-4 Phase 2 (SPI-4.2) I/O bus of the IXP2800.
- the ASIC 340 would thus interface to the SPI-4.2 bus and fan out the channelized I/O to the various external interfaces at 341 - 344 .
- the QDRAM 320 is used primarily to provide temporary storage of data that is being transferred either to the channelized I/O 135 from the RDRAM 310 , or from the channelized I/O 135 to the RDRAM 310 .
- a work list is also maintained in the RDRAM 310 .
- the X-SCALE processor 110 can write commands into this work list, and the microengines 130 a, 130 b, etc. can access the commands and execute the functions specified by the commands.
- One embodiment of the present invention may utilize 1-2 megabytes of QDRAM and two (2) gigabytes of RDRAM.
- the QDRAM and RDRAM are both provided on a single printed circuit board, together with the single-chip network processor 330 .
- the main core processor 110 stores commands in the work list of the RDRAM 310 .
- the main core processor could store a plurality of commands which respectively correspond to a plurality of desired storage disk accesses.
- the commands can indicate, for example, what instructions to execute, where data is (or will be) stored in memory, etc.
- the subordinate processors acting independently as they become free to support the main core processor, can retrieve commands from the work list and make disk storage accesses in parallel by using SCSI interfaces 343 a and 343 b.
- the subordinate processor For a write to disk storage, the subordinate processor transfers data from the RDRAM 310 out to the disk storage unit (e.g. 360 ). For a read from disk storage, the subordinate processor transfers data received from the disk storage unit into the RDRAM 310 . These data transfers can be accomplished by the memory interface 250 of FIG. 2 , under control of the instruction execution unit 240 of FIG. 2 . This distribution of instruction execution to support I/O processing avoids the bottlenecks that may occur in mainframe or supercomputer architectures, wherein all instructions that control channelized I/O are executed in a single central processor unit, rather than in the I/O channels themselves.
- the main core processor 110 can utilize the bus structure 140 to provide commands directly to the various subordinate processors.
- FIG. 4 illustrates the distributed operating system of FIG. 1 in more detail according to an exemplary embodiment of the invention.
- the main core processor 110 runs applications 410 , for example file server applications. These applications are supported by an operating system that is distributed into and among the subordinate processors 130 .
- the main core processor provides an application interface 420 , and may also provide some local operating system functionality 430 . However, the remainder of the operating system functionality is distributed among the subordinate processors 130 . This distribution of operating system functionality among the subordinate processors 130 is indicated generally by the remote operating system functions 450 in FIG. 4 .
- the operating system is the well known Linux operating system
- the IP stack functionality of the Linux operating system is distributed into the subordinate processors 130 as a remote operating system function.
- the IP stack functionality uses a well-defined socket interface that can be easily relocated from the main processor into the subordinate processors 130 .
- the Linux scheduler functionality is relatively easy to move because it is triggered by a timer and every system call returns through the scheduler.
- the applications interface 420 makes the distribution of the operating system into the subordinate processors completely transparent to the applications 410 . Accordingly, the applications 410 can run without modification on the main core processor 110 in the same manner as if the entire operating system were implemented on the main core processor 110 .
- the distributed operating system is used to handle I/O requests from the main core processor 110 , then the entire I/O process is rendered transparent to the application running on the main processor. More particularly, the application at 410 sees only the application interface 420 , and the fact that the subordinate processors 130 handle the I/O operation is transparent to the application running on the main core processor 110 .
- a typical disk storage read operation produces many interrupts before it is completed.
- the many interrupts are seen only by the subordinate processors, and are invisible to the application running on the main core processor.
- the application running on the main core processor the application simply provides a disk storage read request to the applications interface 420 , and this request results in a single interrupt, namely, an interrupt from the operating system indicating that the desired file is ready in RDRAM 310 .
- operating system functions that are relatively slow, relatively frequently accessed, or both, can be distributed among the subordinate processors 130 , thereby off-loading from the main core processor 110 a relatively large processing burden, which in turn improves the data processing throughput that the main core processor can achieve while executing the application according to the sequential programming model.
- FIG. 5 illustrates an expanded data processing architecture according to an exemplary embodiment of the invention.
- a plurality of instances of the data processing architecture 100 described above relative to FIGS. 1-3 designated respectively as 100 a, 100 b, . . . 100 c, are interconnected by a bus structure 510 .
- the bus structure 510 interconnects the main core processors 110 of the respective data processing architectures 100 a, 100 b, . . . 100 c.
- the arrangement of FIG. 5 thus results in an even higher performance architecture.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Multi Processors (AREA)
Abstract
An accelerated operating system can increase the data processing throughput of a data processor executing an application according to a sequential programming model. An application running on a main data processor is interfaced to an operating system which has been accelerated by distributing at least some of the operating system among a plurality of subordinate data processors which provide data processing support for the application running on the main data processor. The subordinate data processors can thus also provide operating system support for the application running on the main data processor. This decreases the processing burden on the main data processor, thereby increasing the main data processor's data processing throughput while executing the application.
Description
- The present invention claims the priority under 35 USC § 119(e) of the following co-pending U.S. Provisional Applications:
- 1) U.S. Provisional Patent Application Ser. No. 60/575,589, entitled “DISTRIBUTION OF OPERATING SYSTEM FUNCTIONS IN THE ORION HIGH CAPACITY I/O PROCESSOR,” filed on May 27, 2004; and
- 2) U.S. Provisional Patent Application Ser. No. 60/575,590 entitled “HIGH PERFORMANCE ASYMMETRIC MULTI-PROCESSOR WITH SEQUENTIAL PROGRAMMING MODEL,” filed May 27, 2004.
- The subject matter disclosed in each of Patent Application Ser. Nos. 60/575,589 and 60/575,590 is hereby incorporated by reference into the present disclosure as if fully set forth herein.
- The invention relates generally to data processing and, more particularly, to increasing data processing throughput in a data processing architecture.
- Various approaches have been used to increase data processing throughput in data processing architectures. These approaches are briefly described below.
- Vector computers and hardware use a pipelining approach, where the data is sequentially presented to a line of processing entities, called stages, where each processing entity performs a single or a small set of operations on each piece of data as it passes through. Data is fully processed only after it has passed through all stages. Thus, parallel processing occurs by having each processor perform the same operation on each piece of data flowing through it, where each stage of the pipeline operates simultaneously on different pieces of data. For example, Stage N operates on data i, Stage N-1 operates on data i+1, Stage N-2 operates on data i+2, and so on down to Stage 0. This is well suited to processing arrays.
- Superscalar computers, such as the Pentium Pro and the PowerPC, support a combination of vector and software pipelining, which amount to handling more than one instruction at a time by preloading the next instruction while operating on the current one and by using a fine grain scheduling technique along with instruction ordering techniques to parallelize loops such that one iteration of a loop can begin execution before the previous one completes.
- Automatic parallelism typically is achieved in compilers by parallelizing array operations either by parallelizing loops, in a manner similar to software pipelining, or by parallelizing intrinsically parallel array operations.
- It is simple to allow parallel processors to run different programs. Also, some classes of problems, such as code breaking programs and graphics rendering, can be easily separated to explore different portions of the solution space by allocating portions to each of the parallel processors.
- Some parallel computing approaches use a distributed cluster of machines that are autonomous computing nodes, each with its own memory, operating system, and disk space. These approaches also use message passing to coordinate the processing. An example of this is the Search for Extra-Terrestrial Intelligence (SETI) At Home project, where screen savers on home computers receive computational tasks over the internet and e-mail any interesting findings back to the main facility. There are two approaches to distributed cluster processing—each processor executing the same program on different sets of data and different processors executing different programs on different sets of data.
- Shared Memory Processors or Symmetric Multi-Processors (SMP) refer to single computers with multiple processors. The processors may run different programs or a single program can run in parallel across the processors. Parallelizing a single program requires special processing techniques, including message passing between the processors or use of shared memory to coordinate activities. Synchronization techniques are required to coordinate the asynchronous processors to maintain the logical dependence and required order of precedence.
- Another approach to parallel computing is the Single Instruction Multiple Data (SIMD) approach where one instruction unit runs a large array of processors in lock-step. The instruction unit controls the entire array by fetching each instruction and commanding each processor in the array to carry it out, each with its own data. This approach is implemented in supercomputers such as the Control Data Systems CDC 6600 and 7600 and in Cray supercomputers via a technique called vector processing.
- Another multi-processor architecture known as the Asymmetric Multi-Processor Operating System (“AsyMOS”) has been developed to exploit cost/performance advantages that accrue to small scale SMPs due to shared memory and packaging. In this architecture, the multiple processors are partitioned into three (3) types, applications processors, disk processors, and device processors, based on the end functionality that they will perform. By virtue of this partition, specific network and disk processing can be off-loaded from the applications processors that are running the main applications.
- The aforementioned vector processing and SIMD approaches are only applicable to certain kinds of problems, such as array processing. They work best if the problem is formulated in an appropriate manner, for example, by using vector algebra cronstructs. The SMP approach works best if different programs are run on different processors. If an application is not readily separated into different parts, then parallel processing may require special structuring of the problem, or the application may need to be written specifically for the parallel processing environment. This can entail the use of specific parallel processing code with message passing to coordinate the activities of the various processors as they operate in parallel.
- It is desirable in view of the foregoing to provide for increasing the data processing throughput of data processing architecture without requiring parallel processing techniques and their attendant difficulties.
- The present invention provides a data processing architecture with an accelerated operating system to increase the data processing throughput of an application executed according to a sequential programming model. An application running on a main data processor is interfaced to the operating system. By distributing at least some of the operating system among a plurality of subordinate data processors which provide data processing support for the application running on the main data processor, exemplary embodiments of the present invention permit the subordinate data processors to also provide operating system support for the application running on the main data processor. This off-loading of operating system functionality from the main data processor increases the data processing throughput of the main data processor.
- To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide an improved data processing apparatus. According to an advantageous embodiment of the present invention, the data processing apparatus comprises: 1) a main data processor capable of running an application; 2) a plurality of subordinate data processors which provide data processing support for the application running on the the main data processor; 3) a plurality of communication paths which respectively couple the subordinate data processors to the main data processor; and 4) an operating system, and an application interface which interfaces the application to the operating system, the application interface provided on the main data processor. At least some of the operating system is distributed among the subordinate data processors such that the subordinate data processors also provide operating system support for the application running on the main data processor.
- According to one embodiment of the present invention, the main data processor runs the application according to a sequential programming model.
- According to another embodiment of the present invention, the subordinate data processors provide the data processing support for the application running on the main data processor by inputting and outputting data from and to a site located physically separately from the data processing apparatus.
- According to still another embodiment of the present invention, the at least some of the operating system includes an operating system function that is accessed relatively frequently by the application running on the main data processor.
- According to yet another embodiment of the present invention, the operating system function includes one of an IP stack, a dispatcher, a scheduler, and a virtual memory paging function.
- According to a further embodiment of the present invention, the operating system is a Linux operating system.
- According to a still further embodiment of the present invention, the subordinate data processors execute program instructions to provide the data processing support.
- According to a yet further embodiment of the present invention, the application interface renders the distribution of the at least some of the operating system transparent to the application running on the main data processor.
- Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term “controller” means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
- For a more complete understanding of the present invention and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
-
FIG. 1 illustrates a data processing architecture according to an exemplary embodiment of the invention; -
FIG. 2 illustrates the subordinate processors ofFIG. 1 according to an exemplary embodiment of the invention; -
FIG. 3 illustrates a detailed example of the data processing architecture ofFIG. 1 according to an exemplary embodiment of the invention; -
FIG. 4 illustrates the distributed operating system ofFIG. 1 in more detail according to an exemplary embodiment of the invention; and -
FIG. 5 illustrates an expanded data processing architecture according to an exemplary embodiment of the invention. -
FIGS. 1 through 5 , discussed below, and the various embodiments used to describe the principles of the present invention in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the invention. Those skilled in the art will understand that the principles of the present invention may be implemented in any suitably arranged data processing architecture. -
FIG. 1 illustrates a data processing architecture according to an exemplary embodiment of the present invention. Thedata processing architecture 100 includes amain core processor 110, a plurality ofsubordinate processors 130, andmemory 120. Themain core processor 110 is coupled bybus structure 140 for communication with thesubordinate processors 130 and withmemory 120, and thesubordinate processors 130 are also coupled bybus structure 140 for communication withmemory 120. Thesubordinate processors 130 are cooperable with themain core processor 110 andmemory 120 to provide thedata processing architecture 100 with channelized I/O as indicated by the multiple I/O channels illustrated generally at 135. Thearchitecture 100 also includes a distributedoperating system 150, which is described in more detail below. -
FIG. 2 is a simplified diagram illustrating the subordinate processors ofFIG. 1 according to an exemplary embodiment of the invention. Thesubordinate processor 200 ofFIG. 2 includesregisters 210, aprogram control unit 220, aninstruction execution unit 240, and amemory interface 250. Theregisters 210,program control unit 220, andmemory interface 250 are connected to thebus structure 140 for communication with one another, and also with themain core processor 110 and memory 120 (see alsoFIG. 1 ). Theprogram control unit 220 appropriately loads instructions and data frommemory 120 into theregisters 210. - In an exemplary embodiment, a plurality of sets of registers at 210 may be used in order to implement a corresponding plurality of execution threads. In such a multiple thread embodiment, a
multiplexer 230 is connected between theregisters 210 and theinstruction execution unit 240, and theprogram control unit 220 controls themultiplexer 230 such that the registers associated with the desired thread are connected to theinstruction execution unit 240. In an alternate embodiment, only a single register set and a corresponding single execution thread may be implemented. In such an embodiment, the single register set can be connected directly to theinstruction execution unit 240, as indicated generally by broken line inFIG. 2 . - Under control of the
program control unit 220, theinstruction execution unit 240 executes the instructions that it receives. Under control of theinstruction execution unit 240, thememory interface 250 reads data frommemory 120 viabus structure 140 and outputs the data on I/O channel 260. Also under control of theinstruction execution unit 240, thememory interface 250 receives data from the I/O channel 260, and writes the received data intomemory 120 viabus structure 140. Each of the subordinate processors illustrated at 130 inFIG. 1 implements an I/O channel, such as shown at 260 inFIG. 2 , thereby providing the multiple-channel, or channelized, I/O 135 inFIG. 1 . -
FIG. 3 illustrates a detailed example of the data processing architecture ofFIGS. 1 and 2 according to an exemplary embodiment of the invention. In the example ofFIG. 3 , the data processing architecture is utilized as a gateway or file server in a storage area network (SAN) 300. Themain core processor 110 is implemented as an X-SCALE processor inFIG. 3 , and thesubordinate processors 130 are implemented as microengines 130 a, 130 b, etc. In an exemplary embodiment of the present invention, the X-SCALE processor and microengines are provided in a conventional network processor integrated circuit, such as IXP2800 chips commercially available from Intel Corporation. A single-chip network processor is indicated generally at 330 inFIG. 3 . - In
FIG. 3 , thememory 120 ofFIG. 1 includesRDRAM 310, QDRAM 320 andscratchpad memory 321. In an exemplary embodiment of the present invention, thescratchpad memory 321 is provided on-chip with the X-SCALE processor and the microengines. - The
data processing architecture 100 is interfaced to adata network 350 and 360 and 370 via an ASIC 340 (or an FPGA), Ethernet interfaces 341 and 342, SCSI interfaces 343 a and 343 b, and Fiber Channel (FC)storage arrays interface 344. The interfaces at 341-344 are well known in the art. TheASIC 340 is designed to interface between the channelized I/O 135 of thedata processing architecture 100 and the various interfaces 341-344. For example, in an embodiment which utilizes the IXP2800, the channelized I/O 135 is provided on the SPI-4 Phase 2 (SPI-4.2) I/O bus of the IXP2800. TheASIC 340 would thus interface to the SPI-4.2 bus and fan out the channelized I/O to the various external interfaces at 341-344. - The QDRAM 320 is used primarily to provide temporary storage of data that is being transferred either to the channelized I/
O 135 from theRDRAM 310, or from the channelized I/O 135 to theRDRAM 310. A work list is also maintained in theRDRAM 310. TheX-SCALE processor 110 can write commands into this work list, and the 130 a, 130 b, etc. can access the commands and execute the functions specified by the commands. One embodiment of the present invention may utilize 1-2 megabytes of QDRAM and two (2) gigabytes of RDRAM. In an exemplary embodiment of the present invention, the QDRAM and RDRAM are both provided on a single printed circuit board, together with the single-microengines chip network processor 330. - In an exemplary embodiment of the invention, the
main core processor 110 stores commands in the work list of theRDRAM 310. For example, the main core processor could store a plurality of commands which respectively correspond to a plurality of desired storage disk accesses. The commands can indicate, for example, what instructions to execute, where data is (or will be) stored in memory, etc. The subordinate processors, acting independently as they become free to support the main core processor, can retrieve commands from the work list and make disk storage accesses in parallel by using 343 a and 343 b.SCSI interfaces - For a write to disk storage, the subordinate processor transfers data from the
RDRAM 310 out to the disk storage unit (e.g. 360). For a read from disk storage, the subordinate processor transfers data received from the disk storage unit into theRDRAM 310. These data transfers can be accomplished by thememory interface 250 ofFIG. 2 , under control of theinstruction execution unit 240 ofFIG. 2 . This distribution of instruction execution to support I/O processing avoids the bottlenecks that may occur in mainframe or supercomputer architectures, wherein all instructions that control channelized I/O are executed in a single central processor unit, rather than in the I/O channels themselves. - Similar bottlenecks can of course also occur in conventional PC and other desktop architectures, where all I/O and data processing functionality is controlled by instruction execution performed in the central processing unit. In an exemplary embodiment of the present invention, the
main core processor 110 can utilize thebus structure 140 to provide commands directly to the various subordinate processors. -
FIG. 4 illustrates the distributed operating system ofFIG. 1 in more detail according to an exemplary embodiment of the invention. As shown inFIG. 4 , themain core processor 110runs applications 410, for example file server applications. These applications are supported by an operating system that is distributed into and among thesubordinate processors 130. In particular, the main core processor provides anapplication interface 420, and may also provide some localoperating system functionality 430. However, the remainder of the operating system functionality is distributed among thesubordinate processors 130. This distribution of operating system functionality among thesubordinate processors 130 is indicated generally by the remote operating system functions 450 inFIG. 4 . - In an exemplary embodiment of the present invention, the operating system is the well known Linux operating system, and the IP stack functionality of the Linux operating system is distributed into the
subordinate processors 130 as a remote operating system function. The IP stack functionality uses a well-defined socket interface that can be easily relocated from the main processor into thesubordinate processors 130. As another example, the Linux scheduler functionality is relatively easy to move because it is triggered by a timer and every system call returns through the scheduler. - In an exemplary embodiment of the present invention, the
applications interface 420 makes the distribution of the operating system into the subordinate processors completely transparent to theapplications 410. Accordingly, theapplications 410 can run without modification on themain core processor 110 in the same manner as if the entire operating system were implemented on themain core processor 110. - If the distributed operating system is used to handle I/O requests from the
main core processor 110, then the entire I/O process is rendered transparent to the application running on the main processor. More particularly, the application at 410 sees only theapplication interface 420, and the fact that thesubordinate processors 130 handle the I/O operation is transparent to the application running on themain core processor 110. A typical disk storage read operation produces many interrupts before it is completed. - However, by distributing into the subordinate processors the operating system functionality associated with disk storage accesses, the many interrupts are seen only by the subordinate processors, and are invisible to the application running on the main core processor. As far as the application running on the main core processor is concerned, the application simply provides a disk storage read request to the
applications interface 420, and this request results in a single interrupt, namely, an interrupt from the operating system indicating that the desired file is ready inRDRAM 310. - In an exemplary embodiment, operating system functions that are relatively slow, relatively frequently accessed, or both, can be distributed among the
subordinate processors 130, thereby off-loading from the main core processor 110 a relatively large processing burden, which in turn improves the data processing throughput that the main core processor can achieve while executing the application according to the sequential programming model. -
FIG. 5 illustrates an expanded data processing architecture according to an exemplary embodiment of the invention. In the expandeddata processing architecture 500 ofFIG. 5 , a plurality of instances of thedata processing architecture 100 described above relative toFIGS. 1-3 , designated respectively as 100 a, 100 b, . . . 100 c, are interconnected by abus structure 510. In particular, thebus structure 510 interconnects themain core processors 110 of the respective 100 a, 100 b, . . . 100 c. The arrangement ofdata processing architectures FIG. 5 thus results in an even higher performance architecture. - Although the present invention has been described with an exemplary embodiment, various changes and modifications may be suggested to one skilled in the art. It is intended that the present invention encompass such changes and modifications as fall within the scope of the appended claims.
Claims (20)
1. A data processing apparatus, comprising:
a main data processor capable of running an application;
a plurality of subordinate data processors capable of providing data processing support for the application running on said main data processor;
a plurality of communication paths which respectively couple said subordinate data processors to said main data processor; and
an operating system and an application interface for interfacing said application to said operating system, wherein said application interface is provided on said main data processor and at least some of said operating system is distributed among said subordinate data processors such that said subordinate data processors also provide operating system support for the application running on said main data processor.
2. The apparatus as set forth in claim 1 , wherein said main data processor runs the application according to a sequential programming model.
3. The apparatus as set forth in claim 1 , wherein said subordinate data processors provide said data processing support for the application running on said main data processor by inputing and outputting data from and to a site located physically separately from said data processing apparatus.
4. The apparatus as set forth in claim 1 , wherein said at least some of said operating system includes an operating system function that is accessed relatively frequently by the application running on said main data processor.
5. The apparatus as set forth in claim 4 , wherein said operating system function includes one of an IP stack, a dispatcher, a scheduler, and a virtual memory paging function.
6. The apparatus as set forth in claim 1 , wherein said operating system is a Linux operating system.
7. The apparatus as set forth in claim 1 , wherein said subordinate data processors execute program instructions to provide said data processing support.
8. The apparatus as set forth in claim 1 , wherein said application interface renders said distribution of said at least some of said operating system transparent to the application running on said main data processor.
9. The apparatus as set forth in claim 1 , wherein said apparatus is implemented as a single integrated circuit.
10. The apparatus as set forth in claim 9 , wherein said main data processor includes a RISC processor, and said subordinate data processors include respective RISC microengines.
11. The apparatus as set forth in claim 1 , wherein each of said communication paths includes a memory that is shared by said main data processor and the associated subordinate data processor.
12. The apparatus as set forth in claim 11 , wherein said main data processor and all of said subordinate data processors share said memory.
13. The apparatus as set forth in claim 1 , wherein said at least some of said operating system includes an IP stack, a scheduler, a dispatcher and a virtual memory paging function.
14. The apparatus as set forth in claim 1 , wherein at least one of said subordinate processors provides said data processing support and said operating system support concurrently.
15. A method of accelerating a computer operating system, comprising:
interfacing an application running on a main data processor to a computer operating system; and
distributing at least some of the computer operating system among a plurality of subordinate data processors which provide data processing support for the application running on the main data processor, such that the subordinate data processors can also provide operating system support for the application running on the main data processor.
16. The method as set forth in claim 15 , wherein the at least some of the computer operating system includes one of an IP stack, a dispatcher, a scheduler, and a virtual memory paging function.
17. The method as set forth in claim 15 , wherein the at least some of the computer operating system includes an IP stack, a scheduler, a dispatcher, and a virtual memory paging function.
18. The method as set forth in claim 15 , wherein the at least some of the computer operating system includes an IP stack and a scheduler.
19. The method as set forth in claim 15 , wherein the step of interfacing includes the step of rendering the distribution of the at least some of the computer operating system transparent to the application running on the main data processor.
20. The method as set forth in claim 15 , further comprising the step of utilizing the subordinate data processors to provide the data processing support and the operating system support concurrently.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US11/091,362 US20050278720A1 (en) | 2004-05-27 | 2005-03-28 | Distribution of operating system functions for increased data processing performance in a multi-processor architecture |
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US57559004P | 2004-05-27 | 2004-05-27 | |
| US57558904P | 2004-05-27 | 2004-05-27 | |
| US11/091,362 US20050278720A1 (en) | 2004-05-27 | 2005-03-28 | Distribution of operating system functions for increased data processing performance in a multi-processor architecture |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20050278720A1 true US20050278720A1 (en) | 2005-12-15 |
Family
ID=37149389
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US11/091,362 Abandoned US20050278720A1 (en) | 2004-05-27 | 2005-03-28 | Distribution of operating system functions for increased data processing performance in a multi-processor architecture |
| US11/091,731 Expired - Fee Related US7562111B2 (en) | 2004-05-27 | 2005-03-28 | Multi-processor architecture with high capacity I/O |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US11/091,731 Expired - Fee Related US7562111B2 (en) | 2004-05-27 | 2005-03-28 | Multi-processor architecture with high capacity I/O |
Country Status (2)
| Country | Link |
|---|---|
| US (2) | US20050278720A1 (en) |
| KR (1) | KR100694212B1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20080048291A (en) * | 2006-11-28 | 2008-06-02 | (주)엠텍소프트 | Digital processing apparatus including a plurality of processors and a method of connecting a plurality of processors |
| US10908914B2 (en) * | 2008-10-15 | 2021-02-02 | Hyperion Core, Inc. | Issuing instructions to multiple execution units |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB0919253D0 (en) | 2009-11-03 | 2009-12-16 | Cullimore Ian | Atto 1 |
| US8875276B2 (en) | 2011-09-02 | 2014-10-28 | Iota Computing, Inc. | Ultra-low power single-chip firewall security device, system and method |
| US8904216B2 (en) * | 2011-09-02 | 2014-12-02 | Iota Computing, Inc. | Massively multicore processor and operating system to manage strands in hardware |
| DE102015119201A1 (en) | 2015-05-11 | 2016-11-17 | Dspace Digital Signal Processing And Control Engineering Gmbh | A method of configuring an interface unit of a computer system |
| KR101936942B1 (en) * | 2017-08-28 | 2019-04-09 | 에스케이텔레콤 주식회사 | Distributed computing acceleration platform and distributed computing acceleration platform control method |
| KR101973946B1 (en) * | 2019-01-02 | 2019-04-30 | 에스케이텔레콤 주식회사 | Distributed computing acceleration platform |
| CN113377857A (en) * | 2021-07-02 | 2021-09-10 | 招商局金融科技有限公司 | Data distribution method and device, electronic equipment and readable storage medium |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6216216B1 (en) * | 1998-10-07 | 2001-04-10 | Compaq Computer Corporation | Method and apparatus for providing processor partitioning on a multiprocessor machine |
| US20030140179A1 (en) * | 2002-01-04 | 2003-07-24 | Microsoft Corporation | Methods and system for managing computational resources of a coprocessor in a computing system |
| US20050071578A1 (en) * | 2003-09-25 | 2005-03-31 | International Business Machines Corporation | System and method for manipulating data with a plurality of processors |
| US20050081201A1 (en) * | 2003-09-25 | 2005-04-14 | International Business Machines Corporation | System and method for grouping processors |
| US20050223382A1 (en) * | 2004-03-31 | 2005-10-06 | Lippett Mark D | Resource management in a multicore architecture |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4153942A (en) * | 1977-01-24 | 1979-05-08 | Motorola, Inc. | Industrial control processor |
| US4665520A (en) * | 1985-02-01 | 1987-05-12 | International Business Machines Corporation | Optimistic recovery in a distributed processing system |
| US6714945B1 (en) * | 1995-11-17 | 2004-03-30 | Sabre Inc. | System, method, and article of manufacture for propagating transaction processing facility based data and for providing the propagated data to a variety of clients |
| US6108731A (en) * | 1996-02-09 | 2000-08-22 | Hitachi, Ltd. | Information processor and method of its component arrangement |
| CA2245963C (en) | 1998-08-26 | 2009-10-27 | Qnx Software Systems Ltd. | Distributed kernel operating system |
| JP2000268011A (en) | 1999-03-19 | 2000-09-29 | Nec Soft Ltd | Distributed job execution system and program recording medium therefor |
| KR100388065B1 (en) * | 1999-06-18 | 2003-06-18 | 주식회사 케이티 | Method and apparatus for providing shared library on distributed system using UNIX |
| JP2003330730A (en) | 2002-05-15 | 2003-11-21 | Matsushita Electric Ind Co Ltd | Operating system deployment device |
-
2005
- 2005-03-28 US US11/091,362 patent/US20050278720A1/en not_active Abandoned
- 2005-03-28 US US11/091,731 patent/US7562111B2/en not_active Expired - Fee Related
- 2005-05-06 KR KR1020050038154A patent/KR100694212B1/en not_active Expired - Fee Related
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6216216B1 (en) * | 1998-10-07 | 2001-04-10 | Compaq Computer Corporation | Method and apparatus for providing processor partitioning on a multiprocessor machine |
| US20030140179A1 (en) * | 2002-01-04 | 2003-07-24 | Microsoft Corporation | Methods and system for managing computational resources of a coprocessor in a computing system |
| US20050071578A1 (en) * | 2003-09-25 | 2005-03-31 | International Business Machines Corporation | System and method for manipulating data with a plurality of processors |
| US20050081201A1 (en) * | 2003-09-25 | 2005-04-14 | International Business Machines Corporation | System and method for grouping processors |
| US20050223382A1 (en) * | 2004-03-31 | 2005-10-06 | Lippett Mark D | Resource management in a multicore architecture |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20080048291A (en) * | 2006-11-28 | 2008-06-02 | (주)엠텍소프트 | Digital processing apparatus including a plurality of processors and a method of connecting a plurality of processors |
| US10908914B2 (en) * | 2008-10-15 | 2021-02-02 | Hyperion Core, Inc. | Issuing instructions to multiple execution units |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20060045952A (en) | 2006-05-17 |
| US7562111B2 (en) | 2009-07-14 |
| US20050267930A1 (en) | 2005-12-01 |
| KR100694212B1 (en) | 2007-03-14 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109213723B (en) | A processor, method, device, and non-transitory machine-readable medium for data flow graph processing | |
| CN109597459B (en) | Processor and method for privilege configuration in a spatial array | |
| CN109213523B (en) | Processor, method and system for configurable spatial accelerator with memory system performance, power reduction and atomic support features | |
| Wang et al. | MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters | |
| JP3771957B2 (en) | Apparatus and method for distributed control in a processor architecture | |
| RU2427895C2 (en) | Multiprocessor architecture optimised for flows | |
| US7861065B2 (en) | Preferential dispatching of computer program instructions | |
| CN111527485B (en) | memory network processor | |
| US20170147345A1 (en) | Multiple operation interface to shared coprocessor | |
| US9213677B2 (en) | Reconfigurable processor architecture | |
| GB2493607A (en) | Eliminating redundant instruction processing in an SIMT processor | |
| US20140331014A1 (en) | Scalable Matrix Multiplication in a Shared Memory System | |
| JP6502616B2 (en) | Processor for batch thread processing, code generator and batch thread processing method | |
| CN113495865B (en) | Asynchronous data movement pipeline | |
| CN110659115A (en) | Multi-threaded processor core with hardware assisted task scheduling | |
| CN112817738B (en) | Techniques for modifying an executable graph to implement a workload associated with a new task graph | |
| EP1416377B1 (en) | Processor system with a plurality of processor cores for executing tasks sequentially or in parallel | |
| CN116724292A (en) | Parallel processing of thread groups | |
| EP1846820B1 (en) | Methods and apparatus for instruction set emulation | |
| Flynn | Flynn’s taxonomy | |
| US20050278720A1 (en) | Distribution of operating system functions for increased data processing performance in a multi-processor architecture | |
| JP2024527169A (en) | Instructions and logic for identifying multiple instructions that can be retired in a multi-stranded out-of-order processor - Patents.com | |
| US9886327B2 (en) | Resource mapping in multi-threaded central processor units | |
| CN106030519A (en) | Processor logic and method for dispatching instructions from multiple shares | |
| Leibson et al. | Configurable processors: a new era in chip design |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |