WO1996020440A1 - Decouplage de processeur double - Google Patents
Decouplage de processeur double Download PDFInfo
- Publication number
- WO1996020440A1 WO1996020440A1 PCT/GB1995/003001 GB9503001W WO9620440A1 WO 1996020440 A1 WO1996020440 A1 WO 1996020440A1 GB 9503001 W GB9503001 W GB 9503001W WO 9620440 A1 WO9620440 A1 WO 9620440A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor
- memory
- fetch
- program
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
Definitions
- the present invention relates to a method for operating a parallel computing system.
- VSM virtual shared memory
- a suitable comprise involves the updating of memory in a less explicit but controlled and synchronised way. This can be achieved either by thinking about programming with notions such as bulk synchronisation or by relying on high level language constructs (such as array operations in FORTRAN 90) to enable relevant information to be extracted at compile time.
- Virtual Shared Memory is an extension of well known virtual memon principles. Each processor in a parallel system has local random access memory which is accessed via a Memory Management Unit (MMU). The virtual address space is common to all processors but their local memory will, at any time, hold only a fraction of it. The MMU records those areas of virtual memory which are currently in local memory and performs the address translation necessary to access them. If access is made to an area of memon' which is not currently locally resident, the MMU must signal this "miss" allowing for a copy of the appropriate memory to be made, from another processor, before the computation proceeds. The major difference between this and a normal virtual memon' system is that multiple copies of the same area of memon may exist and may be being accessed concurrently. In these circumstances it is vital that any inconsistencies which may occur do not have a detrimental effect on the result of the computation.
- MMU Memory Management Unit
- the address of a memory access is a function of a value which is itself currently being "pre-fetched” and is not therefore available.
- a control transfer is a function of a value which is currently being "pre ⁇ fetched" and the next control instruction cannot be determined.
- the code performing address generation and data accessing would be generated separately from that which used the data. This would then run on separate fetch and execute units with the phasing adjusted so that instructions using the data were delayed by the store latency. In this case, it would be necessary to delay the execution unit by a time equal to. or greater than, the upper bound set for a remote transfer. It would also be necessary to provide a mechanism for queuing the data values in correct order at the input to the execution unit. This could probably be achieved using a FIFO queue in which are placed values which were found to be local but tag remote accesses with a queue position where they should be inserted when available.
- the fetch unit would compute values of i and f. and issue addresses to form the queue of data values.
- the execution unit delayed by the appropriate amount, would consume these to form the final value of "sum”.
- Figs. 2 to 4 show a comparison of various execution patterns which might occur with this simple program under various conditions.
- Fig. 2 shows the sequential fetch and execute cycles which would occur on a single processor where all data were local.
- Fig. 3 shows a demand copying VSM system where, assuming some locality of access to copied pages, the execution time would be extended by idle periods equal to the time L that it takes to copy a page from a remote processor.
- Fig. 4 shows the effect of decoupling assuming that the total fetch time is greater than the latency. In these circumstances, the performance is improved over the serial case due to the pipeline overlap of the fetch and execute phases (assuming that they are not significantly extended by being separated). This pipelining effect is a bonus; the major achievement of the decoupling is that it has eliminated the idle periods due to the copying latency.
- the example is intended to represent a single thread of a parallel program then, such performance would be very acceptable.
- access latencies to non-local values may vary from a few machine cycles to as many as a thousand or more machine cycles and the latency will not be known until the access is made. Even then, if it is a remote access and the characteristics of the communication system are non-uniform, it will not be possible to do more than predict an upper bound.
- a method for operating a parallel computing system comprising first and second processors, a first memory for which the communication access latency is relatively short, and a second memory for which the communication access latency is relatively long, the system running a program which requires access to both of the first and second memories, wherein the first processor runs at least part of the program to identify data required from the second memory to run the program, the first processor fetches the identified data from the second memory, the fetched data is stored in the first memory but not operated upon by the first processor, and the second processor runs the program using the data fetched b ⁇ the first processor and stored in the first memory', the operation of the second processor being delayed relative to that of the first processor to enable the first processor to pre ⁇ fetch the required data from the second memory.
- Both processors may run the same code, the operation of the second processor being delayed relative to that of the first by the maximum latency of the second memory .
- One way of achieving this would be to form a queue of instructions to be performed by the second processor in a FIFO, instructions being added to the queue when processed by the first processor and removed from the queue when performed b> the second processor.
- the code run by the first processor may be the same as that run by the second processor except for the omission of all but that part of the loop concerned with fetch operations.
- a loop count may be maintained, the loop count being incremented by the first processor each time the first processor completes a fetch cycle of a value needed by the second processor, and decremented by the second processor each time an execute cycle is completed.
- the second processor may be caused to miss instruction execution cycles if the loop count exceeds a value equivalent to the maximum fetch latency of the second memory.
- synchronisation check points may be inserted in the program code, the first processor being prevented from proceeding beyond a predetermined point in the code defined by a check point until authorised to proceed by the second processor.
- the first processor may hold a relative loop length variable which is loaded before a complete fetch sequence, the values of the variable being equal to or less than the ratio of the fetch/execute instruction sequence length to the fetch sequence length, and the first processor being prevented from proceeding if the value of this variable exceeds a predetermined value.
- the present invention also provides an apparatus for performing the above method.
- Fig. 1 is a simplified illustration of a decoupled VSM structure comprising separate fetch and execution processing units:
- Figs. 2. 3 and 4 illustrate execute patterns which might occur with the system of Fig. 1 under various conditions:
- Fig. 5 is a schematic illustration of an embodiment of the present invention:
- Fig. 6 illustrates the operation of a VSM system in accordance with the present invention assuming some page locality:
- Fig. 7 illustrates the operation of a modified embodiment of the present invention. Referring to Fig. 5. this illustrates an embodiment of the present invention.
- a first processor 1 accesses a local first memor 2 through a memory management unit 3 (MMU A).
- a second processor 4 accesses the local memory 2 through a memory management unit 5 (MMU B).
- the MMU's communicate with a second memory which is a shared global memory (not shown) through a communications processor 6.
- processors are both capable of performing the full range of fetch and execute operations.
- the processors run with the operation of second processor lagging the operation of the first.
- the leading processor (processor A) will be performing the fetch function
- the lagging processor (processor B) performs program execution.
- processor A the leading processor
- processor B the lagging processor
- processors A and B are allocated its own stack space in virtual memory to hold local variables. It is assumed that the computation will consist of parallel "threads" and the local variables belonging to a particular thread will never be accessed from outside one processor.
- MMU A has the following properties :-
- a read access to a local variable will always succeed and return a value.
- a write access to a local variable will always succeed and modify the value.
- a read access to a global variable may succeed or result in a page fault. In either case, the computation continues in processor A. but in the case of a page fault, the Communications Processor is notified that the page is needed. It is assumed that the communications processor will haye a record of pages currently being fetched and will initiate a fetch if necessary. The value returned by this access will be a special null.
- a write access to a global variable will not be performed by the MMU A in any circumstances, but a request will be issued for a copy of that page and the computation will proceed.
- MMU B performs normal address translation with all reads and writes completed.
- Processor A calculates array addresses and causes the fetching of pages of global virtual memory into local memory by trying to read the array. However, because a queue of operands has not been formed, the code which runs on processor B must repeat the address calculations before accessing memory. As long as a page replacement strategy is operated which does not reject pages which have recently been accessed from processor A. all global variables required by the computation will be available for it to complete normally in processor B. The functionality defined for the MMU also allows the correct result to be achieved by running the complete program, phased appropriately, on both processors. If the elements of the array were not local when accessed by processor A spurious values of A's local copy of "sum" will be formed, but this is of no consequence as the correct values will be locally available to processor B when required.
- the approach means running the complete code twice thereby losing the performance advantages of pipelining the fetch and execute operations.
- One benefit though, is that such a structure can be built using "off the shelf integrated CPU components rather than more specialised custom circuits which might be required for dedicated fetch/execute hardware.
- acceptable performance can be achieved without specialised compiler technology because of the ability to run the same code, although higher performance might result with more intelligent separation of fetch and execute functions.
- decoupled VSM in accordance with the invention can hide remote access latencies in many of these more complex cases.
- the two processors A and B run code with a delay between them.
- processor A suspends because it needs the value of a non-local variable and. although it is not shown as happening in the above examples (because it will be avoided in most cases) the case of processor B suspending if there has been a delay in a global value arriving. In both these cases, the need to suspend will be signalled by the MMU.
- processor B To ensure the lack of suspensions, it is necessary to keep processor B delayed by the correct amount. However, particularly in more complex cases where there may be speculative fetching, processor A must not get too far ahead, filling the real store with copies too far in advance.
- FIFO queue One possible technique is to use a FIFO queue, except that it would be a queue of instructions between the two processors rather than a queue of data items as in the above example shown in Fig. 1.
- the length of the queue would represent the delay between the processors so that the appropriate processor could be stalled if the queue either became too long or too short.
- This approach would work well for simple cases where both processors are running identical code but significant additional complexity would be needed to cope with the more general case.
- processors may have its progress of instruction execution halted by a miss signalled by the MMU.
- processor A may be stalled if its execution get too far ahead of processor B.
- a "loop count" could be maintained, incremented by processor A each time a fetch cycle of a value known to be needed by processor B is completed and decremented by processor B when the execute cycle is completed. If this count exceeded a value equivalent to the maximum fetch latency then processor A could be caused to miss instruction execution cycles.
- Processor A could hold a relative loop length variable, loaded before a complete fetch sequence was entered. This value should be equal to (or just less than) the ratio of the fetch/execute instruction sequence length to the fetch sequence length. A count would be incremented by this amount for each instruction executed in A and decremented by one for each instruction executed in B. The arithmetic required would notionally be fractional but could, of course, be implemented as a simple binary operation. Again processor A would be stalled if this exceeded a certain length. This would have the benefit that it would not add significantly to the instruction count of the code at the expense of a small hardware cost and some compilation effort to determine the magnitude of the ratio and where it should be loaded.
- processor A will start executing code to ensure that required global values are made local.
- Processor B will start at the same time and only suspend if its MMU signals a "miss”. It will be restarted by the arrival of the page of information requested by processor A. In this way the delay will only be introduced when necessary resulting in the minimum total execution time.
- the invention deals with the problems of "page misses" in a paged distributed virtual shared memory system by using a dual processor configuration where one processor does a partial evaluation in an attempt to predict subsequent memon needs. It is believed that the technique can be implemented using standard processor components; the main functionality is provided by the memory management system.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
L'invention concerne un procédé d'exploitation d'un système ordinateur parallèle à mémoire virtuelle partagée, faisant appel à un découplage du processeur double. Le système comprend des premier et second processeurs, une première mémoire pour laquelle la latence avant l'accès à la communication est relativement courte et une seconde mémoire par laquelle la latence avant l'accès à la communication qui est relativement longue. Le système exécute un programme qui requiert l'accès aux première et seconde mémoires, et le premier processeur exécute au moins une partie du programme pour identifier les données nécessaires de la seconde mémoire pour exécuter le programme. Le premier processeur extrait les données identifiées de la seconde mémoire et mémorise ces données dans la première mémoire. Le premier processeur n'utilise toutefois pas les données extraites. Le second processeur exécute le programme en utilisant les données extraites par le premier processeur et mémorisées dans la première mémoire. Le fonctionnement du second processeur est retardé par rapport au premier processeur pour permettre au premier processeur d'extraire au préalable les données nécessaires, de la seconde mémoire.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU42704/96A AU4270496A (en) | 1994-12-23 | 1995-12-21 | Dual processor decoupling |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB9426155.9A GB9426155D0 (en) | 1994-12-23 | 1994-12-23 | Dual processor decoupling |
| GB9426155.9 | 1994-12-23 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO1996020440A1 true WO1996020440A1 (fr) | 1996-07-04 |
Family
ID=10766542
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/GB1995/003001 Ceased WO1996020440A1 (fr) | 1994-12-23 | 1995-12-21 | Decouplage de processeur double |
Country Status (3)
| Country | Link |
|---|---|
| AU (1) | AU4270496A (fr) |
| GB (1) | GB9426155D0 (fr) |
| WO (1) | WO1996020440A1 (fr) |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2001052061A3 (fr) * | 2000-01-14 | 2002-01-10 | Sun Microsystems Inc | Procede et appareil permettant d'utiliser un processeur auxiliaire pour effectuer des operations de prelecture de valeurs de donnees dans un processeur primaire |
| WO2002057909A3 (fr) * | 2001-01-16 | 2003-08-07 | Sun Microsystems Inc | Speculation sur valeur sur un processeur secondaire afin de faciliter la prelecture pour un processeur principal |
| US6681318B2 (en) | 2000-09-08 | 2004-01-20 | Sun Microsystems, Inc. | Method and apparatus for using an assist processor to prefetch instructions for a primary processor |
| US6772321B2 (en) | 2000-05-04 | 2004-08-03 | Sun Microsystems, Inc. | Method and apparatus for using an assist processor and value speculation to facilitate prefetching for a primary processor |
| WO2004068339A3 (fr) * | 2003-01-28 | 2005-12-22 | Sun Microsystems Inc | Processeur a fil de pistage pour bandes laterales |
| CN102298507A (zh) * | 2010-06-24 | 2011-12-28 | 宇瞻科技股份有限公司 | 具有虚拟光盘装置的存储装置 |
-
1994
- 1994-12-23 GB GBGB9426155.9A patent/GB9426155D0/en active Pending
-
1995
- 1995-12-21 WO PCT/GB1995/003001 patent/WO1996020440A1/fr not_active Ceased
- 1995-12-21 AU AU42704/96A patent/AU4270496A/en not_active Abandoned
Non-Patent Citations (5)
| Title |
|---|
| CHEN T -F ET AL: "A PERFORMANCE STUDY OF SOFTWARE AND HARDWARE DATE PREFECTING SCHEMES*", PROCEEDINGS OF THE ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, CHICAGO, APRIL 18 - 21, 1994, no. SYMP. 21, 18 April 1994 (1994-04-18), INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, pages 223 - 232, XP000480438 * |
| F. E. SAKALAY: "Storage Hierarchy Control System", IBM TECHNICAL DISCLOSURE BULLETIN, vol. 15, no. 4, September 1992 (1992-09-01), NEW YORK US, pages 1100 - 1101, XP002002415 * |
| G. L. REIJNS ET AL: "Application of Virtual Memory Technique for Coupled Microprocessors", FOURTH EUROMICRO SYMPOSIUM ON MICROPROCESSING AND MICROPROGRAMMING, 17 October 1978 (1978-10-17) - 19 October 1978 (1978-10-19), MUNICH, pages 130 - 139, XP002002416 * |
| J. E. SMITH, T. J. KAMINSKI: "Varieties of Decoupled Access/Execute Computer Architectures", TWENTIETH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING, 6 October 1982 (1982-10-06) - 8 October 1982 (1982-10-08), MONTICELLO, ILLINOIS, pages 577 - 586, XP000212152 * |
| KURIAN L ET AL: "MEMORY LATENCY EFFECTS IN DECOUPLED ARCHITECTURES", IEEE TRANSACTIONS ON COMPUTERS, vol. 43, no. 10, 1 October 1994 (1994-10-01), pages 1129 - 1139, XP000474484 * |
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2001052061A3 (fr) * | 2000-01-14 | 2002-01-10 | Sun Microsystems Inc | Procede et appareil permettant d'utiliser un processeur auxiliaire pour effectuer des operations de prelecture de valeurs de donnees dans un processeur primaire |
| US6415356B1 (en) | 2000-01-14 | 2002-07-02 | Sun Microsystems, Inc. | Method and apparatus for using an assist processor to pre-fetch data values for a primary processor |
| KR100800941B1 (ko) * | 2000-01-14 | 2008-02-04 | 썬 마이크로시스템즈, 인코포레이티드 | 주프로세서용 데이터값의 사전인출을 위해 보조프로세서를이용하는 방법 및 장치 |
| JP4766816B2 (ja) * | 2000-01-14 | 2011-09-07 | オラクル・アメリカ・インコーポレイテッド | アシストプロセッサを使用して1次プロセッサのデータ値をプリフェッチするための方法および装置 |
| US6772321B2 (en) | 2000-05-04 | 2004-08-03 | Sun Microsystems, Inc. | Method and apparatus for using an assist processor and value speculation to facilitate prefetching for a primary processor |
| US6681318B2 (en) | 2000-09-08 | 2004-01-20 | Sun Microsystems, Inc. | Method and apparatus for using an assist processor to prefetch instructions for a primary processor |
| WO2002057909A3 (fr) * | 2001-01-16 | 2003-08-07 | Sun Microsystems Inc | Speculation sur valeur sur un processeur secondaire afin de faciliter la prelecture pour un processeur principal |
| WO2004068339A3 (fr) * | 2003-01-28 | 2005-12-22 | Sun Microsystems Inc | Processeur a fil de pistage pour bandes laterales |
| US7502910B2 (en) | 2003-01-28 | 2009-03-10 | Sun Microsystems, Inc. | Sideband scout thread processor for reducing latency associated with a main processor |
| CN102298507A (zh) * | 2010-06-24 | 2011-12-28 | 宇瞻科技股份有限公司 | 具有虚拟光盘装置的存储装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| AU4270496A (en) | 1996-07-19 |
| GB9426155D0 (en) | 1995-02-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Steffan et al. | A scalable approach to thread-level speculation | |
| Dubois et al. | Memory access dependencies in shared-memory multiprocessors | |
| Mowry et al. | Automatic compiler-inserted I/O prefetching for out-of-core applications | |
| Etsion et al. | Task superscalar: An out-of-order task pipeline | |
| Anderson et al. | Performance of the Cray T3E multiprocessor | |
| Bitar et al. | Multiprocessor cache synchronization: Issues, innovations, evolution | |
| US5835950A (en) | Self-invalidation method for reducing coherence overheads in a bus-based shared-memory multiprocessor apparatus | |
| CN114667508A (zh) | 为加速器取回数据的方法和系统 | |
| US5896517A (en) | High performance processor employing background memory move mechanism | |
| Krishnan et al. | The need for fast communication in hardware-based speculative chip multiprocessors | |
| Kavi et al. | Design of cache memories for multi-threaded dataflow architecture | |
| Blake et al. | Bloom filter guided transaction scheduling | |
| Bousias et al. | Instruction level parallelism through microthreading—a scalable approach to chip multiprocessors | |
| VanderWiel et al. | A survey of data prefetching techniques | |
| Kim et al. | Leveraging cache coherence in active memory systems | |
| Scheurich et al. | Lockup-free caches in high-performance multiprocessors | |
| WO1996020440A1 (fr) | Decouplage de processeur double | |
| Torrellas et al. | The Illinois aggressive coma multiprocessor project (I-ACOMA) | |
| Muller et al. | Hiding miss latencies with multithreading on the Data Diffusion Machine | |
| Watson et al. | Decoupled pre-fetching for distributed shared memory | |
| Berrached et al. | A decoupled access/execute architecture for efficient access of structured data | |
| Bakshi et al. | Memory latency: to tolerate or to reduce | |
| Amamiya et al. | Datarol: a parallel machine architecture for fine-grain multithreading | |
| Vianès et al. | A case for second-level software cache coherency on many-core accelerators | |
| Watson et al. | An evaluation of DELTA, a decoupled pre-fetching virtual shared memory system |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): AL AM AT AU BB BG BR BY CA CH CN CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TT UA UG US UZ VN |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): KE LS MW SD SZ UG AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG |
|
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: CA |