WO1996020440A1

WO1996020440A1 - Dual processor decoupling

Info

Publication number: WO1996020440A1
Application number: PCT/GB1995/003001
Authority: WO
Inventors: Ian Watson; Alasdair Rawsthorn
Original assignee: Victoria University of Manchester
Current assignee: University of Manchester
Priority date: 1994-12-23
Filing date: 1995-12-21
Publication date: 1996-07-04
Anticipated expiration: 1997-06-23
Also published as: GB9426155D0; AU4270496A

Abstract

A method for operating a parallel virtual shared memory computing system which relies upon dual processor decoupling. The system has first and second processors, a first memory for which the communication access latency is relatively short, and a second memory for which the communication access latency is relatively long. The system runs a program which requires access to both of the first and second memories, and the first processor runs at least part of the program to identify data required from the second memory to run the program. The first processor fetches the identified data from the second memory and stores that data in the first memory. The first processor does not however operate upon the fetched data. The second processor runs the program using the data fetched by the first processor and stored in the first memory. The operation of the second processor is delayed relative to that of the first processor to enable the first processor to pre-fetch the required data from the second memory.

Description

DUALPROCESSORDECOUPLING

The present invention relates to a method for operating a parallel computing system.

Commercial parallel computers which present a machine wide view of memory have, until fairly recently, had a physically shared store structure resulting in limited extensibility. More highly parallel machines have used physically distributed store and presented the programmer with message passing models of computation for inter process communication. More recently, it has become much more widely accepted that, at the programmers model level, a global view of memory is highly desirable even if the machine has a physically distributed store structure.

There are two "mainstream" approaches to the provision of a "scalable" distributed shared memory model. The first, virtual shared memory (VSM). is an idea which was proposed in the area of declarative parallel systems some time ago. (I. Watson and P. Watson. "Graph Reduction in a Parallel Virtual Memory Environment". LNCS 279. Springer-Verlag. 1987. D.H.D. Warren and S. Haridi. "The Data Diffusion Machine - a Scalable Shared Virtual Memory Multiprocessor". Proceedings of the 1988 International Conference on Fifth Generation Computer Systems, pp 943-952. Tokyo. Japan. Dec 1988.) This idea has only recently been implemented in a practical commercial machine. (J. Rothnie. "Overview of the KSR1 Computer System". Technical Report TR-9202001. Kendal Square Research. 1992.) It is essentially an extension of a conventional virtual memon' system where a processor's local memon holds copies of the data currently in use which may be fetched or discarded as appropriate. The second, often referred to as the "Valiant" approach, in some ways, resembles the ideas incorporated in early graph reduction machines that large memo latency can be fully hidden by the use of a ^"high degree of excess parallelism. (M.D. Cripps. J. Darlington. A.J. Field. P.G. Harrison & M.J. Reeve. "The Design and Implementation of ALICE: A parallel Graph Reduction Machine", in Thakkar. S.S. (ed) Selected Reprints on Dataflow and Reduction Architectures. IEEE Computer Socieu Press. (1987).) The TERA machine is a commercial example of this approach. (Smith B et al. The TERA Computer System" ICS90 ACM Press 1990, ppl-6j.)

Although the above two areas are not entirely independent, their separation emphasises two important issues in the support of shared distributed memory. The first is that locality should be exploited to avoid non-local memory access whenever possible whereas the second is that efficient support of multiple threading can hide significant memory latency.

The attraction of a shared memory for parallel computing can readily be appreciated by reference to an analogy in human activity of the difference between the requirement to tell everyone who needs to know about a particular fact as opposed to posting information on a notice board for reading on demand. The latter is clearly far more flexible in complex situations where the use of information is unpredictable. Unfortunately, the analogy extends to the problem of large numbers of people needing to look at the information at the same time; if it is in a single place there will be access problems.

Physically shared memon' therefore results in limited extensibility and the only- practical solution is to distribute it. There are basically two choices: either all memon,' can be put equally distant from all processors or each processor can have its own memory but that memon is accessible from elsewhere. The former has the benefit of consistency but discards the capability of a processor to access locally at least one bit of memory with all the known performance benefits of localin . Of course, a hybrid solution is possible using some totally private memory but this does not. in reality, alter the real problem of a processor needing to make global access to a memory to which it is not directly connected.

The restricted environment provided by message passing programming models simplifies the task of writing parallel programs and the availability of global memon has to be combined with appropriate programming models if it is to be used effective!) : uncontrolled use of shared state is well known to lead to significant complexity and hence erroneous programs. There is a wide range of "controlled" ways of using shared memon'. ranging from explicit constructs to both lock and synchronise access to shared variables, to what is essentially the declarative approach of having one "once only" writer. The first of these is still probably far too detailed and hence complex for general purpose programming, while the latter is too restrictive, both in expressive power and in the requirement which it produces for complex memory management mechanisms to re¬ use physical memory.

It is probable that a suitable comprise involves the updating of memory in a less explicit but controlled and synchronised way. This can be achieved either by thinking about programming with notions such as bulk synchronisation or by relying on high level language constructs (such as array operations in FORTRAN 90) to enable relevant information to be extracted at compile time.

Virtual Shared Memory is an extension of well known virtual memon principles. Each processor in a parallel system has local random access memory which is accessed via a Memory Management Unit (MMU). The virtual address space is common to all processors but their local memory will, at any time, hold only a fraction of it. The MMU records those areas of virtual memory which are currently in local memory and performs the address translation necessary to access them. If access is made to an area of memon' which is not currently locally resident, the MMU must signal this "miss" allowing for a copy of the appropriate memory to be made, from another processor, before the computation proceeds. The major difference between this and a normal virtual memon' system is that multiple copies of the same area of memon may exist and may be being accessed concurrently. In these circumstances it is vital that any inconsistencies which may occur do not have a detrimental effect on the result of the computation.

Decoupled Access/Execute architectures have been described in research literature over the last ten years, although no mainstream products have emerged in that period. (J.E. Smith. "Decoupled Access/Execute Computer Architecture". ACM Transactions on Computer Systems. Vol. 2. No. 4. November 1984. pp. 289-308. ) Traditional Access/Execute decoupling is a single-processor architectural technique for hiding memon^' latency by allowing different parts of a program to execute asynchronously: the memory addressing part runs early, while the data accessing part runs later, delayed by whatever time is necessary to read data items from memory.

There are circumstances where decoupling breaks down; these can be broadly divided into two categories :-

The address of a memory access is a function of a value which is itself currently being "pre-fetched" and is not therefore available. A control transfer is a function of a value which is currently being "pre¬ fetched" and the next control instruction cannot be determined.

If this happens then the progress of the computation must be delayed until the required data is available.

Most recently, decoupling has been investigated in the context of supercomputer design, and has been found to be very sympathetic to the design of processors for high¬ speed numeric computation. (P.L. Bird, N.P. Topham and A. Rawsthome. "The Effectiveness of Decoupling". ACM International Conference on Supercomputing, Tokyo. Japan. 1993.) In particular, it has been found that in a large range of different programs, the incidence of events that cause decoupling to break down is rare.

This recent work is significant in two ways: it demonstrates that it is possible to optimize decoupled architectures in such a direction that they promise real performance, and secondly it shows how infrequently the remaining "Loss of Decoupling" events occur in a range of programs.

To consider the application of decoupling to VSM structures, assume a section of a parallel program, running on a processor of a parallel system which implements VSM. which operates on (for the moment reads) elements of an array which are mapped on to multiple pages of virtual memon'. An array data structure is assumed as it is probably the most important data structuring mechanism in parallel computation and also simplifies the initial description. Assume that many of the elements of the array are accessed but in a manner which is not readily determinable at compile time and that the pages which hold the array are scattered around separate memories of the system.

A simple example of such code (in a suitable high level language) might be:- for i = 0 to N do sum = sum + A[f(i)]

Assuming that the function f returns integer values which are within the bounds of the array but which, for successive values of i. are distributed across all possible values in a way which is not predictable except by computing the function at run time. This does not necessarily mean that successive accesses to the array A have no spatial locality, just that it cannot readily be predicted.

In these circumstances there is a high probability that a processor performing this computation will attempt to make memory accesses to addresses for which there are no local copies in the processor's local memory. The memory management unit will therefore signal that the page must be fetched from another memory in the system. In any real system, the communication overheads and network latency will ensure that hundreds of machine cycles will elapse before the copy appears in local memon' and the computation can proceed. Clearly, this will have a significant impact on the performance of the system; given that the incidence of such page "misses" will increase with an increasing number of parallel processors, it is likely that the performance of any parallel machine implementing VSM using a simple "copy on demand" system will exhibit poor parallel speed-up.

If the access patterns were regular, it might be possible to determine (either automatically at compile time or by user provided annotations) which pages of memory were required before the accesses happened and pre-fetch the pages to avoid having to suspend the computation whenever a new page were accessed (such "tiling directives" are crucial to the performance of KSR FORTRAN). However, in this case, unless we could pre-evaluate f(i). the only possibility is to pre-fetch all pages at the risk of significant wasted work and hence inefficiency: limited compile time evaluation of f(i ) may be possible but a general scheme would soon become highly complex. In "real" programs it may prove impossible to make any sensible "compile time" decisions particularly with more complex data structures.

Techniques related to Decoupling can. however, enable us to do a significant amount of prediction at run-time. In any modern or structured approach to the above program, the variable i would be local to a small section of the program; probabh allocated on the stack. Generally therefore the store location holding i is. and will remain, local to the processor on which the section of code is running and any accesses to it can be made rapidly (in a real processor, it could be assumed that it was in cache). Computations involving only such local variables will therefore never require data to be fetched from elsewhere and can always proceed unhindered. In this case, therefore, the generation of the addresses required to access A can run independently, i.e. they can be decoupled.

In a normal approach to decoupling, with the objective of trying to mask latencies which occur during main store access, the code performing address generation and data accessing would be generated separately from that which used the data. This would then run on separate fetch and execute units with the phasing adjusted so that instructions using the data were delayed by the store latency. In this case, it would be necessary to delay the execution unit by a time equal to. or greater than, the upper bound set for a remote transfer. It would also be necessary to provide a mechanism for queuing the data values in correct order at the input to the execution unit. This could probably be achieved using a FIFO queue in which are placed values which were found to be local but tag remote accesses with a queue position where they should be inserted when available. It would be necessary to enter any data fetched into the local memory and make appropriate entries into the MMU. The length of the operand queue should, of course, be sufficient to hold all values fetched during the delay period. An outline of a possible physical structure is shown in Fig. 1.

In the simple array example, referred to above, the fetch unit would compute values of i and f. and issue addresses to form the queue of data values. The execution unit delayed by the appropriate amount, would consume these to form the final value of "sum". Figs. 2 to 4 show a comparison of various execution patterns which might occur with this simple program under various conditions.

Fig. 2 shows the sequential fetch and execute cycles which would occur on a single processor where all data were local. Fig. 3 shows a demand copying VSM system where, assuming some locality of access to copied pages, the execution time would be extended by idle periods equal to the time L that it takes to copy a page from a remote processor. Fig. 4 shows the effect of decoupling assuming that the total fetch time is greater than the latency. In these circumstances, the performance is improved over the serial case due to the pipeline overlap of the fetch and execute phases (assuming that they are not significantly extended by being separated). This pipelining effect is a bonus; the major achievement of the decoupling is that it has eliminated the idle periods due to the copying latency. Remembering that the example is intended to represent a single thread of a parallel program then, such performance would be very acceptable.

Although the example may look over simple it should be noted that there are real computations which are amenable to such techniques. For example, a simple matrix multiply which uses two input matrices to produce a third (i.e. does not use in-place update) could be executed with very high efficiency.

However, many real computations are not sufficiently simple that the fetch and execute can be separated totally, this requires that both the data addresses and the outcome of any control transfers rely only on local values. If a computation was to use global values either to calculate the address of further values or in the evaluation of a conditional control transfer, a long delay would arise around the fetch/execute loop needed to cope with the maximum latency. In many cases, due to locality, such a delay would be unnecessary and it would be found that the performance was severely restricted.

When attempting to apply the principle of decoupling to VSM. the major difference between communications which take place between a processor and its local "main" memory and those which require the use of communication across a network to another processor's memory must be recognised. The first have a short fixed predictable latency and the total time required to communicate multiple values is almost directly proportional to their number. The second are longer and. furthermore, the latency-may not be either predictable or uniform depending on the characteristics of the communication system and network. In general, because of a significant communication "setup" time, it is much more efficient to fetch a number of values at a single time rather than fetch them individually. Indeed one of the major reasons for the use of VSM is to exploit locality so that the cost of communication can be amortized over multiple accesses. Thus, access latencies to non-local values may vary from a few machine cycles to as many as a thousand or more machine cycles and the latency will not be known until the access is made. Even then, if it is a remote access and the characteristics of the communication system are non-uniform, it will not be possible to do more than predict an upper bound.

Thus it is clear that the application of conventional decoupling techniques to VSM system would not provide acceptable results except in limited circumstances.

It is an object of the present invention to obviate or mitigate the problems outlined above.

According to the present invention, there is provided a method for operating a parallel computing system comprising first and second processors, a first memory for which the communication access latency is relatively short, and a second memory for which the communication access latency is relatively long, the system running a program which requires access to both of the first and second memories, wherein the first processor runs at least part of the program to identify data required from the second memory to run the program, the first processor fetches the identified data from the second memory, the fetched data is stored in the first memory but not operated upon by the first processor, and the second processor runs the program using the data fetched b\ the first processor and stored in the first memory', the operation of the second processor being delayed relative to that of the first processor to enable the first processor to pre¬ fetch the required data from the second memory.

Both processors may run the same code, the operation of the second processor being delayed relative to that of the first by the maximum latency of the second memory . One way of achieving this would be to form a queue of instructions to be performed by the second processor in a FIFO, instructions being added to the queue when processed by the first processor and removed from the queue when performed b> the second processor.

In an alternative arrangement where the program to be run comprises a loop part only of which is concerned with fetch operations, the code run by the first processor may be the same as that run by the second processor except for the omission of all but that part of the loop concerned with fetch operations. In order to control the relative rates of operation of the first and second processors, a loop count may be maintained, the loop count being incremented by the first processor each time the first processor completes a fetch cycle of a value needed by the second processor, and decremented by the second processor each time an execute cycle is completed. The second processor may be caused to miss instruction execution cycles if the loop count exceeds a value equivalent to the maximum fetch latency of the second memory.

In an alternative arrangement, synchronisation check points may be inserted in the program code, the first processor being prevented from proceeding beyond a predetermined point in the code defined by a check point until authorised to proceed by the second processor.

In a further alternative arrangement, the first processor may hold a relative loop length variable which is loaded before a complete fetch sequence, the values of the variable being equal to or less than the ratio of the fetch/execute instruction sequence length to the fetch sequence length, and the first processor being prevented from proceeding if the value of this variable exceeds a predetermined value.

The present invention also provides an apparatus for performing the above method.

Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:-

Fig. 1 is a simplified illustration of a decoupled VSM structure comprising separate fetch and execution processing units:

Figs. 2. 3 and 4 illustrate execute patterns which might occur with the system of Fig. 1 under various conditions:

Fig. 5 is a schematic illustration of an embodiment of the present invention:

Fig. 6 illustrates the operation of a VSM system in accordance with the present invention assuming some page locality: and

Fig. 7 illustrates the operation of a modified embodiment of the present invention. Referring to Fig. 5. this illustrates an embodiment of the present invention. A first processor 1 (processor A) accesses a local first memor 2 through a memory management unit 3 (MMU A). A second processor 4 (processor B) accesses the local memory 2 through a memory management unit 5 (MMU B). The MMU's communicate with a second memory which is a shared global memory (not shown) through a communications processor 6.

Thus a dual processor structure is provided, but rather than relying upon a simple separation of the fetch and execute operations between the processors, the processors are both capable of performing the full range of fetch and execute operations. The processors run with the operation of second processor lagging the operation of the first. In general, the leading processor (processor A) will be performing the fetch function, while the lagging processor (processor B) performs program execution. However, there is a wide range of possibilities depending on how the code is compiled and divided between the two processors. There are several options for achieving the necessary phasing between the processors.

Each of processors A and B is allocated its own stack space in virtual memory to hold local variables. It is assumed that the computation will consist of parallel "threads" and the local variables belonging to a particular thread will never be accessed from outside one processor.

However, global variables are common to both processors A & B and can be accessed from anywhere. MMU A has the following properties :-

A read access to a local variable will always succeed and return a value. A write access to a local variable will always succeed and modify the value.

A read access to a global variable may succeed or result in a page fault. In either case, the computation continues in processor A. but in the case of a page fault, the Communications Processor is notified that the page is needed. It is assumed that the communications processor will haye a record of pages currently being fetched and will initiate a fetch if necessary. The value returned by this access will be a special null. A write access to a global variable will not be performed by the MMU A in any circumstances, but a request will be issued for a copy of that page and the computation will proceed.

MMU B performs normal address translation with all reads and writes completed.

Processor A calculates array addresses and causes the fetching of pages of global virtual memory into local memory by trying to read the array. However, because a queue of operands has not been formed, the code which runs on processor B must repeat the address calculations before accessing memory. As long as a page replacement strategy is operated which does not reject pages which have recently been accessed from processor A. all global variables required by the computation will be available for it to complete normally in processor B. The functionality defined for the MMU also allows the correct result to be achieved by running the complete program, phased appropriately, on both processors. If the elements of the array were not local when accessed by processor A spurious values of A's local copy of "sum" will be formed, but this is of no consequence as the correct values will be locally available to processor B when required.

The approach means running the complete code twice thereby losing the performance advantages of pipelining the fetch and execute operations. One benefit though, is that such a structure can be built using "off the shelf integrated CPU components rather than more specialised custom circuits which might be required for dedicated fetch/execute hardware. In addition, acceptable performance can be achieved without specialised compiler technology because of the ability to run the same code, although higher performance might result with more intelligent separation of fetch and execute functions.

Further benefits can be achieved in accordance with the invention if more complex approaches are adopted whilst still relying upon one processor to be primarih concerned with fetch operations.

Traditional fetch/execute decoupling copes only with cases similar to those described above where the phases can be separated fully. This is because the approach is normally used to hide a small fixed latency and. if the value being fetched is required to progress the next fetch computation, the process must be suspended for that time before proceeding. In decoupled VSM. the situation is somewhat different. The major latency is that of a remote fetch of a copy of memory and. assuming use of a page based VSM system, each fetch should result in the copying of a significant number of global values which, due to spatial locality, would be used in the future. This can be used to extend the capability of the decoupled system of the present invention. Traditional uses of decoupling usually only work well for array like data structures with relatively simple access patterns like those described above. An operation on a structure like a linked list, for example, is likely to perform very poorly because the fetching of each element of the list is totally dependent on the previous element. With appropriate compilation techniques, decoupled VSM in accordance with the invention can hide remote access latencies in many of these more complex cases.

Take a linked list composed of elements described by the following C like definitions:- typedef struct cell { int value; struct cell *next;

} cell; cell *list: And a computation of the form:- l cell *ptr: sum = 0 ptr = list; while (ptr !=NULL)

( sum = sum + ptr-> value: ptr=ptr->next: This is the list based equivalent of the array example, simply traversing the list and summing the values. The major problem with this program is that each data item fetch uses an address which is contained in the previous element of the list and is likely to cause long pauses in the fetching process if a remote access is required. In practice, in a paged demand copying VSM system, a significant number of cells would be grouped on to pages (how well this happens depends critically on the memon' management strategy in a practical system) and only a fraction of the accesses would result in a "miss". However, they would still have a significant impact on performance. Fig. 6 indicates the form of the execution which would result.

The access to "value" will cause a suspension if the cell pointed to is not local. When the value is available, either following a suspension or due to locality, the sum will be followed by the assignment and test of the next pointer.

The nature of the program is such that these suspensions in the fetch phase are inevitable. However, given adequate locality, decoupling techniques can be used to minimise the effect of them on the progress of the overall computation. It is possible to identify that part of the loop which is concerned with the fetch and separate it simply b\ omitting the statement in the body of the loop concerned with the addition:- while (ptr !=NULL)

(

I ptr=ptr->next;

If this runs on processor A of the structure of Fig. 5 together with the complete program on processor B. the execution pattern shown in Fig. 7 can be achieved with correct phasing.

The ability of processor B to progress unhindered is determined by the latenc> L. the fetch time F (=ff above), the total loop execution time E (= f&e above) and the number of consecutive elements of the structure which are either grouped on a single page or found locally on other pages already fetched. As long as

N *E>L + N * F or N > L(E-F) then the elements will always be available in processor B's store when required. In order to see whether this is realistic, consider some typical values. Compiling the loop for a SPARC, the result is F=9 instructions, and E=14. Assuming an instruction execution time of 25nS and a latency of 5mS requires N > 40. Given that, for more complex loops the value of E-F would almost certainly be significantly greater than this, the technique is promising.

The list is a simple example of a linked structure which represents the other major large data structuring technique (apart from arrays) that will probably be encountered in real computation. In practice, in parallel systems, it is likely that linked tree/graph structures will be important with parallel sections of a program operating on sub-trees. It should be clear that, as long as the trees are built in a way that ensures some locality, similar principles to those used above will apply.

A further issue which needs discussion, is the way in which data structures, of all types, are used. The examples considered above have the common propertλ' that the size of the computation is fixed, the array examples goes around the loop N times and the list example follows the list to its end. The important feature is. in fact, that the test which terminates the loop is a function of values that are certain to be local when the test is made and therefore the "fetch" can be separated into a smaller loop which can be allowed to "run ahead" fetching values which are known to be required. There are computations, however, where the requirement for future parts of a structure depends on the value of those currently being processed. This happens with all sons of data structures including arrays, typically where the computation involves searching the structure for a particular value contained in it.

To keep it simple, consider searching a list with a loop of the form:- while (ptr!=NULL)&&(! found)

found= ptr->value=42 ) ptr=ptr->next; Here it is not possible to identify a smaller loop which can be separated from the main computation to pre-fetch values because all parts of the computation are required in the loop test.

The solution to this problem is to omit that part of the test which depends on the loop body from the fetch computation, reducing it to be identical to the simpler loop case, i.e.:- while (ptr !=NULL)

ptr=ptr->next;

Because of the nature of the dual processor structure and the MMU. the effect of running this loop in processor A will be to cause those values which might be used in the computation to be fetched avoiding any suspensions in processor B. but without affecting the result of the final computation. Clearly there is the danger of doing wasted work, and mechanisms will be required to prevent A running too far ahead.

The two processors A and B run code with a delay between them. To produce a proposal for a practical system, it is necessary to cater for cases where processor A suspends because it needs the value of a non-local variable and. although it is not shown as happening in the above examples (because it will be avoided in most cases) the case of processor B suspending if there has been a delay in a global value arriving. In both these cases, the need to suspend will be signalled by the MMU.

To ensure the lack of suspensions, it is necessary to keep processor B delayed by the correct amount. However, particularly in more complex cases where there may be speculative fetching, processor A must not get too far ahead, filling the real store with copies too far in advance.

One possible technique is to use a FIFO queue, except that it would be a queue of instructions between the two processors rather than a queue of data items as in the above example shown in Fig. 1. The length of the queue would represent the delay between the processors so that the appropriate processor could be stalled if the queue either became too long or too short. This approach would work well for simple cases where both processors are running identical code but significant additional complexity would be needed to cope with the more general case.

The solution is to provide each processor with its own Program Counter such that the processor fetches instructions as normal from memory. Either processor may have its progress of instruction execution halted by a miss signalled by the MMU. In addition processor A may be stalled if its execution get too far ahead of processor B. There are a number of ways in which this could be implemented:-

A "loop count" could be maintained, incremented by processor A each time a fetch cycle of a value known to be needed by processor B is completed and decremented by processor B when the execute cycle is completed. If this count exceeded a value equivalent to the maximum fetch latency then processor A could be caused to miss instruction execution cycles.

A similar alternative would be to include synchronisation "check points" at places through the code but not necessarily on each loop. Processor A would not be permitted to proceed beyond a certain point until "unlocked" by B.

Processor A could hold a relative loop length variable, loaded before a complete fetch sequence was entered. This value should be equal to (or just less than) the ratio of the fetch/execute instruction sequence length to the fetch sequence length. A count would be incremented by this amount for each instruction executed in A and decremented by one for each instruction executed in B. The arithmetic required would notionally be fractional but could, of course, be implemented as a simple binary operation. Again processor A would be stalled if this exceeded a certain length. This would have the benefit that it would not add significantly to the instruction count of the code at the expense of a small hardware cost and some compilation effort to determine the magnitude of the ratio and where it should be loaded. Any of the above solutions results in a similar behaviour except that the extent that processor A is allowed to "run ahead" may van'. By removing the explicit delay between the two processors, the potential for efficient execution in the absence of MMU induced suspensions is increased. Processor A will start executing code to ensure that required global values are made local. Processor B will start at the same time and only suspend if its MMU signals a "miss". It will be restarted by the arrival of the page of information requested by processor A. In this way the delay will only be introduced when necessary resulting in the minimum total execution time.

Thus the invention deals with the problems of "page misses" in a paged distributed virtual shared memory system by using a dual processor configuration where one processor does a partial evaluation in an attempt to predict subsequent memon needs. It is believed that the technique can be implemented using standard processor components; the main functionality is provided by the memory management system.

Simple programs where the progress of global data accessing is not itself dependent on other values can be executed efficiently using exactly the same code in the dual processors. In this case no special compiler technology is needed. More complex data access patterns will need some separation of fetch and execute phases during compilation.

Successful pre-fetching is thought to be crucial to the success of VSM. but it is not the only problem. All distributed memon' systems have problems with shared variables if copies are taken and updates take place which render the copies out of date. The present invention neither helps nor hinders these issues of coherence, many of which may be ameliorated by higher level approaches to programming.

Claims

A method for operating a parallel computing system comprising first and second processors, a first memory for which the communication access latency is relatively short, and a second memory for which the communication access latency is relatively long, the system running a program which requires access to both of the first and second memories, wherein the first processor runs at least part of the program to identify' data required from the second memory to run the program, the first processor fetches the identified data from the second memory, the fetched data is stored in the first memor but not operated upon by the first processor, and the second processor runs the program using the data fetched by the first processor and stored in the first memon'. the operation of the second processor being delayed relative to that of the first processor to enable the first processor to pre-fetch the required data from the second memory.

A method according to claim 1. wherein both processors run the same code, the operation of the second processor being delayed relative to that of the first by the maximum latency of the second memon¹.

A method according to claim 1 or 2. wherein a queue of instructions to be performed by the second processor is formed in a FIFO, instructions being added to the queue when processed by the first processor and removed from the queue when performed by the second processor.

A method according to claim 1. wherein the program to be run comprises a loop part only of which is concerned with fetch operations, and the code run by the first processor is the same as that run by the second processor except for the omission of all but that part of the loop concerned with fetch operations.

A method according to claim 4. wherein a loop count is maintained, the loop count being incremented by the first processor each time the first processor completes a fetch cycle of a value needed by the second processor, and decremented by the second processor each time an exact cycle is completed, and the first processor is caused to miss instruction execution cycles if the loop count exceeds a value equivalent to the maximum fetch latency of the second memory.

A method according to claim 4. wherein synchronisation check points are inserted in the program code, the first processor being prevented from proceeding beyond a predetermined point in the code defined by a check point until authorised to proceed by the second processor.

A method according to claim 4. wherein the first processor holds a relative loop length variable which is loaded before a complete fetch sequence, the value of the variable being equal to or less than the ratio of the fetch/execute instruction sequence length to the fetch sequence length, and the first processor being prevented from proceeding if the value of this variable exceeded a predetermined value.

An apparatus for performing a method in accordance with any preceding claim.

A method for operating a parallel computer system substantial ly as hereinbefore described with reference to Figs. 5. 6 and 7 of the accompanying drawings.