WO2001077818A2

WO2001077818A2 - Method for predicting the instruction execution latency of a de-coupled configurable co-processor

Info

Publication number: WO2001077818A2
Application number: PCT/US2001/010687
Authority: WO
Inventors: Muhammad Afsar; Stash Czaja
Original assignee: Infineon Technologies North America Corp
Current assignee: Infineon Technologies North America Corp
Priority date: 2000-04-05
Filing date: 2001-04-03
Publication date: 2001-10-18
Anticipated expiration: 2002-10-05
Also published as: WO2001077818A3

Abstract

A method and an apparatus for predicting the execution latency of coprocessor are disclosed. As a method, a central processing unit (CPU) fetches an instruction to be executed by a de-coupled flexible coprocessor (FCOP). The instruction is decoded into an opcode (command) and corresponding data by the CPU which are then passed to the FCOP for execution during coprocessor runtime. Since the CPU has the capability of predicting the corresponding coprocessor runtime, the CPU continues to execute other instructions concurrently with the FCOP executing the FCOP instruction. In this way, the CPU does not suspend operation during coprocessor runtime.

Description

TECHNIQUES FOR PREDICTING THE EXECUTION LATENCY OF A DE-

COUPLED FLEXIBLE CO-PROCESSOR

FIELD OF THE INVENTION:

The present invention pertains to computing systems and the like. More specifically, the present invention relates to reducing the execution latency in a

microprocessor.

BACKGROUND OF THE INVENTION:

In most communications systems, a special purpose microprocessor, such as a

digital signal processor (DSP), execute specific tasks or algorithms. However in order

to perform very specialized signal processing functions such as convolutional or

concatenated codes decoding, etc.), a specialized function unit referred to as a

coprocessor is used. As well known in the art, a coprocessor is any computer

processor which assists the main processor (the "CPU") by performing certain special functions, usually much faster than the main processor could perform them in

software. Typically, the coprocessor acts as a "slave" device performing the execution

of the specific (however, relatively infrequent) instructions unsuitable, or inefficient,

for the main processor. In a conventionally architectured computing system 100

shown in Fig. 1 a main CPU 102 receives an instruction 104 from a memory device

105 at a fetch/decoder unit 107. The fetch/decoder unit 107 then decodes the

instruction 104 into an opcode 106 that identifies the particular operation to be performed on a data field 108.

Typically, the decoded instruction (in the form of the opcode field 106 and the

data field 108) are stored in a general purpose register (GPR) 110. In some cases, the opcode 106 indicates that a particular specialized operation is to be performed by a coprocessor 112 coupled to the CPU 102. Based upon the opcode 106 in these cases,

the CPU sends the opcode to coprocessor field 108 and commences executing the instruction 104 during what is referred to as coprocessor runtime.

Unfortunately, since the coprocessor runtime dynamically changes (i.e.,

unpredictable), in a closely coupled systems such as the system 100, the CPU 102 must suspend execution (referred to as CPU latency) until such a time as the

coprocessor 112 has returned a result data field 114 to the GPR 110. It is only when the coprocessor 112 has returned the result data field 114 that the CPU 102 can

resume executing any others of the instructions fetched from the memory device 105.

However, even with the inability of the CPU 102 to predict the coprocessor

runtime and the resulting need to suspend execution, the system 100 is reasonably

suited for most applications requiring substantially static bandwidth allocation (i.e.,.

the CPU latency does not substantially affect system performance). However, systems

(such as in wireless communications systems, for example) that experience dynamic

changes in various system parameters require what is referred to in the art as dynamic

bandwidth allocation (i.e., due to changes in data rate, for example). In wireless

communications systems, such dynamic changes are due in part to the dynamic nature

of the associated wireless communications channels, the need for continually updated

capacity management which, in turn, is related to subscriber mobility within (and

without) a particular subscriber's grid. It is the dynamic nature of bandwidth

allocation in wireless systems, that the conventionally architectured computing system 100 with the dynamic (i.e., unpredictable) processor latency is particularly unsuited

resulting in substantial system performance degradation.

In view of the foregoing, a computing system that includes a flexible

coprocessor having predictable execution latency would be desirable.

SUMMARY OF THE INVENTION

An improved system used to improve the performance of a computing system

having a microprocessor and a coprocessor is described. More specifically, the

system is arranged to provide a flexible coprocessor that is application dependent

having predictable execution latency so as to permit the concurrent execution of the

CPU and the co-processor.

In one embodiment of the invention, a method for predicting an execution

latency of a coprocessor by a central processing unit is disclosed. In the described

embodiment, the CPU is coupled to coprocessor and is arranged to perform

executable instructions that form a program whereas the coprocessor is arranged to execute selected ones of the executable instructions. As a method, a received

instruction is decoded by the CPU into a command portion and an associated data

portion. If the command portion indicates that the corresponding instruction is to be

executed by the coprocessor, then the command portion and the data portion are

passed off to the coprocessor. The coprocessor then issues a runtime start status flag

indicating that the coprocessor has begun to execute the passed instruction. The CPU then uses the issued runtime start status flag to predict a coprocessor runtime latency

which, in turn, enables the CPU to concurrently execute others of the executable

instructions with the coprocessor. BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference

numerals refer to similar elements and in which:

Fig. 1 illustrates a conventionally architectured computing system.

Fig. 2A illustrates a computing system having a CPU and an associated

coprocessor in accordance with an embodiment of the invention.

Fig. 2B illustrates a timing diagram for a multi-threaded computing system

implementation of the computing system of Fig. 2A.

Fig. 3 illustrates a flowchart detailing a process whereby a CPU passes off a

command and associated data to an associated coprocessor in accordance with an

embodiment of the invention.

Fig. 4 is a computing system suitably arranged for implementing the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following detailed description of the present invention, numerous

specific embodiments are set forth in order to provide a thorough understanding of the

invention. However, as will be apparent to those skilled in the art, the present

invention may be practiced without these specific details or by using alternate

elements or processes. In other instances well known processes, procedures, components, and circuits have not been described in detail so as not to unnecessarily

obscure aspects of the present invention. Referring initially to Fig. 2A, an illustration of a computing system 200 in accordance with an embodiment of the invention is shown. The computing system

200 includes a memory 202 connected to a CPU 204 by way of a memory bus 206. The CPU 204, in turn, includes a fetch/decoder unit 208 also connected to the memory bus 206. As well known in the art, the fetch/decoder unit 208 provides for

retrieving a selected instruction from the memory 202 at the direction of the CPU 204.

Once retrieved, the fetch/decoder unit 208 decodes the fetched instruction into the

opcode 106 and the associated data 108. In the described embodiment, the CPU 204

includes a fetch unit cache memory 210 also referred to as a special function register

(SFR) 210 suitable for storing the opcode 106 and the data 108 in a command register

209 and a data register 211, respectively. In the described embodiment, in order to

execute selected instructions by a coprocessor 212 that is coupled to the CPU 204, an

interface unit 214 is arranged to mediate the flow of information, such as commands

and data, between the coprocessor 212 and the CPU 204.

In the described embodiment, the coprocessor 212 includes a command queue

216 suitably arranged to receive and store commands that can take the form of the

opcode 106. The coprocessor 212 also includes a status queue 218 coupled to the

CPU 204 and an execution block 220. In a preferred embodiment, the status queue

218 is arranged to store a variety of status flags provided by the execution block 220

before, during, and after coprocessor runtime. The various status flags include, but

are not limited to, a start status flag indicating the start of coprocessor runtime, co¬

processor latency and an end status flag indicating the end of coprocessor runtime. It

is these flags that are used by the CPU 204 to predict the coprocessor runtime

associated with a particular opcode (command) in such a way that the CPU 204 can continue to execute incoming instructions without resorting to suspending execution

of instructions from the memory 202.

Substantially simultaneously with the passing of the command to the

command queue 216, the SFR 210 passes the corresponding data field 108 to the

interface 214 which, in turn, passes it to a data queue 222 which in some cases is bi-

directionally coupled to the execution block 220. At the beginning of the coprocessor

runtime associated with the opcode stored in the command queue 216, the execution block 220 fetches the appropriate data stored in the data queue 222. At the end of

coprocessor runtime, the result data field 114 is returned to the data queue 222 where

it is then made available to the CPU 204 by being stored in a result register 213

included in the SFR 210.

During an exemplary operation, an instruction 224 is fetched from the memory

202 and decoded by the fetch/decoder unit 208. The decoded instruction in the form

of the opcode and data are then passed to the SFR 210. In the described embodiment,

the opcode is stored is stored in the opcode register 209 whereas the data is stored in

the data register 211 both of which are included in the SFR 210. Based upon the

opcode, the CPU 204 instructs the interface 214 to fetch the opcode 106 from the

opcode register 209 and to store it in the command queue 216. Substantially

simultaneously, the coprocessor 212 fetches the corresponding data 108 from the data

register 211 and stores it in the data queue 222 where it is made available to the

execution block 220 prior to the start of coprocessor runtime.

In the described embodiment, at the start of coprocessor runtime, the execution

block 220 sets an coprocessor start flag in the status queue 218 indicating to the CPU 204 that execution of the instruction 226 is commencing. In those cases where the CPU has identified and learned the execution characteristics of a particular command,

the CPU 204 uses the various status flags ascertain the corresponding coprocessor

execution (or runtime) latency. Since the CPU 204 can determine coprocessor

latency, it can concurrently execute additional instructions fetched from the memory

device 202 without resorting to suspending operations by, for example, invoking

interrupts. In this way, the CPU runtime efficiency is greatly improved. At the

conclusion of the coprocessor runtime, the execution block 220 sets a runtime stop flag to the status queue 218 and stores the result field 114 in the data queue 222.

Based in part upon receipt of the runtime stop flag, the CPU 204 retrieves the result

field 114 and processes it accordingly.

Referring now to Fig. 2B, illustrating a timing diagram 250 for a multi¬

threaded computing system in accordance with an embodiment of the invention. It

should be noted that the timing diagram 250 is exemplary of any multi-threaded type

computing system having concurrency between two independent threads of execution.

As such, the timing diagram will be discussed with reference to the computing system

200 shown in Fig. 2A. As shown, at the start of the CPU 204 runtime, a CPU

execution thread 252 is instantiated at an initial time t = t₀. At a subsequent time t =

ti, the coprocessor 212 begins coprocessor runtime by invoking a coprocessor

execution thread 254 substantially simultaneously with passing a status start execution

flag. At a time t = t₂, the coprocessor 212 completes coprocessor runtime by passing a

status end runtime flag. It should be noted, that since the CPU 204 received that

status start runtime flag, and execution latency for coprocessor it was capable of predicting the coprocessor execution latency and was thereby able to continue

execution of the CPU thread 252 without resorting to interrupts.

Fig. 3 is a flowchart detailing a process 300 for executing an instruction by a coprocessor in conjunction with a CPU in accordance with an embodiment of the

invention. The process 300 starts at 302 by the CPU receiving and decoding an instruction from, for example, a memory device coupled thereto. Once the instruction

has been received, the CPU decodes the fetched instruction into a command and a

data portion at 304 that are subsequently stored at 306. Next, at 308, the CPU sends

the command and data portions to an associated flexible coprocessor (FCOP) arranged

to carry out and execute the command portion of the fetched instruction. At 310, the

FCOP issues a start FCOP runtime status flag indicating that the FCOP starting to

process the data associated with the received command. At the same time FCOP

sends the latency to CPU. Next, at 312, based upon the issued start FCOP runtime

status flag and the latency the CPU predicts the FCOP runtime latency and

concurrently executes with the FCOP. At 314, the FCOP continues to process the

received data based upon the command concurrently with the CPU executing other

instructions. When the FCOP has completed processing, it issues a end FCOP

runtime status flag at 316 which the CPU uses to retrieve the result data at 318 which

then causes the FCOP to enter a wait state for the next executable command at 320.

Fig. 4 illustrates a computer system 400 that can be employed to implement the present invention. The computer system 400 or, more specifically, CPUs 402,

may be arranged to support a virtual machine, as will be appreciated by those skilled

in the art. As is well known in the art, ROM acts to transfer data and instructions uni- directionally to the CPUs 402, while RAM is used typically to transfer data and

instructions in a bi-directional manner. CPUs 402 may generally include any number

of processors. Both primary storage devices 404, 406 may include any suitable

computer-readable media. A secondary storage medium 408, which is typically a

mass memory device, is also coupled bi-directionally to CPUs 402 and provides additional data storage capacity. The mass memory device 408 is a computer-readable

medium that may be used to store programs including computer code, data, and the

like. Typically, mass memory device 408 is a storage medium such as a hard disk or a

tape which generally slower than primary storage devices 404, 406. Mass memory

storage device 408 may take the form of a magnetic or paper tape reader or some other

well-known device.. It will be appreciated that the information retained within the

mass memory device 408, may, in appropriate cases, be incorporated in standard

fashion as part of RAM 406 as virtual memory. A specific primary storage device 404 such as a CD-ROM may also pass data uni-directionally to the CPUs 402.

CPUs 402 are also coupled to one or more input/output devices 410 that may

include, but are not limited to, devices such as video monitors, track balls, mice,

keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic

or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other

well-known input devices such as, of course, other computers. Finally, CPUs 402

optionally may be coupled to a computer or telecommunications network, e.g., an

Internet network or an intranet network, using a network connection as shown

generally at 412. With such a network connection, it is contemplated that the CPUs 402 might receive information from the network, or might output information to the

network in the course of performing the above-described method steps. Such information, which is often represented as a sequence of instructions to be executed

using CPUs 402, may be received from and outputted to the network, for example, in

the form of a computer data signal embodied in a carrier wave. The above-described

devices and materials will be familiar to those of skill in the computer hardware and

software arts.

Although only a few embodiments of the present invention have been

described, it should be understood that the present invention may be embodied in

many other specific forms without departing from the spirit or the scope of the present

invention.

The described arrangements have numerous advantages. One such advantage

is the invention improves system performance by reducing central processing unit

(CPU) execution latency. In one embodiment, the CPU execution latency is reduced

by the CPU predicting an associated coprocessor's runtime latency thereby enabling

the CPU to concurrently execute other instructions with the coprocessor. In this way,

the ability of the system to provide dynamic bandwidth allocation without resorting to

the CPU generating interrupts greatly improves system performance in applications

where dynamic bandwidth is important. Such applications include, but are not limited

to, cellular switching networks and the like. The described invention works well with

any computing system, including multi-threaded object oriented computing systems and the like.

Although only a few embodiments of the present invention have been

described in detail, it should be understood that the present invention can be embodied

in many other specific forms without departing from the spirit or scope of the

invention. Particularly, although the invention has been described primarily in the context of integrated circuits having processor subsystems, the advantages including

increased bus bandwidths are equally applicable to any device capable of generating

large amounts of information related to, for example, multi-processor computing

systems.

Additionally, the characteristics of the invention can be varied in accordance

with the needs of a particular system. Therefore, the present examples are to be

considered as illustrative and not restrictive, and the invention is not to be limited to

the details given herein, but may be modified within the scope of the appended claims.

Claims

1. A method of a predicting an execution latency of a coprocessor by a central

processing unit coupled thereto arranged to perform executable instructions, wherein the coprocessor is arranged to execute selected ones of the executable instructions,

comprising

decoding a received instruction by the CPU into a command portion and an

associated data portion;

determining that the decoded instruction corresponds to any of the selected

executable instructions to be executed by the coprocessor;

passing off the command portion and the data portion to the coprocessor when

it is determined that the decoded instruction is to be executed by the coprocessor;

issuing a runtime start status flag and the execution latency by the coprocessor

when the coprocessor begins executing the passed instruction; and

predicting a coprocessor runtime latency based upon the issued runtime start

status information for the passed instruction wherein the CPU executes others of the

executable instructions concurrently with the coprocessor executing the passed

instruction.

2. A method as recited in claim 1, wherein the coprocessor comprises:

a configurable execution block arranged to execute the selected ones of the executable instructions; a command queue coupled to the execution block suitably arranged to

receive and store the command portion, wherein the execution blocks executes the

selected ones of the executable instructions based upon the command portion;

a data queue coupled to the execution block suitably arranged to

receive and store the data portion, wherein the execution block processes the data portion based upon the command portion and producing a result thereby; and

a status flag queue coupled to the execution block suitably arranged to provide

a status flag to the CPU.

3. A method as recited in claim 2, further comprising:

issuing a coprocessor runtime stop status flag indicating that the execution block has

completed executing the passed instruction.

4. A method as recited in claim 3, wherein the predicting comprises:

identifying a coprocessor execution latency corresponding to the passed instruction

based upon the coprocessor runtime start status flag and the coprocessor runtime stop status flag.

5. A method as recited in claim 4, wherein the CPU learns the identified

coprocessor execution latency for the opcode corresponding to the passed instruction.

6. A method as recited in claim 5, wherein the CPU uses the learned coprocessor

execution latency to concurrently execute others of the executable instructions with

the coprocessor executing another of the selected instructions.

7. An apparatus for predicting a coprocessor execution latency for a coprocessor

having a configurable execution block coupled to a central processing unit (CPU),

comprising:

a status queue coupled to the CPU arranged to issue a coprocessor runtime status flag;

a data queue coupled to the CPU arranged to store a data field

corresponding to data to be processed by the execution block and to store a result data

field corresponding to the processed data; and

a command queue coupled to CPU arranged to store a command that

provides the coprocessor with operating instructions; and

predicting the coprocessor latency by the CPU based upon an issued

runtime start status flag for a passed instruction wherein the CPU executes others of the executable instructions concurrently with the coprocessor executing the passed

instruction.

8. An apparatus as recited in claim 7, wherein the CPU decodes then passed

instruction into the command and the data portion.

9. An apparatus as recited in claim 8, wherein a runtime start status flag issued by

the coprocessor when the coprocessor begins executing the passed instruction.