[go: up one dir, main page]

WO2001077818A2 - Method for predicting the instruction execution latency of a de-coupled configurable co-processor - Google Patents

Method for predicting the instruction execution latency of a de-coupled configurable co-processor Download PDF

Info

Publication number
WO2001077818A2
WO2001077818A2 PCT/US2001/010687 US0110687W WO0177818A2 WO 2001077818 A2 WO2001077818 A2 WO 2001077818A2 US 0110687 W US0110687 W US 0110687W WO 0177818 A2 WO0177818 A2 WO 0177818A2
Authority
WO
WIPO (PCT)
Prior art keywords
coprocessor
cpu
runtime
execution
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2001/010687
Other languages
French (fr)
Other versions
WO2001077818A3 (en
Inventor
Muhammad Afsar
Stash Czaja
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Infineon Technologies North America Corp
Original Assignee
Infineon Technologies North America Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infineon Technologies North America Corp filed Critical Infineon Technologies North America Corp
Publication of WO2001077818A2 publication Critical patent/WO2001077818A2/en
Publication of WO2001077818A3 publication Critical patent/WO2001077818A3/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Definitions

  • the present invention pertains to computing systems and the like. More specifically, the present invention relates to reducing the execution latency in a
  • a special purpose microprocessor such as a
  • DSP digital signal processor
  • a specialized function unit referred to as a
  • coprocessor is used. As well known in the art, a coprocessor is any computer
  • processors which assists the main processor (the "CPU") by performing certain special functions, usually much faster than the main processor could perform them in
  • the coprocessor acts as a "slave" device performing the execution
  • a main CPU 102 receives an instruction 104 from a memory device
  • the fetch/decoder unit 107 then decodes the
  • the decoded instruction (in the form of the opcode field 106 and the
  • GPR general purpose register
  • the opcode 106 indicates that a particular specialized operation is to be performed by a coprocessor 112 coupled to the CPU 102. Based upon the opcode 106 in these cases,
  • the CPU sends the opcode to coprocessor field 108 and commences executing the instruction 104 during what is referred to as coprocessor runtime.
  • the CPU 102 must suspend execution (referred to as CPU latency) until such a time as the
  • coprocessor 112 has returned a result data field 114 to the GPR 110. It is only when the coprocessor 112 has returned the result data field 114 that the CPU 102 can
  • bandwidth allocation i.e., due to changes in data rate, for example.
  • a computing system that includes a flexible
  • system is arranged to provide a flexible coprocessor that is application dependent
  • the CPU is coupled to coprocessor and is arranged to perform
  • command portion indicates that the corresponding instruction is to be
  • the coprocessor then issues a runtime start status flag
  • the CPU uses the issued runtime start status flag to predict a coprocessor runtime latency
  • Fig. 1 illustrates a conventionally architectured computing system.
  • Fig. 2A illustrates a computing system having a CPU and an associated
  • Fig. 2B illustrates a timing diagram for a multi-threaded computing system
  • Fig. 3 illustrates a flowchart detailing a process whereby a CPU passes off a
  • Fig. 4 is a computing system suitably arranged for implementing the invention.
  • FIG. 2A an illustration of a computing system 200 in accordance with an embodiment of the invention is shown.
  • the computing system 200 is shown.
  • the 200 includes a memory 202 connected to a CPU 204 by way of a memory bus 206.
  • the CPU 204 includes a fetch/decoder unit 208 also connected to the memory bus 206.
  • the fetch/decoder unit 208 provides for
  • the fetch/decoder unit 208 decodes the fetched instruction into the
  • fetch unit cache memory 210 also referred to as a special function register
  • (SFR) 210 suitable for storing the opcode 106 and the data 108 in a command register
  • interface unit 214 is arranged to mediate the flow of information, such as commands
  • the coprocessor 212 includes a command queue
  • the coprocessor 212 also includes a status queue 218 coupled to the
  • the status queue In a preferred embodiment, the status queue
  • the various status flags include, but
  • the SFR 210 passes the corresponding data field 108 to the
  • the execution block 220 fetches the appropriate data stored in the data queue 222.
  • the result data field 114 is returned to the data queue 222 where
  • an instruction 224 is fetched from the memory
  • the decoded instruction in the form
  • the opcode is stored is stored in the opcode register 209 whereas the data is stored in
  • the CPU 204 instructs the interface 214 to fetch the opcode 106 from the
  • the coprocessor 212 fetches the corresponding data 108 from the data
  • block 220 sets an coprocessor start flag in the status queue 218 indicating to the CPU 204 that execution of the instruction 226 is commencing. In those cases where the CPU has identified and learned the execution characteristics of a particular command,
  • the CPU 204 uses the various status flags ascertain the corresponding coprocessor
  • the execution block 220 sets a runtime stop flag to the status queue 218 and stores the result field 114 in the data queue 222.
  • the CPU 204 retrieves the result
  • FIG. 2B illustrating a timing diagram 250 for a multi ⁇
  • timing diagram 250 is exemplary of any multi-threaded type
  • t t 0 .
  • the coprocessor 212 begins coprocessor runtime by invoking a coprocessor
  • execution thread 254 substantially simultaneously with passing a status start execution
  • the coprocessor 212 completes coprocessor runtime by passing a
  • Fig. 3 is a flowchart detailing a process 300 for executing an instruction by a coprocessor in conjunction with a CPU in accordance with an embodiment of the
  • the process 300 starts at 302 by the CPU receiving and decoding an instruction from, for example, a memory device coupled thereto. Once the instruction is received from, for example, a memory device coupled thereto.
  • the CPU decodes the fetched instruction into a command and a
  • FCOP issues a start FCOP runtime status flag indicating that the FCOP starting to
  • Fig. 4 illustrates a computer system 400 that can be employed to implement the present invention.
  • the computer system 400 or, more specifically, CPUs 402,
  • ROM acts to transfer data and instructions uni- directionally to the CPUs 402, while RAM is used typically to transfer data and
  • CPUs 402 may generally include any number
  • Both primary storage devices 404, 406 may include any suitable
  • mass memory device is also coupled bi-directionally to CPUs 402 and provides additional data storage capacity.
  • the mass memory device 408 is a computer-readable
  • mass memory device 408 is a storage medium such as a hard disk or a
  • storage device 408 may take the form of a magnetic or paper tape reader or some other
  • mass memory device 408 may, in appropriate cases, be incorporated in standard
  • a specific primary storage device 404 such as a CD-ROM may also pass data uni-directionally to the CPUs 402.
  • CPUs 402 are also coupled to one or more input/output devices 410 that may
  • devices such as video monitors, track balls, mice,
  • keyboards keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic
  • optionally may be coupled to a computer or telecommunications network, e.g., an
  • the CPUs 402 might receive information from the network, or might output information to the
  • CPUs 402 may be received from and outputted to the network, for example, in
  • CPU CPU execution latency.
  • the CPU execution latency is reduced
  • Such applications include, but are not limited
  • any computing system including multi-threaded object oriented computing systems and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

A method and an apparatus for predicting the execution latency of coprocessor are disclosed. As a method, a central processing unit (CPU) fetches an instruction to be executed by a de-coupled flexible coprocessor (FCOP). The instruction is decoded into an opcode (command) and corresponding data by the CPU which are then passed to the FCOP for execution during coprocessor runtime. Since the CPU has the capability of predicting the corresponding coprocessor runtime, the CPU continues to execute other instructions concurrently with the FCOP executing the FCOP instruction. In this way, the CPU does not suspend operation during coprocessor runtime.

Description

TECHNIQUES FOR PREDICTING THE EXECUTION LATENCY OF A DE-
COUPLED FLEXIBLE CO-PROCESSOR
FIELD OF THE INVENTION:
The present invention pertains to computing systems and the like. More specifically, the present invention relates to reducing the execution latency in a
microprocessor.
BACKGROUND OF THE INVENTION:
In most communications systems, a special purpose microprocessor, such as a
digital signal processor (DSP), execute specific tasks or algorithms. However in order
to perform very specialized signal processing functions such as convolutional or
concatenated codes decoding, etc.), a specialized function unit referred to as a
coprocessor is used. As well known in the art, a coprocessor is any computer
processor which assists the main processor (the "CPU") by performing certain special functions, usually much faster than the main processor could perform them in
software. Typically, the coprocessor acts as a "slave" device performing the execution
of the specific (however, relatively infrequent) instructions unsuitable, or inefficient,
for the main processor. In a conventionally architectured computing system 100
shown in Fig. 1 a main CPU 102 receives an instruction 104 from a memory device
105 at a fetch/decoder unit 107. The fetch/decoder unit 107 then decodes the
instruction 104 into an opcode 106 that identifies the particular operation to be performed on a data field 108.
Typically, the decoded instruction (in the form of the opcode field 106 and the
data field 108) are stored in a general purpose register (GPR) 110. In some cases, the opcode 106 indicates that a particular specialized operation is to be performed by a coprocessor 112 coupled to the CPU 102. Based upon the opcode 106 in these cases,
the CPU sends the opcode to coprocessor field 108 and commences executing the instruction 104 during what is referred to as coprocessor runtime.
Unfortunately, since the coprocessor runtime dynamically changes (i.e.,
unpredictable), in a closely coupled systems such as the system 100, the CPU 102 must suspend execution (referred to as CPU latency) until such a time as the
coprocessor 112 has returned a result data field 114 to the GPR 110. It is only when the coprocessor 112 has returned the result data field 114 that the CPU 102 can
resume executing any others of the instructions fetched from the memory device 105.
However, even with the inability of the CPU 102 to predict the coprocessor
runtime and the resulting need to suspend execution, the system 100 is reasonably
suited for most applications requiring substantially static bandwidth allocation (i.e.,.
the CPU latency does not substantially affect system performance). However, systems
(such as in wireless communications systems, for example) that experience dynamic
changes in various system parameters require what is referred to in the art as dynamic
bandwidth allocation (i.e., due to changes in data rate, for example). In wireless
communications systems, such dynamic changes are due in part to the dynamic nature
of the associated wireless communications channels, the need for continually updated
capacity management which, in turn, is related to subscriber mobility within (and
without) a particular subscriber's grid. It is the dynamic nature of bandwidth
allocation in wireless systems, that the conventionally architectured computing system 100 with the dynamic (i.e., unpredictable) processor latency is particularly unsuited
resulting in substantial system performance degradation.
In view of the foregoing, a computing system that includes a flexible
coprocessor having predictable execution latency would be desirable.
SUMMARY OF THE INVENTION
An improved system used to improve the performance of a computing system
having a microprocessor and a coprocessor is described. More specifically, the
system is arranged to provide a flexible coprocessor that is application dependent
having predictable execution latency so as to permit the concurrent execution of the
CPU and the co-processor.
In one embodiment of the invention, a method for predicting an execution
latency of a coprocessor by a central processing unit is disclosed. In the described
embodiment, the CPU is coupled to coprocessor and is arranged to perform
executable instructions that form a program whereas the coprocessor is arranged to execute selected ones of the executable instructions. As a method, a received
instruction is decoded by the CPU into a command portion and an associated data
portion. If the command portion indicates that the corresponding instruction is to be
executed by the coprocessor, then the command portion and the data portion are
passed off to the coprocessor. The coprocessor then issues a runtime start status flag
indicating that the coprocessor has begun to execute the passed instruction. The CPU then uses the issued runtime start status flag to predict a coprocessor runtime latency
which, in turn, enables the CPU to concurrently execute others of the executable
instructions with the coprocessor. BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference
numerals refer to similar elements and in which:
Fig. 1 illustrates a conventionally architectured computing system.
Fig. 2A illustrates a computing system having a CPU and an associated
coprocessor in accordance with an embodiment of the invention.
Fig. 2B illustrates a timing diagram for a multi-threaded computing system
implementation of the computing system of Fig. 2A.
Fig. 3 illustrates a flowchart detailing a process whereby a CPU passes off a
command and associated data to an associated coprocessor in accordance with an
embodiment of the invention.
Fig. 4 is a computing system suitably arranged for implementing the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
In the following detailed description of the present invention, numerous
specific embodiments are set forth in order to provide a thorough understanding of the
invention. However, as will be apparent to those skilled in the art, the present
invention may be practiced without these specific details or by using alternate
elements or processes. In other instances well known processes, procedures, components, and circuits have not been described in detail so as not to unnecessarily
obscure aspects of the present invention. Referring initially to Fig. 2A, an illustration of a computing system 200 in accordance with an embodiment of the invention is shown. The computing system
200 includes a memory 202 connected to a CPU 204 by way of a memory bus 206. The CPU 204, in turn, includes a fetch/decoder unit 208 also connected to the memory bus 206. As well known in the art, the fetch/decoder unit 208 provides for
retrieving a selected instruction from the memory 202 at the direction of the CPU 204.
Once retrieved, the fetch/decoder unit 208 decodes the fetched instruction into the
opcode 106 and the associated data 108. In the described embodiment, the CPU 204
includes a fetch unit cache memory 210 also referred to as a special function register
(SFR) 210 suitable for storing the opcode 106 and the data 108 in a command register
209 and a data register 211, respectively. In the described embodiment, in order to
execute selected instructions by a coprocessor 212 that is coupled to the CPU 204, an
interface unit 214 is arranged to mediate the flow of information, such as commands
and data, between the coprocessor 212 and the CPU 204.
In the described embodiment, the coprocessor 212 includes a command queue
216 suitably arranged to receive and store commands that can take the form of the
opcode 106. The coprocessor 212 also includes a status queue 218 coupled to the
CPU 204 and an execution block 220. In a preferred embodiment, the status queue
218 is arranged to store a variety of status flags provided by the execution block 220
before, during, and after coprocessor runtime. The various status flags include, but
are not limited to, a start status flag indicating the start of coprocessor runtime, co¬
processor latency and an end status flag indicating the end of coprocessor runtime. It
is these flags that are used by the CPU 204 to predict the coprocessor runtime
associated with a particular opcode (command) in such a way that the CPU 204 can continue to execute incoming instructions without resorting to suspending execution
of instructions from the memory 202.
Substantially simultaneously with the passing of the command to the
command queue 216, the SFR 210 passes the corresponding data field 108 to the
interface 214 which, in turn, passes it to a data queue 222 which in some cases is bi-
directionally coupled to the execution block 220. At the beginning of the coprocessor
runtime associated with the opcode stored in the command queue 216, the execution block 220 fetches the appropriate data stored in the data queue 222. At the end of
coprocessor runtime, the result data field 114 is returned to the data queue 222 where
it is then made available to the CPU 204 by being stored in a result register 213
included in the SFR 210.
During an exemplary operation, an instruction 224 is fetched from the memory
202 and decoded by the fetch/decoder unit 208. The decoded instruction in the form
of the opcode and data are then passed to the SFR 210. In the described embodiment,
the opcode is stored is stored in the opcode register 209 whereas the data is stored in
the data register 211 both of which are included in the SFR 210. Based upon the
opcode, the CPU 204 instructs the interface 214 to fetch the opcode 106 from the
opcode register 209 and to store it in the command queue 216. Substantially
simultaneously, the coprocessor 212 fetches the corresponding data 108 from the data
register 211 and stores it in the data queue 222 where it is made available to the
execution block 220 prior to the start of coprocessor runtime.
In the described embodiment, at the start of coprocessor runtime, the execution
block 220 sets an coprocessor start flag in the status queue 218 indicating to the CPU 204 that execution of the instruction 226 is commencing. In those cases where the CPU has identified and learned the execution characteristics of a particular command,
the CPU 204 uses the various status flags ascertain the corresponding coprocessor
execution (or runtime) latency. Since the CPU 204 can determine coprocessor
latency, it can concurrently execute additional instructions fetched from the memory
device 202 without resorting to suspending operations by, for example, invoking
interrupts. In this way, the CPU runtime efficiency is greatly improved. At the
conclusion of the coprocessor runtime, the execution block 220 sets a runtime stop flag to the status queue 218 and stores the result field 114 in the data queue 222.
Based in part upon receipt of the runtime stop flag, the CPU 204 retrieves the result
field 114 and processes it accordingly.
Referring now to Fig. 2B, illustrating a timing diagram 250 for a multi¬
threaded computing system in accordance with an embodiment of the invention. It
should be noted that the timing diagram 250 is exemplary of any multi-threaded type
computing system having concurrency between two independent threads of execution.
As such, the timing diagram will be discussed with reference to the computing system
200 shown in Fig. 2A. As shown, at the start of the CPU 204 runtime, a CPU
execution thread 252 is instantiated at an initial time t = t0. At a subsequent time t =
ti, the coprocessor 212 begins coprocessor runtime by invoking a coprocessor
execution thread 254 substantially simultaneously with passing a status start execution
flag. At a time t = t2, the coprocessor 212 completes coprocessor runtime by passing a
status end runtime flag. It should be noted, that since the CPU 204 received that
status start runtime flag, and execution latency for coprocessor it was capable of predicting the coprocessor execution latency and was thereby able to continue
execution of the CPU thread 252 without resorting to interrupts.
Fig. 3 is a flowchart detailing a process 300 for executing an instruction by a coprocessor in conjunction with a CPU in accordance with an embodiment of the
invention. The process 300 starts at 302 by the CPU receiving and decoding an instruction from, for example, a memory device coupled thereto. Once the instruction
has been received, the CPU decodes the fetched instruction into a command and a
data portion at 304 that are subsequently stored at 306. Next, at 308, the CPU sends
the command and data portions to an associated flexible coprocessor (FCOP) arranged
to carry out and execute the command portion of the fetched instruction. At 310, the
FCOP issues a start FCOP runtime status flag indicating that the FCOP starting to
process the data associated with the received command. At the same time FCOP
sends the latency to CPU. Next, at 312, based upon the issued start FCOP runtime
status flag and the latency the CPU predicts the FCOP runtime latency and
concurrently executes with the FCOP. At 314, the FCOP continues to process the
received data based upon the command concurrently with the CPU executing other
instructions. When the FCOP has completed processing, it issues a end FCOP
runtime status flag at 316 which the CPU uses to retrieve the result data at 318 which
then causes the FCOP to enter a wait state for the next executable command at 320.
Fig. 4 illustrates a computer system 400 that can be employed to implement the present invention. The computer system 400 or, more specifically, CPUs 402,
may be arranged to support a virtual machine, as will be appreciated by those skilled
in the art. As is well known in the art, ROM acts to transfer data and instructions uni- directionally to the CPUs 402, while RAM is used typically to transfer data and
instructions in a bi-directional manner. CPUs 402 may generally include any number
of processors. Both primary storage devices 404, 406 may include any suitable
computer-readable media. A secondary storage medium 408, which is typically a
mass memory device, is also coupled bi-directionally to CPUs 402 and provides additional data storage capacity. The mass memory device 408 is a computer-readable
medium that may be used to store programs including computer code, data, and the
like. Typically, mass memory device 408 is a storage medium such as a hard disk or a
tape which generally slower than primary storage devices 404, 406. Mass memory
storage device 408 may take the form of a magnetic or paper tape reader or some other
well-known device.. It will be appreciated that the information retained within the
mass memory device 408, may, in appropriate cases, be incorporated in standard
fashion as part of RAM 406 as virtual memory. A specific primary storage device 404 such as a CD-ROM may also pass data uni-directionally to the CPUs 402.
CPUs 402 are also coupled to one or more input/output devices 410 that may
include, but are not limited to, devices such as video monitors, track balls, mice,
keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic
or paper tape readers, tablets, styluses, voice or handwriting recognizers, or other
well-known input devices such as, of course, other computers. Finally, CPUs 402
optionally may be coupled to a computer or telecommunications network, e.g., an
Internet network or an intranet network, using a network connection as shown
generally at 412. With such a network connection, it is contemplated that the CPUs 402 might receive information from the network, or might output information to the
network in the course of performing the above-described method steps. Such information, which is often represented as a sequence of instructions to be executed
using CPUs 402, may be received from and outputted to the network, for example, in
the form of a computer data signal embodied in a carrier wave. The above-described
devices and materials will be familiar to those of skill in the computer hardware and
software arts.
Although only a few embodiments of the present invention have been
described, it should be understood that the present invention may be embodied in
many other specific forms without departing from the spirit or the scope of the present
invention.
The described arrangements have numerous advantages. One such advantage
is the invention improves system performance by reducing central processing unit
(CPU) execution latency. In one embodiment, the CPU execution latency is reduced
by the CPU predicting an associated coprocessor's runtime latency thereby enabling
the CPU to concurrently execute other instructions with the coprocessor. In this way,
the ability of the system to provide dynamic bandwidth allocation without resorting to
the CPU generating interrupts greatly improves system performance in applications
where dynamic bandwidth is important. Such applications include, but are not limited
to, cellular switching networks and the like. The described invention works well with
any computing system, including multi-threaded object oriented computing systems and the like.
Although only a few embodiments of the present invention have been
described in detail, it should be understood that the present invention can be embodied
in many other specific forms without departing from the spirit or scope of the
invention. Particularly, although the invention has been described primarily in the context of integrated circuits having processor subsystems, the advantages including
increased bus bandwidths are equally applicable to any device capable of generating
large amounts of information related to, for example, multi-processor computing
systems.
Additionally, the characteristics of the invention can be varied in accordance
with the needs of a particular system. Therefore, the present examples are to be
considered as illustrative and not restrictive, and the invention is not to be limited to
the details given herein, but may be modified within the scope of the appended claims.

Claims

1. A method of a predicting an execution latency of a coprocessor by a central
processing unit coupled thereto arranged to perform executable instructions, wherein the coprocessor is arranged to execute selected ones of the executable instructions,
comprising
decoding a received instruction by the CPU into a command portion and an
associated data portion;
determining that the decoded instruction corresponds to any of the selected
executable instructions to be executed by the coprocessor;
passing off the command portion and the data portion to the coprocessor when
it is determined that the decoded instruction is to be executed by the coprocessor;
issuing a runtime start status flag and the execution latency by the coprocessor
when the coprocessor begins executing the passed instruction; and
predicting a coprocessor runtime latency based upon the issued runtime start
status information for the passed instruction wherein the CPU executes others of the
executable instructions concurrently with the coprocessor executing the passed
instruction.
2. A method as recited in claim 1, wherein the coprocessor comprises:
a configurable execution block arranged to execute the selected ones of the executable instructions; a command queue coupled to the execution block suitably arranged to
receive and store the command portion, wherein the execution blocks executes the
selected ones of the executable instructions based upon the command portion;
a data queue coupled to the execution block suitably arranged to
receive and store the data portion, wherein the execution block processes the data portion based upon the command portion and producing a result thereby; and
a status flag queue coupled to the execution block suitably arranged to provide
a status flag to the CPU.
3. A method as recited in claim 2, further comprising:
issuing a coprocessor runtime stop status flag indicating that the execution block has
completed executing the passed instruction.
4. A method as recited in claim 3, wherein the predicting comprises:
identifying a coprocessor execution latency corresponding to the passed instruction
based upon the coprocessor runtime start status flag and the coprocessor runtime stop status flag.
5. A method as recited in claim 4, wherein the CPU learns the identified
coprocessor execution latency for the opcode corresponding to the passed instruction.
6. A method as recited in claim 5, wherein the CPU uses the learned coprocessor
execution latency to concurrently execute others of the executable instructions with
the coprocessor executing another of the selected instructions.
7. An apparatus for predicting a coprocessor execution latency for a coprocessor
having a configurable execution block coupled to a central processing unit (CPU),
comprising:
a status queue coupled to the CPU arranged to issue a coprocessor runtime status flag;
a data queue coupled to the CPU arranged to store a data field
corresponding to data to be processed by the execution block and to store a result data
field corresponding to the processed data; and
a command queue coupled to CPU arranged to store a command that
provides the coprocessor with operating instructions; and
predicting the coprocessor latency by the CPU based upon an issued
runtime start status flag for a passed instruction wherein the CPU executes others of the executable instructions concurrently with the coprocessor executing the passed
instruction.
8. An apparatus as recited in claim 7, wherein the CPU decodes then passed
instruction into the command and the data portion.
9. An apparatus as recited in claim 8, wherein a runtime start status flag issued by
the coprocessor when the coprocessor begins executing the passed instruction.
PCT/US2001/010687 2000-04-05 2001-04-03 Method for predicting the instruction execution latency of a de-coupled configurable co-processor Ceased WO2001077818A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US54305100A 2000-04-05 2000-04-05
US09/543,051 2000-04-05

Publications (2)

Publication Number Publication Date
WO2001077818A2 true WO2001077818A2 (en) 2001-10-18
WO2001077818A3 WO2001077818A3 (en) 2002-06-27

Family

ID=24166383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/010687 Ceased WO2001077818A2 (en) 2000-04-05 2001-04-03 Method for predicting the instruction execution latency of a de-coupled configurable co-processor

Country Status (1)

Country Link
WO (1) WO2001077818A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2278452A1 (en) * 2009-07-15 2011-01-26 Nxp B.V. Coprocessor programming
US7933276B2 (en) * 2004-11-12 2011-04-26 Pmc-Sierra Israel Ltd. Dynamic bandwidth allocation processor

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63158657A (en) * 1986-12-23 1988-07-01 Fanuc Ltd Coprocessor control system
US5214764A (en) * 1988-07-15 1993-05-25 Casio Computer Co., Ltd. Data processing apparatus for operating on variable-length data delimited by delimiter codes
JP2771683B2 (en) * 1990-07-17 1998-07-02 三菱電機株式会社 Parallel processing method
JP2884831B2 (en) * 1991-07-03 1999-04-19 株式会社日立製作所 Processing equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7933276B2 (en) * 2004-11-12 2011-04-26 Pmc-Sierra Israel Ltd. Dynamic bandwidth allocation processor
EP2278452A1 (en) * 2009-07-15 2011-01-26 Nxp B.V. Coprocessor programming

Also Published As

Publication number Publication date
WO2001077818A3 (en) 2002-06-27

Similar Documents

Publication Publication Date Title
US7234025B2 (en) Microprocessor with repeat prefetch instruction
US8015391B2 (en) Simultaneous multiple thread processor increasing number of instructions issued for thread detected to be processing loop
US7421571B2 (en) Apparatus and method for switching threads in multi-threading processors
US5727227A (en) Interrupt coprocessor configured to process interrupts in a computer system
GB2234613A (en) Method and apparatus for switching contexts in a microprocessor
US12039337B2 (en) Processor with multiple fetch and decode pipelines
US9274829B2 (en) Handling interrupt actions for inter-thread communication
US7761688B1 (en) Multiple thread in-order issue in-order completion DSP and micro-controller
CN102193828B (en) Decoupling the number of logical threads from the number of simultaneous physical threads in a processor
KR100571332B1 (en) Reset of the programmable processor
US7051146B2 (en) Data processing systems including high performance buses and interfaces, and associated communication methods
US6721878B1 (en) Low-latency interrupt handling during memory access delay periods in microprocessors
US6675238B1 (en) Each of a plurality of descriptors having a completion indicator and being stored in a cache memory of an input/output processor
WO2001077818A2 (en) Method for predicting the instruction execution latency of a de-coupled configurable co-processor
US9015720B2 (en) Efficient state transition among multiple programs on multi-threaded processors by executing cache priming program
JPWO2006022202A1 (en) Information processing device, exception control circuit
US8694697B1 (en) Rescindable instruction dispatcher
JP3493768B2 (en) Data processing device
JPH06324861A (en) System and method for controlling cpu
US7124285B2 (en) Peak power reduction when updating future file
JP2000099330A (en) Computer processor system
CN116594694A (en) Memory instruction scheduling system, method, graphics processor and electronic device
CN120066579A (en) RISC-V-based interrupt and exception handling system, method and processor
JP2001084143A (en) Information processing device
JP2002351658A (en) Arithmetic processing unit

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): CN JP KR

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: A3

Designated state(s): CN JP KR

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP