HK1110970A

HK1110970A - Pre-decode error handling via branch correction

Info

Publication number: HK1110970A
Application number: HK08105675.3A
Authority: HK
Inventors: 罗德尼‧韦恩‧史密斯; 布赖恩‧迈克尔‧斯坦普尔; 詹姆斯‧诺里斯‧迪芬德尔费尔; 杰弗里‧托德‧布里奇斯; 托马斯‧安德鲁‧萨托‧乌斯
Original assignee: 高通股份有限公司
Priority date: 2004-11-22
Filing date: 2005-11-18
Publication date: 2008-07-25

Abstract

In a pipelined processor where instructions are pre-decoded prior to being stored in a cache, an incorrectly pre-decoded instruction is detected during execution in the pipeline. The corresponding instruction is invalidated in the cache, and the instruction is forced to evaluate as a branch instruction. In particular, the branch instruction is evaluated as 'mispredicted not taken' with a branch target address of the incorrectly pre-decoded instruction's address. This, with the invalidated cache line, causes the incorrectly pre-decoded instruction to be re-fetched from memory with a precise address. The re-fetched instruction is then correctly pre-decoded, written to the cache, and executed.

Description

Pre-decode error handling via branch correction

Technical Field

The present invention relates generally to the field of processors, and in particular to a method of correcting erroneous pre-decoded data associated with an instruction by forcing a branch correction procedure with the target address of the instruction.

Background

Microprocessors perform computational tasks in a variety of applications. There is almost always a need for improved processor performance to allow faster operation and/or increased functionality through software changes. In many embedded applications (e.g., portable electronic devices), conserving power is also an important goal in processor design and implementation.

Many modern processors may use a pipeline architecture in which consecutive instructions are overlapped in execution to increase the overall processor throughput. Maintaining smooth execution through the pipeline is critical to achieving high performance. Many modern processors also use hierarchical memories, where fast, on-chip caches store local copies of recently accessed data and instructions. One pipeline optimization technique known in the art is to pre-decode instructions. That is, instructions are examined when they are read from memory, partially pre-decoded, and certain information about the instructions (referred to as pre-decode information) is stored in the cache along with the associated instructions. When instructions are later fetched from the cache, the pre-decode information is also fetched, and used to assist in fully decoding the instructions.

Sometimes, the pre-decode information contains errors. These errors may be detected during a decoding stage in the pipeline. When an error is found, an exception occurs and the pipeline must be flushed and all instructions including the incorrectly pre-decoded instruction must be refetched. This process results in significant degradation of performance and power management.

Disclosure of Invention

The present invention relates in one embodiment to a method of correcting an incorrectly pre-decoded instruction. A pre-decode error is detected. In response to detecting the error, a branch correction procedure is forced with the target address of the incorrectly pre-decoded instruction.

The invention relates, in another embodiment, to a processor. The processor includes a pre-decoder inserted in an instruction fetch path that generates pre-decode information associated with a particular instruction. The processor also includes a pre-decode error detector and corrector that detects incorrect pre-decode information associated with the instruction and forces the instruction to execute as a mispredicted branch with the address of the instruction as the branch target address.

Drawings

FIG. 1 is a functional block diagram of a processor.

FIG. 2 is a functional block diagram of a portion of a memory, a pre-decoder, an instruction cache, and a processor pipeline.

FIG. 3 is a functional block diagram of branch correction logic.

Detailed Description

Pipeline processor architectures exploit parallelism by overlapping the execution of multiple sequential instructions, each having multiple execution steps. Typical instruction steps include instruction fetch, decode, execute, and write back. Each step is performed in the pipeline by one or more pipe stages, which include logic and memory elements such as latches or registers. The pipe stages are connected together to form a pipeline. Instructions enter the pipeline and are processed serially through the stages. Additional instructions enter the pipeline before previous instructions complete execution, and thus multiple instructions may be processed within the pipeline at any given time. This ability to exploit parallelism among instructions in a sequential instruction stream is of significant help to improve processor performance. Under ideal conditions, and in a processor that completes each pipe stage in one cycle, after a brief initial process of filling the pipeline, an instruction may complete execution in each cycle. Many realistic constraints make this ideal condition impossible to maintain; however, keeping the pipeline flowing completely and smoothly is a common goal of processor design.

Modern processors also typically use a memory hierarchy that places a small amount of fast, expensive memory near the processor, backed up by a large amount of slower, inexpensive memory. A typical processor memory hierarchy may include registers in the top processor; one or more on-chip caches (e.g., SRAM) as a backup; possibly including off-chip cache memory, referred to as tier 2 or L2 cache memory (e.g., SRAM); main memory (typically DRAM); magnetic disk storage devices (magnetic media); and a lowest layer of magnetic tape or CD (magnetic or optical media). In embedded applications (e.g., portable electronic devices), there may be limited, if any, disk storage, and thus main memory (often limited in size) may be the lowest level in the memory hierarchy.

FIG. 1 depicts a functional block diagram of a representative processor 10, the processor 10 using both a pipelined architecture and a hierarchical memory structure. The processor 10 executes instructions in an instruction execution pipeline 12 in accordance with control logic 14. The pipeline includes various registers or latches 16, organized in pipe stages, and one or more Arithmetic Logic Units (ALU) 18. A General Purpose Register (GPR) file 20 provides registers comprising the top of the memory hierarchy. The pipeline fetches instructions from an Instruction cache 22, manages memory addressing and permissions by an Instruction side Translation lookaside buffer (ITLB) 24, and performs some initial decoding of the instructions by a predecoder 21. Data is accessed from a data cache 26, with memory addressing and permissions managed by a main Translation Lookaside Buffer (TLB) 28. In various embodiments, the ITLB may comprise a copy of part of the TLB. Alternatively, the ITLB and TLB may be integrated. Similarly, in various embodiments of the processor 10, the I-cache 22 and D-cache 26 may be integrated or unified. Accesses (misses) that are not present in the I-cache 22 and/or the D-cache 26 result in accesses to the main (off-chip) memory 32, under the control of the memory interface 30. The processor 10 may include an input/output (I/O) interface 34 that controls access to various peripheral devices 36. Those skilled in the art will recognize that many variations of the processor 10 are possible. For example, the processor 10 may include a two-tier (L2) cache used as either or both of the I and D caches. In addition, one or more of the functional blocks depicted in the processor 10 may be omitted from a particular embodiment.

One known technique for improving processor performance and reducing power consumption is known as predecoding. The pre-decoder 21 includes logic interposed in the path between the main memory 32 and the instruction cache 22. Some of the instructions fetched from memory may be pre-decoded, pre-decode information generated, and written to the I-cache 22 along with the instructions. When instructions are fetched from the cache for execution, the pre-decode information may assist one or more decode pipe stages in decoding the instructions. For example, a predecoder may determine the length of a variable length instruction and write predecode information into a cache that assists the decode pipe stage in retrieving the correct number of bits for the variable length instruction. A variety of information may be pre-decoded and stored in the I-cache 22.

By removing logic from one or more decode pipe stages, the performance of the pre-decoder 21 is improved allowing for earlier use of the logic and possibly shorter machine cycle times. The pre-decoder 21 also reduces power consumption by performing pre-decoding operations once. Since the hit rate of the I-cache 22 is typically as high as 90%, considerable power savings may be realized by not requiring that logic operations be performed each time an instruction is executed from the I-cache 22.

Sometimes, the pre-decoder 21 is in error. For example, if data such as parameters or intermediate values are stored in memory along with the instructions, a pre-decode operation that determines the length of the instruction by simply counting bytes from the beginning of a cache line may erroneously identify bytes of one or more of the parameters or intermediate values as instructions further down the line. Other types of errors may exist, including random bit errors in the pre-decoder 21 or in the I-cache 22. These errors should be found in one or more decode pipe stages, and will typically result in an exception, requiring the pipeline to be flushed and restarted, thereby incurring performance and power consumption penalties.

There are a number of ways to correct the pre-decode error that do not require the exception and associated flush of the pipeline 12 to be caused. FIG. 2 is a functional block diagram depicting portions of the processor 10 and pipeline 12. FIG. 2 also depicts an Instruction Cache Address Register (ICAR)48, which indexes the I-cache 22. The address loaded into the ICAR 48 is generated and/or selected by the next fetch address calculation circuit 46. When an instruction is fetched from the memory 32 (or L2 cache), the instruction is pre-decoded by the pre-decoder 21 and the pre-decode information 23 is stored in the instruction cache 22 along with the corresponding instruction.

In the pipeline 12, instructions and associated pre-decode information 23 are fetched from the I-cache 22, at least partially decoded by decode logic 40, and the results stored in the DCD1 pipe stage latch 42. In many processors 10, the DCD1 pipe stage contains a branch predictor. In the case where the branch predictor predicts that a branch will be taken, the pipe stage may calculate the branch target address and provide it to the next fetch address calculation logic 46 along the branch predicted address path 44. This is one example of an address path from the pipe stage to the next fetch address calculation logic 46 (branches predicted not to be taken would simply allow continued instruction fetching).

In one exemplary embodiment, the fetched and partially decoded instructions then flow to the pipe stage DCD2, the pipe stage DCD2 including the incorrect pre-decode detection and correction logic 50. If an error in the predecode information is detected, the DCD2 pipe stage may signal an indication of an exception and flush the pipeline 12, as discussed above.

Alternatively, pre-decode errors may be corrected by re-fetching instructions from memory 32. One way to accomplish this is to invalidate the instruction in the cache 22 and provide the instruction address to the next fetch address circuit 46 along path 54. This address will then be loaded into the ICAR 48. Because the instruction is invalidated in the cache 22, the cache access will miss, resulting in an access to the main memory 32. The instructions fetched from the main memory 32 will then be correctly pre-decoded by the pre-decoder 21 and placed back into the instruction cache 22. The instruction may then be refetched from the cache 22, along with the correct pre-decode information 23.

The next fetch address calculation logic 46 is typically located on the critical path of most processor data flows and therefore limits machine cycle time. Adding a path 54 for an instruction address associated with an incorrect pre-decode would add logic to the next fetch address calculation 46, increasing machine cycle time and decreasing performance. This performance hit is particularly striking given that the pre-decoded information 23 is rarely incorrect. Optimizing performance in rare cases at the expense of general case typically reduces overall processor performance.

According to one embodiment of the invention, the incorrect pre-decode path 54 (as indicated by the dashed line in FIG. 2) to the next fetch address calculator 46 is eliminated. Rather than providing a dedicated path to the next fetch address calculator 46, the incorrect pre-decode detection and correction logic 50 causes the pipeline 12 to evaluate the incorrectly pre-decoded instruction as a branch instruction. The predecode correction logic 50 may change the semantics of the incorrectly predecoded instruction to those of the branch instruction or, alternatively, may set a flag carried through the pipeline that indicates to the execution pipe stage that the instruction is to be treated as a branch.

Specifically, the incorrectly pre-decoded instruction is evaluated as a branch predicted not taken but evaluated taken, and the branch target address is the address of the incorrectly pre-decoded instruction. At some point down the pipeline 12 (depending on implementation details), the instruction is evaluated by an execution pipe stage 56 that evaluates the "branch taken" condition and generates a branch target address. The branch target address is provided to the next fetch address calculator 46 along the branch correction path 58. Branch condition evaluation logic, branch target address generation logic, and associated control logic in the branch correction path 58 and the next fetch address calculator 46 are already present in each pipeline processor 10 that predicts branch behavior.

Figure 3 is a functional diagram of one possible implementation of branch correction logic. Within the EXE pipe stage latch 56 are a predicted taken Branch (BPT) bit 60, and a branch condition evaluation (COND) bit 62. The BPT bit 60 is a 1 if the branch predictor predicted the branch to be taken earlier in the pipeline 12, and the BPT bit 60 is a 0 if the branch is predicted not to be taken. The COND bit 62 is a 1 if the branch evaluates taken and the COND bit 62 is a 0 if the branch evaluates not taken. These two bits may be subjected to an exclusive-or (XOR) operation, as indicated by gate 66, to generate a multiplexer select or similar control signal provided to next fetch address calculator 46, indicating that branch correction path 58 should be selected as the next fetch address. Table 1 below depicts a truth table for XOR 66.

BPT	COND	Output of	Note
BPT	COND	Output of	Note	0	0	0	Correct prediction as not taken; without correction
0	1	1	Misprediction as not taken-having to supply a branch target address to a next fetch address circuit on a branch correction path	0	0	0	Correct prediction as not taken; without correction
0	1	1		1	0	1	Misprediction as taken-must supply consecutive addresses to the next fetch address circuit on the branch correction path
1	1	0	The correct prediction is adopted; without correction	1	0	1

Table 1: branch prediction analysis truth table

The condition evaluation bit 62 may additionally be used as a select input to a multiplexer 68 that selects between consecutive addresses and the calculated branch target address 64 to generate an address that is placed on the branch correction path 58.

According to one embodiment of the invention, to process incorrectly pre-decoded instructions, the BPT bit 60 may be set or forced to 0, and the COND bit 62 may be set or forced to 1, to force a "branch misprediction as not taken" situation. In this case, the calculated branch target address 64 will be directed to the next address fetch circuit 46 via the branch correction path 58.

According to one embodiment of the invention, an incorrectly pre-decoded instruction is evaluated as a PC-dependent branch instruction with the branch displacement column of 0. When this instruction is evaluated in the EXE pipe stage 56, the calculated branch target address will include the address of the incorrectly pre-decoded instruction (offset 0), in another embodiment of the present invention the incorrectly pre-decoded instruction is evaluated as a register branch instruction, and in addition, the branch target address register will be loaded with the address of the incorrectly pre-decoded instruction. In the case where the branch target address register is loaded by an arithmetic operation, the operand registers may be loaded to generate an incorrectly pre-decoded instruction address. Many other methods for evaluating an incorrectly pre-decoded instruction as a mispredicted not taken branch instruction having a target address of the instruction itself will be readily apparent to those of ordinary skill in the art and are included within the scope of this disclosure.

Referring again to FIG. 2, a forced mispredicted as not taken branch instruction is executed at the EXE stage 56, and a branch target address including the address of the incorrectly pre-decoded instruction is placed on the branch correction path 58. This address is selected by the next fetch address calculator 46, loaded into the ICAR 48, and the instruction fetch is performed in the I-cache 22.

Since the incorrect pre-decode detection and correction logic 50 invalidates the cache line containing the incorrectly pre-decoded instruction, the I-cache 22 access will miss, forcing a fetch of the instruction from memory 32 (or the L2 cache). The instruction will then be correctly pre-decoded by the pre-decoder 21 and placed into the I-cache 22 along with the correct pre-decode information 23. The instructions and pre-decode information 23 may then be re-fetched from the I-cache 22, the instructions and pre-decode information 23 decoded correctly, and the instructions and pre-decode information 23 executed correctly in the pipeline 12. No offset errors due to, for example, alternating interspersing of data and instructions occur again in the pre-decoder 21 because the memory access is to the exact address of the instruction, not the beginning of a cache line.

It should be noted that the above description of memory accesses is conceptual. In any given implementation, accesses to memory 32 may occur in parallel with I-cache 22 accesses; i-cache 22 misses may be predicted, and therefore I-cache 22 accesses are avoided; memory 32 access results may go directly into the pipeline 12, being written in parallel into the I-cache 22; and so on. In general, the present disclosure encompasses all memory and/or cache performance optimizations that may be operationally skewed from the above description.

Although the present invention has been described herein with respect to particular features, aspects and embodiments thereof, it will be apparent that numerous variations, modifications, and other embodiments are possible within the broad scope of the present invention, and accordingly, all variations, modifications and embodiments are to be regarded as being within the scope of the invention. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method of correcting an incorrectly pre-decoded instruction, comprising:

detecting a predecode error; and

in response to detecting the error, a branch correction program is forced with a target address of the incorrectly pre-decoded instruction.

2. The method of claim 1 further comprising invalidating the incorrectly pre-decoded instruction in a cache prior to forcing the branch correction program.

3. The method of claim 2, further comprising fetching the instruction from memory in response to the branch correction program.

4. The method of claim 3 further comprising pre-decoding the instructions and storing the instructions and pre-decode information associated with the instructions in the cache.

5. The method of claim 1, wherein forcing a branch correction procedure comprises forcing a branch condition "true" and forcing a branch prediction "false".

6. The method of claim 1 wherein forcing a branch correction procedure with the target address of the incorrectly pre-decoded instruction comprises storing the address in a target address register and forcing register branch instruction correction.

7. The method of claim 6, wherein storing the address in a target address register comprises: in the event that the target address register is loaded with the result of an arithmetic operation on the contents of two operand registers, the calculated value is stored in the operand register, the value being calculated to produce the address from the arithmetic operation.

8. The method of claim 1 wherein forcing a branch correction procedure with the target address of the incorrectly pre-decoded instruction comprises forcing a PC-dependent branch instruction correction with a branch displacement of zero.

9. A processor, comprising:

a pre-decoder inserted in an instruction fetch path, the pre-decoder generating pre-decode information associated with an instruction; and

a pre-decode error detector and corrector that detects incorrect pre-decode information associated with the instruction and forces the instruction to execute as a mispredicted branch with the address of the instruction as the branch target address.

10. The processor of claim 9, further comprising

A cache that stores the instructions and the predecode information, and wherein the predecode error detector and corrector further invalidates the instructions in the cache upon detection of the predecode error.

11. The processor of claim 9, further comprising a branch predictor and a branch correction path that provides a corrected branch target address for instruction fetching in response to a conditional branch that was predicted not taken but evaluated taken.

12. The processor of claim 11 wherein the pre-decode error detector and corrector utilizes the branch correction path to force the incorrectly pre-decoded instruction to execute as a branch instruction incorrectly predicted not taken.

13. A method of correcting an incorrectly pre-decoded instruction, comprising:

detecting a predecode error; and

in response to detecting the error, correcting the pre-decode error by fetching the instruction from memory and pre-decoding the instruction.

14. The method of claim 13, wherein fetching the instruction from memory comprises:

invalidating the instruction in a cache; and

after invalidating the instruction, attempting to fetch the instruction from the cache.

15. The method of claim 13, wherein fetching the instruction from memory comprises evaluating the instruction as a branch, wherein an address of the instruction is taken as a branch target address.

16. The method of claim 15, wherein evaluating the instruction as a branch comprises evaluating the instruction as a mispredicted not taken branch.