HK1112984A

HK1112984A - Suppressing update of a branch history register by loop-ending branches

Info

Publication number: HK1112984A
Application number: HK08108339.5A
Authority: HK
Inventors: 博胡斯拉夫‧雷赫利克
Original assignee: 高通股份有限公司
Priority date: 2005-02-24
Filing date: 2006-02-24
Publication date: 2008-09-19

Abstract

Conditional branch instructions that terminate code loops are detected, and a Branch History Register (BHR) is prevented from updating to store the loop-ending branch evaluations. This prevents the branch that implements loop iterations from displacing other branch evaluation histories from the BHR. The loop-ending branch may be detected statically, by a compiler using a specific type branch instruction or inserting indicator bits in the op code of a loop-ending branch instruction. A loop-ending branch instruction may be detected dynamically as any backwards branch, or by storing the PC of the last one or several branch instructions upon updating the BHR, and checking the PC of a branch instruction against the Last Branch PC (LBPC) register(s). If the branch PC matches, update of the BHR is suppressed. Keeping loop iteration branches out of the BHR improves branch prediction training time and accuracy.

Description

Suppressing update of branch history register by loop-ending branch

Technical Field

The present invention relates generally to the field of processors, and more particularly to a method of improving branch prediction by suppressing updates to branch history registers with loop-ending branch instructions.

Background

Microprocessors perform computational tasks in a wide variety of applications. Improved processor performance is almost always needed to allow faster operation and/or increased functionality through software changes. In many embedded applications, such as portable electronic devices, saving power is also a goal of processor design and implementation.

Many modern processors use a pipeline architecture in which sequential instructions (each having multiple execution steps) overlap in execution. To achieve improved performance, instructions should flow continuously through the pipeline. Any situation that causes instructions to stall in the pipeline can adversely affect performance. If an instruction is flushed from the pipeline and subsequently re-fetched, performance and power consumption suffer.

Most programs contain conditional branch instructions, whose actual branching behavior is not known until the instruction is evaluated deep in the pipeline. To avoid stalls due to waiting for actual evaluation of branch instructions, modern processors may employ some form of branch prediction, whereby the branching behavior of conditional branch instructions is predicted early in the pipeline. Based on the predicted branch evaluation, the processor speculatively fetches (prefetches) and executes instructions from a predicted address, either the branch target address (if the branch is predicted to be taken) or the next sequential address after the branch instruction (if the branch is predicted not to be taken). When the actual branch behavior is determined, if the branch is mispredicted, the speculatively fetched instructions must be flushed from the pipeline and new instructions fetched from the next correct address. Prefetching instructions in response to a misprediction of a branch may adversely affect processor performance and power consumption. Therefore, improving the accuracy of branch prediction is an important design goal.

Known branch prediction techniques include both static and dynamic prediction. The likely behavior of some branch instructions may be predicted statically by a programmer and/or compiler. One example of branch prediction is an error checking routine. Code will typically execute correctly and errors are rare. Thus, a branch instruction that implements a "branch on error" will evaluate "not taken" a very high percentage of the time. Such instructions may include static branch prediction bits in the opcode that are set by a programmer or compiler with knowledge of the most likely outcome of the branch condition.

Dynamic prediction is generally based on the branch evaluation history (and in some cases, the branch prediction accuracy history) of the branch instruction being predicted and/or other branch instructions in the same code. Extensive analysis of actual code indicates that recent past branch evaluation patterns may be a good indication of the evaluation of future branch instructions.

One known form of dynamic branch prediction depicted in FIG. 1 utilizes a Branch History Register (BHR)100 to store the past n branch evaluations. In a simple embodiment, the BHR 30 includes a shift register. The most recent branch evaluation results are shifted in (e.g., 1 indicates branch taken and 0 indicates branch not taken), while the oldest past evaluation in the register is replaced. The processor may maintain a local BHR 100 for each branch instruction. Alternatively (or additionally), the BHR 100 may contain recent past evaluations of all conditional branch instructions, sometimes referred to in the art as a global BHR or GHR. As used herein, BHR refers to both local and global branch history registers.

As depicted in FIG. 1, the BHR 100 may index a Branch Predictor Table (BPT)102, which BPT 102 may likewise be local or global. BHR 100 may index BPT 102 directly or may be combined with other information in BPT indexing logic 104, such as a Program Counter (PC) for a branch instruction. Other inputs to the BPT index logic 104 may additionally be utilized. BPT index logic 104 may chain the inputs together (commonly referred to in the art as gselect), exclusive or the inputs (gshare), perform a hash function, or combine or transform the inputs in a variety of ways.

In one example, BPT 102 may include multiple saturating counters, with their MSBs acting as bimodal branch predictors. For example, each table entry may comprise a 2-bit counter that takes one of four states, each of which is assigned a weighted prediction value, such as:

11-robust predictive adoption

10-weak prediction adoption

01-Weak prediction not to use

00-Strong prediction not adopted

The counter is incremented each time the respective branch instruction evaluates "taken" and decremented each time the instruction evaluates "not taken". The MSB of the counter is a bimodal branch predictor that will predict whether a branch is taken or not taken, regardless of the strength or weight of the base prediction. The saturation counter reduces prediction errors for infrequent branch evaluations. Branches that always evaluate in one direction will saturate the counter. Conversely, infrequent evaluation will change the counter value (and the strength of the prediction), but not the bimodal prediction value. Thus, infrequent evaluations will only mispredict once, rather than twice. The table of saturation counters is merely an illustrative example, and in general, a BHT may index a table containing a variety of branch prediction mechanisms.

Regardless of which branch prediction mechanism is employed in the BPT 102, the BHR 100 (alone or in combination with other information such as the branch instruction PC) indexes the BPT 102 to obtain branch predictions. By storing previous branch evaluations in the BHR 100 and using the evaluations in branch prediction, the branch instruction being predicted is correlated with past branch behavior — the behavior of its own past behavior in the case of a local BHR 100 and other branch instructions in the case of a global BHR 100. This correlation may be critical for accurate branch prediction, at least in the case of highly repetitive code.

Note that fig. 1 depicts a branch evaluation, i.e., the actual evaluation of a conditional branch instruction, stored in the BHR 100, which may only be known deep in the pipeline (e.g., in an execution pipe stage). While this is the final result, in practice, many high performance processors store the predicted branch evaluation from the BPT 102 in the BHR 100, and then correct the BHR 100 as part of an error prediction recovery operation when the prediction error is confirmed. For clarity, the drawings do not reflect such implementation features.

One common code structure that may reduce the efficacy of branch predictors employing BHR 100 is a loop. A loop ends with a conditional branch instruction that tests a loop end condition, such as whether the index variable incremented each time through the loop has reached a loop end value. If not, execution branches back to the beginning of the loop for another iteration and another loop-ending conditional branch evaluation. With respect to the n-bit BHR 100, there are three relevant cases with respect to cycling: the loop is not executed; the loop is executed through m iterations (where m < n); and executing the loop m times (where m ═ n).

If the loop is not executed, the forward branch at the beginning of the loop forms a branch on the loop body, resulting in a taken branch evaluation. This has minimal impact on the BHR 100, since the past branch evaluation history in the BHR 100 is replaced by only one branch evaluation (in fact, prediction accuracy may be improved by correlation with this branch evaluation).

If the loop executes through m iterations (where m > ═ n), the BHR 100 is saturated by the "taken" backward branch of the loop-ending branch instruction. That is, at the end of the loop, the n-bit BHR will always contain exactly n-1 ones followed by a single zero, which corresponds to a longer series of adopted evaluations resulting from loop iterations, and ends with a single unexploded evaluation at the end of the loop. This effectively compromises the efficacy of the BHR 100 because all of the correlations with the previous branch evaluation (for local or global BHR 100) are lost. In this case, the BHR 100 would likely map to the same BPT 102 entry to fetch a given branch instruction (depending on other inputs to the BPT indexing logic 104), rather than to an entry containing a branch prediction reflecting the relevance of the branch instruction to the previous branch evaluation.

Furthermore, a saturated BHR 100 may increase aliasing in the BPT 102. That is, if the BHR 100 directly indexes the BPT 102, all branch instructions after a loop with many iterations will map to the same BPT 102 entry. Even in the case where the BHR 100 is combined with other information, the possibility of confusion increases. This adversely affects prediction accuracy not only for branch instructions after a loop, but also for all branch instructions that are aliased to their entries in the BPT 102.

If the loop is executed through m iterations (where m < n), then the BHR 100 does not saturate and some previous branch evaluation history is maintained. However, the bits representing the history of previous branch evaluations are replaced by m bit positions. Especially in the case of m variations, this has two detrimental effects on branch prediction. First, branch instructions will map to a larger number of entries in the BPT 102 to capture the same correlation with previous branch evaluations, requiring a larger BPT 102 to support the same accuracy for the same number of branch instructions than would be required if the BHR 30 were not affected by the loop-ending branch. Second, the branch predictor in BPT 102 will take a longer time to "train," thereby increasing the amount of code that must be executed before BPT 102 begins to provide accurate branch predictions.

For example, consider an 8-bit BHR 100 and a code segment having branch instructions A-H, followed by a loop, and followed by branch instruction X. Branch X strongly correlates with the evaluation history of branches G and H. Various iterations of the inserted loop will produce the BHR results presented in table 1 below when X is predicted.

BHR								Note
BHR								Note	A	B	C	D	E	F	G	H	Loop execution once (not using initial forward or loop end backward branch)
B	C	D	E	F	G	H	1	Skipping cycles (taking one initial forward branch)	A	B	C	D	E	F	G	H
B	C	D	E	F	G	H	1	Skipping cycles (taking one initial forward branch)	C	D	E	F	G	H	1	0	2 iterations (branch taken once after loop end, then not taken)
D	E	F	G	H	1	1	0	3 iterations	C	D	E	F	G	H	1	0
D	E	F	G	H	1	1	0	3 iterations	E	F	G	H	1	1	1	0	4 iterations
F	G	H	1	1	1	1	0	5 iterations	E	F	G	H	1	1	1	0	4 iterations
F	G	H	1	1	1	1	0	5 iterations	G	H	1	1	1	1	1	0	6 iterations

Table 1: BHR 100 content after various numbers of loop iterations

In this example, there is in each case a desired correlation in the BHR 100 between the branch instruction X being predicted and the previous evaluation of branches G and H. However, they are located at different locations in the BHR 100, and thus each case will map to a different BPT 102 entry. This wastes BPT 102 space, increases branch prediction training time, and increases the likelihood of aliasing in BPT 102, all of which reduce prediction accuracy.

Disclosure of Invention

In one or more embodiments, the adverse effects of storing loop-ending branch instruction evaluations in a BHR are ameliorated by identifying loop-ending branch instructions and suppressing updates to the BHR in response to the loop-ending instructions. The loop-ending instruction is identified in a variety of ways.

In one embodiment, a branch prediction method includes, responsive to the nature of a branch instruction, optionally suppressing updates to the BHR upon execution of the branch instruction.

In another embodiment, a processor includes: a branch predictor operable to predict evaluation of conditional branch instructions; and an instruction execution pipeline operable to speculatively fetch and execute instructions based on a prediction from the branch predictor. The processor further includes: a BHR operable to store an evaluation of a conditional branch instruction; and control circuitry operable to suppress storing an evaluation of the conditional branch instruction responsive to a property of the branch instruction.

In yet another embodiment, a compiler or assembler operable to generate instructions in response to program code includes a loop-ending branch instruction marking function operable to indicate a conditional branch instruction that terminates a code loop.

Drawings

FIG. 1 is a functional block diagram of a prior art branch predictor circuit.

Fig. 2 is a functional block diagram of a processor.

FIG. 3 is a flow diagram of a method of executing a branch instruction.

FIG. 4 is a functional block diagram of a branch predictor circuit including one or more last branch PC registers.

Detailed Description

Fig. 1 depicts a functional block diagram of a processor 10. The processor 10 executes instructions in an instruction execution pipeline 12 in accordance with control logic 14. In some embodiments, the pipeline 12 may be a superscalar design, having multiple parallel pipelines. The pipeline 12 includes various registers or latches 16, organized in pipe stages, and one or more Arithmetic Logic Units (ALU) 18. A General Purpose Register (GPR) file 20 provides registers comprising the top of the memory hierarchy.

The pipeline 12 fetches instructions from an instruction cache (I-cache)22, with memory address translation and permissions managed by an instruction-side translation lookaside buffer (ITLB) 24. When a conditional branch instruction is decoded early in the pipeline 12, the branch predictor 26 predicts branch behavior and provides the prediction to the instruction prefetch unit 28. The instruction prefetch unit 28 speculatively fetches instructions from the instruction cache 22 at a branch target address calculated in the pipeline 12 for a "taken" branch prediction or at a next sequential address for a branch predicted to be "not taken". In either case, the prefetched instructions are loaded into the pipeline 12 for speculative execution.

The branch predictor 26 includes a Branch History Register (BHR)30, a Branch Predictor Table (BPT)32, BPT index logic 34, and BHR update logic 36. The branch predictor 26 may additionally include one or more last branch PC registers 38, described more fully herein below.

Data is accessed from a data cache (D-cache)40, with memory address translation and permissions managed by a main Translation Lookaside Buffer (TLB) 42. In various embodiments, the ITLB 24 may comprise a copy of part of the TLB 42. Alternatively, the ITLB 24 and TLB 42 may be integrated. Similarly, in various embodiments of the processor 10, the I-cache 22 and D-cache 40 may be integrated or merged. Misses in the I-cache 22 and/or the D-cache 40 cause an access to main (off-chip) memory 44, under the control of a memory interface 46.

The processor 10 may include an input/output (I/O) interface 46 that controls access to various peripheral devices 50. Those skilled in the art will recognize that many variations of the processor 10 are possible. For example, the processor 10 may include a second level (L2) cache for either or both the I-cache 22 and the D-cache 40. Moreover, one or more of the functional blocks depicted in the processor 10 may be omitted from a particular embodiment.

In accordance with one or more embodiments, branch prediction accuracy is improved by preventing loop-ending branches from corrupting one or more BHRs 30 in the branch predictor 26. This process is depicted in fig. 3 as a flow chart. The conditional branch instruction is decoded (block 52). A determination is made as to whether the branch is a loop-ending branch (block 54). If not, the BHR 30 is updated to record the branch evaluation (block 56), i.e., whether the branch instruction evaluates to "taken" or "not taken". Execution then continues at the branch target address or next sequential address, respectively (block 58). If the branch is not a loop-ending branch, updating the BHR 30 is suppressed to record the branch evaluation for the loop-ending branch instruction (as indicated by the path from block 54 to block 58). In this manner, loop iteration branches do not corrupt the contents of the BHR 30 by overriding the relevant branch evaluation history. The query (block 54) identifies the branch instruction as a loop-ending branch instruction, which may be accomplished in a variety of ways.

The loop iterates by branching from the end of the loop back to the beginning of the loop. According to one embodiment, conditional branch instructions (i.e., backward branches) that assume each branch target address is less than the branch instruction address or PC are loop-ending branch instructions and are prevented from updating the BHR 30. This embodiment provides the advantage of simplicity. At BHR 30 update time, the branch instruction PC is compared to the Branch Target Address (BTA) when the branch instruction is actually evaluated in the pipeline. If BTA < PC, the BHR 30 is not updated. This embodiment has the following disadvantages: address comparisons are required when determining branch target addresses, and some backward branches that are not loop-ending branches will not have their evaluations recorded in the BHR 30.

Another way to detect a loop-ending branch is to discern repeated execution of the same branch instruction. In one embodiment depicted in FIG. 4, a Last Branch PC (LBPC) register 38 stores the PC whose evaluation was the last branch instruction stored in the BHR 30. In the simple loop case, if the PC of the branch instruction matches the LBPC 38 (that is, the branch instruction is the last branch instruction evaluated), then the branch instruction is assumed to be a loop-ending branch instruction, and further updates to the BHR 30 are suppressed. As discussed above with reference to FIG. 1, while FIG. 4 depicts comparing the contents of the LBPC 38 to the actual branch evaluation in the BHR update logic 36 in any given implementation, the LBPC 38 may be compared to a predicted branch evaluation, with the BHR 30 corrected in the event of a misprediction. This embodiment stores only the first iteration of the loop, replacing only one previous branch evaluation from the BHR 30. This embodiment does not require compiler support and does not require the direction of the branch to be determined at BHR 30 update time.

A loop may contain one or more nested loops, or may contain other branches within the loop. In this case, the BHR 30 can be saturated by inhibiting the inner circulation by the LBPC method; however, the outer loop-ending branch will still be stored in the BHR 30. In one embodiment, two or more LBPC registers 38 may be provided, with the PCs of consecutively evaluated branch instructions being stored in the respective LBPC registers (LBPCs)₀，LBPC₁…LBPC_M)38, respectively. PC and LBPC if branch instruction_NAny of the registers 38 match, then updates to the BHR 30 may be suppressed.

Loop-ending branch instructions may also be statically marked by a compiler or assembler. In one embodiment, the compiler generates a particular type of branch instruction, such as "BRLP," for only loop-ending branches. The BRLP instruction is recognized, and the BHR 30 is never updated when the BRPE instruction evaluates in the execution pipe stage. In another embodiment, a compiler or assembler may embed a loop-ending branch indication in a branch instruction, such as by setting one or more predefined bits in the opcode. The loop-ending branch bit is detected and updates to the BHR 30 are suppressed when the branch instruction evaluates in the execution pipe stage. Static identification of loop-ending branches reduces hardware and computational complexity by moving the loop-ending identification function into a compiler or assembler.

Conditional branch instructions may have many properties including, for example, branch instruction address or PC, instruction type, and whether an indication bit is present in the opcode. As used herein, the nature of a branch operation and/or the nature of a program related to a branch is considered to be the nature of a branch instruction. For example, whether the branch instruction PC matches the contents of one or more LBPC registers 38 and whether the branch target address is forward or backward with respect to the branch instruction PC, are properties of the branch instruction.

Although the present invention has been described herein with respect to particular features, aspects and embodiments thereof, it will be apparent that numerous variations, modifications, and other embodiments are possible within the broad scope of the present invention, and accordingly, all variations, modifications and embodiments are to be regarded as being within the scope of the invention. The present embodiments are therefore to be construed in all aspects as illustrative and not restrictive and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method of branch prediction, comprising:

updates to a Branch History Register (BHR) are optionally suppressed when executing the branch instruction, in response to the nature of the branch instruction.

2. The method of claim 1, wherein the nature of the branch instruction is that the branch is backward.

3. The method of claim 1, wherein a property of the branch instruction is that the branch is a loop-ending branch.

4. The method of claim 3, wherein the PC of the branch instruction matches contents of a Last Branch PC (LBPC) register of a PC storing a last branch instruction to update the BHR.

5. The method of claim 4, wherein the PC of the branch instruction matches the contents of any of a plurality of LBPC registers storing the PC of the last plurality of branch instructions to update the BHR.

6. The method of claim 3, wherein a property of the branch instruction is that the branch instruction is the only branch instruction generated by a compiler needle for ending a branch.

7. The method of claim 3, wherein a property of the branch instruction is that the branch instruction includes one or more bits indicating that it is a loop-ending branch instruction.

8. A processor, comprising:

a branch predictor operable to predict evaluation of conditional branch instructions;

an instruction execution pipeline operable to speculatively fetch and execute instructions based on a prediction from the branch predictor;

a Branch History Register (BHR) operable to store the evaluation of conditional branch instructions; and control circuitry operable to suppress storing the evaluation of the conditional branch instruction in response to a property of the branch instruction.

9. The processor of claim 8, further comprising a Last Branch PC (LBPC) register operable to store a PC of a branch instruction that updates the BHR, and wherein the control circuitry is operable to refrain from storing evaluation of the conditional branch instruction when the PC of the branch instruction matches contents of the LBPC register.

10. The method of claim 9, further comprising a plurality of LBPC registers operable to store a PC of a plurality of branch instructions that update the BHR, and wherein the control circuitry is operable to inhibit storing the evaluation of conditional branch instructions when the PC of the branch instruction matches the contents of any LBPC register.

11. The method of claim 8, wherein the control circuit is operable to suppress storing the evaluation of conditional branch instructions when the branch instruction includes an indication that it is an end-of-loop instruction.

12. The method of claim 11, wherein the indication that the branch instruction is a loop-ending instruction is an instruction type.

13. The method of claim 8 wherein the control circuitry is operable to suppress storing the evaluation of conditional branch instructions when a branch instruction target address is less than the branch instruction PC.

14. A compiler or assembler, comprising:

a compiler or assembler operable to generate instructions in response to program code; and

a loop-ending branch instruction marking function operable to indicate a conditional branch instruction that terminates the code loop.

15. The compiler or assembler of claim 14 wherein the loop-ending branch instruction marking function is operable to generate a unique type of branch instruction to end each loop.

16. The compiler or assembler of claim 14 wherein the loop-ending branch instruction marking function is operable to insert a loop-ending indicator in each conditional branch instruction that ends a loop.

17. The compiler or assembler of claim 16 wherein the loop-over indicator comprises one or more bits inserted in a predetermined field in the conditional branch instruction opcode.

18. A method of branch prediction using a Branch History Register (BHR) that stores evaluations of previous conditional branch instructions, comprising:

detecting a loop ending branch; and

updating of the BHR that will store the evaluation of the associated branch instruction is suppressed.

19. The method of claim 18, wherein detecting a loop-ending branch comprises detecting a match between a PC of the associated branch instruction and contents of a last branch PC (lbpc) register storing a PC of a last branch instruction to update the BHR.

20. The method of claim 18, wherein detecting a loop-ending branch comprises detecting a match between a PC of the associated branch instruction and contents of any of a plurality of LBPC registers storing a PC of a last plurality of branch instructions to update the BHR.

21. The method of claim 18, wherein detecting a loop-ending branch comprises decoding a unique branch instruction generated by a compiler for ending a branch.

22. The method of claim 18, wherein detecting a loop-ending branch comprises detecting one or more bits in the associated branch instruction opcode indicating that it is a loop-ending branch instruction.