CN111984325B

CN111984325B - Device and system for improving branch prediction throughput

Info

Publication number: CN111984325B
Application number: CN202010439722.2A
Authority: CN
Inventors: M.S.S.戈文丹; 邹浮舟; A.恩戈; W.T.昌瓦特斋; M.特卡奇克; G.D.祖拉斯基
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-05-23
Filing date: 2020-05-22
Publication date: 2024-12-24
Anticipated expiration: 2040-05-22
Also published as: CN111984325A

Abstract

According to one general aspect, an apparatus may include a branch prediction circuit configured to predict whether a branch instruction will be taken or not taken. The apparatus may include a branch target buffer circuit configured to store a memory segment empty flag indicating whether a memory segment following a target address includes at least one other branch instruction, wherein the memory segment empty flag is created at a commit stage prior to occurrence of the branch instruction. The branch prediction circuit may be configured to skip the memory segment if the memory segment empty flag indicates the absence of other branch instructions.

Description

Device and system for improving branch prediction throughput

Technical Field

The present disclosure relates to processor instruction flow and, more particularly, to improving branch prediction throughput by skipping cache lines (cacheline) without branches.

Background

In a computer architecture, a branch predictor or branch prediction unit is a digital circuit that attempts to guess where a branch (e.g., if-then-else structure, jump instruction) will go before the result is actually computed and known. The goal of a branch predictor is typically to improve flow in an instruction pipeline. In many modern pipelined microprocessor architectures, branch predictors play a critical role in achieving high performance.

A conditional jump instruction is typically utilized to implement a bidirectional branch. The conditional jump may be "not taken" (not taken) and continue to execute the first code segment immediately following the conditional jump, or "taken" (taken) and jumped to a different location in the program memory storing the second code segment. It is often uncertain whether a conditional jump is taken or not taken until a condition has been calculated and the conditional jump has passed through the execution stage of the instruction pipeline.

Without branch prediction, the processor would typically have to wait for a conditional jump instruction to pass through the execution stage before the next instruction can enter the fetch stage in the pipeline. Branch predictors attempt to avoid this time waste by attempting to guess whether a conditional jump is most likely to be taken or not taken. Then fetching instructions that are guessed most likely to be taken at the destination of the branch and speculatively executing. If the instruction execution stage detects that the speculative branch is erroneous, then the speculatively or partially executed instruction is typically discarded and the pipeline restarted from the correct branch, resulting in a delay.

Disclosure of Invention

According to one general aspect, an apparatus may include a branch prediction circuit configured to predict whether a branch instruction is taken or not taken. The apparatus may include a branch target buffer circuit configured to store a memory segment empty flag indicating whether a memory segment following a target address includes at least one other branch instruction, wherein the memory segment empty flag is created at a commit stage prior to occurrence of the branch instruction. The branch prediction circuit may be configured to skip the memory segment if the memory segment empty flag indicates the absence of other branch instruction(s).

According to another general aspect, an apparatus may include a branch detection circuit configured to detect a presence of at least one branch instruction stored within a portion of a memory segment during a commit phase of a current instruction. The apparatus may include a branch target buffer circuit configured to store a branch instruction address and a memory segment empty flag indicating whether a portion of a memory segment following the target address includes at least one other branch instruction.

According to another general aspect, a system may include a branch detection circuit configured to detect a presence of at least one branch instruction stored within a portion of a memory segment during a current commit instruction commit phase. The system may include a branch target buffer circuit configured to store a branch instruction address and a memory segment empty flag indicating whether a portion of a memory segment following the target address includes at least one other branch instruction. The system may include a branch prediction circuit configured to predict whether a branch instruction is a taken, and wherein the branch prediction circuit is configured to skip the memory segment if the associated memory segment empty flag indicates a lack of a branch instruction.

The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

A system and/or method for a processor instruction stream, substantially as shown in and/or described in connection with at least one of the figures, and more particularly, with respect to improving branch prediction throughput by skipping cache lines without branches, as set forth more completely in the claims.

Drawings

FIG. 1 is a block diagram of an example embodiment of a system in accordance with the disclosed subject matter.

FIG. 2 is a block diagram of an example embodiment of a data structure in accordance with the disclosed subject matter.

FIG. 3 is a diagram of an example embodiment of a data structure in accordance with the disclosed subject matter.

FIG. 4 is a block diagram of an example embodiment of a system in accordance with the disclosed subject matter.

FIG. 5 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter.

FIG. 6 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter.

FIG. 7 is a schematic block diagram of an information handling system that may include devices formed in accordance with the principles of the disclosed subject matter.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

Various example embodiments are described more fully hereinafter with reference to the accompanying drawings, in which some example embodiments are shown. The subject matter of the present disclosure may, however, be embodied in many different forms and should not be construed as limited to the example embodiments set forth herein. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosed subject matter to those skilled in the art. In the drawings, the size and relative sizes of layers and regions may be exaggerated for clarity.

It will be understood that when an element or layer is referred to as being "on," "connected to" or "coupled to" another element or layer, it can be directly on, connected or coupled to the other element or layer, or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present. Like numbers refer to like elements throughout. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the presently disclosed subject matter.

Spatially relative terms, such as "under", "below", "over", "above", "over" and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the exemplary term "below" may include both an orientation above and below. The device may be otherwise oriented (rotated 90 degrees or other orientations) and the spatially relative descriptors used herein interpreted accordingly.

Also, for ease of description, electrical terms such as "high," "low," "pull-up," "pull-down," "1," "0," etc. may be used herein to facilitate description of other voltage levels or another element(s) or feature(s) relative to voltage levels or currents, as shown. It will be appreciated that the electrically relative terms are intended to encompass different reference voltages in use or operation of the device in addition to the voltages or currents depicted in the figures. For example, if the device or signal in the figure is inverted or other reference voltages, currents or charges are used, then the element described as "high" or "pull-up" will be "low" or "pull-down" as compared to the new reference voltage or current. Thus, the exemplary term "high" may encompass relatively low or high voltages or currents. Otherwise, the device may be based on a different electrical reference frame and interpret the electrical relative descriptors used herein accordingly.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the subject matter of the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Example embodiments are described herein with reference to cross-sectional views, which are schematic illustrations of idealized example embodiments (and intermediate structures). As such, variations in the shapes of the illustrations as a result, for example, of manufacturing techniques and/or tolerances, are to be expected. Thus, example embodiments should not be construed as limited to the particular shapes of regions illustrated herein but are to include deviations in shapes that result, for example, from manufacturing. For example, an implanted region shown as a rectangle will typically have rounded or curved features and/or implant concentration gradients at its edges rather than a binary change from implanted to non-implanted regions. Also, an implanted region formed by implantation may result in some implantation in the region between the implanted region and the surface through which implantation occurs. Thus, the regions illustrated in the figures are schematic in nature and their shapes are not intended to illustrate the actual shape of a region of a device and are not intended to limit the scope of the presently disclosed subject matter.

Unless defined otherwise, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the presently disclosed subject matter belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein

Hereinafter, example embodiments will be explained in detail with reference to the accompanying drawings.

FIG. 1 is a block diagram of an example embodiment of a system 100 in accordance with the disclosed subject matter. In various embodiments, system 100 may comprise a computer, a plurality of discrete integrated circuits, or a system on a chip (SoC). As described below, the system 100 may include many other components that are not shown in this figure so as not to obscure the disclosed subject matter.

In the illustrated embodiment, the system 100 includes a system memory 104. In various embodiments, system memory 104 may be comprised of Dynamic Random Access Memory (DRAM). It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect. In such embodiments, the system memory 104 may comprise on-module memory (e.g., a dual in-line memory module (DIMM)), may be an integrated chip that is soldered or otherwise fixedly integrated with the system 100, or may even be incorporated as part of an integrated chip (e.g., soC) that includes the system 100. It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

In the illustrated embodiment, the system memory 104 may be configured to store data segments or information. These data segments may include instructions that cause processor 102 to perform various operations. In general, system memory 104 may be part of a larger memory hierarchy including multiple caches. In various embodiments, the operations described herein may be performed by another layer or level of the memory hierarchy, such as a level 2 (L2) cache. Those skilled in the art will appreciate that while operations are described with reference to system memory 104, the disclosed subject matter is not limited to this illustrative example.

In the illustrated embodiment, the system 100 also includes a processor 102. The processor 102 may be configured to perform a plurality of operations indicated by the various instructions. These instructions may be executed by various execution units (most not shown), such as an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a load/store unit (LSU), an instruction fetch unit 116 (IFU), and the like. It is understood that the units are merely a collection of circuits that are combined together to perform a portion of the functionality of the processor 102. Typically, the units perform one or more operations in a pipelined architecture of the processor 102.

In the illustrated embodiment, the processor 102 may include a Branch Prediction Unit (BPU) 112. As described above, the instruction(s) may be branch instructions when the processor 102 is executing an instruction stream. A branch instruction is an instruction that causes an instruction stream to branch or diverge between two or more paths. A typical example of a branch instruction is an if-then structure, in which if (if) satisfies a certain condition (e.g., the user clicks the "OK" button), then (then) will execute a first instruction set, and if (if) does not satisfy a certain condition (e.g., the user clicks the "Cancel" button), then (then) will execute a second instruction set. As described above, this is a problem in pipelined processor architectures because new instructions must enter the pipeline of the processor 102 before the outcome of the branch, jump, or if-then structure is known (because the pipeline stage that resolves the branch instruction is located deep in the pipeline). Thus, new instructions must be prevented from entering the pipeline until either the branch instruction is resolved (negating the major advantages of the pipeline architecture), or the processor 102 must guess in what way the instruction stream will branch and speculatively place those instructions into the pipeline. The BPU 112 may be configured to predict how the instruction stream will branch. In the illustrated embodiment, the BPU 112 may be configured to output the predicted instruction, or more precisely, the memory address where the predicted instruction is stored.

In the illustrated embodiment, processor 102 includes a Branch Prediction Address Queue (BPAQ) 114. The BPAQ 114 may include a memory structure configured to store a plurality of addresses for predicted instructions that have been predicted by the BPU 112. The BPAQ 114 may store the addresses of these predicted instructions in a first-in-first-out (FIFO) order such that the instruction addresses are output from the BPAQ 114 in the same order in which the BPUs 112 predicted them.

In the illustrated embodiment, the processor 102 includes an Instruction Fetch Unit (IFU) 116 configured to fetch instructions from the memory hierarchy and place them into the pipeline of the processor 102. In such embodiments, the IFU 116 may be configured to take the memory address associated with the most recent or oldest instruction (next instruction) from the BPAQ 114 and request the actual instruction from the memory hierarchy. Ideally, instructions would be provided quickly from the memory hierarchy and put into the pipeline of processor 102.

Ideally, instructions may be fetched (via memory access (s)) from the level 1 (L1) instruction cache 118. In such embodiments, as a top level or higher level of the memory hierarchy, the L1 instruction cache 118 may be relatively fast in the pipeline and incur little or no delay. Occasionally, however, the L1 instruction cache 118 may not include the desired instruction. This will result in a cache miss (miss) and will have to fetch or load instructions from a lower, slower level of the memory hierarchy (e.g., system memory 104). Such cache misses may cause a delay in the pipeline of the processor 102, because instructions will not be input into the pipeline at a rate of one per cycle (or maximum rate in the processor architecture).

In the illustrated embodiment, the processor 102 includes an instruction prefetch unit (IPFU) 120.IPFU 120 is configured to prefetch instructions into the L1 instruction cache 118 prior to the actual fetch operation performed by the IFU 116. Therefore, IPFU 120 reduces the occurrence of any cache misses experienced by IFU 116. IPFU 120 may do this by requesting a predicted instruction from L1 instruction cache 118 prior to execution by IFU 116. In such an embodiment, if a cache miss then occurs, the L1 instruction cache 118 will begin processing to request the missing instruction from the system memory 104. In such embodiments, instructions may be received and stored in the L1 instruction cache 118 upon request by the IFU 116.

Returning to the BPU 112, the processor 102 may include a Branch Target Buffer (BTB) circuit 122. In various embodiments, BTB 122 may include memory that maps branch addresses to previously predicted target addresses (to which the branch will jump). In such embodiments, BTB 122 may indicate to which address the previous iteration of the branch instruction was last jumped to or predicted to jump to. This makes the operation of the BPU 112 simpler and faster because the BPU 112 can simply request a predicted branch target address from the BTB 122 instead of performing a complete address prediction calculation.

Likewise, the processor 102 may include a Return Address Stack (RAS) circuit 124. In various embodiments, RAS 124 may be a memory or data structure that stores a memory address returned to once the current branch operation or instruction (typically a return instruction) has been completed. For example, when the branch is a subroutine call, once completed, the subroutine will return to the next instruction after the memory address is called. In various embodiments, RAS calculation circuitry 126 may perform this return address calculation.

Now showing the basic structure of the processor 102, fig. 2 shows operations performed by the processor 102.

FIG. 2 is a block diagram of an example embodiment of a data structure 200 in accordance with the disclosed subject matter. In various embodiments, data structure 200 may represent memory storage of various instructions to be fetched and processed by processor 102 of FIG. 1.

In this context, the general term for a block or portion of memory is "segment" of memory. For purposes of example, a memory segment may include a cache line, but in particular embodiments, the cache line is larger. In this context, a cache line may be a unit of data transfer between the L1 instruction cache 118 and main memory (e.g., system memory 104). In various embodiments, the disclosed subject matter may relate to memory segments of multiple cache lines, portions of cache lines, or memory sizes that are not measured in cache lines at all. It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

In the illustrated embodiment, data structure 200 includes sequentially occurring cache lines 204 and 206. In such embodiments, as described above, the processor 102 typically fetches and processes instructions from the beginning (e.g., left) of the cache lines 204 and 206 to the end (e.g., right) of the cache lines 204 and 206.

Branch instructions a211, B212, C213, D214, E215, F216, and G217 are included in the cache line. In various embodiments, the BPU 112 of fig. 1 may be configured to process each branch instruction (which is considered a subroutine call for simplicity) and continue to process the cache lines in order as the branch returns to that point.

The BPU 112 may be configured to stop processing (for the clock cycle (s)) at the memory segment of the cache line boundary. For example, in processing cache line 204, BPU 112 may process a211 in a first cycle, then B212 in a second cycle, then C213 in a third cycle, then D214 in a fourth cycle, then check portion 224 in a fifth cycle, stopping at the end of cache line 204 before the sixth cycle continues to move to E215 of cache line 206.

Since there is no branch to process in portion 224 (as opposed to portion 222), the time taken to check the cache line is a wasted cycle (or many cycles taken to process portion 224). In various embodiments, portion 224 may include a complete cache line. The disclosed subject matter may eliminate or reduce such branch pipeline bubbling (or lack of operation during one or more cycles).

In the disclosed subject matter, BTB 122 and/or RAS 124 may include an indication of whether portion 224 or more generally a portion following any given branch instruction target does not have (empty or devoid) a branch instruction. In such an embodiment, "null" does not mean that no instructions are stored there, but only that no branch instructions are stored in the memory segment. It is expected (but not required) that many non-branch instructions will fill this portion 224.

For example, branch 202 (e.g., the return branch from call D214) may return a Program Counter (PC) to the end of portion 222. After this return, the BPU 112 may examine the RAS 124 and determine that there are no more branch instructions after D214 (portion 224). The BPU 112 may then begin processing the next cache line 206, thus saving the wasted computation time involved in checking the branches of the portion 224.

Similarly, BTB 122 may include a flag that indicates whether a memory segment following the target address of the branch has no additional branch instruction. In such embodiments, if branch 202 is not a return (from a call) but another type of branch instruction (e.g., a call, an unconditional jump, a jump, etc.), BTB 122 may include a target address (e.g., the address of the beginning of portion 224) and whether there are additional branch instructions from the target address to the portion of the end of the cache line (i.e., portion 224).

FIG. 3 is a diagram of an example embodiment of data structures 300 and 301 in accordance with the disclosed subject matter. In such an embodiment, the data structure 300 may be stored by a branch target buffer (e.g., BTB 122 of fig. 1). In various embodiments, the data structure 301 may be stored by a return address stack (e.g., the RAS 124 of FIG. 1). It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

In the illustrated embodiment, data structure 300 may illustrate a representative embodiment of the state of a BTB. In such embodiments, the BTB may include at least three columns or fields (although more columns or fields may be used in various embodiments). The first field 302 includes the address (or other identifier) of the branch instruction. The second field 304 may include the predicted target address of the branch (i.e., the address to which the branch may jump). In a conventional BTB, these two fields 302 and 304 may be the only columns or fields, except for a valid flag (not shown) (note whether a row, line, or entry may be used).

In such an embodiment, when the BPU encounters a branch instruction, it looks up it via its memory address (first field 302), and the BPU determines where in memory to find the next instruction (via second field 304). As described above, in such embodiments, the BPU may waste one or more cycles to find a branch instruction in a non-existent memory address (i.e., a memory segment passing through the target address is empty or has no branch instruction) upon reaching that target address.

However, in the illustrated embodiment, the BPU may be configured to check the third field or null flag 306. In such embodiments, the empty flag 306 may indicate whether the memory segment passing through the target address is empty or has no branch instruction. In various embodiments, the value of null flag 306 may be calculated the first time a branch instruction is encountered. In some embodiments, when the correctness of the branch (or lack of correctness) is fully resolved, this may be done at the commit stage or pipeline stage.

In various embodiments, the empty flag 306 of the memory segment may comprise a single bit or true/false value. In such an embodiment, the empty flag 306 may refer only to the immediate memory segment (IMMEDIATE MEMORY SEGMENT) that includes the target address. In another embodiment, the null flag 306 may indicate how many memory segments should be skipped. For example, the last row of data structure 300 has a value of 3, indicating that the current memory segment plus the other two memory segments have no branch instructions.

In another embodiment, the null flag 306 may include a valid flag. In another embodiment, the valid flag of the null flag may be stored as a separate field (not shown). In such an embodiment, the valid flag of the null flag may indicate whether the null flag 306 has been calculated and whether the null flag 306 may be relied upon. For example, an entry may be placed in the BTB during an instruction fetch pipeline stage, but the empty flag 306 may not be computed prior to the commit stage. Or in another example, null flag 306 may be valid only for branches predicted to be "taken" and invalid for branches predicted to be "not taken" (or vice versa). In yet another embodiment, null flag 306 may be valid only for certain types of branches (e.g., call and return). It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited in this respect.

In such an embodiment, the null flag 306 may be incremented by 1 bit. In such an embodiment, a valid and true (or set) null flag may be "0x11" while a valid and false (or cleared) null flag may be "0x10", with the first bit being a valid bit and the second bit being a null state. It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

In the illustrated embodiment, the data structure 301 may illustrate a representative embodiment of the state of the RAS. In such embodiments, the RAS may include at least two columns or fields (although more columns or fields may be used in various embodiments). Field 312 includes a return address (or other identifier) that the calling branch instruction will return. In a conventional RAS, field 312 may be a unique column or field, except for a valid flag (not shown) (note whether a row, line, or entry may be used). Conventionally, the return address is pushed into the top of the data structure 301 and then popped from the top in a Last In First Out (LIFO) manner.

In the illustrated embodiment, the BPU may be configured to check the second field or null flag 316. In such an embodiment, the null flag 316 may indicate whether the memory segment passing through the target address (field 312) of the return instruction has no branch instruction, as described above. In various embodiments, the value of null flag 316 may be calculated the first time a call branch instruction is encountered. In various embodiments, the null flag 316 may be similar to the flags described above. In various embodiments, the null flag 306 of the BTB and the null flag 316 of the RAS may include differences in format or information.

Fig. 4 is a block diagram of an example embodiment of a system 400 in accordance with the disclosed subject matter. In various embodiments, system 400 may comprise a computer, a plurality of discrete integrated circuits, or a system on a chip (SoC). As described below, the system 400 may include many other components that are not shown in this figure so as not to obscure the disclosed subject matter.

In the illustrated embodiment, system 400 includes system memory 104. In various embodiments, system memory 104 may be comprised of Dynamic Random Access Memory (DRAM). It should be understood, however, that the foregoing is merely one illustrative example and that the disclosed subject matter is not limited in this respect. In such embodiments, the system memory 104 may comprise on-module memory (e.g., dual in-line memory module (DIMM)), may be an integrated chip that is soldered or otherwise fixedly integrated with the system 400, or may even be incorporated as part of an integrated chip (e.g., soC) that includes the system 400. It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

In the illustrated embodiment, the system 400 also includes a processor 102. The processor 102 may be configured to perform a plurality of operations indicated by the various instructions. These instructions may be executed by various execution units (most not shown), such as an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a load/store unit (LSU), an instruction fetch unit 116 (IFU), and the like. It is understood that the units are merely a collection of circuits that are combined together to perform a portion of the functionality of the processor 102. Typically, the units perform one or more operations in a pipelined architecture of the processor 102.

In various embodiments, the processor 102 may operate in various pipeline stages. In computation, a pipeline, also called a data pipeline, is a collection of data processing elements connected in a coarse sequence (rough series), where the output of one element is the input of the next element. The elements of the pipeline are typically executed in parallel or in time-division. A certain amount of buffer memory is typically inserted between the elements.

In a classical Reduced Instruction Set Computer (RISC) pipeline, stages include instruction fetch (most of which are shown in fig. 1), instruction decode, execution, memory access, and write back. In modern out-of-order and speculative execution processors, the processor 102 may execute unwanted instructions. The pipeline stage in which it is determined whether an instruction (or its result) is needed is called the commit stage. If the commit stage is placed in the Procrustean bed of the classical RISC pipeline, it may be placed in the write-back stage. In various embodiments or architectures, the commit stage may be a separate pipeline stage.

In the illustrated embodiment, the processor 102 may include an execution unit 402 as described above. In the illustrated embodiment, the processor 102 may include a commit queue 404 in which completed instructions are placed in order of age.

In the illustrated embodiment, the processor 102 may include a register file 406. In such embodiments, when instructions are committed (rather than discarded), the results of those instructions may be placed or committed into register file 406. In modern computers with renaming registers, commit actions may include verifying or marking the value already stored in register file 406 as correct. In various embodiments, the processor may include a cache 418 (e.g., a data cache) in which data of the register file is ultimately moved and then to the system memory 104, as described above.

Further, in the illustrated embodiment, the processor 102 may include a branch detection circuit 420. In such embodiments, the branch detection circuit 420 may be configured to detect the presence of at least one branch instruction stored with a portion of a memory segment (e.g., a cache line) during the commit phase of a current instruction.

In such embodiments, once branch detection circuit 420 has made a determination as to whether the memory segment portion does not have any branch instructions, it may create or update a memory segment empty tag in BTB 122, as described above. In various embodiments, this may include setting or clearing a null tag associated with the branch instruction.

In some embodiments, the processor 102 or branch detection circuit 420 may include a last branch memory 422 that stores the last or current branch instruction encountered from the commit queue 404. In such embodiments, the last branch memory 422 may indicate the branch instruction associated with the currently computed empty tag. In various embodiments, the last branch memory 422 may be active (branch null tags being computed) or inactive (branch null tags not being computed).

In various embodiments, BTB 122 may be graph-based. In such embodiments, branches may be stored as nodes, and edges may represent control flows of a program or instruction set. In various embodiments, the disclosed subject matter may be limited to a first level BTB of a multi-level or hierarchical BTB structure. It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

In various embodiments, some designs define instruction blocks and instruction sequences ending in branches. In such embodiments, BTB 122 may look for or index a branch based on the start address of the block rather than the actual address of the branch instruction. In such embodiments, the disclosed subject matter is modified accordingly. In addition, BTB metadata may be enhanced to store how many empty cache lines or memory segments may be skipped before the next branch instruction is encountered. It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

In various embodiments, a Branch Target Buffer (BTB) may be configured to store metadata associated with a branch instruction, e.g., a null flag. A Branch Prediction Pipeline (BPP) may be configured to detect branch instructions whose target cache line is partially or completely empty and skip branch predictions for any empty target cache line. In various embodiments, the BPP may do so by training using a commit instruction cache line. The BPP may mark a taken branch instruction whose target cache line is empty by setting at least one of the taken target cache line empty flags. The BPP may mark a not taken branch instruction with a not taken target cache line empty flag as true (true) in the BTB entry of the branch instruction. The BPP may examine the BTB entry or Return Address Stack (RAS) of the branch instruction to determine if the target cache line empty flag is set. If the target cache line empty flag is set, the BPP may skip branch prediction for one or more instruction cache lines of the target cache line that include the branch instruction.

FIG. 5 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter. In various embodiments, the technique 500 may be used or generated by a system such as that of fig. 4 or 7. Although it is to be appreciated that the above are merely a few illustrative examples, the disclosed subject matter is not limited in this respect. It should be appreciated that the disclosed subject matter is not limited by the order or number of acts illustrated by technique 500.

In various embodiments, as described above, technique 500 may illustrate an embodiment of a technique employed by a processor or branch detection unit for determining the correct state of a memory segment empty flag, as described above. In the illustrated embodiment, a technique 500 that may be specific to taking a branch (taken branch) is shown. In another embodiment, a technique may be employed for not taking a branch (not-token branch). In yet another embodiment, techniques may be employed for both taken and not taken branches and/or various types of branch instructions (e.g., call, return, unconditional jump, conditional jump, zero value jump, or other value jumps, etc.). It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

Block 502 illustrates that, in one embodiment, a commit instruction may be checked to determine if it is a branch instruction. As described above, commit instructions may be provided by or stored in a commit queue that holds branch instructions or non-branch instructions in chronological order. In such embodiments, non-branch instructions may be grouped by the memory segment from which they come.

Block 504 illustrates that, in one embodiment, if the commit instruction is a branch instruction, the branch instruction (or its address) may be stored in the last branch memory, as described above. In various embodiments, the last branch memory may be marked as valid or marked as storing an address for empty flag determination.

Block 506 illustrates that in one embodiment, if the commit instruction is not a branch instruction, a check may be made to determine whether the last branch memory is valid or active.

Block 508 illustrates that, in one embodiment, if the commit instruction is not a branch instruction and the last branch memory value is valid, a null flag associated with the branch stored in the last branch memory may be set to a value indicating that the remainder of the memory segment does not contain a branch instruction. As described above, a null flag may be stored in the BTB.

Block 510 illustrates that, in one embodiment, if the commit instruction is not a branch instruction, the last branch memory value may be invalidated or marked as inactive. In various embodiments, block 510 may be skipped if the result of block 506 indicates that the last branch memory value has been invalidated.

Block 599 shows a stopping point. However, it should be appreciated that the technique 500 may be repeated for each commit instruction.

FIG. 6 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter. In various embodiments, technique 600 may be used or generated by a system such as that of fig. 1 or 7. Although it is to be appreciated that the above are merely a few illustrative examples, the disclosed subject matter is not limited in this respect. It should be appreciated that the disclosed subject matter is not limited by the order or number of acts illustrated by technique 600.

In various embodiments, as described above, technique 600 may illustrate embodiments of techniques employed by a processor or branch prediction unit to determine whether to skip or pass through a memory segment or portion of a cache line, as described above. In the illustrated embodiment, technique 600 may be specific to the illustrated taking of a branch. In another embodiment, techniques may be employed for not taking branches. In yet another embodiment, a technique may be employed for both taken and not taken branches and/or various types of branch instructions (e.g., call, return, unconditional jump, conditional jump, zero value jump, or other value jump, etc.). It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

Block 602 illustrates that, in one embodiment, a determination may be made as to whether a branch instruction is predicted taken. If not, technique 600 can stop 699. However, it should be understood that the above is merely one illustrative example and that the disclosed subject matter is not limited thereto.

Block 604 illustrates that, in one embodiment, a determination may be made as to which type of branch instruction has been encountered. In the illustrated embodiment, it is determined whether a branch may be a call, a return, or neither. It should be appreciated that the above is merely one illustrative example and that the disclosed subject matter is not limited in this respect.

Block 606 illustrates that, in one embodiment, if the branch instruction is neither a call nor a return, then a memory segment empty flag (associated with the branch instruction) may be read from the BTB, as described above.

Block 608 illustrates that, in one embodiment, if the branch instruction is a call branch instruction, then the target of the corresponding return branch instruction may be determined. It may then be determined whether the returned target memory segment or the rest of the cache line has no other branch instructions. Once this determination is made and the memory segment empty flag is created, the memory segment empty flag may be pushed onto the RAS along with the return target address, as described above. In such an embodiment, once the RAS empty flag has been prepared for the final return of the call, the BPU may execute block 606 on the call instruction.

Block 610 illustrates that, in one embodiment, if the branch instruction is a return branch instruction, the empty flag of the RAS for that branch may be read (prepared by block 608), as described above.

Block 612 illustrates that, in one embodiment, the value of the null flag (the value of BTB or RAS determined by the branch type) may be determined, as described above. If the empty flag is not set, cleared, or the empty flag indicates that the rest of the memory segment is not unbranched, technique 600 may stop 699 and the branching process may occur normally.

Block 614 illustrates that, in one embodiment, it may be determined whether virtual-to-physical (V2P) address translation is available for the cache line containing the target address and the next sequential cache line after the target address. In various embodiments, this may be stored in a translation look-aside buffer (TLB). If a virtual to physical (V2P) address translation of the cache line containing the target address and the next sequential cache line after the target address is not available, an indication may be made to move to the next memory segment so that additional work, such as TLB fill, may be done. The technique 600 may stop at block 699.

Block 616 illustrates that, in one embodiment, it may be determined that both the target cache line and the cache line following the target cache line are available in a cache (e.g., instruction cache) and/or BTB (cache hit no miss). If not, the technique may not skip the empty memory, but instead move to block 699.

Block 618 illustrates that, in one embodiment, if the null flag is set (or indicates that the remainder of the target memory segment may be skipped) and both the target cache line and the cache line following the target cache line are available in the cache, then the BPU may skip or pass through the remainder of the current memory segment, as described above.

Block 699 shows the stopping point. It is to be appreciated, however, that as described above, the BPU may continue further processing of branch predictions and that technique 600 may be part of a larger branch prediction technique. Further, it should be appreciated that technique 600 may be repeated for each branch instruction.

Fig. 7 is a schematic block diagram of an information handling system 700, which information handling system 700 may include a semiconductor device formed in accordance with the principles of the disclosed subject matter.

Referring to fig. 7, an information handling system 700 may include one or more devices constructed in accordance with the principles of the disclosed subject matter. In another embodiment, information handling system 700 may employ or perform one or more techniques in accordance with the principles of the disclosed subject matter.

In various embodiments, information handling system 700 may include computing devices such as a laptop computer, a desktop computer, a workstation, a server, a blade server, a personal digital assistant, a smart phone, a tablet computer, and other suitable computers or virtual machines or virtual computing devices thereof. In various embodiments, information handling system 700 may be used by a user (not shown).

The information handling system 700 according to the disclosed subject matter may also include a Central Processing Unit (CPU), logic, or processor 710. In some embodiments, processor 710 may include one or more Functional Unit Blocks (FUBs) or Combinational Logic Blocks (CLBs) 715. In such embodiments, the combinational logic block may include various boolean logic operations (e.g., NAND, NOR, NOT, XOR), stable logic devices (e.g., flip-flops, latches), other logic devices, or combinations thereof. These combinational logic operations may be configured in a simple or complex manner to process the input signals to achieve the desired results. It should be appreciated that while some illustrative examples of synchronous combinational logic operations are described, the disclosed subject matter is not so limited and may include asynchronous operations or mixtures thereof. In one embodiment, the combinational logic operation may include a plurality of Complementary Metal Oxide Semiconductor (CMOS) transistors. In various embodiments, these CMOS transistors may be arranged in gates that perform logic operations, although it should be understood that other techniques may be used and are within the scope of the disclosed subject matter.

The information handling system 700 according to the disclosed subject matter may also include volatile memory 720 (e.g., random Access Memory (RAM)). The information handling system 700 according to the disclosed subject matter may also include non-volatile memory 730 (e.g., hard disk drive, optical memory, NAND or flash memory). In some embodiments, volatile memory 720, nonvolatile memory 730, or combinations or portions thereof, may be referred to as a "storage medium. In various embodiments, volatile memory 720 and/or nonvolatile memory 730 may be configured to store data in a semi-permanent or substantially permanent form.

In various embodiments, information handling system 700 may include one or more network interfaces 740 configured to allow information handling system 700 to become part of and communicate via a communication network. Examples of Wi-Fi protocols can include, but are not limited to, institute of Electrical and Electronics Engineers (IEEE) 802.11g, IEEE 802.11n. Examples of cellular protocols may include, but are not limited to, IEEE 802.16m, (also known as wireless MAN (metropolitan area network) advanced), long Term Evolution (LTE) advanced, enhanced data rates for GSM (global system for mobile communications) evolution (EDGE), evolved high speed packet access (hspa+). Examples of wired protocols may include, but are not limited to, IEEE 802.3 (also known as ethernet), fibre channel, power line communications (e.g., homePlug, IEEE 1901). It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

The information handling system 700 according to the disclosed subject matter may also include a user interface unit 750 (e.g., a display adapter, a haptic interface, a human interface device). In various embodiments, the user interface unit 750 may be configured to receive input from a user and/or provide output to a user. Other types of devices may also be used to provide interaction with the user, for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback, and input from the user may be received in any form, including acoustic, speech, or tactile input.

In various embodiments, information handling system 700 may include one or more other devices or hardware components 760 (e.g., a display or monitor, keyboard, mouse, camera, fingerprint reader, video processor). It should be appreciated that the above are merely a few illustrative examples and that the disclosed subject matter is not limited thereto.

The information handling system 700 according to the disclosed subject matter may also include one or more system buses 705. In such embodiments, the system bus 705 may be configured to communicatively couple the processor 710, the volatile memory 720, the nonvolatile memory 730, the network interface 740, the user interface unit 750, and the one or more hardware components 760. Data processed by the processor 710 or data input from outside the nonvolatile memory 730 may be stored in the nonvolatile memory 730 or the volatile memory 720.

In various embodiments, information handling system 700 may include or execute one or more software components 770. In some embodiments, software component 770 may include an Operating System (OS) and/or applications. In some embodiments, the OS may be configured to provide one or more services to applications and manage or act as an intermediary between applications and the various hardware components of information handling system 700 (e.g., processor 710, network interface 740). In such embodiments, information handling system 700 may include one or more native applications that may be installed locally (e.g., within non-volatile memory 730) and configured to be executed directly by processor 710 and interact directly with the OS. In such embodiments, the native application may include precompiled machine executable code. In some embodiments, the native application may include a script interpreter (e.g., C shell (csh), APPLESCRIPT, AUTOHOTKEY) or virtual execution machine (VM) (e.g., java virtual machine, microsoft public language runtime) configured to convert the source or object code into executable code that is then executable by the processor 710.

The above semiconductor devices may be packaged using various packaging techniques. For example, semiconductor devices constructed in accordance with the principles of the disclosed subject matter may be packaged using any of Package On Package (POP) technology, ball Grid Array (BGA) technology, chip Size Package (CSP) technology, plastic Leaded Chip Carrier (PLCC) technology, plastic dual in-line package (PDIP) technology, die-on-chip packaging technology, die-on-Chip (COB) technology, ceramic dual in-line package (CERDIP) technology, plastic quad flat package (PMQFP) technology, plastic flat package (PQFP) technology, small Outline IC (SOIC) technology, small outline package (SSOP) technology, thin Small Outline Package (TSOP) technology, thin flat package (TQFP) technology, system In Package (SIP) technology, multi-chip package (MCP) technology, wafer level structure package (WFP) technology, wafer level process stack package (WSP) technology, or other technologies known to those skilled in the art.

The method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps may also be performed by, and apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

In various embodiments, a computer-readable medium may include instructions that, when executed, cause an apparatus to perform at least a portion of the method steps. In some embodiments, the computer readable medium may be included in a magnetic medium, an optical medium, other medium, or a combination thereof (e.g., CD-ROM, hard drive, read-only memory, flash drive). In such embodiments, the computer-readable medium may be an article of manufacture that is tangible and non-transitory.

While the principles of the disclosed subject matter have been described with reference to example embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosed concepts. Accordingly, it should be understood that the above embodiments are not limiting, but merely illustrative. Accordingly, the scope of the disclosed concept is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing description. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the embodiments.

Claims

1. An apparatus, comprising:

A branch prediction circuit configured to predict that a branch instruction will be taken, and

A return address stack circuit configured to store a memory segment empty flag indicating that a memory segment following the return address does not include at least one other branch instruction and to determine that physical address translation of a next memory segment and subsequent sequential memory segments is available;

Wherein the branch prediction circuit is configured to skip a memory segment associated with a memory segment empty flag indicating the absence of at least one other branch instruction.

2. The apparatus of claim 1, wherein the branch prediction circuit is configured to:

determining that the next memory segment is stored in the instruction cache and return address stack circuitry, and

The skip memory segment empty flag indicates a memory segment lacking a branch instruction.

3. The apparatus of claim 1, wherein the branch prediction circuit is configured to, for a memory segment that includes at least one other branch instruction after the return address, move to a next instruction within the memory segment.

4. The apparatus of claim 1, wherein the memory segment is a cache line.

5. The apparatus of claim 1, wherein the branch prediction circuit is configured to determine whether the branch instruction is one of a call instruction or a return instruction.

6. The apparatus of claim 5, further comprising a branch target buffer circuit configured to store a memory segment empty flag of a target address, and

Wherein, in response to the branch instruction being a call instruction, the apparatus is configured to:

Determining that the memory segment following the associated return instruction includes at least one other branch instruction, an

The determination is stored as a memory segment empty flag within the return address stack circuit.

7. The apparatus of claim 5, further comprising a branch target buffer circuit configured to store a second memory segment empty flag indicating that a memory segment following the target address does not include at least one other branch instruction, wherein the second memory segment empty flag is created at a commit stage prior to occurrence of the branch instruction.

8. The apparatus of claim 1, wherein the branch prediction circuit is configured to:

9. An apparatus, comprising:

A branch detection circuit configured to detect the presence of at least one branch instruction within a portion stored in a memory segment during a commit phase of a current instruction, wherein the commit phase includes a pipeline phase in which the apparatus determines that the instruction and a result of the instruction are to be required, and

Return address circuitry configured to store:

Return address, and

A memory segment empty flag indicating whether a portion of the memory segment following the return address includes at least one other branch instruction.

10. The apparatus of claim 9, wherein the memory segment is a cache line.

11. The apparatus of claim 9, wherein the apparatus comprises a commit queue circuit;

wherein the commit queue circuit is configured to store current commit instructions in chronological order.

12. The apparatus of claim 9, wherein the apparatus comprises a last committed branch memory configured to store previously committed branch instructions.

13. The apparatus of claim 12, wherein the branch detection circuit is configured to:

determining that the current instruction is a branch instruction, and

The current instruction is stored in the last committed branch memory.

14. The apparatus of claim 9, wherein the branch detection circuit is configured to determine that the previously stored last committed branch instruction is still valid in response to the current instruction not being a branch instruction.

15. The apparatus of claim 14, further comprising:

a branch target buffer circuit configured to store:

Branch instruction address, and

A second memory segment empty flag indicating whether a portion of the memory segment following the target address includes at least one other branch instruction, an

Wherein the branch detection circuit is configured to set a memory segment empty flag associated with the previously stored last committed branch instruction in the branch target buffer circuit in response to the current instruction not being a branch instruction and the previously stored last committed branch instruction remaining valid.

16. The apparatus of claim 14, wherein the branch detection circuit is configured to mark the previously stored last committed branch instruction as invalid if the current instruction is not a branch instruction and the previously stored last committed branch instruction is invalid.

17. The apparatus of claim 9, wherein the branch target buffer circuit comprises a graph-based branch target buffer circuit.

18. The apparatus of claim 9, wherein the memory segment empty flag indicates that a plurality of memory segments, or portions thereof, following the branch instruction address do not include at least one other branch instruction.

19. A system, comprising:

A branch detection circuit configured to detect the presence of at least one branch instruction stored within a portion of the memory segment during a commit phase of a current commit instruction, wherein the commit phase includes a pipeline phase in which the system determines that the instruction and a result of the instruction are to be required;

a branch target buffer circuit configured to store:

branch instruction address, and

A memory segment empty flag indicating whether a portion of the memory segment following the target address includes at least one other branch instruction, an

A branch prediction circuit configured to predict that a branch instruction will be taken, and wherein the branch prediction circuit is configured to skip an associated memory segment empty flag indicating a memory segment lacking a branch instruction.

20. The system of claim 19, wherein the store segment empty flag is valid only for a taken branch instruction.