HK1112983A

HK1112983A - Unaligned memory access prediction

Info

Publication number: HK1112983A
Application number: HK08108338.6A
Authority: HK
Inventors: 杰弗里‧托德‧布里奇斯; 维克托‧罗伯茨‧奥格斯堡; 詹姆斯‧诺里斯‧迪芬德尔费尔; 托马斯‧安德鲁‧萨托里乌斯
Original assignee: 高通股份有限公司
Priority date: 2005-02-17
Filing date: 2006-02-16
Publication date: 2008-09-19

Abstract

In an instruction execution pipeline, the misalignment of memory access instructions is predicted. Based on the prediction, an additional micro-operation is generated in the pipeline prior to the effective address generation of the memory access instruction. The additional micro-operation accesses the memory falling across a predetermined address boundary. Predicting the misalignment and generating a micro-operation early in the pipeline ensures that sufficient pipeline control resources are available to generate and track the additional micro-operation, avoiding a pipeline flush if the resources are not available at the time of effective address generation. The misalignment prediction may employ known conditional branch prediction techniques, such as a flag, a bimodal counter, a local predictor, a global predictor, and combined predictors. A misalignment predictor may be enabled or biased by a memory access instruction flag or misaligned instruction type.

Description

Unaligned memory access prediction

Technical Field

The present disclosure relates generally to the field of processors, and in particular to predicting unaligned memory accesses in pipelined processors.

Background

Portable electronic devices have become commonplace. Two trends in portable electronic devices are increased functionality and decreased size. The increase in functionality is facilitated by the increased computing power provided by faster and more powerful processors.

In addition to providing advanced features and functionality, portable electronic devices themselves are continually being reduced in size and weight. One effect of this shrinking trend is the ever decreasing size of batteries used to power the processor and other electronic components in the device. While improvements in battery technology partially offset the problem, the reduction in battery size imposes a stringent power budget on all portable electronic device electronics. A significant portion of the power budget for a portable electronic device is the power consumed by the processor.

Thus, processor improvements that increase performance and reduce power consumption are desirable for many applications such as portable electronic devices. Typically modern processors employ a pipeline architecture in which sequential instructions, each having multiple execution steps, overlap in execution. To obtain maximum performance, instructions should flow continuously through the pipeline. Any situation that results in instructions being flushed from the pipeline and then restarted may adversely affect both performance and power consumption.

Certain pipeline resources (e.g., queue locations for instruction status and tracking) are allocated as instructions enter the pipeline. If a single instruction is later found in the pipeline to require more resources than the originally dispatched resources, then subsequent instructions may need to be flushed to allow their resources to be re-dispatched to the instructions requiring the resources.

A memory access instruction that loads misaligned data from memory or stores misaligned data to memory is one example of an instruction that may require more pipeline resources than the pipeline resources to which it was originally dispatched, which may not be discovered until deep in the pipeline. Misaligned data are those data that, when stored in memory, cross a predetermined memory boundary (e.g., a word or half-word boundary). Due to the way memory is logically constructed and addressed and physically coupled to a memory bus, data crossing memory boundaries cannot generally be read or written in a single cycle. In fact, two consecutive bus cycles may be required — one to read or write data to one side of the boundary and the other to read or write the remaining data.

Memory access instructions for unaligned data, i.e., load or store instructions, must generate additional instruction steps or micro-operations in the pipeline to perform the additional memory accesses required for the unaligned data. However, at the execute stage, the alignment of the data cannot be determined until the effective address of the memory access and the data size are known (which may only occur deep in the pipeline). When an effective address is generated and a data misalignment is found, there may be insufficient pipeline control resources for generating a micro-operation to perform a second memory access. With such misalignment, the pipeline must be flushed of at least all subsequent instructions to free up these resources. The flushed instructions must then be re-fetched and re-executed in the pipeline, degrading processor performance and wasting power.

Disclosure of Invention

Data misalignment for a memory access instruction may be predicted early in the pipeline prior to the effective address generation of the instruction. Pipeline resources may be allocated and the pipeline controlled to create a second micro-operation. A second memory access cycle is performed with the second micro-operation as needed for misaligned data.

The present invention, in one embodiment, relates to a method of processing memory access instructions in an instruction execution pipeline. A misalignment of a memory access instruction is predicted, and at least one micro-operation is generated in the pipeline in response to the prediction prior to generating an effective address for the memory access instruction, the micro-operation performing a second memory access for misaligned data.

In another embodiment, the invention relates to a pipeline processor. The processor includes an instruction execution pipeline including a plurality of pipe stages, and a misalignment predictor that generates a prediction that a memory access instruction will access misaligned data. The processor additionally includes control logic that generates additional micro-operations in the pipeline for the memory access instruction in response to the prediction prior to generating an effective address for the memory access instruction.

Drawings

FIG. 1 is a functional block diagram of a processor.

FIG. 2 is a diagram of memory organization.

FIG. 3 is a functional block diagram of a portion of a processor pipeline.

Detailed Description

Pipelining is a processor-implemented technique whereby multiple instructions overlap simultaneously when executed. Each instruction in a typical architecture is typically executed in multiple execution steps, such as fetch, decode, one or more execution steps, memory access, and write-back. A processor pipeline includes multiple "pipe stages. Each pipe stage (which typically includes logic and storage) completes an execution step or a portion of an execution step of an instruction. The pipe stages are coupled together to form a pipeline. Instructions enter the pipe stage and are processed sequentially in the pipe stage. Additional instructions enter the pipeline before previous instructions complete execution-thus, multiple instructions may be processed within the pipeline at any given time. This ability to exploit parallelism among instructions in a sequential instruction stream contributes significantly to improved processor performance. In a processor that ideally and in one cycle completes each pipe stage, after a brief initial process of filling the pipeline, an instruction may complete execution in each cycle.

Such ideal conditions are rarely achieved in practice due to a variety of factors including data dependencies among instructions (data hazards), control dependencies such as branches (control hazards), processor resource allocation conflicts (structural hazards), interrupts, cache misses, and the like. Additionally, some instructions may need to pass through one or more of the pipe stages more than once. In this case, the processor may generate a plurality of micro-operations for the instruction. As used herein, a micro-operation is a logical entity that occupies one pipe stage at a time and flows through a pipeline. Ideally, most instructions contain a single micro-operation in the pipeline (to approach the goal of one instruction per cycle). However, the instruction may include two or more micro-operations, or may be split into two or more, each of which flows sequentially through the pipeline.

One form of structural pipeline hazard results from misaligned memory accesses. During the execution stage, many memory access instructions compute an effective memory address-i.e., the address from which data is loaded or to which data is stored. However, during the decode stage, processor resources, in particular pipeline control resources that dispatch pipe stages to instructions and track instructions through the pipeline, are dispatched to the load or store instructions. In the common case of memory addresses aligned on word, halfword, or other predetermined address boundaries, load or store operations may be performed in a single pipe stage (although the pipeline may be halted while data is being retrieved from memory).

If a memory access instruction is directed to data crossing a predetermined address boundary, two load or store operations are required, requiring two micro-operations in the pipeline to be executed. However, it is likely that only one micro-operation will be dispatched to the load or store in the decode stage. The need to generate new micro-operations in the execution stages of the pipeline presents problems. In the event that pipeline resources are fully allocated, the need for a new micro-operation will cause an exception, requiring all subsequent instructions to be flushed from the pipeline to free up the necessary pipeline control resources.

Fig. 1 depicts a functional block diagram of a processor 10. The processor 10 executes instructions in an instruction execution pipeline 12 according to control logic 14, which control logic 14 may include an instruction execution FIFO 15. The pipeline includes various registers or latches 16, organized in pipe stages, and one or more Arithmetic Logic Units (ALU) 18. A General Purpose Register (GPR) file 20 provides registers comprising the top of the memory hierarchy. The pipeline fetches instructions from an instruction cache 22, with memory addressing and permissions managed by an instruction-side translation lookaside buffer (ITLB) 24. Data is accessed from a data cache 26, with memory addressing and permissions managed by a main Translation Lookaside Buffer (TLB) 28. In various embodiments, the ITLB may comprise a copy of part of the TLB. Alternatively, the ITLB and TLB may be integrated. Similarly, in various embodiments of the processor 10, the I-cache 22 and the D-cache 26 may be integrated or unified. Misses in the I-cache 22 and/or the D-cache 26 cause an access to main (off-chip) memory 32, under the control of a memory interface 30, which memory interface 30 may include a cache miss processing queue 29. The processor 10 may include an input/output (I/O) interface 34 that controls access to various peripheral devices 36. Those skilled in the art will recognize that numerous variations to processor 10 are possible. For example, the processor 10 may include a second level (L2) cache for either or both of the I and D caches. In addition, one or more of the functional blocks depicted in the processor 10 may be omitted from a particular embodiment.

Table 1 below depicts a diagram of representative pipeline operations. The instructions in the representative structure are executed in six steps:

IF-instruction fetching

ID-instruction decoding

EX 1-execution (fetching address operands for memory Access instructions)

EX 2-execution (effective Address calculation for memory Access instruction)

MEM memory access

WB-write-back

Instruction numbering	Clock period
	Clock period										1	2	3	4	5	6	7	8	9	10
	I	IF	ID	EX1	EX2	MEM	WB				1	2	3	4	5	6	7	8	9	10
i+1	I	IF	ID	EX1	EX2	MEM	WB					IF	ID	EXI	EX2	MEM	WB
i+1	i+2			IF	ID	EX1	EX2	MEM	WB			IF	ID	EXI	EX2	MEM	WB
i+3	i+2			IF	ID	EX1	EX2	MEM	WB					IF	ID	EX1	EX2	MEM	WB
i+3	i+4					IF	ID	EX1	EX2	MEM				IF	ID	EX1	EX2	MEM	WB		WB

Table 1: representative pipeline operations

If each pipe stage executes in one cycle, and if there is no pipeline stall, hazard, or interrupt, from cycle 6 to 10, one instruction is completed and a writeback of its results is performed in each cycle. Clock cycles 1-4 represent an initialization process to fill a pipeline, which is performed, for example, after a reset, context switch, interrupt, or any other flush of the pipeline. Since the additional instructions follow the (i + 4) th instruction, execution of one instruction per cycle may continue indefinitely in the ideal case. The pipeline configuration of table 1 is merely representative. In any given processor implementation, a pipeline may include any number of pipe stages for any instruction step.

Load (LD) and Store (ST) instructions access memory to read and write data, respectively. The memory is organized to access a predetermined amount of data at a time simultaneously. Fig. 2 is a block diagram of one memory structure in an electronic device, including a processor 10, a bus 31, and a memory 32. In this example, the bus 31 is 32 bits wide, and the memory 32 (which may include, for example, cache memory in a processor or off-chip RAM) is aligned on word (32-bit) boundaries. As will be readily appreciated by those skilled in the art, a variety of different bus and memory structures may be implemented, with correspondingly different data alignment boundaries.

As an example of unaligned memory access, fig. 2 depicts a memory read by an LD instruction with a valid start byte address of 0x0A, and a size field having three bytes. For a word aligned memory 32, this data cannot be read from the memory in one bus cycle. In fact, processor 10 must first read the full word starting at 0x08 (masked bytes 0x08 and 0x09), and then read the full word starting at 0x0C (masked bytes 0x 0D-0 x 0F). In a system with a double word memory alignment and a 64-bit bus, this data can be read in a single cycle; however, 3 bytes of data starting at 0x07 cannot be read. In general, any real bus 31 and memory 32 configuration may cause problems of unaligned memory accesses.

Because unaligned data requires two bus cycles, a memory access instruction directed to unaligned data (referred to herein as an "unaligned memory instruction") will result in the creation of two micro-operations in the execute stage. For example, if the (i +3) th instruction in Table 1 is an unaligned memory access instruction, the pipeline will execute as depicted in Table 2 below.

Instruction numbering	Clock period
	Clock period											1	2	3	4	5	6	7	8	9	10	11
	i	IF	ID	EX1	EX2	MEM	WB					1	2	3	4	5	6	7	8	9	10	11
i+1	i	IF	ID	EX1	EX2	MEM	WB						IF	ID	EX1	EX2	MEM	WB
i+1	i+2			IF	ID	EX1	EX2	MEM	WB				IF	ID	EX1	EX2	MEM	WB
i+3	i+2			IF	ID	EX1	EX2	MEM	WB						IF	ID	EX1	EX2	MEM	WB
i+3	(i+3)₂	Only when sufficient resources are available						genμ-op	EX2	MEM	WB				IF	ID	EX1	EX2	MEM	WB
i+4	(i+3)₂	Only when sufficient resources are available						genμ-op	EX2	MEM	WB					IF	ID	EX1	(stop)	EX2	MEM	WB

Table 2: representative pipeline with unaligned memory access instructions

During the EX2 stage, the effective address of memory access instruction i +3 is determined in clock cycle 7. Only at this point may the processor control logic determine that the effective memory access is unaligned-that is, it crosses a predetermined memory boundary and cannot be accessed in a single bus cycle. If sufficient pipeline resources are available, the processor will generate additional micro-operations ("gen μ -op" in Table 2) for the memory access instruction, indicated as (i +3)₂. The initially dispatched micro-operation and the new micro-operation then proceed sequentially through the remaining pipeline.

The initial micro-operation (the address of the first memory access has been calculated at EX 2) then accesses the first portion of the addressed data at MEM, and writes the data at WB. The newly created micro-operation will calculate the address of the second memory access (e.g., by adding a word offset) at EX2 in clock cycle 8, then access the second portion of the addressed data at MEM, and write the data at WB.

Instruction i +4 stalls in clock cycle 8 due to the additional micro-operations required for misaligned data of instruction i + 3. To implement this stop in the pipeline, the clock to the EX1 latch must be gated, the output to the EX1 latch recirculated to the multiplexer at the input to the EX1 latch, or some other mechanism must be employed to hold the EX1 content through both clock cycles 7 and 8. Similarly, the following instruction i +5 will stall throughout the pipeline at the DCD latch, and so on. Implementing such stall control presents design challenges, particularly when the need for a stall is only discovered late in the pipeline. In addition, the need to "recirculate" the pipe stages for memory access instructions at EX2, as well as for other instructions in previous pipe stages, adds multiplexer selection delays on one or more critical paths, potentially reducing performance by increasing machine cycle time. Minimizing events that may cause a pipeline stall improves performance.

Misaligned memory accesses may also be described in more detail with reference to FIG. 3, which is a functional block diagram of a portion of the instruction pipeline 12. The LD instruction is fetched from the instruction cache 22 in the pipeline and loaded into the IF latch 40. The instruction is decoded by decode logic 42. In one embodiment, the LD calculates the effective address by adding the base address located in the first register r1 to the offset located in the second register r 2. The addresses of the two registers r1 and r2 and the size of the data are decoded from the instruction and latched in the DCD latch 44. These register addresses may then index a register file (e.g., the GPR file 20) that returns the contents of registers, indicated as (r1) and (r 2). These values are stored in EX1 latch 46, added by ALU 18, and the effective memory address is stored in EX2 latch 48. The memory access then continues at 50, accessing the data cache 26. If an access miss occurs in the data cache 26, the memory access operation 50 will perform address translation if necessary and access the off-chip memory 32, stalling the pipeline until the memory access is resolved. In any case, the memory access operation 50 returns a first portion of unaligned data, which is stored in the MEM latch 52.

When the effective address is generated by adding (r1) and (r2) at 48, control logic 14 checks the effective address and size fields, and first detects that the memory access is unaligned. If sufficient processor resources are available, the effective address is "recycled" at EX2 latch 48, as indicated by arrow 54. The address is updated with a single word offset to generate an effective address for the second memory access. This adds a micro-operation to the pipeline; and subsequent instructions are not allowed to proceed to the EX2 pipe stage. As the data extracted from the first word progresses down the pipeline, the second word is accessed at 50 and stored in MEM latch 52. The data may then be written to the GPRs sequentially, or combined and then written, as appropriate.

Note that for purposes of explanation, FIG. 3 depicts each stage of the pipeline 12 having an associated LD instruction step in that stage. In effect, once the relevant LD instruction step is completed in a pipe stage and the corresponding micro-operation is moved to a subsequent stage, the micro-operation of another instruction is loaded into the pipe stage for processing. Thus, when the effective address of the LD instruction is latched at 48, the previous three pipe stages are loaded with three micro-operations, which may correspond to up to three different instructions. When an effective address is generated at EX2 latch 48, if insufficient pipeline control resources are available to cycle the address as depicted at 54 and generate a second micro-operation to perform a second memory access, a structural hazard is created and an exception will occur. In this case, all instructions following the misaligned LD instruction must be flushed from the pipeline to make the necessary control resources available. These instructions must be re-fetched and re-processed at a later time, resulting in a performance penalty and wasting power associated with performing the operation twice.

The pipeline control resources that must be available to create micro-operations in the EX2 pipe stage may include entries in the instruction tracking FIFO15 (FIG. 1). The instruction tracking FIFO15 includes an entry for each issued instruction (in program order). The entries in the FIFO15 are allocated when the respective instruction is issued and, after a number of cycles, are updated when the pipeline control logic 14 determines whether the instruction has an exception that may cause an interrupt. The entries are removed from the instruction tracking FIFO15 in turn, popping each one off after having been "validated" (i.e., the pipeline controller determines that the instruction will complete execution without exception) and "confirmed" (i.e., the pipeline controller has recognized that it and all its predecessors have been confirmed, so it is apparent that the instruction completed execution in the pipeline).

The structure and control of the instruction tracking FIFO15 is simplified if each entry in the instruction tracking FIFO15 corresponds to a single micro-operation. On the other hand, if an unaligned memory access instruction causes additional micro-operations to be generated deep in the pipeline (e.g., in the EX2 pipe stage), each entry in the instruction tracking FIFO15 must be able to track multiple potential micro-operations, increasing hardware resources and control complexity for each FIFO entry. This increased complexity and size is required for each FIFO entry, but unaligned memory access instructions (the only instructions that will require multiple micro-operation traces late) are relatively rare. As an alternative to designing the instruction tracking FIFO15 to track multiple micro-operations per entry, each entry may track only one micro-operation. In this case, a late misaligned memory access instruction will cause the pipeline (and the instruction tracking FIFO 15) to flush all instructions behind it, dispatch two entries corresponding to two microinstructions in the instruction tracking FIFO15, and refetch and reissue all following instructions. This results in significant performance and power loss.

In addition to the instruction tracking FIFO15, another resource necessary for misaligned memory instructions that may not be available late in the pipeline is an entry in the cache miss queue 29. When a data access miss occurs in the data cache, the access may be placed in queue 29 to access main memory on the external bus. If no queue entries are available, the pipeline must stall. In the case of a memory access instruction, the cache miss queue 29 may be checked during the DCD stage, and if an entry is available, the control logic 14 allows the instruction to advance, knowing that a cache miss will not stall the pipeline 12. However, if the unaligned memory access instruction must generate an additional micro-operation late in the pipe to perform the additional memory access, and if a memory access miss occurs in the cache, a second cache miss queue entry is required. Since only one is reserved in the DCD pipe stage, there may not be enough queue resources available, causing the pipeline 12 to stall.

According to one embodiment of the present invention, prior to generating an effective address for a memory access instruction, a data misalignment in the memory access instruction is predicted, and a micro-operation is created in response to the prediction. The new micro-operation performs a second memory access required to access the misaligned data. This allows pipeline control resources to be allocated early in the pipeline (e.g., immediately after the instruction is decoded).

Referring again to FIG. 2, once the memory access instruction is decoded at 42, the misalignment predictor 56 detects the instruction. In response to the misalignment prediction, the second micro-operation may be created immediately, as indicated by the "recycle" LD instruction at IF latch 40. The second micro-operation will follow the preliminary load instruction micro-operation through the pipeline and will be available to execute a second memory access cycle when the predicted data misalignment is correct. The additional micro-operations do not actually need to perform the register accesses and address generation described above for the LD instruction, since the address of the memory access is known to be the same as the address of the LD instruction with, for example, a single word offset. After the first memory access by the LD instruction at 50, if the misalignment prediction is correct, the address of the second memory access necessary to read the misaligned data is calculated and stored in EX2 latch 48 when the first data is stored in MEM latch 52. A second memory access is then performed and second data is obtained from the cache 26 or memory 32 and loaded into the MEM latch 52.

If the misalignment prediction is wrong, the second memory access is not performed and the additional micro-operation is discarded. In the event the pipeline stalls, instructions after the LD may proceed, consuming resources allocated for the second micro-operation and effectively removing it from the pipeline.

Table 3 below depicts a pipeline in which instruction i +3 is a misaligned memory access instruction whose misalignment is correctly predicted.

Instruction numbering	Clock period
	Clock period											1	2	3	4	5	6	7	8	9	10	11
	i	IF	ID	EX1	EX2	MEM	WB					1	2	3	4	5	6	7	8	9	10	11
i+1	i	IF	ID	EX1	EX2	MEM	WB						IF	ID	EX1	EX2	MEM	WB
i+1	i+2			IF	ID	EX1	EX2	MEM	WB				IF	ID	EX1	EX2	MEM	WB
i+3	i+2			IF	ID	EX1	EX2	MEM	WB						IF	ID	EX1	EX2	MEM	WB
i+3	(i+3)₂	(sufficient pipeline resources are known to be available)				genμ-op	ID	EX1	EX2	MEM	WB				IF	ID	EX1	EX2	MEM	WB
i+4	(i+3)₂	(sufficient pipeline resources are known to be available)				genμ-op	ID	EX1	EX2	MEM	WB					IF	(stop)	ID	EX1	EX2	MEM	WB

Table 3: representative pipeline of unaligned memory access instructions with correct prediction

In response toDecode the instruction into LD and predict the misalignment, generate a second micro-operation (i +3) in clock cycle 5 at the decode pipe stage₂. Such early generation of the micro-operation (before the effective address of the LD instruction is computed in the EX2 pipe stage at clock cycle 7) ensures that sufficient pipeline control resources are available for the micro-operation (i +3)₂. The timing of instruction execution is otherwise similar to that in Table 2, assuming sufficient resources are available to create the second micro-operation (i +3) in the EX2 pipe stage₂. One difference is that instruction i +4 is stalled the same amount, but the stall occurs earlier in its execution sequence because the micro-operation (i +3) is generated earlier in the pipeline₂。

If the misalignment prediction is accurate, the correct pipeline control resources are precisely allocated for performing the misaligned memory access, and subsequent instructions can be loaded into the pipeline and executed without fear that they are flushed due to the misalignment. If the misalignment prediction is wrong, processor performance and power management is reduced. However, the degradation in performance is not symmetrical. Table 4 below depicts the relative performance and power impact on the misalignment prediction accuracy likelihood.

Prediction	Practice of	Influence of	Note
Prediction	Practice of	Influence of	Note	Alignment of	Alignment of	Optimization of	This is a general situationAnd will occur in 99 +% applications
Alignment of	Misalignment	May be extremely bad	If insufficient resources are available to generate a micro-operation, an exception must be raised and the tube flushed	Alignment of	Alignment of	Optimization of
Alignment of	Misalignment	May be extremely bad		Misalignment	Alignment of	Disadvantages of	Single stage tube bubble-an unnecessary micro-operation created at DCD
Misalignment	Misalignment	Optimization of	Ensuring that sufficient resources are available for the necessary micro-operations	Misalignment	Alignment of	Disadvantages of

Table 4: impact of misalignment prediction accuracy

The correctly predicted case provides optimal performance by dispatching the required number of micro-operations to the memory access instruction as precisely as needed to fully execute the instruction. The above describes a misprediction case where the prediction is aligned but not actually aligned, where the data alignment is not verified until an effective address is generated for the memory access instruction (in the EX2 pipe stage in the above example). As discussed, if sufficient pipeline control resources are available, the only performance degradation is an increase in latency of the instruction following completion of the memory access instruction because a micro-operation is created to perform a second memory access operation. However, if insufficient pipeline control resources are available, an exception will occur and the pipeline 12 will flush out all instructions loaded after the memory access instruction to free up the necessary resources to create and manage additional micro-operations. This is the worst possibility in terms of performance and power optimization.

A misprediction case that is predicted to be misaligned but actually aligned generates unwanted micro-operations or "bubbles" in the pipeline 12 following a memory access instruction. Once the effective address of the memory access instruction is generated and control logic 14 can detect that it is actually aligned, the redundant micro-operation may be discarded. If a memory access instruction miss occurs in the cache 26, forcing access to the off-chip memory 32, for example, the pipeline 12 will be stalled waiting for the memory access operation to complete. If another instruction following the generated micro-operation does not encounter any hazards in the EX1 or EX2 pipe stages, it may proceed to just after the memory access instruction, causing the bubble to disappear. In this case, although some power is wasted to create and manage the micro-operations, there is no performance degradation. In the more likely case of a memory access instruction hit in the cache 26 (and the pipeline 12 otherwise not stalled), a bubble will flow through the pipeline 12, causing a one-cycle performance degradation (assuming one cycle per pipe stage). However, a mispredicted misalignment will not cause an exception or flush the pipeline 12 due to lack of necessary control resources.

Misalignment prediction may be performed in various other ways, some of which are disclosed herein. However, the invention is not limited to the particular misalignment prediction algorithm disclosed. It is within the scope of the present invention to predict (in any way) misalignment of a memory access instruction and generate a micro-operation in response to the prediction to access misaligned data prior to generating an effective address of the instruction.

Where misaligned data accesses are common, a reasonable minor misalignment prediction algorithm may simply assume misalignment, and always generate an additional micro-operation before generating an effective address for a memory access instruction. This will guarantee no exception or pipeline flush due to misalignment at the expense of one cycle of execution performance impact per virtually aligned memory access. According to one embodiment of the present invention, the "predicted misalignment" mode is defined by bits in a control register. When an application expects a large number of misaligned memory accesses, it may enable the mode by setting the bit. When the bit is set, all memory accesses are predicted to be misaligned. In another embodiment, the misalignment prediction is controlled by an attribute in the page table of the memory access instruction, so that all memory accesses by instructions from a single page will be predicted in the same manner (whether aligned or misaligned).

Most code may not encounter misaligned memory accesses that are so easily identified as being within a particular code segment or memory region. Thus, a more complex misalignment prediction approach is needed — prediction can be enabled continuously but does not blindly predict that all memory accesses or all accesses on a particular page will be misaligned. For example, in one embodiment, the misalignment prediction may follow stack pointer alignment. If the stack pointer is misaligned, then the memory access is predicted to be misaligned.

Methods of predicting the behavior of conditional branch instructions are well known, and many methods may be applied to predict misalignment. For example, past recent memory access patterns may be a good indicator of the alignment of future memory accesses. In one embodiment, a plurality of one-bit flags (indexed by address bits of the memory access instruction) indicate the alignment of the most recent memory access by the respective instruction — e.g., "one" indicates a misaligned access and "zero" indicates an aligned access (or vice versa). The misalignment flags may include tags that compare all or most of the memory access instruction addresses to prevent misalignment confusion (which may reduce prediction accuracy) between memory access instructions. Alternatively, to save resources, the misalignment flag may be indexed using only the least significant few bits of the address.

The respective misalignment flags are checked prior to generating the effective address of the memory access instruction, and preferably as early as possible. If the most recent execution of the memory access instruction is misaligned, the pipeline controller may predict that the pending access will also be misaligned and generate a micro-operation to perform a second memory access. Since the type of instruction (i.e., memory access instruction) is first known in the instruction decode pipe stage, the micro-operation is preferably created there. However, the micro-operation may be created later in the pipeline. Any creation of a micro-operation in response to a misalignment prediction prior to generating an effective address of a memory access instruction is within the scope of the present invention.

One result of the single-bit misalignment flag is that odd misaligned memory access instructions in the stream of aligned memory access instructions will mispredict twice-once when the misaligned instruction is first encountered, and again at the next aligned execution of the instruction (whose misalignment flag is now set). A solution to this problem, also known in conditional branch prediction, is a bimodal misalignment predictor that includes a table of two-bit saturating counters indexed by memory access instruction address. Each counter has one of four states:

11-strong misalignment

10-weak misalignment

01-weak alignment

00-strong alignment

When the effective address of the memory access instruction is generated, the corresponding counter is updated. Misaligned memory access instructions increment state toward strong misalignment, and aligned memory access instructions decrement state toward strong alignment. Such a bimodal counter will only mispredict once for odd misaligned accesses in the aligned access stream at the cost of mispredicting twice at the beginning of the misaligned access stream.

Another misalignment prediction algorithm that may be borrowed from conditional branch prediction is the local misalignment predictor. The local misalignment predictor maintains two tables. The first table is a local misalignment history table. It is indexed by the address bits of the memory access instructions and it records the alignment/misalignment history of the n most recent executions of each memory access instruction. Another table is a pattern history table. Similar to the bimodal predictor, this table contains bimodal counters; however, its index is generated from the misalignment history in the first table. To predict alignment, the misalignment history is looked up, and then the history is used to look up a bimodal counter that makes the misalignment prediction.

Yet another option for predicting misalignment is a global misalignment predictor, which exploits the fact that the behavior of many memory accesses is extremely correlated with the history of other recent memory accesses. The global misalignment predictor holds a single shift register that is updated with the recent misalignment history of each memory access instruction executed, and uses this value to index the table of bimodal counters.

Alternatively, the table of bimodal counters may be indexed with a recent misalignment history linked with a few bits in the address of the memory access instruction, referred to as a gselect predictor. For smaller table sizes, gselect may produce more accurate results than local prediction. As another alternative, the memory access instruction address may be XOR'd with the global history rather than chained, which is referred to as the gshare predictor. For larger tables, gshare may produce more accurate misalignment predictions than gselet. Even though gselect and gshare are less accurate than local prediction, they may be preferred for implementation reasons. gselect and gshare require a single table lookup for each alignment prediction, while local prediction requires two table lookups in series.

In 1993, Scott McFarling proposed the combined Branch predictor in Digital Western research Laboratory Technical Specification (Digital Westernresearch Laboratory Technical Note) TN-36, "Combining Branch Predictors", which is incorporated herein by reference in its entirety. In accordance with the present invention, the technique proposed by McFarling can be advantageously applied to the problem of predicting misaligned memory accesses to thereby generate pipeline micro-operations prior to the effective address generation of a memory access instruction.

In one embodiment, the combined misalignment prediction uses three predictors in parallel: bimodal, gshare, and similar bimodal predictors to choose which of bimodal or gshare to use based on each memory access instruction. The choice predictor is another 2-bit up/down saturation counter, in which case the MSB chooses the prediction to be used. In this case, the counter is updated whenever the bimodal and gshare predictions disagree, in favor of whichever predictor is more accurate.

In another embodiment, the misalignment predictor may maintain a misalignment cache, which may be fully associative or set associative, and may be indexed by a portion of the memory access instruction address or a portion of the address linked or exclusive-ORed with other recent misalignment histories (e.g., for the gselect and gshare parameters above). The cache may be indexed early in the pipeline, such as during the instruction fetch pipe stage (e.g., just before the instruction is known to be a memory access instruction). If a misaligned cache hits, the memory access is recently misaligned, and may be predicted misaligned. If this cache access misses, the memory access is predicted to be aligned. Entries are added to the cache for memory access instructions that are not predicted to be misaligned, and entries are removed from the cache for aligned memory accesses that are predicted to be unaligned.

Various other misalignment prediction algorithms are possible. For example, the misalignment predictor may maintain detailed statistics of alignment behavior of memory access instructions, and predict misalignment on an instruction-by-instruction basis or globally based on a statistical average of past alignment experiences. Similarly, the misalignment predictor may maintain a rolling average of the alignment of the n most recent memory access instructions.

Some instruction set architectures include static prediction bits in the opcode that may be specified by a programmer based on his or her specific knowledge of the application. For example, if branches are used in a "branch on error" scenario and the errors are relatively rare, the programmer may statically predict those branches as "not taken". Similarly, programmers may gain insight into the memory alignment behavior of a particular application. For example, many data processing applications utilize data structures that are well-designed and ordered, and little, if any, unaligned memory access would be expected. On the other hand, certain applications may expect a large number of unaligned data accesses. Examples may include the communication program extracting specific data from a continuous data stream in a shared channel, or the data acquisition application recording data from a continuous output in response to an asynchronous trigger. In such applications, enabling or biasing misalignment prediction to a more aggressive mode may improve processor performance and power savings. According to one embodiment of the present invention, a programmer may affect the misalignment prediction behavior of a program via a flag or a set of unaligned memory access instructions in the memory access instructions.

In one embodiment, memory access instructions (e.g., LD and ST instructions) include a flag in the parameter list that indicates that misalignment prediction should be performed. Alternatively, the instruction set may include new instructions (e.g., LDMAL and STMAL) for potentially misaligned load and store operations, respectively. This flag or new instruction provides an input to misalignment predictor 56 to enable memory alignment prediction and generate a micro-operation earlier before effective address generation to perform an additional memory access cycle to access unaligned data.

In another embodiment, a misalignment prediction flag or instruction type places the misalignment predictor 56 in a mode where it makes a more aggressive misalignment prediction than it would without the flag. For example, a flag or instruction type may switch the misalignment predictor from using a two-bit bimodal saturation counter (as described above) to a three-bit saturation counter, where five or six of the eight states indicate the degree of predicted misalignment. One advantage of such misalignment prediction flags or instruction types is that misalignment prediction is controlled by the programmer, who relies on his or her knowledge of application behavior to better predict when misalignment prediction may cause processor performance and power management improvements.

Although the present invention has been described herein with respect to particular features, aspects and embodiments thereof, it will be apparent that numerous variations, modifications, and other embodiments are possible within the broad scope of the present invention, and accordingly, all variations, modifications and embodiments are to be regarded as being within the scope of the invention. The present embodiments are, therefore, to be construed in all aspects as illustrative and not restrictive and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method of processing a memory access instruction in an instruction execution pipeline, the memory access instruction performing a first memory access comprising:

predicting a data misalignment for the memory access instruction; and

generating at least one micro-operation in the pipeline in response to the prediction, the micro-operation performing a second memory access on misaligned data, prior to generating an effective address for the memory access instruction.

2. The method of claim 1, wherein generating at least one micro-operation in the pipeline comprises generating the micro-operation in an instruction decode pipe stage.

3. The method of claim 1, wherein generating at least one micro-operation includes assigning pipeline control resources for the micro-operation.

4. The method of claim 3, wherein the pipeline control resource comprises at least one entry in an instruction tracking FIFO.

5. The method of claim 3, wherein the pipeline control resource comprises an available location in a cache miss queue.

6. The method of claim 1, wherein predicting data misalignment for the memory access instruction comprises setting a misalignment prediction bit in a control register such that when the bit is set, all memory access instructions are predicted to be misaligned.

7. The method of claim 1, wherein predicting data misalignment for the memory access instruction comprises setting one or more attributes on the memory access instruction page table entries such that all memory access instructions on a respective page are predicted to be misaligned when the attributes are set.

8. The method of claim 1, wherein predicting data misalignment for the memory access instruction comprises predicting data misalignment when a stack pointer is misaligned, and predicting data alignment when the stack pointer is aligned.

9. The method of claim 1, wherein predicting data misalignment for the memory access instruction includes storing an alignment history, and predicting misalignment in response to the alignment history.

10. The method of claim 9, wherein storing an alignment history comprises storing an alignment history associated with the memory access instruction.

11. The method of claim 10, wherein the alignment history is indexed by a plurality of instruction address bits associated with the memory access instruction.

12. The method of claim 11, wherein the alignment history includes a flag indicating an alignment of the most recent memory access instruction.

13. The method of claim 11, wherein

Storing the alignment history includes incrementing or decrementing a bimodal saturation counter in response to the alignment of each of the memory access instructions; and wherein

Predicting data misalignment in response to the alignment history includes outputting the MSB of the two peak saturation counter.

14. The method of claim 11, wherein

Responding to the alignment history prediction data misalignment includes outputting a data misalignment prediction based on an encoding of bits of the counter.

15. The method of claim 14, wherein

Storing the alignment history includes storing alignment indications for a predetermined number of the most recent memory access instructions; and wherein

Predicting data misalignment in response to the alignment history includes indexing a table of bimodal counters using the indication, and outputting MSBs of bimodal counters of the index.

16. The method of claim 9, wherein

Storing the alignment history includes storing the alignment history associated with all memory access instructions; and wherein predicting a misalignment in response to the alignment history comprises indexing a table of bimodal counters using the alignment history, and outputting the MSBs of the bimodal counters of the index.

17. The method of claim 16, wherein a table of the bimodal counter is indexed with the alignment history linked with a plurality of address bits associated with the memory access instruction.

18. The method of claim 16, wherein a table of the bimodal counter is indexed with the alignment history exclusive-ored with a plurality of address bits associated with the memory access instruction.

19. The method of claim 9, wherein storing an alignment history comprises:

incrementing or decrementing a separate bimodal saturation counter in response to the alignment of each said memory access instruction; and

storing a synthetic alignment history associated with all memory access instructions;

and wherein predicting misalignment in response to the alignment history comprises:

generating a first predictor including the MSB of the bimodal saturation counter associated with the memory access instruction;

generating a second predictor including the MSB of a bimodal counter in a table indexed by the synthetic alignment history xored with a plurality of address bits associated with the memory access instruction; and

the MSB of a selected bimodal saturation counter is output, which is updated in a direction in favor of the exact one of the first and second predictors when the first predictor is inconsistent with the second predictor.

20. The method of claim 9, wherein storing past alignment history includes maintaining a statistical average of an alignment of past memory access instructions.

21. The method of claim 9, wherein storing past alignment experiences includes maintaining a rolling average of an alignment of a predetermined number of most recent memory access instructions.

22. The method of claim 9, wherein storing past alignment experiences includes maintaining a misaligned cache of misaligned memory accesses that are predicted to be aligned, and wherein predicting misalignment in response to the alignment history includes a hit in the misaligned cache.

23. The method of claim 22, wherein predicting misalignment in response to the alignment history further comprises indexing the misalignment cache prior to decoding the memory access instruction.

24. The method of claim 22, further comprising removing aligned memory accesses predicted to be misaligned from the cache.

25. The method of claim 1, wherein predicting data misalignment for the memory access instruction comprises predicting data misalignment in response to a flag in the memory access instruction.

26. The method of claim 1, wherein predicting data misalignment for the memory access instruction comprises predicting data misalignment in response to the memory access instruction comprising a likely misaligned memory access instruction.

27. A pipeline processor, comprising:

an instruction execution pipeline including a plurality of pipe stages;

a misalignment predictor that generates a prediction that a memory access instruction will access misaligned data; and

control logic to generate additional micro-operations in the pipeline to perform additional memory accesses in response to the prediction prior to generating an effective address for the memory access instruction.

28. The processor of claim 27 wherein the additional micro-operations are generated in an instruction decode pipe stage.

29. The processor of claim 27, wherein the micro-operation occupies at least one of the pipe stages.

30. The processor of claim 27, wherein the misalignment predictor comprises a memory that stores the memory access instruction alignment history.

31. The processor of claim 27 further comprising an instruction execution FIFO and wherein the control logic generates entries in the instruction execution FIFO corresponding to the micro-operations.