HK40081717A

HK40081717A - Data cache with hybrid writeback and writethrough

Info

Publication number: HK40081717A
Application number: HK62023070246.1A
Authority: HK
Inventors: 约翰·英戈尔斯; 韦斯利·韦伦·特普斯特拉; 亨利·库克
Original assignee: 斯法夫股份有限公司
Priority date: 2020-02-21
Filing date: 2021-02-12
Publication date: 2023-05-25

Description

Data cache with hybrid write back and write through

Technical Field

The present disclosure relates to data caches (caches), and in particular, to a data cache that implements mixed write back and write through when the data cache is in a shared state or an exclusive coherency state.

Background

A data cache is a hardware component and/or a software component that stores data so that future requests for the data can be serviced more quickly. Typically, the data cache is a write-back data cache type or a write-through data cache type, where the data cache type controls when data stored in the data cache is written to a backing register, memory, or the like. The write-back write policy is deferred until the modified contents of the data cache are to be replaced by another cache block or based on other policies. The write-through write strategy writes to the backup register synchronously with the write to the data cache.

Data caching in shared memory multiprocessor systems typically operates subject to cache coherency protocols and mechanisms that ensure that changes in the values of shared data propagate throughout the shared memory multiprocessor system in a timely manner. Two common cache coherency protocols are, for example, the modified, exclusive, shared, invalid (MESI) protocol and the modified, shared, invalid (MSI) protocol. In an implementation, an exclusive coherence protocol state may be referred to as a unique coherence protocol state. Typically, in the modified coherency protocol state, a cache line exists only in the current cache and is dirty. That is, the data in the cache line is different from the data in the backing register. In this case, the data cache is required to write data back to the backing register at some future time before any other read to the (no longer valid) backing register is permitted. After the write back is performed, the cache line is changed to a shared coherency protocol state. In the exclusive coherency protocol state, the cache line is only present in the current data cache and is clean. That is, the data in the cache line matches the data in the backing register. In response to a read request, a cache line may be changed to a shared coherency protocol state at any time. Alternatively, the cache line may be changed to a modified coherency protocol state upon writing to the cache line. In the shared coherency protocol state, a cache line may be stored in other caches of the system and clean. That is, the data in the cache line matches the data in the backing register. A cache line may be discarded (changed to an invalid coherency protocol state) at any time. In the invalid coherency protocol state, the cache line is invalid (unused).

In write-back data caches, a scratchpad (or a number of scratchpads) may be issued to cache line(s) or cache block(s) in a "clean" (invalid, shared, or exclusive) coherency protocol state, classically defined as having read-only rights. Writes may be freely performed only when a cache line is established or upgraded to a modified coherency protocol state. A cache line in an exclusive coherency protocol state also has to be upgraded to a modified coherency protocol state to become globally visible.

Coherency protocol upgrades can be done using a coherency mechanism such as snooping, where each data cache monitors the address lines for access to their cached memory locations or directories, with the back controller remembering which cache(s) have which coherency authority(s) on which cache block(s). This coherency protocol promotion process takes time in the internet to probe snoops to demote other caches, resulting in store instruction retirement latency and performance degradation.

Drawings

The disclosure is best understood from the following detailed description when read with the accompanying drawing figures. It is emphasized that, according to common practice, the various features of the drawings are not to scale. On the contrary, the dimensions of the various features are arbitrarily expanded or reduced for clarity.

FIG. 1 is a high-level block diagram of an example of a processing system for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure.

FIG. 2 is a high-level block diagram of an example load register unit of a processing system for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure.

FIG. 3 is a flow diagram of an example technique or method for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure

FIG. 4 is a flow diagram of an example technique or method for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure.

FIG. 5 is a diagram of an example technique for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure.

FIG. 6 is a diagram of an example technique for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure.

FIG. 7 is a diagram of an example technique for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure.

Detailed Description

Disclosed herein are systems and methods for hybrid write-back and write-through data caching. A multiprocessor processing system can include a plurality of processors and a shared memory. Each processor can have a data cache including an L1 data cache. The L1 data cache may be a hybrid write-back and write-through data cache that can mitigate latency associated with performing a coherency protocol upgrade and still comply with the policies of the cache coherency protocol.

The processor includes a hybrid write-back and write-through data cache, a write buffer that tracks the hybrid write-back and write-through data cache, and a store queue. The store queue writes data to a hit cache line in the hybrid write-back and write-through data cache and allocates an entry in the write buffer to write the data, even if the cache line in the hybrid write-back and write-through data cache is in a shared or exclusive coherency state. This results in the hit cache line being in a shared coherency state with data and the allocated entry in the write buffer being in a modified coherency state with data. The write buffer messages the memory controller to promote a hit cache line to a modified coherency state with data and the memory controller messages the mixed write-back and write-through data cache accordingly. The write buffer retires the data and the hybrid write back and write through data cache writes the data back to memory for the define event. The write buffer can write through either the update data or the dirty data if the processor receives a probe before the upgrade or write back. For example, if a hit cache line in a hybrid write-back and write-through data cache is snooped to detect a shared or invalid coherency state, the write buffer will write-through either the update data or the dirty data.

The update data in the hit cache line in the mixed write-back and write-through data cache will be readable by the load from the local processor hart, which is a resource abstraction representing a running context that progresses independently within the running environment. In other words, hart is a resource within a runtime environment that has state and follows a stream of instructions that are independent of other software executions within the same runtime environment. The update data is not readable by snoop probes. Alternatively, the update data is not readable by non-local entities, which can include, for example, non-local processors, non-local cache controllers, non-local cores, and the like.

The use of a mixed write-back and write-through data cache has the effect of extending the local scratch pad buffer allowed by the memory coherency ordering model into the actual contents of the data cache, which can then be made globally visible whenever the data cache is updated to a modified coherency state. The techniques for a hybrid write-back and write-through data cache implementation are applicable to the Weak Memory Order (WMO) model in RISC-V and ARM processors and to the total register order (TSO) model in x86 processors.

These and other aspects of the present disclosure are disclosed in the following detailed description, appended claims, and accompanying drawings.

As used herein, the term "processor" refers to one or more processors, such as one or more special purpose processors, one or more digital signal processors, one or more microprocessors, one or more controllers, one or more microcontrollers, one or more application processors, one or more Central Processing Units (CPUs), one or more Graphics Processing Units (GPUs), one or more Digital Signal Processors (DSPs), one or more Application Specific Integrated Circuits (ASICs), one or more application specific standard products, one or more field programmable gate arrays, any other type or combination of integrated circuits, one or more state machines, or any combination thereof.

The term "circuit" refers to an arrangement of electronic components (e.g., transistors, resistors, capacitors, and/or inductors) configured to perform one or more functions. For example, a circuit may include one or more transistors interconnected to form logic gates that collectively implement a logic function. For example, the processor can be a circuit.

As used herein, the terms "determine" and "identify," or any variation thereof, include selecting, ascertaining, calculating, looking up, receiving, determining, establishing, obtaining, or otherwise identifying or determining in any way, regardless of using one or more of the devices and methods shown and described herein.

As used herein, the terms "example," "embodiment," "implementation," "aspect," "feature," or "element" are intended to be used as examples, instances, or illustrations. Any example, embodiment, implementation, aspect, feature, or element is independent of each other example, embodiment, implementation, aspect, feature, or element and may be used in combination with any other example, embodiment, implementation, aspect, feature, or element, unless expressly indicated otherwise.

As used herein, the term "or" is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise, or clear from context, "X includes a or B" is intended to indicate any natural inclusive permutation. That is, if X includes A; x comprises B; or X includes A and B, then "X includes A or B" is satisfied under any of the above circumstances. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.

Moreover, for simplicity of explanation, while the figures and descriptions herein may include a sequence or series of steps or stages, the elements of the methods disclosed herein may occur in various orders or concurrently. Additionally, elements of methods disclosed herein may occur in conjunction with other elements not explicitly shown or described herein. Moreover, not all elements of a method described herein may be required to implement a method in accordance with the present disclosure. Although aspects, features, and elements are described herein in particular combinations, each aspect, feature, or element can be used alone or in various combinations with or without other aspects, features, and elements.

It should be understood that the figures and descriptions of the embodiments have been simplified to illustrate elements that are relevant for a clear understanding, while eliminating, for purposes of clarity, many other elements found in a typical processor. One of ordinary skill in the art may recognize that other elements and/or steps are desirable and/or necessary in implementing the present disclosure. However, because such elements and steps do not facilitate a better understanding of the present disclosure, a discussion of such elements and steps is not provided herein.

Fig. 1 is a high-level block diagram of an example of a processing system 1000 for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure. The processing system 1000 is capable of implementing a pipelined architecture. Processing system 1000 can be configured to decode and execute instructions of an Instruction Set Architecture (ISA) (e.g., RISC-V instruction set). Instructions can be executed speculatively and out of order in the processing system 1000. The processing system 1000 can be a computing device, microprocessor, microcontroller, or IP core. The processing system 1000 can be implemented as an integrated circuit.

The processing system 1000 includes at least one processor core 1100. Processor core 1100 can be implemented using one or more Central Processing Units (CPUs). Each processor core 1100 is capable of connecting to one or more memory modules 1200 via an interconnection network 1300 and a memory controller 1400. One or more memory modules 1200 can be referred to as external memory, main memory, backing store, coherent memory, or backing structure (collectively "backing structures").

Each processor core 1100 can include an L1 instruction cache 1500, the L1 instruction cache 1500 being associated with an L1 Translation Lookaside Buffer (TLB) 1510 for virtual-to-physical address translation. Instruction queue 1520 buffers instructions fetched from L1 instruction cache 1500 based on branch prediction 1525 and other fetch pipeline processing. Dequeue instructions are renamed in rename unit 1530 to avoid erroneous data dependencies and then dispatched by dispatch/retirement unit 1540 to appropriate back-end execution units, including floating-point execution unit 1600, integer execution unit 1700, and load/store execution unit 1800, for example. The floating point execution unit 1600 can be allocated a physical register file, FP register file 1610, and the integer execution unit 1700 can be allocated a physical register file, INT register file 1710. The FP register file 1610 and the INT register file 1710 are also coupled to a load/store execution unit 1800, the load/store execution unit 1800 being capable of accessing the L1 data cache 1900 via a L1 data TLB 1910, the L1 data TLB 1910 being coupled to a L2 TLB 1920, the L2 TLB 1920 being coupled to a L1 instruction TLB 1510. The L1 data cache 1900 is coupled to an L2 cache 1930, which L2 cache 1930 is coupled to an L1 instruction cache 1500.

Each element or component of processing system 1000 and processing system 1000 is illustrative and can include additional, fewer, or different devices, entities, elements, components, etc., which can be similarly or differently configured without departing from the scope of the description and claims herein. Further, the illustrated devices, entities, elements and components are capable of performing other functions without departing from the scope of the description and claims herein.

FIG. 2 is a high-level block diagram of an example load/store unit 2000 of a processing system for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure. Load/store unit 2000 can include an issue queue 2100, where issue queue 2100 stores instructions dispatched from dispatch/retirement unit 1540 of fig. 1. Issue queue 2100 is capable of issuing instructions into load/store tag pipeline 2200, which load/store tag pipeline 2200 is then capable of allocating entries in load/store data pipeline 2300, load queue 2400, store queue 2500, and miss status processing register (MSHR) 2600. Instruction buffer data is stored in the store queue 2500 until committed, and then writes are collected on retirement to the L1 data cache 2700 or MSHR 2600.

Each element or component in load/store unit 2000 and load/store unit 2000 is illustrative and can include additional, fewer, or different devices, entities, elements, components, etc., which can be similarly or differently configured without departing from the scope of the description and claims herein. Further, the illustrated devices, entities, elements and components are capable of performing other functions without departing from the scope of the description and claims herein.

FIG. 3 is a flowchart 3000 of an example technique or method for implementing hybrid write-back and write-through in accordance with an embodiment of the present disclosure. Flowchart 3000 can be implemented, for example, in processing system 1000 of fig. 1, load store unit 2000 of fig. 2, and similar devices and systems. Flow diagram 3000 describes the communication or interaction with respect to load/store unit 3100 and backup structure 3200. In implementations, the backup structure 3200 can include a controller. Load/store unit 3100 can include store queue 3300, L1 data cache 3400, and write buffer 3500. The L1 data cache 3400 can include multiple cache lines. The write buffer 3500 can track the coherency state of the L1 data cache 3400. In the starting state of flowchart 3000, a hit cache line of L1 data cache 3400 can have either a shared coherency state or an exclusive coherency state (collectively referred to as the "starting coherency state"). In an implementation, the write buffer 3500 is an MSHR, such as MSHR 2600 of FIG. 2.

After dequeuing or retiring the queue entries, store queue 3300 can write data to the hit cache line in L1 data cache 3400 (3610) and allocate the entries in write buffer 3500 (3620). As a result, the cache line in the L1 data cache 3400 is now in a starting coherency state with data and the allocated entry in the write buffer 3500 is now in a modified state with data. Since the write buffer 3500 tracks the L1 data cache 3400, the write buffer 3500 can send a message to the backup structure 3200 to upgrade the coherency state of the cache line in the L1 data cache 3400 (3630). The lookaside structure 3200 can promote the cache line in the L1 data cache 3400 to a modified coherency state with data (3640). The write buffer 3500 enables retirement of data in an allocated entry upon validation of an upgrade of a cache line in the L1 data cache 3400. Thus, the coherency state of the cache line in the L1 data cache 3400 now conforms to the coherency protocol. L1 data cache 3400 can write data in the cache line back to backup structure 3200 as appropriate (3650).

Fig. 4 is a flow diagram 4000 of an example technique or method for implementing hybrid write back and write through, in accordance with an embodiment of the present disclosure. Flowchart 4000 can be implemented, for example, in processing system 1000 of fig. 1, load store unit 2000 of fig. 2, and similar devices and systems. Flow diagram 4000 describes communication or interaction with respect to load/store unit 4100 and backup structure 4200, non-local entity 4250, or both. In an implementation, backup structure 4200 can include a controller. Load/store unit 4100 can include a store queue 4300, an L1 data cache 4400, and a write buffer 4500. The L1 data cache 4400 can include multiple cache lines. The write buffer 4500 is capable of tracking the coherency state of the L1 data cache 4400. In the starting state of flowchart 4000, a hit cache line of L1 data cache 4400 can have either a shared coherency state or an exclusive coherency state (collectively referred to as the "starting coherency state"). In an implementation, the write buffer 4500 is an MSHR such as MSHR 2600 of FIG. 2.

After dequeuing or retiring a queue entry, the store queue 4300 can write data to the hit cache line of the L1 data cache 4400 (4610) and allocate an entry in the write buffer 4500 (4620). As a result, the cache line in the L1 data cache 4400 is now in a starting coherency state with data and the write buffer 4500 is now in a modified state with data. Probes 4700 are received from the backing structure 4200 or the non-local entity 4250 before upgrading a cache line in the L1 data cache 4400 (4630). Probe 4700 can be a demoted to a shared coherency state or a demoted to an invalid shared coherency state. The probes 4700 can check against both cache lines in the L1 data cache 4400 and allocated entries in the write buffer 4500.

In the event probe 4700 is demoted to a shared coherency state, the cache line in L1 data cache 4400 remains valid in the current coherency state, which is the shared coherency state with data. The write buffer 4500 is not in the correct coherency state and writes data through to the backing structure 4200 or the non-local entity 4250 (4640). Thus, the coherency state of the cache line in the L1 data cache 3400 now conforms to the coherency protocol.

In the event probe 4700 is demoted to an invalid coherency state, the cache line in L1 data cache 4400 is demoted to an invalid coherency state and the data in the cache line is discarded. The write buffer 4500 is destaged to an invalid coherency state and the write buffer 4500 writes data through to the backing structure 4200 or the non-local entity 4250 (4640). Thus, the coherency state of the cache line in the L1 data cache 3400 now conforms to the coherency protocol.

FIG. 5 is a diagram of an example technique 5000 for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure. The technology comprises the following steps: retire 5100 a store queue entry in the store queue; writing 5200 data of the retired store queue entry in a hit cache line having a shared coherency state or an exclusive coherency state; allocating 5300 entries in a write buffer and writing data for hit cache lines having a shared coherency state or an exclusive coherency state; indicating 5400 the standby fabric controller upgrade hits on a cache line; staging 5500 the hit cache line to a modified coherency state; retiring 5600 the data in the write buffer after confirming the hit cache line upgrade; and write back of 5700 to the backing structure is performed as appropriate. The technique 5000 can be implemented, for example, in the processing system 1000 of fig. 1, the load store unit 2000 of fig. 2, and similar devices and systems.

The technique 5000 includes retiring 5100 a store queue entry in the store queue. The store queue retires data from the store queue entry after commit.

The technique 5000 includes writing 5200 data of a retired store queue entry in a hit cache line having a shared coherency state or an exclusive coherency state. Even if the hit cache line is in a shared or exclusive coherency state, retired data is written by the store queue to the data cache, and in particular to the cache line, when there is a hit with respect to a memory location or address. The hit cache line is now in a shared coherency state with the data. In the event of a data cache miss, the data is written to a secondary cache or a backing register in the hierarchical memory structure.

Technique 5000 includes allocating 5300 entries in a write buffer and writing data for a hit cache line having a shared coherency state or an exclusive coherency state. In addition to writing data to the hit cache line, the store queue also allocates the same data and writes it to the write buffer. In an implementation, the write buffer is an MSHR. The write buffer is now in a modified coherency state with the data.

Technique 5000 includes instructing the 5400 lookaside fabric controller to promote a hit to the cache line. The write buffer is able to track the coherency state of the data cache. As a result, the write buffer can instruct the backing structure controller to upgrade the coherency state of the hit cache line to a modified coherency state with data.

The technique 5000 includes staging 5500 a hit cache line to a modified coherency state. In response to writing to the buffer, the backing structure controller upgrades the hit cache line to a modified coherency state with the data.

The technique 5000 includes retiring 5600 the data in the write buffer after confirming a hit to the cache line upgrade. The write buffer acknowledges an upgrade that hits a cache line and retires the data in the allocated entry.

The technique 5000 includes performing 5700 write back to the backing structure as appropriate. For a define event, the data cache performs a write back of data in the hit cache line to the backing structure.

FIG. 6 is a diagram of an example technique 6000 for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure. The technology comprises the following steps: retire 6100 store queue entries in the store queue; writing 6200 data of the retired store queue entry in a hit cache line having a shared coherency state or an exclusive coherency state; allocating 6300 entries in the write buffer and writing data for hit cache lines having either a shared coherency state or an exclusive coherency state; receiving 6400 a probe from a backing structure controller or a non-local entity to demote to an invalid coherency state for a hit cache line; checking 6500 by a probe of the coherency state of the hit cache line and the allocated entry in the write buffer; demoting 6600 the hit cache line to an invalid coherency state and discarding the data; and demoting 6700 the allocated entry in the write buffer to an invalid coherency state and performing a write-through of the data to a backing structure or to a non-local entity.

Technique 6000 includes retiring 6100 the store queue entry in the store queue. The store queue retires data from the store queue entry after commit.

Technique 6000 includes writing 6200 data of a retired store queue entry in a hit cache line having either a shared coherency state or an exclusive coherency state. Even if the hit cache line is in a shared or exclusive coherency state, retired data is written by the store queue to the data cache, and in particular to the cache line, when there is a hit with respect to a memory location or address. The hit cache line is now in a shared coherency state with the data. In the event of a data cache miss, the data is written to a secondary cache or a backing register in the hierarchical memory structure.

Technique 6000 includes allocating 6300 entries in the write buffer and writing data for the hit cache line having either a shared coherency state or an exclusive coherency state. In addition to writing data to the hit cache line, the store queue also allocates the same data and writes it to the write buffer. In an implementation, the write buffer is the MSHR. The write buffer is now in a modified coherency state with data.

Technique 6000 includes receiving 6400 a probe from a backing structure controller or a non-local entity to demote to an invalid coherency state for a hit cache line. A probe is received from a backing structure controller or a non-local entity for a downgrade to an invalid coherency state for a hit cache line before upgrading the coherency state of the hit cache line, writing data to the backing structure, or both.

Technique 6000 includes checking 6500 by the detection of a hit in the coherency state of the cache line and an allocated entry in the write buffer. The probe causes the data cache and write buffer to demote the hit cache line and the allocated entry to an invalid coherency state.

Technique 6000 includes demoting 6600 the hit cache line to an invalid coherency state and discarding the data. After destaging, data in the hit cache line is discarded.

The technique 6000 includes demoting 6700 the allocated entries in the write buffer to an invalid coherency state and performing an overdraw of the data to a backing structure or non-local entity. After destaging, the write buffer transparently writes the data to a backup scratchpad or to a non-local entity.

FIG. 7 is a diagram of an example technique 7000 for implementing hybrid write back and write through in accordance with an embodiment of the present disclosure. The technology comprises the following steps: retire 7100 a store queue entry in the store queue; writing 7200 data of a retired store queue entry in a hit cache line having a shared coherency state or an exclusive coherency state; allocating a 7300 entry in a write buffer and writing data for a hit cache line having a shared coherency state or an exclusive coherency state; receiving 7400 a probe with a destage to shared coherency state for a hit cache line from a backing structure controller or non-local entity; checking 7500 by probing for a hit in the coherency state of the cache line and an allocated entry in the write buffer; maintaining 7600 the hit cache line in a shared coherency state; and demoting 7700 the allocated entry in the write buffer to a shared coherency state and performing a write-through of the data to a backing structure or to a non-local entity.

Technique 7000 includes retiring 7100 the store queue entries in the store queue. The store queue retires data from the store queue entry after commit.

Technique 7000 includes writing 7200 data of a retired store queue entry in a hit cache line having either a shared coherency state or an exclusive coherency state. Even if the hit cache line is in a shared or exclusive coherency state, retired data is written by the store queue to the data cache, and in particular to the cache line, when there is a hit with respect to a memory location or address. The hit cache line is now in a shared coherency state with the data. In the event of a data cache miss, the data is written to a secondary cache or a backing register in the hierarchical memory structure.

Technique 7000 includes allocating a 7300 entry in the write buffer and writing data for the hit cache line having either a shared coherency state or an exclusive coherency state. In addition to writing data to the hit cache line, the store queue also allocates the same data and writes it to the write buffer. In an implementation, the write buffer is an MSHR. The write buffer is now in a modified coherency state with the data.

Technique 7000 includes receiving 7400 a probe with a destage to shared coherency state for a hit cache line from a backing structure controller or non-local entity. A probe is received from a backing structure controller or a non-local entity for a hit cache line having a destage to a shared coherency state before the coherency state of the hit cache line is upgraded, data is written to the backing structure, or both.

Technique 7000 includes checking 7500 by probing for a hit cache line's coherency state and an allocated entry in the write buffer. Probing causes the data cache and write buffer to demote the hit cache line and the allocated entry to an invalid coherency state.

Technique 7000 includes maintaining 7600 the hit cache line in a shared coherency state.

Technique 7000 includes demoting 7700 the allocated entries in the write buffer to a shared coherency state and performing an overdraw of the data to a backup structure or a non-local entity. After destaging, the write buffer writes the data through to a backup register or non-local entity.

In general, a processing system includes: a memory and an associated memory controller; and a processor coupled to the memory. The processor includes: a data cache comprising a plurality of cache lines; a write buffer configured to track a data cache; a store queue configured to store one or more store operations and to write data to an allocated entry in the hit cache line and the write buffer when the hit cache line is initially in at least a shared coherency state, resulting in the hit cache line being in a shared coherency state with data and the allocated entry being in a modified coherency state with data. The write buffer is configured to send a message to the memory controller to upgrade the hit cache line to a modified coherency state with the data. The memory controller is configured to promote a hit cache line to a modified coherency state with data. The write buffer is configured to retire data after confirming a hit to the cache line upgrade. The data cache is configured to perform write back of data in the hit cache line to memory for a defined event. In an implementation, a processor is configured to receive a demote to invalid coherency probe from one of a memory controller or an entity that is not local to the processor prior to promoting a coherency state of a hit cache line, a data cache is configured to demote the hit cache line to an invalid coherency state and delete the data, and a write buffer is configured to demote the allocated entry to the invalid coherency state and perform a write-through of the data to the memory. In an implementation, a processor is configured to receive a destage to invalid coherency probe from one of a memory controller or an entity not local to the processor prior to writing data to memory, a data cache is configured to destage a hit cache line to an invalid coherency state and delete the data, and a write buffer is configured to destage an allocated entry to the invalid coherency state and perform a write-through of the data to memory. In an implementation, a processor is configured to receive a demote to shared coherency probe from one of a memory controller or an entity that is not local to the processor before promoting a coherency state of a hit cache line, the hit cache line is configured to remain in a shared coherency state with data, and a write buffer is configured to demote an allocated entry to the shared coherency state with data and to perform a write-through of data to memory. In an implementation, a processor is configured to receive a destage to shared coherency probe from one of a memory controller or an entity that is not local to the processor prior to writing data to memory, a hit cache line is configured to remain in a shared coherency state with data, and a write buffer is configured to destage an allocated entry to the shared coherency state with data and perform a write-through of data to memory. In an implementation, a hit cache line is in an exclusive coherency state. In an implementation, the write buffer is a miss status handling register. In an implementation, an entity or process local to the processor can access the data stored in the hit cache line. In an implementation, an entity or process local to the processor can read the data stored in the hit cache line. In an implementation, a hit cache line in a modified coherency state with data is globally visible to an entity or process that is not local to the processor.

In general, a method for performing hybrid write-back and write-through includes: writing data from a retired store queue entry to a hit cache line in a data cache, wherein the hit cache line is initially in a shared coherency state; writing data to an allocated entry in the write buffer while the hit cache line is in the shared coherency state, wherein the hit cache line is then in the shared coherency state with the data and the allocated entry is in the modified coherency state with the data; sending a message to a memory controller to promote a hit cache line to a modified coherency state with data; staging a hit cache line to a modified coherency state with data; retire data in the allocated entry after confirming the hit cache line upgrade; write back of data in the hit cache line to memory is performed on a define event. In an implementation, the method includes: receiving a demote to invalid coherency probe from one of a memory controller or an entity that is non-local to a processor associated with the data cache before one of promoting a coherency state that hits the cache line or writing data to the memory; demoting the hit cache line to an invalid coherency state; deleting data in the hit cache line; demoting the allocated entry to an invalid coherency state; and performing, by the write buffer, a write-through of the data to memory. In an implementation, the method includes: receiving a destage to shared coherency probe from one of a memory controller or an entity non-local to a processor associated with the data cache before staging the coherency state of the hit cache line or writing the data to the one of the memories; holding the cache line in a shared coherency state with the data by hitting; demoting the allocated entry to a shared coherency state with data; and performing, by the write buffer, a write-through of the data to memory. In an implementation, the write buffer is a miss status handling register. In an implementation, an entity or process local to the processor can access the data stored in the hit cache line. In an implementation, the method includes tracking, by a write buffer, a cache coherency state of a data cache.

In general, a method for performing hybrid write-back and write-through includes: writing data from the store queue to a hit cache line in the data cache, wherein the hit cache line is initially in a shared coherency state and the hit cache line is in the shared coherency state with the data after the data is written; writing data to an allocated entry in the write buffer when the hit cache line is in the shared coherency state, wherein the allocated entry is in a modified coherency state with the data after the data is written; tracking, by the write buffer, a cache coherency state of the data cache; prior to receiving a probe from a non-local entity: staging, by the write buffer upon request, the hit cache line to a modified coherency state with data based on the tracked cache coherency state; retire data in the allocated entry after a hit cache line upgrade is confirmed; and performing write back of data in the hit cache line to the memory for a defined event; and in the event of receiving a downgrade to invalid coherency probe from a non-local entity: demoting the hit cache line to an invalid coherency state; deleting data in the hit cache line; demoting the allocated entry to an invalid coherency state; and performing, by the write buffer, a write-through of the data to memory; and in the event of receiving a demotion to shared coherence probe from a non-local entity: holding the cache line in a shared coherency state with the data by hitting; demoting the allocated entry to a shared coherency state with data; and performing a write-through of the data to memory by the write buffer. In an implementation, the write buffer is a miss status handling register. In an implementation, the method includes enabling an entity or process local to the processor to access data stored in the hit cache line. In an implementation, a hit cache line in a modified coherency state with data is globally visible to an entity or process that is not local to the processor.

Although some embodiments herein relate to methods, those skilled in the art will appreciate that they may also be embodied as a system or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "processor," device, "or" system. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer-readable media having computer-readable program code embodied therein. Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to CDs, DVDs, wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Aspects are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures.

While the disclosure has been described in connection with certain embodiments, it is to be understood that the disclosure is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications, combinations, and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures as is permitted under the law.

Claims

1. A processing system, comprising:

a memory and an associated memory controller; and

a processor coupled to the memory, the processor comprising:

a data cache comprising a plurality of cache lines;

a write buffer configured to track the data cache;

a storage queue configured to:

storing one or more storage operations; and

writing data to an allocated entry in the hit cache line and the write buffer when the hit cache line is initially in at least a shared coherency state, resulting in the hit cache line being in a shared coherency state with data and the allocated entry being in a modified coherency state with data;

the write buffer is configured to send a message to the memory controller to upgrade the hit cache line to a modified coherency state with data;

the memory controller is configured to upgrade the hit cache line to the modified coherency state with data;

the write buffer is configured to retire the data upon confirmation of the hit cache line upgrade; and

the data cache is configured to perform write back of data in the hit cache line to memory for a defined event.

2. The processing system of claim 1, further comprising:

the processor is configured to receive a demote to invalid coherency probe from one of the memory controller or an entity that is not local to the processor prior to promoting the coherency state of the hit cache line;

the data cache is configured to demote the hit cache line to an invalid coherency state and delete the data; and

the write buffer is configured to demote the allocated entry to an invalid coherency state and perform a write-through of the data to the memory.

3. The processing system of claim 1, further comprising:

the processor is configured to receive a destage to invalidate inconsistency probe from one of an entity not native to the processor or the memory controller prior to writing the data to the memory;

4. The processing system of claim 2, further comprising:

the processor is configured to receive a demote to shared coherency probe from one of the memory controller or an entity that is not local to the processor prior to promoting the coherency state of the hit cache line;

the hit cache line is configured to remain in the shared coherency state with data; and

the write buffer is configured to demote the allocated entry to the shared coherency state with data and perform a write-through of the data to the memory.

5. The processing system of claim 2, further comprising:

the processor is configured to receive a destage to share inconsistency probe from one of an entity that is not local to the processor or the memory controller prior to writing the data to the memory;

6. The processing system of claim 1, wherein the hit cache line is in an exclusive coherency state.

7. The processing system of claim 1, wherein the write buffer is a miss status handling register.

8. The processing system of claim 1, wherein an entity or process local to the processor can access data stored in the hit cache line.

9. The processing system of claim 1, wherein an entity or process local to the processor is capable of reading data stored in the hit cache line.

10. The processing system of claim 1, wherein the hit cache line in the modified coherency state with data is globally visible to an entity or process that is not local to the processor.

11. A method for performing mixed write-back and write-through, the method comprising:

writing data from a retired store queue entry to a hit cache line in a data cache, wherein the hit cache line is initially in a shared coherency state;

writing the data to an allocated entry in a write buffer while the hit cache line is in a shared coherency state, wherein the hit cache line is then in a shared coherency state with data and the allocated entry is in a modified coherency state with data;

sending a message to a memory controller to upgrade the hit cache line to a modified coherency state with data;

staging the hit cache line to the modified coherency state with data;

retire data in the allocated entry after confirming the hit cache line upgrade; and

write back of data in the hit cache line to memory is performed for a defined event.

12. The method of claim 11, further comprising:

receiving a demote to invalid coherency probe from one of the memory controller or an entity that is non-local to the processor associated with the data cache prior to one of upgrading a coherency state of the hit cache line or writing the data to the memory;

demoting the hit cache line to an invalid coherency state;

deleting data in the hit cache line;

demoting the allocated entry to an invalid coherency state; and

performing, by the write buffer, a write-through of the data to the memory.

13. The method of claim 12, further comprising:

receiving a demote to shared coherency probe from one of the memory controller or an entity that is not local to a processor associated with the data cache prior to one of upgrading the coherency state of the hitting cache line or writing the data to the memory;

maintaining, by the hit cache line, in the shared coherency state with data;

demoting the allocated entry to the shared coherency state with data; and

performing, by the write buffer, a write-through of the data to the memory.

14. The method of claim 11, wherein the write buffer is a miss status handling register.

15. The method of claim 11, wherein an entity or process local to the processor can access the data stored in the hit cache line.

16. The method of claim 11, further comprising:

the cache coherency state of the data cache is tracked by the write buffer.

17. A method for performing hybrid write-back and write-through, the method comprising:

writing data from a store queue to a hit cache line in a data cache, wherein the hit cache line is initially in a shared coherency state and, after writing the data, the hit cache line is in a shared coherency state with the data;

writing the data to an allocated entry in a write buffer while the hit cache line is in a shared coherency state, wherein the allocated entry is in a modified coherency state with data after the data is written;

tracking, by the write buffer, a cache coherency state of the data cache;

prior to receiving a probe from a non-local entity:

staging, by the write buffer on request, the hit cache line to the modified coherency state with data based on the tracked cache coherency state;

performing write back of data in the hit cache line to memory for a defined event; and

in the event of receiving a demotion to invalid coherency probe from the non-local entity:

demoting the hit cache line to an invalid coherency state;

deleting data in the hit cache line;

demoting the allocated entry to an invalid coherency state; and

performing, by the write buffer, a write-through of the data to the memory; and in the event of receiving a demotion to shared coherence probe from the non-local entity:

maintaining in the shared coherency state with data by the hit cache line;

demoting the allocated entry to the shared coherency state with data; and

performing, by the write buffer, a write-through of the data to the memory.

18. The method of claim 17, wherein the write buffer is a miss status handling register.

19. The method of claim 17, wherein an entity or process local to the processor can access the data stored in the hit cache line.

20. The method of claim 17, wherein the hit cache line in the modified coherency state with data is globally visible to an entity or process that is not local to the processor.