US20140236561A1 - Efficient validation of coherency between processor cores and accelerators in computer systems - Google Patents
Efficient validation of coherency between processor cores and accelerators in computer systems Download PDFInfo
- Publication number
- US20140236561A1 US20140236561A1 US14/038,125 US201314038125A US2014236561A1 US 20140236561 A1 US20140236561 A1 US 20140236561A1 US 201314038125 A US201314038125 A US 201314038125A US 2014236561 A1 US2014236561 A1 US 2014236561A1
- Authority
- US
- United States
- Prior art keywords
- accelerator
- instructions
- cache
- accelerators
- shared resource
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
-
- G06F17/5009—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/084—Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
Definitions
- the present invention generally relates to computer systems, and more particularly to a method of verifying the design of a computer system having a resource such as a cache memory which is shared among multiple devices such as processors and accelerators.
- a design-under-test is driven by vectors of inputs, and states encountered while walking through the sequence are checked for properties of correctness. This process is often performed by software simulation tools using different programming languages created for electronic design automation, including Verilog, VHDL and TDML.
- the verification process should include simulation of any shared resources in the design.
- One typical shared resource in a computer system is a cache memory.
- SMP symmetric, multi-processor
- all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate and generally have the same architecture.
- a processing unit includes a processor core having registers and execution units which carry out program instructions in order to operate the computer, and the processing unit can also have one or more caches such as an instruction cache and a data cache implemented using high-speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory.
- Each cache is referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip.
- Each cache is associated with a cache contoller that manages the transfer of data between the processor core and the cache memory.
- a processing unit can include additional caches such as a level 2 (L2) cache which supports the on-board (level 1) data and instruction caches.
- L2 caches can store a much larger amount of information (program instructions and operand data) than the on-board caches can, but at a longer access penalty.
- Multi-level cache hierarchies can be provided where there are many levels (L3, L4, etc.) of serially connected caches.
- the higher-level caches are typically shared by more than one processor core.
- processors To implement cache coherency in a computer system, processors typically communicate over a common generalized interconnect or bus, passing messages indicating their need to read or write memory locations. When an operation is placed on the interconnect, all of the other processors can monitor (snoop) this operation and decide if the state of their caches can allow the requested operation to proceed and, if so, under what conditions.
- One common cache coherency protocol is the MESI protocol in which a cache block can be in one of four states, M (Modified), E (Exclusive), S (Shared) or I (Invalid).
- M Modified
- E Exclusive
- S Shared
- I Invalid
- cache intervention which allows a cache having control over a requested memory block to provide the data for that block directly to another cache requesting the value (for a read-type operation), bypassing the need to write the data to system memory.
- Cache coherency may extend beyond processor cores.
- hardware accelerators and have their own specialized processing units according to their particular functions.
- the control software for an accelerator typically creates an accelerator-specific control block in system memory.
- the accelerator reads all the needed inputs from the accelerator-specific control block and performs the requested operations.
- Some of the fields in the accelerator control block might for example include the requested operation, source address, destination address and operation specific inputs such as key value and key size for encryption/decryption.
- the usage of system memory by an accelerator leads to the necessity of ensuring coherency with the accelerator in addition to the processors.
- One low-cost approach to such an accelerator coherency system is described in U.S. Pat. No. 7,814,279.
- the present invention is generally directed to a method of testing coherency in a system design having a shared resource, at least one processor which accesses the shared resource, and at least one accelerator which accesses the shared resource, by selecting an entry of the shared resource for targeted testing during a simulation of operation of the system design, allocating a first portion of the selected entry for use by a first instruction from the processor, allocating a second portion of the selected entry for use by a second instruction from the accelerator, executing the first and second instructions using the allocated first and second portions of the selected entry subject to a coherency protocol adapted to maintain data consistency of the shared resource, and verifying correctness of data stored in the entry.
- the shared resource may be a cache memory, and the entry a cache line of the cache memory.
- the first and second portions of the cache line can have different sizes.
- the processor and the accelerator can operate in the simulation at different frequencies.
- the first portion of the selected line can be randomly allocated for use by the first instruction, and the second portion of the selected line can be randomly allocated for use by the second instruction.
- Multiple processors and multiple accelerators can access the cache memory, in which case a single cache line can further be allocated for use by other processors and accelerators.
- the verification system can control execution of the cached instructions to invoke different coherency modes of the coherency mechanism.
- the invention provides a further opportunity to test any accelerator having an original function and an inverse function by allocating a first set of cache lines for the accelerator to generate an original function output based on an original function input, allocating a second set of cache lines for the accelerator to generate an inverse function output based on the original function output, and verifying correctness of the original and inverse functions by comparing the inverse function output to the original function input.
- FIG. 1 is a block diagram of a computer system programmed to carry out verification of computer systems in accordance with one implementation of the present invention
- FIG. 2 is a block diagram of a verification environment for a computer system design having multiple processors and accelerators which share access to a cache memory in accordance with one implementation of the present invention
- FIG. 3 is a pictorial representation of a simulated cache memory allocated in accordance with one implementation of the present invention wherein a portion of a cache line is used by an accelerator while other portions of the same cache line are used by different processors;
- FIG. 4 is a chart depicting how the cache memory blocks used by an accelerator can be assigned for a function of the accelerator and an inverse of the function to yield an output which is the same as the original function input in accordance with one implementation of the present invention.
- FIG. 5 is a chart illustrating the logical flow for a coherency testing process in accordance with one implementation of the present invention.
- the validation environment includes a cache whose lines are divided into blocks or sectors which can be accessed independently by a processor and an accelerator.
- a single cache line can be accessed by multiple accelerators or multiple cores, including different processor operations.
- This approach enables comprehensive testing for different modes of operation of cache lines, such as intervention by other caches or accesses from memory via the coherency mechanism.
- accelerators can pull those lines from outside the bus interconnect between processors, thereby generating coherency stress between different mechanisms.
- the invention accordingly ensures coherency between processors and accelerators by detecting any holes in the hardware design.
- Computer system 10 is a symmetric multiprocessor (SMP) system having a plurality of processors 12 a , 12 b connected to a system bus 14 .
- System bus 14 is further connected to a combined memory controller/host bridge (MC/HB) 16 which provides an interface to system memory 18 .
- System memory 18 may be a local memory device or alternatively may include a plurality of distributed memory devices, preferably dynamic random-access memory (DRAM).
- DRAM dynamic random-access memory
- MC/HB 16 also has an interface to peripheral component interconnect (PCI) Express links 20 a , 20 b , 20 c .
- PCIe peripheral component interconnect
- Each PCI Express (PCIe) link 20 a , 20 b is connected to a respective PCIe adaptor 22 a , 22 b
- each PCIe adaptor 22 a , 22 b is connected to a respective input/output (I/O) device 24 a , 24 b
- MC/HB 16 may additionally have an interface to an I/O bus 26 which is connected to a switch (I/O fabric) 28 .
- Switch 28 provides a fan-out for the I/O bus to a plurality of PCI links 20 d , 20 e , 20 f .
- PCI links are connected to more PCIe adaptors 22 c , 22 d , 22 e which in turn support more I/O devices 24 c, 24 d, 24 e.
- the I/O devices may include, without limitation, a keyboard, a graphical pointing device (mouse), a microphone, a display device, speakers, a permanent storage device (hard disk drive) or an array of such storage devices, an optical disk drive, and a network card.
- Each PCIe adaptor provides an interface between the PCI link and the respective I/O device.
- MC/HB 16 provides a low latency path through which processors 12 a , 12 b may access PCI devices mapped anywhere within bus memory or I/O address spaces.
- MC/HB 16 further provides a high bandwidth path to allow the PCI devices to access memory 18 .
- Switch 28 may provide peer-to-peer communications between different endpoints and this data traffic does not need to be forwarded to MC/HB 16 if it does not involve cache-coherent memory transfers.
- Switch 28 is shown as a separate logical component but it could be integrated into MC/HB 16 .
- PCI link 20 c connects MC/HB 16 to a service processor interface 30 to allow communications between I/O device 24 a and a service processor 32 .
- Service processor 32 is connected to processors 12 a , 12 b via a JTAG interface 34 , and uses an attention line 36 which interrupts the operation of processors 12 a , 12 b.
- Service processor 32 may have its own local memory 38 , and is connected to read-only memory (ROM) 40 which stores various program instructions for system startup. Service processor 32 may also have access to a hardware operator panel 42 to provide system status and diagnostic information.
- ROM read-only memory
- computer system 10 may include modifications of these hardware components or their interconnections, or additional components, so the depicted example should not be construed as implying any architectural limitations with respect to the present invention.
- the invention may further be implemented in an equivalent cloud computing network.
- service processor 32 uses JTAG interface 34 to interrogate the system (host) processors 12 a , 12 b and MC/HB 16 . After completing the interrogation, service processor 32 acquires an inventory and topology for computer system 10 . Service processor 32 then executes various tests such as built-in-self-tests (BISTs), basic assurance tests (BATs), and memory tests on the components of computer system 10 . Any error information for failures detected during the testing is reported by service processor 32 to operator panel 42 . If a valid configuration of system resources is still possible after taking out any components found to be faulty during the testing then computer system 10 is allowed to proceed.
- BISTs built-in-self-tests
- BATs basic assurance tests
- memory tests any error information for failures detected during the testing is reported by service processor 32 to operator panel 42 . If a valid configuration of system resources is still possible after taking out any components found to be faulty during the testing then computer system 10 is allowed to proceed.
- Executable code is loaded into memory 18 and service processor 32 releases host processors 12 a , 12 b for execution of the program code, e.g., an operating system (OS) which is used to launch applications and in particular the circuit validation application of the present invention, results of which may be stored in a hard disk drive of the system (an I/O device 24 ).
- OS operating system
- service processor 32 may enter a mode of monitoring and reporting any operating parameters or errors, such as the cooling fan speed and operation, thermal sensors, power supply regulators, and recoverable and non-recoverable errors reported by any of processors 12 a , 12 b , memory 18 , and MC/HB 16 .
- Service processor 32 may take further action based on the type of errors or defined thresholds.
- the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
- the computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device.
- the computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
- a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- the computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave.
- the computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
- Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, written for a variety of platforms such as an AIX environment or operating systems such as Windows 7 or Linux.
- the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
- Such storage media excludes transitory media such as propagating signals.
- the computer program instructions may further be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- Computer system 10 carries out program instructions for a coherency validation process that uses novel cache control techniques to stress the coherency mechanisms between processors and accelerators. Accordingly, a program embodying the invention may include conventional aspects of various validation and design tools, and these details will become apparent to those skilled in the art upon reference to this disclosure.
- Accelerators generally use specific protocols and interfaces to submit jobs and get results. Issuing a job to these accelerators is typically done through control block structures with specific alignment requirements for accessing these control blocks. So memory accesses from accelerators outside the processor fall into two kinds, accesses to the control block structures, and accesses to the source memory on which accelerators operate. Many of the accelerators do not directly fetch instructions themselves but instead rely on the main processor(s) to assign work to them.
- the software executing in the main memory e.g., application program or operating system
- the accelerator can read all the needed inputs from the accelerator-specific control block and perform the requested operation. Some of the fields in the accelerator control block could be the requested operation, source address, target address and operation specific inputs such as key value and key size for encryption/decryption.
- the present invention proposes a method through which the designer can allocate memory for any processor and accelerator mechanisms.
- FIG. 2 there is depicted a verification environment 50 for a computer system design which includes multiple processors 52 a , 52 b , 52 c (processor 0 , processor 1 , processor 2 ) and multiple accelerators 54 a , 54 b , 54 c (accelerator 0 , accelerator 1 , accelerator 2 ).
- the processors and accelerators access a memory hierarchy of the computer system design, for example through read or write operations, to load values to or store values from memory blocks.
- the memory hierarchy includes one or more cache memories 58 , and a coherency mechanism 60 .
- the processors and accelerators may operate at different frequencies, i.e., the processors operating at a first frequency f 1 and the accelerators operating at a second frequency f 2 (or multiple frequencies for different accelerators).
- the verification environment may include a variety of other computer system components, not shown, according to the particular design being tested.
- a verification control program 62 can be used to manage the entire simulation effort.
- the verification system further includes a correctness checker 64 which can check values (data) computed by environment 50 or values stored in any of the components as the simulation proceeds against predetermined values known to be correct.
- verification environment 50 is a software program structure representing a hardware design
- the various components are generally software modules, however, one or more of the components may be implemented in hardware for the verification procedure and subject to verification control 62 .
- the software modules are loaded onto an actual computer system such as computer system 10 to carry out the verification process, and are controlled by commands from verification control 62 which can also reside on computer system 10 , along with correctness checker 64 .
- Verification control 62 can command the execution of instructions in such a manner as to test different modes of operation of cache lines. For example, one portion of the cache line can be accessed by another cache memory to satisfy an instruction in a procedure known as cache intervention, as opposed to a mode where the access is directly to or from main memory.
- FIG. 3 illustrates one implementation for cache memory 58 .
- Cache memory 58 has a plurality of entries or cache lines which may be arranged in pages and groups or ways according to congruence class mappings. A given cache line is further decomposed into a plurality of blocks or sectors. The sectors may be different sizes such as word, double word, quad word, etc.
- Verification control 62 can allocate different sectors of a single cache line for use by both accelerators and processors. In FIG. 3 , the first sector of the cache lines has been allocated for use by the accelerators, while the remaining sectors have been allocated to different processors or central processing units (CPUs).
- CPUs central processing units
- the first sector of a given cache line could be allocated to accelerator 54 a
- the second sector of the cache line could be allocated to processor 52 a
- the third sector of the cache line could be allocated to processor 52 b
- the fourth sector of the cache line could be allocated to processor 52 c .
- different sectors of a single a cache line could be allocated for use by different accelerators. Multiple cache lines can be so shared according to the same divisions between blocks. Sectors used by the processors can be used for different operations.
- a given cache line may be divided into blocks of different sizes.
- control structures accessed to submit jobs to accelerators often need to be quad word aligned, so the block allocated for an accelerator may be quad word size, but other blocks in the same line allocated to processors may be double word or single word size.
- This approach enables a cache line to be accessed randomly from different cores to create scenarios of intervention and different allowed coherency states, increasing the stress on coherency mechanism 60 .
- correctness check 64 will sample values throughout the cache lines and compare them to the known correct values to determine whether the coherency mechanism is performing properly.
- processors could be performing include: processor accesses to the cache line for stores and or loads when the accelerator is also trying to access the cache line which can change the state of the cache; scenarios where an accelerator is trying to access the cache line and a processor serves the request (by the process called intervention); and a processor trying to fetch a cache line through a prefetch operation while accelerators are accessing it.
- a first contiguous set of cache lines in a page can be allocated for a control structure of the execution block for the accelerator, to invoke the first functionality.
- a second contiguous set of cache lines in the same page, immediately following the first set (consecutive), is allocated for a control structure of the execution block for the same accelerator, but the second set invokes the second, opposite functionality.
- the second set of cache lines will use the output of the first function as the input to the second function.
- a third set of contiguous cache lines in the same page is used for the output of the second function, which should be exactly the same as the original input to the first function.
- the number of cache lines in the sets depends on the particular control block structure for the accelerator.
- the first set of cache lines are used by the accelerator to generate the original function output, which then becomes in the inverse input, i.e., input to the inverse function.
- the inverse function of the accelerator acts on the inverse input, it generates an inverse output, which should be the same as the original function input.
- correctness checker 64 can then be compared by correctness checker 64 to validate the functionalities of the accelerator. In this manner, the operation of any accelerator which has opposite functionalities can be validated as the coherency mechanism is contemporaneously validated.
- the verification control could designate the encryption function as the original function and the decryption function as the inverse function, or could alternatively designate the decryption function as the original function and the encryption function as the inverse function.
- Process 70 begins by selecting a congruence class to target ( 72 ), and selecting an instruction to apply to the cache memory ( 74 ).
- the congruence class and instruction may be selected according to any algorithm devised by the designer, or may be selected randomly.
- a check is then made to determine whether the instruction is accelerator-specific ( 76 ) or is for memory access ( 78 ). If the instruction is neither, then verification control 62 builds the instruction accordingly to conventional testing methodology ( 80 ).
- a memory line is selected for the instruction within the targeted congruence class ( 82 ), and a sector index for that line is set ( 84 ).
- the line and index may also be selected randomly or accordingly to other designer conditions. If the sector associated with the set sector index is free ( 86 ), that sector is used to build the instruction ( 88 ). If the sector indicated by the index is not free, a random number generator is used to pick an adjacent sector for examination ( 90 ), searching either to the left ( 92 ) or to the right ( 94 ) until a free sector is found. If a sector is free ( 96 ), that sector is used ( 88 ); otherwise, any sector is picked according to designer preferences or randomly ( 98 ).
- Termination criteria may be based for example on elapsed time or a predetermined number of instructions being built.
- the invention achieves several advantages by allowing processors and accelerators to share the same set of cache lines. Since the processors can be generating random values during simulation, accelerators sharing the lines can be tested with different memory values. Different alignments of control blocks can also be validated. As processors and accelerators fight for the same set of cache lines, not only is the coherency mechanism tested but snoop and bus arbitration are also stressed. In particular, accelerators working at different frequencies than processors will result in different timing of snoop broadcasts on the interconnect bus. Execution of accelerator workloads can be sped up as memory lines used by the accelerators can be brought in earlier by the processor cores. In real world applications, the novel allocation of cache lines can prepare inputs for the accelerators ahead of the needed time, achieving maximum throughput. Finally, for accelerators having inverse functions, correctness validation of the accelerator functionality happens with much faster coverage of all corner cases, and no software simulation is needed to check the correctness of these functionalities.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
- This application is a continuation of copending U.S. patent application Ser. No. 13/770,711 filed Feb. 19, 2013.
- 1. Field of the Invention
- The present invention generally relates to computer systems, and more particularly to a method of verifying the design of a computer system having a resource such as a cache memory which is shared among multiple devices such as processors and accelerators.
- 2. Description of the Related Art
- When a new computer system (or subsystem) is designed, it is important to ensure that the design is going to work properly before proceeding with fabrication preparation for the integrated circuit devices making up the system, and their assembly into the finished product. A variety of tests can be performed to evaluate the design, but simulation remains the dominant strategy for functionally verifying high-end computer systems. A design-under-test is driven by vectors of inputs, and states encountered while walking through the sequence are checked for properties of correctness. This process is often performed by software simulation tools using different programming languages created for electronic design automation, including Verilog, VHDL and TDML.
- The verification process should include simulation of any shared resources in the design. One typical shared resource in a computer system is a cache memory. In a symmetric, multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate and generally have the same architecture. A processing unit includes a processor core having registers and execution units which carry out program instructions in order to operate the computer, and the processing unit can also have one or more caches such as an instruction cache and a data cache implemented using high-speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory. These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip. Each cache is associated with a cache contoller that manages the transfer of data between the processor core and the cache memory. A processing unit can include additional caches such as a level 2 (L2) cache which supports the on-board (level 1) data and instruction caches. Higher-level caches can store a much larger amount of information (program instructions and operand data) than the on-board caches can, but at a longer access penalty. Multi-level cache hierarchies can be provided where there are many levels (L3, L4, etc.) of serially connected caches. The higher-level caches are typically shared by more than one processor core.
- In systems which share resources such as cache memories, it is important to ensure that the resource is accessed in a consistent manner among all of the devices that use the resource, e.g., that one processor is not given a value for a cache entry that is inconsistent with a value for the same entry that was given to another requesting processor. A system that implements this consistency is said to be coherent. Different coherency protocols have been devised to control the movement of and write permissions for data, generally on a cache block basis. At the heart of all these mechanisms for maintaining coherency is the requirement that the protocols allow only one requesting device to have a “permission” that allows a write operation to a given memory location (cache block) at any given point in time. To implement cache coherency in a computer system, processors typically communicate over a common generalized interconnect or bus, passing messages indicating their need to read or write memory locations. When an operation is placed on the interconnect, all of the other processors can monitor (snoop) this operation and decide if the state of their caches can allow the requested operation to proceed and, if so, under what conditions. One common cache coherency protocol is the MESI protocol in which a cache block can be in one of four states, M (Modified), E (Exclusive), S (Shared) or I (Invalid). There are many, more complicated protocols which expand upon the MESI protocol. Other features may be implemented with the coherency protocol, such as cache intervention which allows a cache having control over a requested memory block to provide the data for that block directly to another cache requesting the value (for a read-type operation), bypassing the need to write the data to system memory.
- Cache coherency may extend beyond processor cores. As computational demands have increased, computer systems have adapted by relying more on other hardware components for specific tasks such as graphics (video) management, data compression, or cryptography. These components are generally referred to as hardware accelerators, and have their own specialized processing units according to their particular functions. The control software for an accelerator typically creates an accelerator-specific control block in system memory. The accelerator reads all the needed inputs from the accelerator-specific control block and performs the requested operations. Some of the fields in the accelerator control block might for example include the requested operation, source address, destination address and operation specific inputs such as key value and key size for encryption/decryption. The usage of system memory by an accelerator leads to the necessity of ensuring coherency with the accelerator in addition to the processors. One low-cost approach to such an accelerator coherency system is described in U.S. Pat. No. 7,814,279.
- The present invention is generally directed to a method of testing coherency in a system design having a shared resource, at least one processor which accesses the shared resource, and at least one accelerator which accesses the shared resource, by selecting an entry of the shared resource for targeted testing during a simulation of operation of the system design, allocating a first portion of the selected entry for use by a first instruction from the processor, allocating a second portion of the selected entry for use by a second instruction from the accelerator, executing the first and second instructions using the allocated first and second portions of the selected entry subject to a coherency protocol adapted to maintain data consistency of the shared resource, and verifying correctness of data stored in the entry. The shared resource may be a cache memory, and the entry a cache line of the cache memory. The first and second portions of the cache line can have different sizes. The processor and the accelerator can operate in the simulation at different frequencies. The first portion of the selected line can be randomly allocated for use by the first instruction, and the second portion of the selected line can be randomly allocated for use by the second instruction. Multiple processors and multiple accelerators can access the cache memory, in which case a single cache line can further be allocated for use by other processors and accelerators. The verification system can control execution of the cached instructions to invoke different coherency modes of the coherency mechanism. The invention provides a further opportunity to test any accelerator having an original function and an inverse function by allocating a first set of cache lines for the accelerator to generate an original function output based on an original function input, allocating a second set of cache lines for the accelerator to generate an inverse function output based on the original function output, and verifying correctness of the original and inverse functions by comparing the inverse function output to the original function input.
- The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
- The present invention may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
-
FIG. 1 is a block diagram of a computer system programmed to carry out verification of computer systems in accordance with one implementation of the present invention; -
FIG. 2 is a block diagram of a verification environment for a computer system design having multiple processors and accelerators which share access to a cache memory in accordance with one implementation of the present invention; -
FIG. 3 is a pictorial representation of a simulated cache memory allocated in accordance with one implementation of the present invention wherein a portion of a cache line is used by an accelerator while other portions of the same cache line are used by different processors; -
FIG. 4 is a chart depicting how the cache memory blocks used by an accelerator can be assigned for a function of the accelerator and an inverse of the function to yield an output which is the same as the original function input in accordance with one implementation of the present invention; and -
FIG. 5 is a chart illustrating the logical flow for a coherency testing process in accordance with one implementation of the present invention. - The use of the same reference symbols in different drawings indicates similar or identical items.
- As growing demands for performance have increased reliance on hardware accelerators, it has become more important and more difficult to ensure that accelerator operations are coherent with respect to memory operations by processors. Coherency within SMP systems is easily validated according to known methods, but coherency across two different mechanisms possibly operating at different frequencies like processors and an off-chip field programmable gate array (FPGA) is a difficult challenge for the design and validation teams. Special alignment requirements and synchronization requirements make it difficult to extract all possible corner errors in regular random tests.
- It would, therefore, be desirable to devise an improved method of validating coherency between processors and accelerators which can imitate real world scenarios. It would be further advantageous if the method could efficiently validate accelerators with different coherency modes available in the system to make sure quality of the design is maintained and also give a working model for better throughput in using the accelerators. The present invention achieves these advantages by allocating different portions of a single cache line for use by an accelerator and a processor. In the exemplary embodiment, the validation environment includes a cache whose lines are divided into blocks or sectors which can be accessed independently by a processor and an accelerator. A single cache line can be accessed by multiple accelerators or multiple cores, including different processor operations. This approach enables comprehensive testing for different modes of operation of cache lines, such as intervention by other caches or accesses from memory via the coherency mechanism. While coherency and intervention is managed across cores and caches, accelerators can pull those lines from outside the bus interconnect between processors, thereby generating coherency stress between different mechanisms. The invention accordingly ensures coherency between processors and accelerators by detecting any holes in the hardware design.
- With reference now to the figures, and in particular with reference to
FIG. 1 , there is depicted oneembodiment 10 of a computer system in which the present invention may be implemented to carry out the testing of coherency systems in an integrated circuit design.Computer system 10 is a symmetric multiprocessor (SMP) system having a plurality of 12 a, 12 b connected to aprocessors system bus 14.System bus 14 is further connected to a combined memory controller/host bridge (MC/HB) 16 which provides an interface to system memory 18. System memory 18 may be a local memory device or alternatively may include a plurality of distributed memory devices, preferably dynamic random-access memory (DRAM). There may be additional structures in the memory hierarchy which are not depicted, such as on-board (L1) and second-level (L2) or third-level (L3) caches. - MC/HB 16 also has an interface to peripheral component interconnect (PCI) Express links 20 a, 20 b, 20 c. Each PCI Express (PCIe) link 20 a, 20 b is connected to a
22 a, 22 b, and eachrespective PCIe adaptor 22 a, 22 b is connected to a respective input/output (I/O)PCIe adaptor device 24 a, 24 b. MC/HB 16 may additionally have an interface to an I/O bus 26 which is connected to a switch (I/O fabric) 28.Switch 28 provides a fan-out for the I/O bus to a plurality of PCI links 20 d, 20 e, 20 f. These PCI links are connected to 22 c, 22 d, 22 e which in turn support more I/more PCIe adaptors 24 c, 24 d, 24 e. The I/O devices may include, without limitation, a keyboard, a graphical pointing device (mouse), a microphone, a display device, speakers, a permanent storage device (hard disk drive) or an array of such storage devices, an optical disk drive, and a network card. Each PCIe adaptor provides an interface between the PCI link and the respective I/O device. MC/HB 16 provides a low latency path through whichO devices 12 a, 12 b may access PCI devices mapped anywhere within bus memory or I/O address spaces. MC/HB 16 further provides a high bandwidth path to allow the PCI devices to access memory 18.processors Switch 28 may provide peer-to-peer communications between different endpoints and this data traffic does not need to be forwarded to MC/HB 16 if it does not involve cache-coherent memory transfers.Switch 28 is shown as a separate logical component but it could be integrated into MC/HB 16. - In this embodiment, PCI link 20 c connects MC/HB 16 to a
service processor interface 30 to allow communications between I/O device 24 a and aservice processor 32.Service processor 32 is connected to 12 a, 12 b via a JTAG interface 34, and uses anprocessors attention line 36 which interrupts the operation of 12 a, 12 b.processors Service processor 32 may have its ownlocal memory 38, and is connected to read-only memory (ROM) 40 which stores various program instructions for system startup.Service processor 32 may also have access to ahardware operator panel 42 to provide system status and diagnostic information. - In alternative
embodiments computer system 10 may include modifications of these hardware components or their interconnections, or additional components, so the depicted example should not be construed as implying any architectural limitations with respect to the present invention. The invention may further be implemented in an equivalent cloud computing network. - When
computer system 10 is initially powered up,service processor 32 uses JTAG interface 34 to interrogate the system (host) 12 a, 12 b and MC/HB 16. After completing the interrogation,processors service processor 32 acquires an inventory and topology forcomputer system 10.Service processor 32 then executes various tests such as built-in-self-tests (BISTs), basic assurance tests (BATs), and memory tests on the components ofcomputer system 10. Any error information for failures detected during the testing is reported byservice processor 32 tooperator panel 42. If a valid configuration of system resources is still possible after taking out any components found to be faulty during the testing thencomputer system 10 is allowed to proceed. Executable code is loaded into memory 18 andservice processor 32 releases host 12 a, 12 b for execution of the program code, e.g., an operating system (OS) which is used to launch applications and in particular the circuit validation application of the present invention, results of which may be stored in a hard disk drive of the system (an I/O device 24). Whileprocessors 12 a, 12 b are executing program code,host processors service processor 32 may enter a mode of monitoring and reporting any operating parameters or errors, such as the cooling fan speed and operation, thermal sensors, power supply regulators, and recoverable and non-recoverable errors reported by any of 12 a, 12 b, memory 18, and MC/HB 16.processors Service processor 32 may take further action based on the type of errors or defined thresholds. - As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
- Any combination of one or more computer usable or computer readable media may be utilized. The computer-usable or computer-readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. The computer-usable or computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this invention, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium may include a propagated data signal with the computer-usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
- Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, written for a variety of platforms such as an AIX environment or operating systems such as Windows 7 or Linux. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks. Such storage media excludes transitory media such as propagating signals.
- The computer program instructions may further be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
-
Computer system 10 carries out program instructions for a coherency validation process that uses novel cache control techniques to stress the coherency mechanisms between processors and accelerators. Accordingly, a program embodying the invention may include conventional aspects of various validation and design tools, and these details will become apparent to those skilled in the art upon reference to this disclosure. - Accelerators generally use specific protocols and interfaces to submit jobs and get results. Issuing a job to these accelerators is typically done through control block structures with specific alignment requirements for accessing these control blocks. So memory accesses from accelerators outside the processor fall into two kinds, accesses to the control block structures, and accesses to the source memory on which accelerators operate. Many of the accelerators do not directly fetch instructions themselves but instead rely on the main processor(s) to assign work to them. The software executing in the main memory (e.g., application program or operating system) creates an accelerator-specific control block in the memory and then initiates the accelerator. The accelerator can read all the needed inputs from the accelerator-specific control block and perform the requested operation. Some of the fields in the accelerator control block could be the requested operation, source address, target address and operation specific inputs such as key value and key size for encryption/decryption. The present invention proposes a method through which the designer can allocate memory for any processor and accelerator mechanisms.
- Referring now to
FIG. 2 , there is depicted averification environment 50 for a computer system design which includesmultiple processors 52 a, 52 b, 52 c (processor 0,processor 1, processor 2) andmultiple accelerators 54 a, 54 b, 54 c (accelerator 0,accelerator 1, accelerator 2). The processors and accelerators access a memory hierarchy of the computer system design, for example through read or write operations, to load values to or store values from memory blocks. The memory hierarchy includes one ormore cache memories 58, and acoherency mechanism 60. The processors and accelerators may operate at different frequencies, i.e., the processors operating at a first frequency f1 and the accelerators operating at a second frequency f2 (or multiple frequencies for different accelerators). As those skilled in the art will appreciate, the verification environment may include a variety of other computer system components, not shown, according to the particular design being tested. Averification control program 62 can be used to manage the entire simulation effort. The verification system further includes acorrectness checker 64 which can check values (data) computed byenvironment 50 or values stored in any of the components as the simulation proceeds against predetermined values known to be correct. - As
verification environment 50 is a software program structure representing a hardware design, the various components are generally software modules, however, one or more of the components may be implemented in hardware for the verification procedure and subject toverification control 62. The software modules are loaded onto an actual computer system such ascomputer system 10 to carry out the verification process, and are controlled by commands fromverification control 62 which can also reside oncomputer system 10, along withcorrectness checker 64.Verification control 62 can command the execution of instructions in such a manner as to test different modes of operation of cache lines. For example, one portion of the cache line can be accessed by another cache memory to satisfy an instruction in a procedure known as cache intervention, as opposed to a mode where the access is directly to or from main memory. -
FIG. 3 illustrates one implementation forcache memory 58.Cache memory 58 has a plurality of entries or cache lines which may be arranged in pages and groups or ways according to congruence class mappings. A given cache line is further decomposed into a plurality of blocks or sectors. The sectors may be different sizes such as word, double word, quad word, etc.Verification control 62 can allocate different sectors of a single cache line for use by both accelerators and processors. InFIG. 3 , the first sector of the cache lines has been allocated for use by the accelerators, while the remaining sectors have been allocated to different processors or central processing units (CPUs). So for example, the first sector of a given cache line could be allocated to accelerator 54 a, the second sector of the cache line could be allocated to processor 52 a, the third sector of the cache line could be allocated to processor 52 b, and the fourth sector of the cache line could be allocated toprocessor 52 c. Alternatively, different sectors of a single a cache line could be allocated for use by different accelerators. Multiple cache lines can be so shared according to the same divisions between blocks. Sectors used by the processors can be used for different operations. A given cache line may be divided into blocks of different sizes. For example, control structures accessed to submit jobs to accelerators often need to be quad word aligned, so the block allocated for an accelerator may be quad word size, but other blocks in the same line allocated to processors may be double word or single word size. This approach enables a cache line to be accessed randomly from different cores to create scenarios of intervention and different allowed coherency states, increasing the stress oncoherency mechanism 60. As simulation of the verification environment proceeds,correctness check 64 will sample values throughout the cache lines and compare them to the known correct values to determine whether the coherency mechanism is performing properly. Some examples of the various operations that the processors could be performing include: processor accesses to the cache line for stores and or loads when the accelerator is also trying to access the cache line which can change the state of the cache; scenarios where an accelerator is trying to access the cache line and a processor serves the request (by the process called intervention); and a processor trying to fetch a cache line through a prefetch operation while accelerators are accessing it. - This approach provides a convenient opportunity to test the accuracy of an accelerator. Many accelerators have functionalities which are complementary, i.e., generally opposite (inverse) operations, for example, encryption and decryption, or compression and decompression. The present invention can leverage this feature to validate the correctness of these functionalities. In such a case, as depicted in
FIG. 3 , a first contiguous set of cache lines in a page can be allocated for a control structure of the execution block for the accelerator, to invoke the first functionality. A second contiguous set of cache lines in the same page, immediately following the first set (consecutive), is allocated for a control structure of the execution block for the same accelerator, but the second set invokes the second, opposite functionality. The second set of cache lines will use the output of the first function as the input to the second function. A third set of contiguous cache lines in the same page (consecutive to the second set) is used for the output of the second function, which should be exactly the same as the original input to the first function. The number of cache lines in the sets depends on the particular control block structure for the accelerator. - Accordingly, as seen in
FIG. 4 , the first set of cache lines are used by the accelerator to generate the original function output, which then becomes in the inverse input, i.e., input to the inverse function. When the inverse function of the accelerator acts on the inverse input, it generates an inverse output, which should be the same as the original function input. These two values can then be compared bycorrectness checker 64 to validate the functionalities of the accelerator. In this manner, the operation of any accelerator which has opposite functionalities can be validated as the coherency mechanism is contemporaneously validated. It does not matter which of the two functions is designated as the original and which is designated as the inverse, e.g., for an encryption accelerator the verification control could designate the encryption function as the original function and the decryption function as the inverse function, or could alternatively designate the decryption function as the original function and the encryption function as the inverse function. - The invention may be further understood with reference to the chart of
FIG. 5 which illustrates the logical flow for a specific coherency testing process in accordance with one implementation carried out bycomputer system 10.Process 70 begins by selecting a congruence class to target (72), and selecting an instruction to apply to the cache memory (74). The congruence class and instruction may be selected according to any algorithm devised by the designer, or may be selected randomly. A check is then made to determine whether the instruction is accelerator-specific (76) or is for memory access (78). If the instruction is neither, thenverification control 62 builds the instruction accordingly to conventional testing methodology (80). If the instruction is accelerator-specific or is for memory access, a memory line is selected for the instruction within the targeted congruence class (82), and a sector index for that line is set (84). The line and index may also be selected randomly or accordingly to other designer conditions. If the sector associated with the set sector index is free (86), that sector is used to build the instruction (88). If the sector indicated by the index is not free, a random number generator is used to pick an adjacent sector for examination (90), searching either to the left (92) or to the right (94) until a free sector is found. If a sector is free (96), that sector is used (88); otherwise, any sector is picked according to designer preferences or randomly (98). After any instruction is built byverification control 62, the process repeats iteratively atbox 74 with additional instructions for as long as the verification process continues (100). Termination criteria may be based for example on elapsed time or a predetermined number of instructions being built. - The invention achieves several advantages by allowing processors and accelerators to share the same set of cache lines. Since the processors can be generating random values during simulation, accelerators sharing the lines can be tested with different memory values. Different alignments of control blocks can also be validated. As processors and accelerators fight for the same set of cache lines, not only is the coherency mechanism tested but snoop and bus arbitration are also stressed. In particular, accelerators working at different frequencies than processors will result in different timing of snoop broadcasts on the interconnect bus. Execution of accelerator workloads can be sped up as memory lines used by the accelerators can be brought in earlier by the processor cores. In real world applications, the novel allocation of cache lines can prepare inputs for the accelerators ahead of the needed time, achieving maximum throughput. Finally, for accelerators having inverse functions, correctness validation of the accelerator functionality happens with much faster coverage of all corner cases, and no software simulation is needed to check the correctness of these functionalities.
- Although the invention has been described with reference to specific embodiments, this description is not meant to be construed in a limiting sense. Various modifications of the disclosed embodiments, as well as alternative embodiments of the invention, will become apparent to persons skilled in the art upon reference to the description of the invention. For example, while the invention has been disclosed in the context of a cache memory, it is applicable to any shared resource that allows accesses from both processors and accelerators. It is therefore contemplated that such modifications can be made without departing from the spirit or scope of the present invention as defined in the appended claims.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/038,125 US20140236561A1 (en) | 2013-02-19 | 2013-09-26 | Efficient validation of coherency between processor cores and accelerators in computer systems |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US13/770,711 US9501408B2 (en) | 2013-02-19 | 2013-02-19 | Efficient validation of coherency between processor cores and accelerators in computer systems |
| US14/038,125 US20140236561A1 (en) | 2013-02-19 | 2013-09-26 | Efficient validation of coherency between processor cores and accelerators in computer systems |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/770,711 Continuation US9501408B2 (en) | 2013-02-19 | 2013-02-19 | Efficient validation of coherency between processor cores and accelerators in computer systems |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20140236561A1 true US20140236561A1 (en) | 2014-08-21 |
Family
ID=51351883
Family Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/770,711 Active 2034-11-10 US9501408B2 (en) | 2013-02-19 | 2013-02-19 | Efficient validation of coherency between processor cores and accelerators in computer systems |
| US14/038,125 Abandoned US20140236561A1 (en) | 2013-02-19 | 2013-09-26 | Efficient validation of coherency between processor cores and accelerators in computer systems |
Family Applications Before (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US13/770,711 Active 2034-11-10 US9501408B2 (en) | 2013-02-19 | 2013-02-19 | Efficient validation of coherency between processor cores and accelerators in computer systems |
Country Status (1)
| Country | Link |
|---|---|
| US (2) | US9501408B2 (en) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10025722B2 (en) | 2015-10-28 | 2018-07-17 | International Business Machines Corporation | Efficient translation reloads for page faults with host accelerator directly accessing process address space without setting up DMA with driver and kernel by process inheriting hardware context from the host accelerator |
| CN109669897A (en) * | 2017-10-13 | 2019-04-23 | 华为技术有限公司 | Data transmission method and device |
| US10289553B2 (en) | 2016-10-27 | 2019-05-14 | International Business Machines Corporation | Accelerator sharing |
| US10810070B2 (en) | 2016-02-19 | 2020-10-20 | Hewlett Packard Enterprise Development Lp | Simulator based detection of a violation of a coherency protocol in an incoherent shared memory system |
| CN113132178A (en) * | 2020-01-15 | 2021-07-16 | 普天信息技术有限公司 | Protocol consistency test method and device |
| US11347643B2 (en) * | 2018-06-29 | 2022-05-31 | Intel Corporation | Control logic and methods to map host-managed device memory to a system address space |
| CN115485671A (en) * | 2020-05-06 | 2022-12-16 | 国际商业机器公司 | Leverage consistently attached interfaces in the network stack framework |
| CN115618801A (en) * | 2022-12-01 | 2023-01-17 | 北京智芯微电子科技有限公司 | Cache consistency check method, device and electronic equipment |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2011097482A1 (en) * | 2010-02-05 | 2011-08-11 | Maxlinear, Inc. | Conditional access integration in a soc for mobile tv applications |
| US9772944B2 (en) | 2014-06-27 | 2017-09-26 | International Business Machines Corporation | Transactional execution in a multi-processor environment that monitors memory conflicts in a shared cache |
| US10013351B2 (en) | 2014-06-27 | 2018-07-03 | International Business Machines Corporation | Transactional execution processor having a co-processor accelerator, both sharing a higher level cache |
| US9928127B2 (en) | 2016-01-29 | 2018-03-27 | International Business Machines Corporation | Testing a data coherency algorithm |
| JP7045921B2 (en) * | 2018-04-27 | 2022-04-01 | 株式会社日立製作所 | Semiconductor LSI design device and design method |
| US11445020B2 (en) * | 2020-03-24 | 2022-09-13 | Arm Limited | Circuitry and method |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020087809A1 (en) * | 2000-12-28 | 2002-07-04 | Arimilli Ravi Kumar | Multiprocessor computer system with sectored cache line mechanism for cache intervention |
| US20020087828A1 (en) * | 2000-12-28 | 2002-07-04 | International Business Machines Corporation | Symmetric multiprocessing (SMP) system with fully-interconnected heterogenous microprocessors |
| US6883071B2 (en) * | 2001-11-19 | 2005-04-19 | Hewlett-Packard Development Company, L.P. | Method for evaluation of scalable symmetric multiple processor cache coherency protocols and algorithms |
| US20090024892A1 (en) * | 2007-07-18 | 2009-01-22 | Vinod Bussa | System and Method of Testing using Test Pattern Re-Execution in Varying Timing Scenarios for Processor Design Verification and Validation |
| US20110252200A1 (en) * | 2010-04-13 | 2011-10-13 | Apple Inc. | Coherent memory scheme for heterogeneous processors |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6373493B1 (en) | 1995-05-01 | 2002-04-16 | Apple Computer, Inc. | Hardware graphics accelerator having access to multiple types of memory including cached memory |
| US7814279B2 (en) | 2006-03-23 | 2010-10-12 | International Business Machines Corporation | Low-cost cache coherency for accelerators |
| US20080010321A1 (en) | 2006-06-20 | 2008-01-10 | International Business Machines Corporation | Method and system for coherent data correctness checking using a global visibility and persistent memory model |
| US7865675B2 (en) | 2007-12-06 | 2011-01-04 | Arm Limited | Controlling cleaning of data values within a hardware accelerator |
| US8108197B2 (en) | 2008-12-04 | 2012-01-31 | International Business Machines Corporation | Method to verify an implemented coherency algorithm of a multi processor environment |
-
2013
- 2013-02-19 US US13/770,711 patent/US9501408B2/en active Active
- 2013-09-26 US US14/038,125 patent/US20140236561A1/en not_active Abandoned
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020087809A1 (en) * | 2000-12-28 | 2002-07-04 | Arimilli Ravi Kumar | Multiprocessor computer system with sectored cache line mechanism for cache intervention |
| US20020087828A1 (en) * | 2000-12-28 | 2002-07-04 | International Business Machines Corporation | Symmetric multiprocessing (SMP) system with fully-interconnected heterogenous microprocessors |
| US6883071B2 (en) * | 2001-11-19 | 2005-04-19 | Hewlett-Packard Development Company, L.P. | Method for evaluation of scalable symmetric multiple processor cache coherency protocols and algorithms |
| US20090024892A1 (en) * | 2007-07-18 | 2009-01-22 | Vinod Bussa | System and Method of Testing using Test Pattern Re-Execution in Varying Timing Scenarios for Processor Design Verification and Validation |
| US20110252200A1 (en) * | 2010-04-13 | 2011-10-13 | Apple Inc. | Coherent memory scheme for heterogeneous processors |
Non-Patent Citations (4)
| Title |
|---|
| Abd-El-Barr, Mostafa. "Fundamentals of Computer Organization and Architecture". Published Feb 22, 2005. P121-122. * |
| Lebeck, Alvin R., et al. "Annotated memory references: A mechanism for informed cache management." Euro-Par'99 Parallel Processing. Springer Berlin Heidelberg, 1999. 1251-1254.. * |
| Raina, Rajesh and Molyneaux, Robert. "Random Self-Test Method Applications on PowerPCTM Microprocessor Caches". Published Feb 21, 1998. * |
| Sirowy et al. "Clock-Frequency Assignment for Multiple Clock Domain Systems-on-a-Chip". Published in 2007. * |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10025722B2 (en) | 2015-10-28 | 2018-07-17 | International Business Machines Corporation | Efficient translation reloads for page faults with host accelerator directly accessing process address space without setting up DMA with driver and kernel by process inheriting hardware context from the host accelerator |
| US10031858B2 (en) | 2015-10-28 | 2018-07-24 | International Business Machines Corporation | Efficient translation reloads for page faults with host accelerator directly accessing process address space without setting up DMA with driver and kernel by process inheriting hardware context from the host accelerator |
| US10810070B2 (en) | 2016-02-19 | 2020-10-20 | Hewlett Packard Enterprise Development Lp | Simulator based detection of a violation of a coherency protocol in an incoherent shared memory system |
| US10289553B2 (en) | 2016-10-27 | 2019-05-14 | International Business Machines Corporation | Accelerator sharing |
| US11068397B2 (en) | 2016-10-27 | 2021-07-20 | International Business Machines Corporation | Accelerator sharing |
| CN109669897A (en) * | 2017-10-13 | 2019-04-23 | 华为技术有限公司 | Data transmission method and device |
| US11347643B2 (en) * | 2018-06-29 | 2022-05-31 | Intel Corporation | Control logic and methods to map host-managed device memory to a system address space |
| US20220237121A1 (en) * | 2018-06-29 | 2022-07-28 | Intel Corporation | Host-managed coherent device memory |
| US11928059B2 (en) * | 2018-06-29 | 2024-03-12 | Intel Corporation | Host-managed coherent device memory |
| CN113132178A (en) * | 2020-01-15 | 2021-07-16 | 普天信息技术有限公司 | Protocol consistency test method and device |
| CN115485671A (en) * | 2020-05-06 | 2022-12-16 | 国际商业机器公司 | Leverage consistently attached interfaces in the network stack framework |
| CN115618801A (en) * | 2022-12-01 | 2023-01-17 | 北京智芯微电子科技有限公司 | Cache consistency check method, device and electronic equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| US20140237194A1 (en) | 2014-08-21 |
| US9501408B2 (en) | 2016-11-22 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9501408B2 (en) | Efficient validation of coherency between processor cores and accelerators in computer systems | |
| CN108292337B (en) | Trusted opening of security fort regions in virtualized environments | |
| Hadidi et al. | Cairo: A compiler-assisted technique for enabling instruction-level offloading of processing-in-memory | |
| US10229043B2 (en) | Requesting memory spaces and resources using a memory controller | |
| TW201905714A (en) | Computing system operating method, computing system, vehicle and computer readable medium for direct input and output operations of a storage device with auxiliary processor memory | |
| US20120303948A1 (en) | Address translation unit, device and method for remote direct memory access of a memory | |
| Lee et al. | Performance characterization of data-intensive kernels on AMD fusion architectures | |
| US20200089621A1 (en) | Method, system, and apparatus for stress testing memory translation tables | |
| US20230025441A1 (en) | Ranking tests based on code change and coverage | |
| CN104346247B (en) | Multistage pipeline system for programmable circuit | |
| Patil et al. | Āpta: Fault-tolerant object-granular CXL disaggregated memory for accelerating FaaS | |
| CN104321750B (en) | Method and system for maintaining release coherency in shared memory programming | |
| US9009018B2 (en) | Enabling reuse of unit-specific simulation irritation in multiple environments | |
| EP4066122B1 (en) | Virtualized caches | |
| CN115372791A (en) | Test method and device of integrated circuit based on hardware simulation and electronic equipment | |
| JPWO2012127955A1 (en) | Semiconductor device | |
| WO2022212232A1 (en) | Configurable interconnect address remapper with event recognition | |
| CN116245054A (en) | Authentication method, authentication device, electronic device, and computer-readable storage medium | |
| JP2022533378A (en) | Address Translation Cache Invalidation in Microprocessors | |
| US9323702B2 (en) | Increasing coverage of delays through arbitration logic | |
| CN104572570B (en) | Chip-stacked cache extension with coherence | |
| US20230333861A1 (en) | Configuring a component of a processor core based on an attribute of an operating system process | |
| Mampaey | Memory page allocation in multi-chip-module GPUs | |
| Daoudi et al. | Improving Simulations of Task-Based Applications on Complex NUMA Architectures | |
| Wu et al. | Simulating CXL Shared Coherent Memory Using Shared Memory Among Virtual Machines |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: GLOBALFOUNDRIES U.S. 2 LLC, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:036550/0001 Effective date: 20150629 |
|
| AS | Assignment |
Owner name: GLOBALFOUNDRIES INC., CAYMAN ISLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GLOBALFOUNDRIES U.S. 2 LLC;GLOBALFOUNDRIES U.S. INC.;REEL/FRAME:036779/0001 Effective date: 20150910 |
|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
| AS | Assignment |
Owner name: GLOBALFOUNDRIES U.S. INC., NEW YORK Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION;REEL/FRAME:056987/0001 Effective date: 20201117 Owner name: GLOBALFOUNDRIES U.S. INC., NEW YORK Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:WILMINGTON TRUST, NATIONAL ASSOCIATION;REEL/FRAME:056987/0001 Effective date: 20201117 |