WO1998038579A1 - Procede permettant de reduire la frequence des manques en antememoire - Google Patents
Procede permettant de reduire la frequence des manques en antememoire Download PDFInfo
- Publication number
- WO1998038579A1 WO1998038579A1 PCT/IB1998/000197 IB9800197W WO9838579A1 WO 1998038579 A1 WO1998038579 A1 WO 1998038579A1 IB 9800197 W IB9800197 W IB 9800197W WO 9838579 A1 WO9838579 A1 WO 9838579A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- blocks
- instructions
- selection
- cache
- block
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
- G06F8/4442—Reducing the number of cache misses; Data prefetching
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0864—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
Definitions
- the invention relates to a method of selecting memory locations for placing instructions as described in the preamble of Claim 1.
- a machine like a computer typically contains a main memory, a cache memory and a processor. During execution of a program the computer loads instructions from main memory for execution by the processor and copies these instructions into the cache memory.
- the cache memory contains cache lines, each of which may hold several instructions at a time.
- the memory contains locations, and the location at which an instruction is stored determines into which cache line that instruction will be copied. When the instruction is copied into the cache line a previous content of that cache line is no longer available from the cache memory.
- a cache miss may occur when, after the time the instruction was last copied into a particular cache line, another instruction has been copied into that particular cache line. Whether this occurs depends on the main memory locations of the other instructions that have been executed since the time that the instruction was last copied into the cache line. If these locations are such that these instructions have to be copied into the particular cache line, a cache miss may occur. The number of cache misses can be minimized by a proper selection of these main memory locations so that copying into the particular cache line is not needed too often.
- a sample of execution of the program is obtained.
- the sample indicates which instructions the processor successively executes when the machine receives a given typical data input.
- a linear function is derived which calculates the number of cache misses for the sample of execution as a function of the locations where the instructions of the program are placed in main memory.
- a minimum of this linear function is found, which corresponds to optimal locations for placing the instructions in main memory.
- This known method has the disadvantage that it is very time consuming. Even for relatively small programs a computer needs an excessive amount of execution time to find an optimum.
- the known method reduces this amount of execution time by grouping the instructions into blocks of instructions that are always executed as a whole, and by calculating the number of cache misses at the level of blocks instead of individual instructions. But even with this improvement a computer still needs hours of execution time to find an optimum for relatively small programs.
- the known method makes very inefficient use of the locations in main memory, because it divides the program into blocks of instructions and introduces unused locations between the blocks to enforce that each cache line will contain instructions from only one block.
- the method according to the invention is characterized by the characterizing part of Claim 1.
- Potential selections are screened to select a potential selection which generally reduces the number of cache misses for the sample of execution of the program when the instructions are placed according to the potential selection instead of according to the original selection. This process is repeated, each time with the successor selection replacing the original selection.
- a computation of the number of cache misses is performed for potential successor selections.
- a score is used as a heuristic for deciding whether to compute the number of cache misses of potential selections from selections that differ from the original selection only by the movement of selected blocks.
- the program will be loaded into main memory according to the optimal selection.
- This main memory may be for example a conventional DRAM or a ROM or the like.
- Such a ROM can be used in machine with a fixed, efficient program.
- the program can also be stored on a machine readable medium like a magnetic disk, in combination with information about the optimal selection for use in loading the program into the main memory according to the optimal selection.
- the machine containing the main memory will execute the program by fetching the instructions from main memory.
- the method according to the invention has an embodiment wherein the respective score for each block is a count of executions of that block during the sample of execution, only executions being counted which both cause a cache miss and are separated from a directly preceding execution of that block by execution of other blocks of which an aggregate size is less than a size of the cache memory.
- the cache misses that are counted are all conflict misses, i.e. cache misses that can be avoided by assigning different locations to instructions. It is likely that the number of cache misses occurring during the sample of execution can be reduced more by movement of a block for which this count is higher. Therefore use of this count makes the search for an optimal selection efficient.
- the set of at least one selected block is selected so that the respective scores indicate that each of the at least one selected blocks has at least as high a number of cache misses as any other block, the further blocks comprising all blocks for which the respective scores indicate that they have a lower number of cache misses than the at least one selected block or blocks. Movement of a block which causes the most cache misses is most likely to be a success for reducing the number of cache misses occurring during the sample of execution.
- the method according to the invention has an embodiment wherein, for said testing whether any one of the set of potential selection produces a smaller number of cache misses, self-conflicting potential selections, in which any location selected for placement of an instruction of any one of the set of at least one selected block may cause a cache conflict with any location selected for placement of instructions of the same one of the set of at least one selected blocks in the original selection, are excluded from the set of potential selections.
- self-conflicting potential selections in which any location selected for placement of an instruction of any one of the set of at least one selected block may cause a cache conflict with any location selected for placement of instructions of the same one of the set of at least one selected blocks in the original selection, are excluded from the set of potential selections.
- the invention has another embodiment wherein the locations selected for placement of the instructions of the blocks for each of the original selection and the potential selections are selected logically substantially contiguously, substantially no unused locations occurring between blocks. This is done even if it means that instructions from end and beginning, respectively, of two contiguous blocks will be loaded into the same cache line. Thus the method will find an optimal selection under the constraint that the amount of main memory space used for the program is minimal.
- the method according to the invention has a further embodiment wherein the set of at least one selected blocks comprises only one block and wherein the potential selections each have an order of blocks in main memory which differs from an order of blocks in main memory of the original selection in that only the instructions of the one block have been moved relative to the instructions of other blocks. This reduces the number of potential selections that needs to be considered, and thereby the average number of times the total number of cache misses need to be computed.
- Figure 1 shows a diagram of a machine with a main memory and a cache memory.
- Figure 2 shows a flow chart of a typical program
- FIG. 3 shows a flow-chart of an embodiment of the method according to the invention
- Figure 4 illustrates two selections of locations for placing instructions in main memory.
- Figure 1 shows a diagram of a machine with a main memory 10, a cache memory 12 and a processor 14.
- the main memory 10 is for example a DRAM or a ROM.
- the processor 14 has an address output coupled to the cache memory 12.
- the cache memory 12 has an instruction output coupled to the processor 14 via a multiplexer 16.
- the cache memory 12 also has an address output coupled to the main memory 10.
- the main memory 10 has an instruction output coupled to the processor 14 via the multiplexer 16.
- the cache memory 12 has a control output coupled to the multiplexer 16 for controlling whether instructions are supplied to the processor 14 from the main memory 10 or the cache memory 12.
- the processor issues addresses to fetch instructions.
- the addresses refer to locations in the main memory 10.
- the cache memory 12 passes the address to the main memory 10.
- the main memory 10 supplies the instruction stored in the location referred to by the address to the instruction output of the main memory 10.
- the cache memory 12 controls the multiplexer 16 so that this instruction is supplied to the processor 14.
- the cache memory 12 checks the address to see whether a copy of that instruction is available in the cache memory owing to an earlier fetch of that instruction. If a copy is available the cache memory 12 supplies this copy to its instruction output and controls the multiplexer 16 to supply that copy to the processor 14 for execution. In this case the processor 14 does not need to wait for the main memory 10 to supply the instruction and no address needs to be supplied to the main memory 10.
- the cache memory 12 contains a number of cache lines 120-1..120-8. Only 8 cache lines are shown, but in practice the number of cache lines will be much larger, for example 512. Each cache line provides memory space, for example 64 byte, for storing a number of copied instructions.
- the address of the location in main memory 10 in which the instruction is stored determines into which cache line that instruction will be copied. For example an instruction stored at a location with address "A" might be stored in cache line number A mod N (the integer remainder when A is divided by N). Instructions stored at different locations in main memory may be copied to the same cache line: in the example instructions stored at locations whose addresses differ by an integer multiple of N will be copied into the same cache line. These instructions are said to have "conflicting" addresses.
- the architecture of the cache memory 12 determines which old instructions are no longer available. In a cache memory 12 with a direct mapped architecture none of the instructions with conflicting addresses that have been copied earlier into the cache line will be available any longer. In a cache memory with a set-associative architecture a number (for example 8) of ranges of instructions with mutually conflicting address can be retained together in the cache memory 12. When an instruction is copied into the cache memory 12 only one of that number of ranges will no longer be available. Which one depends on the cache replacement strategy of the architecture. In case of a "Least Recently Used" (LRU) strategy, the range of instructions with conflicting addresses that have been fetched least recently by the processor 14 is no longer available.
- LRU Least Recently Used
- Figure 2 shows a flow chart of a typical program for the machine of figure 1.
- the program contains blocks of instructions 20-26.
- the instructions from the blocks are stored in the main memory 10 and executed by the processor 14.
- the program contains loops 27, 28, which cause the instructions of some blocks 21, 22 or 23, 24 to be executed more than once.
- This type of cache-miss may be avoided by placing the other instruction before execution in a location in the main memory 10 whose address does not conflict with the address of the particular instruction.
- the invention is concerned with a method of selecting locations for placing instructions in the main memory 10 so that the number of cache-misses is minimized.
- the invention is concerned with a method of selecting locations for placing instructions in the main memory 10 so that the number of cache-misses is minimized.
- Figure 3 shows a flow-chart of an embodiment of the method according to the invention.
- a sample of an execution of the program is obtained.
- the blocks 20-26 will be executed in a certain sequence: the loops 27, 28 will be taken a number of times and in the loops 27, 28 either one of two blocks 22, 23 will be executed alternatively.
- the sample of execution describes the sequence in which the blocks are executed (for example 20, 21, 22, 24, 21, 22, 24, 25, 21, 23, 24, 21, 22, 24 etc.).
- a block is repeatedly executed then an execution of that block at a position "i" in the sequence is marked as a candidate for elimination of cache misses if there is a next preceding execution of the block at a position "j" in the sequence (j ⁇ i) and S j -S: ⁇ MIN, where MIN is the number of cache lines in the cache memory 12 (plus the number of contents of a cache lines that the cache memory 12 can hold simultaneously in the case of a set associative cache).
- a next preceding execution of the block is an execution so that there are no intermediate positions "k" (j
- an original selection of locations for placing the instructions of the program is made in the first step 30 and the number of cache misses is computed that will occur during the sample of execution when the instructions are placed according to the original selection.
- This computation needs to consider only the execution of blocks 20-26 and not the execution of individual instructions, although the latter is possible in principle. For this computation it is determined for the original selection for each block which cache lines will be involved when the block is copied into the cache memory 12. Then execution according to the sample is simulated at block level, step by step for each block in the sequence of execution, keeping a record for each cache line of which block(s) will be available in the cache line after execution of each block in the sequence, given the architecture of the cache memory 12.
- the second step 31 of the flow-chart it is counted for each block 20-26 how many cache misses are caused by this block 20-26 during the sample of execution. In the count only those executions in the sequence of executions are counted which have been marked in the first step as a candidate for elimination of cache misses. For these cache misses it is certain that they can be avoided by placing the block at a different location in maim memory 10. The count is also performed at a block level as described for determining the total number of cache misses.
- a third step 32 of the flow-chart the block 20-26 is selected that has at least as high count as the count for any other block as determined in the second step 32. This block is marked as "tried".
- a potential selection of location for placing the instructions of the program is derived from the original selection of locations for placing the instructions.
- Figure 4 illustrates the original selection 40 and a potential selection 42.
- the horizontal dimension in figure 4 symbolizes the locations in the main memory 10 in order of logically increasing address "Addr”. Dashed vertical lines mark the boundaries between ranges of addresses that address locations that will be copied into the same cache line.
- four blocks a-d are shown at a position symbolizing the locations of the instructions of those blocks in main memory 10.
- Figure 4 assumes that second block "b" has been selected in the third step
- a potential selection is derived from the original selection by taking the order in which the blocks appear in the original selection and moving the selected block "b" relative to the other blocks. For this purpose a position is selected to which the selected block is moved. This position must be “untried” as yet for the selected block in combination with the original position. When the position is selected it is marked as “tried”, so that it will not be used again for the selected block in combination with the original selection.
- the move of the selected block “b” results in the potential selection 42 where the second block “b” has been moved from its position between blocks a and c to a position between blocks c and d.
- the order of the blocks in the potential selection is (a c b d) instead of the order (a b e d) in the original selection.
- the blocks are located substantially contiguously in main memory 10 i.e. substantially without unused locations between the locations used for instructions consecutive blocks (e.g. a-b) in the order.
- memory space is used very efficiently.
- positions are tested to determine whether according to the resulting potential selection any instruction of the selected block is stored in a location that corresponds to a cache line that is used for the selected block according to the original selection. If this is the case, such a position is preferably not used for generating the potential selection and another position is selected. Thus only positions will be selected where the cache misses caused by execution of the selected block are eliminated. This saves the time needed for computing the total number of cache misses of potential selections according to the unused position and, more importantly, it avoids that the flow-chart will substitute such a potential selection for the original selection, thereby achieving a smaller reduction in the number of cache misses than would have been possible for other potential selections.
- the total number of cache misses is computed that will occur during the sample of execution when the instructions are placed according to the potential selection 42. This total number of cache- misses will generally differ from the total number of cache misses computed for the original selection 40.
- the instructions of the second block b When the instructions are placed in main memory 10 according to the potential selection 42, the instructions of the second block b will be copied into different cache lines in the cache memory than when the instructions are placed according to the original selection 40 (as can be seen from the fact that the second block b appears between different pairs of dashed vertical lines in the two selections). Therefore the cache misses caused by fetching of instructions of block b when the blocks are placed in main memory 10 according to the original selection will not occur when the blocks are placed in main memory according to the potential selection. However the instructions of the block b may give rise to other cache misses, either when they are fetched or because they make other instructions unavailable when they are copied into the cache memory.
- Other blocks a, c, d may also cause different cache misses because the removal and insertion of the block b causes instructions of other blocks (e.g. block c) to be placed at different locations. All this generally results in different number of cache misses for the potential selection 42 and the original selection 40.
- the total number of cache misses computed for the potential selection 42 is compared with the total number of cache misses computed for the original selection 40. If the total number of cache misses computed for the potential selection 42 is less than that computed for the original selection 40, then the original selection 40 is replaced by the potential selection in a substep 35a and the flow-chart resumes from the second step.
- a seventh step 36 is executed to determine whether there are any untried positions left for placing the selected block b. If so, the flow chart returns to the fourth step 33. If no untried positions are left, an eight step 37 dete ⁇ nines if there are any further blocks that may serve as selected block. If so, the flow chart returns to the third step 32.
- a ninth step 38 in which a code module is generated, containing the instructions together with information where to place these instructions in main memory 10 according to the optimal selection.
- the code module is used to load the instructions into the main memory 10 according to the optimal selection, after which the processor 14 may start executing the program.
- the code module may be stored intermediately on a computer readable medium like a magnetic or optical disk from which it is loaded into the main memory.
- one uses a less stringent criterion for exiting, for example that no blocks are left untried that cause more than a certain number of cache misses.
- the method is executed by a computer according to the flowchart of figure 3 the greatest amount of computing time will be involved in computing the total number of cache misses for the sample of execution when the instructions are placed according to potential selections.
- this amount of computing time is minimized already by computing the number of cache misses for blocks and not for individual instructions, and this amount of computing time can be further minimized for example by only counting cache misses involving blocks that are executed more than once in the sample of execution.
- this amount of computing time is kept in check because the count determined in the second step 31 is used as a heuristic for selecting potential selections from selections that differ from the original selection only by the movement of a single block. This means that the original selection is replaced by a potential selection which reduces the number of cache misses without first computing the total number of cache misses for all possible potential selections. If a potential selection is found in which one selected block has been moved from its position in the original selection and which has a lower total number of cache misses than the original selection, this potential selection replaces the original selection. After that no further blocks are selected that have an equal or lower count as determined in the second step 31. The total number of cache misses does not need to be computed for those blocks and positions.
- this reduction in computation time can be achieved even if deviations from the flow-chart of figure 3 are made.
- the computation of the total number of cache misses for potential selections obtained using other selected blocks is avoided if a "better" potential selection than the original selection is found using the first selected block.
- the original selection is then replaced with a potential selection that has the least (or at most as low as any) number cache misses of all the potential selections derived by moving respective ones of the several different selected blocks, provided that number cache misses is less than that of the original selection.
- the computation of the number of cache misses for potential selections obtained by moving the further blocks is avoided if a suitable potential selection is found using the several different selected blocks.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
On calcule le nombre des manques en antémémoire pour l'exécution des différents blocs d'un nombre de blocs. Ce résultat est utilisé dans une logique heuristique de recherche locale consistant à remplacer de façon itérative chaque fois une sélection originale par une sélection qui diffère de la sélection originale uniquement par le mouvement d'un bloc unique et qui donne lieu à moins de manques en antémémoire pour chacun des échantillons d'exécution de la sélection originale. Pour la mise en place d'instructions d'un programme en mémoire, on trouve ainsi une sélection d'emplacements qui ramène à un minimum le nombre de manques en antémémoire se produisant pour un échantillon d'une exécution standard du programme.
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| DE69811159T DE69811159T2 (de) | 1997-02-27 | 1998-02-16 | Verfahren zur verminderung der frequenz von zwischenspeicher-fehlgriffen in einem rechner |
| JP10529257A JP2000509861A (ja) | 1997-02-27 | 1998-02-16 | コンピュータにおけるキャッシュミスの頻度を減少させる方法 |
| EP98901461A EP0896701B1 (fr) | 1997-02-27 | 1998-02-16 | Procede permettant de reduire la frequence des manques en antememoire |
| PCT/IB1998/000197 WO1998038579A1 (fr) | 1997-02-27 | 1998-02-16 | Procede permettant de reduire la frequence des manques en antememoire |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP97200575.5 | 1997-02-27 | ||
| PCT/IB1998/000197 WO1998038579A1 (fr) | 1997-02-27 | 1998-02-16 | Procede permettant de reduire la frequence des manques en antememoire |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO1998038579A1 true WO1998038579A1 (fr) | 1998-09-03 |
Family
ID=11004687
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/IB1998/000197 WO1998038579A1 (fr) | 1997-02-27 | 1998-02-16 | Procede permettant de reduire la frequence des manques en antememoire |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO1998038579A1 (fr) |
-
1998
- 1998-02-16 WO PCT/IB1998/000197 patent/WO1998038579A1/fr active IP Right Grant
Non-Patent Citations (2)
| Title |
|---|
| FRANK MUELLER, DAVID B. WHALLEY, "Fast Instruction Cache Analysis via Static Cache Simulation", PROCEEDINGS/ANNUAL SIMULATION SYMPOSIUM, April 1995, (Phoenix), pages 105-114. * |
| KARL PETTIS, ROBERT C. HANSEN, "Profile Guided Code Position", PROCEEDINGS OF THE ACM SIGPLAN'90 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION WHITE PLAINS, New York, 20-22 June 1990, pages 16-27. * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6453389B1 (en) | Optimizing computer performance by using data compression principles to minimize a loss function | |
| US6226715B1 (en) | Data processing circuit with cache memory and cache management unit for arranging selected storage location in the cache memory for reuse dependent on a position of particular address relative to current address | |
| US6813705B2 (en) | Memory disambiguation scheme for partially redundant load removal | |
| EP0449368B1 (fr) | Procédé pour compiler les instructions d'ordinateur pour améliorer l'efficacité d'une antémémoire | |
| US6192450B1 (en) | Destage of data for write cache | |
| US5809566A (en) | Automatic cache prefetch timing with dynamic trigger migration | |
| JP3417984B2 (ja) | キャッシュ競合削減コンパイル方法 | |
| US5963972A (en) | Memory architecture dependent program mapping | |
| Hsu et al. | On the minimization of loads/stores in local register allocation | |
| US6055621A (en) | Touch history table | |
| US5581721A (en) | Data processing unit which can access more registers than the registers indicated by the register fields in an instruction | |
| EP0373361B1 (fr) | Générateur de code performant pour un calculateur à espaces registre dissemblables | |
| US6134633A (en) | Prefetch management in cache memory | |
| US7447845B2 (en) | Data processing system, processor and method of data processing in which local memory access requests are serviced by state machines with differing functionality | |
| CN1476562A (zh) | 用于快闪存储器的记入后直写式高速缓存 | |
| EP0412247A2 (fr) | Système d'antémémoire | |
| EP0838755A2 (fr) | Appareil et méthode pour convertir des programmes binaires | |
| CN1047245C (zh) | 采用独立存取中间存储器的超标量处理器系统中加强指令调度的方法和系统 | |
| US5339420A (en) | Partitioning case statements for optimal execution performance | |
| CN1359488A (zh) | 强静态预测分支指令的优化执行 | |
| US6301641B1 (en) | Method for reducing the frequency of cache misses in a computer | |
| KR100837479B1 (ko) | 캐시 메모리 및 그 제어 방법 | |
| US6651245B1 (en) | System and method for insertion of prefetch instructions by a compiler | |
| US7530063B2 (en) | Method and system for code modification based on cache structure | |
| EP0896701B1 (fr) | Procede permettant de reduire la frequence des manques en antememoire |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FI FR GB GR IE IT LU MC NL PT SE |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1998901461 Country of ref document: EP |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| WWP | Wipo information: published in national office |
Ref document number: 1998901461 Country of ref document: EP |
|
| WWG | Wipo information: grant in national office |
Ref document number: 1998901461 Country of ref document: EP |