WO2025128115A1

WO2025128115A1 - Address space support for point-to-point cache coherency

Info

Publication number: WO2025128115A1
Application number: PCT/US2023/084209
Authority: WO
Inventors: Christopher Jonathan Phoenix
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2025-06-19
Anticipated expiration: 2026-06-15

Abstract

A system comprising: a plurality of cores, wherein each core is associated with a cache; a cache coherency subsystem comprising data processing apparatus configured to perform operations comprising: receiving an address from a first core of the plurality of cores; determining that a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the pair of cores of the plurality of cores.

Description

ADDRESS SPACE SUPPORT FOR POINT-TO-POINT CACHE COHERENCY

BACKGROUND

[1] This specification relates to systems having integrated circuit devices.

[2] A cache is a device that stores data retrieved from memory or data to be written to memory for one or more different hardware devices in a system. The hardware devices can be different components integrated into a system on a chip (SOC) or a system that includes cores on several different chips. In this specification, the devices that provide read requests and write requests through caches will be referred to as client devices.

[3] A multiprocessor system can have multiple processing devices that each have a separate cache. A multiprocessor system can have multiple copies of any shared data i.e., in a shared memory as well as a memory for each cache. In order to maintain cache coherency, when one copy of data is changed, the other copies must also be changed or invalidated in other caches that share the same data.

[4] A cache coherency system can keep data consistent among caches by communicating with various caches in the multiprocessor system that may have access to a particular piece of data when the data is updated. However, performing cache coherency processes can be expensive in multiple ways. For example, performing a cache coherency process can introduce more complex hardware, high latency, and wasted operations, e.g., when a cache does not have access to the particular piece of data that is the subject of the coherency process.

SUMMARY

[5] This specification describes techniques for modifying addresses in a way that triggers a cache coherency system to perform a reduced cache coherency process. For example, when data that needs to be updated is only shared among two cores in a plurality of cores, this allows the cache coherency system to only communicate with the caches associated with those two cores. Additionally, when the data that needs to be updated is only accessed by a single core, this allows the cache coherency system to only communicate the cache associated with that core.

[6] According to a first aspect, there is provided a system comprising: a plurality of cores, wherein each core is associated with a cache; a cache coherency subsystem comprising data processing apparatus configured to perform operations comprising: receiving an address from a first core of the plurality of cores; determining that a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the pair of cores of the plurality of cores.

[7] In some implementations, the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address does not specify a pair of core; and in response, performing a full cache coherency process among the plurality of cores.

[8] In some implementations, the reduced cache coherency process checks cache coherency on fewer cores than the full cache coherency process.

[9] In some implementations, one or more of the cores are configured to execute instructions to implement an operating system, and the operating system is configured to perform operations comprising: maintaining a mapping between programs, cores that the programs execute on, and respective regions of memory that the programs share; and populating the reserved field of addresses with core identifiers whenever a physical address must be calculated for use by a program identified in the mapping as running on one or two cores.

[10] In some implementations, the operations further comprise: receiving a request to move a program from a first core and a second core to a third core and a fourth core; and recalculating the reserved field addresses with core identifiers associated with the third and fourth core.

[11] In some implementations, the address is a physical address, and the reserved field of the address occupies fewer than all bits of the physical address.

[12] In some implementations, each program shares one or more regions of memory with one or more of each of the other programs.

[13] In some implementations, the cores in the plurality of cores are divided into pairs of cores, where each core shares one or more regions of memory with its paired core.

[14] In some implementations, the operations further comprise assigning memory sharing programs to each of the pairs of cores.

[15] In some implementations, the operations further comprise: receiving a second address from one of the plurality of cores; determining that a reserved field of the address specifies a pair of clusters of cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the cores in the pair of clusters of cores for the plurality of cores. [16] In some implementations, the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address specifies a single core; and in response, bypassing a cache coherency process for the plurality of cores.

[17] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Cache coherency can be a complex process that requires a large amount of circuitry and delay time. The cache coherency system can reduce coherency traffic for multiprocessor systems. When data is not shared among all cores in a multiprocessor system with multiple cores, the system can use an address that has a reserved field as an indication that the cache coherency system should perform only a reduced cache coherency process. The reduced cache coherency process checks cache coherency on fewer cores than a full cache coherency process does and thus is faster to perform, requires less complex operations, and less inter-core communications traffic..

[18] For example, in order to specify which cores to check cache coherency for during a cache coherency process, conventional cache coherency systems can broadcast each data access to all caches. This approach has scalability issues because it has N^A2 complexity. The techniques described in this specification also avoid the problems of another conventional solution, which is to maintain a directory that stores the caches that each element of data is stored in. This can become very complex, can require additional silicon area, and can suffer from delays. In contrast, the cache coherency system described in this specification uses a reserved space in an address to map regions of memory to specify the cores for which a reduced cache coherency process should be performed.

[19] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[20] FIG. 1 is a diagram of an example system.

[21] FIG. 2 is a diagram illustrating an example address that has a reserved field that specifies a pair of cores. [22] FIG. 3 is a flow chart illustrating an example process for determining a type of cache coherency process to perform.

[23] FIG. 4 is a flow chart illustrating an example process for populating an address based on a memory request.

[24] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

[25] FIG. 1 is a diagram of an example system 100. The system 100 includes a system on a chip (SOC) 102 communicatively coupled to a memory device 118. The SOC 102 has multiple client devices 1 lOa-n that each have an associated local cache 112a-n, and a cache coherency subsystem 114. Each local cache 112a-n services memory requests from its associated client device HOa-n. The cache coherency subsystem 114 is a communication subsystem that performs a cache coherency process whenever data in any local cache 112a-n is updated or data is requested from a cache. The techniques described in this specification can also be used for systems having additional layers of caches.

[26] In order to improve the performance of maintaining coherent caches, the system 100 can modify some addresses in a way that triggers the cache coherency subsystem 114 to perform a reduced cache coherency process. For example, the cache coherency subsystem 114 can use a reserved space in an address to identify the client devices for which a reduced cache coherency process should be performed. For example, when data that needs to be updated is only shared among two client devices out of the multiple client devices 112a-n, the system can indicate this by populating the reserved field of addresses used by the client devices, which signals to the cache coherency system to only communicate with the caches associated with those two client devices when performing cache coherency maintenance. Additionally, when the data that needs to be updated is only accessed by a single client device, this allows the cache coherency system to only communicate with the cache associated with that client device or bypass communicating with any cache. These mechanisms improve performance of the system because the reduced cache coherency process is far less expensive in time and resources than a full cache coherency process that considers a larger subset of caches or all caches in the system.

[27] The SOC 102 is an example of a device that can be installed on or integrated into any appropriate computing device, which may be referred to as a host device. For example, the SOC 102 can be installed on a mobile host device, e.g., a smart phone, a smart watch or another wearable computing device, a tablet computer, or a laptop computer, to name just a few examples.

[28] The SOC 102 has multiple client devices HOa-n. Each of the client devices HOa-n can have one or more cores that implement any appropriate module, device, or functional component that is configured to read and store data in the memory device. For example, a client device can be a CPU, a DMA controller, or lower-level components of the SOC itself.

[29] The system 100 can include an operating system that can be configured to modify addresses in a way that indicates that one or more client devices share a region of memory. The client devices 1 lOa-n can be configured to execute instructions to implement an operating system. The operating system can be configured to maintain mappings between programs or client devices 1 lOan-n that share respective regions of memory e.g., one mapping between regions accessible to one or more client devices for each region of memory. The shared regions of memory can, for example, be distinguished from one another using a memory region identifier. A memory region identifier can be an abstraction for contiguous data that can be mapped into a region of an address space. A memory region identifier can represent data in a memory storage device, such as the memory' storage device 118.

[30] The operating system can be configured to populate the page table with addresses having reserved fields set with core identifiers. A core identifier can be a value to identify a set of one or more devices that use a particular cache. For example, each client device 1 lOa-n can have multiple computing cores. But if a particular client device uses only one local cache, a single core identifier can be used forthat client device. Alternatively or in addition, each client device can have a single computing core, in which case each client device can have a separate and distinct core identifier. In the case where a client device has multiple cores that use multiple caches respectively, a single client device can be associated with multiple core identifiers to distinguish between cores that use the multiple caches respectively. A reserved field of addresses can be allowed within an otherwise unused bit field within the address.

[31] The cache coherency subsystem 114 is a communications subsystem of the SOC 102 that ensures that all data stored in the local caches 112a-n remains coherent. In other words, the cache coherency subsystem 114 strives to give each cache a same view of values in memory 118. The values in the caches can be different from corresponding values in memory, e.g., in the case of a cached write not yet being flushed, but the caches themselves should have the same view of such unflushed values. Whenever data is updated in one local cache 112a-n, the cache coherency subsystem can perform a cache coherency process to ensure that any other cache with access to that data does not contain a differing version of the data. The cache coherency subsystem 114 includes communications pathways that allow the client devices 1 lOa-n to communicate with one another as well as to make requests to read and write data using the memory device 140. The cache coherency subsystem 114 can include any appropriate combination of communications hardware, e.g., buses or dedicated interconnect circuitry.

[32] As described above, programs running on the client devices 112a-n, or the client devices 112a-n themselves, can share regions of memory. For example, two processes running on different cores can read from and write to a shared region of memory. The cache coherency subsystem 118 can be configured to receive an address from a first client device and determine if a reserved field of the address specifies a pair, or a subset, of client devices that share a region of memory not shared with any other client devices. If the reserved field of the address specifies a pair of client devices, the cache coherency subsystem 118 can perform a reduced cache coherency process between the pair of client devices without needing to communicate with any other client devices in the SOC 102. The reduced cache coherency process checks cache coherency on fewer client devices than a full cache coherency process that would check cache coherency in all local caches 112-a-n.

[33] When the operating system has populated a reserved address field to specify a pair of client devices, the cache coherency subsystem 118 can perform a reduced cache coherency process that checks cache coherency for only the pair of client devices. The reserved field can also specify one client device. When the reserved field specifies one client device, the cache coherency subsystem bypasses performing a cache coherency process.

[34] When the reserved field does not specify a pair of client devices or a single client device, the cache coherency subsystem 118 can perform a full cache coherency process that checks cache coherency for all client devices.

[35] The addresses can be physical addresses and the reserved field of the address can occupy fewer than all bits of the physical address. For example, if the reserved field is unpopulated, the address can read 0x000 ppppp . Using the same format, the reserved field can identify a client device labelled j and a client device labelled k by populating the bits of the address as OxOjkO ppppp . When the reserved field identifies a single client device, for example a client device j, the address reads OxOjjO ppppp .

[36] The reserved field can also identify clusters of client devices. In some examples, the reserved field of the address specifies a pair of clusters of client devices of the plurality of cores that share a region of memory. The cache coherency subsystem 118 can perform a reduced cache coherency process between the client devices in the pair of clusters of cores, without checking cache coherency for client devices that are not a part of either cluster.

[37] The caches HOa-n are positioned in the data pathway between the client devices 112a-n and the memory controller 130. The memory controller 130 can handle requests to and from the memory device 140.

[38] FIG. 2 is a diagram illustrating an example physical address 200 that has a reserved field that specifies a pair of cores. The address is an address space 202 for a 64-bit system. The address space 202 includes a reserved field for a first core identifier 204 and a second core identifier 206 that share a region of memory as well as a field for a physical page 208 and offset bits 210.

[39] The field for the physical page 208 identifies a page that is shared among only two cores and the reserved field for the first core identifier 204 can identify one of the cores while the second core identifier 206 can identify the second core. The reserved field of the address can be fewer than all bits of the physical address. For example, if the reserved field is unpopulated, the address can read 0x000 ppppp . Using the same format, the reserved field can identify a core labeled j and a core labeled k by populating the bits of the address as OxOjkO ppppp , where the first core identifier 204 is populated by j and the second core identifier 206 is populated by k. When a page is only shared with one core, for example a core /, the address reads OxOjjO ppppp and both core identifiers are populated by j.

[40] FIG. 3 is a flow chart illustrating an example process for determining a type of cache coherency process to perform. The example process 300 can be performed by the components of the SOC 102, specifically the cache coherency subsystem 114.

[41] The cache coherency subsystem 114 receives an address from a first core of a plurality of cores (Step 310). The cores can be cores of a multiprocessor system that each have an associated local cache.

[42] The cache coherency subsystem 114 determines if a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory (Step 320). The address can be a physical address where the reserved field of the address occupies fewer than all bits of the physical address e.g., the example address 200 of FIG. 2.

[43] If the reserved field of the address specifies a pair of cores that share a region of memory, the cache coherency subsystem 114 performs a reduced cache coherency process between the pair of cores (Step 330). The reduced cache coherency process checks cache coherency on fewer cores than the full cache coherency process e.g., the two cores that are in the pair of cores.

[44] If the reserved field of the address does not specify a pair of cores that share a region of memory, the cache coherency subsystem 114 performs a full cache coherency process that checks cache coherency on all cores in the plurality of cores (Step 330).

[45] The reserved field of the address can also specify a single core. In some implementations, the cache coherency subsystem can determine that the reserved field of the address specifies a single core and bypass performing a cache coherency process as the data does not need to be updated for the other cores.

[46] The operating system can delegate which cores share regions of memory. Each program or core can share one or more regions of memory with one or more other programs or cores. In some implementations, each core can share a memory region identifier with a subset drawn from each of the other cores in the system. An operating system can choose which cores to assign shared memory depending on which tasks are assigned to specific cores, e.g., based on the needs of a task.

[47] Alternatively, the operating system can divide the cores in the plurality of cores into pairs of cores. In some implementations, each core only shares one or more regions of memory with its paired core.

[48] The reserved field of the address can also specify clusters of cores. In some implementations, the reserved field of the address can specify a pair of clusters of cores of the plurality of cores that share a region of memory. For example, a cluster AT can represent cores j and k while cluster A can represent core h and z. The cache coherency subsystem 114 can perform a reduced cache coherency process between the cores in the pair of clusters of cores e.g, the cache coherency subsystem can check cache coherency for cores j, k, h, and z.

[49] FIG. 4 is a flow chart illustrating an example process for populating the bit fields of an address based on a configuration of shared memory between programs running on cores. The example process 400 can be performed by an operating system. [50] The system maintains a mapping between programs that share regions of memory (step 410). Programs can share regions of memory for a variety of reasons. For example, a multiprocess application can have multiple processes that read from and write to the same region of memory. As another example, producer and consumer processes can use the shared region of memory as a buffer between the two processes.

[51] The shared region of memory can, for example, be identified by a memory region identifier. A memory region identifier is an abstraction for one or more regi ons of memory that are mapped into a region of an address space. A memory region identifier can represent portions of a memory storage device, such as the memory storage device 118.

[52] The system receives a request to update the page table for a program (Step 420). Updating the page table can be due to a variety of reasons, e.g., due to allocating memory to a program for the first time or due to encountering a page fault.

[53] Typically, updating the page table requires physical addresses to be calculated, implicitly or explicitly, from virtual addresses belonging to the program. The mapping between physical and virtual addresses or pages can then be added to a page table. The operating system uses a page table to store mappings of virtual addresses to physical addresses, where each mapping is a page table entry.

[54] The system determines whether the program is identified as running one one or two cores in the mapping of programs that share regions of memory (step 430). If so, the system can modify the physical addresses in the page table by writing the core identifiers into the reserved spaces of the physical addresses (branch to step 440). This modification will then cause the cache coherency system to perform a reduced cache coherency process whenever those physical addresses are encountered during execution. The system can make similar modifications to other architectural structures that maintain address translations, e.g., in one or more translation lookaside buffers.

[55] In some examples, the operating system may choose to maintain a mapping of pairs of cores rather than a mapping of programs. Each core can share one or more regions of memory with its paired core. The system can assign memory sharing programs to each of the pairs of cores. In other examples, the operating system can choose to track and map shared memory regions individually. Each core in the plurality of cores can share one or more regions of memory with each of the other cores in the plurality of cores. [56] A program can use one or more memory region identifiers that are shared between more than two cores e.g., between five cores or between seven cores etc. When a program uses a memory region identifier that is shared with more than a designated amount (e.g., two) of cores, the system does not modify the physical address and the cache coherency subsystem performs a full cache coherency process. Simultaneously, a program can also use one or more memory region identifiers that are shared between only one core or a pair of cores. When a program uses a memory region identifier that is shared with a designated amount (e.g., two) of cores, the system does modify the physical address and the cache coherency subsystem performs a full cache coherency process.

[57] If the program is not identified in the mapping as running on one or two cores, the system can leave the reserved field of addresses unaltered (branch to step 440).

[58] When generating a physical address from a virtual address, the system can determine whether the virtual address is part of a memory region identifier that is known to be accessible to only one core or only two cores (or groups of cores). If the virtual address is part of a memory region identifier that is known to be accessible to only one core or only two cores, the operating system can write core identifiers in a reserved field of the physical address in the page table. For example, the reserved field can identify a core labeled j and a core labeled k by populating the bits of the physical address as OxOjkO ppppp in the page table. Other bit encodings may be possible, some of which may reduce the number of bits required for the encoding, perhaps with cooperation from hardware. For example, as little as 1 bit may be used if the hardware applies the reduced coherency protocol only between pairs of cores with numbers differing only in the least significant bit, and the operating system only sets that bit for programs using memory only on those pairs of cores.

[59] After the page table is populated with page table entries that map virtual addresses to physical addresses, including physical addresses with core identifiers, the system can receive a request for a core to access a specific address of memory. This can trigger a cache coherency process. When memory in a page that has core identifiers in the reserved field of the physical address is accessed, the cache coherency hardware will use a reduced cache coherency process.

[60] The operating system can move a program from one core to another core. When the operating system moves a program to another, the operating system can recalculate the relevant physical addresses in the page table. The operating system can move a program that runs on a first and second core to a second and third core instead. In some implementations, the system can populate the reserved field of the physical address with respective identifiers of the second core and the third core. For example, the third core can be a core labeled h. The first and second cores can be labeled j and k respectively. The system can change the address from OxOjkO ppppp to OxOhkO ppppp .

[61] The system can use an N-to-N crossbar (e.g., butterfly circuit) to transfer cache data between arbitrary pairs of cores. When using an N-to-N crossbar, some regions of memory can be shared between cores 4 and 5, and other regions can be shared between 4 and 17. Each page in a memory region will have its own set of j and k bits, so an N-to-N crossbar would allow process A to have one memory region identifier shared with process B, and another memory region identifier shared with program C, and the crossbar would support an A-B link and a B-C link as needed.

[62] In some examples, there can be point-to-point links between cores 0 and 1, 1 and 2, 2 and 3, 3 and 4... which could support data pipelining such as might happen in a media-heavy product. Instead of using an N-to-N crossbar for the point-to-point traffic between different cores, the system 100 can use simpler point-to-point connections between cores 0 and 1, 2 and 3, etc., and the operating system can assign memory-sharing programs to those appropriately-paired cores.

[63] In computer systems where most regions of memory are shared by programs running on more than 2 cores, there may not be enough bits to encode the cores that share the regions. If this is anticipated, the system can link small clusters of cores by relatively simple (e.g. snoop) coherency circuitry, and the j and k identifiers would then refer to cluster number rather than individual core number. Clustering the cores provides more CPU power for each set of programs accessing a piece of memory.

[64] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[65] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non- transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine- readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[66] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[67] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network. [68] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.

[69] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[70] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.

[71] In addition to the embodiments described above, the following embodiments are also innovative:

Embodiment 1 is a system comprising: a plurality of cores, wherein each core is associated with a cache; a cache coherency subsystem comprising data processing apparatus configured to perform operations comprising: receiving an address from a first core of the plurality of cores; determining that a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the pair of cores of the plurality of cores.

Embodiment 2 is the system of embodiment 1, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address does not specify a pair of cores; and in response, performing a full cache coherency process among the plurality of cores.

Embodiment 3 is the system of embodiment 2, wherein the reduced cache coherency process checks cache coherency on fewer cores than the full cache coherency process.

Embodiment 4 is the system of any one of embodiments 1-3, wherein one or more of the cores are configured to execute instructions to implement an operating system, and wherein the operating system is configured to perform operations comprising: maintaining a mapping between programs, cores that the programs execute on, and respective regions of memory that the programs share; and populating the reserved field of addresses with core identifiers whenever a physical address must be calculated for use by a program identified in the mapping as running on one or two cores .

Embodiment 5 is the system of embodiment 4, wherein the operations further comprise: receiving a request to move a program from a first core and a second core to a third core and a fourth core; and recalculating the reserved field addresses with core identifiers associated with the third and fourth core.

Embodiment 6 is the system of any one of embodiments 1-5, wherein the address is a physical address, and the reserved field of the address occupies fewer than all bits of the physical address.

Embodiment 7 is the system of embodiment 4, wherein each program shares one or more regions of memory with one or more of each of the other programs. Embodiment 8 is the system of embodiment 4, wherein the cores in the plurality of cores are divided into pairs of cores, where each core shares one or more regions of memory with its paired core.

Embodiment 9 is the system of embodiment 8, wherein the operations further comprise assigning memory sharing programs to each of the pairs of cores.

Embodiment 10 is the system of any one of embodiments 1-9, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that a reserved field of the address specifies a pair of clusters of cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the cores in the pair of clusters of cores for the plurality of cores.

Embodiment 11 is the system of any one of embodiments 1-10, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address specifies a single core; and in response, bypassing a cache coherency process for the plurality of cores.

Embodiment 12 is a method comprising performing the operations of any one of embodiments 1 - 11.

Embodiment 13 is a computer storage medium encoded with instructions that are operable, when executed by data processing apparatus, to cause the data processing apparatus to perform the operations of any one of claims 1-11.

[72] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[73] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[74] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

[75] What is claimed is:

Claims

1. A system comprising: a plurality of cores, wherein each core is associated with a cache; a cache coherency subsystem comprising data processing apparatus configured to perform operations comprising: receiving an address from a first core of the plurality of cores; determining that a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the pair of cores of the plurality of cores.

2. The system of claim 1, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address does not specify a pair of cores; and in response, performing a full cache coherency process among the plurality of cores.

3. The system of claim 2, wherein the reduced cache coherency process checks cache coherency on fewer cores than the full cache coherency process.

4. The system of any preceding claim, wherein one or more of the cores are configured to execute instructions to implement an operating system, and wherein the operating system is configured to perform operations comprising: maintaining a mapping between programs, cores that the programs execute on, and respective regions of memory that the programs share; and populating the reserved field of addresses with core identifiers whenever a physical address must be calculated for use by a program identified in the mapping as running on one or two cores .

5. The system of claim 4, wherein the operations further comprise: receiving a request to move a program from a first core and a second core to a third core and a fourth core; and recalculating the reserved field addresses with core identifiers associated with the third and fourth core.

6. The system of any preceding claim, wherein the address is a physical address, and the reserved field of the address occupies fewer than all bits of the physical address.

7. The system of claim 4, wherein each program shares one or more regions of memory with one or more of each of the other programs.

8. The system of claim 4, wherein the cores in the plurality of cores are divided into pairs of cores, where each core shares one or more regions of memory with its paired core.

9. The system of claim 8, wherein the operations further comprise assigning memory sharing programs to each of the pairs of cores.

10. The system of any preceding claim, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that a reserved field of the address specifies a pair of clusters of cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the cores in the pair of clusters of cores for the plurality of cores.

11. The system of any preceding claim, wherein the operations further comprise: receiving a second address from one of the plurality of cores; determining that the reserved field of the address specifies a single core; and in response, bypassing a cache coherency process for the plurality of cores.

12. A method performed by a system comprising a plurality of cores, wherein each core is associated with a cache, the method comprising: receiving an address from a first core of the plurality of cores; determining that a reserved field of the address specifies a pair of the cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the pair of cores of the plurality of cores.

13. The method of claim 12, further comprising: receiving a second address from one of the plurality of cores; determining that the reserved field of the address does not specify a pair of cores; and in response, performing a full cache coherency process among the plurality of cores.

14. The method of claim 13, wherein the reduced cache coherency process checks cache coherency on fewer cores than the full cache coherency process.

15. The method of any one of claims 12-14, wherein one or more of the cores are configured to execute instructions to implement an operating system, and further comprising performing, by the operating system, operations comprising: maintaining a mapping between programs, cores that the programs execute on, and respective regions of memory that the programs share; and populating the reserved field of addresses with core identifiers whenever a physical address must be calculated for use by a program identified in the mapping as running on one or two cores .

16. The method of claim 15, further comprising: receiving a request to move a program from a first core and a second core to a third core and a fourth core; and recalculating the reserved field addresses with core identifiers associated with the third and fourth core.

17. The method of any one of claims 12-16, wherein the address is a physical address, and the reserved field of the address occupies fewer than all bits of the physical address.

18. The method of claim 15, wherein each program shares one or more regions of memory with one or more of each of the other programs.

19. The method of claim 15, wherein the cores in the plurality of cores are divided into pairs of cores, where each core shares one or more regions of memory with its paired core.

20. The method of claim 19, further comprising assigning memory sharing programs to each of the pairs of cores.

21 . The method of any one of claims 12-20, further comprising: receiving a second address from one of the plurality of cores; determining that a reserved field of the address specifies a pair of clusters of cores of the plurality of cores that share a region of memory; and in response, performing a reduced cache coherency process between the cores in the pair of clusters of cores for the plurality of cores.

22. The method of any one of claims 12-21, further comprising: receiving a second address from one of the plurality of cores; determining that the reserved field of the address specifies a single core; and in response, bypassing a cache coherency process for the plurality of cores.