WO2024045585A1

WO2024045585A1 - Method for dynamically sharing storage space in parallel processor, and corresponding processor

Info

Publication number: WO2024045585A1
Application number: PCT/CN2023/083990
Authority: WO
Inventors: 苏叶华
Original assignee: Beijing Denglin Technologies Co Ltd; Shanghai Denglin Technology Co Ltd
Current assignee: Beijing Denglin Technologies Co Ltd; Shanghai Denglin Technology Co Ltd
Priority date: 2022-09-02
Filing date: 2023-03-27
Publication date: 2024-03-07
Anticipated expiration: 2025-03-02
Also published as: CN115168247A; CN115168247B

Abstract

Provided in the present application are a method for dynamically sharing a storage space in a parallel processor, and a corresponding processor. One part of a storage space of a local memory of a processor is used as a local memory, and the other part thereof is used as a cache. A memory access control unit of a processor respectively updates the starting positions of a local memory and a cache in a memory of the processor according to received settings regarding the size of the local memory and the size of the cache; and the cache determines a new data storage position corresponding to each cache block in the memory according to the size of the cache and the starting position of the cache in the memory of the processor, and establishes mapping between each tag storage position and the new data storage position of each cache block. The solution allows a user to dynamically adjust the sizes of storage spaces of a local memory and a cache in a processor, such that the execution performance of the processor for an application program is improved, without increasing the area of a chip and the hardware cost.

Description

Method for dynamic shared memory space in parallel processor and corresponding processor

Technical field

本申请涉及高性能并行处理器中存储空间管理技术，尤其涉及在处理器的本地内存和高速缓存存储器之间共享存储空间的方法及系统。The present application relates to storage space management technology in high-performance parallel processors, and in particular to a method and system for sharing storage space between the processor's local memory and cache memory.

Background technique

本部分的陈述仅仅是为了提供与本申请的技术方案有关的背景信息，以帮助理解，其对于本申请的技术方案而言并不一定构成现有技术。The statements in this section are merely to provide background information related to the technical solution of the present application to help understand, and they do not necessarily constitute prior art for the technical solution of the present application.

诸如通用图形处理器(General-purpose computing on graphics processing units，GPGPU)之类的高性能并行处理器一般都支持本地内存和全局内存的编程模型，这样可以给用户提供灵活性，让用户可以根据实际的应用来选择本地的内存还是全局的内存。本地内存和全局内存有各自的优缺点：本地内存一般都离计算内核比较近，访问速度比较快，但是容量比较小；而全局内存虽然有很大的容量，但是访问的速度和延时都比较差。针对全局内存的访问延时问题，GPGPU采用缓存技术来解决全局内存的访问瓶颈，例如在处理器与全局内存之间设置速度快但容量小的高速缓冲存储器(Cache，也可简称为缓存)作为全局内存数据的缓存，例如L1缓存。为了进一步提升性能，还可以引入多级Cache，例如在L1缓存之后，加入L2缓存、甚至L3缓存来减少访问的延时问题。High-performance parallel processors such as general-purpose computing on graphics processing units (GPGPU) generally support local memory and global memory programming models, which can provide users with flexibility and allow them to customize the program according to actual conditions. Application to choose local memory or global memory. Local memory and global memory have their own advantages and disadvantages: local memory is generally closer to the computing core and has faster access speed, but its capacity is smaller; while global memory has a large capacity, but its access speed and latency are relatively small. Difference. In response to the access delay problem of global memory, GPGPU uses caching technology to solve the access bottleneck of global memory. For example, a fast but small-capacity cache memory (Cache, which can also be referred to as cache) is set up between the processor and global memory. Cache of global memory data, such as L1 cache. In order to further improve performance, multi-level cache can also be introduced. For example, after the L1 cache, an L2 cache or even an L3 cache can be added to reduce the access delay problem.

因此，在诸如GPGPU之类的包括多个计算内核的并行处理器架构中，每个计算内核往往都包含一个本地内存和一个L1缓存。一般而言，每个计算内核的本地内存和L1缓存各自的容量越大，性能也就越好，但相应地，这会造成处理器的芯片面积和硬件成本的增加。Therefore, in parallel processor architectures such as GPGPU that include multiple computing cores, each computing core often contains a local memory and an L1 cache. Generally speaking, the larger the capacity of each computing core's local memory and L1 cache, the better the performance, but correspondingly, this will cause an increase in the chip area and hardware cost of the processor.

需要说明的是，上述内容仅用于帮助理解本申请的技术方案，并不作为评价本申请的现有技术的依据。It should be noted that the above content is only used to help understand the technical solution of the present application and is not used as a basis for evaluating the prior art of the present application.

发明内容Contents of the invention

发明人研究发现，在使用诸如CUDA、OPENCL之类的编程语言在GPGPU处理器上编写应用程序时，用户根据实际应用需求会选择以本地内存或者全局内存为主导的编程模型。换句话说，用户编写的应用程序不会同时对本地内存和L1缓存有大容量的需要。由此，发明人设计了一种在并行处理器中动态共享存储空间的解决方案，使得本地内存和L1缓存可以共享存储空间，而用户可以根据实际的应用来动态分配本地内存或者L1缓存的大小。这样的方案，可以让同一个随机存取存储器(RAM)空间可以被本地内存和L1缓存分时共享，从而达到减少芯片面积和硬件成本的目的。因为本地内存或者L1缓存中的存储空间通常会占到80％的面积成本，而采用动态共享存储空间的方案之后，可以节约大约40％的面积成本。The inventor's research found that when using programming languages such as CUDA and OPENCL to write applications on GPGPU processors, users will choose a programming model dominated by local memory or global memory based on actual application needs. In other words, user-written applications do not There will be large capacity requirements for local memory and L1 cache at the same time. Therefore, the inventor designed a solution to dynamically share storage space in parallel processors, so that local memory and L1 cache can share storage space, and users can dynamically allocate the size of local memory or L1 cache according to actual applications. . Such a solution allows the same random access memory (RAM) space to be time-shared by local memory and L1 cache, thereby reducing chip area and hardware costs. Because the storage space in local memory or L1 cache usually accounts for 80% of the area cost, using a dynamic shared storage space solution can save about 40% of the area cost.

根据本申请实施例的第一方面，提供了一种用于并行处理器中动态共享存储空间的方法，其包括：由处理器的访存控制单元根据收到的对本地内存大小和高速缓冲存储器大小的设置，分别更新本地内存和高速缓冲存储器在处理器的存储器中的起始位置，其中所述存储器的存储空间的一部分作为本地内存使用，而另一部分作为高速缓冲存储器的数据存储单元使用；由处理器的访存控制单元根据收到的对高速缓冲存储器大小的设置更新对于高速缓冲存储器的访存地址中索引字段的大小的设置；以及由高速缓冲存储器根据所述访存控制单元提供的高速缓冲存储器大小及其在处理器的存储器中的起始位置，确定各缓存块在存储器中对应的新的数据存储位置，并在各标签存储位置与各缓存块的新数据存储位置之间建立映射。According to a first aspect of the embodiment of the present application, a method for dynamically sharing memory space in a parallel processor is provided, which includes: a memory access control unit of the processor determines the size of the local memory and the cache memory according to the received information. Setting the size, respectively updating the starting positions of the local memory and the cache memory in the memory of the processor, wherein part of the storage space of the memory is used as the local memory, and the other part is used as the data storage unit of the cache memory; The memory access control unit of the processor updates the setting of the size of the index field in the memory access address of the cache memory according to the received setting of the cache memory size; and the cache memory provides the setting of the cache memory size according to the memory access control unit. The cache memory size and its starting position in the processor's memory determine the new data storage location corresponding to each cache block in the memory, and establish a relationship between each tag storage location and the new data storage location of each cache block. mapping.

在这样的实施例中，处理器中本地内存和L1缓存共享同一存储器，但本地内存和L1缓存的所占用的存储空间大小并非固定的，而是可以随着用户提供的配置而动态发生变化，从而允许用户在使用诸如CUDA、OPENCL之类的编程语言在GPGPU处理器上编写应用程序时，可以根据所选择的基于本地内存或者全局内存的编程模型来重新调整处理器中本地内存和L1缓存的存储空间大小，以更好地改善处理器对于应用程序的执行性能。如果用户当前选择基于本地内存的编程，那么可以适当扩大处理器中本地内存的存储空间，反之如果用户当前选择基于全局内存的编程，那么可以适当扩大处理器中L1缓存的存储空间。这样，不仅改善了处理器对于应用程序的执行性能，而且还不会增加芯片面积和硬件成本。In such an embodiment, the local memory and the L1 cache in the processor share the same memory, but the size of the storage space occupied by the local memory and the L1 cache is not fixed, but can dynamically change with the configuration provided by the user. This allows users to re-adjust the local memory and L1 cache in the processor according to the selected programming model based on local memory or global memory when using programming languages such as CUDA and OPENCL to write applications on GPGPU processors. Storage space size to better improve the processor's execution performance for applications. If the user currently chooses programming based on local memory, the storage space of the local memory in the processor can be appropriately expanded. On the contrary, if the user currently chooses programming based on global memory, the storage space of the L1 cache in the processor can be appropriately expanded. In this way, it not only improves the execution performance of the processor for application programs, but also does not increase the chip area and hardware cost.

在一些实施例中，该方法还可以包括由处理器的访存控制单元根据其收到的对本地内存大小和高速缓冲存储器大小的设置，在处理器的存储器中从预先设定的地址开始先划分相应大小的存储空间作为本地内存，再接着为高速缓冲存储器分配存储空间，其中本地内存的起始位置为所述预先设定的地址。In some embodiments, the method may further include, by the memory access control unit of the processor, starting from a preset address in the memory of the processor according to the settings it receives for the local memory size and cache memory size. Divide storage space of corresponding size as local memory, and then connect Then allocate storage space for the cache memory, wherein the starting position of the local memory is the preset address.

在这样的实施例中，每次调整存储空间大小时，本地内存的起始位置都不需要发生变化，只需要根据新设置的本地内存空间大小就能确定更新后高速缓冲存储器在处理器的存储器中的新的起始位置。这不仅简化了存储空间管理的流程，而且当本地内存的大小被更新的时候，更新前后共享的本地内存部分的数据不会丢失。In such an embodiment, every time the storage space size is adjusted, the starting position of the local memory does not need to change. It only needs to determine the location of the updated cache memory in the processor's memory based on the newly set local memory space size. the new starting position in . This not only simplifies the storage space management process, but also when the size of the local memory is updated, the data in the shared local memory part before and after the update will not be lost.

在一些实施例中，高速缓冲存储器可以是组相联缓存，以及其中所述高速缓冲存储器的访存地址中索引字段的大小是基于高速缓冲存储器大小除以该高速缓冲存储器预设的缓存块大小和每个组中包含的标签数所得到的结果来确定的。In some embodiments, the cache may be a set associative cache, and wherein the size of the index field in the access address of the cache is based on the cache size divided by the cache block size preset for the cache and the number of tags contained in each group.

在一些实施例中，该方法还可以包括由处理器的访存控制单元响应于收到对于本地内存的访存请求，使用更新后的本地内存的起始位置来定位所述访存请求要访问的数据。In some embodiments, the method may further include, in response to receiving a memory access request for the local memory, the memory access control unit of the processor using the updated starting position of the local memory to locate the memory access request to access. The data.

在一些实施例中，该方法还可以包括由处理器的访存控制单元响应于收到对于全局内存的访存请求，将访存请求中的地址映射成高速缓冲存储器的访存地址，并将其发送至高速缓冲存储器，其中所述访存地址采用更新后的索引字段大小；以及由高速缓冲存储器响应于收到来自访存控制单元的访存请求，根据所建立的标签存储位置与新的数据存储位置之间的映射来定位所述访存请求要访问的数据。In some embodiments, the method may further include: in response to receiving a memory access request for the global memory, the memory access control unit of the processor maps the address in the memory access request into a memory access address of the cache memory, and It is sent to the cache memory, wherein the memory access address adopts the updated index field size; and the cache memory responds to receiving the memory access request from the memory access control unit, according to the established tag storage location and the new Mapping between data storage locations to locate the data to be accessed by the memory access request.

在一些实施例中，由高速缓冲存储器响应于收到来自访存控制单元的访存请求，根据所建立的标签存储位置与新的数据存储位置之间的映射来定位所述访存请求要访问的数据可以包括：In some embodiments, in response to receiving a memory access request from the memory access control unit, the cache memory locates the memory access request to access according to the established mapping between the tag storage location and the new data storage location. The data can include:

在缓存命中时根据所建立的标签存储位置与新的数据存储位置之间的映射中确定命中的标签所对应的缓存块，从该缓存块提取该访存请求要访问的数据作为对该访存请求的响应；When a cache hit occurs, the cache block corresponding to the hit tag is determined based on the established mapping between the tag storage location and the new data storage location, and the data to be accessed by the memory access request is extracted from the cache block as the cache block for the memory access request. response to the request;

在缓存未命中时执行下列操作：Perform the following actions on a cache miss:

为该访存请求分配一个标签存储位置以存放该访存请求的访存地址中的标签字段，并从高速缓冲存储器的数据存储单元中选择未与标签绑定的多个缓存块的其中一个分配给该访存请求；Allocate a tag storage location for the memory access request to store the tag field in the memory access address of the memory access request, and select one of the multiple cache blocks not bound to the tag from the data storage unit of the cache memory to allocate Give the memory access request;

将所分配的标签存储位置原来对应的缓存块的标签绑定位设置为指示未与标签绑定，接着在该标签存储位置与分配给该访存请求的缓存块之间建立映射关系，并将分配给该访存请求的缓存块的标签绑定位设置为指示已与标签绑定；以及The tag binding bit of the cache block originally corresponding to the allocated tag storage location is set to indicate that it is not bound to the tag, and then between the tag storage location and the cache block allocated to the memory access request Establish a mapping relationship between the cache block assigned to the memory access request and set the tag binding bit to indicate that it has been bound to the tag; and

从下一级存储器获取该访存请求要访问的数据并将其保存在为该访存请求分配的缓存块中。The data to be accessed by this memory access request is obtained from the next level memory and stored in the cache block allocated for this memory access request.

在这样的实施例中，在高速缓冲存储中每个标签存储位置不再固定地与某个缓存块绑定，而是可以动态地映射到或绑定于任一缓存块，并且标签和缓存块中的数据可以不必同步更新。当L1缓存的存储空间的发生变化时，例如L1缓存的数据存储单元103被移动到或分配到共享存储空间的另一位置时，只要给出L1缓存在共享存储空间的起始位置，就可以从该起始位置开始重新分配缓存块，只要相应地在重新建立各标签存储位置与各缓存块的新数据存储位置之间的映射即可。这样，当L1缓存的存储空间根据用户的配置发生变化或调整时，基于重新建立的映射关系就能定位或查找到的是新分配的缓存块中的数据。In such an embodiment, each tag storage location in the cache is no longer fixedly bound to a certain cache block, but can be dynamically mapped or bound to any cache block, and the tag and cache block The data in does not need to be updated synchronously. When the storage space of the L1 cache changes, for example, when the data storage unit 103 of the L1 cache is moved or allocated to another location in the shared storage space, as long as the starting position of the L1 cache in the shared storage space is given, it can Cache blocks are reallocated from this starting position by re-establishing the mapping between each tag storage location and the new data storage location for each cache block accordingly. In this way, when the storage space of the L1 cache changes or is adjusted according to the user's configuration, the data in the newly allocated cache block can be located or found based on the re-established mapping relationship.

根据本申请实施例的第二方面，提供了一种支持动态共享存储空间的处理器，其包括访存控制单元、存储器和高速缓冲存储器，所述高速缓冲存储器包括控制器、用于保存标签的标签存储单元、由多个缓存块构成的数据存储单元和映射单元；其中所述存储器的存储空间的一部分作为本地内存使用，而另一部分作为高速缓冲存储器的数据存储单元使用，其中：According to the second aspect of the embodiment of the present application, a processor that supports dynamic shared storage space is provided, which includes a memory access control unit, a memory and a cache memory. The cache memory includes a controller, a A tag storage unit, a data storage unit composed of multiple cache blocks, and a mapping unit; wherein a part of the storage space of the memory is used as a local memory, and the other part is used as a data storage unit of the cache memory, where:

所述访存控制单元被配置为：根据收到的对本地内存大小和高速缓冲存储器大小的设置，分别更新本地内存和高速缓冲存储器在处理器的存储器中的起始位置；以及根据收到的对高速缓冲存储器大小的设置更新对于高速缓冲存储器的访存地址中索引字段的大小的设置；The memory access control unit is configured to: update the starting positions of the local memory and the cache memory in the memory of the processor according to the received settings of the local memory size and the cache memory size; and according to the received settings The setting of the cache size updates the setting of the size of the index field in the cache access address;

高速缓冲存储器的控制器被配置为：根据所述访存控制单元提供的高速缓冲存储器大小及其在处理器的存储器中的起始位置，确定各缓存块在存储器中对应的新的数据存储位置，并在映射单元中建立标签存储单元中各标签存储位置与各缓存块的新数据存储位置之间的映射。The controller of the cache memory is configured to: determine the new data storage location corresponding to each cache block in the memory based on the size of the cache memory provided by the memory access control unit and its starting position in the memory of the processor. , and establish a mapping between each tag storage location in the tag storage unit and the new data storage location of each cache block in the mapping unit.

在一些实施例中，所述访存控制单元还被可以被配置为：根据其收到的对本地内存大小和高速缓冲存储器大小的设置，在所述存储器中从预先设定的地址开始先划分相应大小的存储空间作为本地内存，再接着为高速缓冲存储器分配存储空间，其中本地内存的起始位置为所述预先设定的地址。在一些实施例中，所述存储器可以是以随机存取存储器的形式实现。高速缓冲存储器中的映射单元可以是以寄存器的形式实现。这样的采用寄存器形式实现的映射单元可以进一步减少高速缓冲存储器中对于映射关系的存储所占的成本和面积，提高解析标签和缓存块之间映射关系的速度。In some embodiments, the memory access control unit may also be configured to: first divide the memory starting from a preset address according to the settings it receives for the local memory size and cache memory size. A storage space of a corresponding size is used as a local memory, and then a storage space is allocated for the cache memory, where the starting position of the local memory is the preset address. In some embodiments, the memory may be implemented in the form of random access memory. The mapping unit in the cache memory can be implemented in the form of a register. Such adoption is sent to The mapping unit implemented in the form of a memory can further reduce the cost and area occupied by the storage of mapping relationships in the cache memory, and improve the speed of parsing the mapping relationship between tags and cache blocks.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and do not limit the application.

Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本申请的实施例，并与说明书一起用于解释本申请的原理。显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。在附图中：The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts. In the attached picture:

图1为根据本申请一个实施例的高速缓冲存储器的结构模块示意图。Figure 1 is a schematic diagram of a structural module of a cache memory according to an embodiment of the present application.

图2为根据本申请一个实施例的标签与缓存块之间的映射关系示意图。Figure 2 is a schematic diagram of the mapping relationship between tags and cache blocks according to an embodiment of the present application.

图3为提供了本申请一个实施例的用于并行处理器中动态共享存储空间的方法的流程示意图。FIG. 3 is a flowchart illustrating a method for dynamically sharing memory space in a parallel processor according to an embodiment of the present application.

Detailed ways

为了使本申请的目的，技术方案及优点更加清楚明白，以下结合附图通过具体实施例对本申请进一步详细说明。应当理解，所描述的实施例是本申请的一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动下获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail through specific embodiments in conjunction with the accompanying drawings. It should be understood that the described embodiments are some, but not all, of the embodiments of the present application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本申请的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本申请的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。Furthermore, the described features, structures or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the present application. However, those skilled in the art will appreciate that the technical solutions of the present application may be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. may be adopted. In other instances, well-known methods, apparatus, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present application.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software form, or in one or more These functional entities are implemented in hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the drawings are only illustrative, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be merged or partially merged, so the actual order of execution may change according to the actual situation.

在诸如GPGPU之类的并行处理器中，来自不同的计算内核的多个线程的访存请求被发送到处理器的访存控制单元(LSU，Load/Store Unit)，由LSU从本地内存或全局内存来存取这些访存请求要访问的数据。在采用本地内存的编程模型中，通常以数组的形式来访问本地内存，因此访存控制单元可以基于本地内存的起始地址来定位到各个线程的访存请求想要访问的本地内存中某个特定存储位置。而对于全局内存(也可以称为主存)则不然，在采用全局内存的编程模型中，当访存控制单元收到对某个全局内存地址的访存请求时，首先将该访存请求中要访问的全局内存地址映射成对于高速缓冲存储器(例如L1缓存)的访存地址，然后从L1缓存中查找相应数据是否已经缓存在L1缓存中，如果要访问的数据已缓存在L1缓存中(可称为“命中”)，则直接从L1缓存中返回数据给处理器，否则由L1缓存的控制器从下一级存储器(例如全局内存中)提取数据进行缓存并返回给处理器。In parallel processors such as GPGPU, memory access requests from multiple threads of different computing cores are sent to the processor's memory access control unit (LSU, Load/Store Unit), which is loaded from local memory or global memory by the LSU. memory to access the data accessed by these memory access requests. In a programming model that uses local memory, local memory is usually accessed in the form of an array. Therefore, the memory access control unit can locate the location in the local memory that each thread's memory access request wants to access based on the starting address of the local memory. specific storage location. This is not the case for global memory (also called main memory). In the programming model using global memory, when the memory access control unit receives a memory access request for a certain global memory address, it first processes the memory access request. The global memory address to be accessed is mapped to the access address of the cache memory (such as L1 cache), and then the corresponding data is checked from the L1 cache to see whether it is already cached in the L1 cache. If the data to be accessed is already cached in the L1 cache ( can be called a "hit"), the data is returned directly from the L1 cache to the processor, otherwise the L1 cache controller extracts the data from the next-level memory (such as global memory), caches it, and returns it to the processor.

发明人研究发现，在使用诸如CUDA、OPENCL之类的编程语言在GPGPU处理器上编写应用程序时，用户根据实际应用需求会选择以本地内存或者全局内存为主导的编程模型。换句话说，用户编写的应用程序不会同时对本地内存和L1缓存有大容量的需要。因此，在本申请的实施例中提供了一种在并行处理器中由本地内存和L1缓存共享同一个随机存取存储器(RAM)的存储空间的解决方案。用户可以根据实际的应用需求通过上层软件来动态分配处理器中本地内存和L1缓存的空间大小。一旦空间分配设置好之后，本地内存就只会访问被指定的空间大小，而同样L1缓存也只能访问为其指定的空间。而随着用户配置的不同，L1缓存和本地内存的空间也会发生变化。对于本地内存，只要给出本地内存的起始地址和存储空间大小，访存控制单元就可以定位到本地内存中任一存储位置。而对于L1缓存则不然，一旦L1缓存的空间大小发生变化，其访存地址和索引空间都可能发生变化。现有的L1缓存控制机制都是针对固定空间大小的L1缓存来设置的，并不能支持动态存储空间。The inventor's research found that when using programming languages such as CUDA and OPENCL to write applications on GPGPU processors, users will choose a programming model dominated by local memory or global memory based on actual application needs. In other words, user-written applications do not have large requirements for both local memory and L1 cache. Therefore, embodiments of the present application provide a solution in which the storage space of the same random access memory (RAM) is shared by the local memory and the L1 cache in a parallel processor. Users can dynamically allocate the space size of local memory and L1 cache in the processor through upper-layer software according to actual application requirements. Once the space allocation is set, the local memory will only access the designated space, and the L1 cache can also only access the space designated for it. With different user configurations, the space of L1 cache and local memory will also change. For local memory, as long as the starting address and storage space size of the local memory are given, the memory access control unit can locate any storage location in the local memory. This is not the case for the L1 cache. Once the space size of the L1 cache changes, its access address and index space may change. The existing L1 cache control mechanisms are all aimed at fixed space It is set up with a small L1 cache and cannot support dynamic storage space.

更具体地，位于处理器与全局内存之间的高速缓冲存储器(在下文中统称为L1缓存)通常包括控制器、用于保存标签(tag)的标签存储单元和用于保存数据的数据存储单元。L1缓存的数据存储空间被平均分成相等大小的多个缓存块(也可称为缓存行，cache line)。L1缓存与全局内存之间的数据传输的最小单位为缓存块大小。数据存储单元中每个缓存块都在标签存储单元中设置有唯一对应的标签存储位置。与每个缓存块中存放的数据相对应的标签被保存在该缓存块的对应的标签存储位置中。当缓存块中数据更新时，该缓存块对应的标签存储位置中的标签也会随之更新，二者是同步替换的。虽然标签也是L1缓存的一部分，但通常提及L1缓存大小时，仅指L1缓存最大能容纳的数据总量，即L1缓存的数据存储单元的存储空间大小。More specifically, a cache memory (hereinafter collectively referred to as the L1 cache) located between the processor and the global memory generally includes a controller, a tag storage unit for saving tags, and a data storage unit for saving data. The data storage space of the L1 cache is evenly divided into multiple cache blocks (also called cache lines) of equal size. The smallest unit of data transfer between the L1 cache and global memory is the cache block size. Each cache block in the data storage unit is provided with a unique corresponding tag storage location in the tag storage unit. Tags corresponding to the data stored in each cache block are stored in the corresponding tag storage location of the cache block. When the data in the cache block is updated, the tag in the tag storage location corresponding to the cache block will also be updated, and the two are replaced synchronously. Although tags are also part of the L1 cache, when the L1 cache size is usually mentioned, it only refers to the maximum total amount of data that the L1 cache can accommodate, that is, the storage space size of the data storage unit of the L1 cache.

现有的高速缓冲存储器通常分为三种类型：直接映射缓存(Direct mapped Cache)、全相联缓存(Full associative Cache)和组相联缓存(Set associative Cache)。在直接映射缓存中，每个主存块只能映射至L1缓存中固定位置的缓存块，即使缓存内还空着许多位置也不能占用，因此缓存的存储空间得不到充分的利用，而且如果程序恰好要重复访问对应同一缓存位置的不同主存块，经常发生块冲突，需要不停地进行替换，从而降低了命中率。在全相联缓存中，每个主存块都可映射至L1缓存中任一缓存块，在访问数据时，需要将访存请求中地址与各个缓存块的标签进行比较以确定是否命中，但每次需要经过全部比对后才能确定缓存未命中，然后再开始执行替换，这直接影响了数据缓存效率和访问效率。组相联缓存中将存储空间分成若干组，各组之间是直接映射方式，而组内各个缓存块之间则是全相联映射方式。相应地，主存也按L1缓存大小分区，每个分区又分成若干个组，每个组包含若干主存块。这样，但主存中每个组只能映射到Cache中的指定组，但主存中一个组内的每个主存块可映射至Cache中指定组内的任意缓存块中。也就是说组内是全相联映射方式，而组间则是直接映射方式。组相联缓存在判断块命中以及替换算法上都要比全相联缓存简单高效，块冲突的概率比直接映射缓存的低，其命中率介于直接映射缓存和全相联缓存之间。在这样的L1缓存中，标签的查找和比对速度会直接影响处理器对数据的存取效率，因此同一组中标签的数量通常不会太多，比较常见的标签数量通常为4个、8个或16个等，一般都小于32个。 Existing cache memories are usually divided into three types: Direct mapped Cache, Full associative Cache and Set associative Cache. In a direct-mapped cache, each main memory block can only be mapped to a cache block at a fixed location in the L1 cache. Even if there are many empty locations in the cache, they cannot be occupied. Therefore, the cache storage space cannot be fully utilized, and if The program happens to repeatedly access different main memory blocks corresponding to the same cache location. Block conflicts often occur and require constant replacement, thus reducing the hit rate. In a fully associative cache, each main memory block can be mapped to any cache block in the L1 cache. When accessing data, the address in the memory access request needs to be compared with the tag of each cache block to determine whether it is a hit. However, Each time it is necessary to go through all comparisons to determine the cache miss, and then start the replacement, which directly affects the data cache efficiency and access efficiency. In the set-associative cache, the storage space is divided into several groups. The direct mapping method is used between each group, while the fully associative mapping method is used between each cache block within the group. Correspondingly, the main memory is also partitioned according to the L1 cache size, and each partition is divided into several groups, each group containing several main memory blocks. In this way, each group in main memory can only be mapped to a specified group in Cache, but each main memory block in a group in main memory can be mapped to any cache block in a specified group in Cache. In other words, the within-group mapping method is fully associative, while the between-group mapping method is direct mapping. The set-associative cache is simpler and more efficient than the fully-associative cache in determining block hits and replacement algorithms. The probability of block conflicts is lower than that of the direct-mapped cache, and its hit rate is between the direct-mapped cache and the fully-associative cache. In such an L1 cache, the search and comparison speed of tags will directly affect the processor's data access efficiency. Therefore, the number of tags in the same group is usually not too many. The more common number of tags is usually 4 or 8. or 16, etc., generally less than 32.

在并行计算处理器中，更多地采用的是组相联缓存作为L1缓存。下文中以组相联缓存类型的L1缓存进行举例说明，但应理解，本申请中介绍的实施例经适当调整或修改也可适用于其他类型的L1缓存。In parallel computing processors, set-associative cache is more commonly used as the L1 cache. In the following, an L1 cache of the set-associative cache type is used as an example. However, it should be understood that the embodiments introduced in this application can also be applied to other types of L1 caches with appropriate adjustments or modifications.

当L1缓存的控制器收到访存请求时，其通过访存请求中的访存地址来判断相应数据是否已缓存在L1缓存中。该访存地址通常包括三个部分：标签(tag)、索引(index)和偏移量(offset)。其中偏移量用于寻址缓存块中某个数据，索引用于定位L1缓存中的某一个组，而标签用于与通过索引指定的组中包含的各缓存块的对应tag进行比对以判断是否命中。仍以大小为64字节且缓存块大小是8字节的L1缓存为例，假设L1缓存被分成两个组，每组的存储空间大小为32Bytes，即包含4个缓存块。当缓存块大小为8字节时，其寻址范围以3个比特即可表示，由此访存地址可采用最低3位表示偏移量字段；而其中组的数量为2，因此以1个比特就可以覆盖所有的组，这样访存地址中可以采用与偏移量相邻的1个比特表示索引字段，而访存地址中其余比特为标签字段，用于与通过索引定位的组中所有缓存块对应的tag进行比较。该标签字段保存的是该访存请求所要访问的全局内存的地址的一部分。这样，对于示例的L1缓存在收到访存请求时，通过该访存请求中地址的索引字段可以定位到相应的组，取访存地址中的标签字段与该组中包含的各个缓存块的tag进行比较，如果有匹配的tag，则说明缓存命中，可以根据访存地址中偏移量字段提取与所匹配的tag对应的缓存块中相应数据返回给处理器。如果不相等，则说明要访问的数据还没有被缓存在L1缓存中(即“缺失”)。在缓存缺失的情况下，L1缓存的控制器为要访问的数据分配一个缓存块及其相应tag，从主存中加载该数据至所分配的缓存行中，并将该访存地址对应的标签字段保存在该缓存块对应的tag中。When the L1 cache controller receives a memory access request, it determines whether the corresponding data has been cached in the L1 cache through the memory access address in the memory access request. The memory access address usually includes three parts: tag, index and offset. The offset is used to address a certain data in the cache block, the index is used to locate a certain group in the L1 cache, and the tag is used to compare with the corresponding tag of each cache block contained in the group specified by the index. Determine whether it is a hit. Still taking the L1 cache with a size of 64 bytes and a cache block size of 8 bytes as an example, assume that the L1 cache is divided into two groups, and the storage space size of each group is 32 Bytes, that is, it contains 4 cache blocks. When the cache block size is 8 bytes, its addressing range can be represented by 3 bits, so the lowest 3 bits of the access address can be used to represent the offset field; and the number of groups is 2, so 1 bits can cover all groups, so that 1 bit adjacent to the offset can be used in the access address to represent the index field, and the remaining bits in the access address are label fields, used to match all groups in the group located by the index. The tags corresponding to the cache blocks are compared. The tag field stores part of the address of the global memory to be accessed by the memory access request. In this way, when the example L1 cache receives a memory access request, the corresponding group can be located through the index field of the address in the memory access request, and the label field in the memory access address and the information of each cache block contained in the group are obtained. The tags are compared. If there is a matching tag, it means that the cache is hit. The corresponding data in the cache block corresponding to the matched tag can be extracted according to the offset field in the access address and returned to the processor. If they are not equal, it means that the data to be accessed has not been cached in the L1 cache (that is, "missing"). In the case of a cache miss, the L1 cache controller allocates a cache block and its corresponding tag for the data to be accessed, loads the data from the main memory into the allocated cache line, and adds the tag corresponding to the access address. The field is stored in the tag corresponding to the cache block.

在现有的L1缓存中，标签存储单元中的标签存储位置与数据存储单元中每个缓存块(即数据存储位置)是固定绑定的，当L1缓存的数据存储空间发生变化(例如，缓存块的位置被移动或被改变)时，L1缓存将无法根据标签和索引寻址到相应的数据。因此，传统的L1缓存无法支持存储空间大小的动态变化。In the existing L1 cache, the tag storage location in the tag storage unit is fixedly bound to each cache block (i.e., data storage location) in the data storage unit. When the data storage space of the L1 cache changes (for example, the cache When the location of the block is moved or changed), the L1 cache will not be able to address the corresponding data based on the tag and index. Therefore, the traditional L1 cache cannot support dynamic changes in storage space size.

另外，通过上述分析可以看出，在L1缓存的访存地址中偏移量字段所占的位数是由缓存块的大小决定的，而标签字段对应的是访存请求所要访问的全局内存的地址的一部分，也是预先设置好的；而只有索引字段所占的位数会随着L1缓存大小的不同而发生变化，并且索引字段的变化不会对诸如全局内存之类的外部元件造成影响，仅影响到L1缓存的内部寻址。例如，对于缓存块大小是128B，8-way组相联缓存(即每组中包含8个缓存块)，由缓存块大小确定的偏移量占据访存地址中最低的7位比特(即【6：0】部分)，当缓存大小为32KB(即128*8*32)时，共有32个组，访存地址中索引部分占用5个比特位，即【11：7】部分；而如果缓存大小变为64KB(128*8*64)，则共有64个组，访存地址中索引部分占用6个比特位，即【12：7】部分。也就是说，当L1缓存的数据存储空间大小发生变化时，必须相应地改变访存地址中索引字段的大小，以便进行正确寻址。In addition, it can be seen from the above analysis that the number of bits occupied by the offset field in the memory access address of the L1 cache is determined by the size of the cache block, and the tag field corresponds to the global memory to be accessed by the memory access request. Part of the address is also preset; only the index field The number of bits occupied will change with the size of the L1 cache, and changes in the index field will not affect external components such as global memory, only the internal addressing of the L1 cache. For example, for a cache block size of 128B and an 8-way set associative cache (i.e., each group contains 8 cache blocks), the offset determined by the cache block size occupies the lowest 7 bits of the access address (i.e., [ 6:0] part), when the cache size is 32KB (i.e. 128*8*32), there are 32 groups in total, and the index part of the access address occupies 5 bits, which is the [11:7] part; and if the cache The size becomes 64KB (128*8*64), then there are 64 groups in total, and the index part of the memory access address occupies 6 bits, that is, the [12:7] part. That is to say, when the size of the data storage space of the L1 cache changes, the size of the index field in the access address must be changed accordingly in order to perform correct addressing.

图1给出了根据本申请一个实施例的高速缓冲存储器100的结构模块示意图。该高速缓冲存储器100包括控制器101、用于保存标签的标签存储单元102、由多个缓存块构成的数据存储单元103和映射单元104。与现有的高速缓冲存储器中标签存储位置与缓存块固定绑定不同，在高速缓冲存储器100中标签存储位置与数据存储位置(即缓存块)之间是动态映射的。也就是每个标签存储位置不再固定地与某个缓存块绑定，而是可以动态地映射到或绑定于任一个缓存块。在该实施例中，标签存储位置与缓存块之间的映射关系被保存在映射单元104中。映射单元104例如可以表格的形式保存标签序号与缓存块序号之间的一一映射关系。标签序号用于指示标签存储单元102中保存的每个标签的位置。缓存块序号用于指示数据存储单元103中每个缓存块的位置。FIG. 1 shows a schematic structural module diagram of a cache memory 100 according to an embodiment of the present application. The cache memory 100 includes a controller 101, a tag storage unit 102 for saving tags, a data storage unit 103 composed of a plurality of cache blocks, and a mapping unit 104. Different from the fixed binding of tag storage locations and cache blocks in existing cache memories, there is a dynamic mapping between tag storage locations and data storage locations (ie, cache blocks) in the cache memory 100 . That is, each tag storage location is no longer fixedly bound to a cache block, but can be dynamically mapped to or bound to any cache block. In this embodiment, the mapping relationship between the tag storage location and the cache block is saved in the mapping unit 104 . The mapping unit 104 may, for example, store a one-to-one mapping relationship between tag serial numbers and cache block serial numbers in the form of a table. The tag serial number is used to indicate the location of each tag stored in the tag storage unit 102 . The cache block serial number is used to indicate the location of each cache block in the data storage unit 103.

图2给出了根据本申请一个示例的标签存储位置与缓存块之间的映射关系示意图。如图2所示，标签存储单元中共有k+1个标签存储位置，而数据存储单元中共有n+1个缓存块，其中n和k都是自然数，且n大于或等于k。例如，第1个标签t0当前映射至第6个缓存块d5，第2个标签t1当前映射至第9个缓存块d8，……，第k+1个标签当前映射至第24个缓存块d23。可以看出，映射单元104所保存的映射关系实际上所体现或反映的是标签存储单元102中各个存储位置当前保存的标签与数据存储单元103中相应缓存块当前保存的数据块之间的映射关系。当L1缓存的数据存储单元103的位置和存储空间大小发生变化时，通过对映射单元104建立的标签的存储位置与其对应数据的存储位置之间的映射进行调整，就可以使得标签存储单元102中各标签映射到处于新位置的数据存储单元103的各个缓存块。通过这种动态映射和存储的方式，可以改变或调整能够检索到的标签和缓存块的数量，由此可以支持L1缓存的存储空间的动态变化。而且可以允许L1缓存被移动到或分配到共享存储空间的任何部分，只要给出L1缓存在共享存储空间的起始位置，就可以通过调整映射单元104来重新定位各标签存储位置所对应的数据存储位置。Figure 2 shows a schematic diagram of the mapping relationship between tag storage locations and cache blocks according to an example of this application. As shown in Figure 2, there are k+1 tag storage locations in the tag storage unit, and there are n+1 cache blocks in the data storage unit, where n and k are both natural numbers, and n is greater than or equal to k. For example, the 1st tag t0 is currently mapped to the 6th cache block d5, the 2nd tag t1 is currently mapped to the 9th cache block d8,..., the k+1th tag is currently mapped to the 24th cache block d23 . It can be seen that the mapping relationship saved by the mapping unit 104 actually embodies or reflects the mapping between the tags currently saved in each storage location in the tag storage unit 102 and the data blocks currently saved in the corresponding cache block in the data storage unit 103 relation. When the location and storage space size of the L1 cache data storage unit 103 change, by adjusting the mapping between the storage location of the tag established by the mapping unit 104 and the storage location of the corresponding data, the tag storage unit 102 can be Each tag is mapped to the data storage unit 103 at the new location. individual cache blocks. Through this dynamic mapping and storage method, the number of tags and cache blocks that can be retrieved can be changed or adjusted, thereby supporting dynamic changes in the storage space of the L1 cache. Moreover, the L1 cache can be allowed to be moved or allocated to any part of the shared storage space. As long as the starting position of the L1 cache in the shared storage space is given, the data corresponding to each tag storage location can be relocated by adjusting the mapping unit 104. storage location.

在一些实施例中，数据存储单元103中缓存块的数量可以大于标签存储单元102中所包含的标签的数量。每个缓存块可以与一个标签绑定，也可以不与任一标签绑定。每个缓存块都设置有标签绑定位。标签绑定位用于指示该缓存块是否与标签绑定，例如当缓存块与某个标签绑定时，可以将其标签绑定位设置为1、y或T等，而缓存块不再绑定任何标签时，可将其标签绑定位设置为0、n或F等。每个缓存块只有在未绑定任何标签的情况下，才允许释放其中的数据资源，即可以参与数据替换，可用于存放新的数据。在一些实施例中，每个缓存块还可以设置有状态位，用于指示对于该缓存块是否已经操作完毕。例如当控制器101确定对于缓存块当前存放的数据的读写操作已经全部完成时，可以将该缓存块的状态位设置为1、y或T等，反之，可将其状态为设置为0、n或F等。每个缓存块只有在未绑定任何标签且读取操作已经全部完成的情况下，才允许释放其中的数据资源，即可以参与数据替换，可用于存放新的数据。In some embodiments, the number of cache blocks in the data storage unit 103 may be greater than the number of tags contained in the tag storage unit 102 . Each cache block can be bound to a tag or not. Each cache block has the tag binding bit set. The tag binding bit is used to indicate whether the cache block is bound to a tag. For example, when the cache block is bound to a certain tag, its tag binding bit can be set to 1, y, or T, etc., and the cache block is no longer bound. When specifying any tag, its tag binding bit can be set to 0, n, F, etc. Each cache block is only allowed to release the data resources when it is not bound to any tag, that is, it can participate in data replacement and can be used to store new data. In some embodiments, each cache block may also be set with a status bit to indicate whether the operation for the cache block has been completed. For example, when the controller 101 determines that all read and write operations for the data currently stored in the cache block have been completed, the status bit of the cache block can be set to 1, y, or T, etc.; conversely, its status can be set to 0, n or F etc. Each cache block is only allowed to release its data resources when no tags are bound and all read operations have been completed, that is, it can participate in data replacement and can be used to store new data.

在该实施例中，通过引入了映射单元104实现了标签与缓存块中数据的动态映射或动态绑定。标签和缓存块中的数据可以不必同步更新。例如，当标签存储单元102中某个存储位置保存的标签被替换为新标签时，可以在数据存储单元103中为该新标签对应的数据分配新的缓存块，并在映射单元104中建立该新标签与新分配的缓存块之间的映射即可，而与原来在该存储位置保存的旧标签相对应的缓存块中的数据仍然保留在数据存储单元103中。相应地，当L1缓存的存储空间的发生变化时，例如L1缓存的数据存储单元103被移动到或分配到共享存储空间的另一位置时，只要给出L1缓存在共享存储空间的起始位置，就可以从该起始位置开始重新分配缓存块，只要相应地在映射单元104来重新定位各标签存储位置所对应的数据存储位置即可。但应理解，当L1缓存的存储空间根据用户的配置发生变化或调整时，基于映射单元104重新建立的映射关系所定位或查找到的是新分配的缓存块中的数据，而在存储空间未调整之前L1缓存中保存的数据将不会被保留，其占用的空间可以被其他数据代替。 In this embodiment, dynamic mapping or dynamic binding of tags and data in cache blocks is achieved by introducing the mapping unit 104 . Data in tags and cache blocks do not need to be updated synchronously. For example, when a tag stored in a certain storage location in the tag storage unit 102 is replaced with a new tag, a new cache block can be allocated in the data storage unit 103 for the data corresponding to the new tag, and the data corresponding to the new tag can be established in the mapping unit 104. The mapping between the new tag and the newly allocated cache block is sufficient, and the data in the cache block corresponding to the old tag originally saved in the storage location is still retained in the data storage unit 103 . Correspondingly, when the storage space of the L1 cache changes, for example, when the data storage unit 103 of the L1 cache is moved or allocated to another location in the shared storage space, as long as the starting location of the L1 cache in the shared storage space is given , the cache block can be reallocated from the starting position, as long as the data storage location corresponding to each tag storage location is repositioned in the mapping unit 104 accordingly. However, it should be understood that when the storage space of the L1 cache is changed or adjusted according to the user's configuration, what is located or found based on the mapping relationship re-established by the mapping unit 104 is the data in the newly allocated cache block, and the data in the storage space is not The data saved in the L1 cache before adjustment will not be retained, and the space it occupies can be replaced by other data.

在一些实施例中，映射单元104可以采用诸如SRAM、DRAM之类的随机存取存储器来实现，在其他上以诸如数组、链表等数据结构来保存标签序号与缓存块序号之间的一一映射关系。以数组为例，该数组中元素的数量为标签存储单元102中可存储的标签的数量。数组中第一个元素中保存的是标签存储单元102中第一个标签当前所对应的缓存块的序号，依次类推。在又一些实施例中，映射单元104可以采用寄存器的形式实现，例如，将映射单元104实现为一组寄存器，各个寄存器分别对应于标签存储单元102中各标签的存储位置，而每个寄存器的值为相应位置的标签对应的缓存块的序号。这样的采用寄存器形式实现的映射单元可以进一步减少L1缓存中对于映射关系的存储所占的成本和面积，提高解析标签和缓存块之间映射关系的速度。In some embodiments, the mapping unit 104 can be implemented using a random access memory such as SRAM or DRAM, and in other cases, a data structure such as an array or a linked list is used to store the one-to-one mapping between the tag serial number and the cache block serial number. relation. Taking an array as an example, the number of elements in the array is the number of tags that can be stored in the tag storage unit 102 . The first element in the array stores the sequence number of the cache block currently corresponding to the first tag in the tag storage unit 102, and so on. In some embodiments, the mapping unit 104 may be implemented in the form of a register. For example, the mapping unit 104 may be implemented as a set of registers. Each register corresponds to the storage location of each tag in the tag storage unit 102, and each register The value is the sequence number of the cache block corresponding to the tag at the corresponding position. Such a mapping unit implemented in the form of a register can further reduce the cost and area occupied by the storage of mapping relationships in the L1 cache, and improve the speed of parsing the mapping relationships between tags and cache blocks.

继续参考图1，当该高速缓冲存储器100接收到访存控制单元LSU发送的访存请求时，控制器101对于接收到的访存请求中包含的访存地址进行解析。根据访存地址中索引字段定位到相应的组，然后将访存请求中标签字段与所定位的组中包含的标签进行比较。如果能找到匹配的标签，则缓存命中，说明该访存请求要访问的数据已经被缓存在该高速缓冲存储器中。如果比对完所有的标签都没有找到相匹配的标签，则说明该访存请求要访问的数据还没缓存在该高速缓冲存储器中，此时控制器101需要从下一级存储器(例如L2缓存或主存等)中将该访存请求要访问的数据读取至该高速缓冲存储器100中。Continuing to refer to FIG. 1 , when the cache memory 100 receives a memory access request sent by the memory access control unit LSU, the controller 101 parses the memory access address contained in the received memory access request. Locate the corresponding group according to the index field in the memory access address, and then compare the label field in the memory access request with the labels contained in the located group. If a matching tag can be found, it is a cache hit, indicating that the data to be accessed by the memory access request has been cached in the cache memory. If no matching tag is found after comparing all tags, it means that the data to be accessed by the memory access request has not been cached in the cache memory. At this time, the controller 101 needs to obtain the data from the next level memory (such as L2 cache). or main memory, etc.), the data to be accessed by the memory access request is read into the cache memory 100 .

在缓存命中的情况下，控制器101根据映射单元104中所保存的映射关系确定该命中的标签对应的数据存储位置(即数据存储单元103中的某个缓存块)，并根据访存地址中偏移量字段从对应的缓存块提取该访存请求要访问的数据作为对该访存请求的响应返回给处理器的访存控制单元LSU。In the case of a cache hit, the controller 101 determines the data storage location corresponding to the hit tag (ie, a certain cache block in the data storage unit 103) according to the mapping relationship saved in the mapping unit 104, and determines the data storage location according to the memory access address. The offset field extracts the data to be accessed by the memory access request from the corresponding cache block and returns it to the memory access control unit LSU of the processor as a response to the memory access request.

在缓存未命中的情况下，控制器101为该访存请求分配一个标签，例如将访存请求中包含的访存地址的标签部分作为新分配的标签，并将新分配的标签保存在标签存储单元102中，这时需要用该新分配的标签去替换在标签存储单元102的其中一个存储位置保存的原有标签，从而实现标签的更新。实际上就是在标签存储单元102为该访存请求的标签分配了存储位置。同时，控制器101还需要在数据存储单元103中为该访存请求分配一个缓存块，以便能存放从下一级存储器中读取的该访存请求要访问的数据。为了在分配给该访存请求的标签和缓存块之间建立对应关系，控制器101还需要更新映射单元104中标签与缓存块之间的映射关系，以使得标签存储单元102中分配给该访存请求的标签与在数据存储单元103中分配给该访存请求的缓存块之间建立映射。例如，根据该标签在标签存储单元102中存储位置的序号在映射单元104中查找该标签序号对应的缓存块序号，从数据存储单元103中找到的缓存块序号对应的缓存块，将其标签绑定位设置为指示未与标签绑定，并将找到的缓存块序号替换为分配给该访存请求的缓存块在数据存储单元103中的序号。在映射单元104中建立相应映射之后，将分配给该访存请求的缓存块的标签绑定位设置为指示已与标签绑定。这样，就可以从下一级存储器读取包含该访存请求要访问的数据并将其保存在为该访存请求分配的缓存块中。In the case of a cache miss, the controller 101 assigns a tag to the memory access request, for example, uses the tag part of the memory access address contained in the memory access request as a newly allocated tag, and saves the newly allocated tag in the tag storage. In the unit 102, it is necessary to replace the original label stored in one of the storage locations of the label storage unit 102 with the newly allocated label, thereby achieving label updating. In fact, the tag storage unit 102 allocates a storage location for the tag of the memory access request. At the same time, the controller 101 also needs to allocate a cache block for the memory access request in the data storage unit 103 so that the data to be accessed by the memory access request can be stored from the next level memory. according to. In order to establish a corresponding relationship between the tag assigned to the memory access request and the cache block, the controller 101 also needs to update the mapping relationship between the tag and the cache block in the mapping unit 104, so that the tag storage unit 102 is assigned to the access request. A mapping is established between the tag of the memory request and the cache block allocated to the memory access request in the data storage unit 103. For example, according to the serial number of the storage location of the tag in the tag storage unit 102, the cache block serial number corresponding to the tag serial number is searched in the mapping unit 104, and the cache block corresponding to the cache block serial number found in the data storage unit 103 is bound to its tag. The location is set to indicate that it is not bound to the tag, and the found cache block sequence number is replaced with the sequence number in the data storage unit 103 of the cache block assigned to the memory access request. After the corresponding mapping is established in the mapping unit 104, the tag binding bit of the cache block allocated to the memory access request is set to indicate that it has been bound to the tag. In this way, the data containing the memory access request can be read from the next-level memory and stored in the cache block allocated for the memory access request.

在一些实施例中，数据存储单元103中缓存块的数量可以大于标签存储单元102中所包含的标签的数量。每个缓存块可以与一个标签绑定，也可以不与任一标签绑定。每个缓存块都设置有标签绑定位和状态位。标签绑定位用于指示该缓存块是否与标签绑定，例如当缓存块与某个标签绑定时，可以将其标签绑定位设置为1、y或T等，而缓存块不再绑定任何标签时，可将其标签绑定位设置为0、n或F等。状态位用于指示对于该缓存块是否已经操作完毕。例如当控制器101确定对于缓存块当前存放的数据的读写操作已经全部完成时，可以将该缓存块的状态位设置为1、y或T等，反之，可将其状态为设置为0、n或F等。每个缓存块只有在未绑定任何标签且读取操作已经全部完成的情况下，才允许释放其中的数据资源，即可以参与数据替换，可用于存放新的数据。In some embodiments, the number of cache blocks in the data storage unit 103 may be greater than the number of tags contained in the tag storage unit 102 . Each cache block can be bound to a tag or not. Each cache block has tag binding bits and status bits set. The tag binding bit is used to indicate whether the cache block is bound to a tag. For example, when the cache block is bound to a certain tag, its tag binding bit can be set to 1, y, or T, etc., and the cache block is no longer bound. When specifying any tag, its tag binding bit can be set to 0, n, F, etc. The status bit is used to indicate whether the operation for this cache block has been completed. For example, when the controller 101 determines that all read and write operations for the data currently stored in the cache block have been completed, the status bit of the cache block can be set to 1, y, or T, etc.; conversely, its status can be set to 0, n or F etc. Each cache block is only allowed to release its data resources when no tags are bound and all read operations have been completed, that is, it can participate in data replacement and can be used to store new data.

在该实施例中，通过引入了映射单元104实现了标签与缓存块中数据的动态映射或动态绑定。标签和缓存块中的数据可以不必同步更新。例如，当标签存储单元102中某个存储位置保存的标签被替换为新标签时，可以在数据存储单元103中为该新标签对应的数据分配新的缓存块，并在映射单元104中建立该新标签与新分配的缓存块之间的映射即可，而与原来在该存储位置保存的旧标签相对应的缓存块中的数据仍然保留在数据存储单元103中。In this embodiment, dynamic mapping or dynamic binding of tags and data in cache blocks is achieved by introducing the mapping unit 104 . Data in tags and cache blocks do not need to be updated synchronously. For example, when a tag stored in a certain storage location in the tag storage unit 102 is replaced with a new tag, a new cache block can be allocated in the data storage unit 103 for the data corresponding to the new tag, and the data corresponding to the new tag can be established in the mapping unit 104. The mapping between the new tag and the newly allocated cache block is sufficient, and the data in the cache block corresponding to the old tag originally saved in the storage location is still retained in the data storage unit 103 .

在一些实施例中，映射单元104可以采用诸如SRAM、DRAM之类的随机存取存储器来实现，在其他上以诸如数组、链表等数据结构来保存标签序号与缓存块序号之间的一一映射关系。以数组为例，该数组中元素的数量为标签存储单元102中可存储的标签的数量。数组中第一个元素中保存的是标签存储单元102中第一个标签当前所对应的缓存块的序号，依次类推。在又一些实施例中，映射单元104可以采用寄存器的形式实现，例如，将映射单元104实现为一组寄存器，各个寄存器分别对应于标签存储单元102中各标签的存储位置，而每个寄存器的值为相应位置的标签对应的缓存块的序号。这样的采用寄存器形式实现的映射单元可以进一步减少L1缓存中对于映射关系的存储所占的成本和面积，提高解析标签和缓存块之间映射关系的速度。In some embodiments, the mapping unit 104 can be implemented using a random access memory such as SRAM or DRAM, and in other cases, a data structure such as an array or a linked list is used to store the one-to-one mapping between the tag serial number and the cache block serial number. relation. Taking an array as an example, the elements in the array The quantity is the number of tags that can be stored in the tag storage unit 102 . The first element in the array stores the sequence number of the cache block currently corresponding to the first tag in the tag storage unit 102, and so on. In some embodiments, the mapping unit 104 may be implemented in the form of a register. For example, the mapping unit 104 may be implemented as a set of registers. Each register corresponds to the storage location of each tag in the tag storage unit 102, and each register The value is the sequence number of the cache block corresponding to the tag at the corresponding position. Such a mapping unit implemented in the form of a register can further reduce the cost and area occupied by the storage of mapping relationships in the L1 cache, and improve the speed of parsing the mapping relationships between tags and cache blocks.

图3给出了根据本申请一个实施例的用于在并行处理器中动态共享存储空间的方法，其主要包括下列步骤：Figure 3 shows a method for dynamically sharing memory space in a parallel processor according to an embodiment of the present application, which mainly includes the following steps:

在步骤S1)由处理器的访存控制单元根据收到的对本地内存大小和高速缓冲存储器大小的设置，分别更新本地内存和高速缓冲存储器在处理器的存储器中的起始位置。在该实施例中，允许用户在使用诸如CUDA、OPENCL之类的编程语言在GPGPU处理器上编写应用程序时，可以根据所选择的基于本地内存或者全局内存的编程模型来重新调整处理器中本地内存和L1缓存的存储空间大小，以更好地改善处理器对于应用程序的执行性能。如果用户当前选择基于本地内存的编程，那么可以适当扩大处理器中本地内存的存储空间，反之如果用户当前选择基于全局内存的编程，那么可以适当扩大处理器中L1缓存的存储空间。为了在改善处理器对于应用程序的执行性能的同时又不增加芯片面积和硬件成本，在该实施例中，本地内存和L1缓存的存储空间都来自于处理器中同一个随机存取存储器(RAM)，并且本地内存和L1缓存的所占用的存储空间大小并非固定的，而是可以随着用户提供的配置而动态发生变化。例如，当用户选择基于本地内存的编程时，可以扩大本地内存在该存储器中的空间占比；而当用户选择基于全局内存的编程，可以扩大L1缓存在该存储器中的空间占比。用户可以通过例如调用处理器提供的配置接口、提供配置文件、选择相应配置选项、发送配置命令等方式将其当前对于本地内存大小和高速缓冲存储器大小的设置提供至处理器的访存控制单元。处理器的访存控制单元可以根据其收到的对本地内存大小和高速缓冲存储器大小的设置，从处理器的存储器中划分相应大小的存储空间来分别作为本地内存和高速缓冲存储器，同时根据对该共享的存储器的重新分配。实际上，对于由本地内存和L1缓存共享的存储器而言，每次对于本地内存和高速缓冲存储器所占的存储空间的重新划分，只要通过确定本地内存和高速缓冲存储器在处理器的存储器中的新的起始位置就能实现。In step S1), the memory access control unit of the processor updates the starting positions of the local memory and the cache memory in the memory of the processor respectively according to the received settings of the local memory size and the cache memory size. In this embodiment, when users write applications on GPGPU processors using programming languages such as CUDA and OPENCL, they can re-adjust the local memory in the processor according to the selected programming model based on local memory or global memory. The storage space size of the memory and L1 cache to better improve the processor's execution performance for applications. If the user currently chooses programming based on local memory, the storage space of the local memory in the processor can be appropriately expanded. On the contrary, if the user currently chooses programming based on global memory, the storage space of the L1 cache in the processor can be appropriately expanded. In order to improve the execution performance of the processor for application programs without increasing the chip area and hardware cost, in this embodiment, the storage space of the local memory and L1 cache comes from the same random access memory (RAM) in the processor. ), and the storage space occupied by local memory and L1 cache is not fixed, but can dynamically change with the configuration provided by the user. For example, when the user chooses programming based on local memory, the space proportion of the local memory in the memory can be expanded; and when the user chooses programming based on global memory, the space proportion of the L1 cache in the memory can be expanded. The user can provide its current settings for the local memory size and cache memory size to the memory access control unit of the processor by, for example, calling the configuration interface provided by the processor, providing a configuration file, selecting corresponding configuration options, sending configuration commands, etc. The memory access control unit of the processor can divide storage spaces of corresponding sizes from the processor's memory to serve as local memory and cache memory respectively based on the settings it receives for the local memory size and cache memory size. Reallocation of the shared memory. In fact, for memory shared by local memory and L1 cache, every time for local memory and cache The re-division of the storage space occupied by the processor can be achieved by determining the new starting location of the local memory and cache memory in the processor's memory.

在一些实施例中，当处理器的访存控制单元根据其收到的对本地内存大小和高速缓冲存储器大小的设置，从处理器的存储器中划分相应大小的存储空间来分别作为本地内存和高速缓冲存储器时，先分配本地内存的存储空间，再分配L1缓存的存储空间。并且对于本地内存，每次都是从处理器的存储器中预设的地址(例如该存储器的起始地址)来分配相应大小的存储空间，从而使得本地内存占用共享的存储器空间的低地址空间部分。这样每次更新时本地内存的起始位置都不需要发生变化，只需要根据新设置的本地内存空间大小就能确定更新后高速缓冲存储器在处理器的存储器中的新的起始位置。这不仅简化了存储空间管理的流程，而且当本地内存的大小被更新的时候，更新前后共享的本地内存部分的数据不会丢失。而如上文提到的，对于L1缓存，当其数据存储空间大小被更新时，由于涉及到的缓存块数量发生变化，访存地址中索引字段也发生改变，因此更新之前L1缓存中保存的数据都会被释放或清空。更新之后的L1缓存不能在定位到更新之前缓存过的数据。对于存储空间调整后的L1缓存，根据L1缓存在RAM的起始位置访问L1缓存中某个具体存储位置时，都相当于在原地址之后加了一个偏移量，这个偏移量也就是分给本地内存的空间大小。In some embodiments, when the memory access control unit of the processor divides storage spaces of corresponding sizes from the memory of the processor into local memory and cache memory according to the settings it receives for the local memory size and cache memory size. When buffering memory, the storage space of the local memory is allocated first, and then the storage space of the L1 cache is allocated. And for local memory, a corresponding size of storage space is allocated from the preset address in the processor's memory (such as the starting address of the memory) every time, so that the local memory occupies the low address space part of the shared memory space. . In this way, the starting position of the local memory does not need to change each time it is updated, and the new starting position of the updated cache memory in the processor's memory can be determined only based on the newly set local memory space size. This not only simplifies the storage space management process, but also when the size of the local memory is updated, the data in the shared local memory part before and after the update will not be lost. As mentioned above, for the L1 cache, when the data storage space size is updated, because the number of cache blocks involved changes, the index field in the memory access address also changes, so the data saved in the L1 cache before is updated. will be released or cleared. The L1 cache after the update cannot locate data cached before the update. For the L1 cache after storage space adjustment, when accessing a specific storage location in the L1 cache according to the starting position of the L1 cache in RAM, it is equivalent to adding an offset after the original address. This offset is also assigned to The size of local memory space.

在步骤S2)由处理器的访存控制单元根据收到的对高速缓冲存储器大小的设置更新对于高速缓冲存储器的访存地址中索引字段的大小的设置。如上文提到的，在L1缓存的访存地址中偏移量字段所占的位数是由缓存块的大小决定的，而标签字段对应的是访存请求所要访问的全局内存的地址的一部分，也是预先设置好的；而只有索引字段所占的位数会随着L1缓存大小的不同而发生变化，并且索引字段的变化仅影响到L1缓存的内部寻址。在缓存块大小和每个组内包含的标签数不变的情况下，当L1缓存的数据存储空间大小发生变化时，相应地改变访存地址中索引字段的大小，以便进行正确寻址。例如，对于缓存块大小是128B，8-way组相联缓存(即每组中包含8个缓存块)，当缓存大小为32KB(即128*8*32)时，共有32个组，访存地址中索引字段占用5个比特位；而如果缓存大小变为64KB(128*8*64)，则共有64个组，访存地址中索引字段占用6个比特位。也就是说，在给定缓存块大小和每个组内包含的标签数时，索引字段所占用的比特位数是根据L1缓存的组数来设置的，而L1缓存的组数可以通过缓存大小除以缓存块大小与每个组内包含的标签数来得到。在处理器的访存控制单元确定L1缓存的访存地址中索引字段大小时，也意味着确定了该L1缓存保存的新的组数。在一个示例中，该访存控制单元可以将在步骤S1)确定的高速缓冲存储器在处理器的存储器中的起始位置、对高速缓冲存储器大小的设置、在步骤S2)确定的新组数等参数包含在指示调整空间的通知消息中一起发送给高速缓冲存储器，以便于高速缓冲存储器重新调整和配置其数据存储空间。In step S2) the memory access control unit of the processor updates the setting of the size of the index field in the memory access address of the cache memory according to the received setting of the cache memory size. As mentioned above, the number of bits occupied by the offset field in the L1 cache memory access address is determined by the size of the cache block, and the tag field corresponds to part of the global memory address to be accessed by the memory access request. , is also preset; only the number of bits occupied by the index field will change with the size of the L1 cache, and changes in the index field only affect the internal addressing of the L1 cache. When the size of the cache block and the number of tags contained in each group remain unchanged, when the size of the data storage space of the L1 cache changes, the size of the index field in the access address is changed accordingly to ensure correct addressing. For example, for a cache block size of 128B and an 8-way group associative cache (that is, each group contains 8 cache blocks), when the cache size is 32KB (that is, 128*8*32), there are 32 groups in total. The index field in the address occupies 5 bits; and if the cache size becomes 64KB (128*8*64), there are 64 groups in total, and the index field in the access address occupies 6 bits. That is, given the cache block size and the number of tags contained within each group, the index The number of bits occupied by a segment is set according to the number of groups in the L1 cache, which can be obtained by dividing the cache size by the cache block size and the number of tags contained in each group. When the memory access control unit of the processor determines the size of the index field in the memory access address of the L1 cache, it also means that the new number of groups saved in the L1 cache is determined. In one example, the memory access control unit may determine the starting position of the cache memory in the memory of the processor determined in step S1), the setting of the cache memory size, the new group number determined in step S2), etc. The parameters are included in the notification message indicating the adjustment space and are sent to the cache to facilitate the cache to realign and configure its data storage space.

继续参考图3，在步骤S3)，当高速缓冲存储器收到来自处理器的访存控制单元发送的指示调整空间的通知消息时，可以从高速缓冲存储器在处理器的存储器中的起始位置开始，依次确定每个组及其中包含的各缓存块在处理器的存储器中对应的新的数据存储位置。如上文提到的，高速缓冲存储器的数据存储单元中每个缓存块需要设置对应的标签存储位置才能被正确寻址。因此，在步骤S3)还需要在这些重新确定的各缓存块的位置与高速缓冲存储器的标签存储单元的各标签存储位置之间建立新的映射关系。也就是在高速缓存器的标签存储单元中各标签存储位置与经过存储空间调整后的数据存储单元中各个新的数据存储位置之间建立新的映射，以便能基于新建立的映射关系来定位各个标签对应的缓存块。Continuing to refer to Figure 3, in step S3), when the cache memory receives a notification message indicating an adjustment space sent from the memory access control unit of the processor, it can start from the starting position of the cache memory in the memory of the processor , and sequentially determine the new data storage location corresponding to each group and each cache block contained therein in the memory of the processor. As mentioned above, each cache block in the data storage unit of the cache memory needs to set a corresponding tag storage location in order to be correctly addressed. Therefore, in step S3), a new mapping relationship needs to be established between the redetermined locations of each cache block and each tag storage location of the tag storage unit of the cache memory. That is to say, a new mapping is established between each tag storage location in the tag storage unit of the cache and each new data storage location in the data storage unit after storage space adjustment, so that each tag storage location can be located based on the newly established mapping relationship. The cache block corresponding to the tag.

通过上述步骤，用户可以随时根据实际的应用来动态调整处理器的存储器中分配本地内存或者L1缓存的存储空间大小。当处理器的访存控制单元收到对于本地内存的访存请求时，可以使用更新后的本地内存的起始位置来定位所述访存请求要访问的数据。而当访存控制单元响应于收到对于全局内存的访存请求，首先将访存请求中的地址映射成高速缓冲存储器的访存地址，并将其发送至高速缓冲存储器，其中所述访存地址采用更新后的索引字段大小。高速缓冲存储器在收到来自访存控制单元的访存请求时，可根据新建立的标签存储位置与新的数据存储位置之间的映射来定位该访存请求要访问的数据。在这样的方案，使得处理器中同一个随机存取存储器(RAM)空间可以被本地内存和L1缓存分时共享，从而达到减少芯片面积和硬件成本的目的。因为本地内存或者L1缓存中的存储空间通常会占到80％的面积成本，而采用动态共享存储空间的方案之后，可以节约大约40％的面积成本。而且该方案允许用户在编写应用程序时，根据选择的基于本地内存或者全局内存的编程模型来重新调整处理器中本地内存和L1缓存的存储空间大小，从而更好地改善处理器对于应用程序的执行性能。Through the above steps, the user can dynamically adjust the storage space allocated to local memory or L1 cache in the processor's memory at any time according to the actual application. When the memory access control unit of the processor receives a memory access request for the local memory, the updated starting position of the local memory can be used to locate the data to be accessed by the memory access request. When the memory access control unit responds to receiving a memory access request for the global memory, it first maps the address in the memory access request to the memory access address of the cache memory, and sends it to the cache memory, where the memory access request The address takes the updated index field size. When the cache memory receives a memory access request from the memory access control unit, it can locate the data to be accessed by the memory access request based on the mapping between the newly established tag storage location and the new data storage location. In such a solution, the same random access memory (RAM) space in the processor can be time-shared by the local memory and L1 cache, thereby achieving the purpose of reducing chip area and hardware costs. Because the storage space in local memory or L1 cache usually accounts for 80% of the area cost, using a dynamic shared storage space solution can save about 40% of the area cost. Moreover, this solution allows users to re-adjust the local memory in the processor according to the selected programming model based on local memory or global memory when writing applications. The storage space size of the memory and L1 cache can better improve the processor's execution performance for applications.

在一些实施例中，在经过上述步骤调整L1缓存的存储空间之后，当L1缓存收到来自访存控制单元的访存请求时，通过其控制器对于接收到的访存请求中包含的访存地址进行解析。根据更新后的索引字段大小从访存地址中提取相应的索引字段，根据所提取的索引字段定位到相应的组，然后将访存请求中标签字段与所定位的组中包含的各个标签进行比较。如果能找到匹配的标签，则缓存命中，说明该访存请求要访问的数据已经被缓存在L1缓存中。如果比对完所有的标签都没有找到相匹配的标签，则说明该访存请求要访问的数据还没缓存在L1缓存中，此时该L1缓存需要从下一级存储器(例如L2缓存或主存等)中将该访存请求要访问的数据读取至该L1缓存中。其中在缓存命中的情况下，根据步骤S3)新建立的标签与缓存块的映射关系确定该命中的标签对应的是哪个缓存块，并根据访存地址中偏移量字段从该缓存块提取该访存请求要访问的数据作为对该访存请求的响应返回给访存控制单元。在缓存未命中的情况下，L1缓存的控制器以该访存请求的访存地址中的标签字段替换在标签存储单元的其中一个标签存储位置保存的标签，并根据步骤S3)新建立的标签与缓存块的映射关系确定所选择的标签存储位置对应的是那个缓存块，然后从下一级存储器获取该访存请求要访问的数据并将其保存在该缓存块中。In some embodiments, after adjusting the storage space of the L1 cache through the above steps, when the L1 cache receives a memory access request from the memory access control unit, its controller controls the memory access included in the received memory access request. Address is parsed. Extract the corresponding index field from the cache address based on the updated index field size, locate the corresponding group based on the extracted index field, and then compare the label field in the cache request with each label contained in the located group. . If a matching tag can be found, the cache hits, indicating that the data accessed by the memory access request has been cached in the L1 cache. If no matching tag is found after comparing all tags, it means that the data to be accessed by the memory access request has not been cached in the L1 cache. At this time, the L1 cache needs to be retrieved from the next-level memory (such as the L2 cache or the main memory). The data to be accessed by the memory access request is read into the L1 cache. In the case of a cache hit, determine which cache block the hit tag corresponds to based on the newly established mapping relationship between tags and cache blocks in step S3), and extract the cache block from the cache block based on the offset field in the access address. The data to be accessed by the memory access request is returned to the memory access control unit as a response to the memory access request. In the case of a cache miss, the L1 cache controller replaces the tag stored in one of the tag storage locations of the tag storage unit with the tag field in the memory access address of the memory access request, and creates a newly created tag according to step S3) The mapping relationship with the cache block determines which cache block the selected tag storage location corresponds to, and then obtains the data to be accessed by the memory access request from the next-level memory and saves it in the cache block.

在L1缓存的缓存块设置有标签绑定位的一些实施例中，在缓存未命中的情况下，L1缓存的控制器在标签存储单元中为该访存请求分配一个标签存储位置以存放该访存请求的访存地址中的标签字段，并从数据存储单元中选择未与标签绑定的多个缓存块的其中一个分配给该访存请求；接着在映射单元中将所分配的标签存储位置原来对应的缓存块的标签绑定位设置为指示未与标签绑定；然后在该标签存储位置与分配给该访存请求的缓存块之间建立映射关系，并将分配给该访存请求的缓存块的标签绑定位设置为指示已与标签绑定；以及从下一级存储器获取该访存请求要访问的数据并将其保存在为该访存请求分配的缓存块中。In some embodiments where the cache block of the L1 cache is set with a tag binding bit, in the case of a cache miss, the controller of the L1 cache allocates a tag storage location for the memory access request in the tag storage unit to store the access request. The tag field in the memory access address of the memory request is selected, and one of the multiple cache blocks not bound to the tag is selected from the data storage unit to be allocated to the memory access request; then the allocated tag storage location is stored in the mapping unit The tag binding bit of the original corresponding cache block is set to indicate that it is not bound to the tag; then a mapping relationship is established between the tag storage location and the cache block assigned to the memory access request, and the tag assigned to the memory access request is The tag binding bit of the cache block is set to indicate that it is bound to the tag; and the data to be accessed by the memory access request is obtained from the next level memory and saved in the cache block allocated for the memory access request.

在本申请的又一些实施例中，还提供了一种支持动态共享存储空间的处理器，其除了访存控制单元、存储器和高速缓冲存储器之外，其余部件均于现有处理器相同，在此不再赘述。在该实施例中，高速缓冲存储器是如上文结合图1和2接收的高速缓冲存储器，其包括控制器、用于保存标签的标签存储单元、由多个缓存块构成的数据存储单元和映射单元。该处理器的存储器的存储空间的一部分作为本地内存使用，而另一部分作为高速缓冲存储器的数据存储单元使用。该存储器可以诸如SRAM、DRAM之类的随机存取存储器(RAM)的形式实现。其中访存控制单元被配置为：根据收到的对本地内存大小和高速缓冲存储器大小的设置，分别更新本地内存和高速缓冲存储器在处理器的存储器中的起始位置；以及根据收到的对高速缓冲存储器大小的设置更新对于高速缓冲存储器的访存地址中索引字段的大小的设置。具体细节可参考上文结合步骤S1)和S2)所介绍的内容。其中高速缓冲存储器的控制器被配置为：根据访存控制单元提供的高速缓冲存储器大小及其在处理器的存储器中的起始位置，确定各缓存块在存储器中对应的新的数据存储位置，并映射单元中建立标签存储单元中各标签存储位置与各缓存块的新数据存储位置之间的映射。具体细节可参考上文结合步骤S3)所介绍的内容。In some embodiments of the present application, a processor that supports dynamic shared memory space is also provided. Except for the memory access control unit, memory and cache memory, the other components are the same as those of existing processors. This will not be described again. In this embodiment, the cache is a cache as received above in connection with Figures 1 and 2, which includes a controller, The label storage unit of the tag, the data storage unit and the mapping unit composed of multiple cache blocks. A part of the storage space of the memory of the processor is used as local memory, and the other part is used as a data storage unit of the cache memory. The memory may be implemented in the form of random access memory (RAM) such as SRAM, DRAM. The memory access control unit is configured to: update the starting positions of the local memory and the cache memory in the processor's memory according to the received settings for the local memory size and cache memory size; and based on the received settings for the local memory size and cache memory size; The setting of the cache size updates the setting of the size of the index field in the cache access address. For specific details, please refer to the content introduced above in conjunction with steps S1) and S2). The controller of the cache memory is configured to: determine the new data storage location corresponding to each cache block in the memory based on the size of the cache memory provided by the memory access control unit and its starting position in the memory of the processor, And a mapping between each tag storage location in the tag storage unit and the new data storage location of each cache block is established in the mapping unit. For specific details, please refer to the content introduced above in conjunction with step S3).

应理解，对于本文中提到的处理器和高速缓冲存储器中诸如访存控制单元、控制器之类的模块及其执行的方法步骤，除了以纯计算机可读程序代码方式实现之外，完全可以通过将相应功能模块、过程或步骤进行逻辑编程在使得这些模块以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等形式来实现相同功能。因此，这样实现的控制器、访存控制单元等可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件的内部结构。或者甚至，可以将用于实现各种功能的装置视为既可以是实现相关过程或方法步骤的软件模块又可以是硬件部件内的结构。It should be understood that, in addition to being implemented in pure computer-readable program code, the modules such as memory access control units and controllers in the processor and cache memory mentioned in this article and the method steps for their execution can be completely implemented by The corresponding functional modules, processes or steps are logically programmed so that these modules implement the same functions in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers and embedded microcontrollers. Therefore, the controller, memory access control unit, etc. implemented in this way can be regarded as a kind of hardware component, and the devices included therein for realizing various functions can also be regarded as the internal structure of the hardware component. Or even, the means for implementing various functions can be considered as structures within hardware components as well as software modules implementing the relevant processes or method steps.

本说明书中针对“各个实施例”、“一些实施例”、“一个实施例”、或“实施例”等的参考指代的是结合所述实施例所描述的特定特征、结构、或性质包括在至少一个实施例中。因此，短语“在各个实施例中”、“在一些实施例中”、“在一个实施例中”、或“在实施例中”等在整个说明书中各地方的出现并非必须指代相同的实施例。此外，特定特征、结构、或性质可以在一个或多个实施例中以任何合适方式组合。因此，结合一个实施例中所示出或描述的特定特征、结构或性质可以整体地或部分地与一个或多个其他实施例的特征、结构、或性质无限制地组合，只要该组合不是非逻辑性的或不能工作。References in this specification to "various embodiments," "some embodiments," "one embodiment," or "an embodiment," etc., refer to a particular feature, structure, or property described in connection with the embodiment, including In at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in one embodiment," or "in an embodiment" etc. in various places throughout this specification are not necessarily referring to the same implementation. example. Furthermore, specific features, structures, or properties may be combined in any suitable manner in one or more embodiments. Thus, a particular feature, structure, or property shown or described in connection with one embodiment may be combined, in whole or in part, without limitation with features, structures, or properties of one or more other embodiments so long as the combination is not non-unlimited. Logical or not working.

本说明书中“包括”和“具有”以及类似含义的术语表达，意图在于覆盖不排他的包含，例如包含了一系列步骤或单元的过程、方法、系统、产品或设备并不限定于已列出的步骤或单元，而是可选地还包括没有列出的步骤或单元，或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。“一”或“一个”也不排除多个的情况。另外，本申请附图中的各个元素仅仅为了示意说明，并非按比例绘制。In this specification, the terms “including”, “having” and similar meanings are intended to Coverage is not exclusive, for example, a process, method, system, product or device that includes a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units that are not listed. , or optionally other steps or units inherent to such processes, methods, products or devices. "One" or "one" does not exclude more than one. In addition, each element in the drawings of this application is for schematic illustration only and is not drawn to scale.

虽然本申请已经通过上述实施例进行了描述，然而本申请并非局限于这里所描述的实施例，在不脱离本申请范围的情况下还包括所做出的各种改变以及变化。 Although the present application has been described through the above embodiments, the present application is not limited to the embodiments described here, and also includes various changes and changes made without departing from the scope of the present application.

Claims

A method for dynamically sharing memory space in parallel processors, including:

The memory access control unit of the processor updates the starting positions of the local memory and the cache memory in the processor's memory respectively according to the received settings of the local memory size and cache memory size, where the storage space of the memory One part is used as local memory, while the other part is used as the data storage unit of cache memory;

The memory access control unit of the processor updates the setting of the size of the index field in the memory access address of the cache memory according to the received setting of the cache memory size;

The cache memory determines the new data storage location corresponding to each cache block in the memory according to the cache memory size provided by the memory access control unit and its starting position in the processor's memory, and stores it in each tag. Create a mapping between the location and the new data storage location for each cache block.

The method according to claim 1, further comprising: according to the settings of the local memory size and the cache memory size received by the memory access control unit of the processor, starting from a preset address in the memory of the processor. Divide a storage space of a corresponding size as a local memory, and then allocate a storage space for the cache memory, where the starting position of the local memory is the preset address.

The method of claim 1, wherein the cache memory is a set associative cache, and wherein the size of the index field in the access address of the cache memory is preset based on the cache memory size divided by the cache memory. The results are determined by the cache block size and the number of tags contained in each group.

The method according to claim 1, further comprising: in response to receiving a memory access request for the local memory, the memory access control unit of the processor uses the updated starting position of the local memory to locate the memory access request to access. The data.

The method according to any one of claims 1-4, further comprising: in response to receiving a memory access request for the global memory, the memory access control unit of the processor maps the address in the memory access request into an address of the cache memory. fetch the memory address and send it to the cache, wherein the memory access address adopts the updated index field size;

In response to receiving the memory access request from the memory access control unit, the cache memory locates the data to be accessed by the memory access request according to the established mapping between the tag storage location and the new data storage location.

The method according to claim 5, wherein the cache memory responds to receiving the memory access request from the memory access control unit to locate the access according to the established mapping between the tag storage location and the new data storage location. The data to be accessed by the storage request includes:

When a cache hit occurs, the cache block corresponding to the hit tag is determined based on the established mapping between the tag storage location and the new data storage location, and the data to be accessed by the memory access request is extracted from the cache block as the cache block for the memory access request. response to the request;

Perform the following actions on a cache miss:

Allocate a tag storage location for the memory access request to store the tag field in the memory access address of the memory access request, and select one of the multiple cache blocks not bound to the tag from the data storage unit of the cache memory to allocate Give the memory access request;

The tag binding bit of the cache block originally corresponding to the allocated tag storage location is set to indicate that it is not bound to the tag, and then a mapping relationship is established between the tag storage location and the cache block allocated to the memory access request, and The cache block assigned to the fetch request has its tag binding bit set to indicate that it is bound to a tag; and

The data to be accessed by this memory access request is obtained from the next level memory and stored in the cache block allocated for this memory access request.

A processor that supports dynamic shared memory space, which includes a memory access control unit, a memory, and a cache memory. The cache memory includes a controller, a tag storage unit for saving tags, and data composed of multiple cache blocks. Storage unit and mapping unit; wherein a part of the storage space of the memory is used as local memory, and the other part is used as a data storage unit of the cache memory, where:

The memory access control unit is configured to: update the starting positions of the local memory and the cache memory in the memory of the processor according to the received settings of the local memory size and the cache memory size; and according to the received settings The setting of the cache size updates the setting of the size of the index field in the cache access address;

The controller of the cache memory is configured to: determine the new data storage location corresponding to each cache block in the memory based on the size of the cache memory provided by the memory access control unit and its starting position in the memory of the processor. , and establish a mapping between each tag storage location in the tag storage unit and the new data storage location of each cache block in the mapping unit.

The processor according to claim 7, wherein the memory access control unit is further configured to: according to the settings of the local memory size and the cache memory size it receives, from a preset address in the memory Start by dividing storage space of corresponding size as local memory. store, and then allocate storage space for the cache memory, where the starting position of the local memory is the preset address.

7. The processor of claim 7, wherein said memory is implemented in the form of random access memory.

The processor of claim 7, wherein the mapping unit is implemented in the form of a register.