CN107851006A - Multithreading register mappings - Google Patents
Multithreading register mappings Download PDFInfo
- Publication number
- CN107851006A CN107851006A CN201580082261.5A CN201580082261A CN107851006A CN 107851006 A CN107851006 A CN 107851006A CN 201580082261 A CN201580082261 A CN 201580082261A CN 107851006 A CN107851006 A CN 107851006A
- Authority
- CN
- China
- Prior art keywords
- register
- registers
- threads
- thread
- physical
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30138—Extension of register space, e.g. register cache
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
- G06F9/384—Register renaming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Description
背景技术Background technique
本发明在其一些实施例中涉及多线程的实现,更具体地,涉及但不限定于多线程内核中的架构寄存器管理。The present invention, in some of its embodiments, relates to the implementation of multithreading, and more particularly, but not limited to, architectural register management in multithreaded kernels.
CPU内核,特别是那些针对服务器市场的内核,越来越多地支持多线程(multithreading,简称MT)。对多线程内核的需求在所有服务器市场中一直在高速增长,特别是在横向扩展应用(例如大数据)的背景下。CPU cores, especially those aimed at the server market, increasingly support multithreading (MT for short). The demand for multi-threaded cores has been growing at a high rate across all server markets, especially in the context of scale-out applications such as Big Data.
目前有三种MT实施方案:There are currently three MT implementations:
1.细粒度MT(Fine Grain MT,简称FGMT):线程以时钟为基础进行交织;1. Fine Grain MT (FGMT for short): threads are interleaved based on the clock;
2.同步MT(Simultaneous MT,简称SMT):线程同时运行,共享所有机器资源;2. Simultaneous MT (Simultaneous MT, referred to as SMT): threads run at the same time, sharing all machine resources;
3.粗粒度MT(Coarse Grain MT,简称CGMT,也表示事件MT和SoE MT上的切换):线程运行直至被某个事件阻塞(通常会导致长时间的停滞)。然后,下一个处于自然状态的等待线程会替代该线程。3. Coarse grain MT (Coarse Grain MT, referred to as CGMT, also means switching on event MT and SoE MT): the thread runs until it is blocked by an event (usually resulting in a long period of stagnation). Then, the next waiting thread in its natural state takes its place.
目前的MT实施方式包括:Current MT implementations include:
1.英特尔的Larrabee(4路FGMT);1. Intel's Larrabee (4-way FGMT);
2.英特尔至强服务器(2路SMT);2. Intel Xeon server (2-way SMT);
3.英特尔安腾Montecito(2路CGMT)。3. Intel Itanium Montecito (2-way CGMT).
在MT中,每个线程承载机器的整体架构状态。每个架构寄存器文件集(Architectural Register File Set,简称ARF)通常包括:In MT, each thread hosts the overall architectural state of the machine. Each architectural register file set (Architectural Register File Set, referred to as ARF) usually includes:
1.整数寄存器文件(例如,ARMv8采用31个寄存器,每个64位宽);1. An integer register file (for example, ARMv8 uses 31 registers, each 64 bits wide);
2.浮点/SIMD寄存器文件(例如,ARMv8采用32个寄存器,每个128位宽);2. Floating point/SIMD register files (for example, ARMv8 uses 32 registers, each 128 bits wide);
3.状态寄存器(例如,ARMv8采用约6个寄存器,每个64位宽);3. Status registers (for example, ARMv8 uses about 6 registers, each 64 bits wide);
若同一芯片支持多个线程,数量将会成倍增加。寄存器必须是可用的且易于访问。If the same chip supports multiple threads, the number will multiply. Registers must be available and easily accessible.
目前的MT实施方式使用下述策略来处理寄存器文件(register file,简称RF):Current MT implementations use the following strategy to handle register files (RF):
1)为每个线程复制ARF。这也用于FGMT和SMT,在一些情形中,也用于CGMT(以避免长时间的切换)。复制ARF在硅面积和能量消耗方面是非常浪费的。1) Duplicate the ARF for each thread. This is also used for FGMT and SMT, and in some cases, for CGMT (to avoid long handovers). Duplicating an ARF is very wasteful in terms of silicon area and power consumption.
2)保持单个寄存器文件集并反复复制(仅适用于CGMT)。这种方法耗时,使得切换时间相当长,效率低下,并且严重降低性能。2) Keep a single register file set and replicate it over and over (CGMT only). This approach is time-consuming, making switching times considerably longer, inefficient, and severely degrading performance.
发明内容Contents of the invention
本发明的目的是改进多线程。The purpose of the invention is to improve multithreading.
独立权利要求主旨是达到该目的。从属权利要求保护其他实施例。The independent claims purport to achieve this purpose. The dependent claims protect other embodiments.
本文中呈现的实施例将运行线程(即活动线程)的最近和/或经常使用的寄存器映射到物理寄存器中。所有线程的寄存器都保存在架构寄存器中,或者存储在SRAM中。当所请求的寄存器未映射到物理寄存器时,将架构寄存器的内容存储在已分配的物理寄存器中,并可能替换以前存储的内容(例如,已挂起的线程)。这样,就可以减少硅面积和能量消耗,缩短切换时间。Embodiments presented herein map the most recently and/or frequently used registers of running threads (ie, active threads) into physical registers. All thread registers are kept in architectural registers, or stored in SRAM. When the requested register is not mapped to a physical register, store the contents of the architectural register in the allocated physical register, possibly replacing previously stored content (e.g., for suspended threads). In this way, the silicon area and energy consumption can be reduced, and the switching time can be shortened.
根据本发明一些实施例的第一方面,提供了一种用于处理寄存器访问请求的系统。该系统包括接收寄存器访问请求的接口和处理单元。所述处理单元基于多个多线程(multithreading,简称MT)线程中的每个架构寄存器的最近使用和访问频率中的至少一个,将一组寄存器从所述多个架构寄存器中动态地映射到至少一个多个物理寄存器中,当在所述物理寄存器中找不到匹配时,在所述架构寄存器中查找每个寄存器访问请求的所述匹配。According to a first aspect of some embodiments of the invention there is provided a system for processing register access requests. The system includes an interface and a processing unit for receiving register access requests. The processing unit dynamically maps a group of registers from the plurality of architectural registers to at least one of the most recent use and access frequency of each architectural register in a plurality of multithreading (MT) threads One of the plurality of physical registers, when a match is not found in the physical registers, the match is looked up in the architectural registers for each register access request.
根据第一方面,在系统的第一种可能的实现方式中,所述MT线程提交所述寄存器访问请求,并且在多线程处理器中。According to the first aspect, in a first possible implementation manner of the system, the MT thread submits the register access request, and is in a multi-threaded processor.
在系统的第二种可能的实现方式中,所述寄存器访问请求经由至少一个流水线引擎接收。In a second possible implementation of the system, the register access request is received via at least one pipeline engine.
在系统的第三种可能的实现方式中,所述架构寄存器存储在静态随机存取存储器(static random access memory,简称SRAM)中。In a third possible implementation manner of the system, the architectural registers are stored in a static random access memory (static random access memory, SRAM for short).
在系统的第四种可能的实现方式中,所述系统还包括存储访问频率数据集的存储器。所述处理单元利用各个寄存器的访问频率更新所述访问频率数据集,并且根据所述访问频率数据集执行所述映射。In a fourth possible implementation manner of the system, the system further includes a memory for storing the access frequency data set. The processing unit updates the access frequency data set with the access frequency of each register, and executes the mapping according to the access frequency data set.
在系统的第五种可能的实现方式中,所述系统还包括存储最近使用数据集的存储器。所述处理单元利用所述最近使用来更新所述最近使用数据集,并且根据所述最近使用数据集来执行所述映射。In a fifth possible implementation manner of the system, the system further includes a memory for storing recently used data sets. The processing unit updates the recently used dataset with the recent used and performs the mapping based on the recently used dataset.
根据第一方面的第五种实现方式,在系统的第二种可能的实现方式中,所述最近使用数据集包括多个记录。每个记录将每一个所述MT线程的最近使用记录到所述架构寄存器。According to a fifth implementation manner of the first aspect, in a second possible implementation manner of the system, the recently used data set includes a plurality of records. Each record records the most recent usage of each of said MT threads to said architectural registers.
根据第一方面的第五种实现方式,在系统的第三种可能的实现方式中,所述最近使用数据集包括所述架构寄存器的各个分配状态。According to a fifth implementation manner of the first aspect, in a third possible implementation manner of the system, the recently used data set includes each allocation status of the architectural register.
根据第一方面的第五种实现方式,在系统的第四种可能的实现方式中,所述架构寄存器映射到所述MT线程中挂起和运行的线程的分配中,并且所述物理寄存器映射到所述多个MT线程的运行线程的分配。According to a fifth implementation of the first aspect, in a fourth possible implementation of the system, the architectural registers are mapped to the allocation of suspended and running threads in the MT thread, and the physical registers are mapped to Assignment of execution threads to the plurality of MT threads.
根据第一方面的第五种实现方式,在系统的第五种可能的实现方式中,所述处理单元在将任一个所述物理寄存器的分配从所述MT线程的一个切换到所述MT线程中的另一个时更新所述最近使用数据集。According to a fifth implementation of the first aspect, in a fifth possible implementation of the system, the processing unit switches the allocation of any one of the physical registers from one of the MT threads to the MT thread The other of the updates the recently used data set.
在系统的第六种可能的实现方式中,所述处理单元将所述架构寄存器映射到所述MT线程中。在系统的第七种可能的实现方式中,所述处理单元将任一个所述架构寄存器的映射从所述MT线程的一个切换到所述多个架构寄存器中的另一个。In a sixth possible implementation manner of the system, the processing unit maps the architectural register to the MT thread. In a seventh possible implementation manner of the system, the processing unit switches the mapping of any one of the architectural registers from one of the MT threads to another of the plurality of architectural registers.
在系统的第八种可能的实现方式中,当活动线程没有切换激活到不同线程时,所述处理单元将映射到所述活动线程的物理寄存器的各个状态设置为可用。In an eighth possible implementation manner of the system, when the active thread is not switched to be activated to a different thread, the processing unit sets each state of the physical register mapped to the active thread as available.
根据本发明一些实施例的第二方面,提供了一种用于处理寄存器访问请求的方法。该方法包括:According to a second aspect of some embodiments of the present invention there is provided a method for processing register access requests. The method includes:
I)接收多个寄存器访问请求;1) receiving multiple register access requests;
ii)基于多个多线程(multithreading,简称MT)线程中的每个架构寄存器的最近使用和访问频率中的至少一个,将一组寄存器从多个架构寄存器中动态地映射到至少一个多个物理寄存器中;ii) dynamically mapping a set of registers from the plurality of architectural registers to at least one of the plurality of physical registers based on at least one of recent use and access frequency of each architectural register in a plurality of multithreading (MT) threads in the register;
iii)当在所述物理寄存器的映射中找不到请求的寄存器时,在所述多个架构寄存器中为每个所述寄存器访问请求查找匹配。iii) When the requested register is not found in the map of physical registers, finding a match for each of said register access requests in said plurality of architectural registers.
根据第二方面,在方法的第一种可能的实现方式中,该方法还包括:通过记录经由至少一个流水线引擎接收的所述多个寄存器访问请求来监测所述最近使用和访问频率中的至少一个。除非另有定义,否则本文所使用的所有技术和/或科学术语的含义与本发明所属领域的普通技术人员所公知的含义相同。与本文所描述的方法和材料类似或者相同的方法和材料可以用于本发明实施例的实践或测试,下文描述示例性的方法和/或材料。若存在冲突,则以包括定义在内的专利说明书为准。另外,材料、方法以及示例都只是用于说明,并非必要限定。According to the second aspect, in a first possible implementation of the method, the method further includes: monitoring at least one of the recent use and access frequency by recording the plurality of register access requests received via at least one pipeline engine One. Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not necessarily limiting.
附图说明Description of drawings
此处仅作为示例,结合附图描述了本发明的一些实施例。现在具体结合附图,需要强调的是所示的项目作为示例,为了说明性地讨论本发明的实施例。这样,根据附图说明,如何实践本发明实施例对本领域技术人员而言是显而易见的。By way of example only, some embodiments of the invention are described herein with reference to the accompanying drawings. With specific reference now to the drawings, it is emphasized that the items shown are by way of example, for purposes of illustrative discussion of embodiments of the invention. Thus, how to practice the embodiments of the present invention will be apparent to those skilled in the art from the description of the accompanying drawings.
在附图中:In the attached picture:
图1为根据本发明实施例提供的一种处理寄存器访问请求的系统的简易框图;FIG. 1 is a simplified block diagram of a system for processing register access requests according to an embodiment of the present invention;
图2为根据本发明实施例提供的一种寄存器映射方案的简易说明;FIG. 2 is a brief description of a register mapping scheme provided according to an embodiment of the present invention;
图3为根据本发明实施例提供的一种处理寄存器访问请求的方法的简易框图;FIG. 3 is a simplified block diagram of a method for processing a register access request according to an embodiment of the present invention;
图4为根据本发明实施例提供的一种基于线程背景切换的方法的简易框图;Fig. 4 is a simple block diagram of a method based on thread background switching provided according to an embodiment of the present invention;
图5为根据本发明实施例提供的一种处理寄存器访问请求的方法的简易流程图。Fig. 5 is a simplified flow chart of a method for processing a register access request according to an embodiment of the present invention.
具体实施方式Detailed ways
本发明在其一些实施例中涉及多线程,更具体地,涉及但不限定于多线程内核中的架构寄存器管理。The present invention, in some of its embodiments, relates to multithreading, and more particularly, but not limited to, architectural register management in multithreaded kernels.
本发明实施例利用寄存器映射方案将最近和/或频繁使用的架构寄存器动态地映射到较小的物理寄存器文件集中,并且根据需要提取寄存器的内容。Embodiments of the present invention utilize a register mapping scheme to dynamically map recently and/or frequently used architectural registers into a smaller set of physical register files, and extract the contents of the registers as needed.
在一些实施例中,当发出新的寄存器访问请求时,检查寄存器映射(此处也表示为映射表),以查看所请求的架构寄存器是否映射到物理寄存器中。当所述请求的寄存器存在于PRF中时,所述物理寄存器用于寄存器访问。In some embodiments, when a new register access request is issued, the register map (also denoted herein as a mapping table) is checked to see if the requested architectural register is mapped into a physical register. When the requested register exists in the PRF, the physical register is used for register access.
当所述请求的寄存器未映射到所述PRF中时,一个或多个物理寄存器写回到ARF中,以使所述PRF中的寄存器可用于存储其他架构寄存器值。将所述请求的架构寄存器写入物理寄存器,从所述PRF中继续访问。When the requested register is not mapped into the PRF, one or more physical registers are written back into the ARF to make the registers in the PRF available to store other architectural register values. Write the requested architectural register into a physical register, and continue accessing from the PRF.
在将物理寄存器分配和重新分配给架构寄存器期间,根据需要动态维护和更新所述映射表。在详细解释本发明的至少一个实施例之前,应当理解,本发明不必将其应用限于在下面的描述中阐述的和/或在附图和/或实施例中说明的部件和/或方法的结构和布置的细节。本发明可以有其他实施例或可以采用各种方式实践或执行。The mapping table is dynamically maintained and updated as needed during allocation and reallocation of physical registers to architectural registers. Before explaining at least one embodiment of the present invention in detail, it is to be understood that the present invention is not necessarily limited in its application to the structures of components and/or methods set forth in the following description and/or illustrated in the drawings and/or examples and layout details. The invention is capable of other embodiments or of being practiced or carried out in various ways.
本发明可以是系统、方法和/或计算机程序产品。所述计算机程序产品可以包括具有计算机可读程序指令的一个(或多个)计算机可读存储介质,所述指令用于使处理器执行本发明的各个方面。The present invention can be a system, method and/or computer program product. The computer program product may include one (or more) computer-readable storage medium(s) having computer-readable program instructions for causing a processor to perform various aspects of the invention.
所述计算机可读存储介质可以是有形设备,该有形设备可以保存和存储指令执行设备使用的指令。例如,所述计算机可读存储介质可以是但不限于电子存储设备、磁性存储设备、光存储设备、电磁存储设备、半导体存储设备或这几者的任意合适组合。The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. For example, the computer readable storage medium may be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of these.
可以从计算机可读存储介质中将此处描述的计算机可读程序指令下载到各个计算/处理设备上,或通过网络下载到外部计算机或外部存储设备上,所述网络如因特网、局域网、广域网和/或无线网。The computer-readable program instructions described herein can be downloaded to each computing/processing device from a computer-readable storage medium, or to an external computer or external storage device through a network such as the Internet, a local area network, a wide area network, and and/or Wi-Fi.
所述计算机可读程序指令可以完全在用户电脑上执行,部分在用户电脑上执行,或作为独立的软件包,部分在用户电脑上执行,部分在远端电脑上执行,或完全在远端电脑或服务器上执行。在后面的场景中,远端电脑可以通过任何类型的网络与用户电脑连接,包括局域网(local area network,简称LAN)或广域网(wide area network,简称WAN),或者,可以(例如,使用因特网服务提供商通过因特网)在外部电脑上建立该连接。在一些实施例中,包括可编程逻辑电路、现场可编程门阵列(field-programmable gate array,简称FPGA)或可编程逻辑阵列(programmable logic array,简称PLA)等的电子电路可以利用计算机可读程序指令的状态信息执行所述计算机可读程序指令以个性化所述电子电路,以便执行本发明的各方面。The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, or as a stand-alone software package, partly on the user's computer, partly on a remote computer, or entirely on the remote computer or execute on the server. In the latter scenario, the remote computer can be connected to the user computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can (for example, use an Internet service The provider establishes this connection on the external computer via the Internet). In some embodiments, an electronic circuit including a programmable logic circuit, a field-programmable gate array (FPGA for short), or a programmable logic array (PLA for short) may utilize a computer-readable program State Information of Instructions Execution of the computer readable program instructions to personalize the electronic circuitry to perform aspects of the present invention.
此处,结合本发明实施例的方法、装置(系统)以及计算机程序产品的流程图和/或框图描述本发明的各方面。应当理解,流程图和/或框图的每个框以及流程图和/或框图中的框的组合可以由计算机可读程序指令来实现。Here, aspects of the present invention are described in conjunction with flowcharts and/or block diagrams of methods, apparatus (systems) and computer program products in embodiments of the present invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
附图中的流程图和框图示出了根据本发明的各种实施例的系统、方法和计算机程序产品的可能实现的架构、功能和操作。此时,流程图或框图中的每个框都可以代表一个模块、分段或指令的一部分,包括一个或多个用于实现特定逻辑功能的可执行指令。在一些替代的实现方式中,框中指出的功能可以不按照图中的顺序实现。例如,事实上,连续展示的两个框可以同时执行,或者有时候,框可以按照相反的顺序执行,这取决于所涉及的功能。还应注意的是,框图和/或流程图中每一个框以及框图和/或流程图中框的组合可以由基于专用硬件的系统执行,该系统执行指定的功能或动作,或者执行专用硬件和计算机指令的组合。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. At this time, each block in the flowchart or block diagram may represent a module, a segment or a part of instructions, including one or more executable instructions for implementing specific logical functions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or actions, or by special purpose hardware and A combination of computer instructions.
本发明一些实施例基于一种计算系统,该计算系统包括:Some embodiments of the invention are based on a computing system comprising:
1)存储器(例如SRAM),其存储多个MT线程的架构状态;1) memory (such as SRAM), which stores the architectural state of multiple MT threads;
2)存储物理寄存器文件(此处表示为PRF)的物理寄存器;2) storing the physical registers of the physical register file (represented as PRF here);
3)寄存器映射,将架构寄存器动态映射到物理寄存器中。3) Register mapping, which dynamically maps architectural registers to physical registers.
如此处使用的术语“架构寄存器文件”和“ARF”是指包括所有线程的整个架构状态的数据集。这些术语不限于用于存储ARF的特定类型的文件,数据组织或存储元件。The terms "architectural register file" and "ARF" as used herein refer to a data set that includes the entire architectural state of all threads. These terms are not limited to the specific type of file, data organization or storage element used to store ARF.
存储ARF的存储器比物理寄存器具有更密集的数据存储,但其访问时间比物理寄存器的访问时间长。一个大小合理的PRF能够快速访问一些架构寄存器内容,而不需要大大增加硅面积。如此处使用的术语“物理寄存器文件”和“PRF”是指存储在物理寄存器中的数据集。这些术语不限于特定类型的文件或数据组织。The memory storing ARF has denser data storage than physical registers, but its access time is longer than that of physical registers. A reasonably sized PRF enables fast access to some architectural register contents without significantly increasing silicon area. The terms "physical register file" and "PRF" as used herein refer to data sets stored in physical registers. These terms are not limited to a particular type of file or organization of data.
在一些实施例中,存储器存储所有线程的所有架构状态。在本实施例中,使用简单的逻辑来进行固定索引,但在区域中成本更高。在其他实施例中,所述存储器仅存储未存储在所述PRF中的架构状态,从而导致区域减小,且索引复杂度增加。In some embodiments, memory stores all architectural state for all threads. In this embodiment, simple logic is used for fixed indexing, but is more expensive in regions. In other embodiments, the memory only stores architectural state that is not stored in the PRF, resulting in reduced area and increased indexing complexity.
在一些实施例中,每个活动线程具有预定数量的物理寄存器,而不能使用分配给其他线程的物理寄存器。在其他实施例中,物理寄存器动态分配给所有活动线程。In some embodiments, each active thread has a predetermined number of physical registers and cannot use physical registers allocated to other threads. In other embodiments, physical registers are allocated dynamically to all active threads.
在一些实施例中,当发出新的寄存器访问请求时,在寄存器映射(此处也表示为映射表)中查找其源操作数和目标操作数。当映射表显示所请求的寄存器存在于PRF中时,所述物理寄存器用于寄存器访问。当一个或多个请求的寄存器未映射到PRF中时,发生更换周期。在更换周期中,将一个或多个物理寄存器(例如最近没有使用的寄存器)写回架构寄存器中。这些物理寄存器随后可用于存储其他架构寄存器值。在预热期之后,所有选定的架构寄存器(例如最近使用的)将缓存在PRF中,执行要求相对较少的更换周期。然而,当处理转移到采用不同架构寄存器的新阶段时,可能会出现新的预热期。在更换周期期间或之后根据需要动态地维护和更新映射表。In some embodiments, when a new register access request is issued, its source and destination operands are looked up in a register map (also denoted herein as a mapping table). When the mapping table shows that the requested register exists in the PRF, the physical register is used for register access. A replacement cycle occurs when one or more requested registers are not mapped into the PRF. During a replacement cycle, one or more physical registers (eg, registers that have not been used recently) are written back into architectural registers. These physical registers can then be used to store other architectural register values. After the warmup period, all selected architectural registers (eg recently used) will be cached in the PRF, execution requiring relatively few replacement cycles. However, a new warm-up period may occur when processing moves to a new stage that employs different architectural registers. The mapping table is dynamically maintained and updated as needed during or after the replacement cycle.
如此处所使用的术语“寄存器请求”和“寄存器访问请求”包括对寄存器的读写操作的请求。此处描述的寄存器映射对于采用大量线程以便利用线程级并行性(例如图形加速器,大数据服务器等)的核心实现特别有益。单个内核可以通过增加线程级并行性(thread level parallelism,简称TLP)来支持更多数量的线程,而无需复制所有线程的整个架构状态或者限制长时间线程切换周期的CGMT操作的开销。随着每个核心的线程数量的增加,此处描述的寄存器映射的潜在优势将增加。The terms "register request" and "register access request" as used herein include requests for read and write operations on registers. The register map described here is particularly beneficial for core implementations that employ a large number of threads in order to take advantage of thread-level parallelism (e.g., graphics accelerators, big data servers, etc.). A single core can support a greater number of threads by increasing thread level parallelism (TLP) without the overhead of duplicating the entire architectural state of all threads or limiting CGMT operations for long thread switching cycles. The potential benefits of the register map described here increase as the number of threads per core increases.
现参考图1,图1为根据本发明实施例提供的一种处理寄存器访问请求的系统的简易框图。系统100包括接口110和处理单元120。Referring now to FIG. 1 , FIG. 1 is a simplified block diagram of a system for processing register access requests according to an embodiment of the present invention. The system 100 includes an interface 110 and a processing unit 120 .
所述接口110接收寄存器访问请求。可选地,所述寄存器访问请求由多个MT线程进行提交。可选地,所述寄存器访问请求经由至少一个流水线引擎进行接收。The interface 110 receives register access requests. Optionally, the register access request is submitted by multiple MT threads. Optionally, the register access request is received via at least one pipeline engine.
所述处理单元120将一组寄存器从架构寄存器150动态地映射到物理寄存器140。可选地,映射基于:The processing unit 120 dynamically maps a set of registers from architectural registers 150 to physical registers 140 . Optionally, the mapping is based on:
i)MT线程的访问频率(“频繁使用”);i) access frequency of MT threads ("frequently used");
ii)MT线程的最近使用(“最近使用的”);和/或ii) recent use of MT threads ("Recently Used"); and/or
iii)访问频率和最近使用的组合。iii) Frequency of access and combination of recent use.
响应于寄存器访问请求,所述处理单元120从映射表中确定寄存器值是否存储在所述物理寄存器140中。当在所述物理寄存器140中没有找到匹配时,所述处理单元120在所述架构寄存器150中查找所请求寄存器的匹配。In response to a register access request, the processing unit 120 determines from the mapping table whether a register value is stored in the physical register 140 . When no match is found in the physical register 140 , the processing unit 120 looks in the architectural register 150 for a match for the requested register.
可选地,架构寄存器存储在静态随机存取存储器(static random accessmemory,简称SRAM)中。Optionally, the architectural registers are stored in static random access memory (static random access memory, SRAM for short).
在一些实施例中,所述系统100包括存储最近使用数据集的存储器。所述处理单元120利用每个寄存器的最近使用来更新最近使用数据集,并且根据所述最近使用数据集执行至少部分地所述映射。附加地或可替代地,所述存储器存储访问频率数据集。处理单元120利用每个寄存器的访问频率来更新最近使用数据集,并且根据所述访问频率数据集执行至少部分地所述映射。In some embodiments, the system 100 includes memory that stores recently used data sets. The processing unit 120 updates a recently used data set with the most recent use of each register, and performs at least part of the mapping according to the recently used data set. Additionally or alternatively, the memory stores access frequency data sets. The processing unit 120 updates the recently used data set with the access frequency of each register, and performs at least part of the mapping according to the access frequency data set.
可选地,所述最近使用数据集包括多个记录。每个记录记录各个线程中架构寄存器的最近使用。Optionally, the recently used data set includes a plurality of records. Each record records the most recent usage of architectural registers in various threads.
可选地,最近使用数据集包括每个架构寄存器的分配状态。所述分配状态指示架构寄存器何时分配给物理寄存器以及什么情况下架构寄存器值可以从物理寄存器读取或写入物理寄存器,并且可选地,指示架构寄存器分配到的物理寄存器。Optionally, the recently used data set includes the allocation status of each architectural register. The allocation status indicates when an architectural register is allocated to a physical register and under what circumstances architectural register values can be read from or written to the physical register, and optionally indicates the physical register to which the architectural register is allocated.
可选地,所述架构寄存器150分配给多个MT线程的挂起(即不活动)和运行(即活动)线程,所述物理寄存器140分配给运行线程。Optionally, the architectural registers 150 are allocated to suspended (ie, inactive) and running (ie, active) threads of a plurality of MT threads, and the physical registers 140 are allocated to running threads.
可选地,当架构寄存器的分配从一个MT线程切换到另一个线程时,所述处理单元120更新所述最近使用数据集。当线程终止或添加时,可能会发生这种情况。Optionally, when the allocation of architectural registers is switched from one MT thread to another, the processing unit 120 updates the recently used data set. This can happen when threads are terminated or added.
可选地,当物理寄存器的分配从一个MT线程切换到另一个线程时,所述处理单元120更新所述最近使用数据集。当线程不活动以及物理寄存器重新分配给不同线程的架构寄存器时,可能会发生这种情况。Optionally, when the allocation of physical registers is switched from one MT thread to another, the processing unit 120 updates the recently used data set. This can happen when threads are inactive and physical registers are reallocated to architectural registers of different threads.
可选地,所述处理单元120将架构寄存器映射到相应的MT线程。Optionally, the processing unit 120 maps architectural registers to corresponding MT threads.
可选地,所述处理单元120将特定的MT线程的架构寄存器的映射切换到不同架构寄存器。现参考图2,图2为根据本发明实施例提供的一种寄存器映射方案的简易说明。在图2中:i)N表示活动线程的数量;Optionally, the processing unit 120 switches the mapping of the architectural register of a specific MT thread to a different architectural register. Referring now to FIG. 2 , FIG. 2 is a brief description of a register mapping scheme provided according to an embodiment of the present invention. In Fig. 2: i) N represents the number of active threads;
ii)M表示线程(活动和不活动)的总数量;ii) M represents the total number of threads (active and inactive);
iii)K表示每个线程的所有寄存器的数量;iii) K represents the number of all registers of each thread;
iv)J表示存储在PRF 130中的每个线程的寄存器数量。iv) J represents the number of registers stored in the PRF 130 per thread.
因此,ARF中的寄存器总数量为M*K,而PRF中的寄存器数量较少为N*J。Therefore, the total number of registers in ARF is M*K, while the number of registers in PRF is less N*J.
为了清楚起见,在图2的非限制性实施例中,存储在PRF中的寄存器是基于访问频率(“频繁使用”)来选择的。在其他实施例中,通过不同的标准(例如最近使用的)和寄存器映射来选择存储在PRF中的寄存器,以基本相似的方式执行访问和处理。For clarity, in the non-limiting embodiment of FIG. 2, the registers stored in the PRF are selected based on access frequency ("frequently used"). In other embodiments, registers stored in the PRF are selected by different criteria (eg, most recently used) and register maps, with access and processing performed in a substantially similar manner.
映射表210指定所请求的寄存器是否在PRF 220中分配,还维护用于查找替换的候选(例如最不常用的寄存器)的其他信息。在图2实施例中,所述映射表210保存活动线程的每个寄存器的以下字段:Mapping table 210 specifies whether the requested register is allocated in PRF 220, and also maintains other information for finding replacement candidates (eg, least commonly used registers). In the embodiment of FIG. 2, the mapping table 210 saves the following fields of each register of an active thread:
i)有效字段:指示架构寄存器值是否存储在PRF中;i) valid field: indicates whether the architectural register value is stored in the PRF;
ii)索引字段:将所述架构寄存器映射到物理寄存器;ii) Index field: mapping the architectural registers to physical registers;
iii)脏字段:表示存储在架构寄存器中的值是否对应于映射的物理寄存器的值;iii) Dirty field: Indicates whether the value stored in the architectural register corresponds to the value of the mapped physical register;
iv)访问频率字段:当请求的架构寄存器不在PRF中时,可以用于选择物理寄存器进行覆盖。iv) Access frequency field: When the requested architectural register is not in the PRF, it can be used to select a physical register for overwriting.
在图2中,流水线引擎200运行N个活动线程。所述活动线程下发架构寄存器的寄存器访问请求。当从所述流水线引擎200接收到寄存器请求时,使用所述映射表210来确定寄存器值是否可以相对较快地从所述PRF 220中访问,或者必须从ARF 230中获得。In FIG. 2, pipeline engine 200 runs N active threads. The active thread issues a register access request of the architectural register. When a register request is received from the pipeline engine 200 , the mapping table 210 is used to determine whether a register value can be accessed relatively quickly from the PRF 220 or must be obtained from the ARF 230 .
所述ARF 230存储所有活动和不活动线程的架构寄存器文件。数据可以在所述ARF230和所述PRF 220之间传输,以保持架构寄存器值和物理寄存器值符合操作所需的最新状态。所述映射表进行相应更新。The ARF 230 stores the architectural register files of all active and inactive threads. Data may be transferred between the ARF 230 and the PRF 220 to keep architectural register values and physical register values up-to-date as required for operation. The mapping table is updated accordingly.
在“寄存器未命中”的情况下(即所请求的架构寄存器不在所述PRF 220中),为所请求的架构寄存器重新分配“受害”物理寄存器,并且更换重新分配的寄存器的内容。在一些实施例中,非活动线程是受害物理寄存器的优选供应。In the case of a "register miss" (ie, the requested architectural register is not in the PRF 220), the "victim" physical register is reallocated for the requested architectural register, and the content of the reallocated register is replaced. In some embodiments, inactive threads are the preferred provider of victim physical registers.
可选地,受害物理寄存器至少部分地基于存储在映射表中的数据(例如,访问频率和/或最近访问)进行选择。Optionally, the victim physical register is selected based at least in part on data (eg, access frequency and/or most recent access) stored in a mapping table.
可选地,源寄存器和目的地寄存器的重新映射在流水线引擎中完成。Optionally, the remapping of source and destination registers is done in the pipeline engine.
现参考图3,图3为根据本发明实施例提供的一种处理寄存器访问请求的方法的简易流程图。在步骤310中,接收寄存器访问请求。在步骤320中,检查寄存器映射以确定所请求的寄存器何时映射到物理寄存器中。在步骤330中,在架构寄存器(即,ARF)中为未映射到物理寄存器的每个所请求的寄存器查找匹配。可选地,在步骤340中,将架构寄存器值存储在物理寄存器中。Referring now to FIG. 3 , FIG. 3 is a simplified flow chart of a method for processing a register access request according to an embodiment of the present invention. In step 310, a register access request is received. In step 320, the register map is checked to determine when the requested register is mapped into a physical register. In step 330, a match is looked up in an architectural register (ie, ARF) for each requested register that is not mapped to a physical register. Optionally, in step 340, the architectural register value is stored in a physical register.
可选地,在步骤350中,所请求的寄存器从所述PRF中进行访问。Optionally, in step 350, the requested register is accessed from said PRF.
在步骤360中,动态执行从架构寄存器到物理寄存器(即PRF)的寄存器映射。所述映射可以基于MT线程的每个架构寄存器的最近使用和/或MT线程的每个架构寄存器的最近使用。可选地,基于替代或附加映射标准来执行所述映射。In step 360, register mapping from architectural registers to physical registers (ie, PRFs) is performed dynamically. The mapping may be based on recent usage of each architectural register by the MT thread and/or recent usage of each architectural register by the MT thread. Optionally, the mapping is performed based on alternative or additional mapping criteria.
可选地,通过记录经由至少一个流水线引擎接收的寄存器访问请求来监测寄存器(物理寄存器和/或架构寄存器)的最近使用。Optionally, recent usage of registers (physical registers and/or architectural registers) is monitored by recording register access requests received via at least one pipeline engine.
现参考图4,图4为根据本发明实施例提供的一种处理寄存器访问请求的方法的简易框图。在步骤400中,流水线引擎下发寄存器访问请求。在步骤410中,检查映射表以确定所请求的寄存器是否存储在PRF中(例如通过检查所请求寄存器的“有效”位)。Referring now to FIG. 4 , FIG. 4 is a simplified block diagram of a method for processing a register access request according to an embodiment of the present invention. In step 400, the pipeline engine issues a register access request. In step 410, the mapping table is checked to determine if the requested register is stored in the PRF (eg, by checking the "valid" bit of the requested register).
在步骤420中,当所请求寄存器存储在PRF中时,执行寄存器读取或写入访问。对于写入操作,写入数据存储在映射到所请求架构寄存器的物理寄存器中。对于读取操作,返回映射到所请求架构寄存器的物理寄存器中存储的值。In step 420, a register read or write access is performed when the requested register is stored in the PRF. For write operations, the write data is stored in physical registers that map to the requested architectural registers. For read operations, returns the value stored in the physical register mapped to the requested architectural register.
在步骤430中,当所请求的寄存器未存储在PRF中时,搜索PRF以找到可用的物理寄存器来存储所请求的架构寄存器的数据。在步骤450中,当没有找到可用的物理寄存器时,选择受害物理寄存器并将其内容存储回ARF,从而创建可用的物理寄存器。In step 430, when the requested register is not stored in the PRF, the PRF is searched for an available physical register to store the requested architectural register's data. In step 450, when no usable physical register is found, a victim physical register is selected and its content is stored back to the ARF, thereby creating a usable physical register.
在步骤460中,确定所述访问是否是一个读请求。然后在步骤470中,当所述访问是读请求时,所请求的寄存器值从ARF中复制到PRF中的可用寄存器,如步骤470所示。然后如上所述,执行步骤420中的读取或写入操作。In step 460, it is determined whether the access is a read request. Then in step 470, when the access is a read request, the requested register value is copied from the ARF to an available register in the PRF, as shown in step 470. The read or write operation in step 420 is then performed as described above.
现参考图5,图5为根据本发明实施例提供的一种线程背景切换方法的简易框图。Referring now to FIG. 5 , FIG. 5 is a simplified block diagram of a thread background switching method according to an embodiment of the present invention.
在步骤500中,确定线程切换是硬件还是软件。In step 500, it is determined whether the thread switch is hardware or software.
在步骤510中,当线程切换是硬件切换时,引擎流水线将活动线程切换到不同的线程,暂时阻止以前的活动线程。在步骤520中,将当前活动线程的有效位设置为0。在步骤530中,在ARF中仅更新在映射表中标记为脏的寄存器。In step 510, when the thread switch is a hardware switch, the engine pipeline switches the active thread to a different thread, temporarily blocking the previously active thread. In step 520, the valid bit of the currently active thread is set to zero. In step 530, only registers marked as dirty in the mapping table are updated in the ARF.
在步骤540中,当线程切换是软件切换时,软件将活动线程切换到另一个线程,使以前的活动线程失效。在步骤550中,映射到当前活动线程的架构寄存器的所有物理寄存器读取并写入存储器(即在ARF中更新)。在步骤560中,将当前的活动线程在映射表中的所有有效位设置为0。在步骤570中,将以前的活动线程从线程控制寄存器中删除,该线程控制寄存器指定活动线程。In step 540, when the thread switch is a software switch, the software switches the active thread to another thread, invalidating the previously active thread. In step 550, all physical registers mapped to the architectural registers of the currently active thread are read and written to memory (ie, updated in the ARF). In step 560, all valid bits of the current active thread in the mapping table are set to 0. In step 570, the previously active thread is removed from the thread control register, which designates the active thread.
综上所述,上述实施例对于包括CGMT在内的所有MT实现是有用的。此处描述的寄存器映射显著减少了CGMT开销,因为不是每个线程都复制ARF,并且根据需要完成架构寄存器的恢复。可以在线程切换时间(即机器前端从新线程获取指令时)中检索新线程的寄存器。挂起线程自然地提供物理寄存器受害候选者。避免了完全的ARF复制导致面积(管芯尺寸)和能耗的显著减少。相对于完整的ARF保存和恢复,线程切换时间显著缩短,并且可以主要在后台执行。In summary, the above embodiments are useful for all MT implementations including CGMT. The register map described here significantly reduces CGMT overhead because ARFs are not copied per thread, and restoration of architectural registers is done on demand. Registers for a new thread can be retrieved at thread switch time (i.e. when the machine front end fetches instructions from the new thread). Suspended threads naturally provide physical register victim candidates. Avoiding full ARF replication results in significant reductions in area (die size) and power consumption. Compared to a full ARF save and restore, thread switch times are significantly reduced and can be performed mostly in the background.
对本发明各个实施例的描述只是为了说明的目的,而这些描述并不旨在穷举或限于所公开的实施例。在不脱离所描述的实施例的范围和精神的情况下,本领域技术人员可以清楚理解许多修改和变化。相比于市场上可找到的技术,选择此处使用的术语可最好地解释本实施例的原理、实际应用或技术进步,或使本领域其他技术人员理解此处公开的实施例。The descriptions of various embodiments of the present invention are presented for purposes of illustration only, and are not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and changes will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, practical applications or technological advancements, or to enable others skilled in the art to understand the embodiments disclosed herein, compared to technologies available in the market.
预计在从该应用程序到期的专利期间,将开发许多相关的多线程实现、寄存器文件、架构寄存器、物理寄存器、寄存器映射实现和寄存器访问操作,且术语包括多线程、寄存器文件、架构寄存器、物理寄存器、寄存器映射、寄存器访问和寄存器访问请求预先包括所有这些新技术。It is expected that during the patent period from which this application expires, many related multithreading implementations, register files, architectural registers, physical registers, register map implementations, and register access operations will be developed, and terms include multithreading, register files, architectural registers, Physical Registers, Register Map, Register Access, and Register Access Request include all of these new technologies upfront.
术语“包括”以及“有”表示“包括但不限于”。这个术语包括了术语“由……组成”以及“本质上由……组成”。The terms "including" and "having" mean "including but not limited to". This term includes the terms "consisting of" and "consisting essentially of".
短语“主要由……组成”意指组成物或方法可以包含额外成分和/或步骤,但前提是所述额外成分和/或步骤不会实质上改变所要求的组成物或方法的基本和新颖特性。The phrase "consisting essentially of" means that the composition or method may contain additional ingredients and/or steps, provided that the additional ingredients and/or steps do not materially alter the basic and novel nature of the claimed composition or method characteristic.
除非上下文中另有明确说明,此处使用的单数形式“一个”和“所述”包括复数含义。例如,术语“一个复合物”或“至少一个复合物”可以包括多个复合物,包括其混合物。As used herein, the singular forms "a" and "the" include plural reference unless the context clearly dictates otherwise. For example, the term "a complex" or "at least one complex" may include a plurality of complexes, including mixtures thereof.
此处使用的词“示例性的”表示“作为一个例子、示例或说明”。任何“示例性的”实施例并不一定理解为优先于或优越于其他实施例,和/或并不排除其他实施例特点的结合。The word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any "exemplary" embodiment is not necessarily to be construed as preferred or superior to other embodiments, and/or does not preclude a combination of features of other embodiments.
此处使用的词语“可选地”表示“在一些实施例中提供且在其他实施例中没有提供”。本发明的任意特定的实施例可以包含多个“可选的”特征,除非这些特征相互矛盾。The word "optionally" is used herein to mean "provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may contain multiple "optional" features, unless such features are contradictory.
在整个本申请案中,本发明的各种实施例可以范围格式呈现。应理解,范围格式的描述仅为了方便和简洁起见,并且不应该被解释为对本发明范围的固定限制。因此,对范围的描述应被认为是已经具体地公开所有可能的子范围以及所述范围内的个别数值。例如,对例如从1到6的范围的描述应被认为是已经具体地公开子范围,例如从1到3、从1到4、从1到5、从2到4、从2到6、从3到6等,以及所述范围内的个别数字,例如1、2、3、4、5和6。不管范围的宽度如何,这都适用。Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual values within that range. For example, a description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within said range, such as 1, 2, 3, 4, 5, and 6. This works regardless of the width of the range.
当此处指出一个数字范围时,表示包含了在指出的这个范围内的任意所列举的数字(分数或整数)。短语“在第一个所指示的数和第二个所指示的数范围内”以及“从第一个所指示的数到第二个所指示的数范围内”和在这里互换使用,表示包括第一个和第二个所指示的数以及二者之间所有的分数和整数。When a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "within the first indicated number and the second indicated number" and "from the first indicated number to the second indicated number" are used interchangeably herein to mean Include the first and second indicated numbers and all fractions and whole numbers in between.
应了解,为简洁起见在单独实施例的上下文中描述的本发明的某些特征还可以组合提供于单个实施例中。相反地,为简洁起见在单个实施例的上下文中描述的本发明的各个特征也可以单独地或以任何合适的子组合或作为本发明的任何合适的其它实施例提供。在各个实施例的上下文中描述的某些特征未视为那些实施例的基本特征,除非没有这些元素所述实施例无效。此处,本说明书中提及的所有出版物、专利和专利说明书都通过引用本说明书结合在本说明书中,同样,每个单独的出版物、专利或专利说明书也具体且单独地结合在此。此外,对本申请的任何参考的引用或识别不可当做是允许这样的参考在现有技术中优先于本发明。就使用节标题而言,不应该将节标题理解成必要的限定。It is appreciated that certain features of the invention, which are, for brevity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as any suitable other embodiment of the invention. Certain features described in the context of individual embodiments are not to be considered essential characteristics of those embodiments, unless the embodiment is ineffective without those elements. All publications, patents, and patent specifications mentioned in this specification are herein incorporated by reference herein as if each individual publication, patent, or patent specification was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is prior art over the present invention. As far as section headings are used, they should not be construed as necessarily limiting.
Claims (15)
- A kind of 1. system of processing register access request, it is characterised in that including:Interface, for receiving multiple register access requests;Processing unit, it is connected to the interface and is used for:Based on making recently in each multiple architectural registers in multiple multithreadings (multithreading, abbreviation MT) thread With with it is at least one in access frequency, one group of register is dynamically mapped at least one from the multiple architectural registers In individual multiple physical registers;When not finding matching in the multiple physical register, searched in the multiple structure register each described The matching of register access request.
- 2. system according to claim 1, it is characterised in that the multiple MT threads submit the multiple register access Request, and in multiline procedure processor.
- 3. system according to any one of the preceding claims, it is characterised in that the multiple register access request warp Received by least one streamline engine.
- 4. system according to any one of the preceding claims, it is characterised in that the multiple architectural registers are stored in In static RAM (static random access memory, abbreviation SRAM).
- 5. system according to any one of the preceding claims, it is characterised in that also include being used to store access frequency number According to the memory of collection;Wherein described processing unit is used to update the access frequency data using the access frequency of each register Collection, and the mapping is performed according to the access frequency data set.
- 6. system according to any one of the preceding claims, it is characterised in that also include being used to store to use number recently According to the memory of collection;Wherein described processing unit be used for using it is described recently using come update it is described use data set recently, and And the mapping is performed using data set recently according to described.
- 7. system according to claim 6, it is characterised in that it is described to include multiple records using data set recently, each Record the nearest usage record of each the multiple MT thread to the multiple architectural registers.
- 8. the system according to any one of claim 6 or 7, it is characterised in that described to include institute using data set recently State each distribution state of multiple architectural registers.
- 9. the system according to any one of claim 6~8, it is characterised in that the multiple structure register is mapped to The distribution for the thread hung up and run in the multiple MT threads, and the multiple map physical registers are to the multiple MT The distribution of the active thread of thread.
- 10. the system according to any one of claim 6~9, it is characterised in that the processing unit is used for, and will appoint Distribution in one the multiple physical register is switched in the multiple MT threads from one of the multiple MT threads Another when, renewal described uses data set recently.
- 11. system according to any one of the preceding claims, it is characterised in that the processing unit is used for will be described more Individual architectural registers are mapped in the multiple MT threads.
- 12. system according to any one of the preceding claims, it is characterised in that the processing unit is used for any one Mapping in the multiple architectural registers is switched in the multiple architectural registers from one of the multiple MT threads Another.
- 13. system according to any one of the preceding claims, it is characterised in that when active threads do not have switching to be activated to During different threads, each state for the physical register that the processing unit is used to will be mapped to the active threads is arranged to can With.
- A kind of 14. method of processing register access request, it is characterised in that including:Receive multiple register access requests;Nearest use based on each multiple architectural registers in multiple multithreadings (multithreading, abbreviation MT) thread With it is at least one in access frequency, one group of register is dynamically mapped to from the multiple architectural registers at least one In multiple physical registers;When can not find the register of request in the mapping in the multiple physical register, in the multiple architectural registers Matching is searched for each register access request.
- 15. according to the method for claim 14, it is characterised in that also include:By recording via at least one streamline Engine receive the multiple register access ask monitor it is described recently using and access frequency in it is at least one.
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2015/068977 WO2017028909A1 (en) | 2015-08-18 | 2015-08-18 | Shared physical registers and mapping table for architectural registers of multiple threads |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN107851006A true CN107851006A (en) | 2018-03-27 |
| CN107851006B CN107851006B (en) | 2020-12-04 |
Family
ID=54007684
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201580082261.5A Active CN107851006B (en) | 2015-08-18 | 2015-08-18 | Multithreaded register map |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN107851006B (en) |
| WO (1) | WO2017028909A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112445616A (en) * | 2020-11-25 | 2021-03-05 | 海光信息技术股份有限公司 | Resource allocation method and device |
| CN114616545A (en) * | 2019-10-30 | 2022-06-10 | 超威半导体公司 | Shadow latches in a register file for shadow latch configuration for thread storage |
| WO2023029591A1 (en) * | 2021-09-03 | 2023-03-09 | 海光信息技术股份有限公司 | Processor, physical register management method, and electronic apparatus |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11294683B2 (en) | 2020-03-30 | 2022-04-05 | SiFive, Inc. | Duplicate detection for register renaming |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101794214A (en) * | 2009-02-04 | 2010-08-04 | 世意法(北京)半导体研发有限责任公司 | Register renaming system using multi-block physical register mapping table and method thereof |
| CN102298514A (en) * | 2010-06-14 | 2011-12-28 | 英特尔公司 | Register mapping techniques for efficient dynamic binary translation |
| US8200949B1 (en) * | 2008-12-09 | 2012-06-12 | Nvidia Corporation | Policy based allocation of register file cache to threads in multi-threaded processor |
| US20120216004A1 (en) * | 2011-02-23 | 2012-08-23 | International Business Machines Corporation | Thread transition management |
| US20130086364A1 (en) * | 2011-10-03 | 2013-04-04 | International Business Machines Corporation | Managing a Register Cache Based on an Architected Computer Instruction Set Having Operand Last-User Information |
| US20140122841A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Efficient usage of a register file mapper and first-level data register file |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7711898B2 (en) * | 2003-12-18 | 2010-05-04 | Intel Corporation | Register alias table cache to map a logical register to a physical register |
| US9501285B2 (en) * | 2010-05-27 | 2016-11-22 | International Business Machines Corporation | Register allocation to threads |
-
2015
- 2015-08-18 CN CN201580082261.5A patent/CN107851006B/en active Active
- 2015-08-18 WO PCT/EP2015/068977 patent/WO2017028909A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8200949B1 (en) * | 2008-12-09 | 2012-06-12 | Nvidia Corporation | Policy based allocation of register file cache to threads in multi-threaded processor |
| CN101794214A (en) * | 2009-02-04 | 2010-08-04 | 世意法(北京)半导体研发有限责任公司 | Register renaming system using multi-block physical register mapping table and method thereof |
| CN102298514A (en) * | 2010-06-14 | 2011-12-28 | 英特尔公司 | Register mapping techniques for efficient dynamic binary translation |
| US20120216004A1 (en) * | 2011-02-23 | 2012-08-23 | International Business Machines Corporation | Thread transition management |
| US20130086364A1 (en) * | 2011-10-03 | 2013-04-04 | International Business Machines Corporation | Managing a Register Cache Based on an Architected Computer Instruction Set Having Operand Last-User Information |
| US20140122841A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Efficient usage of a register file mapper and first-level data register file |
Non-Patent Citations (1)
| Title |
|---|
| 廖银 等: "动态二进制翻译中全寄存器直接映射方法", 《计算机应用与软件》 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114616545A (en) * | 2019-10-30 | 2022-06-10 | 超威半导体公司 | Shadow latches in a register file for shadow latch configuration for thread storage |
| CN112445616A (en) * | 2020-11-25 | 2021-03-05 | 海光信息技术股份有限公司 | Resource allocation method and device |
| WO2023029591A1 (en) * | 2021-09-03 | 2023-03-09 | 海光信息技术股份有限公司 | Processor, physical register management method, and electronic apparatus |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2017028909A1 (en) | 2017-02-23 |
| CN107851006B (en) | 2020-12-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10802987B2 (en) | Computer processor employing cache memory storing backless cache lines | |
| US9910611B2 (en) | Access control for memory protection key architecture | |
| TWI659305B (en) | Facility for extending exclusive hold of a cache line in private cache | |
| US10671744B2 (en) | Lightweight trusted execution for internet-of-things devices | |
| US10664199B2 (en) | Application driven hardware cache management | |
| CN106716434A (en) | Memory protection key architecture with independent user and supervisor domains | |
| US20200081716A1 (en) | Controlling Accesses to a Branch Prediction Unit for Sequences of Fetch Groups | |
| US9740623B2 (en) | Object liveness tracking for use in processing device cache | |
| JP2013536524A (en) | Context switch | |
| US10108548B2 (en) | Processors and methods for cache sparing stores | |
| US20120284483A1 (en) | Managing allocation of memory pages | |
| US20180081813A1 (en) | Quality of cache management in a computer | |
| US20130262780A1 (en) | Apparatus and Method for Fast Cache Shutdown | |
| US9542336B2 (en) | Isochronous agent data pinning in a multi-level memory system | |
| WO2013192057A1 (en) | Cache sector dirty bits | |
| CN107851006B (en) | Multithreaded register map | |
| US9223714B2 (en) | Instruction boundary prediction for variable length instruction set | |
| US20160179532A1 (en) | Managing allocation of physical registers in a block-based instruction set architecture (isa), and related apparatuses and methods | |
| US20160170767A1 (en) | Temporary transfer of a multithreaded ip core to single or reduced thread configuration during thread offload to co-processor | |
| US20170308474A1 (en) | Operation of a multi-slice processor implementing a unified page walk cache | |
| US12210446B2 (en) | Inter-cluster shared data management in sub-NUMA cluster | |
| AU2014328735B2 (en) | Consistent and efficient mirroring of nonvolatile memory state in virtualized environments | |
| US11500638B1 (en) | Hardware compression and decompression engine | |
| US20200019405A1 (en) | Multiple Level History Buffer for Transaction Memory Support |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |