CN118535352A - A multi-granularity remote memory runtime method based on user mode - Google Patents
A multi-granularity remote memory runtime method based on user mode Download PDFInfo
- Publication number
- CN118535352A CN118535352A CN202410517975.5A CN202410517975A CN118535352A CN 118535352 A CN118535352 A CN 118535352A CN 202410517975 A CN202410517975 A CN 202410517975A CN 118535352 A CN118535352 A CN 118535352A
- Authority
- CN
- China
- Prior art keywords
- memory
- remote memory
- runtime
- granularity
- remote
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/547—Remote procedure calls [RPC]; Web services
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
- G06F3/0607—Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/441—Register allocation; Assignment of physical memory space to logical memory space
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/445—Exploiting fine grain parallelism, i.e. parallelism at instruction level
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/544—Buffers; Shared memory; Pipes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/544—Remote
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/549—Remote execution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及远程内存领域,尤其涉及一种基于用户态的多粒度远程内存运行时方法。The present invention relates to the field of remote memory, and in particular to a multi-granularity remote memory runtime method based on user mode.
背景技术Background Art
随着大数据时代的到来,数据中心中的内存密集型(memory-intensive)应用越来越普遍。这类应用主要包括低延迟网络应用(如内存数据库)与数据密集型应用(如机器学习、图计算)等。在此背景下,内存资源逐渐成为数据中心中的主要瓶颈之一。With the advent of the big data era, memory-intensive applications in data centers are becoming more and more common. Such applications mainly include low-latency network applications (such as in-memory databases) and data-intensive applications (such as machine learning and graph computing). In this context, memory resources have gradually become one of the main bottlenecks in data centers.
数据中心中内存瓶颈导致的问题主要在两方面。一是单机内存不足导致的服务不稳定与性能下降。现代操作系统通常会采用内存交换(swap)机制,在内存压力较大时将较冷页面换出到下级存储设备,以实现一定程度的内存弹性。但使用交换机制会导致内存密集型应用性能的断崖式下降。根据数据显示,Google的数据中心中一个月内大约有79万个任务由于内存压力被中止。对此现有的解决方案包括使用I/O性能更高的SSD进行页面交换、使用内存压缩技术、使用更为激进的页面回收机制等,但对其性能问题仅有缓解作用。二是单机内存利用率问题,如Google对数据中心的跟踪分析表明应用往往过度请求内存,导致实际内存利用率仅为50%左右,阿里巴巴的研究也表明数据中心中存在约30%的冷内存。由于传统单片服务器架构在内存资源的分配和使用上缺乏跨机共享内存的能力,这些剩余内存无法在数据中心层面与其他服务器高效共享。因此,内存分配缺乏弹性也是数据中心中内存分配的关键问题之一。There are two main problems caused by memory bottlenecks in data centers. One is the service instability and performance degradation caused by insufficient memory on a single machine. Modern operating systems usually use a memory swap mechanism to swap out cold pages to lower-level storage devices when memory pressure is high to achieve a certain degree of memory elasticity. However, the use of a swap mechanism will lead to a cliff-like drop in the performance of memory-intensive applications. According to data, about 790,000 tasks in Google's data centers were terminated due to memory pressure in a month. Existing solutions include using SSDs with higher I/O performance for page swapping, using memory compression technology, and using more aggressive page recycling mechanisms, but these only have a mitigating effect on performance issues. The second is the problem of single-machine memory utilization. For example, Google's tracking and analysis of data centers shows that applications often over-request memory, resulting in an actual memory utilization rate of only about 50%. Alibaba's research also shows that there is about 30% cold memory in data centers. Since the traditional monolithic server architecture lacks the ability to share memory across machines in the allocation and use of memory resources, these remaining memories cannot be efficiently shared with other servers at the data center level. Therefore, the lack of elasticity in memory allocation is also one of the key issues in memory allocation in data centers.
内存分解(Memory Disaggregation)是近年来学术界提出的上述问题的一种解决方案,其主要内容是通过给予应用访问其他节点上的内存的能力,提高数据中心内存使用的弹性,提高整体内存使用率。该技术的基座在于近年愈发成熟的远程直接内存访问(RDMA,Remote Direct MemoryAccess)技术,通过支持RDMA的网络设备,计算节点可以微秒级延迟直接访问远程节点的内存,且不需要远程节点的CPU参与。多项研究表明内存密集型应用经过针对RDMA改造后可以在工作集跨多节点内存的情况下保持高性能特点。但将RDMA应用于内存分解领域,使应用在较少更改或无需更改的情况下拥有使用远程内存的能力,需要对传统的仅针对本地内存的软件架构进行重新设计。Memory disaggregation is a solution to the above problems proposed by the academic community in recent years. Its main content is to improve the elasticity of data center memory usage and improve the overall memory utilization rate by giving applications the ability to access memory on other nodes. The foundation of this technology lies in the Remote Direct Memory Access (RDMA) technology that has become increasingly mature in recent years. Through network devices that support RDMA, computing nodes can directly access the memory of remote nodes with microsecond delays, and do not require the participation of the remote node's CPU. Many studies have shown that memory-intensive applications can maintain high performance characteristics when the working set spans multiple nodes after being modified for RDMA. However, applying RDMA to the field of memory disaggregation, so that applications have the ability to use remote memory with little or no changes, requires redesigning the traditional software architecture that only targets local memory.
目前远程内存领域的研究工作主要分两个主流方向:At present, research work in the field of remote memory is mainly divided into two mainstream directions:
一是基于内核交换机制,其主要思路是将远程内存抽象为一个块设备作为系统一块交换空间,复用内核的交换机制达到使用远程内存的目的。当应用访问一个不在本地内存的页时,CPU会触发页错误(page fault)并通过交换机制将页面从远端节点换入本地内存。通过交换机制,系统会将热页面留在本地内存,而冷页面则存储在远端内存中。这一思路使用页粒度管理远端内存,优势在于应用无需修改,保证了最大的兼容性。InfiniSwap首先提出并实现了这一思路,此外通过异步将换出页面写入SSD的方式保证系统鲁棒性。然而这种方式需要经过完整的内核软件栈会造成极高的额外性能开销,且内核的页面交换页面预取机制不适用于远端内存快速存取的场景。除此之外,内核交换机制本身由于是针对磁盘设计,其许多内部算法尚在针对SSD场景进行优化(如swap entry分配算法),难以应对远程内存场景的高性能需求;基于页粒度的内存管理方式也会引入读写放大的问题。The first is based on the kernel swap mechanism. Its main idea is to abstract the remote memory into a block device as a swap space of the system, and reuse the kernel swap mechanism to achieve the purpose of using remote memory. When an application accesses a page that is not in the local memory, the CPU will trigger a page fault and swap the page from the remote node to the local memory through the swap mechanism. Through the swap mechanism, the system will keep the hot page in the local memory and store the cold page in the remote memory. This idea uses page granularity to manage the remote memory. The advantage is that the application does not need to be modified, ensuring maximum compatibility. InfiniSwap first proposed and implemented this idea. In addition, the system robustness is ensured by asynchronously writing the swapped pages to the SSD. However, this method requires a complete kernel software stack, which will cause extremely high additional performance overhead, and the kernel's page swap page prefetching mechanism is not suitable for the scenario of fast access to remote memory. In addition, since the kernel swap mechanism itself is designed for disks, many of its internal algorithms are still being optimized for SSD scenarios (such as swap entry allocation algorithm), which is difficult to meet the high performance requirements of remote memory scenarios; the memory management method based on page granularity will also introduce the problem of read-write amplification.
二是基于用户态运行时,主要思路是在面向对象模型机制上做改动,扩展编程语言的对象生命周期、垃圾回收机制等使用远程内存。但此类机制的缺点在于缺乏兼容性,或依赖某一种特定的语言运行时,或需要对应用源码进行大量修改才能运行。The second is based on user-mode runtime, the main idea is to make changes to the object-oriented model mechanism, expand the object life cycle of the programming language, garbage collection mechanism, etc. to use remote memory. However, the disadvantage of this mechanism is the lack of compatibility, or dependence on a specific language runtime, or the need to make a lot of modifications to the application source code before it can run.
除这两个主流方向之外,学界还有一系列软硬件结合的方案,但问题仍在于兼容性。In addition to these two mainstream directions, the academic community also has a series of solutions that combine software and hardware, but the problem still lies in compatibility.
总结来讲,目前的远端内存系统解决方案存在一对难以兼容的矛盾:无法兼顾高兼容性和高性能(低内核开销/低读写放大)这两点优势。为化解这一矛盾,本发明从先前工作中未曾尝试过的切入点出发:使用编译期的程序分析与程序改写技术对应用引入对远程内存的支持。本发明提出了一种新的用户态远端内存运行时,可在用户态执行远端内存的细粒度管理与换入换出工作,并通过在程序编译过程中引入指针流分析和指令改写步骤,使应用无需进行源码级别修改或仅引入少量修改即可与运行时相兼容,以兼顾内核态解决方案的兼容性与用户态方案的高性能特征。In summary, the current remote memory system solutions have a pair of incompatible contradictions: it is impossible to take into account both the advantages of high compatibility and high performance (low kernel overhead/low read/write amplification). In order to resolve this contradiction, the present invention starts from a point of entry that has not been tried in previous work: using compile-time program analysis and program rewriting technology to introduce support for remote memory to applications. The present invention proposes a new user-mode remote memory runtime, which can perform fine-grained management and swapping in and out of remote memory in user mode, and by introducing pointer flow analysis and instruction rewriting steps in the program compilation process, the application can be compatible with the runtime without source code level modification or only introducing a small amount of modification, so as to take into account the compatibility of kernel-mode solutions and the high-performance characteristics of user-mode solutions.
发明内容Summary of the invention
有鉴于现有技术的上述缺陷,本发明所要解决的技术问题是如何兼顾高兼容性和高性能。In view of the above-mentioned defects of the prior art, the technical problem to be solved by the present invention is how to achieve both high compatibility and high performance.
为实现上述目的,本发明提供了一种基于用户态的多粒度远程内存运行时方法,其特征在于,本发明通过使用编译期的程序分析与程序改写技术,对原生C/C++/其他由LLVM生态支持的语言编写的应用引入对远程内存的支持,并提出了一种新的用户态远端内存运行时,可在用户态执行远端内存的细粒度管理与换入换出工作,通过所述编译期的步骤使应用无需进行源码级别修改或仅引入少量修改即可与运行期相兼容,以兼顾内核态解决方案的兼容性与用户态方案的高性能特征。To achieve the above-mentioned purpose, the present invention provides a multi-granularity remote memory runtime method based on user state, which is characterized in that the present invention introduces support for remote memory for applications written in native C/C++/other languages supported by the LLVM ecosystem by using compile-time program analysis and program rewriting technology, and proposes a new user-state remote memory runtime, which can perform fine-grained management and swapping in and out of remote memory in user state. Through the compile-time steps, the application does not need to be modified at the source code level or only introduces a small amount of modification to be compatible with the runtime, so as to take into account the compatibility of the kernel-state solution and the high-performance characteristics of the user-state solution.
进一步地,所述编译期部分基于LLVM工具链,主要实现为编译工具链中的编译器插件模块,在LLVM优化器中被调用;所述编译器插件模块针对的是LLVM IR形式的代码,编译期功能可以支持原生C/C++以及其他由LLVM生态支持,可以编译为LLVM IR的任何语言。Furthermore, the compile-time part is based on the LLVM tool chain, and is mainly implemented as a compiler plug-in module in the compilation tool chain, which is called in the LLVM optimizer; the compiler plug-in module targets codes in the form of LLVM IR, and the compile-time function can support native C/C++ and other languages supported by the LLVM ecosystem and can be compiled into any language of LLVM IR.
进一步地,对于原生的C/C++代码仓库,执行如下步骤生成支持所述远程内存的可执行文件:Furthermore, for the native C/C++ code repository, the following steps are performed to generate an executable file that supports the remote memory:
S1.1标注代码中动态内存分配的部分;S1.1 Mark the code for dynamic memory allocation;
S1.2代码通过Clang编译器翻译为LLVM IR中间表示代码;The S1.2 code is translated into LLVM IR intermediate representation code by the Clang compiler;
S1.3经过前置优化后,利用S1.1的结果进行指针流分析,标注出所有可能指向动态分配内存的指针;After pre-optimization, S1.3 uses the results of S1.1 to perform pointer flow analysis and mark all pointers that may point to dynamically allocated memory;
S1.4根据配置文件,对所有可能指向动态分配内存指针的访存进行二进制改写,插入远端内存处理函数的调用;S1.4 performs binary rewriting on all memory accesses that may point to dynamically allocated memory pointers according to the configuration file, and inserts calls to remote memory processing functions;
S1.5编译器对代码中存在的循环进行数据流分析,识别出循环中的访存模式,并在合适位置插入内存预取代码指导运行时库进行数据预取;The S1.5 compiler performs data flow analysis on loops in the code, identifies memory access patterns in the loops, and inserts memory prefetching code at appropriate locations to guide the runtime library to prefetch data;
S1.6代码经过后置优化完成完整的优化流,。链接器将改写后的LLVM IR与远端内存运行时库进行链接,生成可执行文件。The S1.6 code is post-optimized to complete the complete optimization flow. The linker links the rewritten LLVM IR with the remote memory runtime library to generate an executable file.
进一步地,所述步骤S1.1中,对于使用标准glibc中动态内存分配的方式,通过clang编译器自动标注。Furthermore, in the step S1.1, the method of using dynamic memory allocation in the standard glibc is automatically marked by the clang compiler.
进一步地,所述步骤S1.3中,所述指针流分析是指流敏感、非上下文敏感的Andersen指针流分析算法,实际运用中也可换用其他更为精确的指针流分析算法。Furthermore, in the step S1.3, the pointer flow analysis refers to the flow-sensitive, non-context-sensitive Andersen pointer flow analysis algorithm, and other more accurate pointer flow analysis algorithms may also be used in actual applications.
进一步地,所述步骤S1.5中所述访存模式是顺序访存。Furthermore, the memory access mode in step S1.5 is sequential memory access.
进一步地,所述运行期过程分为初始化过程与访存过程;Furthermore, the runtime process is divided into an initialization process and a memory access process;
所述初始化过程在应用启动时进行,主要负责与远程内存节点上的内存管理程序建立RDMA链接并对本地缓存数据结构与远程内存进行初始化操作。The initialization process is performed when the application is started, and is mainly responsible for establishing an RDMA link with the memory management program on the remote memory node and performing initialization operations on the local cache data structure and the remote memory.
进一步地,所述初始化过程包括以下步骤:Furthermore, the initialization process includes the following steps:
S2.1由应用程序调用远端内存运行时库的初始化回调函数;函数调用会在编译期由编译器插件自动插入应用初始化逻辑中;S2.1 The application calls the initialization callback function of the remote memory runtime library; the function call will be automatically inserted into the application initialization logic by the compiler plug-in during compilation;
S2.2应用调用初始化回调后,运行时库会分配固定大小的本地内存作为远端内存的缓存使用,并初始化本地缓存数据结构;S2.2 After the application calls the initialization callback, the runtime library allocates a fixed-size local memory as a cache for the remote memory and initializes the local cache data structure;
S2.3运行时库与内存结点程序建立RDMA连接;S2.3 The runtime library establishes an RDMA connection with the memory node program;
S2.4内存结点程序在成功建立连接后发送BaseVA与R_key信息给计算节点上运行时库,这两个信息用于计算远程内存对象在内存结点上的虚拟地址与发送RDMA读写请求;S2.4 After successfully establishing a connection, the memory node program sends BaseVA and R_key information to the runtime library on the computing node. These two pieces of information are used to calculate the virtual address of the remote memory object on the memory node and send RDMA read and write requests;
S2.5运行时库向远端内存节点发送请求,注册一段指定大小的内存作为该应用独占的远程内存;The S2.5 runtime library sends a request to the remote memory node to register a memory of a specified size as the remote memory exclusively used by the application;
S2.6内存结点程序收到请求后在本地内存区域中分配一段指定大小的内存,并对该段内存进行初始化操作;After receiving the request, the S2.6 memory node program allocates a memory of a specified size in the local memory area and initializes the memory;
S2.7若内存节点有足够内存,内存结点程序返回注册成功指令给运行时库;若注册失败则退出整个初始化流程;S2.7 If the memory node has enough memory, the memory node program returns a registration success instruction to the runtime library; if the registration fails, the entire initialization process is exited;
S2.8运行时库收到注册成功的请求后将控制流返还给应用程序。After receiving the successful registration request, the S2.8 runtime library returns the control flow to the application.
进一步地,所述访存过程在应用运行时被调用,由于在编译期编译器插件已经将远程内存访问指令翻译为运行时库中访问远程内存的回调函数,应用在访问由运行时库管理的内存时会自动调用访存过程;为支持多种粒度的内存管理,运行时库中通过不同的数据结构管理了不同大小的内存,其中小粒度内存通过组相联缓存进行管理,大粒度内存通过链表形式的页缓存与单独的页地址映射表进行管理;除了应用对预取函数的显式调用,运行时库在访存时也会隐式地对内存进行预取以提高远端内存性能。Furthermore, the memory access process is called when the application is running. Since the compiler plug-in has translated the remote memory access instruction into a callback function for accessing the remote memory in the runtime library during compilation, the application will automatically call the memory access process when accessing the memory managed by the runtime library. To support memory management of multiple granularities, the runtime library manages memories of different sizes through different data structures, wherein small-granularity memory is managed through a set-associative cache, and large-granularity memory is managed through a page cache in the form of a linked list and a separate page address mapping table. In addition to the application's explicit call to the prefetch function, the runtime library will also implicitly prefetch the memory when accessing memory to improve remote memory performance.
进一步地,所述访存过程包括如下步骤:Furthermore, the memory access process includes the following steps:
S3.1应用程序通过访问堆内存的形式触发运行时库的访存回调函数;S3.1 The application triggers the memory access callback function of the runtime library by accessing the heap memory;
S3.2运行时库通过回调函数中的参数判断访存粒度大小,并根据粒度大小判断使用的缓存数据结构,并到对应的数据结构中对内存进行查找;The S3.2 runtime library determines the memory access granularity through the parameters in the callback function, determines the cache data structure to use based on the granularity, and searches the memory in the corresponding data structure;
若查找成功,则表明本地缓存中有该内存对象,则直接跳转到S3.6进行返回;若查找失败,则表明需要从远端将该内存对象取回,需要从如下步骤继续;If the search is successful, it means that the memory object exists in the local cache, and then jump directly to S3.6 to return; if the search fails, it means that the memory object needs to be retrieved from the remote end, and it is necessary to continue from the following steps;
S3.3若本地缓存已满,则需要进行本地缓存逐出操作,运行时库会选择一个受害者对象victim通过发送RDMA写请求将其写回到远端内存,并空出该缓存位置;S3.3 If the local cache is full, a local cache eviction operation is required. The runtime library selects a victim object victim and writes it back to the remote memory by sending an RDMA write request, and vacates the cache location.
S3.4运行时库通过发送RDMA读请求,将远程内存读到本地的空缓存位置中;The S3.4 runtime library reads the remote memory into the local empty cache location by sending an RDMA read request;
S3.5若当前的应用访存模式被运行时库识别到,则使用运行时库默认的预取策略进行一定距离的预取。默认的预取策略有如顺序访问sequential,交叉访问strided。S3.5 If the current application memory access mode is recognized by the runtime library, the runtime library's default prefetch strategy is used to perform prefetching at a certain distance. The default prefetch strategy is sequential access and strided access.
S3.6完成上述步骤后将控制流交还给应用程序。After completing the above steps, S3.6 returns the control flow to the application.
本发明具有如下技术效果:The present invention has the following technical effects:
1、针对原生C/C++/其他由LLVM生态支持的语言编写的应用,无需或仅需少量修改即可令其支持用户态远程内存功能;1. For applications written in native C/C++/other languages supported by the LLVM ecosystem, no or only a small amount of modification is required to enable them to support user-mode remote memory functions;
2、远程内存访问不再以OS内存页为固定粒度,有效减少远程内存访问过程中的读写放大;2. Remote memory access no longer uses OS memory pages as a fixed granularity, effectively reducing read and write amplification during remote memory access;
3、通过预取优化在较为复杂的访存模式下提高远程内存预取的命中率,降低访存等待时间。3. Improve the hit rate of remote memory prefetching and reduce the memory access waiting time in more complex memory access modes through prefetch optimization.
以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明,以充分地了解本发明的目的、特征和效果。The concept, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings to fully understand the purpose, characteristics and effects of the present invention.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明的一个较佳实施例的系统图;FIG. 1 is a system diagram of a preferred embodiment of the present invention;
图2是本发明的一个较佳实施例的编译期过程及代码示例;FIG2 is a compile-time process and code example of a preferred embodiment of the present invention;
图3是本发明的一个较佳实施例的运行期初始化过程;FIG3 is a run-time initialization process of a preferred embodiment of the present invention;
图4是本发明的一个较佳实施例的运行期访存过程及数据结构示意。FIG. 4 is a schematic diagram of the runtime memory access process and data structure of a preferred embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
以下参考说明书附图介绍本发明的多个优选实施例,使其技术内容更加清楚和便于理解。本发明可以通过许多不同形式的实施例来得以体现,本发明的保护范围并非仅限于文中提到的实施例。The following describes several preferred embodiments of the present invention with reference to the drawings in the specification, so that the technical content is clearer and easier to understand. The present invention can be embodied in many different forms of embodiments, and the protection scope of the present invention is not limited to the embodiments mentioned in the text.
在附图中,结构相同的部件以相同数字标号表示,各处结构或功能相似的组件以相似数字标号表示。附图所示的每一组件的尺寸和厚度是任意示出的,本发明并没有限定每个组件的尺寸和厚度。为了使图示更清晰,附图中有些地方适当夸大了部件的厚度。In the drawings, components with the same structure are indicated by the same numerical reference numerals, and components with similar structures or functions are indicated by similar numerical reference numerals. The size and thickness of each component shown in the drawings are arbitrarily shown, and the present invention does not limit the size and thickness of each component. In order to make the illustration clearer, the thickness of the components is appropriately exaggerated in some places in the drawings.
如图1所示本发明通过使用编译期的程序分析与程序改写技术,对原生C/C++/其他由LLVM生态支持的语言编写的应用引入对远程内存的支持,并提出了一种新的用户态远端内存运行时,可在用户态执行远端内存的细粒度管理与换入换出工作,通过前述的编译期间的步骤使应用无需进行源码级别修改或仅引入少量修改即可与运行时相兼容,以兼顾内核态解决方案的兼容性与用户态方案的高性能特征。下面分别举例描述编译期间与运行期间的具体步骤。As shown in Figure 1, the present invention introduces support for remote memory for applications written in native C/C++/other languages supported by the LLVM ecosystem by using program analysis and program rewriting technology during compilation, and proposes a new user-mode remote memory runtime, which can perform fine-grained management and swapping in and out of remote memory in user mode. Through the aforementioned steps during compilation, the application does not need to be modified at the source code level or only introduces a small amount of modification to be compatible with the runtime, so as to take into account the compatibility of the kernel-mode solution and the high-performance characteristics of the user-mode solution. The following examples describe the specific steps during compilation and during runtime.
1.编译期1. Compilation period
编译期部分基于LLVM工具链,主要实现为编译工具链中的编译器插件模块,在LLVM优化器中被调用。由于编译器插件模块针对的是LLVM IR(中间表示)形式的代码,理论上编译期功能可以支持原生C/C++以及其他由LLVM生态支持,可以编译为LLVM IR的任何语言。参考附图中图2,对于原生的C/C++代码仓库,执行如下步骤生成支持远端内存的可执行文件:The compile-time part is based on the LLVM toolchain, and is mainly implemented as a compiler plug-in module in the compilation toolchain, which is called in the LLVM optimizer. Since the compiler plug-in module targets the code in the form of LLVM IR (intermediate representation), in theory, the compile-time function can support native C/C++ and any other language supported by the LLVM ecosystem that can be compiled into LLVM IR. Referring to Figure 2 in the attached figure, for the native C/C++ code repository, perform the following steps to generate an executable file that supports remote memory:
S1.1标注代码中动态内存分配的部分。对于使用标准glibc中动态内存分配的方式,通过clang编译器自动标注;S1.1 Annotate the code for dynamic memory allocation. The clang compiler automatically annotates the dynamic memory allocation method used in the standard glibc.
S1.2代码通过Clang编译器翻译为LLVM IR中间表示代码;The S1.2 code is translated into LLVM IR intermediate representation code by the Clang compiler;
S1.3经过前置优化后,利用S1.1的结果进行指针流分析(在图例中写为非流敏感、非上下文敏感的Andersen指针流分析算法,实际运用中也可换用其他更为精确的指针流分析算法),标注出所有可能指向动态分配内存的指针;After pre-optimization, S1.3 uses the result of S1.1 to perform pointer flow analysis (in the figure, it is written as the non-flow-sensitive and non-context-sensitive Andersen pointer flow analysis algorithm. In actual application, other more accurate pointer flow analysis algorithms can also be used to mark all possible pointers pointing to dynamically allocated memory.
S1.4根据配置文件,对所有可能指向动态分配内存指针的访存进行二进制改写,插入远端内存处理函数的调用;S1.4 performs binary rewriting on all memory accesses that may point to dynamically allocated memory pointers according to the configuration file, and inserts calls to remote memory processing functions;
S1.5编译器对代码中存在的循环进行数据流分析,识别出循环中的访存模式(在图例中为顺序访存模式),并在合适位置插入内存预取代码指导运行时库进行数据预取。The S1.5 compiler performs data flow analysis on the loops in the code, identifies the memory access mode in the loop (sequential memory access mode in the example), and inserts memory prefetch code at the appropriate location to guide the runtime library to perform data prefetching.
S1.6代码经过后置优化完成完整的优化流程。链接器将改写后的LLVM IR与远端内存运行时库进行链接,生成可执行文件。The S1.6 code is post-optimized to complete the optimization process. The linker links the rewritten LLVM IR with the remote memory runtime library to generate an executable file.
一个具体的代码改写例子如图2中所示。可以看到,以上步骤中仅有S1.1步需对应用代码进行少量修改,相对于其他用户态解决方案对应用引入的修改极少,是本专利的一大创新点。A specific example of code rewriting is shown in Figure 2. It can be seen that only step S1.1 in the above steps requires a small modification of the application code, which is a major innovation of this patent compared to other user-mode solutions that introduce very few modifications to the application.
2.运行期2. Operation period
运行期过程分为初始化过程与访存过程。The runtime process is divided into initialization process and memory access process.
初始化过程在应用启动时进行,主要负责与远程内存节点上的内存管理程序建立RDMA链接并对本地缓存数据结构与远程内存进行初始化操作。如图3所示,该过程主要执行如下步骤:The initialization process is performed when the application starts. It is mainly responsible for establishing an RDMA link with the memory manager on the remote memory node and initializing the local cache data structure and the remote memory. As shown in Figure 3, the process mainly performs the following steps:
S2.1由应用程序调用远端内存运行时库(以下称之为“运行时库”)的初始化回调函数。这个函数调用会在编译期由编译器插件自动插入应用初始化逻辑中;S2.1 The application calls the initialization callback function of the remote memory runtime library (hereinafter referred to as the "runtime library"). This function call will be automatically inserted into the application initialization logic by the compiler plug-in during compilation;
S2.2应用调用初始化回调后,运行时库会分配固定大小的本地内存作为远端内存的缓存使用,并初始化本地缓存数据结构;S2.2 After the application calls the initialization callback, the runtime library allocates a fixed-size local memory as a cache for the remote memory and initializes the local cache data structure;
S2.3运行时库与内存结点程序建立RDMA连接(此处省略建立RDMA链接所需要的握手步骤);S2.3 The runtime library establishes an RDMA connection with the memory node program (the handshake step required to establish the RDMA link is omitted here);
S2.4内存结点程序在成功建立连接后发送BaseVA与R_key信息给计算节点上运行时库,这两个信息用于计算远程内存对象在内存结点上的虚拟地址与发送RDMA读写请求;S2.4 After successfully establishing a connection, the memory node program sends BaseVA and R_key information to the runtime library on the computing node. These two pieces of information are used to calculate the virtual address of the remote memory object on the memory node and send RDMA read and write requests;
S2.5运行时库向远端内存节点发送请求,注册一段指定大小的内存作为该应用独占的远程内存;The S2.5 runtime library sends a request to the remote memory node to register a memory of a specified size as the remote memory exclusively used by the application;
S2.6内存结点程序收到请求后在本地内存区域中分配一段指定大小的内存,并对该段内存进行初始化操作;After receiving the request, the S2.6 memory node program allocates a memory of a specified size in the local memory area and initializes the memory;
S2.7若内存节点有足够内存,内存结点程序返回注册成功指令给运行时库;若注册失败则退出整个初始化流程;S2.7 If the memory node has enough memory, the memory node program returns a registration success instruction to the runtime library; if the registration fails, the entire initialization process is exited;
S2.8运行时库收到注册成功的请求后将控制流返还给应用程序。After receiving the successful registration request, the S2.8 runtime library returns the control flow to the application.
访存过程在应用运行时被调用。由于在编译期编译器插件已经将远程内存访问指令翻译为运行时库中访问远程内存的回调函数,应用在访问由运行时库管理的内存时会自动调用访存过程。如图4所示,为支持多种粒度的内存管理,运行时库中通过不同的数据结构管理了不同大小的内存,其中小粒度内存(256B,512B,1K,2K)通过组相联缓存进行管理,大粒度内存(大于等于4K)通过链表形式的页缓存与单独的页地址映射表进行管理。除了应用对预取函数的显式调用,运行时库在访存时也会隐式地对内存进行预取以提高远端内存性能。访存过程具体步骤如下:The memory access process is called when the application is running. Since the compiler plug-in has translated the remote memory access instructions into the callback function for accessing remote memory in the runtime library during compilation, the application will automatically call the memory access process when accessing the memory managed by the runtime library. As shown in Figure 4, in order to support memory management of multiple granularities, the runtime library manages memories of different sizes through different data structures, where small-granularity memory (256B, 512B, 1K, 2K) is managed through a set-associative cache, and large-granularity memory (greater than or equal to 4K) is managed through a page cache in the form of a linked list and a separate page address mapping table. In addition to the application's explicit call to the prefetch function, the runtime library will also implicitly prefetch the memory when accessing memory to improve remote memory performance. The specific steps of the memory access process are as follows:
S3.1应用程序通过访问堆内存的形式触发运行时库的访存回调函数;S3.1 The application triggers the memory access callback function of the runtime library by accessing the heap memory;
S3.2运行时库通过回调函数中的参数判断访存粒度大小,并根据粒度大小判断使用的缓存数据结构,并到对应的数据结构中对内存进行查找;The S3.2 runtime library determines the memory access granularity through the parameters in the callback function, determines the cache data structure to use based on the granularity, and searches the memory in the corresponding data structure;
若查找成功,则表明本地缓存中有该内存对象,则直接跳转到S3.6进行返回。若查找失败,则表明需要从远端将该内存对象取回,需要从如下步骤继续:If the search is successful, it means that the memory object exists in the local cache, and then jump directly to S3.6 to return. If the search fails, it means that the memory object needs to be retrieved from the remote end, and you need to continue with the following steps:
S3.3若本地缓存已满,则需要进行本地缓存逐出操作,运行时库会选择一个受害者对象(victim)通过发送RDMA写请求将其写回到远端内存,并空出该缓存位置;S3.3 If the local cache is full, a local cache eviction operation is required. The runtime library selects a victim object and writes it back to the remote memory by sending an RDMA write request, freeing up the cache location.
S3.4运行时库通过发送RDMA读请求,将远程内存读到本地的空缓存位置中;The S3.4 runtime library reads the remote memory into the local empty cache location by sending an RDMA read request;
S3.5若当前的应用访存模式被运行时库识别到,则使用运行时库默认的预取策略进行一定距离的预取。默认的预取策略有如顺序访问(sequential),交叉访问(strided)等。S3.5 If the current application memory access mode is recognized by the runtime library, the runtime library's default prefetch strategy is used to perform prefetching at a certain distance. The default prefetch strategy includes sequential access, strided access, etc.
S3.6完成上述步骤后将控制流交还给应用程序。After completing the above steps, S3.6 returns the control flow to the application.
以上详细描述了本发明的较佳具体实施例。应当理解,本领域的普通技术无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此,凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。The preferred specific embodiments of the present invention are described in detail above. It should be understood that ordinary technicians in the field can make many modifications and changes based on the concept of the present invention without creative work. Therefore, all technical solutions that can be obtained by technicians in the technical field based on the concept of the present invention through logical analysis, reasoning or limited experiments on the basis of the prior art should be within the scope of protection determined by the claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410517975.5A CN118535352A (en) | 2024-04-26 | 2024-04-26 | A multi-granularity remote memory runtime method based on user mode |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410517975.5A CN118535352A (en) | 2024-04-26 | 2024-04-26 | A multi-granularity remote memory runtime method based on user mode |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118535352A true CN118535352A (en) | 2024-08-23 |
Family
ID=92379934
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410517975.5A Pending CN118535352A (en) | 2024-04-26 | 2024-04-26 | A multi-granularity remote memory runtime method based on user mode |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118535352A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119906754A (en) * | 2025-01-17 | 2025-04-29 | 中山大学 | A programmable data exchange system |
-
2024
- 2024-04-26 CN CN202410517975.5A patent/CN118535352A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119906754A (en) * | 2025-01-17 | 2025-04-29 | 中山大学 | A programmable data exchange system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US6006033A (en) | Method and system for reordering the instructions of a computer program to optimize its execution | |
| Kaxiras et al. | Improving CC-NUMA performance using instruction-based prediction | |
| US8453015B2 (en) | Memory allocation for crash dump | |
| US8402224B2 (en) | Thread-shared software code caches | |
| US9923840B2 (en) | Improving performance and security of multi-processor systems by moving thread execution between processors based on data location | |
| US8738859B2 (en) | Hybrid caching techniques and garbage collection using hybrid caching techniques | |
| US9201653B2 (en) | Binary translator with precise exception synchronization mechanism | |
| US7996630B2 (en) | Method of managing memory in multiprocessor system on chip | |
| US20110320682A1 (en) | Cooperative memory resource management via application-level balloon | |
| US20140282454A1 (en) | Stack Data Management for Software Managed Multi-Core Processors | |
| CN112860381B (en) | Method and system for virtual machine memory expansion based on Shenwei processor | |
| US9513886B2 (en) | Heap data management for limited local memory(LLM) multi-core processors | |
| Chen et al. | Unified holistic memory management supporting multiple big data processing frameworks over hybrid memories | |
| CN114153751B (en) | Computer systems for unified memory access | |
| CN118535352A (en) | A multi-granularity remote memory runtime method based on user mode | |
| US20250117333A1 (en) | Data processing method and apparatus, electronic device, and computer-readable storage medium | |
| Wang et al. | The Hitchhiker's Guide to Programming and Optimizing CXL-Based Heterogeneous Systems | |
| CN117271107A (en) | Data processing methods, devices, electronic equipment and computer-readable storage media | |
| US10379827B2 (en) | Automatic identification and generation of non-temporal store and load operations in a dynamic optimization environment | |
| CN120104043A (en) | Data processing method, device and computing equipment | |
| US20070300210A1 (en) | Compiling device, list vector area assignment optimization method, and computer-readable recording medium having compiler program recorded thereon | |
| CN100390755C (en) | Computer microarchitecture with explicit cache memory | |
| Zhou et al. | The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead | |
| US7539831B2 (en) | Method and system for performing memory clear and pre-fetch for managed runtimes | |
| Giles et al. | Software support for atomicity and persistence in non-volatile memory |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |