[go: up one dir, main page]

CN100428161C - Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor - Google Patents

Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor Download PDF

Info

Publication number
CN100428161C
CN100428161C CNB2007100223705A CN200710022370A CN100428161C CN 100428161 C CN100428161 C CN 100428161C CN B2007100223705 A CNB2007100223705 A CN B2007100223705A CN 200710022370 A CN200710022370 A CN 200710022370A CN 100428161 C CN100428161 C CN 100428161C
Authority
CN
China
Prior art keywords
node
nodes
chip
target program
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2007100223705A
Other languages
Chinese (zh)
Other versions
CN101051276A (en
Inventor
王学香
凌明
杨军
刘新宁
陆生礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CNB2007100223705A priority Critical patent/CN100428161C/en
Publication of CN101051276A publication Critical patent/CN101051276A/en
Application granted granted Critical
Publication of CN100428161C publication Critical patent/CN100428161C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

嵌入式微处理器的存储子系统内存自动布局方法是一种应用于系统芯片设计中的嵌入式微处理器的存储子系统内存自动布局方法,其步骤如下:将外部ARMCC工具链生成的二进制目标程序放入片外同步动态随机存储器中运行,得到运行过程中嵌入式微处理器的访问记录;根据链接信息和前一步骤生成的访问记录,把所述的二进制目标程序划分成一系列数据节点和指令节点,并生成表示节点间优先级关系的关系矩阵;按照优先级高低选择放入片上静态随机存储器上运行的节点,得到选中节点列表;根据所述的选中节点列表,得到新的二进制目标程序;将新的二进制目标程序中和所述选中节点列表中的节点对应的部分放入片上静态随机存储器中运行。

Figure 200710022370

The storage subsystem memory automatic layout method of embedded microprocessor is a storage subsystem memory automatic layout method applied to the embedded microprocessor in system chip design, and its steps are as follows: put the binary target program generated by the external ARMCC tool chain into Run in the off-chip synchronous DRAM, obtain the access records of the embedded microprocessor in the running process; divide the binary target program into a series of data nodes and instruction nodes according to the link information and the access records generated in the previous step, And generate the relational matrix that represents the priority relationship between nodes; Select the node that is put into the on-chip SRAM according to the priority level to run, and obtain the selected node list; According to the selected node list, obtain a new binary target program; put the new The part corresponding to the node in the selected node list in the binary target program is put into the on-chip static random access memory to run.

Figure 200710022370

Description

嵌入式微处理器的存储子系统内存自动布局方法 Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

技术领域 technical field

本发明是一种应用于系统芯片设计中的嵌入式微处理器的存储子系统内存自动布局方法,属于嵌入式微处理器设计领域。The invention relates to an automatic layout method for storage sub-system memory of an embedded microprocessor used in system chip design, and belongs to the field of embedded microprocessor design.

背景技术 Background technique

现代系统芯片的存储子系统一般由嵌入式微处理器、片上存储器、片外存储器三个主要部分构成,根据它们的物理位置和组织特性,各个部分对芯片性能的影响不一,片上存储器一般是访问速度非常快的片上静态随机存储器,它和片外存储器的地址空间互相分开,但共用地址和数据总线;片外存储器通常为片外同步动态随机存储器芯片,相对于片上静态随机存储器而言,它的访问速度很慢而且不稳定,访问功耗较大。随着系统芯片设计水平的不断提高,片外存储器的低读取速度与处理器芯片的高主频速度不相匹配的问题,直接限制了处理器芯片整体性能的提升,造成了处理器芯片性能的损失。此外,访问片外存储器的高功耗也造成处理器芯片功耗的增加。The storage subsystem of a modern system chip is generally composed of three main parts: embedded microprocessor, on-chip memory, and off-chip memory. According to their physical location and organizational characteristics, each part has different effects on chip performance. On-chip memory is generally accessed Very fast on-chip SRAM, its address space and off-chip memory are separated from each other, but share the address and data bus; off-chip memory is usually an off-chip synchronous DRAM chip, compared with on-chip SRAM, it The access speed is very slow and unstable, and the access power consumption is large. With the continuous improvement of the design level of the system chip, the problem that the low reading speed of the off-chip memory does not match the high main frequency speed of the processor chip directly limits the improvement of the overall performance of the processor chip, resulting in the performance of the processor chip. Loss. In addition, the high power consumption of accessing off-chip memory also increases the power consumption of the processor chip.

发明内容 Contents of the invention

技术问题:本发明的目的在于解决上述现有技术中存在的问题,提供一种嵌入式微处理器的存储子系统内存自动布局方法,实现存储子系统的内存布局优化,提高存储子系统的性能并降低其功耗。Technical problem: The object of the present invention is to solve the problems in the above-mentioned prior art, provide a storage subsystem memory automatic layout method of an embedded microprocessor, realize the memory layout optimization of the storage subsystem, improve the performance of the storage subsystem and reduce its power consumption.

技术方案:为解决现有技术中存在的技术问题,本发明所设计嵌入式微处理器的存储子系统内存自动布局方法,针对包括ARM公司的ARM7TDMI嵌入式微处理器、片上静态随机存储器、片外同步动态随机存储器的存储子系统。步骤如下:a)将外部ARMCC工具链生成的原二进制目标程序,全部放入片外同步动态随机存储器中运行,得到运行过程中ARM7TDMI嵌入式微处理器对片外同步动态随机存储器的访问记录;b)根据ARMCC工具链生成的链接信息和前一步骤生成的访问记录,把所述的原二进制目标程序划分成一系列数据节点和指令节点,并生成表示节点间优先级关系的关系矩阵;c)选择放入片上静态随机存储器上运行的节点:将所述的全部节点按照优先级高低排列,选择优先级最高的节点;考虑到所述的关系矩阵的影响,对其余各个节点反复进行所述的排列和选择,直到片上静态随机存储器无法再放入任何节点,得到选中节点列表;d)根据所述的选中节点列表,修改原二进制目标程序,得到新二进制目标程序;e)将新二进制目标程序中和所述选中节点列表中的节点对应的部分放入片上静态随机存储器中运行。Technical scheme: In order to solve the technical problems existing in the prior art, the storage subsystem memory automatic layout method of the embedded microprocessor designed by the present invention is aimed at including the ARM7TDMI embedded microprocessor of ARM Company, on-chip static random access memory, and off-chip synchronous Storage subsystem for dynamic random access memory. The steps are as follows: a) the original binary target program generated by the external ARMCC tool chain is all put into the off-chip synchronous DRAM to run, and the access record of the ARM7TDMI embedded microprocessor to the off-chip synchronous DRAM is obtained in the running process; b ) According to the link information generated by the ARMCC tool chain and the access record generated by the previous step, the original binary target program is divided into a series of data nodes and instruction nodes, and generates a relationship matrix representing the priority relationship between nodes; c) selection Put the nodes running on the on-chip SRAM: arrange all the nodes according to the priority level, and select the node with the highest priority; taking into account the influence of the relationship matrix, repeat the above arrangement for the remaining nodes and selection, until the on-chip static random access memory can no longer be put into any node, and obtain the selected node list; d) modify the original binary target program according to the selected node list, and obtain a new binary target program; e) convert the new binary target program into The part corresponding to the nodes in the selected node list is put into the on-chip SRAM for operation.

将所述的原二进制目标程序划分成数据节点和指令节点的步骤如下:a)符号表重建:将ARMCC工具链(指ARM公司提供的用于编译ARM指令集的工具链)生成的包含符号表的文本文件,转换成划分过程所能识别的新的符号表;b)DCD(DCD是ARM汇编语言中定义的一条伪指令,用于初始化32位数据存储单元,另外在编译过程中还用来保存常量和数据,称之为DCD数据)数据分析:确定原二进制目标程序中哪些是DCD数据节点,哪些是DCD指令节点;c)函数划分:根据函数各部分执行次数的多少和跳转指令将其划分成各个指令节点,即以每个函数中的跳转指令为界,把函数划分成若干指令节点,最终将原二进制目标程序中所有函数划分成一系列指令节点;d)函数调用分析:分析所有指令节点中的每一条函数调用指令,初步构建出整个原二进制目标程序的扩展控制流图;e)全局堆栈分析:封装原二进制目标程序的全局堆栈,得到的新的全局数据节点,该全局数据节点的大小是程序运行过程中统计得到的堆栈大小的1.5倍;f)数据访问分析:确定各个指令节点和全局数据节点之间的关系,完成原二进制目标程序的扩展控制流图。其中,数据节点包括两种:一种是所述的二进制目标程序的原全局数据节点,另一种是封装所述的二进制目标程序的全局堆栈而得到的新的全局数据节点;指令节点,是把所述的二进制目标程序中,符合片上静态随机存储器的容量大小限制的函数,按照其内部的跳转指令切割而成的多个函数基本块。在生成一系列数据节点和指令节点的同时,生成关系矩阵来表示由于节点间的相互关系使得一个或多个节点搬入片上静态随机存储器后其他节点优先级的变化。The steps of dividing the original binary target program into data nodes and instruction nodes are as follows: a) symbol table rebuilding: the ARMCC tool chain (referring to the tool chain used to compile the ARM instruction set provided by ARM company) generated contains the symbol table The text file is converted into a new symbol table that can be recognized by the division process; b) DCD (DCD is a pseudo-instruction defined in ARM assembly language, used to initialize 32-bit data storage units, and also used in the compilation process Save constants and data, called DCD data) Data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program; c) Function division: according to the execution times of each part of the function and the number of jump instructions It is divided into various instruction nodes, that is, with the jump instruction in each function as the boundary, the function is divided into several instruction nodes, and finally all functions in the original binary target program are divided into a series of instruction nodes; d) function call analysis: analysis Each function call instruction in all instruction nodes preliminarily constructs the extended control flow graph of the entire original binary target program; e) global stack analysis: the new global data node obtained by encapsulating the global stack of the original binary target program, the global The size of the data node is 1.5 times the stack size obtained during the program running; f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program. Wherein, the data node includes two kinds: one is the original global data node of the binary target program, and the other is a new global data node obtained by encapsulating the global stack of the binary target program; the instruction node is A plurality of function basic blocks are formed by cutting the functions conforming to the capacity limit of the on-chip SRAM in the binary target program according to the internal jump instructions. While generating a series of data nodes and instruction nodes, a relationship matrix is generated to represent the change of the priorities of other nodes after one or more nodes are moved into the on-chip SRAM due to the interrelationship between the nodes.

选中节点列表采用背包算法得到:首先,将所述的全部节点按照优先级高低排列,选择优先级最高的节点放入所述的选中节点列表;其次,考虑到关系矩阵的影响,再对其余各个节点反复进行前一步的排列和选择,直到片上静态随机存储器无法再放入任何节点,即得到所述的选中节点列表。作为对各个节点进行排列并选择的优先级的标准,既可以为各个节点放入片上静态随机存储器后所能减少的程序执行时间多少与节点大小的比值,也可以为各个节点放入片上静态随机存储器后所能节约的功耗的大小与节点大小的比值,比值愈大优先级愈高。The list of selected nodes is obtained by using the knapsack algorithm: firstly, arrange all the nodes according to the priority level, select the node with the highest priority and put it into the list of selected nodes; The nodes perform the arrangement and selection in the previous step repeatedly until no more nodes can be placed in the on-chip SRAM, that is, the list of selected nodes is obtained. As a standard for arranging and selecting the priority of each node, the ratio of the program execution time that can be reduced by placing the on-chip SRAM for each node to the size of the node can also be used for each node. The ratio of the power consumption that can be saved behind the memory to the node size, the larger the ratio, the higher the priority.

有益效果:本发明的嵌入式微处理器的存储子系统内存自动布局方法,采用考虑节点间关系的片上静态随机存储器分配算法来进行片上静态随机存储器分配,精确地分析了程序的性能优化效果,更改程序的内存布局,将计算密集型应用的程序、数据合适地分配到片外同步动态随机存储器和片上静态随机存储器中,确保调用最多、计算时间最长的程序位于片上静态随机存储器内,充分地利用片上静态随机存储器空间,提高了存储子系统的性能,并降低了外存读取带来的巨大功耗,实现嵌入式微处理器的存储子系统的内存布局的最优化。Beneficial effects: the memory automatic layout method of the storage subsystem of the embedded microprocessor of the present invention adopts the on-chip static random memory allocation algorithm considering the relationship between nodes to carry out the on-chip static random memory allocation, accurately analyzes the performance optimization effect of the program, and changes The memory layout of the program properly allocates the program and data of the calculation-intensive application to the off-chip synchronous DRAM and the on-chip SRAM to ensure that the program with the most calls and the longest calculation time is located in the on-chip SRAM, fully The on-chip static random access memory space is used to improve the performance of the storage subsystem, reduce the huge power consumption caused by reading the external memory, and realize the optimization of the memory layout of the storage subsystem of the embedded microprocessor.

本发明的嵌入式微处理器的存储子系统内存自动布局方法,可以使MP3解码程序等应用程序的性能提高37%-52%。下表所示,是相关应用程序的测试比较:The memory automatic layout method of the storage subsystem of the embedded microprocessor of the present invention can improve the performance of application programs such as MP3 decoding programs by 37%-52%. The table below shows a test comparison of the relevant applications:

附图说明 Description of drawings

图1为ARM7TDMI嵌入式微处理器的存储子系统结构框图。Figure 1 is a block diagram of the storage subsystem of the ARM7TDMI embedded microprocessor.

图2为本发明的工作流程图。Fig. 2 is a working flow diagram of the present invention.

图3为应用程序的扩展控制流图。Figure 3 is the extended control flow graph of the application.

图中有:ARM7TDMI嵌入式微处理器1、片上静态随机存储器2、片外同步动态随机存储器3、存储子系统性能仿真模块4、程序划分器5、SPM分配器(片上静态随机存储器分配器)6、链接器7。In the figure are: ARM7TDMI embedded microprocessor 1, on-chip static random access memory 2, off-chip synchronous dynamic random access memory 3, storage subsystem performance simulation module 4, program divider 5, SPM allocator (on-chip static random access memory allocator) 6 , Linker 7.

具体实施方式 Detailed ways

下面结合附图与具体实施方式对本发明作进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1为ARM7TDMI嵌入式微处理器的存储子系统结构框图。该存储子系统包括ARM7TDMI嵌入式微处理器1、片上静态随机存储器2、片外同步动态随机存储器3,其中,ARM7TDMI嵌入式微处理器1与片上静态随机存储器2之间通过AMBA总线连接;片外同步动态随机存储器3与ARM7TDMI嵌入式微处理器1之间通过外部存储器控制接口间接连接,ARM7TDMI嵌入式微处理器1与外步存储器控制接口之间通过AMBA总线连接,片外同步动态随机存储器3的数据接口和控制接口,分别通过数据总线和控制总线,与外部存储器的数据接口和控制接口对应连接。Figure 1 is a block diagram of the storage subsystem of the ARM7TDMI embedded microprocessor. The storage subsystem includes ARM7TDMI embedded microprocessor 1, on-chip static random access memory 2, and off-chip synchronous dynamic random access memory 3, wherein, ARM7TDMI embedded microprocessor 1 and on-chip static random access memory 2 are connected through AMBA bus; off-chip synchronization The DRAM 3 and the ARM7TDMI embedded microprocessor 1 are indirectly connected through the external memory control interface, the ARM7TDMI embedded microprocessor 1 and the external memory control interface are connected through the AMBA bus, and the data interface of the off-chip synchronous DRAM 3 and the control interface are correspondingly connected to the data interface and the control interface of the external memory through the data bus and the control bus respectively.

本发明的工作流程如下,参见图2:Work flow of the present invention is as follows, referring to Fig. 2:

第一步,将外部ARMCC工具链生成的二进制目标程序全部放入片外同步动态随机存储器3中运行,得到运行过程中ARM7TDMI嵌入式微处理器1对片外同步动态随机存储器3的访问记录。In the first step, all the binary target programs generated by the external ARMCC tool chain are put into the off-chip synchronous DRAM 3 to run, and the access records of the ARM7TDMI embedded microprocessor 1 to the off-chip synchronous DRAM 3 are obtained during the running process.

第二步,程序划分器5根据ARMCC工具链生成的链接信息和所述的访问记录,把所述的二进制目标程序划分成一系列数据节点和指令节点,并完成各个节点访问信息的统计,生成来描述各个节点之间关系的关系矩阵。In the second step, the program divider 5 divides the binary target program into a series of data nodes and instruction nodes according to the link information generated by the ARMCC tool chain and the access records, and completes the statistics of the access information of each node to generate A relationship matrix describing the relationship between individual nodes.

第三步,SPM分配器(片上静态随机存储器分配器)6根据各个节点的访问信息和关系矩阵,采用背包算法选择部分将要放入片上静态随机存储器2的节点,得到选中节点列表:将所述的全部节点按照优先级高低排列,选择优先级最高的节点;考虑到所述的关系矩阵的影响,对其余各个节点反复进行所述的排列和选择,直到片上静态随机存储器无法再放入任何节点,得到选中节点列表。In the third step, the SPM allocator (on-chip SRAM allocator) 6 adopts the knapsack algorithm to select some nodes that will be put into the on-chip SRAM 2 according to the access information and the relationship matrix of each node, and obtains a list of selected nodes: All the nodes of the node are arranged according to the priority level, and the node with the highest priority is selected; taking into account the influence of the relationship matrix, the above arrangement and selection are repeated for the remaining nodes until the on-chip SRAM can no longer be placed in any node , get the list of selected nodes.

第四步,链接器7根据所述的选中节点列表,修改原二进制目标程序,得到新二进制目标程序;In the 4th step, the linker 7 modifies the original binary target program according to the selected node list to obtain a new binary target program;

第五步,把链接器7生成的新二进制目标程序中,对应于所述选中节点列表中的节点的部分,在新二进制目标程序运行的初始化阶段放到片上静态随机存储器2中运行,完成存储子系统内存自动布局优化过程。In the fifth step, in the new binary target program generated by the linker 7, the part corresponding to the node in the selected node list is put into the on-chip SRAM 2 during the initialization stage of the new binary target program operation, and the storage is completed. Subsystem memory automatic layout optimization process.

本发明所述的将外部ARMCC工具链生成的原二进制目标程序,划分成数据节点和指令节点的步骤如下:a)符号表重建:将ARMCC工具链生成的包含符号表的文本文件,转换成划分过程所能识别的新的符号表;b)DCD数据分析:确定原二进制目标程序中哪些是DCD数据节点,哪些是DCD指令节点;c)函数划分:根据函数各部分执行次数的多少和跳转指令将其划分成各个指令节点,即以每个函数中的跳转指令为界,把函数划分成若干指令节点,最终将原二进制目标程序中所有函数划分成一系列指令节点:跳转指令使程序中的各条指令的执行次数不完全一致,把程序中执行次数最多的各条指令放入片上存储器,会减少更多执行时间,此处,跳转指令包括跳转目的地址在函数内部本条指令之前的回跳指令,比如图3中的v3节点跳转到v2节点,和跳转到目的地址在函数内部本条指令之后若干条指令的的前跳指令,比如图3中的v2节点跳转到v5节点,但不包括目的地址不在函数内部的跳转指令,跳转指令包括条件跳转指令和无条件跳转指令;d)函数调用分析:分析所有指令节点中的每一条函数调用指令,初步构建出整个原二进制目标程序的扩展控制流图;e)全局堆栈分析:封装原二进制目标程序的全局堆栈,得到的新的全局数据节点,该全局数据节点的大小是程序运行过程中统计得到的堆栈大小的1.5倍;f)数据访问分析:确定各个指令节点和全局数据节点之间的关系,完成原二进制目标程序的扩展控制流图。其中,所述的数据节点包括两种,一种是原二进制目标程序的原全局数据节点,另一种是封装原二进制目标程序的全局堆栈而得到的新的全局数据节点;所述的指令节点,是把原二进制目标程序中,符合片上静态随机存储器2的容量限制的函数,按照其内部的跳转指令切割而成的多个函数基本块。According to the present invention, the original binary target program generated by the external ARMCC tool chain is divided into data nodes and instruction nodes. A new symbol table that can be recognized by the process; b) DCD data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program; c) function division: according to the number of execution times and jumps of each part of the function The instruction divides it into each instruction node, that is, divides the function into several instruction nodes with the jump instruction in each function as the boundary, and finally divides all the functions in the original binary target program into a series of instruction nodes: the jump instruction makes the program The number of executions of each instruction in the program is not exactly the same. Putting the most executed instructions in the program into the on-chip memory will reduce more execution time. Here, the jump instruction includes the jump destination address in this instruction inside the function The previous jump back instruction, such as jumping from the v3 node in Figure 3 to the v2 node, and jumping to the forward jump instruction of several instructions after the destination address in the function, such as the v2 node in Figure 3 jumping to v5 node, but does not include jump instructions whose destination address is not inside the function, jump instructions include conditional jump instructions and unconditional jump instructions; d) function call analysis: analyze each function call instruction in all instruction nodes, and initially build The extended control flow graph of the entire original binary target program; e) Global stack analysis: the global stack of the original binary target program is encapsulated, and the new global data node is obtained. The size of the global data node is the stack obtained during the program running 1.5 times the size; f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program. Wherein, the data node includes two kinds, one is the original global data node of the original binary object program, and the other is a new global data node obtained by encapsulating the global stack of the original binary object program; the instruction node , is a plurality of functional basic blocks formed by cutting the functions in the original binary target program that meet the capacity limit of the on-chip SRAM 2 according to the internal jump instructions.

作为本发明的一个实施例,对各个节点进行排列并选择的优先级的标准,为各个节点放入片上静态随机存储器2运行后,所能减少的程序执行时间多少与节点大小的比值,比值愈大优先级愈高。As an embodiment of the present invention, the priority standard for arranging and selecting each node is the ratio of the amount of program execution time that can be reduced to the size of the node after each node is put into the on-chip SRAM 2 for operation, and the ratio is greater. Larger has higher priority.

作为本发明的另一个实施例,对各个节点进行排列并选择的优先级的标准,为各个节点放入片上静态随机存储器2运行后,所能节约的功耗大小与节点大小的比值,比值愈大优先级愈高。As another embodiment of the present invention, the priority standard for arranging and selecting each node is the ratio of the power consumption that can be saved to the size of the node after each node is put into the on-chip SRAM 2 for operation, and the greater the ratio Larger has higher priority.

下面分别介绍本发明的嵌入式微处理器的存储子系统内存自动布局方法涉及的四个关键模块。The four key modules involved in the automatic layout method of the storage subsystem memory of the embedded microprocessor of the present invention are respectively introduced below.

一、存储子系统性能仿真模块4:1. Storage subsystem performance simulation module 4:

利用ARM公司的ARMCC工具链提供的仿真功能设计编写嵌入式微处理器的存储子系统性能仿真模块,使得应用程序在本模块上运行时可以得到应用程序对存储子系统的访问记录,用于后续步骤。Use the simulation function provided by the ARMCC tool chain of ARM to design and write the storage subsystem performance simulation module of the embedded microprocessor, so that when the application program runs on this module, the access record of the application program to the storage subsystem can be obtained for subsequent steps .

二、程序划分器5:2. Program divider 5:

为充分利用ARM公司的ARMCC工具链的编译优化能力,本发明的片上静态随机存储器优化技术直接分析ARMCC链接后输出的二进制目标程序文件,而不是程序的C语言源代码。通过重建程序的符号表,全面分析程序的所有内容,包括函数、全局数据变量和全局堆栈。其中,程序中所有符合片上静态随机存储器的容量大小限制的函数,均按照其内部的跳转指令被切割成多个基本块;程序的全局堆栈被封装成一个全局数据变量。由于程序在处理不同的输入时,它的堆栈大小会发生变化,为防止堆栈溢出,该数据变量的大小是程序运行过程中实际使用堆栈大小的1.5倍,原二进制目标程序的全局数据变量保持不变。最后,被送入SPM分配器(片上静态随机存储器分配器)6进行选择的程序内容包括全局数据变量和函数基本块。In order to make full use of the compiling and optimizing ability of the ARMCC tool chain of ARM Company, the on-chip SRAM optimization technology of the present invention directly analyzes the binary target program file output after ARMCC linking, rather than the C language source code of the program. Comprehensively analyze all contents of a program, including functions, global data variables, and the global stack, by rebuilding the program's symbol table. Among them, all the functions in the program conforming to the size limit of the on-chip SRAM are cut into multiple basic blocks according to the internal jump instructions; the global stack of the program is encapsulated into a global data variable. Since the stack size of the program will change when it processes different inputs, in order to prevent stack overflow, the size of the data variable is 1.5 times the actual stack size used during the running of the program, and the global data variable of the original binary target program remains unchanged. Change. Finally, the program content sent to the SPM allocator (on-chip static random access memory allocator) 6 for selection includes global data variables and function basic blocks.

程序划分器5对ARMCC工具链输出的原二进制目标程序的划分过程依次分为六个步骤:符号表重建、DCD数据分析、函数划分、函数调用分析、全局堆栈分析和数据访问分析。经过上述六个步骤,原二进制目标程序被划分成一系列指令节点和数据节点,其中,数据节点包括两种,一种是原二进制目标程序的原全局数据节点,另一种是封装原二进制目标程序的全局堆栈而得到的新的全局数据节点;指令节点是指把原二进制目标程序中,符合片上静态随机存储器2的容量大小限制的函数,按照其内部的跳转指令切割而成的多个函数基本块。为方便后续SPM分配器(片上静态随机存储器分配器)6的算法设计和实现,本发明建立了一种扩展的控制流图来描述这些程序内容和它们之间的关系,图3中V6是数据节点,其他节点都是指令节点,除了无条件转移关系外,其他的节点间的关系在图中都有图示:条件转移关系、数据访问关系、顺序执行关系、调用关系。一些ARM指令对于地址空间有很严格的限制,比如跳转指令和数据装载指令。通常情况下由于地址跳转范围过大,一条跳转指令是不能在片外存储器和片上存储器之间跳转的。片外存储器中的跳转指令搬到片上存储器中后要由两条指令代替,导致了搬到片上存储器中的节点大小的增加,从而影响程序性能和功耗。所以要分别分析节点间的五种关系的关系发起者节点和关系接收者节点分别搬入片上静态随机存储器1后节点大小和对程序性能和功耗的影响。本发明用关系矩阵表示节点之间关系对节点大小、程序性能和功耗的影响,包括性能矩阵TRMij、功耗矩阵ERMij和大小矩阵SRMij。分别表示当且仅当节点Vi被搬入SPM时,节点Vj与Vi之间所有关系对程序执行时间的影响、对节约功耗的影响和对节点大小的影响,用下面的公式表示:The program divider 5 divides the original binary target program output by the ARMCC tool chain into six steps: symbol table reconstruction, DCD data analysis, function division, function call analysis, global stack analysis and data access analysis. After the above six steps, the original binary target program is divided into a series of instruction nodes and data nodes. Among them, there are two types of data nodes, one is the original global data node of the original binary target program, and the other is to encapsulate the original binary target program The new global data node obtained from the global stack; the instruction node refers to a plurality of functions that meet the capacity limit of the on-chip SRAM 2 in the original binary target program and are cut according to its internal jump instructions basic block. For the convenience of the algorithm design and realization of follow-up SPM allocator (on-chip static random access memory allocator) 6, the present invention has set up a kind of extended control flow graph to describe these program contents and the relationship between them, among Fig. 3 V6 is data Nodes, other nodes are instruction nodes, except for the unconditional transfer relationship, the relationship between other nodes is illustrated in the figure: conditional transfer relationship, data access relationship, sequential execution relationship, and call relationship. Some ARM instructions have very strict restrictions on the address space, such as jump instructions and data load instructions. Usually, because the address jump range is too large, a jump instruction cannot jump between off-chip memory and on-chip memory. After the jump instruction in the off-chip memory is moved to the on-chip memory, it will be replaced by two instructions, resulting in an increase in the size of the node moved to the on-chip memory, thereby affecting program performance and power consumption. Therefore, it is necessary to analyze the influence of the node size and the impact on the program performance and power consumption after the relationship initiator node and the relationship receiver node of the five kinds of relationships between nodes are moved into the on-chip SRAM 1 respectively. The present invention uses a relationship matrix to represent the influence of the relationship between nodes on node size, program performance and power consumption, including a performance matrix TRM ij , a power consumption matrix ERM ij and a size matrix SRM ij . Respectively, if and only when the node V i is moved into the SPM, the impact of all relationships between the node V j and V i on the program execution time, the impact on power saving and the impact on the node size, expressed by the following formula:

TRMTRM ijij == ΣΣ typetype == 11 55 (( ΔTΔT startstart (( RR typetype (( ii ,, jj )) )) ++ ΔTΔT endend (( RR typetype (( jj ,, ii )) )) ))

ERMERM ijij == ΣΣ typetype == 11 55 (( ΔEΔE startstart (( RR typetype (( ii ,, jj )) )) ++ ΔEΔE endend (( RR typetype (( jj ,, ii )) )) ))

SRMSRM ijij == ΣΣ typetype == 11 55 ΔSΔS (( RR typetype (( ii ,, jj )) ))

式中,Rtype(i,j)表示节点间的五种关系,ΔTstart表示关系发起者节点搬入片上静态随机存储器(1)后程序执行时间的变化,ΔTend表示关系接收者节点搬入片上静态随机存储器(1)后程序执行时间的变化,ΔEstart表示关系发起者节点搬入片上静态随机存储器(1)后所节约的功耗的变化,ΔEend表示关系接收者节点搬入片上静态随机存储器(1)后所节约的功耗的变化,ΔS表示节点搬入片上静态随机存储器(1)后节点大小的变化。In the formula, R type (i, j) represents the five relationships between nodes, ΔT start represents the change of program execution time after the relationship initiator node moves into the on-chip SRAM (1), and ΔT end represents the relationship receiver node moves into the on-chip SRAM The change of program execution time after random access memory (1), ΔE start represents the change of power consumption saved after the relationship initiator node is moved into the on-chip SRAM (1), and ΔE end represents the relationship receiver node is moved into the on-chip SRAM (1 ), ΔS represents the change in the size of the node after the node is moved into the on-chip SRAM (1).

三、SPM分配器(片上静态随机存储器分配器)6:3. SPM allocator (on-chip static random memory allocator) 6:

该分配器采用考虑了节点间关系的改进的背包算法,对全部节点进行排列、选择,得到将要放入片上静态随机存储器2的节点的列表,步骤如下:a、将所述的全部节点按照优先级高低排列,选择优先级最高的节点放入所述的选中节点列表;b、考虑到关系矩阵的影响,对其余各个节点反复进行前一步的排列和选择,直到片上静态随机存储器2无法再放入任何节点,得到所述的选中节点列表。选择放入片上静态随机存储器2的节点的标准,可以是各个节点放入片上静态随机存储器2后所能获得的性能收益的大小,或者是各个节点放入片上静态随机存储器2后所能节约的功耗的大小,性能收益愈大者优先级愈高,或者节约功耗愈多者优先级愈高。The allocator adopts the improved knapsack algorithm considering the relationship between nodes, arranges and selects all nodes, and obtains a list of nodes that will be put into the on-chip SRAM 2. The steps are as follows: a. Rank high and low, select the node with the highest priority and put it into the selected node list; b, taking into account the influence of the relationship matrix, repeatedly carry out the arrangement and selection of the previous step for the remaining nodes until the on-chip SRAM 2 can no longer be placed Enter any node to get the list of selected nodes. The standard for selecting nodes to be placed in the on-chip SRAM 2 may be the size of the performance gain that each node can obtain after being placed in the on-chip SRAM 2, or the amount that each node can save after being placed in the on-chip SRAM 2 The size of the power consumption, the higher the priority of the performance benefit, or the higher the priority of saving more power consumption.

四、链接器7:4. Linker 7:

得到在当前片上静态随机存储器2的容量下被选中搬入片上静态随机存储器的节点列表之后,通过用C语言编写的链接器7重新链接程序中所有节点,并自动生成新的二进制文件,该文件包括三个部分:初始化部分、片外同步动态随机存储器部分和片上静态随机存储器部分,其中,片外同步动态随机存储器部分是把原二进制目标程序根据节点搬移情况修改了跳转指令,初始化部分和片上静态随机存储器部分是链接器新增的,片上静态随机存储器部分也根据节点搬移情况修改了跳转指令,初始化部分将片上静态随机存储器部分复制到真实的片上静态随机存储器2的存储空间内,然后跳转到原程序的主函数首地址处,如果主函数首节点被选中搬入片上静态随机存储器2,则初始化部分将直接跳到片上静态随机存储器2的对应位置。新二进制目标程序从初始化部分开始执行。After obtaining the node list selected to be moved into the on-chip SRAM under the capacity of the current on-chip SRAM 2, relink all nodes in the program by the linker 7 written in C language, and automatically generate a new binary file, which includes Three parts: initialization part, off-chip synchronous DRAM part and on-chip static random access memory part, among them, the off-chip synchronous DRAM part is to modify the jump instruction according to the node moving situation of the original binary target program, the initialization part and the on-chip The SRAM part is newly added by the linker, and the on-chip SRAM part also modifies the jump instruction according to the node moving situation. The initialization part copies the on-chip SRAM part to the real on-chip SRAM 2 storage space, and then Jump to the first address of the main function of the original program. If the first node of the main function is selected and moved into the on-chip SRAM 2, the initialization part will directly jump to the corresponding position of the on-chip SRAM 2. The new binary object program starts execution from the initialization section.

Claims (3)

1、一种嵌入式微处理器的存储子系统内存自动布局方法,该存储子系统包括ARM7TDMI嵌入式微处理器(1)、片上静态随机存储器(2)、片外同步动态随机存储器(3),其特征在于该内存自动布局方法包含以下步骤:1, a storage subsystem internal memory automatic layout method of a kind of embedded microprocessor, this storage subsystem comprises ARM7TDMI embedded microprocessor (1), on-chip static random access memory (2), off-chip synchronous dynamic random access memory (3), its It is characterized in that the memory automatic layout method comprises the following steps: 1a)将ARMCC工具链生成的原二进制目标程序,全部放入片外同步动态随机存储器(3)中运行,得到运行过程中ARM7TDMI嵌入式微处理器(1)对片外同步动态随机存储器(3)的访问记录;1a) Put the original binary target program generated by the ARMCC tool chain into the off-chip synchronous DRAM (3) to run, and obtain the ARM7TDMI embedded microprocessor (1) to the off-chip synchronous DRAM (3) during operation. access records; 1b)根据ARMCC工具链生成的链接信息和步骤1a)所述的访问记录,把原二进制目标程序划分成一系列数据节点和指令节点,并生成表示节点间优先级关系的关系矩阵;1b) According to the link information generated by the ARMCC tool chain and the access records described in step 1a), the original binary target program is divided into a series of data nodes and instruction nodes, and generates a relationship matrix representing the priority relationship between nodes; 1c)选择放入片上静态随机存储器(2)上运行的节点:1c) Select the node to run on the on-chip SRAM (2): 1c1)将所述的全部节点按照优先级高低排列,选择优先级最高的节点;1c1) Arranging all the nodes according to the priority level, and selecting the node with the highest priority; 1c2)针对步骤1b)所述的关系矩阵的影响,对其余各个节点反复进行步骤1c1)所述的排列和选择,直到片上静态随机存储器(2)无法再放入任何节点,得到选中节点列表;1c2) For the influence of the relationship matrix described in step 1b), repeatedly carry out the arrangement and selection described in step 1c1) for the remaining nodes until the on-chip SRAM (2) can no longer be placed into any node, and a list of selected nodes is obtained; 1d)根据所述的选中节点列表,修改原二进制目标程序,得到新二进制目标程序;1d) modifying the original binary target program according to the selected node list to obtain a new binary target program; 1e)将新二进制目标程序中和所述选中节点列表中的节点对应的部分放入片上静态随机存储器(2);1e) putting the part corresponding to the node in the selected node list in the new binary target program into the on-chip SRAM (2); 将所述的原二进制目标程序划分成数据节点和指令节点的步骤如下:The steps of dividing the original binary target program into data nodes and instruction nodes are as follows: 2a)符号表重建:将ARMCC工具链生成的包含符号表的文本文件,转换成划分过程所能识别的新的符号表;2a) Symbol table reconstruction: convert the text file containing the symbol table generated by the ARMCC tool chain into a new symbol table that can be recognized by the division process; 2b)DCD数据分析:确定原二进制目标程序中哪些是DCD数据节点,哪些是DCD指令节点;2b) DCD data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program; 2c)函数划分:根据函数各部分执行次数的多少和跳转指令将其划分成各个指令节点,即以每个函数中的跳转指令为界,把函数划分成若干指令节点,最终将原二进制目标程序中所有函数划分成一系列指令节点;2c) Function division: According to the number of execution times of each part of the function and the jump instruction, it is divided into each instruction node, that is, the function is divided into several instruction nodes with the jump instruction in each function as the boundary, and finally the original binary All functions in the target program are divided into a series of instruction nodes; 2d)函数调用分析:分析所有指令节点中的每一条函数调用指令,初步构建出整个原二进制目标程序的扩展控制流图;2d) Function call analysis: analyze each function call instruction in all instruction nodes, and initially construct the extended control flow graph of the entire original binary target program; 2e)全局堆栈分析:封装原二进制目标程序的全局堆栈,得到的新的全局数据节点,该全局数据节点的大小是程序运行过程中统计得到的堆栈大小的1.5倍;2e) Global stack analysis: Encapsulate the global stack of the original binary target program to obtain a new global data node. The size of the global data node is 1.5 times the stack size obtained by statistics during the running of the program; 2f)数据访问分析:确定各个指令节点和全局数据节点之间的关系,完成原二进制目标程序的扩展控制流图。2f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program. 2、根据权利要求1所述的嵌入式微处理器的存储子系统内存自动布局方法,其特征在于,所述的优先级为各个节点放入片上静态随机存储器(2)后所能减少的程序执行时间多少与节点大小的比值,比值愈大优先级愈高。2. The storage subsystem memory automatic layout method of an embedded microprocessor according to claim 1, wherein said priority is the program execution that can be reduced after each node is put into the on-chip SRAM (2) The ratio of time to node size, the larger the ratio, the higher the priority. 3、根据权利要求1所述的嵌入式微处理器的存储子系统内存自动布局方法,其特征在于,所述的优先级为各个节点放入片上静态随机存储器(2)后所能节约的功耗大小与节点大小的比值,比值愈大优先级愈高。3. The storage subsystem memory automatic layout method of an embedded microprocessor according to claim 1, wherein said priority is the power consumption that can be saved after each node is put into the on-chip SRAM (2) The ratio of size to node size, the larger the ratio, the higher the priority.
CNB2007100223705A 2007-05-15 2007-05-15 Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor Active CN100428161C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100223705A CN100428161C (en) 2007-05-15 2007-05-15 Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100223705A CN100428161C (en) 2007-05-15 2007-05-15 Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

Publications (2)

Publication Number Publication Date
CN101051276A CN101051276A (en) 2007-10-10
CN100428161C true CN100428161C (en) 2008-10-22

Family

ID=38782701

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100223705A Active CN100428161C (en) 2007-05-15 2007-05-15 Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

Country Status (1)

Country Link
CN (1) CN100428161C (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291480B (en) * 2017-08-15 2020-12-15 中国农业银行股份有限公司 Function calling method and device
CN111177026B (en) * 2019-09-11 2024-09-27 腾讯科技(深圳)有限公司 Method and device for modifying variable memory layout and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952821B2 (en) * 2002-08-19 2005-10-04 Hewlett-Packard Development Company, L.P. Method and system for memory management optimization
US20060080372A1 (en) * 2004-09-21 2006-04-13 Barua Rajeey K Compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6952821B2 (en) * 2002-08-19 2005-10-04 Hewlett-Packard Development Company, L.P. Method and system for memory management optimization
US20060080372A1 (en) * 2004-09-21 2006-04-13 Barua Rajeey K Compiler-driven dynamic memory allocation methodology for scratch-pad based embedded systems

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
A Compiler-Based Approach forDynamically ManagingScratch-Pad Memories in Embedded Systems. Mahmut Kandemir et al.IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems,Vol.23 No.2. 2004
A Compiler-Based Approach forDynamically ManagingScratch-Pad Memories in Embedded Systems. Mahmut Kandemir et al.IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems,Vol.23 No.2. 2004 *
Extended Control Flow Graph Based PerformanceOptimization Using Scratch-Pad Memory. Pu Hanlai,Ling Ming,Jin Jing.DATE'05,IEEE Computer Society. 2005
Extended Control Flow Graph Based PerformanceOptimization Using Scratch-Pad Memory. Pu Hanlai,Ling Ming,Jin Jing.DATE'05,IEEE Computer Society. 2005 *
基于嵌套循环指令分析的片上存储器分配策略. 浦汉来,凌明,金晶,周凡.电路与系统学报,第11卷第1期. 2006
基于嵌套循环指令分析的片上存储器分配策略. 浦汉来,凌明,金晶,周凡.电路与系统学报,第11卷第1期. 2006 *
面向功耗优化的片上存储器分配策略. 金晶,浦汉来,凌明.应用科学学报,第24卷第2期. 2006
面向功耗优化的片上存储器分配策略. 金晶,浦汉来,凌明.应用科学学报,第24卷第2期. 2006 *

Also Published As

Publication number Publication date
CN101051276A (en) 2007-10-10

Similar Documents

Publication Publication Date Title
Balasa et al. Background memory area estimation for multidimensional signal processing systems
JP5717015B2 (en) Architecture optimizer
CN1885295B (en) Build integrated circuits using logic cells
US11900037B2 (en) Circuit synthesis optimization for implements on integrated circuit
US9038006B2 (en) Method and apparatus for generating gate-level activity data for use in clock gating efficiency analysis
CN103329097A (en) Tool generator
CN105447285B (en) A method of improving OpenCL hardware execution efficiency
CN104991884B (en) Heterogeneous polynuclear SoC architecture design method
CN100428161C (en) Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor
WO2023193547A1 (en) Method for generating and storing waveform data during circuit simulation, electronic device and storage medium
CN102037445B (en) Method and apparatus for determination of repetitive bit value pattern
US9171117B2 (en) Method for ranking paths for power optimization of an integrated circuit design and corresponding computer program product
CN112559031B (en) Many-core program reconstruction method based on data structure
CN104008216B (en) Method of using a memory compiler to generate optimized memory instances
Guo et al. GEM: GPU-accelerated emulator-inspired RTL simulation
Naguib et al. Speeding up SystemC simulation through process splitting
CN101923466B (en) Access method of decorator mode order
US20120226890A1 (en) Accelerator and data processing method
CN117852485B (en) FPGA layout wiring method and system
TWI427496B (en) Method and system of generating a model of an integrated circuit
CN116629353B (en) Coarse-grained FIFO hardware channel automatic fitting method for FPGA
CN113688587B (en) Method and device for generating circuit layout, computer equipment and storage medium
WO2014106780A1 (en) A method and apparatus for scan chain data management
CN100403322C (en) A method for realizing random access memory encapsulation
CN115730545A (en) A Deployment Mapping Tool Oriented to Memory Computing FPGA

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant