CN100428161C

CN100428161C - Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

Info

Publication number: CN100428161C
Application number: CNB2007100223705A
Authority: CN
Inventors: 王学香; 凌明; 杨军; 刘新宁; 陆生礼
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2007-05-15
Filing date: 2007-05-15
Publication date: 2008-10-22
Anticipated expiration: 2027-05-15
Also published as: CN101051276A

Abstract

The storage subsystem memory automatic layout method of embedded microprocessor is a storage subsystem memory automatic layout method applied to the embedded microprocessor in system chip design, and its steps are as follows: put the binary target program generated by the external ARMCC tool chain into Run in the off-chip synchronous DRAM, obtain the access records of the embedded microprocessor in the running process; divide the binary target program into a series of data nodes and instruction nodes according to the link information and the access records generated in the previous step, And generate the relational matrix that represents the priority relationship between nodes; Select the node that is put into the on-chip SRAM according to the priority level to run, and obtain the selected node list; According to the selected node list, obtain a new binary target program; put the new The part corresponding to the node in the selected node list in the binary target program is put into the on-chip static random access memory to run.

Description

Memory Automatic Layout Method for Storage Subsystem of Embedded Microprocessor

技术领域 technical field

本发明是一种应用于系统芯片设计中的嵌入式微处理器的存储子系统内存自动布局方法，属于嵌入式微处理器设计领域。The invention relates to an automatic layout method for storage sub-system memory of an embedded microprocessor used in system chip design, and belongs to the field of embedded microprocessor design.

背景技术 Background technique

现代系统芯片的存储子系统一般由嵌入式微处理器、片上存储器、片外存储器三个主要部分构成，根据它们的物理位置和组织特性，各个部分对芯片性能的影响不一，片上存储器一般是访问速度非常快的片上静态随机存储器，它和片外存储器的地址空间互相分开，但共用地址和数据总线；片外存储器通常为片外同步动态随机存储器芯片，相对于片上静态随机存储器而言，它的访问速度很慢而且不稳定，访问功耗较大。随着系统芯片设计水平的不断提高，片外存储器的低读取速度与处理器芯片的高主频速度不相匹配的问题，直接限制了处理器芯片整体性能的提升，造成了处理器芯片性能的损失。此外，访问片外存储器的高功耗也造成处理器芯片功耗的增加。The storage subsystem of a modern system chip is generally composed of three main parts: embedded microprocessor, on-chip memory, and off-chip memory. According to their physical location and organizational characteristics, each part has different effects on chip performance. On-chip memory is generally accessed Very fast on-chip SRAM, its address space and off-chip memory are separated from each other, but share the address and data bus; off-chip memory is usually an off-chip synchronous DRAM chip, compared with on-chip SRAM, it The access speed is very slow and unstable, and the access power consumption is large. With the continuous improvement of the design level of the system chip, the problem that the low reading speed of the off-chip memory does not match the high main frequency speed of the processor chip directly limits the improvement of the overall performance of the processor chip, resulting in the performance of the processor chip. Loss. In addition, the high power consumption of accessing off-chip memory also increases the power consumption of the processor chip.

发明内容 Contents of the invention

技术问题：本发明的目的在于解决上述现有技术中存在的问题，提供一种嵌入式微处理器的存储子系统内存自动布局方法，实现存储子系统的内存布局优化，提高存储子系统的性能并降低其功耗。Technical problem: The object of the present invention is to solve the problems in the above-mentioned prior art, provide a storage subsystem memory automatic layout method of an embedded microprocessor, realize the memory layout optimization of the storage subsystem, improve the performance of the storage subsystem and reduce its power consumption.

技术方案：为解决现有技术中存在的技术问题，本发明所设计嵌入式微处理器的存储子系统内存自动布局方法，针对包括ARM公司的ARM7TDMI嵌入式微处理器、片上静态随机存储器、片外同步动态随机存储器的存储子系统。步骤如下：a)将外部ARMCC工具链生成的原二进制目标程序，全部放入片外同步动态随机存储器中运行，得到运行过程中ARM7TDMI嵌入式微处理器对片外同步动态随机存储器的访问记录；b)根据ARMCC工具链生成的链接信息和前一步骤生成的访问记录，把所述的原二进制目标程序划分成一系列数据节点和指令节点，并生成表示节点间优先级关系的关系矩阵；c)选择放入片上静态随机存储器上运行的节点：将所述的全部节点按照优先级高低排列，选择优先级最高的节点；考虑到所述的关系矩阵的影响，对其余各个节点反复进行所述的排列和选择，直到片上静态随机存储器无法再放入任何节点，得到选中节点列表；d)根据所述的选中节点列表，修改原二进制目标程序，得到新二进制目标程序；e)将新二进制目标程序中和所述选中节点列表中的节点对应的部分放入片上静态随机存储器中运行。Technical scheme: In order to solve the technical problems existing in the prior art, the storage subsystem memory automatic layout method of the embedded microprocessor designed by the present invention is aimed at including the ARM7TDMI embedded microprocessor of ARM Company, on-chip static random access memory, and off-chip synchronous Storage subsystem for dynamic random access memory. The steps are as follows: a) the original binary target program generated by the external ARMCC tool chain is all put into the off-chip synchronous DRAM to run, and the access record of the ARM7TDMI embedded microprocessor to the off-chip synchronous DRAM is obtained in the running process; b ) According to the link information generated by the ARMCC tool chain and the access record generated by the previous step, the original binary target program is divided into a series of data nodes and instruction nodes, and generates a relationship matrix representing the priority relationship between nodes; c) selection Put the nodes running on the on-chip SRAM: arrange all the nodes according to the priority level, and select the node with the highest priority; taking into account the influence of the relationship matrix, repeat the above arrangement for the remaining nodes and selection, until the on-chip static random access memory can no longer be put into any node, and obtain the selected node list; d) modify the original binary target program according to the selected node list, and obtain a new binary target program; e) convert the new binary target program into The part corresponding to the nodes in the selected node list is put into the on-chip SRAM for operation.

将所述的原二进制目标程序划分成数据节点和指令节点的步骤如下：a)符号表重建：将ARMCC工具链(指ARM公司提供的用于编译ARM指令集的工具链)生成的包含符号表的文本文件，转换成划分过程所能识别的新的符号表；b)DCD(DCD是ARM汇编语言中定义的一条伪指令，用于初始化32位数据存储单元，另外在编译过程中还用来保存常量和数据，称之为DCD数据)数据分析：确定原二进制目标程序中哪些是DCD数据节点，哪些是DCD指令节点；c)函数划分：根据函数各部分执行次数的多少和跳转指令将其划分成各个指令节点，即以每个函数中的跳转指令为界，把函数划分成若干指令节点，最终将原二进制目标程序中所有函数划分成一系列指令节点；d)函数调用分析：分析所有指令节点中的每一条函数调用指令，初步构建出整个原二进制目标程序的扩展控制流图；e)全局堆栈分析：封装原二进制目标程序的全局堆栈，得到的新的全局数据节点，该全局数据节点的大小是程序运行过程中统计得到的堆栈大小的1.5倍；f)数据访问分析：确定各个指令节点和全局数据节点之间的关系，完成原二进制目标程序的扩展控制流图。其中，数据节点包括两种：一种是所述的二进制目标程序的原全局数据节点，另一种是封装所述的二进制目标程序的全局堆栈而得到的新的全局数据节点；指令节点，是把所述的二进制目标程序中，符合片上静态随机存储器的容量大小限制的函数，按照其内部的跳转指令切割而成的多个函数基本块。在生成一系列数据节点和指令节点的同时，生成关系矩阵来表示由于节点间的相互关系使得一个或多个节点搬入片上静态随机存储器后其他节点优先级的变化。The steps of dividing the original binary target program into data nodes and instruction nodes are as follows: a) symbol table rebuilding: the ARMCC tool chain (referring to the tool chain used to compile the ARM instruction set provided by ARM company) generated contains the symbol table The text file is converted into a new symbol table that can be recognized by the division process; b) DCD (DCD is a pseudo-instruction defined in ARM assembly language, used to initialize 32-bit data storage units, and also used in the compilation process Save constants and data, called DCD data) Data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program; c) Function division: according to the execution times of each part of the function and the number of jump instructions It is divided into various instruction nodes, that is, with the jump instruction in each function as the boundary, the function is divided into several instruction nodes, and finally all functions in the original binary target program are divided into a series of instruction nodes; d) function call analysis: analysis Each function call instruction in all instruction nodes preliminarily constructs the extended control flow graph of the entire original binary target program; e) global stack analysis: the new global data node obtained by encapsulating the global stack of the original binary target program, the global The size of the data node is 1.5 times the stack size obtained during the program running; f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program. Wherein, the data node includes two kinds: one is the original global data node of the binary target program, and the other is a new global data node obtained by encapsulating the global stack of the binary target program; the instruction node is A plurality of function basic blocks are formed by cutting the functions conforming to the capacity limit of the on-chip SRAM in the binary target program according to the internal jump instructions. While generating a series of data nodes and instruction nodes, a relationship matrix is generated to represent the change of the priorities of other nodes after one or more nodes are moved into the on-chip SRAM due to the interrelationship between the nodes.

选中节点列表采用背包算法得到：首先，将所述的全部节点按照优先级高低排列，选择优先级最高的节点放入所述的选中节点列表；其次，考虑到关系矩阵的影响，再对其余各个节点反复进行前一步的排列和选择，直到片上静态随机存储器无法再放入任何节点，即得到所述的选中节点列表。作为对各个节点进行排列并选择的优先级的标准，既可以为各个节点放入片上静态随机存储器后所能减少的程序执行时间多少与节点大小的比值，也可以为各个节点放入片上静态随机存储器后所能节约的功耗的大小与节点大小的比值，比值愈大优先级愈高。The list of selected nodes is obtained by using the knapsack algorithm: firstly, arrange all the nodes according to the priority level, select the node with the highest priority and put it into the list of selected nodes; The nodes perform the arrangement and selection in the previous step repeatedly until no more nodes can be placed in the on-chip SRAM, that is, the list of selected nodes is obtained. As a standard for arranging and selecting the priority of each node, the ratio of the program execution time that can be reduced by placing the on-chip SRAM for each node to the size of the node can also be used for each node. The ratio of the power consumption that can be saved behind the memory to the node size, the larger the ratio, the higher the priority.

有益效果：本发明的嵌入式微处理器的存储子系统内存自动布局方法，采用考虑节点间关系的片上静态随机存储器分配算法来进行片上静态随机存储器分配，精确地分析了程序的性能优化效果，更改程序的内存布局，将计算密集型应用的程序、数据合适地分配到片外同步动态随机存储器和片上静态随机存储器中，确保调用最多、计算时间最长的程序位于片上静态随机存储器内，充分地利用片上静态随机存储器空间，提高了存储子系统的性能，并降低了外存读取带来的巨大功耗，实现嵌入式微处理器的存储子系统的内存布局的最优化。Beneficial effects: the memory automatic layout method of the storage subsystem of the embedded microprocessor of the present invention adopts the on-chip static random memory allocation algorithm considering the relationship between nodes to carry out the on-chip static random memory allocation, accurately analyzes the performance optimization effect of the program, and changes The memory layout of the program properly allocates the program and data of the calculation-intensive application to the off-chip synchronous DRAM and the on-chip SRAM to ensure that the program with the most calls and the longest calculation time is located in the on-chip SRAM, fully The on-chip static random access memory space is used to improve the performance of the storage subsystem, reduce the huge power consumption caused by reading the external memory, and realize the optimization of the memory layout of the storage subsystem of the embedded microprocessor.

本发明的嵌入式微处理器的存储子系统内存自动布局方法，可以使MP3解码程序等应用程序的性能提高37％-52％。下表所示，是相关应用程序的测试比较：The memory automatic layout method of the storage subsystem of the embedded microprocessor of the present invention can improve the performance of application programs such as MP3 decoding programs by 37%-52%. The table below shows a test comparison of the relevant applications:

附图说明 Description of drawings

图1为ARM7TDMI嵌入式微处理器的存储子系统结构框图。Figure 1 is a block diagram of the storage subsystem of the ARM7TDMI embedded microprocessor.

图2为本发明的工作流程图。Fig. 2 is a working flow diagram of the present invention.

图3为应用程序的扩展控制流图。Figure 3 is the extended control flow graph of the application.

图中有：ARM7TDMI嵌入式微处理器1、片上静态随机存储器2、片外同步动态随机存储器3、存储子系统性能仿真模块4、程序划分器5、SPM分配器(片上静态随机存储器分配器)6、链接器7。In the figure are: ARM7TDMI embedded microprocessor 1, on-chip static random access memory 2, off-chip synchronous dynamic random access memory 3, storage subsystem performance simulation module 4, program divider 5, SPM allocator (on-chip static random access memory allocator) 6 , Linker 7.

具体实施方式 Detailed ways

下面结合附图与具体实施方式对本发明作进一步详细描述。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1为ARM7TDMI嵌入式微处理器的存储子系统结构框图。该存储子系统包括ARM7TDMI嵌入式微处理器1、片上静态随机存储器2、片外同步动态随机存储器3，其中，ARM7TDMI嵌入式微处理器1与片上静态随机存储器2之间通过AMBA总线连接；片外同步动态随机存储器3与ARM7TDMI嵌入式微处理器1之间通过外部存储器控制接口间接连接，ARM7TDMI嵌入式微处理器1与外步存储器控制接口之间通过AMBA总线连接，片外同步动态随机存储器3的数据接口和控制接口，分别通过数据总线和控制总线，与外部存储器的数据接口和控制接口对应连接。Figure 1 is a block diagram of the storage subsystem of the ARM7TDMI embedded microprocessor. The storage subsystem includes ARM7TDMI embedded microprocessor 1, on-chip static random access memory 2, and off-chip synchronous dynamic random access memory 3, wherein, ARM7TDMI embedded microprocessor 1 and on-chip static random access memory 2 are connected through AMBA bus; off-chip synchronization The DRAM 3 and the ARM7TDMI embedded microprocessor 1 are indirectly connected through the external memory control interface, the ARM7TDMI embedded microprocessor 1 and the external memory control interface are connected through the AMBA bus, and the data interface of the off-chip synchronous DRAM 3 and the control interface are correspondingly connected to the data interface and the control interface of the external memory through the data bus and the control bus respectively.

本发明的工作流程如下，参见图2：Work flow of the present invention is as follows, referring to Fig. 2:

第一步，将外部ARMCC工具链生成的二进制目标程序全部放入片外同步动态随机存储器3中运行，得到运行过程中ARM7TDMI嵌入式微处理器1对片外同步动态随机存储器3的访问记录。In the first step, all the binary target programs generated by the external ARMCC tool chain are put into the off-chip synchronous DRAM 3 to run, and the access records of the ARM7TDMI embedded microprocessor 1 to the off-chip synchronous DRAM 3 are obtained during the running process.

第二步，程序划分器5根据ARMCC工具链生成的链接信息和所述的访问记录，把所述的二进制目标程序划分成一系列数据节点和指令节点，并完成各个节点访问信息的统计，生成来描述各个节点之间关系的关系矩阵。In the second step, the program divider 5 divides the binary target program into a series of data nodes and instruction nodes according to the link information generated by the ARMCC tool chain and the access records, and completes the statistics of the access information of each node to generate A relationship matrix describing the relationship between individual nodes.

第三步，SPM分配器(片上静态随机存储器分配器)6根据各个节点的访问信息和关系矩阵，采用背包算法选择部分将要放入片上静态随机存储器2的节点，得到选中节点列表：将所述的全部节点按照优先级高低排列，选择优先级最高的节点；考虑到所述的关系矩阵的影响，对其余各个节点反复进行所述的排列和选择，直到片上静态随机存储器无法再放入任何节点，得到选中节点列表。In the third step, the SPM allocator (on-chip SRAM allocator) 6 adopts the knapsack algorithm to select some nodes that will be put into the on-chip SRAM 2 according to the access information and the relationship matrix of each node, and obtains a list of selected nodes: All the nodes of the node are arranged according to the priority level, and the node with the highest priority is selected; taking into account the influence of the relationship matrix, the above arrangement and selection are repeated for the remaining nodes until the on-chip SRAM can no longer be placed in any node , get the list of selected nodes.

第四步，链接器7根据所述的选中节点列表，修改原二进制目标程序，得到新二进制目标程序；In the 4th step, the linker 7 modifies the original binary target program according to the selected node list to obtain a new binary target program;

第五步，把链接器7生成的新二进制目标程序中，对应于所述选中节点列表中的节点的部分，在新二进制目标程序运行的初始化阶段放到片上静态随机存储器2中运行，完成存储子系统内存自动布局优化过程。In the fifth step, in the new binary target program generated by the linker 7, the part corresponding to the node in the selected node list is put into the on-chip SRAM 2 during the initialization stage of the new binary target program operation, and the storage is completed. Subsystem memory automatic layout optimization process.

本发明所述的将外部ARMCC工具链生成的原二进制目标程序，划分成数据节点和指令节点的步骤如下：a)符号表重建：将ARMCC工具链生成的包含符号表的文本文件，转换成划分过程所能识别的新的符号表；b)DCD数据分析：确定原二进制目标程序中哪些是DCD数据节点，哪些是DCD指令节点；c)函数划分：根据函数各部分执行次数的多少和跳转指令将其划分成各个指令节点，即以每个函数中的跳转指令为界，把函数划分成若干指令节点，最终将原二进制目标程序中所有函数划分成一系列指令节点：跳转指令使程序中的各条指令的执行次数不完全一致，把程序中执行次数最多的各条指令放入片上存储器，会减少更多执行时间，此处，跳转指令包括跳转目的地址在函数内部本条指令之前的回跳指令，比如图3中的v3节点跳转到v2节点，和跳转到目的地址在函数内部本条指令之后若干条指令的的前跳指令，比如图3中的v2节点跳转到v5节点，但不包括目的地址不在函数内部的跳转指令，跳转指令包括条件跳转指令和无条件跳转指令；d)函数调用分析：分析所有指令节点中的每一条函数调用指令，初步构建出整个原二进制目标程序的扩展控制流图；e)全局堆栈分析：封装原二进制目标程序的全局堆栈，得到的新的全局数据节点，该全局数据节点的大小是程序运行过程中统计得到的堆栈大小的1.5倍；f)数据访问分析：确定各个指令节点和全局数据节点之间的关系，完成原二进制目标程序的扩展控制流图。其中，所述的数据节点包括两种，一种是原二进制目标程序的原全局数据节点，另一种是封装原二进制目标程序的全局堆栈而得到的新的全局数据节点；所述的指令节点，是把原二进制目标程序中，符合片上静态随机存储器2的容量限制的函数，按照其内部的跳转指令切割而成的多个函数基本块。According to the present invention, the original binary target program generated by the external ARMCC tool chain is divided into data nodes and instruction nodes. A new symbol table that can be recognized by the process; b) DCD data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program; c) function division: according to the number of execution times and jumps of each part of the function The instruction divides it into each instruction node, that is, divides the function into several instruction nodes with the jump instruction in each function as the boundary, and finally divides all the functions in the original binary target program into a series of instruction nodes: the jump instruction makes the program The number of executions of each instruction in the program is not exactly the same. Putting the most executed instructions in the program into the on-chip memory will reduce more execution time. Here, the jump instruction includes the jump destination address in this instruction inside the function The previous jump back instruction, such as jumping from the v3 node in Figure 3 to the v2 node, and jumping to the forward jump instruction of several instructions after the destination address in the function, such as the v2 node in Figure 3 jumping to v5 node, but does not include jump instructions whose destination address is not inside the function, jump instructions include conditional jump instructions and unconditional jump instructions; d) function call analysis: analyze each function call instruction in all instruction nodes, and initially build The extended control flow graph of the entire original binary target program; e) Global stack analysis: the global stack of the original binary target program is encapsulated, and the new global data node is obtained. The size of the global data node is the stack obtained during the program running 1.5 times the size; f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program. Wherein, the data node includes two kinds, one is the original global data node of the original binary object program, and the other is a new global data node obtained by encapsulating the global stack of the original binary object program; the instruction node , is a plurality of functional basic blocks formed by cutting the functions in the original binary target program that meet the capacity limit of the on-chip SRAM 2 according to the internal jump instructions.

作为本发明的一个实施例，对各个节点进行排列并选择的优先级的标准，为各个节点放入片上静态随机存储器2运行后，所能减少的程序执行时间多少与节点大小的比值，比值愈大优先级愈高。As an embodiment of the present invention, the priority standard for arranging and selecting each node is the ratio of the amount of program execution time that can be reduced to the size of the node after each node is put into the on-chip SRAM 2 for operation, and the ratio is greater. Larger has higher priority.

作为本发明的另一个实施例，对各个节点进行排列并选择的优先级的标准，为各个节点放入片上静态随机存储器2运行后，所能节约的功耗大小与节点大小的比值，比值愈大优先级愈高。As another embodiment of the present invention, the priority standard for arranging and selecting each node is the ratio of the power consumption that can be saved to the size of the node after each node is put into the on-chip SRAM 2 for operation, and the greater the ratio Larger has higher priority.

下面分别介绍本发明的嵌入式微处理器的存储子系统内存自动布局方法涉及的四个关键模块。The four key modules involved in the automatic layout method of the storage subsystem memory of the embedded microprocessor of the present invention are respectively introduced below.

一、存储子系统性能仿真模块4：1. Storage subsystem performance simulation module 4:

利用ARM公司的ARMCC工具链提供的仿真功能设计编写嵌入式微处理器的存储子系统性能仿真模块，使得应用程序在本模块上运行时可以得到应用程序对存储子系统的访问记录，用于后续步骤。Use the simulation function provided by the ARMCC tool chain of ARM to design and write the storage subsystem performance simulation module of the embedded microprocessor, so that when the application program runs on this module, the access record of the application program to the storage subsystem can be obtained for subsequent steps .

二、程序划分器5：2. Program divider 5:

为充分利用ARM公司的ARMCC工具链的编译优化能力，本发明的片上静态随机存储器优化技术直接分析ARMCC链接后输出的二进制目标程序文件，而不是程序的C语言源代码。通过重建程序的符号表，全面分析程序的所有内容，包括函数、全局数据变量和全局堆栈。其中，程序中所有符合片上静态随机存储器的容量大小限制的函数，均按照其内部的跳转指令被切割成多个基本块；程序的全局堆栈被封装成一个全局数据变量。由于程序在处理不同的输入时，它的堆栈大小会发生变化，为防止堆栈溢出，该数据变量的大小是程序运行过程中实际使用堆栈大小的1.5倍，原二进制目标程序的全局数据变量保持不变。最后，被送入SPM分配器(片上静态随机存储器分配器)6进行选择的程序内容包括全局数据变量和函数基本块。In order to make full use of the compiling and optimizing ability of the ARMCC tool chain of ARM Company, the on-chip SRAM optimization technology of the present invention directly analyzes the binary target program file output after ARMCC linking, rather than the C language source code of the program. Comprehensively analyze all contents of a program, including functions, global data variables, and the global stack, by rebuilding the program's symbol table. Among them, all the functions in the program conforming to the size limit of the on-chip SRAM are cut into multiple basic blocks according to the internal jump instructions; the global stack of the program is encapsulated into a global data variable. Since the stack size of the program will change when it processes different inputs, in order to prevent stack overflow, the size of the data variable is 1.5 times the actual stack size used during the running of the program, and the global data variable of the original binary target program remains unchanged. Change. Finally, the program content sent to the SPM allocator (on-chip static random access memory allocator) 6 for selection includes global data variables and function basic blocks.

程序划分器5对ARMCC工具链输出的原二进制目标程序的划分过程依次分为六个步骤：符号表重建、DCD数据分析、函数划分、函数调用分析、全局堆栈分析和数据访问分析。经过上述六个步骤，原二进制目标程序被划分成一系列指令节点和数据节点，其中，数据节点包括两种，一种是原二进制目标程序的原全局数据节点，另一种是封装原二进制目标程序的全局堆栈而得到的新的全局数据节点；指令节点是指把原二进制目标程序中，符合片上静态随机存储器2的容量大小限制的函数，按照其内部的跳转指令切割而成的多个函数基本块。为方便后续SPM分配器(片上静态随机存储器分配器)6的算法设计和实现，本发明建立了一种扩展的控制流图来描述这些程序内容和它们之间的关系，图3中V6是数据节点，其他节点都是指令节点，除了无条件转移关系外，其他的节点间的关系在图中都有图示：条件转移关系、数据访问关系、顺序执行关系、调用关系。一些ARM指令对于地址空间有很严格的限制，比如跳转指令和数据装载指令。通常情况下由于地址跳转范围过大，一条跳转指令是不能在片外存储器和片上存储器之间跳转的。片外存储器中的跳转指令搬到片上存储器中后要由两条指令代替，导致了搬到片上存储器中的节点大小的增加，从而影响程序性能和功耗。所以要分别分析节点间的五种关系的关系发起者节点和关系接收者节点分别搬入片上静态随机存储器1后节点大小和对程序性能和功耗的影响。本发明用关系矩阵表示节点之间关系对节点大小、程序性能和功耗的影响，包括性能矩阵TRM_ij、功耗矩阵ERM_ij和大小矩阵SRM_ij。分别表示当且仅当节点V_i被搬入SPM时，节点V_j与V_i之间所有关系对程序执行时间的影响、对节约功耗的影响和对节点大小的影响，用下面的公式表示：The program divider 5 divides the original binary target program output by the ARMCC tool chain into six steps: symbol table reconstruction, DCD data analysis, function division, function call analysis, global stack analysis and data access analysis. After the above six steps, the original binary target program is divided into a series of instruction nodes and data nodes. Among them, there are two types of data nodes, one is the original global data node of the original binary target program, and the other is to encapsulate the original binary target program The new global data node obtained from the global stack; the instruction node refers to a plurality of functions that meet the capacity limit of the on-chip SRAM 2 in the original binary target program and are cut according to its internal jump instructions basic block. For the convenience of the algorithm design and realization of follow-up SPM allocator (on-chip static random access memory allocator) 6, the present invention has set up a kind of extended control flow graph to describe these program contents and the relationship between them, among Fig. 3 V6 is data Nodes, other nodes are instruction nodes, except for the unconditional transfer relationship, the relationship between other nodes is illustrated in the figure: conditional transfer relationship, data access relationship, sequential execution relationship, and call relationship. Some ARM instructions have very strict restrictions on the address space, such as jump instructions and data load instructions. Usually, because the address jump range is too large, a jump instruction cannot jump between off-chip memory and on-chip memory. After the jump instruction in the off-chip memory is moved to the on-chip memory, it will be replaced by two instructions, resulting in an increase in the size of the node moved to the on-chip memory, thereby affecting program performance and power consumption. Therefore, it is necessary to analyze the influence of the node size and the impact on the program performance and power consumption after the relationship initiator node and the relationship receiver node of the five kinds of relationships between nodes are moved into the on-chip SRAM 1 respectively. The present invention uses a relationship matrix to represent the influence of the relationship between nodes on node size, program performance and power consumption, including a performance matrix TRM _ij , a power consumption matrix ERM _ij and a size matrix SRM _ij . Respectively, if and only when the node V _i is moved into the SPM, the impact of all relationships between the node V _j and V _i on the program execution time, the impact on power saving and the impact on the node size, expressed by the following formula:

${TRM TRM}_{ij ij} = = {Σ Σ}_{type type = = 11}^{55} (({ΔT ΔT}_{start start} (({R R}_{type type} ((i i,, j j)))) + + {ΔT ΔT}_{end end} (({R R}_{type type} ((j j,, i i))))))$

${ERM ERM}_{ij ij} = = {Σ Σ}_{type type = = 11}^{55} (({ΔE ΔE}_{start start} (({R R}_{type type} ((i i,, j j)))) + + {ΔE ΔE}_{end end} (({R R}_{type type} ((j j,, i i))))))$

${SRM SRM}_{ij ij} = = {Σ Σ}_{type type = = 11}^{55} ΔS ΔS (({R R}_{type type} ((i i,, j j))))$

式中，R_type(i，j)表示节点间的五种关系，ΔT_start表示关系发起者节点搬入片上静态随机存储器(1)后程序执行时间的变化，ΔT_end表示关系接收者节点搬入片上静态随机存储器(1)后程序执行时间的变化，ΔE_start表示关系发起者节点搬入片上静态随机存储器(1)后所节约的功耗的变化，ΔE_end表示关系接收者节点搬入片上静态随机存储器(1)后所节约的功耗的变化，ΔS表示节点搬入片上静态随机存储器(1)后节点大小的变化。In the formula, R _type (i, j) represents the five relationships between nodes, ΔT _start represents the change of program execution time after the relationship initiator node moves into the on-chip SRAM (1), and ΔT _end represents the relationship receiver node moves into the on-chip SRAM The change of program execution time after random access memory (1), ΔE _start represents the change of power consumption saved after the relationship initiator node is moved into the on-chip SRAM (1), and ΔE _end represents the relationship receiver node is moved into the on-chip SRAM (1 ), ΔS represents the change in the size of the node after the node is moved into the on-chip SRAM (1).

三、SPM分配器(片上静态随机存储器分配器)6：3. SPM allocator (on-chip static random memory allocator) 6:

该分配器采用考虑了节点间关系的改进的背包算法，对全部节点进行排列、选择，得到将要放入片上静态随机存储器2的节点的列表，步骤如下：a、将所述的全部节点按照优先级高低排列，选择优先级最高的节点放入所述的选中节点列表；b、考虑到关系矩阵的影响，对其余各个节点反复进行前一步的排列和选择，直到片上静态随机存储器2无法再放入任何节点，得到所述的选中节点列表。选择放入片上静态随机存储器2的节点的标准，可以是各个节点放入片上静态随机存储器2后所能获得的性能收益的大小，或者是各个节点放入片上静态随机存储器2后所能节约的功耗的大小，性能收益愈大者优先级愈高，或者节约功耗愈多者优先级愈高。The allocator adopts the improved knapsack algorithm considering the relationship between nodes, arranges and selects all nodes, and obtains a list of nodes that will be put into the on-chip SRAM 2. The steps are as follows: a. Rank high and low, select the node with the highest priority and put it into the selected node list; b, taking into account the influence of the relationship matrix, repeatedly carry out the arrangement and selection of the previous step for the remaining nodes until the on-chip SRAM 2 can no longer be placed Enter any node to get the list of selected nodes. The standard for selecting nodes to be placed in the on-chip SRAM 2 may be the size of the performance gain that each node can obtain after being placed in the on-chip SRAM 2, or the amount that each node can save after being placed in the on-chip SRAM 2 The size of the power consumption, the higher the priority of the performance benefit, or the higher the priority of saving more power consumption.

四、链接器7：4. Linker 7:

得到在当前片上静态随机存储器2的容量下被选中搬入片上静态随机存储器的节点列表之后，通过用C语言编写的链接器7重新链接程序中所有节点，并自动生成新的二进制文件，该文件包括三个部分：初始化部分、片外同步动态随机存储器部分和片上静态随机存储器部分，其中，片外同步动态随机存储器部分是把原二进制目标程序根据节点搬移情况修改了跳转指令，初始化部分和片上静态随机存储器部分是链接器新增的，片上静态随机存储器部分也根据节点搬移情况修改了跳转指令，初始化部分将片上静态随机存储器部分复制到真实的片上静态随机存储器2的存储空间内，然后跳转到原程序的主函数首地址处，如果主函数首节点被选中搬入片上静态随机存储器2，则初始化部分将直接跳到片上静态随机存储器2的对应位置。新二进制目标程序从初始化部分开始执行。After obtaining the node list selected to be moved into the on-chip SRAM under the capacity of the current on-chip SRAM 2, relink all nodes in the program by the linker 7 written in C language, and automatically generate a new binary file, which includes Three parts: initialization part, off-chip synchronous DRAM part and on-chip static random access memory part, among them, the off-chip synchronous DRAM part is to modify the jump instruction according to the node moving situation of the original binary target program, the initialization part and the on-chip The SRAM part is newly added by the linker, and the on-chip SRAM part also modifies the jump instruction according to the node moving situation. The initialization part copies the on-chip SRAM part to the real on-chip SRAM 2 storage space, and then Jump to the first address of the main function of the original program. If the first node of the main function is selected and moved into the on-chip SRAM 2, the initialization part will directly jump to the corresponding position of the on-chip SRAM 2. The new binary object program starts execution from the initialization section.

Claims

1, a storage subsystem internal memory automatic layout method of a kind of embedded microprocessor, this storage subsystem comprises ARM7TDMI embedded microprocessor (1), on-chip static random access memory (2), off-chip synchronous dynamic random access memory (3), its It is characterized in that the memory automatic layout method comprises the following steps:

1a) Put the original binary target program generated by the ARMCC tool chain into the off-chip synchronous DRAM (3) to run, and obtain the ARM7TDMI embedded microprocessor (1) to the off-chip synchronous DRAM (3) during operation. access records;

1b) According to the link information generated by the ARMCC tool chain and the access records described in step 1a), the original binary target program is divided into a series of data nodes and instruction nodes, and generates a relationship matrix representing the priority relationship between nodes;

1c) Select the node to run on the on-chip SRAM (2):

1c1) Arranging all the nodes according to the priority level, and selecting the node with the highest priority;

1c2) For the influence of the relationship matrix described in step 1b), repeatedly carry out the arrangement and selection described in step 1c1) for the remaining nodes until the on-chip SRAM (2) can no longer be placed into any node, and a list of selected nodes is obtained;

1d) modifying the original binary target program according to the selected node list to obtain a new binary target program;

1e) putting the part corresponding to the node in the selected node list in the new binary target program into the on-chip SRAM (2);

The steps of dividing the original binary target program into data nodes and instruction nodes are as follows:

2a) Symbol table reconstruction: convert the text file containing the symbol table generated by the ARMCC tool chain into a new symbol table that can be recognized by the division process;

2b) DCD data analysis: determine which are DCD data nodes and which are DCD instruction nodes in the original binary target program;

2c) Function division: According to the number of execution times of each part of the function and the jump instruction, it is divided into each instruction node, that is, the function is divided into several instruction nodes with the jump instruction in each function as the boundary, and finally the original binary All functions in the target program are divided into a series of instruction nodes;

2d) Function call analysis: analyze each function call instruction in all instruction nodes, and initially construct the extended control flow graph of the entire original binary target program;

2e) Global stack analysis: Encapsulate the global stack of the original binary target program to obtain a new global data node. The size of the global data node is 1.5 times the stack size obtained by statistics during the running of the program;

2f) Data access analysis: determine the relationship between each instruction node and the global data node, and complete the extended control flow graph of the original binary target program.

2. The storage subsystem memory automatic layout method of an embedded microprocessor according to claim 1, wherein said priority is the program execution that can be reduced after each node is put into the on-chip SRAM (2) The ratio of time to node size, the larger the ratio, the higher the priority.

3. The storage subsystem memory automatic layout method of an embedded microprocessor according to claim 1, wherein said priority is the power consumption that can be saved after each node is put into the on-chip SRAM (2) The ratio of size to node size, the larger the ratio, the higher the priority.