WO2017185256A1 - Rmsprop gradient descent algorithm execution apparatus and method - Google Patents
Rmsprop gradient descent algorithm execution apparatus and method Download PDFInfo
- Publication number
- WO2017185256A1 WO2017185256A1 PCT/CN2016/080354 CN2016080354W WO2017185256A1 WO 2017185256 A1 WO2017185256 A1 WO 2017185256A1 CN 2016080354 W CN2016080354 W CN 2016080354W WO 2017185256 A1 WO2017185256 A1 WO 2017185256A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- instruction
- module
- unit
- updated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
Definitions
- the present invention relates to the field of RMSprop algorithm application technology, and in particular to an apparatus and method for performing an RMSprop gradient descent algorithm, and relates to a hardware implementation of an RMSprop gradient descent optimization algorithm.
- Gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing.
- RMSprop algorithm is one of the gradient descent optimization algorithms. Because of its easy implementation, the calculation amount is small, the required storage space is small and Features such as mini-batch data sets are well used for processing, and the use of dedicated devices to implement the RMSprop algorithm can significantly increase the speed of execution.
- a known method of performing the RMSprop gradient descent algorithm is to use a general purpose processor.
- the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
- One of the disadvantages of this method is that the performance of a single general-purpose processor is low, and when multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck.
- the general-purpose processor needs to decode the correlation operation corresponding to the RMSprop algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
- Another known method of performing the RMSprop gradient descent algorithm is to use a graphics processing unit (GPU).
- the method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit.
- SIMD general single instruction multiple data stream
- the GPU Since the GPU is a device dedicated to performing graphics and computational operations and scientific calculations, without the special support for the RMSprop gradient descent algorithm, it still requires a large amount of front-end decoding to perform the relevant operations in the RMSprop gradient descent algorithm. A lot of extra overhead.
- the GPU has only a small on-chip buffer.
- the intermediate variable data required for the RMSprop gradient descent algorithm, such as the average direction amount, needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck and brings great work. Cost.
- the main object of the present invention is to provide an apparatus and method for performing an RMSprop gradient descent algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeatedly reading into the memory. Take data and reduce the bandwidth of memory access.
- the present invention provides an apparatus for performing an RMSprop gradient descent algorithm, the apparatus comprising a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, among them:
- the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of the data;
- the instruction cache unit 2 is configured to read the instruction by the direct memory access unit 1 and cache the read instruction;
- the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5.
- a data buffer unit 4 configured to cache a mean square matrix during initialization and data update
- the data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1. In the specified space.
- the direct memory access unit 1 writes an instruction from the external designated space to the instruction cache unit 2, reads the parameter to be updated and the corresponding gradient value from the external designated space to the data processing module 5, and updates the updated
- the parameter vector is directly written from the data processing module 5 to the external designated space.
- the controller unit 3 decodes the read instruction into a micro-instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to be externally specified.
- the address reads the data and writes the data to the external designated address
- the control data buffer unit 4 acquires an instruction required for the operation from the external designated address through the direct memory access unit 1, controls the data processing module 5 to perform the update operation of the parameter to be updated, and controls The data buffer unit 4 performs data transmission with the data processing module 5.
- the data buffer unit 4 initializes the mean square matrix RMS t at initialization, and reads the mean square matrix RMS t-1 into the data processing module 5 in each data update process, in the data processing module 5 It is updated to the mean square matrix RMS t and then written to the data buffer unit 4. During the operation of the device, a copy of the mean square matrix RMS t is always stored inside the data buffer unit 4.
- the data processing module 5 reads the average direction quantity RMS t-1 from the data buffer unit 4, and reads the parameter vector ⁇ t-1 and the gradient vector to be updated from the external designated space through the direct memory access unit 1 .
- ⁇ global update step size and direction are updated amount of ⁇
- the direction of the average amount of RMS t-1 update RMS t to be updated by updating the parameter RMS t ⁇ t-1 is ⁇ t
- the data is written back to cache RMS t In unit 4
- ⁇ t is written back to the external designated space by the direct memory control unit 1.
- the data processing module 5 updates the mean direction amount RMS t-1 to RMS t according to the formula.
- the data processing module 5 to be updated vector ⁇ t-1 is updated according to the formula ⁇ t Realized.
- the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
- 51 is connected in series with the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56.
- the vector operations are all element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
- the present invention also provides a method for performing an RMSprop gradient descent algorithm, the method comprising:
- the initial direction quantity RMS 0 is initialized, and the parameter vector ⁇ t to be updated and the corresponding gradient vector are obtained from the specified storage unit.
- step S1 an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.
- INSTRUCTION_IO instruction prefetch instruction
- Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, drives the direct memory access unit 1 to read from the external address space and is related to the RMSprop gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
- step S3 the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ from the external space according to the translated microinstruction.
- step S4 the controller unit 3 reads the assignment instruction from the instruction cache unit 2, initializes the average direction amount RMS t-1 in the drive data buffer unit 4 according to the translated microinstruction, and drives the number of iterations in the data processing unit 5.
- t is set to 1;
- step S5 the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated ⁇ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector Then sent to the data processing module 5;
- DATA_IO parameter read instruction
- step S6 the controller unit 3 reads a data transfer instruction from the instruction buffer unit 2, and transmits the average direction amount RMS t-1 in the data buffer unit 4 to the data processing unit 5 according to the translated microinstruction.
- the average direction RMS t-1 and the gradient vector are utilized.
- the average direction quantity update rate ⁇ update the average direction quantity RMS t which is according to the formula
- the implementation specifically includes: the controller unit 3 reads an average direction quantity update instruction from the instruction cache unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS t-1 according to the translated micro instruction;
- the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56, and driving the basic operation sub-
- the module 56 operates (1- ⁇ ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1- ⁇ ) RMS t-1 , respectively.
- the average direction RMS t-1 and the gradient vector are utilized.
- the controller unit 3 After updating the average direction quantity RMS t and the average direction quantity update rate ⁇ , the controller unit 3 reads a data transmission instruction from the instruction cache unit 2, and according to the translated micro-instruction, the updated average direction quantity RMS t It is transferred from the data processing unit 5 to the data buffer unit 4.
- the gradient vector is divided by the square root of the mean direction quantity and multiplied by the global update step length ⁇ to obtain a corresponding gradient decrease amount, and the updated update vector ⁇ t-1 is ⁇ t , which is according to the formula.
- the implementation specifically includes: the controller unit 3 reads a parameter vector update instruction from the instruction cache unit 2, and performs an update operation of the parameter vector according to the translated micro instruction; in the update operation, the parameter vector update instruction is sent to
- the operation control sub-module 51 controls the correlation operation module to perform an operation of: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation unit sub-module 56 to calculate - ⁇ , the number of iterations t plus 1; send operation instruction 5 (INS_5) to vector square root parallel operation sub-module 55, drive vector square root parallel operation sub-module 55 is calculated
- the operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates
- the operation instruction 7 (INS_7) is sent to the vector division parallel operation sub-module 54, and the drive vector division parallel operation sub-module 54 is calculated.
- the operation instruction 8 (INS_8) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates Obtaining ⁇ t ; where ⁇ t-1 is the value before ⁇ 0 is not updated at the t-th cycle, and the t-th cycle updates ⁇ t-1 to ⁇ t ; the operation control sub-module 51 sends the operation instruction 9 (INS_9) ) to the vector division parallel operation sub-module 54, the drive vector division parallel operation sub-module 54 operates to obtain a vector
- the controller unit 3 further reads a DATABACK_IO instruction from the instruction cache unit 2, and updates the parameter vector according to the translated micro instruction.
- ⁇ t is transferred from the data processing unit 5 to the external designated space through the direct memory access unit 1.
- the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges.
- the specific determination process is as follows: the controller unit 3 reads a convergence judgment instruction from the instruction cache unit 2, according to the translation. The micro-instruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, it converges, and the operation ends.
- the apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large by adopting a device specially used for performing the RMSprop gradient descent algorithm. Accelerate the execution speed of related applications.
- Apparatus and method for performing an RMSprop gradient descent algorithm provided by the present invention,
- the use of the moment vector required for the intermediate process of the data buffer unit temporarily avoids repeatedly reading data into the memory, reduces the IO operation between the device and the external address space, reduces the bandwidth of the memory access, and solves the off-chip bandwidth. This bottleneck.
- the apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention the degree of parallelism is greatly improved because the data processing module performs vector operations using the related parallel operation sub-modules.
- FIG. 1 shows an example block diagram of the overall structure of an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
- FIG. 2 illustrates an example block diagram of a data processing module in an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
- FIG. 3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
- An apparatus and method for performing an RMSprop gradient descent algorithm for accelerating the application of an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention First, an average direction quantity RMS 0 is initialized, and the parameter vector ⁇ t to be updated and the corresponding gradient vector are obtained from the specified storage unit. Then, for each iteration, first use the previous mean direction RMS t-1 , gradient vector And the average direction quantity update rate ⁇ updates the average direction quantity RMS t , that is, After that, the gradient vector is divided by the square root of the mean direction amount and multiplied by the global update step size ⁇ to obtain the corresponding gradient descent amount, and the vector to be updated is updated, that is, The entire process is repeated until the vector to be updated converges.
- the device includes a direct memory access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware circuits.
- the direct memory access unit 1 is configured to access an external designated space, read and write data to the instruction cache unit 2 and the data processing module 5, and complete loading and storing of data. Specifically, an instruction is written from the external designated space to the instruction cache unit 2, the parameter to be updated and the corresponding gradient value are read from the external designated space to the data processing module 5, and the updated parameter vector is directly written from the data processing module 5. Externally specified space.
- Instruction cache unit 2 for reading instructions through direct memory access unit 1 and caching The instruction read.
- the controller unit 3 is configured to read an instruction from the instruction cache unit 2, and decode the read instruction into a micro instruction that controls the behavior of the direct memory access unit 1, the data buffer unit 4, or the data processing module 5, and each micro
- the instruction is sent to the direct memory access unit 1, the data buffer unit 4 or the data processing module 5, and controls the direct memory access unit 1 to read data from the external designated address and write the data to the external designated address, and control the data cache unit 3 to access through the direct memory.
- the unit 1 acquires an instruction required for an operation from an external designated address, controls the data processing module 5 to perform an update operation of the parameter to be updated, and controls the data buffer unit 4 to perform data transmission with the data processing module 5.
- Data cache unit 4 for the initialization and update the data cache during mean square matrix specifically, data cache unit 4 are initialized in initialization RMS t square matrix, a square matrix RMS t are each in the data update process
- the -1 is read out into the data processing module 5, updated in the data processing module 5 to the mean square matrix RMS t , and then written to the data buffer unit 4.
- a copy of the mean square matrix RMS t is always stored inside the data buffer unit 4 throughout the operation of the device.
- the moment vector required for the intermediate process of the data buffer unit is temporarily used, the data is repeatedly read into the memory, the IO operation between the device and the external address space is reduced, and the bandwidth of the memory access is reduced.
- the data processing module 5 is configured to update the average direction quantity and the parameter to be updated, and write the updated average direction quantity into the data buffer unit 4, and write the updated parameter to be updated to the outside through the direct memory access unit 1.
- the data processing module 5 reads the average direction quantity RMS t-1 from the data buffer unit 4, and reads the parameter vector ⁇ t-1 to be updated from the external designated space through the direct memory access unit 1 , Gradient vector The global update step size ⁇ and the average direction quantity update rate ⁇ .
- the parameter ⁇ t-1 to be updated is updated by RMS t to be ⁇ t , that is, The RMS t is written back to the data buffer unit 4, and ⁇ t is written back to the external designated space by the direct memory control unit 1.
- the data processing module since the data processing module performs vector operations using the associated parallel operation sub-modules, the degree of parallelism is greatly improved, so the frequency of operation is low, and the power consumption overhead is small.
- the data processing module 5 includes an operation control sub-module 51, a vector addition parallel operation sub-module 52, a vector multiplication parallel operation sub-module 53, a vector division parallel operation sub-module 54, a vector square root parallel operation sub-module 55, and a basic The operation sub-module 56, wherein the vector addition parallel operation sub-module 52, the vector multiplication parallel operation sub-module 53, the vector division parallel operation sub-module 54, the vector square root parallel operation sub-module 55, and the basic operation sub-module 56 are connected in parallel, and the operation control sub-module is connected.
- the vector operations are element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
- FIG. 3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm, including the following steps, in accordance with an embodiment of the present invention:
- step S1 an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the instruction cache unit 2, and the INSTRUCTION_IO instruction is used to drive the direct memory unit 1 to read all instructions related to the RMSprop gradient descent calculation from the external address space.
- INSTRUCTION_IO instruction prefetch instruction
- Step S2 the operation starts, the controller unit 3 reads the INSTRUCTION_IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, drives the direct memory access unit 1 to read from the external address space and is related to the RMSprop gradient descent calculation. All instructions, and cache these instructions into the instruction cache unit 2;
- step S3 the controller unit 3 reads a hyperparametric read instruction (HYPERPARAMETER_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the global update step size ⁇ from the external space according to the translated microinstruction.
- Step S4 the controller unit 3 from unit 2 is read into the instruction cache assignment instructions, translated in accordance with the microinstruction, the drive data buffer unit 4 in the direction average RMS t-1 the amount of initialization, data processing and number of iterations of the drive unit 5 t is set to 1;
- step S5 the controller unit 3 reads a parameter read instruction (DATA_IO) from the instruction cache unit 2, and drives the direct memory access unit 1 to read the parameter vector to be updated ⁇ t-1 from the external designated space according to the translated micro instruction. And the corresponding gradient vector Then sent to the data processing module 5;
- DATA_IO parameter read instruction
- step S6 the controller unit 3 reads a data transfer instruction from the instruction buffer unit 2, and transmits the average direction amount RMS t-1 in the data buffer unit 4 to the data processing unit 5 according to the translated microinstruction.
- step S7 the controller unit 3 reads an average direction quantity update instruction from the instruction buffer unit 2, and drives the data buffer unit 4 to perform an update operation of the average direction quantity RMS t-1 according to the translated microinstruction.
- the average direction quantity update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 sends a corresponding instruction to perform the following operations: sending the operation instruction 1 (INS_1) to the basic operation sub-module 56 to drive the basic operation.
- the sub-module 56 operates (1- ⁇ ), sends the operation instruction 2 (INS_2) to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates (1- ⁇ ) RMS t-1 , respectively.
- step S8 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and transfers the updated mean direction amount RMS t from the data processing unit 5 to the data buffer unit 4 according to the translated microinstruction.
- step S9 the controller unit 3 reads a parameter vector operation instruction from the instruction buffer unit 2, and performs an update operation of the parameter vector according to the translated micro instruction.
- the parameter vector update instruction is sent to the operation control sub-module 51, and the operation control sub-module 51 controls the correlation operation module to perform the following operations: transmitting the operation instruction 4 (INS_4) to the basic operation unit sub-module 56, and driving the basic operation.
- the unit sub-module 56 calculates - ⁇ , the iteration number t is incremented by one; the arithmetic operation instruction 5 (INS_5) is sent to the vector square root parallel operation sub-module 55, and the drive vector square root parallel operation sub-module 55 is calculated.
- the operation operation instruction 6 (INS_6) is sent to the vector multiplication parallel operation sub-module 53, and the drive vector multiplication parallel operation sub-module 53 calculates After the two operations are completed, the operation instruction 7 (INS_7) is sent to the vector division parallel operation sub-module 54, and the drive vector division parallel operation sub-module 54 is calculated.
- the operation instruction 8 (INS_8) is sent to the vector addition parallel operation sub-module 52, and the drive vector addition parallel operation sub-module 52 calculates Obtaining ⁇ t ; where ⁇ t-1 is the value before ⁇ 0 is not updated at the t-th cycle, and the t-th cycle updates ⁇ t-1 to ⁇ t ; the operation control sub-module 51 sends the operation instruction 9 (INS_9) ) to the vector division parallel operation sub-module 54, the drive vector division parallel operation sub-module 54 operates to obtain a vector
- Step S10 the control unit 3 reads an instruction cache to be updated, the amount of write-back command from the unit 2 (DATABACK_IO), translated in accordance with the microinstruction, the updated parameter vector ⁇ t from the data processing unit 5 via direct memory access unit 1 Transfer to the external designated space.
- DATABACK_IO the amount of write-back command from the unit 2
- Step S10 the control unit 3 reads an instruction cache to be updated, the amount of write-back command from the unit 2 (DATABACK_IO), translated in accordance with the microinstruction, the updated parameter vector ⁇ t from the data processing unit 5 via direct memory access unit 1 Transfer to the external designated space.
- step S11 the controller unit 3 reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing module 5 determines whether the updated parameter vector converges, and if temp2 ⁇ ct, it converges, and the operation ends. Otherwise, go to step S5 to continue execution.
- the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated.
- the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
本发明涉及RMSprop算法应用技术领域,具体地涉及一种用于执行RMSprop梯度下降算法的装置及方法,是有关于RMSprop梯度下降优化算法的硬件实现的相关应用。The present invention relates to the field of RMSprop algorithm application technology, and in particular to an apparatus and method for performing an RMSprop gradient descent algorithm, and relates to a hardware implementation of an RMSprop gradient descent optimization algorithm.
梯度下降优化算法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用,RMSprop算法作为梯度下降优化算法中的一种,由于其易于实现,计算量小,所需存储空间小以及对mini-batch数据集进行处理时效果好等特征被广泛的使用,并且使用专用装置实现RMSprop算法可以显著提高其执行的速度。Gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. RMSprop algorithm is one of the gradient descent optimization algorithms. Because of its easy implementation, the calculation amount is small, the required storage space is small and Features such as mini-batch data sets are well used for processing, and the use of dedicated devices to implement the RMSprop algorithm can significantly increase the speed of execution.
目前,一种执行RMSprop梯度下降算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低,而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把RMSprop算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, a known method of performing the RMSprop gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this method is that the performance of a single general-purpose processor is low, and when multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the RMSprop algorithm into a long-column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
另一种执行RMSprop梯度下降算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用单指令多数据流(SIMD)指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对RMSprop梯度下降算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行RMSprop梯度下降算法中相关的运算,带来了大量的额外开销。另外,GPU只有较小的片上缓存,RMSprop梯度下降算法运行中所需的中间变量数据如均方向量等需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。 Another known method of performing the RMSprop gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a general single instruction multiple data stream (SIMD) instruction using a general purpose register file and a general stream processing unit. Since the GPU is a device dedicated to performing graphics and computational operations and scientific calculations, without the special support for the RMSprop gradient descent algorithm, it still requires a large amount of front-end decoding to perform the relevant operations in the RMSprop gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer. The intermediate variable data required for the RMSprop gradient descent algorithm, such as the average direction amount, needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck and brings great work. Cost.
发明内容Summary of the invention
有鉴于此,本发明的主要目的在于提供一种用于执行RMSprop梯度下降算法的装置及方法,以解决数据的通用处理器运算性能不足,前段译码开销大的问题,并避免反复向内存读取数据,降低内存访问的带宽。In view of this, the main object of the present invention is to provide an apparatus and method for performing an RMSprop gradient descent algorithm, which solves the problem that the general-purpose processor of the data has insufficient performance, and the decoding cost of the previous stage is large, and avoids repeatedly reading into the memory. Take data and reduce the bandwidth of memory access.
为达到上述目的,本发明提供了一种用于执行RMSprop梯度下降算法的装置,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4、数据处理模块5,其中:To achieve the above object, the present invention provides an apparatus for performing an RMSprop gradient descent algorithm, the apparatus comprising a direct
直接内存访问单元1,用于访问外部指定空间,向指令缓存单元2和数据处理模块5读写数据,完成数据的加载和存储;The direct
指令缓存单元2,用于通过直接内存访问单元1读取指令,并缓存读取的指令;The
控制器单元3,用于从指令缓存单元2中读取指令,将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令;The
数据缓存单元4,用于在初始化及数据更新过程中缓存均方矩阵;a
数据处理模块5,用于更新均方向量和待更新参数,并将更新后的均方向量写入到数据缓存单元4中,将更新后的待更新参数通过直接内存访问单元1写入到外部指定空间中。The
上述方案中,所述直接内存访问单元1是从外部指定空间向指令缓存单元2写入指令,从外部指定空间读取待更新参数和对应的梯度值到数据处理模块5,并将更新后的参数向量从数据处理模块5直接写入外部指定空间。In the above solution, the direct
上述方案中,所述控制器单元3将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令,用以控制直接内存访问单元1从外部指定地址读取数据和将数据写入外部指定地址,控制数据缓存单元4通过直接内存访问单元1从外部指定地址获取操作所需的指令,控制数据处理模块5进行待更新参数的更新运算,以及控制数据缓存单元4与数据处理模块5进行数据传输。In the above solution, the
上述方案中,所述数据缓存单元4在初始化时初始化均方矩阵RMSt,
在每次数据更新过程中将均方矩阵RMSt-1读出到数据处理模块5中,在数据处理模块5中更新为均方矩阵RMSt,然后再写入到数据缓存单元4中。在装置运行过程中,所述数据缓存单元4内部始终保存着均方矩阵RMSt的副本。In the above solution, the
上述方案中,所述数据处理模块5从数据缓存单元4中读取均方向量RMSt-1,通过直接内存访问单元1从外部指定空间中读取待更新参数向量θt-1、梯度向量全局更新步长α和均方向量更新率δ,将均方向量RMSt-1更新为RMSt,通过RMSt更新待更新参数θt-1为θt,并将RMSt回写到数据缓存单元4中,将θt通过直接内存控制单元1回写到外部指定空间。In the above solution, the
上述方案中,所述数据处理模块5将均方向量RMSt-1更新为RMSt是根据公式实现的,所述数据处理模块5将待更新向量θt-1更新为θt是根据公式实现的。In the above solution, the
上述方案中,所述数据处理模块5包括运算控制子模块51、向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55和基本运算子模块56,其中向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56并联连接,运算控制子模块51分别与向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56串联连接。In the above solution, the
上述方案中,该装置在对向量进行运算时,向量运算均为element-wise运算,同一向量执行某种运算时不同位置元素是并行执行运算。In the above solution, when the device operates on the vector, the vector operations are all element-wise operations, and the same vector performs parallel operations on the same vector when performing certain operations.
为达到上述目的,本发明还提供了一种用于执行RMSprop梯度下降算法的方法,该方法包括:To achieve the above object, the present invention also provides a method for performing an RMSprop gradient descent algorithm, the method comprising:
初始化一个均方向量RMS0,并从指定存储单元中获取待更新参数向 量θt和对应的梯度向量 Initialize an average direction RMS 0 and obtain the parameter θ t to be updated and the corresponding gradient vector from the specified storage unit
在进行梯度下降操作时,先利用均方向量RMSt-1、梯度向量和均方向量更新率δ更新均方向量RMSt,然后将梯度向量除以均方向量的平方根再乘以全局更新步长α得到对应的梯度下降量,更新待更新向量θt-1为θt并输出;重复此过程,直至待更新向量收敛。When performing the gradient descent operation, first use the mean direction RMS t-1 and the gradient vector And the average direction quantity update rate δ updates the average direction quantity RMS t , then divides the gradient vector by the square root of the mean direction quantity and multiplies the global update step length α to obtain the corresponding gradient descent quantity, and updates the to-be-updated vector θ t-1 to θ. t and output; repeat this process until the vector to be updated converges.
上述方案中,所述初始化一个均方向量RMS0,并从指定存储单元中获取待更新参数向量θt和对应的梯度向量包括:In the above solution, the initial direction quantity RMS 0 is initialized, and the parameter vector θ t to be updated and the corresponding gradient vector are obtained from the specified storage unit. include:
步骤S1,在指令缓存单元2的首地址处预先存入一条指令预取指令(INSTRUCTION_IO),该INSTRUCTION_IO指令用于驱动直接内存单元1从外部地址空间读取与RMSprop梯度下降计算有关的所有指令。In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the
步骤S2,运算开始,控制器单元3从指令缓存单元2的首地址读取该条INSTRUCTION_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部地址空间读取与RMSprop梯度下降计算有关的所有指令,并将这些指令缓存入指令缓存单元2中;Step S2, the operation starts, the
步骤S3,控制器单元3从指令缓存单元2读入一条超参量读取指令(HYPERPARAMETER_IO),根据译出的微指令,驱动直接内存访问单元1从外部空间读取全局更新步长α,均方向量更新率δ、收敛阈值ct,然后送入数据处理模块5中;In step S3, the
步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,驱动数据缓存单元4中的均方向量RMSt-1初始化,并驱动数据处理单元5中的迭代次数t被设置为1;In step S4, the
步骤S5,控制器单元3从指令缓存单元2读入一条参数读取指令(DATA_IO),根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取待更新参数向量θt-1和对应的梯度向量然后送入到数据处理模块5中;In step S5, the
步骤S6,控制器单元3从指令缓存单元2读入一条数据传输指令,根据译出的微指令,将数据缓存单元4中的均方向量RMSt-1传输到数据处理单元5中。In step S6, the
上述方案中,所述利用均方向量RMSt-1、梯度向量和均方向量
更新率δ更新均方向量RMSt,是根据公式
实现的,具体包括:控制器单元3从指令缓存单元2中读取一条均方向量更新指令,根据译出的微指令,驱动数据缓存单元4进行均方向量RMSt-1的更新操作;在该更新操作中,均方向量更新指令被送至运算控制子模块51,运算控制子模块51发送相应的指令进行以下操作:发送运算指令1(INS_1)至基本运算子模块56,驱动基本运算子模块56运算(1-δ),发送运算指令2(INS_2)至向量乘法并行运行子模块53,驱动向量乘法并行运行子模块53分别计算(1-δ)RMSt-1、和其中和向量对应位置的元素的计算存在先后顺序,不同位置之间并行计算;然后,发送运算指令3(INS_3)至向量加法并行运算子模块52,驱动向量加法并行运算子模块52计算
得到更新后的均方向量RMSt。In the above solution, the average direction RMS t-1 and the gradient vector are utilized. And the average direction quantity update rate δ update the average direction quantity RMS t , which is according to the formula The implementation specifically includes: the
上述方案中,所述利用均方向量RMSt-1、梯度向量和均方向量更新率δ更新均方向量RMSt后,还包括:控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将更新后的均方向量RMSt从数据处理单元5传送到数据缓存单元4中。In the above solution, the average direction RMS t-1 and the gradient vector are utilized. After updating the average direction quantity RMS t and the average direction quantity update rate δ, the
上述方案中,所述将梯度向量除以均方向量的平方根再乘以全局更新步长α得到对应的梯度下降量,更新待更新向量θt-1为θt,是根据公式实现的,具体包括:控制器单元3从指令缓存单元2读取一条参数向量更新指令,根据译出的微指令,进行参数向量的更新操作;在该更新操作中,参数向量更新指令被送至运算控制子模块51,运算控制子模块51控制相关运算模块进行如下操作:发送运算指令4(INS_4)至基本运算单元子模块56,驱动基本运算单元子模块56计算出-α,迭代次数t加1;发送运算指令5(INS_5)至向量平方根并行运算子模块55,驱动向量平方根并行运算子模块55计算得到
发送运算指令6(INS_6)至向量乘法并行运行子模块53,驱动向量乘法并行运行子模块53计算得到待两个操作完成后,发送运算指令7(INS_7)至向量除法并行运行子模块54,驱动向量除法并行运行子模块54计算得到然后,发送运算指令8(INS_8)至向量加法并行运行子模块52,驱动向量加法并行运行子模块52计算得到θt;其中,θt-1是θ0在第t次循环时未更新前的值,第t次循环将θt-1更新为θt;运算控制子模块51发送运算指令9(INS_9)至向量除法并行运算子模块54,驱动向量除法并行运算子模块54运算得到向量运算控制子模块51分别发送运算指令10(INS_10)、运算指令11(INS_11)向量加法并行运算子模块52和基本运算子模块56,计算得到sum=∑itempi、temp2=sum/n。In the above solution, the gradient vector is divided by the square root of the mean direction quantity and multiplied by the global update step length α to obtain a corresponding gradient decrease amount, and the updated update vector θ t-1 is θ t , which is according to the formula. The implementation specifically includes: the
上述方案中,所述更新待更新向量θt-1为θt之后,还包括:控制器单元3从指令缓存单元2读取一条DATABACK_IO指令,根据译出的微指令,将更新后的参数向量θt从数据处理单元5通过直接内存访问单元1传送至外部指定空间。In the above solution, after the update vector θ t-1 is θ t , the
上述方案中,所述重复此过程直至待更新向量收敛的步骤中,包括判断待更新向量是否收敛,具体判断过程如下:控制器单元3从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理模块5判断更新后的参数向量是否收敛,若temp2<ct,则收敛,运算结束。In the above solution, the step of repeating the process until the vector to be updated converges includes determining whether the vector to be updated converges. The specific determination process is as follows: the
从上述技术方案可以看出,本发明具有以下有益效果:It can be seen from the above technical solutions that the present invention has the following beneficial effects:
1、本发明提供的用于执行RMSprop梯度下降算法的装置及方法,通过采用专门用于执行RMSprop梯度下降算法的装置,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度。1. The apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large by adopting a device specially used for performing the RMSprop gradient descent algorithm. Accelerate the execution speed of related applications.
2、本发明提供的用于执行RMSprop梯度下降算法的装置及方法, 由于采用了数据缓存单元暂存中间过程所需的矩向量,避免了反复向内存读取数据,减少了装置与外部地址空间之间的IO操作,降低了内存访问的带宽,解决了片外带宽这一瓶颈。2. Apparatus and method for performing an RMSprop gradient descent algorithm provided by the present invention, The use of the moment vector required for the intermediate process of the data buffer unit temporarily avoids repeatedly reading data into the memory, reduces the IO operation between the device and the external address space, reduces the bandwidth of the memory access, and solves the off-chip bandwidth. This bottleneck.
3、本发明提供的用于执行RMSprop梯度下降算法的装置及方法,由于数据处理模块采用相关的并行运算子模块进行向量运算,使得并行程度大幅提高。3. The apparatus and method for performing the RMSprop gradient descent algorithm provided by the present invention, the degree of parallelism is greatly improved because the data processing module performs vector operations using the related parallel operation sub-modules.
4、本发明提供的用于执行RMSprop梯度下降算法的装置及方法,由于数据处理模块采用相关的并行运算子模块进行向量运算,运算的并行程度高,所以工作时的频率较低,使得功耗开销小。4. The apparatus and method for executing the RMSprop gradient descent algorithm provided by the present invention, since the data processing module uses the related parallel operation sub-module to perform vector operation, the parallelism of the operation is high, so the frequency during operation is low, so that the power consumption is low. The overhead is small.
为了更完整地理解本发明及其优势,现在将参考结合附图的以下描述,其中:For a more complete understanding of the present invention and its advantages, reference will now be made to the following description
图1示出了根据本发明实施例的用于执行RMSprop梯度下降算法的装置的整体结构的示例框图。1 shows an example block diagram of the overall structure of an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
图2示出了根据本发明实施例的用于执行RMSprop梯度下降算法的装置中数据处理模块的示例框图。2 illustrates an example block diagram of a data processing module in an apparatus for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
图3示出了根据本发明实施例的用于执行RMSprop梯度下降算法的方法的流程图。3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention.
在所有附图中,相同的装置、部件、单元等使用相同的附图标记来表示。Throughout the drawings, the same devices, components, units, and the like are denoted by the same reference numerals.
根据本发明实施例结合附图对本发明示例性实施例的以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the <RTI
在本发明中,术语“包括”和“含有”及其派生词意为包括而非限 制;术语“或”是包含性的,意为和/或。In the present invention, the terms "including" and "containing" and their derivatives are intended to include, but are not limited to, The term "or" is inclusive, meaning and/or.
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不背离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同参考数字用于相似功能和操作。In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numerals are used throughout the drawings for similar functions and operations.
根据本发明实施例的用于执行RMSprop梯度下降算法的装置及方法,用以加速RMSprop梯度下降算法的应用。首先,初始化一个均方向量RMS0,并从指定存储单元中获取待更新参数向量θt和对应的梯度向量然后,每次迭代时,首先利用之前的均方向量RMSt-1、梯度向量和均方向量更新率δ更新均方向量RMSt,即之后,将梯度向量除以均方向量的平方根再乘以全局更新步长α得到对应的梯度下降量,更新待更新向量,即重复整个过程,直至待更新向量收敛。An apparatus and method for performing an RMSprop gradient descent algorithm for accelerating the application of an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention. First, an average direction quantity RMS 0 is initialized, and the parameter vector θ t to be updated and the corresponding gradient vector are obtained from the specified storage unit. Then, for each iteration, first use the previous mean direction RMS t-1 , gradient vector And the average direction quantity update rate δ updates the average direction quantity RMS t , that is, After that, the gradient vector is divided by the square root of the mean direction amount and multiplied by the global update step size α to obtain the corresponding gradient descent amount, and the vector to be updated is updated, that is, The entire process is repeated until the vector to be updated converges.
图1示出了根据本发明实施例的用于实现RMSprop梯度下降算法的装置的整体结构的示例框图。如图1所示,该装置包括直接内存访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理模块5,均可以通过硬件电路进行实现。1 shows an example block diagram of the overall structure of an apparatus for implementing an RMSprop gradient descent algorithm in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a direct
直接内存访问单元1,用于访问外部指定空间,向指令缓存单元2和数据处理模块5读写数据,完成数据的加载和存储。具体是从外部指定空间向指令缓存单元2写入指令,从外部指定空间读取待更新参数和对应的梯度值到数据处理模块5,并将更新后的参数向量从数据处理模块5直接写入外部指定空间。The direct
指令缓存单元2,用于通过直接内存访问单元1读取指令,并缓存
读取的指令。
控制器单元3,用于从指令缓存单元2中读取指令,将读取的指令译码为控制直接内存访问单元1、数据缓存单元4或数据处理模块5行为的微指令,并将各微指令发送至直接内存访问单元1、数据缓存单元4或数据处理模块5,控制直接内存访问单元1从外部指定地址读取数据和将数据写入外部指定地址,控制数据缓存单元3通过直接内存访问单元1从外部指定地址获取操作所需的指令,控制数据处理模块5进行待更新参数的更新运算,以及控制数据缓存单元4与数据处理模块5进行数据传输。The
数据缓存单元4,用于在初始化及数据更新过程中缓存均方矩阵;具体而言,数据缓存单元4在初始化时初始化均方矩阵RMSt,在每次数据更新过程中将均方矩阵RMSt-1读出到数据处理模块5中,在数据处理模块5中更新为均方矩阵RMSt,然后再写入到数据缓存单元4中。在整个装置运行过程中,数据缓存单元4内部始终保存着均方矩阵RMSt的副本。在本发明中,由于采用了数据缓存单元暂存中间过程所需的矩向量,避免了反复向内存读取数据,减少了装置与外部地址空间之间的IO操作,降低了内存访问的带宽。
数据处理模块5,用于更新均方向量和待更新参数,并将更新后的均方向量写入到数据缓存单元4中,将更新后的待更新参数通过直接内存访问单元1写入到外部指定空间中;具体而言,数据处理模块5从数据缓存单元4中读取均方向量RMSt-1,通过直接内存访问单元1从外部指定空间中读取待更新参数向量θt-1、梯度向量全局更新步长α和均方向量更新率δ。首先将均方向量RMSt-1更新为RMSt,即RMSt=然后,通过RMSt更新待更新参数θt-1为θt,即并将RMSt回写到数据缓存单元4中,将θt通过直接内存控制单元1回写到外部指定空间。在本发明中,由于数据处
理模块采用相关的并行运算子模块进行向量运算,使得并行程度大幅提高,所以工作时的频率较低,进而使得功耗开销小。The
图2示出了根据本发明实施例的用于实现RMSprop梯度下降算法相关应用的装置中数据处理模块的示例框图。如图2所示,数据处理模块5包括运算控制子模块51、向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55和基本运算子模块56,其中向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56并联连接,运算控制子模块51分别与向量加法并行运算子模块52、向量乘法并行运算子模块53、向量除法并行运算子模块54、向量平方根并行运算子模块55以及基本运算子模块56串联连接。该装置在对向量进行运算时,向量运算均为element-wise运算,同一向量执行某种运算时不同位置元素是并行执行运算。2 illustrates an example block diagram of a data processing module in an apparatus for implementing an RMSprop gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the
图3示出了根据本发明实施例的用于执行RMSprop梯度下降算法的方法的流程图,具体包括以下步骤:3 shows a flow chart of a method for performing an RMSprop gradient descent algorithm, including the following steps, in accordance with an embodiment of the present invention:
步骤S1,在指令缓存单元2的首地址处预先存入一条指令预取指令(INSTRUCTION_IO),该INSTRUCTION_IO指令用于驱动直接内存单元1从外部地址空间读取与RMSprop梯度下降计算有关的所有指令。In step S1, an instruction prefetch instruction (INSTRUCTION_IO) is pre-stored at the first address of the
步骤S2,运算开始,控制器单元3从指令缓存单元2的首地址读取该条INSTRUCTION_IO指令,根据译出的微指令,驱动直接内存访问单元1从外部地址空间读取与RMSprop梯度下降计算有关的所有指令,并将这些指令缓存入指令缓存单元2中;Step S2, the operation starts, the
步骤S3,控制器单元3从指令缓存单元2读入一条超参量读取指令(HYPERPARAMETER_IO),根据译出的微指令,驱动直接内存访问单元1从外部空间读取全局更新步长α,均方向量更新率δ、收敛阈值ct,然后送入数据处理模块5中;In step S3, the
步骤S4,控制器单元3从指令缓存单元2读入赋值指令,根据译出的微指令,驱动数据缓存单元4中的均方向量RMSt-1初始化,并驱动数
据处理单元5中的迭代次数t被设置为1;Step S4, the
步骤S5,控制器单元3从指令缓存单元2读入一条参数读取指令(DATA_IO),根据译出的微指令,驱动直接内存访问单元1从外部指定空间读取待更新参数向量θt-1和对应的梯度向量然后送入到数据处理模块5中;In step S5, the
步骤S6,控制器单元3从指令缓存单元2读入一条数据传输指令,根据译出的微指令,将数据缓存单元4中的均方向量RMSt-1传输到数据处理单元5中。In step S6, the
步骤S7,控制器单元3从指令缓存单元2中读取一条均方向量更新指令,根据译出的微指令,驱动数据缓存单元4进行均方向量RMSt-1的更新操作。在该更新操作中,均方向量更新指令被送至运算控制子模块51,运算控制子模块51发送相应的指令进行以下操作:发送运算指令1(INS_1)至基本运算子模块56,驱动基本运算子模块56运算(1-δ),发送运算指令2(INS_2)至向量乘法并行运行子模块53,驱动向量乘法并行运行子模块53分别计算(1-δ)RMSt-1、和其中和向量对应位置的元素的计算存在先后顺序,不同位置之间并行计算。然后,发送运算指令3(INS_3)至向量加法并行运算子模块52,驱动向量加法并行运算子模块52计算得到更新后的均方向量RMSt。In step S7, the
步骤S8,控制器单元3从指令缓存单元2读取一条数据传输指令,根据译出的微指令,将更新后的均方向量RMSt从数据处理单元5传送到数据缓存单元4中。In step S8, the
步骤S9,控制器单元3从指令缓存单元2读取一条参数向量运算指令,根据译出的微指令,进行参数向量的更新操作。在该更新操作中,参数向量更新指令被送至运算控制子模块51,运算控制子模块51控制
相关运算模块进行如下操作:发送运算指令4(INS_4)至基本运算单元子模块56,驱动基本运算单元子模块56计算出-α,迭代次数t加1;发送运算指令5(INS_5)至向量平方根并行运算子模块55,驱动向量平方根并行运算子模块55计算得到发送运算指令6(INS_6)至向量乘法并行运行子模块53,驱动向量乘法并行运行子模块53计算得到待两个操作完成后,发送运算指令7(INS_7)至向量除法并行运行子模块54,驱动向量除法并行运行子模块54计算得到然后,发送运算指令8(INS_8)至向量加法并行运行子模块52,驱动向量加法并行运行子模块52计算得到θt;其中,θt-1是θ0在第t次循环时未更新前的值,第t次循环将θt-1更新为θt;运算控制子模块51发送运算指令9(INS_9)至向量除法并行运算子模块54,驱动向量除法并行运算子模块54运算得到向量运算控制子模块51分别发送运算指令10(INS_10)、运算指令11(INS_11)向量加法并行运算子模块52和基本运算子模块56,计算得到sum=∑itempi、temp2=sum/n。In step S9, the
步骤S10,控制器单元3从指令缓存单元2读取一条待更新量写回指令(DATABACK_IO),根据译出的微指令,将更新后的参数向量θt从数据处理单元5通过直接内存访问单元1传送至外部指定空间。Step S10, the
步骤S11,控制器单元3从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理模块5判断更新后的参数向量是否收敛,若temp2<ct,则收敛,运算结束,否则,转到步骤S5处继续执行。In step S11, the
本发明通过采用专门用于执行RMSprop梯度下降算法的装置,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度。同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。 By adopting a device dedicated to performing the RMSprop gradient descent algorithm, the present invention can solve the problem that the general-purpose processor of the data has insufficient performance and the decoding cost of the previous segment is large, and the execution speed of the related application is accelerated. At the same time, the application of the data cache unit avoids repeatedly reading data into the memory and reducing the bandwidth of the memory access.
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被具体化在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may include hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software embodied on a non-transitory computer readable medium), or both The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不背离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。 In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as
Claims (16)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2016/080354 WO2017185256A1 (en) | 2016-04-27 | 2016-04-27 | Rmsprop gradient descent algorithm execution apparatus and method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2016/080354 WO2017185256A1 (en) | 2016-04-27 | 2016-04-27 | Rmsprop gradient descent algorithm execution apparatus and method |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2017185256A1 true WO2017185256A1 (en) | 2017-11-02 |
Family
ID=60161731
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2016/080354 Ceased WO2017185256A1 (en) | 2016-04-27 | 2016-04-27 | Rmsprop gradient descent algorithm execution apparatus and method |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2017185256A1 (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113255270A (en) * | 2021-05-14 | 2021-08-13 | 西安交通大学 | Jacobian template calculation acceleration method, system, medium and storage device |
| CN114461579A (en) * | 2021-12-13 | 2022-05-10 | 杭州加速科技有限公司 | Processing method and system for parallel reading and dynamic scheduling of Pattern file and ATE (automatic test equipment) |
| CN114611809A (en) * | 2022-03-18 | 2022-06-10 | 北京工业大学 | Robot constant-force grinding optimization method, system, equipment and medium |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016037351A1 (en) * | 2014-09-12 | 2016-03-17 | Microsoft Corporation | Computing system for training neural networks |
| CN105512723A (en) * | 2016-01-20 | 2016-04-20 | 南京艾溪信息科技有限公司 | An artificial neural network computing device and method for sparse connections |
-
2016
- 2016-04-27 WO PCT/CN2016/080354 patent/WO2017185256A1/en not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2016037351A1 (en) * | 2014-09-12 | 2016-03-17 | Microsoft Corporation | Computing system for training neural networks |
| CN105512723A (en) * | 2016-01-20 | 2016-04-20 | 南京艾溪信息科技有限公司 | An artificial neural network computing device and method for sparse connections |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113255270A (en) * | 2021-05-14 | 2021-08-13 | 西安交通大学 | Jacobian template calculation acceleration method, system, medium and storage device |
| CN113255270B (en) * | 2021-05-14 | 2024-04-02 | 西安交通大学 | Jacobian template calculation acceleration method, system, medium and storage device |
| CN114461579A (en) * | 2021-12-13 | 2022-05-10 | 杭州加速科技有限公司 | Processing method and system for parallel reading and dynamic scheduling of Pattern file and ATE (automatic test equipment) |
| CN114611809A (en) * | 2022-03-18 | 2022-06-10 | 北京工业大学 | Robot constant-force grinding optimization method, system, equipment and medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109522254B (en) | Arithmetic device and method | |
| WO2017185257A1 (en) | Device and method for performing adam gradient descent training algorithm | |
| CN111353588B (en) | Apparatus and method for performing inverse training of artificial neural networks | |
| CN106991477B (en) | Artificial neural network compression coding device and method | |
| CN111310904B (en) | A device and method for performing convolutional neural network training | |
| WO2017124642A1 (en) | Device and method for executing forward calculation of artificial neural network | |
| WO2017185411A1 (en) | Apparatus and method for executing adagrad gradient descent training algorithm | |
| CN109062608B (en) | Vectorized read and write mask update instructions for recursive computation on independent data | |
| JP6340097B2 (en) | Vector move command controlled by read mask and write mask | |
| CN113537481B (en) | Apparatus and method for performing LSTM neural network operation | |
| WO2017124648A1 (en) | Vector computing device | |
| WO2017185389A1 (en) | Device and method for use in executing matrix multiplication operations | |
| WO2018120016A1 (en) | Apparatus for executing lstm neural network operation, and operational method | |
| WO2018107476A1 (en) | Memory access device, computing device and device applied to convolutional neural network computation | |
| WO2017124647A1 (en) | Matrix calculation apparatus | |
| WO2017185393A1 (en) | Apparatus and method for executing inner product operation of vectors | |
| KR20230109791A (en) | Packed data alignment plus compute instructions, processors, methods, and systems | |
| CN107315570B (en) | Apparatus and method for executing Adam gradient descent training algorithm | |
| CN107341540B (en) | An apparatus and method for executing a Hessian-Free training algorithm | |
| WO2017185392A1 (en) | Device and method for performing four fundamental operations of arithmetic of vectors | |
| WO2017185256A1 (en) | Rmsprop gradient descent algorithm execution apparatus and method | |
| WO2017185404A1 (en) | Apparatus and method for performing vector logical operation | |
| WO2017185419A1 (en) | Apparatus and method for executing operations of maximum value and minimum value of vectors | |
| WO2017181336A1 (en) | Maxout layer operation apparatus and method | |
| CN107315569B (en) | An apparatus and method for performing RMSprop gradient descent algorithm |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16899769 Country of ref document: EP Kind code of ref document: A1 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 16899769 Country of ref document: EP Kind code of ref document: A1 |