WO2025105520A1 - Lightweight deep learning model quantization method - Google Patents
Lightweight deep learning model quantization method Download PDFInfo
- Publication number
- WO2025105520A1 WO2025105520A1 PCT/KR2023/018255 KR2023018255W WO2025105520A1 WO 2025105520 A1 WO2025105520 A1 WO 2025105520A1 KR 2023018255 W KR2023018255 W KR 2023018255W WO 2025105520 A1 WO2025105520 A1 WO 2025105520A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- deep learning
- learning model
- quantization
- maximum values
- average
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the present invention relates to a method for quantizing a deep learning model, and more specifically, to a method for updating a quantization scale for a hardware accelerator-based deep learning model installed in a mobile device.
- the present invention has been made to solve the above problems, and an object of the present invention is to provide a lightweight deep learning model quantization method that predicts and updates the quantization scale in the next learning step based on the parameter distribution in the previous learning step, as a quantization method applicable to a hardware accelerator model in a mobile device, which is a resource-limited environment.
- a method for quantizing a deep learning model includes: a step of storing and accumulating maximum values of deep learning operations during training of a deep learning model; a step of calculating an average of the accumulated maximum values; a step of updating a quantization scale with the calculated average; and a step of quantizing a deep learning model based on the updated quantization scale.
- the accumulation step may be to store the maximum values for each batch and accumulate them.
- the calculation step could be to calculate the average of the maximum values stored per epoch.
- the deep learning model quantization method according to the present invention may further include a step of performing learning of the next epoch on the quantized deep learning model.
- Deep learning operations can be convolution operations.
- the update step may update the quantization scale without knowing the distribution of the output feature map by the convolution operation.
- Quantization can be symmetric quantization.
- the accumulation step could be to store the maximum of the absolute values of the deep learning operation values.
- Deep learning models can be deployed on mobile devices.
- a deep learning operation device characterized by including: an operation unit that stores and accumulates maximum values of deep learning operations during training of a deep learning model, calculates an average of the accumulated maximum values, updates a quantization scale with the calculated average, and quantizes a deep learning model based on the updated quantization scale; and a memory that provides storage space required for the operation unit.
- a method for quantizing a deep learning model characterized by including the steps of: updating a quantization scale with an average of maximum values of deep learning operations during training of a deep learning model; quantizing the deep learning model based on the updated quantization scale; and training the quantized deep learning model.
- a deep learning operation device characterized by including: an operation unit that updates a quantization scale by an average of maximum values of deep learning operations during training of a deep learning model, quantizes the deep learning model based on the updated quantization scale, and trains the quantized deep learning model; and a memory that provides storage space required for the operation unit.
- Figure 1 Example of problems occurring in hardware structure when applying quantization learning model in software.
- Figure 4-5 Comparison of maximum and average distributions when applying the current epoch learning scale based on the previous distribution values.
- Figure 6-9. Example of scale update process by epoch and batch unit during convolution operation process.
- Quantization technology for real-time operation is being widely used in various fields of object recognition related to mobile devices through cameras.
- lightweighting is a major challenge in implementing learning models in environments with limited memory, such as NPU hardware accelerator design.
- Figure 1 is a diagram illustrating the general process of quantization learning in software and the problems that arise when the method is applied as is to a hardware accelerator.
- the quantization scale of the input feature map before convolution and the quantization scale of the output feature map after convolution have different values due to changes in the distribution value due to the intermediate operation process.
- a real-time value analysis and comparison process is required to quantize the parameters based on the accurate value distribution.
- Fig. 2 is a diagram illustrating a model update method based on a quantization scale calculation method applicable to an embodiment of the present invention.
- a symmetric quantization structure is applied as the quantization structure.
- negative and positive values can be separately checked to minimize value loss occurring during quantization, but a separate calculation is added for the intermediate value that serves as a reference, so it is not suitable for hardware design.
- the maximum absolute value of the values output after convolution is stored in a buffer.
- the maximum value of the absolute value of the convolution output is updated and maintained during learning within the batch size during learning, and the maximum value is stored as a cumulative sum in a separate buffer at the end of each batch learning.
- the value stored as a cumulative sum is calculated as an average value at the unit epoch point when the dataset learning is completed once (the cumulative sum is divided by the number of batches) and applied as the quantization scale value of the next epoch.
- the same is true for the filter value, which updates the quantization scale on an epoch basis.
- FIG. 3 is a diagram illustrating a quantization scale update method according to one embodiment of the present invention.
- the maximum value of the absolute value of the convolution output value is selected and stored for each batch. Since this is performed for each batch, the maximum values for each batch are accumulated. When training is completed for one epoch, the accumulated maximum values are divided by the number of batches to calculate the average of the maximum values.
- the quantization scale is updated with the next calculated maximum value average, and the weights and activation functions of the deep learning model are quantized based on the updated quantization scale, and learning of the next epoch is performed on the quantized deep learning model.
- quantization scale update is possible without understanding the distribution of the output feature map by convolution.
- FIG. 4 and FIG. 5 show the difference between the distribution of the actual value and the distribution of the quantized scale value when the quantized scale update is performed according to the method suggested in the embodiment of the present invention. It can be confirmed that the scale value based on the average of the accumulated maximum value for each batch (Mean: red below) has a value suitable for the parameter distribution in the current epoch learning even though the feature map distribution is not identified in real time. On the other hand, there is a problem that the scale updated with only the simple maximum value (Prev. Xpoch Abs Max: blue above)) may cause a problem of performance degradation during learning because it is updated based on an excessively large value.
- Prev. Xpoch Abs Max blue above
- the quantization scale update through the average of the maximum values for each batch is impossible to operate in real time like software operation, but it is possible to check the average value for multiple image data and at the same time roughly understand the distribution change according to the overall learning tendency.
- it has the advantage of being able to implement a hardware accelerator because the process of understanding the real-time distribution parameters is excluded.
- Figure 6-9 shows a more detailed process of a quantization scale update method according to an embodiment of the present invention for a convolution operation process.
- a limited memory such as a mobile device
- the batch size that can be learned at one time i.e., the number of image data, cannot be small.
- FIG. 10 is a diagram illustrating a configuration of a mobile deep learning computing device according to another embodiment of the present invention.
- the mobile deep learning computing device according to the embodiment of the present invention is configured to include a communication interface (110), a deep learning computing device (120), and a memory (130).
- the communication interface (110) communicates with an external host system to receive a dataset and parameters of a pre-trained deep learning model.
- the deep learning operator (120) quantizes and trains the loaded deep learning model according to the method presented in FIG. 3 described above.
- the memory (130) provides the storage space required for the deep learning operator (120) to perform the operation.
- a method for quantization learning of a hardware accelerator model in a mobile device which is an environment with limited resources, is presented by predicting and updating the average of the absolute values of the maximum values of deep learning operations per batch in the previous epoch at a quantization scale.
- the technical idea of the present invention can be applied to a computer-readable recording medium storing a computer program that performs the functions of the device and method according to the present embodiment.
- the technical idea according to various embodiments of the present invention can be implemented in the form of a computer-readable code recorded on a computer-readable recording medium.
- the computer-readable recording medium can be any data storage device that can be read by a computer and store data.
- the computer-readable recording medium can be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical disk, a hard disk drive, etc.
- the computer-readable code or program stored on the computer-readable recording medium can be transmitted through a network connected between computers.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
Abstract
Description
본 발명은 딥러닝 모델 양자화 방법에 관한 것으로, 더욱 상세하게는 모바일 디바이스에 탑재되는 하드웨어 가속기 기반 딥러닝 모델에 대한 양자화 스케일을 업데이트 하는 방법에 관한 것이다.The present invention relates to a method for quantizing a deep learning model, and more specifically, to a method for updating a quantization scale for a hardware accelerator-based deep learning model installed in a mobile device.
종래의 컴퓨터 비전에서 양자화 기반 객체분류 및 객체검출 학습 모델의 대다수는 소프트웨어 기반의 모델로, 해당 양자화 기술의 핵심인 실시간 파라미터값 비교를 통한 스케일 조정은 하드웨어 가속기 기반 설계 환경에 적용함에 있어, 각 레이어 별 출력 파라미터에 대한 재차 확인해야 하는 과정이 추가되어 직접 적용이 용이하지 않다.In conventional computer vision, most quantization-based object classification and object detection learning models are software-based models. The core of the quantization technology, real-time parameter value comparison for scale adjustment, is not easy to directly apply to a hardware accelerator-based design environment because it requires an additional process of re-verifying the output parameters for each layer.
또한 모바일 디바이스에 적용하기 위해서는 작은 크기의 학습 모델과 그에 따른 저장 메모리 공간의 제약을 고려한 설계가 요구되는데, 이를 만족하는 설계 기술이 부족한 상태이다. 때문에 이러한 환경을 만족하는 적응적 양자화 기술의 부재로 모델 구조나 학습 데이터의 종류에 따라 성능의 기복이 발생한다는 문제점이 존재한다.In addition, in order to apply to mobile devices, a design that considers the small size of the learning model and the resulting storage memory space constraints is required, but design technology that satisfies this is lacking. Therefore, there is a problem that performance fluctuations occur depending on the model structure or the type of learning data due to the absence of adaptive quantization technology that satisfies this environment.
본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 자원 제한적 환경인 모바일 디바이스에서의 하드웨어 가속기 모델에 적용 가능한 양자화 방안으로, 이전 학습 단계에서의 파라미터 분포를 기반으로 다음 학습 단계에서의 양자화 스케일을 예측하여 업데이트하는 경량 딥러닝 모델 양자화 방법을 제공함에 있다.The present invention has been made to solve the above problems, and an object of the present invention is to provide a lightweight deep learning model quantization method that predicts and updates the quantization scale in the next learning step based on the parameter distribution in the previous learning step, as a quantization method applicable to a hardware accelerator model in a mobile device, which is a resource-limited environment.
상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 딥러닝 모델 양자화 방법은 딥러닝 모델의 학습 중에 딥러닝 연산의 최대값들을 저장하여 누적하는 단계; 누적된 최대값들의 평균을 계산하는 단계; 계산된 평균으로 양자화 스케일을 업데이트 하는 단계; 및 업데이트된 양자화 스케일을 기초로, 딥러닝 모델을 양자화하는 단계;를 포함한다. According to one embodiment of the present invention for achieving the above purpose, a method for quantizing a deep learning model includes: a step of storing and accumulating maximum values of deep learning operations during training of a deep learning model; a step of calculating an average of the accumulated maximum values; a step of updating a quantization scale with the calculated average; and a step of quantizing a deep learning model based on the updated quantization scale.
누적 단계는, 배치별로 최대값들을 저장하여 누적하는 것일 수 있다.The accumulation step may be to store the maximum values for each batch and accumulate them.
계산 단계는, 에폭 별로 저장된 최대값들의 평균을 계산하는 것일 수 있다.The calculation step could be to calculate the average of the maximum values stored per epoch.
본 발명에 따른 딥러닝 모델 양자화 방법은 양자화된 딥러닝 모델에 대해 다음 에폭의 학습을 수행하는 단계;를 더 포함할 수 있다.The deep learning model quantization method according to the present invention may further include a step of performing learning of the next epoch on the quantized deep learning model.
딥러닝 연산은, 합성곱 연산일 수 있다.Deep learning operations can be convolution operations.
업데이트 단계는, 합성곱 연산에 의한 출력 피쳐맵의 분포를 파악하지 않고, 양자화 스케일을 업데이트하는 것일 수 있다.The update step may update the quantization scale without knowing the distribution of the output feature map by the convolution operation.
양자화는, 대칭적 양자화일 수 있다.Quantization can be symmetric quantization.
누적 단계는, 딥러닝 연산값의 절대값의 최대값을 저장하는 것일 수 있다.The accumulation step could be to store the maximum of the absolute values of the deep learning operation values.
딥러닝 모델은, 모바일 디바이스에 탑재될 수 있다.Deep learning models can be deployed on mobile devices.
본 발명의 다른 측면에 따르면, 딥러닝 모델의 학습 중에 딥러닝 연산의 최대값들을 저장하여 누적하고, 누적된 최대값들의 평균을 계산하며, 계산된 평균으로 양자화 스케일을 업데이트 하고, 업데이트된 양자화 스케일을 기초로, 딥러닝 모델을 양자화하는 연산기; 및 연산기에 필요한 저장공간을 제공하는 메모리;를 포함하는 것을 특징으로 하는 딥러닝 연산장치가 제공된다.According to another aspect of the present invention, a deep learning operation device is provided, characterized by including: an operation unit that stores and accumulates maximum values of deep learning operations during training of a deep learning model, calculates an average of the accumulated maximum values, updates a quantization scale with the calculated average, and quantizes a deep learning model based on the updated quantization scale; and a memory that provides storage space required for the operation unit.
본 발명의 또다른 측면에 따르면, 딥러닝 모델의 학습 중에 딥러닝 연산의 최대값들들의 평균으로 양자화 스케일을 업데이트 하는 단계; 업데이트된 양자화 스케일을 기초로 딥러닝 모델을 양자화하는 단계; 및 양자화된 딥러닝 모델을 학습시키는 단계;를 포함하는 것을 특징으로 하는 딥러닝 모델 양자화 방법이 제공된다.According to another aspect of the present invention, a method for quantizing a deep learning model is provided, characterized by including the steps of: updating a quantization scale with an average of maximum values of deep learning operations during training of a deep learning model; quantizing the deep learning model based on the updated quantization scale; and training the quantized deep learning model.
본 발명의 또다른 측면에 따르면, 딥러닝 모델의 학습 중에 딥러닝 연산의 최대값들들의 평균으로 양자화 스케일을 업데이트 하고, 업데이트된 양자화 스케일을 기초로 딥러닝 모델을 양자화하며, 양자화된 딥러닝 모델을 학습시키는 연산기; 및 연산기에 필요한 저장공간을 제공하는 메모리;를 포함하는 것을 특징으로 하는 딥러닝 연산장치가 제공된다.According to another aspect of the present invention, a deep learning operation device is provided, characterized by including: an operation unit that updates a quantization scale by an average of maximum values of deep learning operations during training of a deep learning model, quantizes the deep learning model based on the updated quantization scale, and trains the quantized deep learning model; and a memory that provides storage space required for the operation unit.
이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, 이전 학습 단계에서의 파라미터 분포를 기반으로 다음 학습 단계에서의 양자화 스케일을 예측하여 업데이트함으로써, 자원 제한적 환경인 모바일 디바이스에서의 하드웨어 가속기 모델을 양자화 학습함에 있어, 빠르고 높은 성능의 학습이 가능해진다.As described above, according to embodiments of the present invention, by predicting and updating the quantization scale in the next learning step based on the parameter distribution in the previous learning step, fast and high-performance learning is possible in quantizing a hardware accelerator model in a mobile device, which is a resource-constrained environment.
도 1. 소프트웨어에서의 양자화 학습 모델 적용시 하드웨어 구조에서 발생하는 문제점 예시Figure 1. Example of problems occurring in hardware structure when applying quantization learning model in software.
도 2. 양자화 스케일 업데이트 기술에 기반한 학습과정 예시Fig. 2. Example of learning process based on quantization scale update technique
도 3. 양자화 스케일 업데이트 기반 학습 방법Fig. 3. Learning method based on quantization scale update
도 4-5. 이전 분포값 기반 현재 에폭 학습 스케일 적용시 최대 및 평균값 분포 비교Figure 4-5. Comparison of maximum and average distributions when applying the current epoch learning scale based on the previous distribution values.
도 6-9. 합성곱 연산 과정에서의 에폭 및 배치 단위별 스케일 업데이트 과정 예시Figure 6-9. Example of scale update process by epoch and batch unit during convolution operation process.
도 10. 모바일 딥러닝 연산장치Fig. 10. Mobile deep learning computing device
이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.
카메라를 통한 다양한 모바일 디바이스 관련 객체 인식 분야에서 실시간 동작을 위한 양자화 기술이 다방면으로 활용되고 있다. 특히 NPU 하드웨어 가속기 설계와 같이 메모리 제한이 있는 환경에서의 학습 모델 구현에서 경량화는 주요한 과제이다.Quantization technology for real-time operation is being widely used in various fields of object recognition related to mobile devices through cameras. In particular, lightweighting is a major challenge in implementing learning models in environments with limited memory, such as NPU hardware accelerator design.
하지만 대부분의 하드웨어 상에서 구현된 양자화 모델은 이미 학습이 완료된 이후 양자화까지 완료된 상태에서 구현된 경우가 대다수이며, 하드웨어 구조에 적합한 양자화 학습 모델 구현은 소프트웨어에서의 양자화 학습 방식을 그대로 적용하기에는 여러 문제점이 존재한다.However, most quantization models implemented on hardware are implemented after learning has been completed and quantization has been completed. There are several problems in implementing a quantization learning model suitable for the hardware structure by directly applying the quantization learning method in software.
도 1은 소프트웨어에서 양자화 학습시 일반적으로 진행되는 과정과, 해당 방식을 그대로 하드웨어 가속기에 적용했을 경우 발생하는 문제를 도식화하여 나타낸 것이다.Figure 1 is a diagram illustrating the general process of quantization learning in software and the problems that arise when the method is applied as is to a hardware accelerator.
CNN 모델에서 합성곱 연산 부분에 대한 양자화를 진행할 경우, 합성곱 이전 입력 피쳐맵이 갖는 양자화 스케일과 합성곱 이후에 출력 피쳐맵이 값는 양자화 스케일은 중간 연산 과정에 의한 분포값 변화로 인해 서로 다른 값을 갖게 되는데, 출력 피쳐맵의 스케일의 경우 정확한 값 분포를 기반으로 파라미터들을 양자화 하기 위해서는 실시간 값 분석 및 비교 과정이 필요하게 된다.When quantizing the convolution operation part in a CNN model, the quantization scale of the input feature map before convolution and the quantization scale of the output feature map after convolution have different values due to changes in the distribution value due to the intermediate operation process. In the case of the scale of the output feature map, a real-time value analysis and comparison process is required to quantize the parameters based on the accurate value distribution.
하지만 하드웨어에서 실시간으로 출력 피쳐맵 값 분석을 진행할 경우 합성곱 연산 이후 출력되는 값들을 전부 확인할 필요가 있기 때문에, 그 비교 과정에서 피쳐맵 값이 한 번 더 메모리에 저장되는 작업이 한 번 더 발생한다. 해당 과정으로 인해 추가로 한 사이클에 대한 무조건적 소모가 발생하며, 이는 하드웨어 가속기 설계에 있어 치명적인 문제점으로 작용할 수 있다. However, when analyzing output feature map values in real time on hardware, it is necessary to check all the values output after the convolution operation, so the feature map values are stored in memory once more during the comparison process. This process unconditionally consumes an additional cycle, which can be a fatal problem in hardware accelerator design.
때문에 하드웨어 가속기에 구현 가능한 학습 모델을 구성하기 위해서는 출력 피쳐맵에 대한 스케일 값을 미리 갖고 있어야 한다. 도 2는 본 발명의 실시예가 적용 가능한 양자화 스케일 계산 방식을 기반으로 한 모델 업데이트 방법을 도식화한 것이다. 양자화 구조는 대칭적 양자화 구조를 적용하였다. 비대칭 양자화 구조의 경우, 음수값과 양수값을 따로 확인하여 양자화 시 발생하는 값 손실을 최소화할 수 있으나 기준이 되는 중간값에 대해 별도의 계산이 추가되어 하드웨어 설계에 적합하지 않다.Therefore, in order to configure a learning model that can be implemented in a hardware accelerator, the scale value for the output feature map must be known in advance. Fig. 2 is a diagram illustrating a model update method based on a quantization scale calculation method applicable to an embodiment of the present invention. A symmetric quantization structure is applied as the quantization structure. In the case of an asymmetric quantization structure, negative and positive values can be separately checked to minimize value loss occurring during quantization, but a separate calculation is added for the intermediate value that serves as a reference, so it is not suitable for hardware design.
양자화로 인한 모델의 전반적 성능 감소를 최소화하기 위해 가장 많은 연산이 이루어지는 합성곱 과정에 대해 양자화 및 연산을 진행하는 경우, 합성곱 이후 출력되는 값들에 대해 절대값의 최대값을 버퍼에 저장한다.In order to minimize the overall performance degradation of the model due to quantization, when quantization and calculation are performed on the convolution process, which involves the most calculations, the maximum absolute value of the values output after convolution is stored in a buffer.
합성곱 출력값의 절대값의 최대값은 학습 시 배치 사이즈 내 학습 동안 갱신 및 유지되며, 배치별 학습이 끝날 때마다 해당 최대값을 별도의 버퍼에 누적합으로 저장한다. 누적합으로 저장된 값은 데이터셋 학습이 한 번 끝난 단위 에폭 시점에서 평균값으로 계산(누적합을 배치 횟수로 나눔)되어 다음 에폭의 양자일 스케일 값으로 적용한다. 필터값의 경우도 마찬가지로 에폭 단위로 양자화 스케일을 업데이트하는 방식으로 진행된다.The maximum value of the absolute value of the convolution output is updated and maintained during learning within the batch size during learning, and the maximum value is stored as a cumulative sum in a separate buffer at the end of each batch learning. The value stored as a cumulative sum is calculated as an average value at the unit epoch point when the dataset learning is completed once (the cumulative sum is divided by the number of batches) and applied as the quantization scale value of the next epoch. The same is true for the filter value, which updates the quantization scale on an epoch basis.
도 3은 본 발명의 일 실시예에 따른 양자화 스케일 업데이트 방법을 나타낸 도면이다.FIG. 3 is a diagram illustrating a quantization scale update method according to one embodiment of the present invention.
도시된 바와 같이, 딥러닝 모델의 학습 중에 배치별로 합성곱 출력값의 절대값의 최대값을 선정하여 저장한다. 이는 배치별로 수행되므로 배치별 최대값들이 누적된다. 한 에폭에 대해 학습이 완료되면, 누적된 최대값들을 배치 횟수로 나누어 최대값 평균을 계산한다.As shown, during the training of the deep learning model, the maximum value of the absolute value of the convolution output value is selected and stored for each batch. Since this is performed for each batch, the maximum values for each batch are accumulated. When training is completed for one epoch, the accumulated maximum values are divided by the number of batches to calculate the average of the maximum values.
다음 계산된 최대값 평균으로 양자화 스케일을 업데이트하여, 업데이트된 양자화 스케일을 기초로 딥러닝 모델의 웨이트와 활성화 함수 등을 양자화하고, 양자화된 딥러닝 모델에 대해 다음 에폭의 학습을 수행한다.The quantization scale is updated with the next calculated maximum value average, and the weights and activation functions of the deep learning model are quantized based on the updated quantization scale, and learning of the next epoch is performed on the quantized deep learning model.
이와 같이 본 발명의 실시예에서는 합성곱에 의한 출력 피쳐맵의 분포를 파악하지 않고도, 양자화 스케일 업데이트가 가능하다.In this way, in an embodiment of the present invention, quantization scale update is possible without understanding the distribution of the output feature map by convolution.
도 4와 도 5에는 본 발명의 실시예에서 제시한 방법에 따라 양자화 스케일 업데이트를 진행할 경우에 대한 실제 값의 분포와 양자화 스케일 값의 분포 차이값을 비교하여 나타내었는데, 배치별 누적 최대값의 평균에 기반한 스케일 값(Mean: 아래 빨간색)이 실시간으로 피쳐맵 분포를 파악하지 않았음에도 현 에폭 학습에서의 파라미터 분포에 적합한 값을 갖는 것을 확인할 수 있다. 반면, 단순 최대값만으로 갱신된 스케일(Prev. Xpoch Abs Max : 위 파란색))은 지나치게 큰 값을 기반으로 업데이트되어 학습시 성능 저하의 문제를 발생시킬 수 있다는 문제점이 존재한다.FIG. 4 and FIG. 5 show the difference between the distribution of the actual value and the distribution of the quantized scale value when the quantized scale update is performed according to the method suggested in the embodiment of the present invention. It can be confirmed that the scale value based on the average of the accumulated maximum value for each batch (Mean: red below) has a value suitable for the parameter distribution in the current epoch learning even though the feature map distribution is not identified in real time. On the other hand, there is a problem that the scale updated with only the simple maximum value (Prev. Xpoch Abs Max: blue above)) may cause a problem of performance degradation during learning because it is updated based on an excessively large value.
이와 같이 배치별 최대값의 평균을 통한 양자화 스케일 업데이트는, 소프트웨어 연산과 같은 실시간 연산이 불가능하나 여러 이미지 데이터에 대한 평균적인 수치를 확인함과 동시에 전반적인 학습 경향성에 따른 분포 변화에 대한 대략적인 파악이 가능하다. 또한 이를 통해 실시간 분포 파라미터 파악에 대한 과정이 제외되어 하드웨어 가속기 구현이 가능하다는 장점이 존재한다.In this way, the quantization scale update through the average of the maximum values for each batch is impossible to operate in real time like software operation, but it is possible to check the average value for multiple image data and at the same time roughly understand the distribution change according to the overall learning tendency. In addition, it has the advantage of being able to implement a hardware accelerator because the process of understanding the real-time distribution parameters is excluded.
도 6-9는 합성곱 연산 과정에 대해, 본 발명의 실시예에 따른 양자화 스케일 업데이트 방법에 대한 보다 자세한 과정을 나타낸다. 모바일 디바이스와 같은 제한된 메모리에서는 한 번에 학습할 수 있는 배치 크기, 즉 이미지 데이터 개수가 적을 수 밖에 없다. Figure 6-9 shows a more detailed process of a quantization scale update method according to an embodiment of the present invention for a convolution operation process. In a limited memory such as a mobile device, the batch size that can be learned at one time, i.e., the number of image data, cannot be small.
이것은 단위 에폭당 배치 학습 횟수가 많음을 의미하고, 각 배치별 최대값의 평균을 기반으로 계산한 스케일 업데이트 기술은 해당 제한적 환경에서 상대적으로 더 많은 샘플값에 대한 평균치을 도출해 낼 수 있어 특정 배치에서의 높은 값에 대한 편향성으로부터의 영향력이 낮아 안정적인 방법이 될 수 있다.This means that the number of batch learnings per unit epoch is large, and the scale update technique calculated based on the average of the maximum values for each batch can derive the average for relatively more sample values in the limited environment, so it can be a stable method with a low influence from the bias for high values in a specific batch.
도 10은 본 발명의 또 다른 실시예에 따른 모바일 딥러닝 연산장치의 구성을 도시한 도면이다. 본 발명의 실시예에 따른 모바일 딥러닝 연산장치는, 도시된 바와 같이, 통신 인터페이스(110), 딥러닝 연산기(120) 및 메모리(130)를 포함하여 구성된다.FIG. 10 is a diagram illustrating a configuration of a mobile deep learning computing device according to another embodiment of the present invention. As illustrated, the mobile deep learning computing device according to the embodiment of the present invention is configured to include a communication interface (110), a deep learning computing device (120), and a memory (130).
통신 인터페이스(110)는 외부 호스트 시스템과 통신 연결하여, 데이터셋, 사전 학습된 딥러닝 모델의 파라미터를 수신한다. 딥러닝 연산기(120)는 탑재된 딥러닝 모델에 대해 전술한 도 3에 제시된 방법에 따라 양자화하여 학습 시킨다. 메모리(130)는 딥러닝 연산기(120)가 연산을 수행함에 있어 필요한 저장 공간을 제공한다.The communication interface (110) communicates with an external host system to receive a dataset and parameters of a pre-trained deep learning model. The deep learning operator (120) quantizes and trains the loaded deep learning model according to the method presented in FIG. 3 described above. The memory (130) provides the storage space required for the deep learning operator (120) to perform the operation.
지금까지, 경량 딥러닝 모델 양자화 방법에 대해 바람직한 실시예를 들어 상세히 설명하였다.So far, we have described in detail preferred embodiments of lightweight deep learning model quantization methods.
위 실시예에서는 이전 에폭에서 배치별 딥러닝 연산의 최대값들의 절대값들의 평균값을 양자화 스케일로 예측하여 업데이트함으로써, 자원이 제한적인 환경인 모바일 디바이스에서의 하드웨어 가속기 모델을 양자화 학습하는 방법을 제시하였다.In the above embodiment, a method for quantization learning of a hardware accelerator model in a mobile device, which is an environment with limited resources, is presented by predicting and updating the average of the absolute values of the maximum values of deep learning operations per batch in the previous epoch at a quantization scale.
이에 의해 하드웨어 가속기 모델에 적용 가능한 양자화 모델로써, 제한적 환경 기반 모바일 디바이스에 적용 가능한 보편적 양자화 스케일 조정 기법을 통해, 빠르고 높은 성능의 학습을 가능하도록 하였다.This enables fast and high-performance learning through a universal quantization scaling technique applicable to limited environment-based mobile devices as a quantization model applicable to hardware accelerator models.
한편, 본 실시예에 따른 장치와 방법의 기능을 수행하게 하는 컴퓨터 프로그램을 수록한 컴퓨터로 읽을 수 있는 기록매체에도 본 발명의 기술적 사상이 적용될 수 있음은 물론이다. 또한, 본 발명의 다양한 실시예에 따른 기술적 사상은 컴퓨터로 읽을 수 있는 기록매체에 기록된 컴퓨터로 읽을 수 있는 코드 형태로 구현될 수도 있다. 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터에 의해 읽을 수 있고 데이터를 저장할 수 있는 어떤 데이터 저장 장치이더라도 가능하다. 예를 들어, 컴퓨터로 읽을 수 있는 기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광디스크, 하드 디스크 드라이브, 등이 될 수 있음은 물론이다. 또한, 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터로 읽을 수 있는 코드 또는 프로그램은 컴퓨터간에 연결된 네트워크를 통해 전송될 수도 있다.Meanwhile, it goes without saying that the technical idea of the present invention can be applied to a computer-readable recording medium storing a computer program that performs the functions of the device and method according to the present embodiment. In addition, the technical idea according to various embodiments of the present invention can be implemented in the form of a computer-readable code recorded on a computer-readable recording medium. The computer-readable recording medium can be any data storage device that can be read by a computer and store data. For example, the computer-readable recording medium can be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical disk, a hard disk drive, etc. In addition, the computer-readable code or program stored on the computer-readable recording medium can be transmitted through a network connected between computers.
또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and various modifications may be made by a person skilled in the art without departing from the gist of the present invention as claimed in the claims. Furthermore, such modifications should not be individually understood from the technical idea or prospect of the present invention.
Claims (12)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| KR1020230157219A KR20250070865A (en) | 2023-11-14 | 2023-11-14 | Lightweight deep learning model quantization method |
| KR10-2023-0157219 | 2023-11-14 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025105520A1 true WO2025105520A1 (en) | 2025-05-22 |
Family
ID=95743158
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2023/018255 Pending WO2025105520A1 (en) | 2023-11-14 | 2023-11-14 | Lightweight deep learning model quantization method |
Country Status (2)
| Country | Link |
|---|---|
| KR (1) | KR20250070865A (en) |
| WO (1) | WO2025105520A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20200050284A (en) * | 2018-11-01 | 2020-05-11 | 삼성전자주식회사 | Encoding apparatus and method of image using quantization table adaptive to image |
| CN111401518A (en) * | 2020-03-04 | 2020-07-10 | 杭州嘉楠耘智信息科技有限公司 | Neural network quantization method and device and computer readable storage medium |
| KR20210004306A (en) * | 2019-07-04 | 2021-01-13 | 삼성전자주식회사 | Neural Network device and method of quantizing parameters of neural network |
| KR20210018352A (en) * | 2019-06-12 | 2021-02-17 | 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 | Method for determining quantization parameters of neural networks and related products |
| KR20220013946A (en) * | 2020-05-21 | 2022-02-04 | 상하이 센스타임 인텔리전트 테크놀로지 컴퍼니 리미티드 | Quantization training, image processing method and apparatus, and storage medium |
-
2023
- 2023-11-14 WO PCT/KR2023/018255 patent/WO2025105520A1/en active Pending
- 2023-11-14 KR KR1020230157219A patent/KR20250070865A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR20200050284A (en) * | 2018-11-01 | 2020-05-11 | 삼성전자주식회사 | Encoding apparatus and method of image using quantization table adaptive to image |
| KR20210018352A (en) * | 2019-06-12 | 2021-02-17 | 상하이 캠브리콘 인포메이션 테크놀로지 컴퍼니 리미티드 | Method for determining quantization parameters of neural networks and related products |
| KR20210004306A (en) * | 2019-07-04 | 2021-01-13 | 삼성전자주식회사 | Neural Network device and method of quantizing parameters of neural network |
| CN111401518A (en) * | 2020-03-04 | 2020-07-10 | 杭州嘉楠耘智信息科技有限公司 | Neural network quantization method and device and computer readable storage medium |
| KR20220013946A (en) * | 2020-05-21 | 2022-02-04 | 상하이 센스타임 인텔리전트 테크놀로지 컴퍼니 리미티드 | Quantization training, image processing method and apparatus, and storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| KR20250070865A (en) | 2025-05-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109766840B (en) | Facial expression recognition method, device, terminal and storage medium | |
| US11556761B2 (en) | Method and device for compressing a neural network model for machine translation and storage medium | |
| CN113128419B (en) | Obstacle recognition method and device, electronic equipment and storage medium | |
| EP4123513A1 (en) | Fixed-point method and apparatus for neural network | |
| WO2022080790A1 (en) | Systems and methods for automatic mixed-precision quantization search | |
| EP3568828A1 (en) | Image processing apparatus and method using multi-channel feature map | |
| US20210176174A1 (en) | Load balancing device and method for an edge computing network | |
| WO2021006650A1 (en) | Method and system for implementing a variable accuracy neural network | |
| WO2022146080A1 (en) | Algorithm and method for dynamically changing quantization precision of deep-learning network | |
| WO2021230463A1 (en) | Method for optimizing on-device neural network model by using sub-kernel searching module and device using the same | |
| WO2023003432A1 (en) | Method and device for determining saturation ratio-based quantization range for quantization of neural network | |
| WO2025105520A1 (en) | Lightweight deep learning model quantization method | |
| WO2023014124A1 (en) | Method and apparatus for quantizing neural network parameter | |
| CN113344214A (en) | Training method and device of data processing model, electronic equipment and storage medium | |
| WO2020091139A1 (en) | Effective network compression using simulation-guided iterative pruning | |
| US12198405B2 (en) | Method, device, and computer program product for training image classification model | |
| WO2025041887A1 (en) | Method for iteratively pruning neural network through self-distillation | |
| CN117809095A (en) | An image classification method, device, equipment and computer-readable storage medium | |
| WO2023177025A1 (en) | Method and apparatus for computing artificial neural network based on parameter quantization using hysteresis | |
| CN111814813A (en) | Neural network training and image classification method and device | |
| CN115761595A (en) | Low-quality video detection and video understanding model training method and device | |
| WO2023128024A1 (en) | Method and system for quantizing deep-learning network | |
| WO2023080292A1 (en) | Apparatus and method for generating adaptive parameter for deep learning acceleration device | |
| WO2025041871A1 (en) | Quantization method and apparatus based on deep learning network architecture encoding and quantization aware training parameter prediction model | |
| WO2023085458A1 (en) | Method and device for controlling lightweight deep learning training memory |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23958947 Country of ref document: EP Kind code of ref document: A1 |