KR20220109301A

KR20220109301A - Quantization method for deep learning model and apparatus thereof

Info

Publication number: KR20220109301A
Application number: KR1020210156647A
Authority: KR
Inventors: 웬롱 흐어; 이고르 바실초프; 강 선; 두앤후이 리우
Original assignee: 삼성전자주식회사
Priority date: 2021-01-28
Filing date: 2021-11-15
Publication date: 2022-08-04
Also published as: CN112906294B; CN112906294A

Abstract

Disclosed in the present invention are a quantization method and quantization apparatus for a deep learning model. The quantization method for a deep learning model comprises: a step (S110) of: quantizing a first model based on quantization parameters to obtain a second model; a step (S120) of testing the second model to obtain actual values of a plurality of optimization object parameters; a step (S130) of calculating a loss function based on the actual values of the plurality of optimization object parameters, expected values of the plurality of optimization object parameters, and constraint values of the plurality of optimization object parameters; a step (S140) of updating the quantization parameters and using the second model as a first model, based on the calculated loss function; and a step (150) of cycling from the step (S110) to the step (S140) until predetermined conditions are satisfied, obtaining optimal quantization parameters when the predetermined conditions are satisfied, and using the quantized first model based on the optimal quantization parameters as a final quantized model. The model obtained by the quantization according to the present invention exhibits a faster training convergence speed, a superior model compression effect, and can guarantee the accuracy of the model without precise tuning due to a combination of high-accuracy and low-accuracy quantization.

Description

QUANTIZATION METHOD FOR DEEP LEARNING MODEL AND APPARATUS THEREOF

딥 러닝 모델용 양자화 방법 및 양자화 장치가 개시된다.A quantization method and a quantization apparatus for a deep learning model are disclosed.

인공지능(AI) 기술의 발달로, AI 기술은 이미 다양한 분야에 적용될 수 있다. 그 중, AI의 중요한 분야인 딥 러닝은 컴퓨터 비전, 언어 처리 및 텍스트 처리에서 획기적인 발전을 이루었다. 현재 모바일 단말과 데이터 센터 서비스 모두 딥 러닝 기술을 적용하고 있으나, 현 단계에서 딥 러닝 기술은 응용 단계로, 딥 러닝 기술이 적용된 하드웨어 장치에 대해, 높은 컴퓨팅 성능, 더 많은 메모리 점유 및 높은 전력 소비를 요구하므로, 모바일 단말 애플리케이션 또는 데이터 센터 서비스에 많은 부하가 걸릴 수 있다.With the development of artificial intelligence (AI) technology, AI technology can already be applied to various fields. Among them, deep learning, an important field of AI, has made breakthroughs in computer vision, language processing, and text processing. Currently, both mobile terminals and data center services apply deep learning technology, but at this stage, deep learning technology is in the application stage. Therefore, it may place a heavy load on the mobile terminal application or data center service.

관련 기술 중, 양자화 딥 러닝 모델(예, 신경망 모델) 기술은, 낮은 정확도와 저전력의 신경망 칩(예, NPU(Neural Network Processing Unit), TPU(Tensor Processing Unit), FPGA(Field Programmable Gate Array) 등)을 사용하여 에너지 소비 및 추론 지연을 줄이므로, 모바일 단말 애플리케이션 또는 데이터 센터 서비스에 적합하다.Among related technologies, quantization deep learning model (e.g., neural network model) technology is a neural network chip (e.g., NPU (Neural Network Processing Unit), TPU (Tensor Processing Unit), FPGA (Field Programmable Gate Array)) with low accuracy and low power. ) to reduce energy consumption and inference delay, making it suitable for mobile terminal applications or data center services.

관련 기술에서 양자화 딥 러닝 모델 기술 방안은, (1) 지각 양자화(perceptual quantization) 기술을 트레이닝하는 양자화 모델의 트레이닝 수렴 속도가 저하될 수 있고, (2) 모델을 고정된(fixed) 정확도로 양자화하는 경우, 양자화된 모델의 압축 효과가 제한될 수 있으며, (3) 저정확도에서 양자화하는 것은 정확도의 손실이 크기 때문에, 정확도를 복원하기 위해 미세 조정이 필요할 수 있다.In the related art, the quantization deep learning model technology method is (1) that the training convergence speed of a quantization model that trains a perceptual quantization technique may be lowered, and (2) that the model is quantized with fixed accuracy. In this case, the compression effect of the quantized model may be limited, and (3) fine-tuning may be required to restore the accuracy, since quantization at low accuracy has a large loss of accuracy.

본 개시는 딥 러닝 모델용 양자화 방법 및 양자화 장치를 제공한다.The present disclosure provides a quantization method and a quantization apparatus for a deep learning model.

본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법은, 양자화 파라미터를 기반으로 제1 모델을 양자화하여 제2 모델을 획득하는 단계; 상기 제2 모델을 테스트하여 하나 이상의 최적화 객체 파라미터의 실제값(real value)을 획득하는 단계; 상기 최적화 객체 파라미터의 상기 실제값, 상기 최적화 객체 파라미터의 기대값(expected value) 및 상기 최적화 객체 파라미터의 제약값(constraint value)을 기반으로 손실 함수(loss function)를 계산하는 단계; 및 상기 손실 함수를 기반으로, 상기 양자화 파라미터를 업데이트하고 제2 모델을 제1 모델로 사용하는 단계를 사전 설정 조건이 만족될 때까지 순환 실행하고, 상기 사전 설정 조건이 만족되는 경우, 최적 양자화 파라미터를 획득하고, 해당 최적 양자화 파라미터를 기반으로 양자화를 실행한 제1 모델을 최종 양자화 모델로 사용하는 단계를 더 포함한다. A quantization method for a deep learning model according to an embodiment of the present disclosure includes: quantizing a first model based on a quantization parameter to obtain a second model; testing the second model to obtain real values of one or more optimization object parameters; calculating a loss function based on the actual value of the optimization object parameter, an expected value of the optimization object parameter, and a constraint value of the optimization object parameter; and updating the quantization parameter and using the second model as the first model based on the loss function cyclically until a preset condition is satisfied, and when the preset condition is satisfied, the optimal quantization parameter , and using a first model that has been quantized based on the corresponding optimal quantization parameter as a final quantization model.

상기 방법에서, 상기 양자화 파라미터를 기반으로 상기 제1 모델을 양자화하여 상기 제2 모델을 획득하는 단계는, 상기 제1 모델 중 양자화할 각 연산자에 대해 양자화 어노테이션(annotation)을 실행하고, 시뮬레이션 양자화 모델을 획득하는 단계; 상기 양자화 파라미터를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 양자화 구성을 실행하는 단계; 상기 양자화 구성 후의 시뮬레이션 양자화 모델을 기반으로, 양자화 계수를 계산하는 단계; 및 상기 양자화 계수를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 모델 재작성을 실행하여, 상기 제2 모델을 획득하는 단계를 포함할 수 있다.In the method, the step of quantizing the first model based on the quantization parameter to obtain the second model includes executing quantization annotation on each operator to be quantized among the first model, and performing a simulation quantization model obtaining a; executing a quantization configuration on the simulated quantization model based on the quantization parameter; calculating quantization coefficients based on the simulation quantization model after the quantization configuration; and performing model rewriting on the simulation quantization model based on the quantization coefficient to obtain the second model.

상기 방법에서, 상기 객체 파라미터는 정확도 및 양자화 모델의 크기, 에너지 소모 및 추론 지연 파라미터 중 적어도 하나의 파라미터를 포함할 수 있다.In the method, the object parameter may include at least one of an accuracy and size of a quantization model, energy consumption, and an inference delay parameter.

상기 방법에서, 상기 손실 함수를 계산하는 단계는, 상기 최적화 객체 파라미터의 상기 기대값과 상기 실제값 간의 차이값, 및 상기 복수의 최적화 객체 파라미터의 상기 제약값과 상기 실제값 간의 차이값을 기반으로, 상기 손실 함수를 계산하는 단계를 포함할 수 있다.In the method, the calculating the loss function comprises: a difference value between the expected value and the actual value of the optimization object parameter, and a difference value between the constraint value and the actual value of the plurality of optimization object parameters. , calculating the loss function.

상기 방법에서, 상기 손실 함수의 함수 표현식은,In the method, the function expression of the loss function is,

이고, t는 기대값이고, t∈R₊는 단일 최적화 객체 파라미터의 기대값이고; c는 제약값이고, c∈R₊는 단일 최적화 객체 파라미터에 대한 제한이고; o는 실제값이고, o∈R₊는 현재 양자화 모델의 특정 최적화 객체 파라미터의 실제값이고; △tj= t_j - o_j는 실제값과 기대값의 차이값이고; △_cj= c_j- o_j는 실제값과 제약값의 차이값이고; wj는 가중치 인자(weighting factor)이고, w∈R₊이고, △_tj ²는 최적화 항이고, 손실을 최소화할 때 각 최적화 객체 파라미터의 중요도는 상기 가중치 인자에 의해 조정되고; w_j×△_tj ²항은 최종 결과가 각 최적화 객체 파라미터를 평가하도록 하고; λ_j는 페널티 인자이고, λ∈R₊이고; (max (0,△_cj))²는 패널티 항으로, 상기 제2 모델의 상기 특정 최적화 객체 파라미터의 상기 실제값이 제약값을 초과하는 경우, 상기 각 최적화 객체 파라미터가 제한 조건에 도달할 수 있도록 패널티가 부여되고; M은 상기 최적화 객체 파라미터의 총 개수일 수 있다., t is the expected value, and t∈R ₊ is the expected value of a single optimization object parameter; c is the constraint value, c∈R ₊ is the constraint on a single optimization object parameter; o is the actual value, and o∈R ₊ is the actual value of the specific optimization object parameter of the current quantization model; Δtj = t _j - o _j is the difference between the actual value and the expected value; △ _cj = c _j - o _j is the difference between the actual value and the constraint value; wj is a weighting factor, w∈R ₊ , Δ _tj ² is an optimization term, the importance of each optimization object parameter when minimizing loss is adjusted by the weighting factor; The w _j ×Δ _tj ² term allows the final result to evaluate each optimization object parameter; λ _j is the penalty factor, λ∈R ₊ ; (max (0,Δ _cj )) ² is a penalty term, so that when the actual value of the specific optimization object parameter of the second model exceeds a constraint value, each optimization object parameter can reach a constraint condition Penalties are awarded; M may be the total number of optimization object parameters.

상기 방법에서, 상기 손실 함수를 기반으로, 상기 양자화 파라미터를 업데이트하는 단계는, 상기 손실 함수의 함수값과 목표 알고리즘을 기반으로, 상기 제2 모델의 새로운 양자화 파라미터 세트를 결정하고 기록하는 단계 - 상기 목표 알고리즘은 베이지안 최적화 알고리즘(Bayesian Optimization Algorithm)을 포함함 -; 및 상기 새로운 양자화 파라미터 세트를 사용하여 상기 제2 모델의 현재 양자화 파라미터를 대체하는 단계를 포함할 수 있다.In the method, the step of updating the quantization parameter based on the loss function comprises: determining and recording a new set of quantization parameters of the second model based on a function value of the loss function and a target algorithm; The target algorithm includes a Bayesian Optimization Algorithm; and replacing the current quantization parameter of the second model using the new quantization parameter set.

상기 방법에서, 상기 최적 양자화 파라미터를 획득하는 단계는, 상기 사전 설정 조건이 만족되는 경우, 스크리닝에 의해 기록된 복수의 양자화 파라미터 세트 중, 상기 손실 함수의 상기 함수값을 최소화하는 세트를 최적 양자화 파라미터로 사용하는 단계를 포함하고, 상기 사전 설정 조건은 상기 단계들의 반복 횟수가 미리 설정된 횟수를 만족하거나, 반복 시간이 미리 설정된 반복 시간을 만족할 수 있다.In the method, the obtaining of the optimal quantization parameter may include selecting a set that minimizes the function value of the loss function from among a plurality of quantization parameter sets recorded by screening when the preset condition is satisfied. and the preset condition is that the number of repetitions of the steps may satisfy a preset number of times, or the repetition time may satisfy a preset repetition time.

상기 방법에서, 상기 양자화 파라미터에 대응하는 정확도 유형은 INT4, INT8 및 INT16 중 적어도 하나의 범주를 포함할 수 있다.In the method, the accuracy type corresponding to the quantization parameter may include at least one category of INT4, INT8, and INT16.

본 개시의 또 다른 일 실시예에 따른 딥 러닝 모델용 양자화 장치는, 혼합 정확도 양자화 모듈; 하나 이상의 객체 최적화 모듈; 및 자동 최적화 모듈을 포함하고, 상기 혼합 정확도 양자화 모듈은, 양자화 파라미터를 기반으로 제1 모델을 양자화하여 제2 모델을 획득하고; 상기 제2 모델을 테스트하여 복수의 최적화 객체 파라미터의 실제값을 획득하도록 구성되고, 상기 하나 이상의 객체 최적화 모듈은, 하나 이상의 최적화 객체 파라미터의 실제값, 복수의 최적화 객체 파라미터의 기대값 및 복수의 최적화 객체 파라미터의 제약값을 기반으로 손실 함수를 계산하도록 구성되고, 상기 자동 최적화 모듈은, 상기 손실 함수를 기반으로, 상기 양자화 파라미터를 업데이트하고 제2 모델을 제1 모델로 사용하고; 상기 양자화 파라미터의 업데이트 결과가 사전 설정 조건을 만족하면, 최적 양자화 파라미터를 획득하고, 상기 최적 양자화 파라미터를 기반으로 양자화를 실행한 제1 모델을 최종 양자화 모델로 사용하도록 구성된다.A quantization apparatus for a deep learning model according to another embodiment of the present disclosure includes: a mixed accuracy quantization module; one or more object optimization modules; and an automatic optimization module, wherein the mixed accuracy quantization module is configured to: quantize the first model based on the quantization parameter to obtain a second model; and test the second model to obtain actual values of a plurality of optimization object parameters, wherein the one or more object optimization modules are configured to: calculate a loss function based on a constraint value of an object parameter, wherein the automatic optimization module is configured to: update the quantization parameter based on the loss function and use a second model as a first model; If the update result of the quantization parameter satisfies a preset condition, it is configured to acquire an optimal quantization parameter and use a first model that has been quantized based on the optimal quantization parameter as a final quantization model.

선택적으로, 상기 혼합 정확도 양자화 모듈은, 상기 제1 모델의 각 양자화될 연산자에 대해 양자화 어노테이션을 실행하여, 시뮬레이션 양자화 모델을 획득하고; 양자화 파라미터를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 양자화 구성을 실행하고; 양자화 구성 후의 시뮬레이션 양자화 모델을 기반으로, 양자화 계수를 계산하고; 양자화 계수를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 모델 재작성을 실행하여, 상기 제2 모델을 획득하도록 구성된다.Optionally, the mixed accuracy quantization module is configured to: execute a quantization annotation on each to-be-quantized operator of the first model to obtain a simulation quantization model; execute a quantization configuration on the simulation quantization model based on the quantization parameter; Calculate quantization coefficients based on the simulation quantization model after quantization construction; and perform model rewriting on the simulation quantization model based on the quantization coefficient, to obtain the second model.

선택적으로, 상기 혼합 정확도 양자화 모듈은 양자화될 각 연산자에 대해 시뮬레이션 양자화 연산자를 삽입하도록 구성되고, 상기 시뮬레이션 양자화 연산자는 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자 및 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자를 포함한다.Optionally, the mixed precision quantization module is configured to insert a simulation quantization operator for each operator to be quantized, wherein the simulation quantization operator includes a simulation quantization operator for quantizing the weights and a simulation quantization operator for quantizing the activation values. .

선택적으로, 상기 혼합 정확도 양자화 모듈은, 상기 양자화 파라미터와 시뮬레이션 양자화 모델 중 연산자의 레이어 레벨 순서 간의 대응 관계를 분석하고; 상기 대응 관계에 따라 상기 양자화 파라미터를 대응하는 연산자로 구성하도록 구성된다.Optionally, the mixed-accuracy quantization module is configured to: analyze a correspondence between the quantization parameter and a layer-level order of operators in the simulation quantization model; and configure the quantization parameter into a corresponding operator according to the correspondence relationship.

선택적으로, 상기 혼합 정확도 양자화 모듈은, 시뮬레이션 양자화 모델 중 삽입된 각 시뮬레이션 양자화 연산자의 양자화된 양자화 파라미터에 대응하는 데이터의 정확도 유형을 결정하고; 부동 소수점 데이터와 상기 정확도 유형 데이터 간의 매핑 관계를 기반으로, 부동 소수점 데이터에서 정수 데이터로의 인터셉팅 오차 및 반올림 오차를 시뮬레이션한 후, 시뮬레이션된 각 양자화 연산자의 양자화 계수를 계산하도록 구성된다.Optionally, the mixed accuracy quantization module is configured to: determine an accuracy type of data corresponding to a quantized quantization parameter of each simulation quantization operator inserted in the simulation quantization model; and calculate a quantization coefficient of each simulated quantization operator after simulating an intercepting error and a rounding error from the floating-point data to the integer data based on the mapping relationship between the floating-point data and the precision type data.

선택적으로, 상기 혼합 정확도 양자화 모듈은, 시뮬레이션 양자화 모델 중 각 시뮬레이션 양자화 연산자에 대해, 해당 시뮬레이션 양자화 연산자의 양자화 구성된 양자화 파라미터 및 대응하는 양자화 계수를 결정하고; 상기 양자화 파라미터 및 양자화 계수를 지원하는 저정확도(low accuracy) 연산자를 사용하여 상기 시뮬레이션 양자화 연산자를 대체하도록 구성된다.Optionally, the mixed accuracy quantization module is configured to: determine, for each simulation quantization operator in the simulation quantization model, a quantization parameter configured to be quantized of the simulation quantization operator and a corresponding quantization coefficient; and replace the simulation quantization operator by using a low accuracy operator supporting the quantization parameter and the quantization coefficient.

선택적으로, 상기 객체 파라미터는 정확도 및 양자화 모델의 크기, 에너지 소모 및 추론 지연 파라미터 중 적어도 하나의 파라미터를 포함한다.Optionally, the object parameter comprises at least one of accuracy and size of the quantization model, energy consumption and inference delay parameters.

선택적으로, 상기 복수의 객체 최적화 모듈은, 상기 복수의 최적화 객체 파라미터의 기대값과 대응하는 실제값 간의 차이값, 및 상기 복수의 최적화 객체 파라미터의 제약값과 대응하는 실제값 간의 차이값을 기반으로, 손실 함수를 계산하도록 구성된다.Optionally, the plurality of object optimization modules are configured to: based on a difference value between an expected value of the plurality of optimization object parameters and a corresponding actual value, and a difference value between a constraint value of the plurality of optimization object parameters and a corresponding actual value , is configured to compute the loss function.

선택적으로, 상기 손실 함수의 함수 표현식은,Optionally, the functional expression of the loss function is

이고,

ego,

t는 기대값이고, t∈R₊는 단일 최적화 객체 파라미터의 기대값이고; c는 제약값이고, c∈R₊는 단일 최적화 객체 파라미터에 대한 제한이고; o는 실제값이고, o∈R₊는 현재 양자화 모델의 특정 최적화 객체 파라미터의 실제값이고; △tj= t_j - o_j는 실제값과 기대값의 차이값이고; △_cj= c_j- o_j는 실제값과 제약값의 차이값이고; wj는 가중치 인자이고, w∈R₊이고, △_tj ²는 최적화 항이고, 손실을 최소화할 때 각 최적화 객체 파라미터가 기대값에 가깝더라도, 가중치 인자에 의해 각 최적화 객체 파라미터의 중요도는 가중치 인자에 의해 조정되고; w_j×△_tj ²항은 최종 결과가 각 최적화 객체 파라미터를 평가하도록 하고; λ_j는 페널티 인자이고, λ∈R₊이고; ; (max (0,△_cj))²는 패널티 항으로, 상기 제2 모델의 상기 특정 최적화 객체 파라미터의 상기 실제값이 제약값을 초과하는 경우, 상기 각 최적화 객체 파라미터가 제한 조건에 도달할 수 있도록 패널티가 부여되고; M은 상기 최적화 객체 파라미터의 총 개수일 수 있다.t is the expected value, and t∈R ₊ is the expected value of a single optimization object parameter; c is the constraint value, c∈R ₊ is the constraint on a single optimization object parameter; o is the actual value, and o∈R ₊ is the actual value of the specific optimization object parameter of the current quantization model; Δtj = t _j - o _j is the difference between the actual value and the expected value; △ _cj = c _j - o _j is the difference between the actual value and the constraint value; wj is the weighting factor, w∈R ₊ , △ _tj ² is the optimization term, and even if each optimization object parameter is close to the expected value when minimizing the loss, the importance of each optimization object parameter depends on the weighting factor by the weighting factor. coordinated by; The w _j ×Δ _tj ² term allows the final result to evaluate each optimization object parameter; λ _j is the penalty factor, λ∈R ₊ ; ; (max (0,Δ _cj )) ² is a penalty term, so that when the actual value of the specific optimization object parameter of the second model exceeds a constraint value, each optimization object parameter can reach a constraint condition Penalties are awarded; M may be the total number of optimization object parameters.

선택적으로, 상기 자동 최적화 모듈은, 상기 손실 함수의 출력 함수값과 목표 알고리즘을 기반으로, 상기 제2 모델의 새로운 양자화 파라미터 세트를 결정하고 기록하고; 해당 새로운 양자화 파라미터 세트를 사용하여 상기 제2 모델의 현재 양자화 파라미터를 대체하도록 구성되고, 여기에서, 상기 목표 알고리즘은 베이지안 최적화 알고리즘(Bayesian Optimization Algorithm)을 포함함 -;를 포함한다. Optionally, the automatic optimization module is configured to: determine and record a new set of quantization parameters of the second model based on an output function value of the loss function and a target algorithm; and replace the current quantization parameter of the second model using the corresponding new quantization parameter set, wherein the target algorithm includes a Bayesian Optimization Algorithm.

선택적으로, 상기 자동 최적화 모듈은, 사전 설정 조건을 만족하면, 스크리닝에 의해 기록된 복수의 양자화 파라미터 세트 중, 상기 손실 함수의 출력 함수값을 최소화할 수 있는 세트를 최적 양자화 파라미터로 사용하도록 구성되고, 여기에서, 상기 사전 설정 조건은 반복 횟수가 미리 설정된 반복 횟수를 만족하거나 반복 시간이 미리 설정된 반복 시간을 만족하는 것을 포함한다.Optionally, the automatic optimization module is configured to use, as an optimal quantization parameter, a set capable of minimizing an output function value of the loss function from among a plurality of quantization parameter sets recorded by screening, if a preset condition is satisfied, , wherein the preset condition includes that the number of repetitions satisfies a preset number of repetitions or that the repetition time satisfies a preset repetition time.

선택적으로, 상기 양자화 파라미터에 대응하는 정확도 유형은 INT4, INT8 및 INT16 중 적어도 하나의 범주를 포함한다.Optionally, the accuracy type corresponding to the quantization parameter includes at least one category of INT4, INT8 and INT16.

선택적으로, 상기 양자화 장치는, 초기화 단계에서 제1 모델에 대해 초기 양자화 파라미터 세트를 설정하도록 구성된 초기화 모듈을 더 포함하고, 여기에서, 그 정확도 유형은 INT4, INT8 및/또는 INT16 중 적어도 하나의 범주를 포함한다.Optionally, the quantization apparatus further comprises an initialization module, configured to set an initial quantization parameter set for the first model in an initialization step, wherein the accuracy type is in the category of at least one of INT4, INT8 and/or INT16. includes

본 개시의 또 다른 일 실시예에 따른 컴퓨터 판독 가능 저장 매체는, 컴퓨팅 프로그램을 저장하고, 상기 컴퓨팅 프로그램은 프로세서에 의해 실행될 때 상기 방법들 중 어느 하나의 방법에 따른 딥 러닝 모델용 양자화 방법을 구현한다.A computer-readable storage medium according to another embodiment of the present disclosure stores a computing program, and the computing program implements a quantization method for a deep learning model according to any one of the methods when executed by a processor. do.

본 개시의 또 다른 일 실시예에 따른 전자 장치는, 적어도 하나의 프로세서; 컴퓨터 실행 가능 명령을 저장하는 적어도 하나의 메모리를 포함하고, 상기 컴퓨터 실행 가능 명령은 상기 적어도 하나의 프로세서에 의해 실행될 때, 상기 방법들 중 어느 하나의 방법에 따른 딥 러닝 모델용 양자화 방법을 실행하도록 상기 적어도 하나의 프로세서를 제어한다.An electronic device according to another embodiment of the present disclosure includes at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, execute a quantization method for a deep learning model according to any one of the methods above. and controls the at least one processor.

개시된 일 실시예들을 이용하여, 제1 모델에 대해 양자화 테스트를 순환 실행하여 복수의 최적화 객체 파라미터의 실제값을 얻은 다음, 복수의 최적화 객체 파라미터의 실제값, 기대값 및 제약값에 따라 손실 함수를 계산하고, 결정된 손실 함수에 따라 양자화 파라미터를 업데이트하여 최종적으로 최적 양자화 파라미터를 획득하고, 이를 통해 최종 양자화 파라미터 구성에 따라 최종 양자화된 모델을 얻을 수 있다. 이러한 과정에서, 양자화 파라미터를 기반으로 양자화가 실행되므로, 서로 다른 양자화 파라미터의 정확도 유형에 대해 양자화가 가능하며, 모델을 고정된 정확도로 고정하여 압축 효과가 떨어지는 문제점을 방지할 수 있다. 또한, 높은 정확도의 양자화 및 낮은 정확도의 양자화의 혼합을 보장할 수 있다. 또한, 복수의 최적화 객체 파라미터의 실제값, 기대값, 제약값을 통합하여 최적화 프로세스에서 손실 함수를 얻음으로써, 각 최적화 객체 간의 충돌 관계를 충분히 고려할 수 있고, 이를 균일하게 최적화할 수 있다. 또한, 본 개시는 트레이닝 단계에서 양자화를 완료하지 않고 사전 트레이닝 모델을 직접 양자화하여, 트레이닝 단계에서 양자화를 완료할 때 양자화 모델의 트레이닝 수렴이 느려지는 문제를 방지할 수 있다. 따라서, 본 개시의 양자화된 모델은 종래 기술의 양자화 모델에 비해 수렴 속도가 빠르고, 모델의 압축 효과가 좋으며, 높은 정확도의 양자화와 낮은 정확도의 양자화가 혼합되어 있어, 미세 조정 없이 모델의 정확도를 보장할 수 있다.Using the disclosed embodiments, a quantization test is repeatedly executed on the first model to obtain actual values of a plurality of optimization object parameters, and then a loss function is calculated according to actual values, expected values and constraint values of the plurality of optimization object parameters. Calculation, and updating the quantization parameter according to the determined loss function to finally obtain an optimal quantization parameter, thereby obtaining a final quantized model according to the final quantization parameter configuration. In this process, since quantization is performed based on the quantization parameter, it is possible to quantize the accuracy types of different quantization parameters, and it is possible to prevent a problem in that the compression effect is deteriorated by fixing the model to a fixed accuracy. In addition, it is possible to ensure the mixing of high-accuracy quantization and low-accuracy quantization. In addition, by integrating the actual value, expected value, and constraint value of a plurality of optimization object parameters to obtain a loss function in the optimization process, the collision relationship between each optimization object can be sufficiently considered and it can be optimized uniformly. In addition, the present disclosure directly quantizes the pre-training model without completing the quantization in the training stage, thereby preventing the problem that the training convergence of the quantization model is slow when the quantization is completed in the training stage. Therefore, the quantized model of the present disclosure has a faster convergence speed compared to the quantized model of the prior art, a good compression effect of the model, and a mixture of high-accuracy quantization and low-accuracy quantization, ensuring the accuracy of the model without fine tuning. can do.

여기에 개시된 일 실시예들의 전체적인 개념의 추가적인 측면 및/또는 이점은 아래의 설명에서 부분적으로 설명될 것이며, 다른 일부는 설명을 통해 명확해지거나, 개시된 기술의 전체적인 개념의 구현을 통해 학습될 것이다.Additional aspects and/or advantages of the overall concept of the embodiments disclosed herein will be set forth in part in the description below, while others will become apparent from the description or learned through implementation of the overall concept of the disclosed technology.

본 개시의 일 실시예의 상기 및 기타 목적과 특징은 실시예를 예시적으로 도시한 첨부 도면과 함께 아래의 설명에 의해 더욱 명확해질 것이다.
도 1은 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법을 도시한 흐름도이다.
도 2는 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 장치를 도시한 블록도이다.
도 3은 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법을 도시한 다목적 자동 양자화 흐름도이다.
도 4는 본 개시의 일 실시예에 따른 컨볼루션 연산자 양자화 흐름을 도시한 예시도이다.
도 5는 본 개시의 일 실시예에 따른 양자화 실행을 도시한 예시도이다.
도 6은 본 개시의 일 실시예에 따른 양자화 구성을 도시한 예시도이다.
도 7은 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법으로, 초기 모델의 양자화 후의 모델 표현(expression) 프래그먼트(fragment)를 도시한 예시도이다.The above and other objects and features of one embodiment of the present disclosure will become more apparent from the following description in conjunction with the accompanying drawings illustrating the embodiments by way of example.
1 is a flowchart illustrating a quantization method for a deep learning model according to an embodiment of the present disclosure.
2 is a block diagram illustrating a quantization apparatus for a deep learning model according to an embodiment of the present disclosure.
3 is a multipurpose automatic quantization flowchart illustrating a quantization method for a deep learning model according to an embodiment of the present disclosure.
4 is an exemplary diagram illustrating a flow of quantization of a convolution operator according to an embodiment of the present disclosure.
5 is an exemplary diagram illustrating quantization execution according to an embodiment of the present disclosure.
6 is an exemplary diagram illustrating a quantization configuration according to an embodiment of the present disclosure.
7 is a quantization method for a deep learning model according to an embodiment of the present disclosure, and is an exemplary diagram illustrating a model expression fragment after quantization of an initial model.

첨부된 도면을 참조하여 제공된 이하의 설명은 특허청구범위 및 그 균등물에 의해 정의되는 본 발명의 일 실시예를 완전히 이해하기 위함이다. 상기 설명은 이해를 돕기 위한 다양한 특정 세부 사항을 포함하나, 이러한 세부 사항은 예시일 뿐이다. 따라서, 본 기술분야의 통상의 지식을 가진 자는 본 발명의 범위 및 사상을 벗어나지 않는 선에서, 여기에 설명된 실시예에 다양한 변경 및 수정이 이루어질 수 있음을 인식할 것이다. 또한, 보다 명확하고 간결하게 하게 위해, 알려진 기능 및 구성에 대한 설명은 생략할 수 있다.BRIEF DESCRIPTION OF THE DRAWINGS The following description provided with reference to the accompanying drawings is intended to fully understand one embodiment of the present invention as defined by the claims and their equivalents. The above description includes various specific details to aid understanding, but these details are exemplary only. Accordingly, those skilled in the art will recognize that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present invention. Also, for clarity and conciseness, descriptions of known functions and configurations may be omitted.

실시예를 설명함에 있어서, 공지된 관련 구조 또는 기능에 대한 구체적인 설명이 본 발명을 모호하게 한다고 판단되는 경우, 그 상세한 설명은 생략한다.In describing the embodiments, if it is determined that detailed descriptions of well-known related structures or functions obscure the present invention, the detailed description thereof will be omitted.

본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법이 개시된다. 도 1은 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법을 도시한 흐름도이다. 도 1을 참조하면, 상기 양자화 방법은 사전 설정 조건이 만족될 때까지 순환 실행하는 단계 S110 내지 단계 S160을 포함한다.A quantization method for a deep learning model according to an embodiment of the present disclosure is disclosed. 1 is a flowchart illustrating a quantization method for a deep learning model according to an embodiment of the present disclosure. Referring to FIG. 1 , the quantization method includes steps S110 to S160 of cyclically executing until a preset condition is satisfied.

단계 S110에서, 양자화 파라미터를 기반으로 제1 모델을 양자화하여 제2 모델을 획득한다.In step S110, the second model is obtained by quantizing the first model based on the quantization parameter.

단계 S110는 다음과 같이 실행될 수 있음을 이해하여야 한다. 예를 들어, 도 5에 따르면, 단계 S110은 다음의 4개의 단계 A 내지 D로 세분화될 수 있다. 단계 A는 양자화 어노테이션(annotation) 단계이고, 단계 B는 양자화 구성(quantization configuration) 단계이고, 단계 C는 양자화 계수 계산 단계이고, 단계 D는 모델 재작성 단계이다. It should be understood that step S110 may be executed as follows. For example, according to FIG. 5 , step S110 may be subdivided into the following four steps A to D. Step A is a quantization annotation step, step B is a quantization configuration step, step C is a quantization coefficient calculation step, and step D is a model rewriting step.

구체적으로, 단계 A에서, 상기 제1 모델 중 양자화할 각 연산자에 대해 양자화 어노테이션을 실행하고, 시뮬레이션 양자화 모델을 획득할 수 있고; 단계 B에서, 양자화 파라미터를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 양자화 구성을 실행할 수 있고; 단계 C에서, 양자화 구성 후의 시뮬레이션 양자화 모델을 기반으로, 양자화 계수를 계산할 수 있고; 단계 D에서, 상기 양자화 계수를 기반으로, 상기 시뮬레이션 양자화 모델에 대해 모델 재작성을 실행하여, 상기 제2 모델을 획득할 수 있다.Specifically, in step A, a quantization annotation may be executed for each operator to be quantized in the first model, and a simulation quantization model may be obtained; In step B, based on the quantization parameter, execute a quantization configuration for the simulation quantization model; In step C, based on the simulation quantization model after quantization construction, calculate quantization coefficients; In step D, the second model may be obtained by performing model rewriting on the simulation quantization model based on the quantization coefficient.

제1 모델은 사전 트레이닝된 모델, 즉 도 3 내지 도 5에 도시된 Pre-Trained Model일 수 있다. 상기 양자화 파라미터에 대응하는 정밀도 유형은 INT4, INT8 및/또는 INT16을 포함하며, 여기에서 INT는 정수(Integer)이다.The first model may be a pre-trained model, that is, a Pre-Trained Model shown in FIGS. 3 to 5 . Precision types corresponding to the quantization parameter include INT4, INT8, and/or INT16, where INT is an integer.

또한, 단계 A는 예를 들어, 양자화될 각 연산자에 대해 시뮬레이션 양자화 연산자를 삽입하는 방식을 구현될 수 있고, 여기에서, 상기 시뮬레이션 양자화 연산자는 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자 및 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자를 포함한다.Also, step A may be implemented, for example, in a manner of inserting a simulation quantization operator for each operator to be quantized, wherein the simulation quantization operator is a simulation quantization operator for quantizing a weight and quantizing an activation value. Includes simulation quantization operators for

구체적으로, 도 4를 참조하면, 컨볼루션 연산자 conv2d를 예로 들면, 삽입된 simQ 연산자(가중치)는 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자이고, simQ (입력) 연산자는 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자이다.Specifically, referring to FIG. 4 , taking the convolution operator conv2d as an example, the inserted simQ operator (weight) is a simulation quantization operator for quantizing weights, and the simQ (input) operator is a simulation quantization operator for quantizing an activation value. to be.

선택적으로, 양자화가 구체적으로 실행되는 경우, 균일한(uniform) 양자화가 실행될 수 있다. 즉, 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자는 비대칭 양자화를 채택할 수 있고, 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자는 대칭 양자화를 채택할 수 있다. 상기 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자 또한 대칭 양자화를 채택할 수 있으며, 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자 또한 비대칭 양자화를 채택할 수도 있다.Optionally, when quantization is specifically performed, uniform quantization may be performed. That is, the simulation quantization operator for quantizing the weights may adopt asymmetric quantization, and the simulation quantization operator for quantizing the activation values may adopt symmetric quantization. The simulation quantization operator for quantizing the weight may also adopt symmetric quantization, and the simulation quantization operator for quantizing the activation value may also adopt asymmetric quantization.

단계 B의 경우, 예를 들어 상기 양자화 파라미터와 시뮬레이션 양자화 모델 중 연산자의 레이어 레벨 순서 간의 대응 관계를 분석하고; 상기 대응 관계에 따라 상기 양자화 파라미터를 대응하는 연산자로 구성하는 방법을 채택할 수 있다.In the case of step B, for example, analyzing the correspondence between the quantization parameter and the layer-level order of operators in the simulation quantization model; A method of configuring the quantization parameter with a corresponding operator according to the correspondence relationship may be adopted.

상기 양자화 파라미터는 최적화기에 의해 손실 함수를 기반으로 생성될 수 있고(아래에서 상세히 설명됨), 선택적으로, 시뮬레이션 양자화 모델의 연산자 레이어 레벨과 양자화 파라미터 어레이의 파라미터 간의 대응 관계는 일대일 매핑일 수 있다. 구체적으로, 도 6을 참조하면, 레이어(Layers) 어레이는 양자화 파라미터를 나타내고, 대응하는 양자화 파라미터를 대응하는 연산자로 구성한다. 여기에서, 파라미터 4, 8 및 16은 대응하는 시뮬레이션 양자화 연산자의 계산 데이터의 정확도 유형을 각각 INT4, INT8 및 INT16으로 구성되는 것을 나타낸다. The quantization parameter may be generated by the optimizer based on the loss function (described in detail below), and optionally, the correspondence between the operator layer level of the simulation quantization model and the parameter of the quantization parameter array may be a one-to-one mapping. Specifically, referring to FIG. 6 , a layer (Layers) array indicates a quantization parameter and configures a corresponding quantization parameter with a corresponding operator. Here, parameters 4, 8 and 16 indicate that the accuracy type of the computation data of the corresponding simulation quantization operator is composed of INT4, INT8 and INT16, respectively.

예를 들어, 도 4를 참조하면, 사전 트레이닝 모델 중 특정 컨볼루션 레이어 conv2d는 원래 FP32 연산을 실행하고, 양자화 연산자의 양자화를 시뮬레이션하여 FP32에서 INT8로의 변환을 구현하므로, 해당 conv2d가 INT8 계산을 실행할 수 있도록 하며, 여기에서 FP는 부동 소수점 데이터(Floating Point)이다. 단계 B의 기능은 모델에서 계산을 실행하는 복수의 연산자(예, 컨볼루션 레이어) 각각이 계산을 실행해야 하는 정확도를 결정하는 것이다.For example, referring to FIG. 4 , a specific convolution layer conv2d in the pre-training model executes the original FP32 operation and simulates the quantization of the quantization operator to implement the FP32 to INT8 transformation, so that the corresponding conv2d executes the INT8 calculation. , where FP is floating point data. The function of step B is to determine the accuracy with which each of a plurality of operators (eg, convolutional layers) executing calculations on the model should perform calculations.

단계 C의 경우, 예를 들어, 시뮬레이션 양자화 모델 중 삽입된 각 시뮬레이션 양자화 연산자의 양자화된 양자화 파라미터에 대응하는 데이터의 정확도 유형을 결정하고; 부동 소수점 데이터와 상기 정확도 유형 데이터 간의 매핑 관계를 기반으로, 부동 소수점 데이터에서 정수 데이터로의 인터셉팅(intercepting) 오차 및 반올림 오차를 시뮬레이션한 다음 시뮬레이션된 각 양자화 연산자의 양자화 계수를 계산하는 방식을 채택하여 구현할 수 있다.In the case of step C, for example, determining an accuracy type of data corresponding to a quantized quantization parameter of each simulation quantization operator inserted in the simulation quantization model; Based on the mapping relationship between floating-point data and the above accuracy type data, an intercepting error and rounding error from floating-point data to integer data are simulated, and then the quantization coefficient of each simulated quantization operator is calculated. can be implemented by

도 6을 참조하면, 구성된 양자화 파라미터는 [8, 4, …… 16, … 4, 8, ……4, 16, …]일 수 있으며, 여기에서 값 4는 구성된 양자화 파라미터에 대응하는 데이터의 정확도 유형이 INT4인 것을 나타내고, 마찬가지로, 값 8은 INT8에 대응하고, 값 16은 INT16에 대응한다.Referring to FIG. 6 , the configured quantization parameters are [8, 4, ... … 16, … 4, 8, … … 4, 16, … ], where the value 4 indicates that the accuracy type of the data corresponding to the configured quantization parameter is INT4, likewise, the value 8 corresponds to INT8, and the value 16 corresponds to INT16.

부동 소수점 데이터와 상기 정확도 유형 데이터 사이의 매핑 관계는 사전에 얻어질 수 있다는 점에 유의해야 하며, 이는 본 개시의 중요 사항이 아니므로 상세한 설명은 생략한다.It should be noted that the mapping relationship between the floating-point data and the precision type data may be obtained in advance, and since this is not an important point of the present disclosure, a detailed description thereof will be omitted.

상기 “시뮬레이션” 인터셉팅 오차 및 반올림 오차는 다음과 같은 방식으로 실현될 수 있다. 예를 들어, KLD 및 Min/Max 방법을 사용하여 활성화 값을 인터셉트하고 반올림할 수 있다. Max/Min 방법은 가중치 값을 가로채서 반올림하는 데 사용된다. 그런 다음, 인터셉트된 값 및 반올림된 값과 원래 부동 소수점 데이터 간의 비교를 기반으로, 상기 인터셉팅 오차 및 반올림 오차를 결정한다.The above “simulated” intercepting error and rounding error can be realized in the following way. For example, activation values can be intercepted and rounded using KLD and Min/Max methods. The Max/Min method is used to intercept and round weight values. Then, based on the comparison between the intercepted value and the rounded value and the original floating point data, the intercepting error and the rounding error are determined.

그 다음으로, "매핑 관계”, “인터셉팅 오차” 및 “반올림 오차"에 기초하여 각 시뮬레이션 양자화 연산자의 양자화 계수를 결정할 수 있다. 상기 양자화 계수는 양자화 후에 데이터를 원래의 데이터 분포에 가능한 한 가깝게 만들 수 있으며, 여기에서 상기 양자화 계수는 예를 들어 스케일 팩터(Scale Factor) 및 영점(Zero Point)을 포함할 수 있다.Next, it is possible to determine the quantization coefficients of each simulation quantization operator based on the “mapping relationship”, “intercepting error” and “rounding error”. The quantization coefficient may make the data after quantization as close as possible to the original data distribution, where the quantization coefficient may include, for example, a scale factor and a zero point.

단계 D의 경우, 예를 들어, 시뮬레이션 양자화 모델 중 각 시뮬레이션 양자화 연산자에 대해, 해당 시뮬레이션 양자화 연산자의 양자화 구성된 양자화 파라미터 및 대응하는 양자화 계수를 결정하고; 상기 양자화 파라미터 및 양자화 계수를 지원하는 저정확도 연산자를 사용하여 상기 시뮬레이션 양자화 연사자를 대체하는 방법을 채택하여 구현할 수 있다.In the case of step D, for example, for each simulation quantization operator in the simulation quantization model, determine a quantization parameter and a corresponding quantization coefficient configured for quantization of the simulation quantization operator; It can be implemented by adopting a method of replacing the simulation quantization operator using a low-accuracy operator supporting the quantization parameter and the quantization coefficient.

상기 “대체”는 다양한 이용 가능한 방식으로 실행될 수 있음을 이해해야 한다. 예를 들어, 도 4의 예시를 참조하면, Mul, Round, Clip 및 Cast 함수를 사용하여 시뮬레이션 연산자를 처리하고 상기 저정확도 연산자를 얻을 수 있다. 여기에서, Mul 함수는, FP32 * 1/scale, 즉, 부동 소수점 데이터를 스케일(scale)로 나눈 것을 의미한다. 다음 예제를 통해 Round 함수를 설명한다. Round(Mul(FP32/scale)), s2^(bit-1), +2^(bit-1)-1)의 의미는, -2^(bit-1)보다 작은 숫자가 +2^(bit-1)로 설정되는 것이다. 예를 들어, 연산자가 INT4로 선택되면, 4bit이고, -2^(bit-1)는 -8이고, +2^(bit-1)보다 큰 숫자가 +2^(bit-1)-1로 설정된다. 연산자의 정확도가 INT4로 선택되면, 즉 4bit이고, +2^(bit-1)-1은 +7이고, 즉 +7보다 큰 숫자가 +7로 설정된다. 캐스트(Cast) 함수의 의미는 FP32를 INT 유형으로 변환하는 것으로, (FP32) + 7 - > (INT4) + 7)이다.It should be understood that the above “replacement” may be implemented in a variety of available ways. For example, referring to the example of FIG. 4 , it is possible to process a simulation operator using the Mul, Round, Clip, and Cast functions and obtain the low-accuracy operator. Here, the Mul function means FP32 * 1/scale, that is, floating-point data divided by a scale. The Round function is explained through the following example. Round(Mul(FP32/scale)), s2^(bit-1), +2^(bit-1)-1) means that a number smaller than -2^(bit-1) is +2^(bit) -1) is set. For example, if the operator is selected as INT4, it is 4bit, -2^(bit-1) is -8, and a number greater than +2^(bit-1) is +2^(bit-1)-1. is set If the precision of the operator is selected as INT4, i.e. 4 bits, +2^(bit-1)-1 is +7, i.e. the number greater than +7 is set to +7. The meaning of the Cast function is to convert FP32 to INT type, (FP32) + 7 - > (INT4) + 7).

상기 양자화 실행 프로세스에서, 저정확도 양자화의 정확도를 보장하기 위해, 양자화 파라미터에 따라 모델을 양자화하므로, 서로 다른 양자화 파라미터의 정확도 유형에 따라 다른 양자화 모드를 채택할 수 있다. 예를 들어, INT8 및 INT16은 레이러 레벨의 양자화 모드를 채택하고, INT4는 채널 레벨 양자화 모드를 채택한다. 고정확도(high accuracy) 연산자는 정확도를 보장하는 데 도움이 되고, 저정확도(low accuracy) 연산자는 모델 크기 및 추론 지연을 압축하여, 모델을 고정된 정확도 및 열악한 압축 효과로 정량화하는 문제를 방지하는데 도움이 된다. 또한, 고정확도 양자화와 저정확도 양자화의 혼합을 구현하여, 미세 조정 없이 모델의 정확도를 보장할 수 있다. 단계 S120에서, 제2 모델을 테스트하여 복수의 최적화 객체 파라미터의 실제값을 획득한다.In the quantization execution process, in order to ensure the accuracy of low-accuracy quantization, the model is quantized according to the quantization parameters, so that different quantization modes can be adopted according to the accuracy types of different quantization parameters. For example, INT8 and INT16 adopt a layer-level quantization mode, and INT4 adopts a channel-level quantization mode. The high accuracy operator helps to ensure accuracy, and the low accuracy operator compresses the model size and inference delay, avoiding the problem of quantifying the model with fixed accuracy and poor compression effectiveness. It helps. In addition, by implementing a mixture of high-accuracy quantization and low-accuracy quantization, the accuracy of the model can be guaranteed without fine tuning. In step S120, the second model is tested to obtain actual values of a plurality of optimization object parameters.

도 3 및 도 5를 참조하면, 여기에서 Val.Data는 테스트 데이터 세트이고, Val.Data를 이용하여 제2 모델을 테스트한다. 구체적으로, 양자화된 모델(즉 제2 모델)의 컴파일 및 예측을 실행하여, 복수의 최적화 객체 파라미터의 실제값을 얻는다.3 and 5 , in this case, Val.Data is a test data set, and the second model is tested using Val.Data. Specifically, compilation and prediction of the quantized model (that is, the second model) are executed to obtain actual values of a plurality of optimization object parameters.

여기에서, 상기 객체 파라미터는, 상기 양자화 모델의 크기(Size), 정확도(Accuracy), 전력(Power) 및/또는 추론 지연(inference latency)을 포함한다. 상기 객체 파라미터는 상기 파라미터를 포함하지만 이에 국한되지 않으며, 모델의 성능을 특성화할 수 있는 다른 지표도 포함할 수 있다.Here, the object parameter includes size, accuracy, power, and/or inference latency of the quantization model. The object parameters include, but are not limited to, the parameters, and may include other indicators that may characterize the performance of the model.

단계 S130에서, 복수의 최적화 객체 파라미터의 실제값, 복수의 최적화 객체 파라미터의 기대값 및 복수의 최적화 객체 파라미터의 제약값을 기반으로 손실 함수를 계산한다.In step S130, a loss function is calculated based on actual values of the plurality of optimization object parameters, expected values of the plurality of optimization object parameters, and constraint values of the plurality of optimization object parameters.

아래 방식을 통해 손실 함수(loss function)를 계산할 수 있음을 이해해야 한다. 예를 들어, 상기 복수의 최적화 객체 파라미터의 기대값과 대응하는 실제값 간의 차이값, 및 상기 복수의 최적화 객체 파라미터의 제약값과 대응하는 실제값 간의 차이값을 기반으로, 손실 함수를 계산할 수 있고, 여기에서, 상기 손실 함수의 함수 표현식은, It should be understood that the loss function can be calculated in the following way. For example, based on a difference value between an expected value of the plurality of optimization object parameters and a corresponding actual value, and a difference value between a constraint value of the plurality of optimization object parameters and a corresponding actual value, a loss function may be calculated, , where the function expression of the loss function is

이고,

ego,

여기에서, t는 기대값이고, t∈R₊는 단일 최적화 객체 파라미터의 기대값이고; c는 제약값이고, c∈R₊는 단일 최적화 객체 파라미터에 대한 제한이고; o는 실제값이고, o∈R₊는 현재 양자화 모델의 특정 최적화 객체 파라미터의 실제값이고; △tj= t_j - o_j는 실제값과 기대값의 차이값이고; △_cj= c_j- o_j는 실제값과 제약값의 차이값이고; wj는 가중치 인자이고, w∈R₊이고, △_tj ²는 최적화 항이고, loss를 최소화할 때 각 최적화 객체 파라미터가 기대값에 가깝더라도, 가중치 인자에 의해 각 최적화 객체 파라미터의 중요도는 가중치 인자에 의해 조정되고; w_j×△_tj ²항은 최종 결과가 각 최적화 객체 파라미터를 평가하도록 하고; λ_j는 페널티 인자이고, λ∈R₊이고; (max (0,△_cj))²는 패널티 항으로, 제2 모델의 특정 최적화 객체 파라미터의 실제값이 제한을 초과하면, 각 최적화 객체 파라미터가 제한 조건에 도달할 수 있도록 패널티를 받게 되고; M은 상기 최적화 객체 파라미터의 총 개수이다. where t is the expected value and t∈R ₊ is the expected value of a single optimization object parameter; c is the constraint value, c∈R ₊ is the constraint on a single optimization object parameter; o is the actual value, and o∈R ₊ is the actual value of the specific optimization object parameter of the current quantization model; Δtj = t _j - o _j is the difference between the actual value and the expected value; △ _cj = c _j - o _j is the difference between the actual value and the constraint value; wj is the weighting factor, w∈R ₊ , △ _tj ² is the optimization term, and even if each optimization object parameter is close to the expected value when minimizing the loss, the importance of each optimization object parameter depends on the weighting factor by the weighting factor. coordinated by; The w _j ×Δ _tj ² term allows the final result to evaluate each optimization object parameter; λ _j is the penalty factor, λ∈R ₊ ; (max (0,Δ _cj )) ² is a penalty term, and if the actual value of a specific optimization object parameter of the second model exceeds the limit, each optimization object parameter is penalized to reach the limit condition; M is the total number of optimization object parameters.

상술한 바와 같이, 복수의 객체 기대값과 제약값을 통합하여 최종 종합 대리 손실 함수(surrogate loss function)를 얻는다. 손실 함수의 출력 함수값이 작을수록, 양자화 모델은 파레토 최적(pareto optimality)에 더 가깝다.As described above, a final surrogate loss function is obtained by integrating a plurality of object expectations and constraint values. The smaller the output function value of the loss function, the closer the quantization model is to a Pareto optimality.

단계 S140에서, 계산된 손실 함수를 기반으로, 상기 양자화 파라미터를 업데이트하고 제2 모델을 제1 모델로 사용한다.In step S140, based on the calculated loss function, the quantization parameter is updated and the second model is used as the first model.

구체적으로, 아래 방식을 통해 업데이트를 구현할 수 있다. 예를 들어, 상기 손실 함수의 출력 함수값을 최적화기에 입력하고, 상기 최적화기는 입력된 함수값과 목표 알고리즘(최적화 알고리즘이라고도 함)을 기반으로 계산을 실행하여, 상기 제2 모델의 새로운 양자화 파라미터 세트를 결정하고 기록하고; 해당 새로운 양자화 파라미터 세트를 사용하여 상기 제2 모델의 현재 양자화 파라미터를 대체할 수 있고, 여기에서, 상기 목표 알고리즘은 베이지안 최적화 알고리즘(Bayesian Optimization Algorithm)을 포함한다. 상기 목표 알고리즘은 베이지안 최적화에 국한되지 않으며, 강화 학습(RL, Reinforcement Learning), 유전 알고리즘 등일 수도 있음을 유의해야 한다.Specifically, the update can be implemented through the following method. For example, an output function value of the loss function is input to an optimizer, and the optimizer executes a calculation based on the input function value and a target algorithm (also referred to as an optimization algorithm), and sets a new quantization parameter of the second model. determine and record; The new quantization parameter set may be used to replace the current quantization parameter of the second model, wherein the target algorithm includes a Bayesian Optimization Algorithm. It should be noted that the target algorithm is not limited to Bayesian optimization, and may be Reinforcement Learning (RL), a genetic algorithm, or the like.

단계 S150에서, 순환 실행한 작업 단계가 상기 사전 설정 조건을 만족하면, 최적 양자화 파라미터를 획득하고, 해당 최적 양자화 파라미터를 기반으로 양자화를 실행한 제1 모델을 최종 양자화 모델로 사용한다.In step S150, if the cyclically executed work step satisfies the preset condition, an optimal quantization parameter is obtained, and the first model quantized based on the optimal quantization parameter is used as the final quantization model.

단계 S150 이전에 순환 실행한 작업 단계가 상기 사전 설정 조건을 만족하는지 여부도 판단할 수 있다.It can also be determined whether the work step cyclically executed before step S150 satisfies the preset condition.

여기서, 상기 순환 실행한 작업 단계는 상기 단계 S110 내지 단계 S140이다. 구체적으로, 사전 설정 조건을 만족하면, 스크리닝에 의해 기록된 복수의 양자화 파라미터 세트 중, 상기 손실 함수의 출력 함수값을 최소화할 수 있는 세트를 최적 양자화 파라미터로 사용하고; 여기에서, 상기 사전 설정 조건은 반복 횟수가 미리 설정된 반복 횟수를 만족하거나 반복 시간이 미리 설정된 반복 시간을 만족하는 것을 포함한다.Here, the cyclically executed work step is the step S110 to the step S140. Specifically, when a preset condition is satisfied, among a plurality of quantization parameter sets recorded by screening, a set capable of minimizing an output function value of the loss function is used as an optimal quantization parameter; Here, the preset condition includes that the number of repetitions satisfies a preset number of repetitions or that the repetition time satisfies a preset repetition time.

일 실시예에서, 상기 양자화 방법은, 초기화 단계에서 제1 모델에 대해 초기 양자화 파라미터 세트를 설정하는 단계를 더 포함하고, 그 정확도 유형은 INT4, INT8 및 INT16 중 적어도 하나의 범주를 포함한다.In one embodiment, the quantization method further comprises setting an initial quantization parameter set for the first model in the initialization step, the accuracy type comprising at least one of INT4, INT8 and INT16.

초기 단계에서 사전 트레이닝 모델에 주어진 초기 양자화 파라미터 세트는 사전 설정 조건이 만족될 때까지 위의 단계 S110 내지 단계 S140의 순환 실행을 트리거할 수 있음을 이해할 수 있다. 선택적으로, 초기 양자화 파라미터의 정확도 유형을 모두 INT8로 설정할 수 있다.It can be understood that the initial quantization parameter set given to the pre-training model in the initial stage may trigger the cyclic execution of the above steps S110 to S140 until the preset condition is satisfied. Optionally, the accuracy types of the initial quantization parameters may all be set to INT8.

이하, 도 3 및 도 7을 결합하여 본 개시의 상기 딥 러닝 모델용 양자화 방법에 대해 설명한다. Hereinafter, a quantization method for the deep learning model of the present disclosure will be described by combining FIGS. 3 and 7 .

도 3을 참조하면, 혼합 정확도 양자화 모듈(MP-Quant)은 단계 S110 내지 단계 S120 단계를 수행한다. 구체적으로, 해당 모듈은 주로 원래 부동 소수점 모델의 각 레이어에 대응하는 FP32 연산자를 양자화 구성 모듈에서 설정한 정확도를 가진 정수 연산자로 변환하고, NPU 또는 GPU 등 낮은 정확도 계산 유닛을 사용하여 계신 실행 속도를 높인다. 각 레이어(Convolution layer 또는 Dense layer)에 대응하는 활성화 값과 가중치를 동시에 양자화하고, 양자화된 모델을 컴파일 및 평가하여 (모델 크기, Acc. 등) 정보를 얻는다. 본 명세서에 기재된 것과 같이, 서로 다른 정확도에 따라 서로 다른 양자화 방법이 있다.Referring to FIG. 3 , the mixed accuracy quantization module MP-Quant performs steps S110 to S120. Specifically, the module mainly converts the FP32 operator corresponding to each layer of the original floating-point model into an integer operator with the accuracy set in the quantization configuration module, and uses a low-accuracy calculation unit such as an NPU or GPU to increase the execution speed. elevate Activation values and weights corresponding to each layer (convolution layer or dense layer) are simultaneously quantized, and the quantized model is compiled and evaluated (model size, Acc., etc.) to obtain information. As described herein, there are different quantization methods with different accuracies.

복수의 객체 최적화 모듈 MOO(Multi-Objective Optimization Module)은 주로 단계 S130를 수행하는데 사용된다. 특히 해당 모듈은 복수의 객체값을 포괄적으로 정량화하고, 해당 최적화 작업을 종합 대리 손실 함수로 변환하는 역할을 한다. 신경망 모델의 실제 적용에는 많은 측면을 고려해야 한다. 모델의 정확도 외에도, 모델의 크기, 추론 지연 및 에너지 소비 등도 실제 장면에서 신경망 모델의 적용에 영향을 미친다. 실제 적용에서는 정확도 및 모델 크기와 같은 여러 객체 간에 충돌이 존재한다. 작은 모델은 일반적으로 모델의 낮은 정량적 정확도로 이어지고, 포괄적인 최적은 모델의 각 객체를 고려해야 한다. 본 모듈에서는 다양한 객체(정확도, 모델 크기, 추론 시간 등)를 동시에 종합적으로 최적화하기 위해 최적화기에 대한 양자화 모델의 출력 피드백으로 종합 대리 손실 함수를 설계하였다.A plurality of object optimization modules MOO (Multi-Objective Optimization Module) is mainly used to perform step S130. In particular, the module serves to comprehensively quantify multiple object values and convert the optimization work into a comprehensive surrogate loss function. Many aspects must be considered in the practical application of neural network models. In addition to the accuracy of the model, the size of the model, inference delay and energy consumption, etc. also affect the application of neural network models in real scenes. In practical applications, conflicts exist between several objects such as accuracy and model size. A small model usually leads to a low quantitative accuracy of the model, and a global optimization must consider each object in the model. In this module, a synthetic surrogate loss function is designed as the output feedback of the quantization model to the optimizer to comprehensively optimize various objects (accuracy, model size, inference time, etc.) at the same time.

자동 최적화 모듈(Auto-Opt)은 주로 단계 S140 내지 단계 S160을 수행하는데 사용된다. 구체적으로, 해당 자동 최적화 모듈은 종합적인 최상의 결과(파레토 최적)를 최적화하는 역할을 한다. 최적화기는 블랙박스 함수로 양자화할 모델을 최적화하고, 모델의 각 레이어의 정확도 양자화 구성을 하이퍼파라미터로 취한다. 최적화기는 반복 최적화를 통해 파레토 최적을 찾는다. 각 반복에서 최적화기는 모델에 작용하는 이전 하이퍼파라미터의 결과를 입력으로 받아들이고, 최적화기의 사후 확률 분포를 조정한 다음, 다음 반복을 위해 새 하이퍼파라미터를 생성한다. 반복 횟수가 미리 설정된 목표에 도달하거나 최적화 기간이 미리 설정되면, 최적화기는 최적화를 중지하고 최적의 파레토의 전략, 즉 최적의 혼합 정확도 구성(최적 양자화 파라미터)을 출력한다. 도 3에 도시된 Config Space는 주로 최적화기에 대한 필터 데이터 세트를 제공하는데 사용된다.The automatic optimization module (Auto-Opt) is mainly used to perform steps S140 to S160. Specifically, the corresponding automatic optimization module is responsible for optimizing the overall best result (Pareto Optimum). The optimizer optimizes the model to be quantized with a black box function, and takes the accuracy quantization configuration of each layer of the model as hyperparameters. The optimizer finds the Pareto optimum through iterative optimization. At each iteration, the optimizer takes as input the results of previous hyperparameters acting on the model, adjusts the optimizer's posterior probability distribution, and then generates new hyperparameters for the next iteration. When the number of iterations reaches a preset target or when the optimization period is preset, the optimizer stops optimizing and outputs an optimal Pareto's strategy, that is, an optimal mixing accuracy configuration (optimal quantization parameter). The Config Space shown in Fig. 3 is mainly used to provide a filter data set for the optimizer.

도 3에 도시된 바와 같이, 순환 최적화를 사전 설정 조건으로 반복할 때, 최적의 양자화 파라미터 세터를 최적의 양자화 전략(Best QStrategy)으로 선택한 다음, 최적의 양자화 전략에 따라 모델을 양자화하여, 최종 양자화 모델을 얻는다.As shown in Fig. 3, when the cyclic optimization is repeated with preset conditions, the optimal quantization parameter setter is selected as the optimal quantization strategy (Best QStrategy), and then the model is quantized according to the optimal quantization strategy to obtain the final quantization. get a model

도 7에 도시된 양자화된 모델 표현 프래그먼트를 결합하여, 컨볼루션 연산자 conv2d에 대해, 대응하는 FP3 부동 소수점 연산자에서 INT4 및 INT8 정밀 정수 연산자로 변환한다. 구체적인 변환 프로세스 단계는 상기 단계 S110 중 단계 A 내지 D를 참조하며, 상세한 설명은 생략한다. By combining the quantized model representation fragments shown in Fig. 7, for the convolution operator conv2d, we convert from the corresponding FP3 floating-point operators to INT4 and INT8 precision integer operators. For specific conversion process steps, refer to steps A to D of step S110, and a detailed description thereof will be omitted.

본 개시의 일 실시예의 다른 일 방면에 따르면, 딥 러닝 모델용 양자화 장치를 제공한다. 여기에서, 도 2를 참조하면, 양자화 장치(200)는 혼합 정확도 양자화 모듈(210), 복수의 객체 최적화 모듈(220), 자동 최적화 모듈(230) 및 초기화 모듈(240)을 포함하며, 각 유닛은 서로 결합될 수 있다.According to another aspect of an embodiment of the present disclosure, a quantization apparatus for a deep learning model is provided. Here, referring to FIG. 2 , the quantization apparatus 200 includes a mixed accuracy quantization module 210 , a plurality of object optimization modules 220 , an automatic optimization module 230 , and an initialization module 240 , and each unit may be combined with each other.

혼합 정확도 양자화 모듈(210)은, 양자화 파라미터를 기반으로 제1 모델을 양자화하여 제2 모델을 획득하고; 제2 모델을 테스트하여 복수의 최적화 객체 파라미터의 실제값을 획득하도록 구성된다. 복수의 객체 최적화 모듈(220)은, 복수의 최적화 객체 파라미터의 실제값, 복수의 최적화 객체 파라미터의 기대값 및 복수의 최적화 객체 파라미터의 제약값을 기반으로 손실 함수를 계산하도록 구성된다. 자동 최적화 모듈(230)은, 계산된 손실 함수를 기반으로, 상기 양자화 파라미터를 업데이트하고 제2 모델을 제1 모델로 사용하고; 상기 업데이트 단계가 사전 설정 조건을 만족하면, 최적 양자화 파라미터를 획득하고, 해당 최적 양자화 파라미터를 기반으로 양자화를 실행한 제1 모델을 최종 양자화 모델로 사용하도록 구성된다.The mixed accuracy quantization module 210 is configured to quantize the first model based on the quantization parameter to obtain a second model; and test the second model to obtain actual values of the plurality of optimization object parameters. The plurality of object optimization module 220 is configured to calculate a loss function based on actual values of the plurality of optimization object parameters, expected values of the plurality of optimization object parameters, and constraint values of the plurality of optimization object parameters. The automatic optimization module 230 is configured to update the quantization parameter based on the calculated loss function and use the second model as the first model; If the updating step satisfies a preset condition, it is configured to acquire an optimal quantization parameter, and use the first model that has been quantized based on the optimal quantization parameter as the final quantization model.

전술한 본 발명의 딥 러닝 모델용 양자화 방법에서 설명한 구체적인 특징은 유사한 확장을 위한 딥 러닝 모델용 양자화 장치에도 유사하게 적용될 수 있음을 이해해야 한다. 편의상 이에 대한 자세한 설명은 생략한다.It should be understood that the specific features described in the above-described quantization method for a deep learning model of the present invention can be similarly applied to a quantization apparatus for a deep learning model for similar extension. For convenience, a detailed description thereof will be omitted.

선택적으로, 상기 혼합 정확도 양자화 모듈은 양자화될 각 연산자에 대해 시뮬레이션 양자화 연산자를 삽입하도록 구성되고, 여기에서, 상기 시뮬레이션 양자화 연산자는 가중치를 양자화하기 위한 시뮬레이션 양자화 연산자 및 활성화 값을 양자화하기 위한 시뮬레이션 양자화 연산자를 포함한다.Optionally, the mixed precision quantization module is configured to insert a simulation quantization operator for each operator to be quantized, wherein the simulation quantization operator includes a simulation quantization operator for quantizing a weight and a simulation quantization operator for quantizing an activation value. includes

선택적으로, 상기 혼합 정확도 양자화 모듈은, 시뮬레이션 양자화 모델 중 삽입된 각 시뮬레이션 양자화 연산자의 양자화된 양자화 파라미터에 대응하는 데이터의 정확도 유형을 결정하고; 부동 소수점 데이터와 상기 정확도 유형 데이터 간의 매핑 관계를 기반으로, 부동 소수점 데이터에서 정수 데이터로의 인터셉팅 오차 및 반올림 오차를 시뮬레이션한 다음 시뮬레이션된 각 양자화 연산자의 양자화 계수를 계산하도록 구성된다.Optionally, the mixed accuracy quantization module is configured to: determine an accuracy type of data corresponding to a quantized quantization parameter of each simulation quantization operator inserted in the simulation quantization model; and simulating an intercepting error and a rounding error from floating point data to integer data based on a mapping relationship between the floating point data and the precision type data, and then calculating a quantization coefficient of each simulated quantization operator.

선택적으로, 상기 혼합 정확도 양자화 모듈은, 시뮬레이션 양자화 모델 중 각 시뮬레이션 양자화 연산자에 대해, 해당 시뮬레이션 양자화 연산자의 양자화 구성된 양자화 파라미터 및 대응하는 양자화 계수를 결정하고; 상기 양자화 파라미터 및 양자화 계수를 지원하는 저정확도(low accuracy) 연산자를 사용하여 상기 시뮬레이션 양자화 연사자를 대체하도록 구성된다.Optionally, the mixed accuracy quantization module is configured to: determine, for each simulation quantization operator in the simulation quantization model, a quantization parameter configured to be quantized of the simulation quantization operator and a corresponding quantization coefficient; and replace the simulation quantization operator using a low accuracy operator supporting the quantization parameter and the quantization coefficient.

선택적으로, 상기 복수의 객체 최적화 모듈은, 상기 복수의 최적화 객체 파라미터의 기대값과 대응하는 실제값 간의 차이값, 및 상기 복수의 최적화 객체 파라미터의 제약값과 대응하는 실제값 간의 차이값을 기반으로, 손실 함수를 계산하도록 구성된다. 여기에서, 상기 손실 함수의 함수 표현식은, Optionally, the plurality of object optimization modules are configured to: based on a difference value between an expected value of the plurality of optimization object parameters and a corresponding actual value, and a difference value between a constraint value of the plurality of optimization object parameters and a corresponding actual value , is configured to compute the loss function. Here, the function expression of the loss function is,

이고,

ego,

여기에서, t는 기대값이고, t∈R₊는 단일 최적화 객체 파라미터의 기대값이고; c는 제약값이고, c∈R₊는 단일 최적화 객체 파라미터에 대한 제한이고; o는 실제값이고, o∈R₊는 현재 양자화 모델의 특정 최적화 객체 파라미터의 실제값이고; △tj= t_j - o_j는 실제값과 기대값의 차이값이고; △_cj= c_j- o_j는 실제값과 제약값의 차이값이고; wj는 가중치 인자이고, w∈R₊이고, △_tj ²는 최적화 항이고, loss를 최소화할 때 각 최적화 객체 파라미터가 기대값에 가깝더라도, 가중치 인자에 의해 각 최적화 객체 파라미터의 중요도는 가중치 인자에 의해 조정되고; w_j×△_tj ²항은 최종 결과가 각 최적화 객체 파라미터를 평가하도록 하고; λ_j는 페널티 인자이고, λ∈R₊이고; (max (0,△_cj))²는 패널티 항으로, 제2 모델의 특정 최적화 객체 파라미터의 실제값이 제한를 초과하면, 각 최적화 객체 파라미터가 제한 조건에 도달할 수 있도록 패널티를 받게 되고; M은 상기 최적화 객체 파라미터의 총 개수이다. where t is the expected value and t∈R ₊ is the expected value of a single optimization object parameter; c is the constraint value, c∈R ₊ is the constraint on a single optimization object parameter; o is the actual value, and o∈R ₊ is the actual value of the specific optimization object parameter of the current quantization model; Δtj = t _j - o _j is the difference between the actual value and the expected value; △ _cj = c _j - o _j is the difference between the actual value and the constraint value; wj is the weighting factor, w∈R ₊ , △ _tj ² is the optimization term, and even if each optimization object parameter is close to the expected value when minimizing the loss, the importance of each optimization object parameter depends on the weighting factor by the weighting factor. coordinated by; The w _j ×Δ _tj ² term allows the final result to evaluate each optimization object parameter; λ _j is the penalty factor, λ∈R ₊ ; (max (0,Δ _cj )) ² is a penalty term, if the actual value of a specific optimization object parameter of the second model exceeds the limit, each optimization object parameter is penalized to reach the limit condition; M is the total number of optimization object parameters.

선택적으로, 상기 자동 최적화 모듈은, 상기 손실 함수의 출력 함수값과 목표 알고리즘을 기반으로, 상기 제2 모델의 새로운 양자화 파라미터 세트를 결정하고 기록하고; 해당 새로운 양자화 파라미터 세트를 사용하여 상기 제2 모델의 현재 양자화 파라미터를 대체하도록 구성된다. 여기에서, 상기 목표 알고리즘은 베이지안 최적화 알고리즘(Bayesian Optimization Algorithm)을 포함한다.Optionally, the automatic optimization module is configured to: determine and record a new set of quantization parameters of the second model based on an output function value of the loss function and a target algorithm; and replace the current quantization parameter of the second model by using the corresponding new quantization parameter set. Here, the target algorithm includes a Bayesian Optimization Algorithm.

선택적으로, 상기 자동 최적화 모듈은, 사전 설정 조건을 만족하면, 스크리닝에 의해 기록된 복수의 양자화 파라미터 세트 중, 상기 손실 함수의 출력 함수값을 최소화할 수 있는 세트를 최적 양자화 파라미터로 사용하도록 구성되고; 여기에서, 상기 사전 설정 조건은 반복 횟수가 미리 설정된 반복 횟수를 만족하거나 반복 시간이 미리 설정된 반복 시간을 만족하는 것을 포함한다.Optionally, the automatic optimization module is configured to use, as an optimal quantization parameter, a set capable of minimizing an output function value of the loss function from among a plurality of quantization parameter sets recorded by screening, if a preset condition is satisfied, ; Here, the preset condition includes that the number of repetitions satisfies a preset number of repetitions or that the repetition time satisfies a preset repetition time.

선택적으로, 상기 양자화 장치는, 초기화 단계에서 제1 모델에 대해 초기 양자화 파라미터 세트를 설정하도록 구성된 초기화 모듈(240)을 더 포함하고, 여기에서, 그 정확도 유형은 INT4, INT8 및/또는 INT16 중 적어도 하나의 범주를 포함한다.Optionally, the quantization apparatus further comprises an initialization module 240 configured to set an initial quantization parameter set for the first model in an initialization step, wherein the accuracy type is at least one of INT4, INT8 and/or INT16. contains one category.

본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 시스템에서 각 유닛/모듈은 하드웨어 구성요소 및/또는 소프트웨어 구성요소로 구현될 수 있음을 이해해야 한다. 본 기술분야의 통상의 지식을 가진 자는 정의된 각 유닛/모듈에서 수행하는 처리에 따라, 예를 들어 FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)을 이용하여 각 유닛/모듈을 구현할 수 있다.It should be understood that each unit/module in the quantization system for a deep learning model according to an embodiment of the present disclosure may be implemented as a hardware component and/or a software component. A person of ordinary skill in the art will implement each unit/module using, for example, an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit) according to the processing performed by each defined unit/module. can

본 개시의 일 실시예의 다른 일 방면에 따르면, 컴퓨팅 프로그램을 저장하고, 상기 컴퓨팅 프로그램은 프로세서에 의해 실행될 때 본 개시의 딥 러닝 모델용 양자화 방법을 구현하는 것인 컴퓨터 판독 가능 저장 매체를 제공한다.According to another aspect of an embodiment of the present disclosure, there is provided a computer-readable storage medium storing a computing program, wherein the computing program implements a quantization method for a deep learning model of the present disclosure when executed by a processor.

구체적으로, 본 개시의 일 실시예에 따른 딥 러닝 모델용 양자화 방법은 컴퓨터 판독 가능 저장 매체에 기록된 컴퓨터 프로그램 명령에 의해 구현될 수 있으며, 상기 컴퓨터 프로그램 명령은 프로세서 또는 다른 유형의 컴퓨팅 장치에 의해 실행될 때 상기 방법을 구현한다. 상기 저장 매체는 또한 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 포함하거나, 데이터 파일, 데이터 구조 등과 프로그램 명령의 조합을 포함할 수 있다. 컴퓨터 판독 가능 저장 매체의 예시로, 자기 매체(예, 하드 디스크, 플로피 디스크 및 자기 테이프), 광학 매체(예, CD ROM 디스크 및 DVD), 자기 광학 매체(예, 광학 디스크) 및 프로그램 명령을 저장하고 실행하도록 특별히 구성된 하드웨어 장치(예, 읽기 전용 메모리(ROM), 랜덤 액세스 메모리(RAM), 플래시 메모리 등) 등을 포함한다. 프로그램 명령의 예시로, 기계 코드(예, 컴파일러에 의해 생성됨) 및 인터프리터를 사용하여 컴퓨터에 의해 실행될 수 있는 상위 수준 코드가 포함된 파일을 포함한다. 설명된 하드웨어 장치는 상기 동작 및 방법을 수행하기 위해 하나 이상의 소프트웨어 유닛으로 구성될 수 있으며, 그 역도 마찬가지이다. 또한, 컴퓨터 판독 가능 저장 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드 또는 프로그램 명령이 저장되고 실행될 수 있다.Specifically, the quantization method for a deep learning model according to an embodiment of the present disclosure may be implemented by computer program instructions recorded in a computer-readable storage medium, wherein the computer program instructions are executed by a processor or other type of computing device. Implement the method when executed. The storage medium may also include a program instruction, a data file, a data structure, etc. alone, or a combination of a data file, a data structure, and the like program instructions. Examples of computer-readable storage media include magnetic media (eg, hard disks, floppy disks, and magnetic tapes), optical media (eg, CD ROM disks and DVDs), magneto-optical media (eg, optical disks), and storing program instructions. and hardware devices (eg, read-only memory (ROM), random access memory (RAM), flash memory, etc.) specifically configured to perform Examples of program instructions include files containing machine code (eg, generated by a compiler) and high-level code that can be executed by a computer using an interpreter. The described hardware apparatus may be comprised of one or more software units to perform the above operations and methods, and vice versa. In addition, the computer readable storage medium may be distributed in networked computer systems, so that computer readable code or program instructions may be stored and executed in a distributed manner.

구체적으로, 상기 전자 장치는 광범위하게 태블릿 컴퓨터, 스마트 폰, 스마트 워치, 또는 필요한 컴퓨팅 및/또는 처리 능력을 갖는 임의의 다른 전자 장치일 수 있다. 일 실시예에서, 해당 전자 장치는 시스템 버스를 통해 연결된 프로세서, 메모리, 네트워크 인터페이스, 통신 인터페이스 등을 포함할 수 있다. 해당 전자 장치의 프로세서는 필요한 컴퓨팅, 처리 및/또는 제어 기능을 제공하는 데 사용될 수 있다. 해당 전자 장치의 메모리는 비휘발성 저장 매체 및 메모리를 포함할 수 있다. 해당 비휘발성 저장 매체는 운영 체제, 컴퓨터 프로그램 등을 저장할 수 있다. 해당 메모리는 비휘발성 저장 매체의 운영 체제 및 컴퓨터 프로그램의 동작을 위한 환경을 제공할 수 있다. 해당 전자 장치의 네트워크 인터페이스 및 통신 인터페이스는 네트워크를 통해 외부 장치와 연결 및 통시하는데 사용될 수 있다.Specifically, the electronic device may broadly be a tablet computer, a smart phone, a smart watch, or any other electronic device having the necessary computing and/or processing capabilities. In an embodiment, the corresponding electronic device may include a processor, a memory, a network interface, a communication interface, etc. connected through a system bus. The processor of the electronic device may be used to provide the necessary computing, processing and/or control functions. The memory of the corresponding electronic device may include a nonvolatile storage medium and a memory. The non-volatile storage medium may store an operating system, a computer program, and the like. The memory may provide an environment for the operation of an operating system and a computer program of a non-volatile storage medium. The network interface and communication interface of the corresponding electronic device may be used to connect and communicate with an external device through a network.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented by a hardware component, a software component, and/or a combination of a hardware component and a software component. For example, the apparatus, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using a general purpose computer or special purpose computer. The processing device may execute an operating system (OS) and a software application running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that may include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in a computer-readable recording medium.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, a person skilled in the art may apply various technical modifications and variations based thereon. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

200: 양자화 장치 210: 혼합 정확도 양자화 모듈
220: 복수의 객체 최적화 모듈 230: 자동 최적화 모듈
240: 초기화 모듈200: quantization unit 210: mixed accuracy quantization module
220: multiple object optimization module 230: automatic optimization module
240: initialization module

Claims

In the quantization method for a deep learning model,
quantizing the first model based on the quantization parameter to obtain a second model;
testing the second model to obtain real values of one or more optimization object parameters;
calculating a loss function based on the actual value of the optimization object parameter, an expected value of the optimization object parameter, and a constraint value of the optimization object parameter; and
updating the quantization parameter based on the loss function and using a second model as a first model;
is cycled until the preset condition is satisfied,
obtaining an optimal quantization parameter when the preset condition is satisfied, and using a first model that has been quantized based on the optimal quantization parameter as a final quantization model;
Quantization method for a deep learning model further comprising a.

According to claim 1,
Quantizing the first model based on the quantization parameter to obtain the second model,
executing quantization annotation on each operator to be quantized in the first model, and obtaining a simulation quantization model;
executing a quantization configuration on the simulated quantization model based on the quantization parameter;
calculating quantization coefficients based on the simulation quantization model after the quantization configuration; and
obtaining the second model by performing model rewriting on the simulation quantization model based on the quantization coefficient
A quantization method for a deep learning model, comprising:

According to claim 1,
The quantization method for a deep learning model, wherein the object parameter includes at least one of accuracy and size of the quantization model, energy consumption, and inference delay parameters.

According to claim 1,
Calculating the loss function comprises:
calculating the loss function based on a difference value between the expected value of the optimization object parameter and the actual value, and a difference value between the constraint value and the actual value of the plurality of optimization object parameters;
A quantization method for a deep learning model, comprising:

5. The method of claim 4,
The function expression of the loss function is,

ego,
t is the expected value, and t∈R ₊ is the expected value of a single optimization object parameter;
c is the constraint value, c∈R ₊ is the constraint on a single optimization object parameter;
o is the actual value, and o∈R ₊ is the actual value of the specific optimization object parameter of the current quantization model;
Δtj = t _j - o _j is the difference between the actual value and the expected value;
△ _cj = c _j - o _j is the difference between the actual value and the constraint value;
wj is a weighting factor, w∈R ₊ , Δ _tj ² is an optimization term, the importance of each optimization object parameter when minimizing loss is adjusted by the weighting factor;
The w _j ×Δ _tj ² term allows the final result to evaluate each optimization object parameter;
λ _j is the penalty factor, λ∈R ₊ ;
(max (0,Δ _cj )) ² is a penalty term, so that when the actual value of the specific optimization object parameter of the second model exceeds a constraint value, each optimization object parameter can reach a constraint condition Penalties are awarded;
M is the total number of optimization object parameters,
Quantization methods for deep learning models.

According to claim 1,
The step of updating the quantization parameter based on the loss function comprises:
determining and recording a new set of quantization parameters of the second model based on a function value of the loss function and a target algorithm, wherein the target algorithm includes a Bayesian Optimization Algorithm; and
replacing the current quantization parameter of the second model with the new quantization parameter set;
A quantization method for a deep learning model, comprising:

7. The method of claim 6,
Obtaining the optimal quantization parameter comprises:
using, as an optimal quantization parameter, a set that minimizes the function value of the loss function among a plurality of quantization parameter sets recorded by screening when the preset condition is satisfied;
including,
The preset condition is that the number of repetitions of the steps satisfies a preset number of times, or the repetition time satisfies a preset repetition time, a quantization method for a deep learning model.

According to claim 1,
The quantization method for a deep learning model, wherein the accuracy type corresponding to the quantization parameter includes at least one category of INT4, INT8, and INT16.

In the quantization device for a deep learning model,
mixed-accuracy quantization module;
one or more object optimization modules; and
automatic optimization module;
including,
The mixed accuracy quantization module comprises:
quantize the first model based on the quantization parameter to obtain a second model;
and test the second model to obtain actual values of a plurality of optimization object parameters;
The one or more object optimization modules,
and calculate a loss function based on actual values of the one or more optimization object parameters, expected values of the plurality of optimization object parameters, and constraint values of the plurality of optimization object parameters;
The automatic optimization module,
update the quantization parameter based on the loss function and use a second model as a first model;
configured to obtain an optimal quantization parameter when the update result of the quantization parameter satisfies a preset condition, and to use a first model that has been quantized based on the optimal quantization parameter as a final quantization model,
A quantizer for deep learning models.

10. The method of claim 9,
The mixed accuracy quantization module comprises:
Execute quantization annotation on each operator to be quantized among the first model,
Acquire a simulation quantization model,
Execute a quantization configuration on the simulation quantization model based on the quantization parameter;
Calculate quantization coefficients based on the simulation quantization model after the quantization configuration,
Based on the quantization coefficient, by executing model rewriting on the simulation quantization model to obtain the second model,
A quantizer for deep learning models.

10. The method of claim 9,
The object parameter includes at least one parameter among accuracy and size of the quantization model, energy consumption, and inference delay parameter.

10. The method of claim 9,
The object optimization module,
calculating the loss function based on a difference value between the expected value of the optimization object parameter and the actual value, and a difference value between the constraint value and the actual value of the plurality of optimization object parameters;
A quantizer for deep learning models.

10. The method of claim 9,
The automatic optimization module,
Determine and record a new set of quantization parameters of the second model based on a function value of the loss function and a target algorithm, wherein the target algorithm includes a Bayesian Optimization Algorithm;
using the new quantization parameter set to replace the current quantization parameter of the second model,
A quantizer for deep learning models.

10. The method of claim 9,
The quantization apparatus for a deep learning model, wherein the accuracy type corresponding to the quantization parameter includes at least one category of INT4, INT8, and INT16.

A computer-readable storage medium storing a computing program, wherein the computing program, when executed by a processor, implements the quantization method for a deep learning model according to any one of claims 1 to 8.

In an electronic device,
at least one processor;
at least one memory storing computer-executable instructions;
including,
The electronic device, wherein the computer executable instructions, when executed by the at least one processor, control the at least one processor to execute the quantization method for a deep learning model according to any one of claims 1 to 8.