[go: up one dir, main page]

WO2025089499A1 - Data-type recognition weight scaling apparatus and method for floating point quantization - Google Patents

Data-type recognition weight scaling apparatus and method for floating point quantization Download PDF

Info

Publication number
WO2025089499A1
WO2025089499A1 PCT/KR2023/020767 KR2023020767W WO2025089499A1 WO 2025089499 A1 WO2025089499 A1 WO 2025089499A1 KR 2023020767 W KR2023020767 W KR 2023020767W WO 2025089499 A1 WO2025089499 A1 WO 2025089499A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
scaler
quantization
value
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/KR2023/020767
Other languages
French (fr)
Korean (ko)
Inventor
박세인
임지은
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sapeon Korea Inc
Original Assignee
Sapeon Korea Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020230182302A external-priority patent/KR20250060784A/en
Application filed by Sapeon Korea Inc filed Critical Sapeon Korea Inc
Publication of WO2025089499A1 publication Critical patent/WO2025089499A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks

Definitions

  • the present invention relates to a data-type aware weight scaling device and method for floating point quantization.
  • Deep learning models often use depthwise-separable convolution blocks (DS Conv Blocks) to reduce the number of learnable parameters (weights) and operations (FLOPs).
  • DS Conv Blocks depthwise-separable convolution blocks
  • FLOPs operations
  • the depth-wise separable convolution block is composed of Pointwise Convolution (PW Conv) - Depthwise Convolution (DW Conv) - Pointwise Convolution (PW Conv) in that order, and each convolution layer (conv) is usually followed by a BatchNorm layer (BN) and a non-linear function such as ReLU.
  • PW Conv Pointwise Convolution
  • DW Conv Depthwise Convolution
  • PW Conv Pointwise Convolution
  • BN BatchNorm layer
  • ReLU non-linear function
  • Deep learning models use floating-point-based FP32 and FP16 data types for precise calculations. Recently, GPUs and NPUs apply 16-bit or 8-bit quantization, which reduces memory footprint and has the advantage of accelerating calculations for efficient inference.
  • Floating point types are divided into sign, exponent, and mantissa parts.
  • the FP8 (1-4-3) format consists of 1 bit sign , 4 bit exponent , and 3 bit mantissa .
  • quantization is applied to both weights and activation, and the formula is as follows:
  • the bias default (hereinafter referred to as "default bias”) can be set differently for each user.
  • the FP8 format has a very limited expression range.
  • the expression range of the FP8 format can be adjusted by adding a bias extra (hereinafter referred to as "additional bias”) term as in Equation 2.
  • additional bias a bias extra
  • the default bias is Set to .
  • the expressible range can be adjusted by adjusting the additional bias (bias extra ), which can reduce the quantization error.
  • bias extra the additional bias
  • the fixed quantization range of FP8(1-4-3) is 0.00054931640625 to 30.
  • the weight distribution of depth-wise convolution is very different for each channel compared to point-wise convolution in depth-wise separable convolution blocks, because there are cases where the weight variance of some channels is very large.
  • FP8 quantization can have different quantization ranges for each channel, like integer quantization.
  • the expression range is limited to a range defined in advance by a formula by an additional bias that can be adjusted by the user, and it is non-uniform quantization in which the quantization error is different for each certain range, so it is not suitable for data with large variance.
  • the purpose of the present invention is to minimize performance degradation that occurs when applying floating point format quantization of 8 bits or less or 16 bits or less to a deep learning model including depth-wise convolution.
  • the present invention provides a data-type-aware weight scaling device and method for floating-point quantization that can minimize accuracy loss due to floating-point quantization.
  • a weight scaling device for floating-point quantization comprises a scaler calculation unit that recognizes a data type and calculates a weight scaler value that adjusts the scale of a weight, and a multiplication unit that multiplies the weight by the calculated weight scaler.
  • the multiplication unit includes a first multiplier that multiplies a weight of the depth-wise convolution by the inverse of a weight scaler, a second multiplier that multiplies a bias applied to the depth-wise convolution by the inverse of the weight scaler, and a third multiplier that multiplies a weight of the point-wise convolution by the weight scaler.
  • the scaler calculation unit determines the ratio of the standard deviation of the weights of the corresponding layer and the standard deviation of the weights of the next layer of the corresponding layer as the weight scaler value. If the scaler value is less than 1, the scaler value is determined as 1.
  • the weight scaling device can additionally reduce quantization error by applying an additional bias to a bias value used for point-wise convolution, and setting the value to a value at which the mean square error can be minimized when quantization is applied.
  • a weight scaling method for floating-point quantization comprises a weight scaler calculation step of recognizing a data type and calculating a weight scaler value that adjusts the scale of a weight, and a scaling step of multiplying the weight by the calculated weight scaler.
  • the scaling step may be to multiply the weights of the depth-wise convolution by the inverse of the weight scaler, multiply the bias applied to the depth-wise convolution by the inverse of the weight scaler, and multiply the weights of the point-wise convolution by the weight scaler.
  • the scaler calculation step determines the ratio of the standard deviation of the weights of the corresponding layer and the standard deviation of the weights of the next layer of the corresponding layer as the weight scaler value. If the scaler value is less than 1, the scaler value is determined as 1.
  • the weight scaling method for floating-point quantization may further include a step of applying an additional bias to a bias value used for point-wise convolution.
  • the additional bias is selected as a value at which a mean square error can be minimized when applying quantization.
  • the variance of point-wise convolution weights having low sensitivity to quantization is adjusted slightly higher, and the variance of depth-wise convolution weights having high sensitivity to quantization is lowered, thereby minimizing the overall accuracy loss due to floating point quantization of 16 bits or less or 8 bits or less.
  • Figure 1 is a diagram for comparing the effect according to the quantization interval in the case of floating-point quantization in the case of low variance distribution and floating-point quantization in the case of high variance distribution.
  • Figure 2 is a diagram showing an exemplary data distribution according to dispersion.
  • Figure 3 shows an example of floating-point quantization error values according to dispersion.
  • Figure 4 is a block diagram showing a conventional configuration for applying a weight quantizer to successive depth-wise convolutions and point-wise convolutions.
  • FIG. 5 is a block diagram showing a configuration for applying a weight quantizer to successive depth-wise convolutions and point-wise convolutions by scaling weights according to one embodiment of the present invention.
  • Figure 6 is a table comparing the quantization error reduction effect when the conventional method and the method of the present invention are applied to MobileNet V1.
  • symbols such as first, second, i), ii), a), b), etc. may be used. These symbols are only for distinguishing the components from other components, and the nature, order, or sequence of the components are not limited by the symbols.
  • a part in the specification is said to 'include' or 'provide' a component, this does not mean that other components are excluded, but rather that other components can be further included, unless explicitly stated otherwise.
  • terms such as 'part', 'module', etc. described in the specification mean a unit that processes at least one function or operation, and this can be implemented by hardware, software, or a combination of hardware and software.
  • FP8 (1-4-3) (hereinafter referred to as FP8) quantization is used as an example for convenience of explanation, but the present invention is not limited thereto and can be applied to quantization of 16 bits or less, for example.
  • FP8 quantization has both ranges that have the same quantization interval and ranges that have different quantization intervals due to the characteristics of floating points, as illustrated in Fig. 1. And due to this characteristic, the quantization error tends to increase as the absolute value of data increases. That is, in the graph of Fig. 1, the quantization interval of the part with large data values (the part far from 0 on the graph's x-axis) (indicated by the dotted box) is spaced further apart than the quantization interval of the part with small data values (the part close to 0 on the graph's x-axis), so the quantization error also increases.
  • Figure 2 shows the data distribution according to various variance values
  • Figure 3 shows the quantization error values at that time.
  • the FP8(1-4-3) data type is predefined by the user.
  • the expressible range [v min , V Max ] is determined by mathematical expression 1.
  • v min is called the "minimum expressible value”
  • V Max is called the "maximum expressible value”.
  • an additional bias bias extra
  • the optimal expressible range among the candidates can be selected according to the distribution range of the input data, thereby additionally reducing the quantization error.
  • the quantization error includes the rounding error and the truncation error.
  • the performance degradation occurs significantly compared to a deep learning model that does not include depth-wise convolution.
  • the weight distribution of depth-wise convolution has a characteristic of having a large variance compared to the subsequent point-wise convolution. This characteristic is reported to cause a large quantization error of the corresponding model when applying FP8 quantization to the weights of depth-wise convolution, which causes a large performance (e.g. accuracy) degradation of the deep learning model.
  • an appropriate weight scaler is applied to the weights of depth-based convolution to make the weight dispersion smaller than before. This reduces the FP8 quantization error, thereby preserving the performance of the deep learning model to the maximum extent even after FP8 quantization is applied to the deep learning model.
  • the inverse of the applied weight scaler is applied to the weights of the subsequent point-wise convolution within the depth-based separate convolution block to ensure that mathematically identical outputs are produced.
  • the conventional FP8 quantization method applies an FP8 quantizer (40, 50) to each of the successive depth-wise convolutions (10) and point-wise convolutions (30), as illustrated in FIG. 4, to quantize the weights into an FP8 data type.
  • the example of FIG. 1 shows a case where the i-th weight (W i l ) (60) of the l-th layer and the j-th weight (W j l+1 ) (80) of the l+1-th layer are quantized into an FP8 data type.
  • b l represents the bias (70) of the l-th layer
  • b l+1 represents the bias (90) of the l+1-th layer.
  • the weight scaling device (160) of the present invention comprises a scaler calculation unit (161) for recognizing a data type and calculating a weight scaler S that adjusts the scale of the weights, and multipliers (162, 163, 164) for multiplying the calculated weight scaler by the weights.
  • the first multiplier (161) multiplies the weight of the depth-specific convolution (110) by the reciprocal of the weight scaler S (1/S)
  • the second multiplier (162) multiplies the bias applied to the depth-specific convolution (110) by the reciprocal of the weight scaler (1/S)
  • the third multiplier (163) multiplies the weight of the point-specific convolution (130) by the weight scaler S.
  • Each layer of a deep learning model has weights (W) and input activation (A).
  • f( ⁇ ) is a nonlinear function that follows the layer.
  • the output of the lth layer can be expressed as in mathematical expression 3, and the output of the l +1th layer can be expressed as in mathematical expression 4. If the layer is a convolution layer, its weight is It has the shape of . At this time are the height (h) and width (w) of the convolution layer weights, respectively.
  • Equation 5 Equation 5
  • Equation 5 can be expressed as Equation 6.
  • W' and b' represent the weight and bias after applying the weight scaler, respectively.
  • the weight distribution is reduced, so that a relatively smaller quantization error can be achieved than the FP8 quantization error of the conventional depth-based convolution.
  • a method for determining a weight scaler S ⁇ R i is described. If the weight scaler S i of the i-th channel of the depth-wise convolution weights is determined to be too large, there is a risk that the variance of the point-wise convolution weight distribution will become too large.
  • the ith weight scaler (S i l ) of the lth layer is determined by the ratio of the standard deviation of W i l and the standard deviation of W l+1 . If this scaler value is less than 1, the scaler value is determined as 1. At this time, the value of the scaler can be adjusted by the k i value.
  • the method of determining the weight scaler (S i l ) of the i-th channel can be expressed in a formula as shown in Mathematical Formula 8.
  • This method can increase the variance of point-wise convolution by multiplying the inverse of the scaler to the point-wise convolution.
  • point-wise convolution weights have significantly smaller variance than depth-wise convolution weights and are robust to quantization errors because the number of weights is much larger. Therefore, the gain when applying a scaler to depth-wise convolution weights is greater than the loss when applying the inverse of the weight scaler to point-wise convolution weights.
  • an additional bias may be applied to the bias value used for point-wise convolution, and the optimal value of the additional bias may be found and adjusted as in Equation 10.
  • the loss when applying the inverse of the weight scaler to the point-by-point convolution weights can be reduced.
  • the variance of the point-wise convolution weights, which are less sensitive to quantization can be adjusted slightly higher, and the variance of the depth-wise convolution weights, which are more sensitive to quantization, can be lowered, thereby minimizing the accuracy loss due to the overall FP8 quantization. Since the method of the present invention is based on the characteristics of the complex quantization interval of the FP data type itself, it will be applicable to all floating-point formats of 16 bits or less.
  • Fig. 6 is a table comparing the quantization errors when the conventional method and the method of the present invention are applied to MobileNet V1.
  • Each component of the device or method according to the present invention may be implemented as hardware or software, or as a combination of hardware and software.
  • the function of each component may be implemented as software, and a microprocessor may be implemented to execute the function of the software corresponding to each component.
  • Various implementations of the systems and techniques described herein can be implemented as digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementations of one or more computer programs executable on a programmable system.
  • the programmable system includes at least one programmable processor (which may be a special purpose processor or a general purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • Computer programs also known as programs, software, software applications, or code
  • a computer-readable recording medium includes any type of recording device that stores data that can be read by a computer system.
  • a computer-readable recording medium can be a non-volatile or non-transitory medium, such as a ROM, a CD-ROM, a magnetic tape, a floppy disk, a memory card, a hard disk, a magneto-optical disk, a storage device, and may further include a transitory medium, such as a data transmission medium.
  • the computer-readable recording medium can be distributed over a network-connected computer system, so that the computer-readable code can be stored and executed in a distributed manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Nonlinear Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Provided are a data-type recognition weight scaling apparatus and a method for floating-point quantization. According to an embodiment of the present invention, the weight scaling apparatus comprises: a scaler calculation unit for calculating a weight scaler value for adjusting the scale of a weight by recognizing a data type; and a multiplication unit for multiplying the calculated weight scaler by the weight. The weight scaler is determined by the ratio of the value of the standard deviation of a weight of a corresponding layer and the value of the standard deviation of a weight of the layer following the corresponding layer.

Description

부동소수점 양자화를 위한 데이터-타입 인식 가중치 스케일링 장치 및 방법Data-type-aware weight scaling device and method for floating-point quantization

본 발명은 부동소수점 양자화를 위한 데이터-타입 인식 가중치 스케일링 장치 및 방법에 관한 것이다.The present invention relates to a data-type aware weight scaling device and method for floating point quantization.

딥러닝 모델은 학습 가능한 파라미터(weights) 수와 연산 수(FLOPs)를 줄이기 위해 깊이별 분리 합성곱 블록(Depthwise-Separable Convolution Block)(DS Conv Block)을 자주 사용한다.Deep learning models often use depthwise-separable convolution blocks (DS Conv Blocks) to reduce the number of learnable parameters (weights) and operations (FLOPs).

깊이별 분리 합성곱 블록은 점별 합성곱(Pointwise Convolution)(PW Conv) - 깊이별 합성곱(Depthwise Convolution)(DW conv) - 점별 합성곱(PW conv) 순서로 구성되며, 각 합성곱 레이어(conv)뒤에는 일반적으로 BatchNorm layer(BN)과 ReLU와 같은 비선형 함수들이 이어진다.The depth-wise separable convolution block is composed of Pointwise Convolution (PW Conv) - Depthwise Convolution (DW Conv) - Pointwise Convolution (PW Conv) in that order, and each convolution layer (conv) is usually followed by a BatchNorm layer (BN) and a non-linear function such as ReLU.

딥러닝 모델에서는 정밀한 연산을 위해 부동 소수점(Floating-point) 기반의 FP32, FP16 데이터 타입을 사용한다. 최근 GPU, NPU에서는 효율적인 추론을 위해 메모리 사용량(footprint)을 줄이고 연산 가속화의 장점이 있는 16비트 또는 8비트 이하 양자화(Quantization)를 적용한다.Deep learning models use floating-point-based FP32 and FP16 data types for precise calculations. Recently, GPUs and NPUs apply 16-bit or 8-bit quantization, which reduces memory footprint and has the advantage of accelerating calculations for efficient inference.

종래에 널리 사용되던 정수(integer) 기반의 8 비트 양자화 방식(INT8)과 메모리 사용량(footprint)은 동일하지만 성능면에서 우수한 결과를 보이는 부동소수점(Floating-point) 기반의 양자화 방법이 많이 채택되고 있다. The 8-bit quantization method (INT8) based on integers, which was widely used in the past, is being widely adopted, and the floating-point-based quantization method, which has the same memory footprint but shows superior performance.

부동소수점 타입은 부호(sign), 지수(exponent), 소수(mantissa) 부분으로 나뉘어진다. 현재 8 비트 미만 또는 16비트 미만의 부동소수점 포맷 데이터 타입은 IEEE의 국제 규격이 없어 다양한 포맷이 존재한다. 예를 들면, FP8(1-4-3) 포맷은 부호 1 비트(bitsign), 지수 4 비트(bitexponent), 소수 3 비트(bitmantissa)로 구성되어 있다. Floating point types are divided into sign, exponent, and mantissa parts. Currently, there are various formats for floating point format data types less than 8 bits or less than 16 bits, as there is no international standard from IEEE. For example, the FP8 (1-4-3) format consists of 1 bit sign , 4 bit exponent , and 3 bit mantissa .

일반적으로 양자화는 가중치(weight)와 활성화(activation)에 모두에 대해 적용되며 수식은 아래와 같다.Typically, quantization is applied to both weights and activation, and the formula is as follows:

Figure PCTKR2023020767-appb-img-000001
Figure PCTKR2023020767-appb-img-000001

biasdefault(이하 "기본 바이어스"라 함)는 사용자마다 다르게 설정할 수 있다. 또한 FP8 포맷은 아주 제한된 표현 범위를 가지고 있는데, 이를 해결하기 위해 수학식 2와 같이 biasextra(이하, "추가 바이어스"라 함) 항을 추가하여 FP8 포맷의 표현 범위를 조절할 수 있다. 이 특허에서 기본 바이어스는

Figure PCTKR2023020767-appb-img-000002
로 설정한다.The bias default (hereinafter referred to as "default bias") can be set differently for each user. In addition, the FP8 format has a very limited expression range. To solve this, the expression range of the FP8 format can be adjusted by adding a bias extra (hereinafter referred to as "additional bias") term as in Equation 2. In this patent, the default bias is
Figure PCTKR2023020767-appb-img-000002
Set to .

Figure PCTKR2023020767-appb-img-000003
Figure PCTKR2023020767-appb-img-000003

추가 바이어스(biasextra)를 조정하여 표현할 수 있는 범위를 조절할 수 있는데, 이를 통해 양자화 오차를 적게 만들 수 있다. 예를 들면, FP8(1-4-3)에서 기본 바이어스가 7일 때, biasextra=0인 경우에 양수 영역의 고정 양자화 범위(Fixed quantization range)는 0.0087890625 ~ 480 이고, biasextra=4인 경우에 FP8(1-4-3)의 고정 양자화 범위는 0.00054931640625 ~ 30 이다.The expressible range can be adjusted by adjusting the additional bias (bias extra ), which can reduce the quantization error. For example, in FP8(1-4-3), when the basic bias is 7, the fixed quantization range of the positive region is 0.0087890625 to 480 when bias extra = 0, and when bias extra = 4, the fixed quantization range of FP8(1-4-3) is 0.00054931640625 to 30.

일반적으로 깊이별 분리 합성곱 블록에서 점별 합성곱에 비해 깊이별 합성곱의 가중치(weight) 분포는 채널별로 매우 상이한데, 이는 몇몇 채널의 가중치 분산(Weight Variance)이 매우 큰 경우가 존재하기 때문이다.In general, the weight distribution of depth-wise convolution is very different for each channel compared to point-wise convolution in depth-wise separable convolution blocks, because there are cases where the weight variance of some channels is very large.

INT8 양자화에서는 각 채널 단위로 서로 다른 양자화 범위를 가지도록 조정하여(per-channel granularity) 이 문제를 완화할 수 있다.In INT8 quantization, this problem can be alleviated by adjusting each channel to have a different quantization range (per-channel granularity).

FP8 양자화는 정수 양자화(Integer quantization)처럼 각 채널마다 다른 양자화 범위를 가질 수 있다. 하지만 FP8 양자화의 특성 상 표현 범위는 사용자가 조절할 수 있는 추가 바이어스(biasextra)에 의해 사전에 수식으로 정의된 범위로 제한되고, 일정한 범위마다 양자화 오차가 상이한 비균일(Non-Uniform) 양자화이기 때문에 분산이 큰 데이터에 대해서는 적절하지 않다.FP8 quantization can have different quantization ranges for each channel, like integer quantization. However, due to the characteristics of FP8 quantization, the expression range is limited to a range defined in advance by a formula by an additional bias that can be adjusted by the user, and it is non-uniform quantization in which the quantization error is different for each certain range, so it is not suitable for data with large variance.

본 발명의 목적은 깊이별 합성곱을 포함하는 딥러닝 모델에 8 비트 이하 또는 16비트 이하 16비트 이하의 부동소수점 포맷 양자화를 적용할 경우에 발생하는 성능 저하를 최소화하는 것이다.The purpose of the present invention is to minimize performance degradation that occurs when applying floating point format quantization of 8 bits or less or 16 bits or less to a deep learning model including depth-wise convolution.

부동소수점 양자화로 인한 정확도 손실을 최소화할 수 있는 부동소수점 양자화를 위한 데이터-타입 인식 가중치 스케일링 장치 및 방법을 제공하는 것이다.The present invention provides a data-type-aware weight scaling device and method for floating-point quantization that can minimize accuracy loss due to floating-point quantization.

본 발명의 일 실시예에 따른 부동소수점 양자화를 위한 가중치 스케일링 장치는, 데이터-타입을 인식하여 가중치의 스케일을 조절하는 가중치 스케일러 값을 계산하는 스케일러 계산부와, 계산된 가중치 스케일러를 가중치에 곱하는 곱셈부를 구비한다. A weight scaling device for floating-point quantization according to one embodiment of the present invention comprises a scaler calculation unit that recognizes a data type and calculates a weight scaler value that adjusts the scale of a weight, and a multiplication unit that multiplies the weight by the calculated weight scaler.

일 실시예에서, 상기 곱셈부는, 깊이별 합성곱의 가중치에 가중치 스케일러의 역수를 곱하는 제1 곱셈기와, 깊이별 합성곱에 적용하는 바이어스에 가중치 스케일러의 역수를 곱하는 제2 곱셈기와, 점별 합성곱의 가중치에 가중치 스케일러를 곱하는 제3 곱셈기를 포함한다. In one embodiment, the multiplication unit includes a first multiplier that multiplies a weight of the depth-wise convolution by the inverse of a weight scaler, a second multiplier that multiplies a bias applied to the depth-wise convolution by the inverse of the weight scaler, and a third multiplier that multiplies a weight of the point-wise convolution by the weight scaler.

일 실시예에서, 상기 스케일러 계산부는 해당 레이어의 가중치의 표준편차(standard deviation)와, 상기 해당 레이어의 다음 레이어의 가중치의 표준편차의 비율을 상기 가중치 스케일러 값으로 결정한다. 만약, 이 스케일러 값이 1보다 작다면, 스케일러 값은 1로 결정한다.In one embodiment, the scaler calculation unit determines the ratio of the standard deviation of the weights of the corresponding layer and the standard deviation of the weights of the next layer of the corresponding layer as the weight scaler value. If the scaler value is less than 1, the scaler value is determined as 1.

일 실시예에서, 상기 가중치 스케일링 장치는, 점별 합성곱에 사용하는 바이어스 값에 추가 바이어스를 적용하되, 양자화 적용시의 평균제곱오차가 최소화될 수 있는 값을 설정하여 추가적으로 양자화 오차를 줄일 수 있다. In one embodiment, the weight scaling device can additionally reduce quantization error by applying an additional bias to a bias value used for point-wise convolution, and setting the value to a value at which the mean square error can be minimized when quantization is applied.

본 발명의 일 실시예에 따른 부동소수점 양자화를 위한 가중치 스케일링 방법은, 데이터-타입을 인식하여 가중치의 스케일을 조절하는 가중치 스케일러 값을 계산하는 가중치 스케일러 계산단계와, 계산된 가중치 스케일러를 가중치에 곱하는 스케일링 단계를 구비한다. A weight scaling method for floating-point quantization according to one embodiment of the present invention comprises a weight scaler calculation step of recognizing a data type and calculating a weight scaler value that adjusts the scale of a weight, and a scaling step of multiplying the weight by the calculated weight scaler.

일 실시예에서 상기 스케일링 단계는, 깊이별 합성곱의 가중치에 가중치 스케일러의 역수를 곱하고, 깊이별 합성곱에 적용하는 바이어스에 가중치 스케일러의 역수를 곱하며, 점별 합성곱의 가중치에 가중치 스케일러를 곱하는 것일 수 있다. In one embodiment, the scaling step may be to multiply the weights of the depth-wise convolution by the inverse of the weight scaler, multiply the bias applied to the depth-wise convolution by the inverse of the weight scaler, and multiply the weights of the point-wise convolution by the weight scaler.

일 실시예에서 상기 스케일러 계산단계는, 해당 레이어의 가중치의 표준편차(standard deviation)와, 상기 해당 레이어의 다음 레이어의 가중치의 표준편차의 비율을 상기 가중치 스케일러 값으로 결정한다. 만약, 이 스케일러 값이 1보다 작다면, 스케일러 값은 1로 결정한다.In one embodiment, the scaler calculation step determines the ratio of the standard deviation of the weights of the corresponding layer and the standard deviation of the weights of the next layer of the corresponding layer as the weight scaler value. If the scaler value is less than 1, the scaler value is determined as 1.

일 실시예에서, 본 발명의 일 실시예에 따른 부동소수점 양자화를 위한 가중치 스케일링 방법은 점별 합성곱에 사용하는 바이어스 값에 추가 바이어스를 적용하는 단계를 더 구비할 수 있다. 상기 추가 바이어스는 양자화 적용시의 평균제곱오차가 최소화될 수 있는 값으로 선정된다. In one embodiment, the weight scaling method for floating-point quantization according to one embodiment of the present invention may further include a step of applying an additional bias to a bias value used for point-wise convolution. The additional bias is selected as a value at which a mean square error can be minimized when applying quantization.

본 발명에 따르면, 양자화에 대한 민감도가 낮은 점별 합성곱 가중치의 분산을 조금 높게 조정하고 양자화에 대한 민감도가 높은 깊이별 합성곱 가중치의 분산을 낮추어 16비트 이하 또는 8비트 이하 부동소수점 양자화로 인한 전체적인 정확도 손실을 최소화할 수 있다.According to the present invention, the variance of point-wise convolution weights having low sensitivity to quantization is adjusted slightly higher, and the variance of depth-wise convolution weights having high sensitivity to quantization is lowered, thereby minimizing the overall accuracy loss due to floating point quantization of 16 bits or less or 8 bits or less.

도 1은 저분산 분포시의 부동소수점 양자화 및 고분산 분포시의 부동소수점 양자화의 경우에 양자화 간격에 따른 효과를 비교하기 위한 도면이다.Figure 1 is a diagram for comparing the effect according to the quantization interval in the case of floating-point quantization in the case of low variance distribution and floating-point quantization in the case of high variance distribution.

도 2는 분산에 따른 예시적인 데이터 분포를 보여주는 도면이다.Figure 2 is a diagram showing an exemplary data distribution according to dispersion.

도 3는 분산에 따른 부동소수점 양자화 에러값을 예시적으로 보여준다.Figure 3 shows an example of floating-point quantization error values according to dispersion.

도 4는 연속적인 깊이별 합성곱과 점별 합성곱에 가중치 양자화기를 적용하는 종래의 구성을 보여주는 블록도이다.Figure 4 is a block diagram showing a conventional configuration for applying a weight quantizer to successive depth-wise convolutions and point-wise convolutions.

도 5는 본 발명의 일 실시예에 따라 가중치를 스케일링하여 연속적인 깊이별 합성곱과 점별 합성곱에 가중치 양자화기를 적용하는 구성을 보여주는 블록도이다.FIG. 5 is a block diagram showing a configuration for applying a weight quantizer to successive depth-wise convolutions and point-wise convolutions by scaling weights according to one embodiment of the present invention.

도 6은 종래의 방법과 본 발명의 방법을 MobileNet V1에 적용하였을 때의 양자화 에러 감소 효과를 비교하여 보여주는 표이다.Figure 6 is a table comparing the quantization error reduction effect when the conventional method and the method of the present invention are applied to MobileNet V1.

이하, 본 개시의 일부 실시예들을 예시적인 도면을 이용해 상세하게 설명한다. 각 도면의 구성 요소들에 참조 부호를 부가함에 있어서, 동일한 구성 요소들에 대해서는 비록 다른 도면 상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 개시를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present disclosure will be described in detail using exemplary drawings. When adding reference numerals to components of each drawing, it should be noted that the same numerals are used for identical components as much as possible even if they are shown in different drawings. In addition, when describing the present disclosure, if it is determined that a specific description of a related known configuration or function may obscure the gist of the present disclosure, the detailed description thereof will be omitted.

본 개시에 따른 실시예의 구성요소를 설명하는 데 있어서, 제1, 제2, i), ii), a), b) 등의 부호를 사용할 수 있다. 이러한 부호는 그 구성요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 부호에 의해 해당 구성요소의 본질 또는 차례나 순서 등이 한정되지 않는다. 명세서에서 어떤 부분이 어떤 구성요소를 '포함' 또는 '구비'한다고 할 때, 이는 명시적으로 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한 명세서에 기재된 '부', '모듈' 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In describing components of embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc. may be used. These symbols are only for distinguishing the components from other components, and the nature, order, or sequence of the components are not limited by the symbols. When a part in the specification is said to 'include' or 'provide' a component, this does not mean that other components are excluded, but rather that other components can be further included, unless explicitly stated otherwise. In addition, terms such as 'part', 'module', etc. described in the specification mean a unit that processes at least one function or operation, and this can be implemented by hardware, software, or a combination of hardware and software.

첨부된 도면과 함께 이하에 개시될 발명의 설명은 본 발명의 예시적인 실시 형태를 설명하고자 하는 것이며, 본 발명이 실시될 수 있는 유일한 실시 형태를 나타내고자 하는 것이 아니다.The following description of the invention, together with the accompanying drawings, is intended to explain exemplary embodiments of the invention and is not intended to represent the only embodiments in which the invention may be practiced.

이하의 설명에서는 설명의 편의를 위해 FP8(1-4-3) (이하 FP8) 양자화를 예로 들어 설명하지만, 본 발명은 이에 한정되지 않으며, 예를 들면 16비트 이하의 양자화에도 적용이 가능하다.In the following description, FP8 (1-4-3) (hereinafter referred to as FP8) quantization is used as an example for convenience of explanation, but the present invention is not limited thereto and can be applied to quantization of 16 bits or less, for example.

양자화 범위 내에서 동일한 양자화 구간을 가지는 정수 양자화와는 달리, FP8 양자화는 부동 소수점의 특성 때문에 도 1에 도시한 것처럼 동일한 양자화 구간을 가지는 범위와 동일하지 않은 양자화 구간을 가지는 범위가 동시에 존재한다. 그리고 이러한 특성때문에 데이터의 절대값이 클 수록 양자화 오차가 커지는 경향을 가진다. 즉, 도 1의 그래프에서 데이터 값이 큰 부분(그래프 x축의 0에서 먼 부분)(점선 박스로 표시)의 양자화 간격은 데이터 값이 작은 부분(그래프 x축의 0에 가까운 부분)의 양자화 간격보다 이격되어 있어서 양자화 오차도 커지게 된다. 한편, 도 1의 (a)와 같은 저분산 분포시에는 데이터가 대부분 간격이 조밀한 지역에 몰려있으므로 도 1의 (b)와 같은 고분산 분포시보다 양자화 에러가 적게 나온다.Unlike integer quantization that has the same quantization interval within the quantization range, FP8 quantization has both ranges that have the same quantization interval and ranges that have different quantization intervals due to the characteristics of floating points, as illustrated in Fig. 1. And due to this characteristic, the quantization error tends to increase as the absolute value of data increases. That is, in the graph of Fig. 1, the quantization interval of the part with large data values (the part far from 0 on the graph's x-axis) (indicated by the dotted box) is spaced further apart than the quantization interval of the part with small data values (the part close to 0 on the graph's x-axis), so the quantization error also increases. On the other hand, in the case of a low-variance distribution like Fig. 1 (a), since most of the data is concentrated in an area with dense intervals, the quantization error is less than in the case of a high-variance distribution like Fig. 1 (b).

도 2는 다양한 분산값에 따른 데이터 분포를 보여주며, 도 3은 이때의 양자화 에러값을 보여준다. (a)는 분산=0.5인 경우, (b)는 분산=1인 경우, (c)는 분산=2인 경우, (d)는 분산=4인 경우이다. 도 3에서 확인할 수 있는 것처럼, 분산값이 작을 수록 양자화 에러가 작아지고, 분산값이 클수록 양자화 에러가 커진다.Figure 2 shows the data distribution according to various variance values, and Figure 3 shows the quantization error values at that time. (a) is the case where variance = 0.5, (b) is the case where variance = 1, (c) is the case where variance = 2, and (d) is the case where variance = 4. As can be seen in Figure 3, the smaller the variance value, the smaller the quantization error, and the larger the variance value, the larger the quantization error.

FP8(1-4-3) 데이터 타입은 사용자가 사전 정의한 기본 바이어스(biasdefault)에 따라 표현 가능 범위 [vmin, VMax]가 수학식 1에 의해 정해진다. 이하, vmin을 "최소표현값", VMax을 "최대표현값"이라 한다. 하지만 사용자의 구현에 따라 추가 바이어스(biasextra)를 추가하여 표현 가능 범위를 수학식 2에 따라 정해진 후보 내에서 조절할 수 있다. 이 경우 입력 데이터의 분포 범위에 따라 후보 중 최적의 표현 범위를 선정해 추가적으로 양자화 오차를 줄일 수 있다. 여기서, 양자화 오차는 반올림 오차(Rounding error) 및 근사화 오차(Truncation error)를 포함한다. 그러나 특수한 조건에 한해서는 이 방법으로 양자화 오차를 줄이는 데에는 한계가 있다.The FP8(1-4-3) data type is predefined by the user. According to the default bias , the expressible range [v min , V Max ] is determined by mathematical expression 1. Hereinafter, v min is called the "minimum expressible value" and V Max is called the "maximum expressible value". However, depending on the user's implementation, an additional bias (bias extra ) can be added to adjust the expressible range within the candidates determined by mathematical expression 2. In this case, the optimal expressible range among the candidates can be selected according to the distribution range of the input data, thereby additionally reducing the quantization error. Here, the quantization error includes the rounding error and the truncation error. However, there is a limit to reducing the quantization error with this method under special conditions.

깊이별 합성곱을 포함한 딥러닝 모델에 FP8 양자화를 적용할 때 깊이별 합성곱을 포함하지 않는 딥러닝 모델 대비 성능 저하가 매우 크게 발생한다. 일반적으로 깊이별 합성곱의 가중치 분포는 이어지는 점별 합성곱에 비해 큰 분산을 가지는 특성이 있다. 이러한 특성은 깊이별 합성곱의 가중치에 FP8 양자화를 적용할 때 해당 모델의 양자화 오차를 크게 발생시키고, 이는 딥러닝 모델에 큰 성능(e.g. Accuracy) 저하를 야기하는 것으로 보고되고 있다.When applying FP8 quantization to a deep learning model that includes depth-wise convolution, the performance degradation occurs significantly compared to a deep learning model that does not include depth-wise convolution. In general, the weight distribution of depth-wise convolution has a characteristic of having a large variance compared to the subsequent point-wise convolution. This characteristic is reported to cause a large quantization error of the corresponding model when applying FP8 quantization to the weights of depth-wise convolution, which causes a large performance (e.g. accuracy) degradation of the deep learning model.

만약 이 문제를 해결하기 위해 추가 바이어스(biasextra)를 크게 할 경우, 표현할 수 있는 최소표현값(vmin)은 작아져 0 근처의 절대값이 작은 값들을 더 잘 표현할 수 있지만, 반면에 최대표현값(VMax)이 작아지게 되어 절댓값이 큰 가중치들이 모두 최대표현값으로 절단되어 근사화 오차(truncation error)가 크게 발생하게 된다.If the additional bias (bias extra ) is increased to solve this problem, the minimum representation value (v min ) that can be expressed becomes smaller, so values with small absolute values near 0 can be expressed better, but on the other hand, the maximum representation value (V Max ) becomes smaller, so all weights with large absolute values are truncated to the maximum representation value, resulting in a large approximation error (truncation error).

반대로 추가 바이어스를 작게 할 경우에는, 표현할 수 있는 최대표현값(VMax)은 커지게 되지만, 이에 따른 반올림 오차(rounding error)는 증가하게 된다. 또한 표현 가능한 최소표현값(vmin)의 절대값 역시 커지게 되어 0 근처 데이터가 최소표현값(vmin)으로 절단되어 근사화 오차(truncation error)가 커지게 된다.On the other hand, if the additional bias is made small, the maximum expressible value (V Max ) increases, but the rounding error increases accordingly. In addition, the absolute value of the minimum expressible value (v min ) also increases, so data near 0 are truncated to the minimum expressible value (v min ), which increases the approximation error (truncation error).

결국 분산이 큰 데이터에 대해 FP8 양자화를 적용할 때에는 추가 바이어스를 조절하는 방식은 효과가 떨어진다.Ultimately, when applying FP8 quantization to data with large variance, the method of controlling additional bias is less effective.

본 발명에서는 깊이별 합성곱의 가중치에 적절한 가중치 스케일러(weight scaler)를 적용하여 가중치의 분산을 종래보다 작게 만든다. 이를 통해 FP8 양자화 오차를 감소시켜 딥러닝 모델에 FP8 양자화 적용 후에도 해당 모델의 성능을 최대한 보존한다. 이때 적용한 가중치 스케일러(weight scaler)의 역수를 깊이별 분리 합성곱 블록 내부의 이어지는 점별 합성곱의 가중치에 적용하여 수학적으로 동일한 출력이 나오도록 한다.In the present invention, an appropriate weight scaler is applied to the weights of depth-based convolution to make the weight dispersion smaller than before. This reduces the FP8 quantization error, thereby preserving the performance of the deep learning model to the maximum extent even after FP8 quantization is applied to the deep learning model. At this time, the inverse of the applied weight scaler is applied to the weights of the subsequent point-wise convolution within the depth-based separate convolution block to ensure that mathematically identical outputs are produced.

이하의 설명에서는 간단한 깊이별 분리 합성곱 블록의 깊이별 합성곱과 점별 합성곱 쌍을 가정한다. 또한 가중치의 FP8 양자화는 채널별(per-channel) 단위로 추가 바이어스 조절이 불가능함을 가정한다. 기본적으로 FP8 양자화는 텐서별(per-tensor) 단위로만 적용하지만, 채널별(per-channel)로 달라지더라도 본 발명은 적용 가능하다.In the following description, we assume a pair of depth-wise convolution and point-wise convolution of a simple depth-wise separable convolution block. We also assume that the FP8 quantization of the weights does not allow additional bias adjustment on a per-channel basis. Basically, FP8 quantization is applied only on a per-tensor basis, but the present invention can be applied even if it varies on a per-channel basis.

종래의 FP8 양자화 방식은 도 4에 도시한 것처럼 연속적인 깊이별 합성곱(10)과 점별 합성곱(30)에 각각 FP8 양자화기(Quantizer)(40, 50)를 적용하여 가중치를 FP8 데이터 타입으로 양자화한다. 도 1의 예에서는 l번째 레이어의 i번째 가중치(Wi l)(60), l+1번째 레이어의 j번째 가중치(Wj l+1)(80)를 FP8 데이터 타입으로 양자화하는 경우를 보여주고 있다. bl은 l번째 레이어의 바이어스(70), bl+1은 l+1번째 레이어의 바이어스(90)를 나타낸다. The conventional FP8 quantization method applies an FP8 quantizer (40, 50) to each of the successive depth-wise convolutions (10) and point-wise convolutions (30), as illustrated in FIG. 4, to quantize the weights into an FP8 data type. The example of FIG. 1 shows a case where the i-th weight (W i l ) (60) of the l-th layer and the j-th weight (W j l+1 ) (80) of the l+1-th layer are quantized into an FP8 data type. b l represents the bias (70) of the l-th layer, and b l+1 represents the bias (90) of the l+1-th layer.

본 발명에서는 도 5에 도시한 것처럼 가중치(weight)와 바이어스(bias)에 스케일러(scaler)를 적용한다. 이를 위하여 본 발명의 가중치 스케일링 장치(160)는 데이터-타입을 인식하여 가중치의 스케일을 조절하는 가중치 스케일러 S를 계산하는 스케일러 계산부(161)와, 계산된 가중치 스케일러를 가중치에 곱하기 위한 곱셈기(162, 163, 164)를 구비한다. 제1 곱셈기(161)는 깊이별 합성곱(110)의 가중치에 가중치 스케일러 S의 역수(1/S)를 곱하고, 제2 곱셈기(162)는 깊이별 합성곱(110)에 적용하는 바이어스에 가중치 스케일러의 역수(1/S)를 곱하며, 제3 곱셈기(163)는 점별 합성곱(130)의 가중치에 가중치 스케일러 S를 곱한다.In the present invention, a scaler is applied to weights and biases as illustrated in FIG. 5. To this end, the weight scaling device (160) of the present invention comprises a scaler calculation unit (161) for recognizing a data type and calculating a weight scaler S that adjusts the scale of the weights, and multipliers (162, 163, 164) for multiplying the calculated weight scaler by the weights. The first multiplier (161) multiplies the weight of the depth-specific convolution (110) by the reciprocal of the weight scaler S (1/S), the second multiplier (162) multiplies the bias applied to the depth-specific convolution (110) by the reciprocal of the weight scaler (1/S), and the third multiplier (163) multiplies the weight of the point-specific convolution (130) by the weight scaler S.

이와 같이 깊이별 합성곱의 가중치에 적절한 가중치 스케일러(weight scaler)를 적용하여 직접 기존 가중치의 분산보다 작게 만든다. 이와 같은 구성을 통해 FP8 양자화 오차를 감소시켜 딥러닝 모델에 FP8 양자화 적용 후에도 해당 모델의 성능을 최대한 보존할 수 있다.In this way, an appropriate weight scaler is applied to the weights of the depth-based convolution to make them smaller than the variance of the original weights. This configuration reduces the FP8 quantization error, so that the performance of the deep learning model can be preserved to the maximum extent even after applying FP8 quantization to the model.

이와 같이 가중치 스케일러를 적용할 때의 효과에 대해서 설명한다. 딥러닝 모델의 각 레이어는 가중치(W)와 입력 활성(Input Activation)(A)를 가진다. f(·)는 해당 레이어에 이어서 있는 비선형 함수이다.Here, we explain the effect of applying a weight scaler. Each layer of a deep learning model has weights (W) and input activation (A). f(·) is a nonlinear function that follows the layer.

l번째 레이어의 출력은 수학식 3과 같이 표현 가능하며, l+1번째 레이어의 출력은 수학식 4와 같이 표현 가능하다. 이때 레이어가 합성곱(Convolution) 레이어라면, 그 가중치는

Figure PCTKR2023020767-appb-img-000004
의 모양을 갖는다. 이때
Figure PCTKR2023020767-appb-img-000005
는 각각 합성곱 레이어 가중치의 높이(h)와 너비(w) 이다.The output of the lth layer can be expressed as in mathematical expression 3, and the output of the l +1th layer can be expressed as in mathematical expression 4. If the layer is a convolution layer, its weight is
Figure PCTKR2023020767-appb-img-000004
It has the shape of . At this time
Figure PCTKR2023020767-appb-img-000005
are the height (h) and width (w) of the convolution layer weights, respectively.

Figure PCTKR2023020767-appb-img-000006
Figure PCTKR2023020767-appb-img-000006

Figure PCTKR2023020767-appb-img-000007
Figure PCTKR2023020767-appb-img-000007

일반적인 딥러닝 모델에서 l 번째 레이어와 l+1 번째 레이어는 연속적이므로, Al+1=Ol을 만족한다. 따라서 수학식 4에 수학식 3을 대입하면 수학식 5와 같이 표현할 수 있다. In a typical deep learning model, the lth layer and the l +1th layer are continuous, so A l+1 = O l . Therefore, if Equation 3 is substituted into Equation 4, it can be expressed as Equation 5.

Figure PCTKR2023020767-appb-img-000008
Figure PCTKR2023020767-appb-img-000008

이때 f(·)가 부분 선형 함수(piece-wise linear function, e.g. ReLU, Leaky ReLU)라면 f(ax)=af(x) 이므로, 수학식 5를 수학식 6과 같이 표현할 수 있다. 수학식 6에서 W'과 b'은 각각 가중치 스케일러를 적용한 후의 가중치와 바이어스를 나타낸다.At this time, if f(·) is a piece-wise linear function (e.g. ReLU, Leaky ReLU), then f ( ax ) = af ( x ), so Equation 5 can be expressed as Equation 6. In Equation 6, W' and b' represent the weight and bias after applying the weight scaler, respectively.

Figure PCTKR2023020767-appb-img-000009
Figure PCTKR2023020767-appb-img-000009

이때 깊이별 합성곱의 새로운 가중치 (W')l의 분산은 수학식 7와 같이 감소한다. At this time, the variance of the new weight (W') l of the depth-wise convolution decreases as in mathematical expression 7.

Figure PCTKR2023020767-appb-img-000010
Figure PCTKR2023020767-appb-img-000010

전술한 것처럼 FP8 양자화의 경우 데이터의 크기가 작을 수록 양자화 오차의 크기가 작아지는 경향을 보이므로, 본 발명에 따르면 가중치 분산이 감소하여 종래의 깊이별 합성곱의 FP8 양자화 오차보다 상대적으로 작은 양자화 오차를 가질 수 있다.As described above, in the case of FP8 quantization, the smaller the data size, the smaller the quantization error tends to be. Therefore, according to the present invention, the weight distribution is reduced, so that a relatively smaller quantization error can be achieved than the FP8 quantization error of the conventional depth-based convolution.

본 발명에서 가중치 스케일러 S∈Ri를 결정하는 방법을 설명한다. 깊이별 합성곱 가중치의 i번째 채널의 가중치 스케일러 Si를 너무 크게 결정하면 반대로 점별 합성곱 가중치 분포의 분산이 너무 커질 위험이 있다. In the present invention, a method for determining a weight scaler S∈R i is described. If the weight scaler S i of the i-th channel of the depth-wise convolution weights is determined to be too large, there is a risk that the variance of the point-wise convolution weight distribution will become too large.

따라서 l번째 레이어의 i번째 가중치 스케일러(Si l)는 Wi l의 표준편차(standard deviation)와 Wl+1의 표준편차의 비율로 결정한다. 만약, 이 스케일러 값이 1보다 작다면, 스케일러 값은 1로 결정한다. 이때 스케일러의 값을 ki 값으로 조절할 수 있다. Therefore, the ith weight scaler (S i l ) of the lth layer is determined by the ratio of the standard deviation of W i l and the standard deviation of W l+1 . If this scaler value is less than 1, the scaler value is determined as 1. At this time, the value of the scaler can be adjusted by the k i value.

i번째 채널의 가중치 스케일러(Si l)를 결정하는 방식을 수식으로 표현하면 수학식 8과 같다.The method of determining the weight scaler (S i l ) of the i-th channel can be expressed in a formula as shown in Mathematical Formula 8.

Figure PCTKR2023020767-appb-img-000011
Figure PCTKR2023020767-appb-img-000011

ki는 스케일러의 효과를 조절하는 인자로 다음 수식과 같이 다양한 방식으로 결정할 수 있다. 예를 들면 아래 수학식 9와 같이 Wi l의 절댓값보다 작으며 Wi l의 절댓값과 차가 가장 작은 2의 지수승(lv)과 Wi l+1의 절댓값보다 작으며 Wi l+1의 절댓값과 차가 가장 작은 2의 지수승(lv)의 비율로 설정할 수 있다. 실험적으로 ki=1일 때 가장 효과가 좋을 확률이 높았다. k i is a factor that controls the effect of the scaler and can be determined in various ways, as shown in the following formula. For example, it can be set as the ratio of the exponent ( lv ) of 2 that is smaller than the absolute value of Wi l and has the smallest difference from the absolute value of Wi l and the exponent (lv) of 2 that is smaller than the absolute value of Wi l+1 and has the smallest difference from the absolute value of Wi l +1, as shown in the following mathematical formula 9. Experimentally, it was found that the effect was most likely to be good when k i = 1.

Figure PCTKR2023020767-appb-img-000012
Figure PCTKR2023020767-appb-img-000012

이 방식은 점별 합성곱에 스케일러의 역수를 곱하므로 점별 합성곱의 분산을 증가시킬 수 있다. 하지만 일반적으로 점별 합성곱 가중치의 경우에는 깊이별 합성곱 가중치보다 분산이 현저히 작고 가중치의 숫자가 월등히 많으므로 양자화 오차에 강건하다고 보고되고 있다. 따라서 깊이별 합성곱 가중치에 스케일러를 적용하였을 때의 이득이 점별 합성곱 가중치에 가중치 스케일러의 역수를 적용하였을 때의 손실보다 크다. This method can increase the variance of point-wise convolution by multiplying the inverse of the scaler to the point-wise convolution. However, it is generally reported that point-wise convolution weights have significantly smaller variance than depth-wise convolution weights and are robust to quantization errors because the number of weights is much larger. Therefore, the gain when applying a scaler to depth-wise convolution weights is greater than the loss when applying the inverse of the weight scaler to point-wise convolution weights.

그러나 가중치 스케일러의 역수가 곱해지며 점별 합성곱 가중치의 최대값이 extra_bias=4인 FP8의 최대값보다 커질 수 있다. 일 실시예에서, 이러한 점을 감안하여 점별 합성곱에 사용하는 바이어스 값에 추가 바이어스를 적용하고 수학식 10과 같이 추가 바이어스의 최적의 값을 찾아서 조정해줄 수 있다.However, since the inverse of the weight scaler is multiplied, the maximum value of the point-wise convolution weight may be greater than the maximum value of FP8 with extra_bias=4. In one embodiment, considering this, an additional bias may be applied to the bias value used for point-wise convolution, and the optimal value of the additional bias may be found and adjusted as in Equation 10.

Figure PCTKR2023020767-appb-img-000013
Figure PCTKR2023020767-appb-img-000013

즉, 양자화 적용시의 평균제곱오차가 최소화될 수 있는 추가 바이어스를 선정하여 점별 합성곱 가중치에 가중치 스케일러의 역수를 적용하였을 때의 손실을 줄일 수 있다.That is, by selecting an additional bias that can minimize the mean square error when applying quantization, the loss when applying the inverse of the weight scaler to the point-by-point convolution weights can be reduced.

이와 같이 양자화에 대한 민감도가 낮은 점별 합성곱 가중치의 분산을 조금 높게 조정하고 양자화에 대한 민감도가 높은 깊이별 합성곱 가중치의 분산을 낮추어 전체적인 FP8 양자화로 인한 정확도 손실을 최소화 할 수 있다. 본 발명의 방법은 FP 데이터 타입 자체의 복합적인 양자화 구간의 특성에 기인하는 것이기 때문에 모든 16비트 이하의 부동소수점 포맷에 적용 가능할 것이다. In this way, the variance of the point-wise convolution weights, which are less sensitive to quantization, can be adjusted slightly higher, and the variance of the depth-wise convolution weights, which are more sensitive to quantization, can be lowered, thereby minimizing the accuracy loss due to the overall FP8 quantization. Since the method of the present invention is based on the characteristics of the complex quantization interval of the FP data type itself, it will be applicable to all floating-point formats of 16 bits or less.

도 6은 종래의 방법과 본 발명의 방법을 MobileNet V1에 적용하였을 때의 양자화 에러를 비교하여 보여주는 표이다. 도 6에서 확인할 수 있듯이, 본 발명의 방법을 적용한 FP8 포맷(FP8-WS)은 본 발명의 방법을 적용하지 아니한 FP8 포맷(FP8)에 비하여, 추가 바이어스=0인 경우에는 +0.694, 추가 바이어스=4인 경우에는 +0.752의 성능 향상을 확인할 수 있었다.Fig. 6 is a table comparing the quantization errors when the conventional method and the method of the present invention are applied to MobileNet V1. As can be seen in Fig. 6, the FP8 format (FP8-WS) to which the method of the present invention is applied shows a performance improvement of +0.694 when the additional bias = 0 and +0.752 when the additional bias = 4, compared to the FP8 format (FP8) to which the method of the present invention is not applied.

본 발명에 따른 장치 또는 방법의 각 구성요소는 하드웨어 또는 소프트웨어로 구현되거나, 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. 또한, 각 구성요소의 기능이 소프트웨어로 구현되고 마이크로프로세서가 각 구성요소에 대응하는 소프트웨어의 기능을 실행하도록 구현될 수도 있다.Each component of the device or method according to the present invention may be implemented as hardware or software, or as a combination of hardware and software. In addition, the function of each component may be implemented as software, and a microprocessor may be implemented to execute the function of the software corresponding to each component.

본 명세서에 설명되는 시스템 및 기법들의 다양한 구현예들은, 디지털 전자 회로, 집적회로, FPGA(field programmable gate array), ASIC(application specific integrated circuit), 컴퓨터 하드웨어, 펌웨어, 소프트웨어, 및/또는 이들의 조합으로 실현될 수 있다. 이러한 다양한 구현예들은 프로그래밍가능 시스템 상에서 실행 가능한 하나 이상의 컴퓨터 프로그램들로 구현되는 것을 포함할 수 있다. 프로그래밍가능 시스템은, 저장 시스템, 적어도 하나의 입력 디바이스, 그리고 적어도 하나의 출력 디바이스로부터 데이터 및 명령들을 수신하고 이들에게 데이터 및 명령들을 전송하도록 결합되는 적어도 하나의 프로그래밍가능 프로세서(이것은 특수 목적 프로세서일 수 있거나 혹은 범용 프로세서일 수 있음)를 포함한다. 컴퓨터 프로그램들(이것은 또한 프로그램들, 소프트웨어, 소프트웨어 애플리케이션들 혹은 코드로서 알려져 있음)은 프로그래밍가능 프로세서에 대한 명령어들을 포함하며 "컴퓨터가 읽을 수 있는 기록매체"에 저장된다.Various implementations of the systems and techniques described herein can be implemented as digital electronic circuits, integrated circuits, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementations of one or more computer programs executable on a programmable system. The programmable system includes at least one programmable processor (which may be a special purpose processor or a general purpose processor) coupled to receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device. Computer programs (also known as programs, software, software applications, or code) include instructions for the programmable processor and are stored on a "computer-readable medium."

컴퓨터가 읽을 수 있는 기록매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 이러한 컴퓨터가 읽을 수 있는 기록매체는 ROM, CD-ROM, 자기 테이프, 플로피디스크, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등의 비휘발성(non-volatile) 또는 비일시적인(non-transitory) 매체일 수 있으며, 또한 데이터 전송 매체(data transmission medium)와 같은 일시적인(transitory) 매체를 더 포함할 수도 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다.A computer-readable recording medium includes any type of recording device that stores data that can be read by a computer system. Such a computer-readable recording medium can be a non-volatile or non-transitory medium, such as a ROM, a CD-ROM, a magnetic tape, a floppy disk, a memory card, a hard disk, a magneto-optical disk, a storage device, and may further include a transitory medium, such as a data transmission medium. In addition, the computer-readable recording medium can be distributed over a network-connected computer system, so that the computer-readable code can be stored and executed in a distributed manner.

본 명세서의 순서도에서는 각 과정들을 순차적으로 실행하는 것으로 기재하고 있으나, 이는 본 개시의 일 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것이다. 다시 말해, 본 개시의 일 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 개시의 일 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 본 명세서의 순서도에 기재된 순서를 변경하여 실행하거나 각 과정들 중 하나 이상의 과정을 병렬적으로 실행하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이므로, 본 명세서의 순서도는 시계열적인 순서로 한정되는 것은 아니다.Although the flowchart of this specification describes each process as being executed sequentially, this is only an illustrative description of the technical idea of one embodiment of the present disclosure. In other words, a person having ordinary skill in the art to which one embodiment of the present disclosure belongs may change the order described in the flowchart of this specification without departing from the essential characteristics of one embodiment of the present disclosure, or may modify and modify and apply various modifications and variations such as executing one or more of the processes in parallel. Therefore, the flowchart of this specification is not limited to a chronological order.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present embodiment, and those skilled in the art will appreciate that various modifications and variations may be made without departing from the essential characteristics of the present embodiment. Accordingly, the present embodiments are not intended to limit the technical idea of the present embodiment, but rather to explain it, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The protection scope of the present embodiment should be interpreted by the following claims, and all technical ideas within a scope equivalent thereto should be interpreted as being included in the scope of the rights of the present embodiment.

CROSS-REFERENCE TO RELATED APPLICATIONCROSS-REFERENCE TO RELATED APPLICATION

본 특허출원은, 본 명세서에 그 전체가 참고로서 포함되는, 2023년 10월 26일에 한국에 출원한 특허출원번호 제10-2023-0145071호 및 2023년 12월 14일에 한국에 출원한 특허출원번호 제10-2023-0182302호에 대해 우선권을 주장한다.This patent application claims priority to Korean patent application No. 10-2023-0145071, filed in Korea on October 26, 2023, and Korean patent application No. 10-2023-0182302, filed in Korea on December 14, 2023, which are incorporated herein by reference in their entirety.

Claims (8)

데이터-타입을 인식하여 가중치의 스케일을 조절하는 가중치 스케일러 값을 계산하는 스케일러 계산부,A scaler calculation unit that calculates a weight scaler value that adjusts the scale of the weight by recognizing the data type, 계산된 가중치 스케일러를 가중치에 곱하는 곱셈부A multiplication part that multiplies the weights by the computed weight scaler. 를 구비하는 부동소수점 양자화를 위한 가중치 스케일링 장치.A weight scaling unit for floating point quantization having a . 제1항에 있어서, 상기 곱셈부는,In the first paragraph, the multiplication unit, 깊이별 합성곱의 가중치에 가중치 스케일러의 역수를 곱하는 제1 곱셈기와,A first multiplier that multiplies the weights of the depth-wise convolution by the inverse of the weight scaler, 깊이별 합성곱에 적용하는 바이어스에 가중치 스케일러의 역수를 곱하는 제2 곱셈기와,A second multiplier that multiplies the inverse of the weight scaler by the bias applied to the depth-wise convolution, 점별 합성곱의 가중치에 가중치 스케일러를 곱하는 제3 곱셈기A third multiplier that multiplies the weights of the pointwise convolution by a weight scaler. 를 포함하는, 부동소수점 양자화를 위한 가중치 스케일링 장치.A weight scaling unit for floating point quantization, comprising: 제1항 또는 제2항에 있어서,In paragraph 1 or 2, 상기 스케일러 계산부는 해당 레이어의 가중치의 표준편차 값과, 상기 해당 레이어의 다음 레이어의 가중치의 표준편차 값의 비율을 상기 가중치 스케일러 값으로 결정하는,The above scaler calculation unit determines the ratio of the standard deviation value of the weight of the corresponding layer and the standard deviation value of the weight of the next layer of the corresponding layer as the weight scaler value. 부동소수점 양자화를 위한 가중치 스케일링 장치.Weight scaling unit for floating point quantization. 제3항에 있어서, 상기 가중치 스케일링 장치는,In the third paragraph, the weight scaling device, 점별 합성곱에 사용하는 바이어스 값에 추가 바이어스를 적용하되, 양자화 적용시의 평균제곱오차가 최소화될 수 있는 값을 선정하는,Apply an additional bias to the bias value used for point-wise convolution, but select a value that can minimize the mean square error when applying quantization. 부동소수점 양자화를 위한 가중치 스케일링 장치.Weight scaling unit for floating point quantization. 데이터-타입을 인식하여 가중치의 스케일을 조절하는 가중치 스케일러 값을 계산하는 가중치 스케일러 계산단계와,A weight scaler calculation step that calculates a weight scaler value that recognizes the data type and adjusts the scale of the weight, 계산된 가중치 스케일러를 가중치에 곱하는 스케일링 단계Scaling step that multiplies the weights by the computed weight scaler. 를 구비하는 부동소수점 양자화를 위한 가중치 스케일링 방법.A weight scaling method for floating point quantization having. 제5항에 있어서, 상기 스케일링 단계는, In the fifth paragraph, the scaling step, 깊이별 합성곱의 가중치에 가중치 스케일러의 역수를 곱하고,Multiply the weights of the depth-wise convolution by the inverse of the weight scaler, 깊이별 합성곱에 적용하는 바이어스에 가중치 스케일러의 역수를 곱하며, The bias applied to the depth-wise convolution is multiplied by the inverse of the weight scaler. 점별 합성곱의 가중치에 가중치 스케일러를 곱하는 것인,Multiplying the weights of the pointwise convolution by a weight scaler, 부동소수점 양자화를 위한 가중치 스케일링 장치.Weight scaling unit for floating point quantization. 제5항 또는 제6항에 있어서,In clause 5 or 6, 상기 스케일러 계산단계는 해당 레이어의 가중치의 표준편차 값과, 상기 해당 레이어의 다음 레이어의 가중치의 표준편차 값의 비율을 상기 가중치 스케일러 값으로 결정하는,The above scaler calculation step determines the ratio of the standard deviation value of the weight of the corresponding layer and the standard deviation value of the weight of the next layer of the corresponding layer as the weight scaler value. 부동소수점 양자화를 위한 가중치 스케일링 장치.Weight scaling unit for floating point quantization. 제7항에 있어서,In Article 7, 점별 합성곱에 사용하는 바이어스 값에 추가 바이어스를 적용하는 단계를 더 구비하며,It further provides a step of applying an additional bias to the bias value used for point-wise convolution. 상기 추가 바이어스는 양자화 적용시의 평균제곱오차가 최소화될 수 있는 값을 선정하는,The above additional bias is selected to a value that can minimize the mean square error when applying quantization. 부동소수점 양자화를 위한 가중치 스케일링 장치.Weight scaling unit for floating point quantization.
PCT/KR2023/020767 2023-10-26 2023-12-15 Data-type recognition weight scaling apparatus and method for floating point quantization Pending WO2025089499A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR20230145071 2023-10-26
KR10-2023-0145071 2023-10-26
KR10-2023-0182302 2023-12-14
KR1020230182302A KR20250060784A (en) 2023-10-26 2023-12-14 Method and apparatus for data-type aware weight scaling for floating-point quantization

Publications (1)

Publication Number Publication Date
WO2025089499A1 true WO2025089499A1 (en) 2025-05-01

Family

ID=95516189

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/020767 Pending WO2025089499A1 (en) 2023-10-26 2023-12-15 Data-type recognition weight scaling apparatus and method for floating point quantization

Country Status (1)

Country Link
WO (1) WO2025089499A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659725A (en) * 2019-09-20 2020-01-07 字节跳动有限公司 Compression and acceleration method of neural network model, data processing method and device
KR20210121946A (en) * 2020-03-31 2021-10-08 삼성전자주식회사 Method and apparatus for neural network quantization
CN115761830A (en) * 2022-09-09 2023-03-07 平安科技(深圳)有限公司 Face recognition model quantitative training method, device, equipment and storage medium
KR20230104037A (en) * 2021-12-30 2023-07-07 주식회사 에임퓨처 Apparatus for enabling the conversion and utilization of various formats of neural network models and method thereof
KR102566480B1 (en) * 2017-02-10 2023-08-11 삼성전자주식회사 Automatic thresholds for neural network pruning and retraining

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102566480B1 (en) * 2017-02-10 2023-08-11 삼성전자주식회사 Automatic thresholds for neural network pruning and retraining
CN110659725A (en) * 2019-09-20 2020-01-07 字节跳动有限公司 Compression and acceleration method of neural network model, data processing method and device
KR20210121946A (en) * 2020-03-31 2021-10-08 삼성전자주식회사 Method and apparatus for neural network quantization
KR20230104037A (en) * 2021-12-30 2023-07-07 주식회사 에임퓨처 Apparatus for enabling the conversion and utilization of various formats of neural network models and method thereof
CN115761830A (en) * 2022-09-09 2023-03-07 平安科技(深圳)有限公司 Face recognition model quantitative training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US20250265461A1 (en) Dynamic quantization of neural networks
US10491239B1 (en) Large-scale computations using an adaptive numerical format
CN113095486B (en) Integer tensor network data processing method
US20210004679A1 (en) Asymmetric quantization for compression and for acceleration of inference for neural networks
WO2021235656A1 (en) Electronic apparatus and control method thereof
CN113780523B (en) Image processing method, device, terminal equipment and storage medium
WO2023003432A1 (en) Method and device for determining saturation ratio-based quantization range for quantization of neural network
WO2021073638A1 (en) Method and apparatus for running neural network model, and computer device
US20060136540A1 (en) Enhanced fused multiply-add operation
WO2025089499A1 (en) Data-type recognition weight scaling apparatus and method for floating point quantization
CN111767993A (en) INT8 quantization method, system, device and storage medium for convolutional neural network
JPH06175826A (en) Logarithm arithmetic circuit
WO2023014124A1 (en) Method and apparatus for quantizing neural network parameter
CN113283591B (en) High-efficiency convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN114330655A (en) Normalized quantization method and device, electronic equipment and storage medium
EP3748491B1 (en) Arithmetic processing apparatus and control program
KR0174498B1 (en) Approximation Method and Circuit of Log
KR20250060784A (en) Method and apparatus for data-type aware weight scaling for floating-point quantization
WO2023128024A1 (en) Method and system for quantizing deep-learning network
CN119249050A (en) Device and method for fast calculation of nonlinear activation function based on coefficient lookup table
WO2025089498A1 (en) Low-precision floating-point friendly quantization method and apparatus
KR20240067175A (en) Batch norm parameters training method for neural network and apparatus of thereof
CN112308216B (en) Data block processing method, device and storage medium
WO2023113445A1 (en) Method and apparatus for floating point arithmetic
CN114676829A (en) Neural network processing device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23956932

Country of ref document: EP

Kind code of ref document: A1