KR20240126334A

KR20240126334A - Temperature Decay Method on Differentiable Architecture Search

Info

Publication number: KR20240126334A
Application number: KR1020230018978A
Authority: KR
Inventors: 강대기; 신지용
Original assignee: 동서대학교 산학협력단
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2024-08-20

Abstract

The present invention provides a temperature decay method for differentiable architecture search, characterized by applying a temperature to the mixed operation of DARTS (Differentiable Architecture Search) using gradient descent to control exploration and exploitation during architecture search. The present invention alleviates the mismatch problem between architecture parameters and one-hot encoded architectures by adjusting the temperature value, resulting in significant improvements in the accuracy and performance of the DARTS algorithm.

Description

Temperature Decay Method on Differentiable Architecture Search

본발명은 미분가능 아키텍처 검색에서의 온도 감쇠방법에 관한 것으로 보다 상세하게는 아키텍처 파라미터와 원-핫 인코딩된 아키텍처 간의 불일치 문제를 온도 값을 조정함으로써 완화하고 DARTS 알고리즘의 정확도와 성능이 증가되는 미분가능 아키텍처 검색에서의 온도 감쇠방법에 관한 것이다.The present invention relates to a temperature damping method in differentiable architecture search, and more specifically, to a temperature damping method in differentiable architecture search, which alleviates the mismatch problem between architectural parameters and one-hot encoded architectures by adjusting temperature values, and increases the accuracy and performance of the DARTS algorithm.

일반적으로 자동화된 기계학습(AutoML; Automated Machine Learning)은 인공지능 모델개발의 속도를 높이고 요구되는 자원을 최소화하기 위한 분야이다. 그중에서도 모델의 아키텍처를 생성하는 NAS(Neural Architecture Search; 신경망 아키텍처 검색) 알고리즘이 활발하게 연구되고 있다. NAS 알고리즘들의 기본적인 목적은 설계자의 개입을 최소화하여 모델 생성 작업의 효율성을 높이면서 최적의 모델 아키텍처를 설계하는 것이다. NAS 알고리즘의 설계와 구현을 위해서 강화학습(Reinforcement Learning; RL), 진화 알고리즘(Evolutionary Algorithm) 등 여러 접근법이 있을 수 있다.In general, automated machine learning (AutoML) is a field that aims to speed up the development of artificial intelligence models and minimize the required resources. Among them, the NAS (Neural Architecture Search) algorithm that generates the architecture of the model is being actively studied. The basic purpose of NAS algorithms is to design the optimal model architecture while minimizing the intervention of the designer and increasing the efficiency of the model creation task. There can be various approaches such as Reinforcement Learning (RL) and Evolutionary Algorithm for the design and implementation of NAS algorithms.

그중에서도 경사 하강법을 이용한 DARTS(Differentiable Architecture Search)를 개선하는 방법이 있다.Among them, there is a method to improve DARTS (Differentiable Architecture Search) using gradient descent.

본 발명 이전에도 DARTS의 몇 가지 문제점들을 제시하고 그것을 해결하기 위한 노력들이 있었다. 종래특허기술의 일례로서 공개번호 10-2022-0105081호에는 복수 개의 노드들과 상기 노드들을 서로 연결하는 간선들을 포함하는 셀을 포함하는 모델을 준비하는 단계;Even before the present invention, there were efforts to suggest several problems of DARTS and solve them. As an example of a prior art patent technology, Publication No. 10-2022-0105081 includes a step of preparing a model including a cell including a plurality of nodes and edges connecting the nodes to each other;

각각의 상기 간선에서 이용할 N1개의 후보연산들을 모두 이용하여 형성한 제1혼합연산을 각각의 상기 간선에 적용하여 상기 모델을 초기화하는 단계; 및A step of initializing the model by applying a first mixed operation formed by using all N1 candidate operations to be used in each of the above edges to each of the above edges; and

상기 각각의 간선에 적용된 각각의 후보연산에 할당된 가중치를 학습시키는 제1학습단계;A first learning step for learning the weights assigned to each candidate operation applied to each edge;

를 포함하는,신경망 구조 탐색방법이 공개되어 있다.A method for searching neural network structures, including , is disclosed.

또한, 공개번호 10-2021-0135799호에는 뉴럴 네트워크 구조 탐색 장치 및 방법이 공개되어 있다.In addition, a neural network structure exploration device and method are disclosed in Publication No. 10-2021-0135799.

ProxylessNAS는[1] One-Shot NAS와 DARTS가 했던 것처럼 역전파 알고리즘을 통해 한번의 훈련 과정만으로 검색 모델을 수렴시키는 상위 네트워크(SuperNet)을 사용하였다. 다만, 과도하게 매개변수화 시켰던 기존의 방식[2]에서 이진화된 간선을 만들어 요구되는 메모리 용량을 크게 줄였으며, CIFAR-10 데이터 세트에서 테스트 오차율 2.08%로 ENAS가 2.83%, DARTS가 2.83%, AmebaNet-B가 2.13%였던 것에 비해 향상된 결과를 보였다.ProxylessNAS[1] uses a supernet (SuperNet) that converges the search model with only one training process through the backpropagation algorithm, just like One-Shot NAS and DARTS. However, it significantly reduces the required memory capacity by creating binarized edges instead of the excessively parameterized conventional method[2], and shows an improved result with a test error rate of 2.08% on the CIFAR-10 dataset, compared to 2.83% for ENAS, 2.83% for DARTS, and 2.13% for AmebaNet-B.

[1] H. Cai, L. Zhu, and S. Han, "ProxylessNAS: Direct neural architecture search on target task and hardware," In International Conference on Learning Representations, 2019. [1] H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct neural architecture search on target task and hardware,” In International Conference on Learning Representations, 2019.

[2] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong, "PC-DARTS: Partial channel connections for memory-efficient differentiable architecture search," In Proceedings of The International Conference on Learning Representations (ICLR), , vol. abs/1907.05737, 12 Jul 2019.[2] Y. Xu, L. Proceedings of The International Conference on Learning Representations (ICLR), , vol. abs/1907.05737, 12 Jul 2019.

Fair-DARTS는 소프트맥스 함수가 불공정한 독점 경쟁을 부추긴다고 이야기하고 있다. 이는 검색 중에 하나 혹은 두 개의 연산자가 해당 간선에서 지배적으로 되는 것을 의미한다. 이것에 대한 해결책으로 소프트맥스 함수를 사용하는 대신 시그모이드 함수를 사용하고 가우시안 노이즈를 추가해 이러한 이점이 생기는 것을 방해한다. 또한, 해당 논문에서는 연속적으로 인코딩된 아키텍처의 솔루션이 원-핫 인코딩된 아키텍처와 유사하다는 DARTS의 기본 전제를 반박하며 이러한 불일치가 작을수록 결과 아키텍처가 일관된 성능을 낸다고 말하고 있다. 아키텍처의 가중치를 0 혹은 1의 극단 값으로 조정하는 L1 Regularization과 유사한 Zero-One 보조 손실을 소개하여 연속적으로 인코딩된 아키텍처와 그 아키텍처로부터 찾은 이산적 아키텍처 간의 불일치를 완화하였다[3]. 결과적으로 CIFAR-10 데이터 세트에서 75.6 % 정확도를 보이며 P-DARTS[1], PC-DARTS[2], SNAS[4], GDAS[5], FBNet-C[6] 알고리즘을 상회하는 성능을 보였다[3]. 이러한 방법은 불일치를 최소로 만들지만 아키텍처의 가중치가 불연속적으로 인코딩된다는 점에서 하나만을 극단적으로 선택하는 형태가 된다. 본 발명에서는 검색이 진행됨에 따라 경쟁을 점차 완화하여 자연스러운 경쟁이 되면서도 최종적으로는 결과 아키텍처의 모습과 유사한 형태의 환경을 조성하였다.Fair-DARTS argues that the softmax function promotes unfair monopoly competition, which means that one or two operators become dominant on the edge during the search. Instead of using the softmax function, they use the sigmoid function and add Gaussian noise to prevent this advantage. In addition, the paper refutes the basic premise of DARTS that the solution of the continuously encoded architecture is similar to the one-hot encoded architecture, and says that the smaller this discrepancy is, the more consistent the performance of the resulting architecture is. They introduce the Zero-One auxiliary loss, which is similar to L1 regularization, which adjusts the weights of the architecture to extreme values of 0 or 1, to alleviate the discrepancy between the continuously encoded architecture and the discrete architecture found from it [3]. As a result, it shows 75.6% accuracy on the CIFAR-10 dataset, outperforming the P-DARTS [1], PC-DARTS [2], SNAS [4], GDAS [5], and FBNet-C [6] algorithms [3]. This method minimizes the mismatch, but it is an extreme form of selecting only one in that the weights of the architecture are encoded discontinuously. In the present invention, as the search progresses, the competition is gradually relaxed, so that natural competition occurs, and ultimately an environment similar to the shape of the resulting architecture is created.

[3] X. Chu, T. Zhou, B. Zhang, and J. Li, "Fair DARTS: Eliminating unfair advantages in differentiable architecture search," In Proceedings of The European Conference on Computer Vision (ECCV), 2019.[3]

[4] S. Xie, H. Zheng, C. Liu, and L. Lin, "SNAS: Stochastic neural architecture search," In Proceedings of International Conference on Learning Representations (ICLR), pp. 1761-1770, 2019.[4] S. 1761-1770, 2019.

[5] X. Dong and Y. Yang, "Searching for a robust neural architecture in four gpu hours," In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2019. [5]

[6] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, "FBNet: Hardware-aware efficient convnet design via differentiable neural architecture search," In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2019.[6] B. Wu, efficient convnet design via differentiable neural architecture search," In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2019.

DARTS with Ensemble Gumbel-Softmax 에서는 검벨 확률 분포를 사용하여 이산 변수들과 아키텍처 확률 분포로부터 샘플링을 시도한다. 이후 불연속인 argmax 부분을 softmax로 대체하여 Gumbel-Softmax가 된다. 해당 연구에서는 이것을 앙상블 하여 샘플링의 성능을 높인다. 여기에도 온도가 적용되어있지만, 이것은 하나의 하이퍼파라미터일 뿐 온도를 검색 중에 조절하여 아키텍처 파라미터 간 경쟁을 조율하지는 않고, 이진 코드로 샘플링 하는 것에 집중하였다. 이들의 연구 결과로 CIFAR-10 데이터 세트에서 DARTS가 4 GPU days가 걸리던 작업을 1.5 GPU days로 자원을 줄이면서도 DARTS와 유사한 성능을 얻어냈다[7].In DARTS with Ensemble Gumbel-Softmax, the Gumbel probability distribution is used to sample from discrete variables and architectural probability distributions. After that, the discontinuous argmax part is replaced with softmax to become Gumbel-Softmax. This is ensembled to improve the sampling performance. Although temperature is also applied here, it is only one hyperparameter, and it does not adjust the temperature during the search to coordinate the competition between architectural parameters, but focuses on sampling with binary codes. As a result of their research, they achieved similar performance to DARTS while reducing resources from 4 GPU days to 1.5 GPU days on the CIFAR-10 dataset [7].

[7] J. Chang, X. Zhang, Y. Guo, G. Meng, S. Xiang, and C. Pan, "Differentiable architecture search with ensemble Gumbel-Softmax," arXiv preprint arXiv:1905.01786, 2019.[7] J. Chang,

β-DARTS는 AdaptNAS[8]에서 했던 도메인 적응을 이용하여 DARTS의 일반화 성능을 높이는 한편, DARTS 알고리즘을 올바르게 정규화하는 방법도 소개하고 있다. DARTS의 혼합 간선이 올바르게 정규화되기 위해서는 소프트맥스 함수 내에서 α를 조절하는 매핑 함수가 아키텍처 파라미터 α의 진폭에 영향을 받지 않으며 α의 진폭을 반영 및 조절할 수 있어야 한다고 이야기하고 있다. 이들은 α가 통과한 소프트맥스의 출력 값을 β로 규정하고, 이 값의 분산은 작으면서도 값 자체는 평균에 가깝게 정규화하여 weight decay와 유사한 효과를 내는 새로운 정규화 방법 β-Decay 정규화를 제시하였다. 그에 따른 결과로 NAS-Bench-201의 CIFAR-100 데이터 세트에서 73.49%의 검증 정확도, 73.51%의 테스트 정확도로 SOTA(State-Of-The-Art)를 달성하였으며, CIFAR-10 데이터 세트에서는 DARTS의 1차 근사법을 사용한 모델과 유사한 속도로 2차 근사법을 사용한 모델보다 뛰어난 97.47%의 정확도를 보였다[9].β-DARTS improves the generalization performance of DARTS by utilizing the domain adaptation performed in AdaptNAS[8], and also introduces a method to properly regularize the DARTS algorithm. They say that in order for the mixed edges of DARTS to be properly regularized, the mapping function that adjusts α in the softmax function must be able to reflect and adjust the amplitude of α without being affected by the amplitude of the architectural parameter α. They proposed a new regularization method, β-Decay regularization, which defines the output value of the softmax that α passed through as β and normalizes the value itself to be close to the average while having a small variance of this value, thereby producing an effect similar to weight decay. As a result, it achieved the State-Of-The-Art (SOTA) with a validation accuracy of 73.49% and a test accuracy of 73.51% on the CIFAR-100 dataset of NAS-Bench-201, and showed an accuracy of 97.47% on the CIFAR-10 dataset, which was similar to the model using the first-order approximation of DARTS and better than the model using the second-order approximation [9].

[8] Y. Li, Z. Yang, Y. Wang, and C. Xu, "Adapting neural architectures between domains," In Advances in Neural Information Processing Systems, vol. 33, pp. 789-798, 2020. [8] Y. Li, Z. Yang, Y. Wang, and C. Xu, “Adapting neural architectures between domains,” In Advances in Neural Information Processing Systems, vol. 33, pp. 789-798, 2020.

[9] P. Ye, B. Li, Y. Li, T. Chen, J. Fan, and W. Ouyang, "β-DARTS: Beta-decay regularization for differentiable architecture search," In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), pp. 10864-10873, 2022. [9] P. Ye, B. Li, Y. Li, T. Chen, J. Fan, and W. Ouyang, “β-DARTS: Beta-decay regularization for differentiable architecture search,” In Proceedings of The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), pp. 10864-10873, 2022.

그러나 상기 종래기술들은 아키텍처 파라미터와 인코딩된 아키텍처 간의 불일치 문제가 있고 DARTS 알고리즘의 정확도와 성능이 좋지 않은 단점이 있었다.However, the above conventional technologies have the disadvantage of having a mismatch problem between the architectural parameters and the encoded architecture and poor accuracy and performance of the DARTS algorithm.

따라서 본 발명은 상기와 같은 문제점을 해결하고자 안출된 것으로,아키텍처 파라미터와 원-핫 인코딩된 아키텍처 간의 불일치 문제를 온도 값을 조정함으로써 완화하고 DARTS 알고리즘의 정확도와 성능이 증가되는 미분가능 아키텍처 검색에서의 온도 감쇠방법을 제공하고자 하는 것이다.Accordingly, the present invention has been made to solve the above problems, and to provide a temperature attenuation method in differentiable architecture search that alleviates the mismatch problem between architectural parameters and one-hot encoded architectures by adjusting temperature values, and increases the accuracy and performance of the DARTS algorithm.

본발명은 미분가능 아키텍처 검색에서의 온도 감쇠방법에 관한 것으로, 경사 하강법을 이용한 DARTS(Differentiable Architecture Search)를 개선하기 위해 DARTS의 혼합 연산자(Mixed Operation)에 온도를 적용하여 아키텍처 검색 시 탐색(Exploration)과 착취(Exploitation)를 조절할 수 있게 한 것을 특징으로 한다.The present invention relates to a temperature attenuation method in differentiable architecture search, and is characterized in that it applies temperature to a mixed operation of DARTS (Differentiable Architecture Search) using a gradient descent method to control exploration and exploitation during architecture search.

따라서 본발명은 아키텍처 파라미터와 원-핫 인코딩된 아키텍처 간의 불일치 문제를 온도 값을 조정함으로써 완화하고 DARTS 알고리즘의 정확도와 성능이 증가되는 현저한 효과가 있다.Therefore, the present invention has a remarkable effect of alleviating the mismatch problem between the architectural parameters and the one-hot encoded architecture by adjusting the temperature value, and increasing the accuracy and performance of the DARTS algorithm.

도 1. 소프트맥스 함수를 통과하기 전의 아키텍처 가중치인 α 값과 소프트맥스 함수를 통과한 아키텍처 가중치 β간의 비교 예시. 온도 값이 작은 경우의 그래프
도 2. 소프트맥스 함수를 통과하기 전의 아키텍처 가중치인 α 값과 소프트맥스 함수를 통과한 아키텍처 가중치 β간의 비교 예시. 온도 값이 큰 경우의 그래프Figure 1. Example of comparison between the architecture weight α before passing through the softmax function and the architecture weight β after passing through the softmax function. Graph for cases where the temperature value is small.
Figure 2. Example of comparison between the architecture weight α before passing through the softmax function and the architecture weight β after passing through the softmax function. Graph when the temperature value is large.

또한, 상기 온도의 값은 조절하는 것을 특징으로 한다.In addition, the value of the above temperature is characterized by being regulated.

또한, 아키텍처 검색 중에 온도를 10에서 0.1로 낮추어가며 진행하는 것을 특징으로 한다.Additionally, it is characterized by lowering the temperature from 10 to 0.1 during the architecture search.

본발명을 첨부도면에 의해 상세히 설명하면 다음과 같다.The present invention is described in detail with reference to the attached drawings as follows.

본 발명에서는 경사 하강법을 이용한 DARTS(Differentiable Architecture Search)를 개선하는 방안에 중점을 두었다. DARTS의 혼합 연산자(Mixed Operation)에 온도를 적용하여 아키텍처 검색 시 탐색(Exploration)과 착취(Exploitation)를 조절할 수 있게 만들었다. 또한, 기존의 DARTS 알고리즘은 혼합 연산자를 만드는 과정에서 아키텍처의 파라미터가 원-핫 인코딩된 아키텍처와 유사하게 학습될 것을 가정하게 되는데, 이때 생기는 아키텍처 파라미터와 원-핫 인코딩된 아키텍처 간의 불일치 문제를 온도 값을 조정함으로써 완화하였다. 실험은 아키텍처 검색 중에 온도를 10에서 0.1로 낮추어가며 진행하였고, 이에 따른 결과로, CIFAR-10 데이터 세트에서 97.37%의 정확도를 기록하여 기존 DARTS 알고리즘보다 0.13%p 증가한 성능을 보였다.In this invention, we focus on improving DARTS(Differentiable Architecture Search) using gradient descent method. By applying temperature to mixed operation of DARTS, we can control exploration and exploitation during architecture search. In addition, existing DARTS algorithm assumes that the parameters of architecture will be learned similarly to one-hot encoded architecture in the process of creating mixed operator, and the mismatch problem between architectural parameters and one-hot encoded architecture was alleviated by adjusting the temperature value. Experiments were conducted by lowering the temperature from 10 to 0.1 during architecture search, and as a result, the accuracy was recorded at 97.37% on CIFAR-10 dataset, showing 0.13%p higher performance than existing DARTS algorithm.

온도 감쇠 미분가능 아키텍처 검색 (TD-DARTS)Temperature-Damped Differentiable Architecture Search (TD-DARTS)

혼합 연산자에서의 온도 담금질 및 온도가 가지는 영향에 대해 설명하면,Explaining the effect of temperature quenching and temperature on mixed operators,

DARTS에서 사용되는 소프트맥스 함수는 두 가지가 있는데, 하나는 최종 분류에 사용되는 소프트맥스 함수이고, 다른 하나는 수식 1과 같은 혼합 연산자(Mixed Edge)를 만드는 데에 사용되는 소프트맥스 함수이다. There are two softmax functions used in DARTS: one is the softmax function used for the final classification, and the other is the softmax function used to create a mixed edge operator as in Equation 1.

수식 1. 혼합 연산자 나타내는 수식Formula 1. Formula representing mixed operator

그 중에서도 혼합 간선 만드는데 사용되는 소프트맥스는 아키텍처의 선택 자체에 직접적인 영향을 끼칠 수 있는 함수이다. 혼합 연산자는 모든 연산자 풀을 하나의 간선으로 만든다. 수식 2는 수식 1의 혼합 연산자의 소프트맥스 함수에 온도 T를 적용한 수식이다. Among them, the softmax used to create mixed edges is a function that can directly affect the choice of the architecture itself. The mixed operator makes all operator pools into one edge. Equation 2 is a formula that applies temperature T to the softmax function of the mixed operator in Equation 1.

수식 2. 온도 감쇠를 적용한 혼합 연산자Equation 2. Mixed operator with temperature damping applied

모든 아키텍처 파라미터 α 는 온도에 의해 리스케일링(rescaling)된다. 수식 3는 혼합 연산자의 소프트맥스 함수 부분을 β로 바꾼 것이다. 아키텍처 파라미터 α 값은 정수이다. 각 간선의 α 벡터가 소프트맥스 함수로 입력되면, 출력은 0보다 큰 수로 이루어져 있으며, 합은 1인 벡터가 된다. 본 논문에서는 이 부분의 변화를 중점적으로 다루므로 출력 벡터는 β로 따로 구분한다. All architectural parameters α are rescaled by temperature. Equation 3 is the softmax function part of the mixing operator changed to β. The value of the architectural parameter α is an integer. When the α vector of each edge is input to the softmax function, the output becomes a vector consisting of numbers greater than 0 and the sum is 1. Since this paper focuses on the change in this part, the output vector is distinguished as β.

수식 3. 아키텍처 파라미터가 온도 감쇠가 적용된 소프트맥스 함수를 통과하기 전과 후를 나타낸 수식Equation 3. Equation showing the architecture parameters before and after passing through the softmax function with temperature damping applied.

결과적으로 수식 4와 같이 혼합 연산자의 소프트맥스 함수 부분은 β로 표현이 가능하다. 즉, β는 해당 간선에서의 각각의 연산자의 기여도를 나타내며, 이것이 결과적으로 어떤 연산자가 최적의 연산자인지 선택하는 척도가 된다. As a result, the softmax function part of the mixed operator can be expressed as β, as in Equation 4. That is, β represents the contribution of each operator on the corresponding edge, and this ultimately becomes a measure for selecting which operator is the optimal operator.

수식 5. 혼합 연산자에 연산자의 기여도 β가 각각의 연산자에 곱해진 수식Equation 5. Equation in which the contribution β of the operator in the mixed operator is multiplied by each operator.

수식 3의 온도를 조정하여 α 값이 확률 분포인 β 값으로 변형될 때, 결과 확률 분포의 차이를 좁히거나 늘릴 수 있다. 모든 입력 α 값을

라는 동일한 온도 값으로 나누어 소프트맥스에 입력하게 되면,

값이 0에 가까울수록 그림 1과 같이 β의 차이는 커지고,

값이 클수록 β의 차이는 도 2와 같이 작아지게 된다.By adjusting the temperature in Equation 3, when the α value is transformed into the β value, which is a probability distribution, the difference in the resulting probability distribution can be narrowed or increased. All input α values

If we divide it by the same temperature value and input it into softmax,

The closer the value is to 0, the larger the difference in β becomes, as shown in Figure 1.

As the value increases, the difference in β becomes smaller, as shown in Fig. 2.

값을 어떤 값으로 두느냐에 따라서 각 연산자의 기여도인 β는 실제 β 값보다 크거나 작은 기여를 하게 된다. 즉, 연산자의 기여도를 연산자의 가중치에 비례하되, 다르게 설정하는 것이 가능한 것이다. 이렇게 하는 이유는 첫째로, DARTS는 연산자를 선택하기 위한 방법으로 역전파를 통한 그래디언트 기반(Gradient-based)의 방법을 사용한다. 각각의 레이어에서 가능한 모든 연산자를 묶어주기 위해 소프트맥스 함수를 사용하는데, 소프트맥스 함수를 통과한 가중치는 결과값에 해당 연산자를 얼마나 기여하게 할 것인가를 결정한다. 모든 가중치는 0 부터 1 사이의 값으로 결정되며, 모든 연산자는 해당 값을 부여받는다. α 값의 업데이트는 결괏값으로부터 β, 소프트맥스를 통과하여 α 값으로 역전파 되므로 모든 연산자가 0부터 1 사이에서 경쟁하게 되는 현상이 일어난다. 이때,

를 적용하게 되면 이 경쟁을 조정할 수 있게 되는 것이다.

가 크면, 작은 α 값들과 큰 α 값들의 차이가 소프트맥스 함수를 통과하면서 작아져, 경쟁이 심해지고,

가 작으면, β 값 간의 차이가 벌어져 높은 α 값들이 다른 α 값들에 비해 해당 간선에서의 지배력이 커진다. 최적의 연산자를 찾는 문제인 NAS의 시각으로 본다면, 결국에는 β 값이 가장 높은 연산자가 선택되므로,

가 높으면 검색 초기에는 지배적이었던 연산자도 마지막에는 선택되지 않을 가능성이 커지며,

가 낮으면 검색 초기에 지배적이었던 연산자가 마지막에도 선택될 가능성이 커진다. 다시 말하면,

값이 높으면 기존의 정보보다 새로운 정보를 더 많이 수용하게 되고,

값이 낮으면 기존의 정보에 더 많이 의존하게 된다. 이는

값이 높으면 검색을 많이 하게 되고 착취는 적어지며,

값이 낮으면 검색을 적게 하고, 착취는 커지게 된다는 의미가 된다.

Depending on what value is set, the contribution of each operator, β, will contribute more or less than the actual β value. In other words, it is possible to set the contribution of the operator in proportion to the weight of the operator, but differently. The reason for this is, first, DARTS uses a gradient-based method through backpropagation as a method for selecting operators. The softmax function is used to group all possible operators in each layer, and the weight passed through the softmax function determines how much the operator will contribute to the result value. All weights are determined as values between 0 and 1, and all operators are given the corresponding values. The update of the α value is backpropagated from the result value to β, softmax, and α, so all operators compete between 0 and 1. At this time,

By applying it, we can adjust this competition.

When is large, the difference between small α values and large α values becomes smaller as it passes through the softmax function, and the competition becomes intense.

When is small, the difference between β values increases, and high α values dominate the corresponding edges compared to other α values. From the perspective of NAS, which is a problem of finding the optimal operator, the operator with the highest β value is ultimately selected, so

If the ratio is high, operators that were dominant at the beginning of the search are more likely to not be selected at the end.

If is low, the operator that was dominant at the beginning of the search is more likely to be selected at the end. In other words,

A high value means that more new information is accepted than existing information.

A lower value means more reliance on existing information. This means

A higher value means more searches and less exploitation.

A lower value means fewer searches and greater exploitation.

가 가지는 의미는 한 가지 더 있다. DARTS 알고리즘은 검색이 종료된 후에 결국 이산적인 선택을 해야 한다. 이산적인 선택을 한다는 의미는 0과 1로 원-핫 인코딩된 각각의 연산자들 중에 하나를 선택한다는 의미이다. 즉, β 값이 0 또는 1로 이루어지게 바꾸는 과정인데, 실제 검색과정에서는 β가 0과 1로 이루어져 있지 않으며, 모든 연산자가 어느 정도 출력값에 기여를 한다. 이에 따른 불일치가 존재하는데, 이것은

값을 낮춤으로써 완화할 수 있다.

값이 낮을수록 β 값들은 0과 1에 가까운 형태가 된다. 이 형태로 검색을 진행하면, 실제 모델 학습과 유사한 결과로 업데이트하게 된다. 따라서 검색을 통해 선택된 결과 모델이 실제 학습에서도 기대한 성능을 낼 가능성이 커진다.

has one more meaning. The DARTS algorithm must make a discrete choice after the search is finished. Making a discrete choice means choosing one of the operators that are one-hot encoded with 0 and 1. In other words, it is a process of changing the β value to 0 or 1. However, in the actual search process, β is not composed of 0 and 1, and all operators contribute to the output value to some extent. There is a discrepancy due to this, which is

This can be alleviated by lowering the value.

The lower the value, the closer the β values are to 0 and 1. If the search is conducted in this form, the results are updated to be similar to the actual model learning. Therefore, the result model selected through the search is more likely to produce the expected performance in actual learning.

온도 감쇠 전략으로서,온도

는 높을수록 연산자 간의 경쟁이 심해지고, 낮을수록 연산자들의 격차는 벌어진다. 따라서 검색 초기에는

값이 높은 상태로 검색을 진행하여 지배적이지 않은 연산자에도 지배적으로 될 가능성을 열어준다. 반대로 검색 후기에는

값을 낮추어 경쟁을 마무리하고 가장 높은 α 값을 가진 연산자에 지배적으로 될 가능성을 부여하여 실제 결과 모델과 유사한 결과물로 검색을 진행할 수 있게 한다. 온도 감쇠 전략은 여러 다른 방법이 사용될 수 있는데, 본 발명에서 시행된 방법은 다음 표1과 같다. As a temperature attenuation strategy, temperature

The higher the value, the more intense the competition between operators, and the lower the value, the wider the gap between operators. Therefore, at the beginning of the search,

By performing the search with a high value, it opens up the possibility that even non-dominant operators can become dominant. Conversely, in the search review,

By lowering the value, the competition is ended and the operator with the highest α value is given a chance to dominate, so that the search can proceed with results similar to the actual result model. There are various different methods that can be used for the temperature damping strategy, and the method implemented in the present invention is as shown in Table 1 below.

표1. 알고리즘 2. TD 알고리즘. 각 에포크에 적용되는 온도를 계산 및 적용하는 방식Table 1. Algorithm 2. TD algorithm. How to calculate and apply the temperature applied to each epoch.

알고리즘 2.는 본 발명에서 온도 감쇠가 적용된 방식을 나타내었다. 시작 온도를

_initial, 최종 온도를

_final, 전체 에포크는

, 각 에포크는 t, 에포크마다 감쇠되는 온도의 간격은 Interval로, 해당 에포크에 적용되는 온도를

로 표기하였다. 하이퍼파라미터로 초기 온도와 최종 온도를 지정한다. 온도가 1일 때, 기존의 DARTS 알고리즘과 같아진다. 따라서, 초기 온도는 1보다 큰 수로 설정하고, 최종 온도는 1보다 작은 수로 설정한다. 온도가 커질수록 식4에서의 β 값의 차이가 작아지고, 0에 가까울수록 β 값의 차이가 벌어진다는 것을 상기하여 설정하는 것이 좋다. Interval은 시작 온도와 최종 온도를 고정할 수 있도록 설정하였다. 에포크마다 같은 온도 차이로 감쇠시키므로 선형적 감쇠가 된다. Algorithm 2. shows the method in which temperature decay is applied in the present invention. The starting temperature is

_initial, final temperature

_final, the entire epoch is

, each epoch is t, the interval of the temperature decay for each epoch is Interval, and the temperature applied to that epoch is

It is expressed as . The initial temperature and the final temperature are specified as hyperparameters. When the temperature is 1, it is the same as the existing DARTS algorithm. Therefore, the initial temperature is set to a number greater than 1, and the final temperature is set to a number less than 1. It is good to set it by remembering that the larger the temperature, the smaller the difference in the β value in Equation 4, and the closer it is to 0, the larger the difference in the β value. The interval is set so that the starting temperature and the final temperature can be fixed. Since the attenuation is performed by the same temperature difference for each epoch, the attenuation is linear.

또한 선형적 감쇠가 아닌 다른 스케줄링 방법을 사용해볼 수 있다. 다른 가중치 감쇠(Weight Decay) 방법들과 유사하게 람다, 지수적 감쇠, 코사인 어닐링, 주기적 감쇠 등의 방법들을 사용해볼 수 있다.You can also try other scheduling methods than linear decay. Similar to other weight decay methods, you can try lambda, exponential decay, cosine annealing, periodic decay, etc.

Claims

A temperature attenuation method in differentiable architecture search characterized by applying temperature to the mixed operation of DARTS to improve DARTS (Differentiable Architecture Search) using gradient descent method, thereby controlling exploration and exploitation during architecture search.

In the first paragraph, a temperature attenuation method in a differentiable architecture search characterized in that the temperature value is controlled

In the second paragraph, a temperature attenuation method in a differentiable architecture search characterized in that the temperature is lowered from 10 to 0.1 during the architecture search.