CN115100459A

CN115100459A - Image classification algorithm based on full attention network structure search

Info

Publication number: CN115100459A
Application number: CN202210660061.5A
Authority: CN
Inventors: 周圆; 王海洋; 霍树伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-23
Anticipated expiration: 2042-06-13
Also published as: CN115100459B

Abstract

The invention discloses an image classification algorithm based on full attention network structure search, which comprises the steps of firstly designing a staged search space, and selecting different self-attention operations at each stage of a network; then, searching by using an automatic supervision searching method, updating the weight parameters and the structural parameters inside the network, and when the automatic supervision searching stage is completed, keeping the structural parameters and using the structural parameters as initial values of the supervision searching stage; and finally, searching by using a supervision searching method, updating the internal weight parameters and the structural parameters of the network, and obtaining the optimal full attention network according to the structural parameters. The searching method provided by the invention can search out a high-performance full attention network structure and simultaneously ensure the required searching efficiency. Experimental results on an image classification task show that the searched full attention network model is superior to the most advanced architecture, and meanwhile, the parameter quantity is greatly reduced.

Description

Image classification algorithm based on full attention network structure search

技术领域technical field

本发明涉及计算机视觉中的图像分类领域，更具体地，涉及到一种基于全注意力网络结构搜索算法来高效设计针对图像分类任务的全注意力网络结构。The invention relates to the field of image classification in computer vision, and more particularly, to an efficient design of a full attention network structure for image classification tasks based on a full attention network structure search algorithm.

背景技术Background technique

近年来，注意力网络结构设计取得了重大进展。自注意力(self-attention)操作由于其捕获远距离特征依赖关系的能力和基于内容的参数学习机制，成为神经网络的重要组成部分。完全由自注意力操作构成的神经网络在各类计算机视觉任务中的应用取得了良好的效果。手工设计一个优秀的全注意力神经网络结构是一个具有挑战性且复杂的工作，需要大量的先验知识与经验，同时需要消耗大量的时间与资源来进行实验验证，这大大减缓了注意力网络的发展速度。In recent years, significant progress has been made in the structural design of attention networks. Self-attention operations have become an important part of neural networks due to their ability to capture long-range feature dependencies and content-based parameter learning mechanisms. Neural networks composed entirely of self-attention operations have achieved good results in various computer vision tasks. Manually designing an excellent full-attention neural network structure is a challenging and complex task that requires a lot of prior knowledge and experience, and consumes a lot of time and resources for experimental verification, which greatly slows down the attention network. development speed.

网络结构搜索(NAS)为解决上述问题提供了一个途径。NAS是神经网络结构设计自动化的过程，其目标是在一定的先验知识下，从给定的搜索空间中自动搜索得到比人工设计的模型性能更优的网络结构。NAS方法不仅能够提高模型的性能，并且将人类专家从设计网络结构的繁琐任务中解放出来。近年来NAS的相关研究取得了诸多重要进展。目前，网络结构的搜索算法主要分为三种：基于强化学习的方法，基于进化算法的方法以及基于梯度的可微分网络结构搜索算法，他们主要研究如何使用不同的搜索策略来提高搜索的效率与最终得到的网络的性能。相比于基于强化学习与进化算法的NAS方法，基于梯度的可微分网络结构搜索算法收敛更快，其优化目标也更为灵活。Network Structure Search (NAS) provides a way to solve the above problems. NAS is an automatic process of neural network structure design. Its goal is to automatically search from a given search space to obtain a network structure with better performance than the manually designed model under certain prior knowledge. The NAS method can not only improve the performance of the model, but also liberate human experts from the tedious task of designing the network structure. In recent years, many important progresses have been made in the related research of NAS. At present, the search algorithms for network structures are mainly divided into three types: methods based on reinforcement learning, methods based on evolutionary algorithms, and differentiable network structure search algorithms based on gradients. They mainly study how to use different search strategies to improve search efficiency and The resulting network performance. Compared with the NAS method based on reinforcement learning and evolutionary algorithm, the gradient-based differentiable network structure search algorithm converges faster, and its optimization objective is more flexible.

现有的NAS方法不适合直接用来搜索全注意力网络。首先，现有NAS方法通常采用基于细胞的搜索空间，在搜索得到的网络结构中，网络浅层和网络深层的细胞结构是相同的。这并不适用于自注意力网络搜索，因为自注意力操作在网络的不同阶段的效果是不同的。并且，现有NAS方法通常使用分类任务作为结构搜索的监督，分类任务要求模型将更多的注意力放在从与标签信息相关的局部区域，不需要考虑远距离像素间的内容关联性。然而，自注意力模型专注于捕捉像素之间的远距离内容依赖，以学习丰富的图像表示。因此，直接使用现有的 NAS方法搜索全注意力网络结构是不合适的。Existing NAS methods are not suitable for directly searching full attention networks. First, existing NAS methods usually use a cell-based search space. In the network structure obtained by the search, the cell structures of the shallow and deep layers of the network are the same. This does not apply to self-attention network search, because the effects of self-attention operations at different stages of the network are different. Moreover, existing NAS methods usually use classification tasks as supervision for structure search. Classification tasks require the model to pay more attention to local regions related to label information and do not need to consider content correlations between distant pixels. However, self-attention models focus on capturing long-range content dependencies between pixels to learn rich image representations. Therefore, it is inappropriate to directly use existing NAS methods to search for full-attention network structures.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中的问题，本发明提供一种基于全注意力网络结构搜索的图像分类算法，解决现有技术中直接使用现有的NAS方法搜索全注意力网络不准确的问题。In order to solve the problems in the prior art, the present invention provides an image classification algorithm based on full attention network structure search, which solves the problem of inaccuracy of directly using the existing NAS method to search the full attention network in the prior art.

本发明是通过以下技术方案实现：The present invention is achieved through the following technical solutions:

一种基于全注意力网络结构搜索的图像分类算法，包括以下步骤：An image classification algorithm based on full attention network structure search, including the following steps:

设计一个阶段性搜索空间，在该空间中，网络的每个阶段选择不同的自注意力操作；Design a staged search space in which each stage of the network selects a different self-attention operation;

使用自监督搜索方法进行搜索，将图像输入到阶段性搜索空间的网络模型中，更新网络内部权重参数和结构参数，当自监督搜索阶段完成时，保留结构参数，并使用它们作为监督搜索阶段的初始值；Search using a self-supervised search method, input images into the network model of the staged search space, update the network internal weight parameters and structural parameters, when the self-supervised search phase is completed, retain the structural parameters and use them as the supervised search phase. initial value;

使用监督搜索方法进行搜索，将图像输入到阶段性搜索空间的网络模型中，更新网络内部权重参数和结构参数，根据结构参数得到最优的全注意力网络。Use the supervised search method to search, input the image into the network model of the staged search space, update the internal weight parameters and structural parameters of the network, and obtain the optimal full attention network according to the structural parameters.

所述阶段性搜索空间的网络结构中，第一层是一个固定的局部自注意力操作，最后两层是平均池化层和分类层，其余中间部分由五个阶段构成，在第二个阶段和第四个阶段中存在一个固定的池化操作，将特征图的空间尺寸减半，通道数加倍；每个阶段有三个可搜索层，对于每个可搜索层，需要从很多个候选操作中选择出性能最优的操作；候选操作由7个自注意力操作组成，包括一个非局部自注意力操作和6个具有不同超参数的局部自注意力操作，局部自注意力操作的超参数包括空间范围和头部数量，其中空间范围选择3、5和7，头部数量选择4和8；综上所述，阶段性搜索空间包含15个可搜索层，每层从7个候选操作中进行选择，搜索空间包含了715个可能的结构。In the network structure of the staged search space, the first layer is a fixed local self-attention operation, the last two layers are the average pooling layer and the classification layer, and the rest of the middle part is composed of five stages. And there is a fixed pooling operation in the fourth stage, which halves the spatial size of the feature map and doubles the number of channels; each stage has three searchable layers, and for each searchable layer, it needs to be selected from many candidate operations. The operation with the best performance is selected; the candidate operation consists of 7 self-attention operations, including a non-local self-attention operation and 6 local self-attention operations with different hyperparameters. The hyperparameters of the local self-attention operation include Spatial range and number of heads, where 3, 5, and 7 are selected for the spatial range, and 4 and 8 are selected for the number of heads; To sum up, the staged search space contains 15 searchable layers, each of which is performed from 7 candidate operations Selected, the search space contains 715 possible structures.

在自监督搜索阶段，设计一种基于上下文自回归任务的自监督搜索算法；上下文自回归任务指的是，将输入图像的多个区域随机掩盖，训练网络预测缺失部分的内容信息。采用一种编码器-解码器结构去提取输入图像的特征，并对缺失的图像内容进行重建。然后，利用这个任务来搜索全注意力网络。使用全注意力网络作为特征编码器对输入图像进行特征提取；该网络包含两类可学习的参数: 自注意力操作的权重参数w和每个候选操作对应的结构参数a；将图像数据集划分为两个独立的集合，分别用DatasetA和DatasetB表示；使用DatasetA数据集优化权重参数，使用DatasetB优化结构参数；使用L1损失作为损失函数，其定义如下:In the self-supervised search stage, a self-supervised search algorithm based on contextual autoregressive task is designed; the contextual autoregressive task refers to randomly masking multiple regions of the input image and training the network to predict the content information of the missing part. An encoder-decoder structure is used to extract the features of the input image and reconstruct the missing image content. Then, use this task to search the full attention network. Use the full attention network as the feature encoder to extract features from the input image; the network contains two types of learnable parameters: the weight parameter w of the self-attention operation and the structural parameter a corresponding to each candidate operation; the image dataset is divided into are two independent sets, represented by DatasetA and DatasetB respectively; use DatasetA to optimize weight parameters, and use DatasetB to optimize structure parameters; use L1 loss as the loss function, which is defined as follows:

其中M为像素个数，p_i为输入像素，y_i为真实值；然后，采用可微分结构搜索方法交替优化权重参数w和结构参数a；以迭代的方式，在DatasetA数据集上通过梯度下降

来优化权重参数，然后在DatasetB数据集上通过梯度下降

来优化结构参数；当自监督搜索阶段完成时，存储结构参数，并使用它们作为监督搜索阶段的初始值。where M is the number of pixels, p _i is the input pixel, and y _i is the real value; then, the differentiable structure search method is used to alternately optimize the weight parameter w and the structure parameter a; in an iterative way, on the DatasetA data set through gradient descent

to optimize the weight parameters, and then pass gradient descent on the DatasetB dataset

to optimize the structural parameters; when the self-supervised search phase is complete, store the structural parameters and use them as initial values for the supervised search phase.

在监督搜索阶段，使用可微分结构搜索方法在图像分类数据集上进行搜索；使用自监督搜索阶段获得的结构参数作为初始值，采用梯度下降法交替优化结构参数α和权重参数w；使用交叉熵损失作为损失函数，其定义为:In the supervised search stage, the differentiable structure search method is used to search on the image classification dataset; the structure parameters obtained in the self-supervised search stage are used as initial values, and the gradient descent method is used to alternately optimize the structure parameter α and the weight parameter w; use cross-entropy Loss as a loss function, which is defined as:

其中K为类别数，p_k为属于第k类的预测概率，y_k为类别标签；当监督搜索过程结束后，将结构参数排序，结构参数最大值对应的操作被选择出来，得到最终的体系结构。where K is the number of categories, p _k is the predicted probability of belonging to the kth category, and y _k is the category label; when the supervised search process is over, the structure parameters are sorted, and the operation corresponding to the maximum value of the structure parameters is selected to obtain the final system. structure.

本发明的有益效果是：本发明提出的搜索方法能够搜索出高性能的全注意力网络结构，同时保证所需的搜索效率。在图像分类任务上的实验结果表明，搜索出来的全注意力网络模型优于最先进的架构，同时大大减少了参数量。The beneficial effects of the present invention are: the search method proposed by the present invention can search for a high-performance full attention network structure, while ensuring the required search efficiency. Experimental results on image classification tasks show that the searched full attention network model outperforms state-of-the-art architectures while greatly reducing the amount of parameters.

附图说明Description of drawings

图1网络结构搜索流程图；Fig. 1 network structure search flow chart;

图2搜索空间的宏观网络结构图；Figure 2 is a macroscopic network structure diagram of the search space;

图3候选操作的具体配置；其中k表示空间范围，h表示局部自注意力操作中头部的数量；The specific configuration of the candidate operation in Fig. 3; wherein k represents the spatial extent, and h represents the number of heads in the local self-attention operation;

图4搜索得到的最佳全注意力网络结构图；Figure 4. The best full attention network structure obtained by the search;

图5本本发明算法与现有先进网络在CIFAR-10和ImageNet上的性能比较； (a)CIFAR数据集上的实验结果；(b)ImageNet数据集上的实验结果。Figure 5. The performance comparison between the algorithm of the present invention and the existing advanced network on CIFAR-10 and ImageNet; (a) experimental results on the CIFAR data set; (b) experimental results on the ImageNet data set.

具体实施方式Detailed ways

为使本发明的技术方案更加清楚，下面结合附图对本发明做进一步阐述。In order to make the technical solutions of the present invention clearer, the present invention is further described below with reference to the accompanying drawings.

本发明提出了一种全注意力网络结构搜索算法，来高效设计针对图像分类任务的全注意力网络结构。首先设计了一个阶段性搜索空间，在该空间中，网络的每个阶段可以选择不同的自注意力操作。之后，为了从搜索空间中高效的找到最优的网络结构，提出了一种新的搜索策略，联合使用自监督搜索和监督搜索，为图像分类任务搜索高效的全注意力网络结构。算法流程图可参见图1，下面对该搜索算法的具体细节进行介绍：The present invention proposes a full-attention network structure search algorithm to efficiently design a full-attention network structure for image classification tasks. We first design a staged search space in which each stage of the network can choose a different self-attention operation. Afterwards, in order to efficiently find the optimal network structure from the search space, a new search strategy is proposed to jointly use self-supervised search and supervised search to search for an efficient full-attention network structure for image classification tasks. The algorithm flow chart can be seen in Figure 1. The specific details of the search algorithm are introduced below:

1.阶段性搜索空间1. Staged search space

本发明提出了一个阶段性搜索空间用于搜索全注意力网络。图2显示了搜索空间的宏观网络结构，图中显示了网络每一阶段的层数和输入的维度。在搜索空间的网络结构中，第一层是一个固定的局部自注意力操作，最后两层是平均池化层和分类层。其余中间部分由五个阶段构成，在第二个阶段和第四个阶段中存在一个固定的池化操作，将特征图的空间尺寸减半，通道数加倍。每个阶段有三个可搜索层，对于每个可搜索层，需要从很多个候选操作中选择出性能最优的操作。候选操作由7个自注意力操作组成，包括一个非局部自注意力操作和6个具有不同超参数的局部自注意力操作，它们的配置列在图3中。局部自注意力操作的超参数包括空间范围和头部数量，其中空间范围可选择3、5和7，头部数量可选择4和8。综上所述，我们的搜索空间包含15个可搜索层，每层可以从7个候选操作中进行选择，搜索空间包含了7¹⁵个可能的结构。The present invention proposes a staged search space for searching full attention networks. Figure 2 shows the macroscopic network structure of the search space, and the figure shows the number of layers and the dimension of the input at each stage of the network. In the network structure of the search space, the first layer is a fixed local self-attention operation, and the last two layers are the average pooling layer and the classification layer. The rest of the middle part consists of five stages, in the second and fourth stage there is a fixed pooling operation that halves the spatial size of the feature map and doubles the number of channels. Each stage has three searchable layers, and for each searchable layer, the operation with the best performance needs to be selected from many candidate operations. The candidate operation consists of 7 self-attention operations, including one non-local self-attention operation and 6 local self-attention operations with different hyperparameters, and their configurations are listed in Fig. 3. The hyperparameters of the local self-attention operation include the spatial extent and the number of heads, where the spatial extent can be selected from 3, 5 and 7, and the number of heads can be selected from 4 and 8. To sum up, our search space contains 15 searchable layers, each of which can be selected from 7 candidate operations, and the search space contains 7 ¹⁵ possible structures.

2.联合自监督搜索和监督搜索的搜索策略2. A search strategy for joint self-supervised search and supervised search

第一步，准备数据集。The first step is to prepare the dataset.

在图像分类任务中，本发明选取CIFAR-10图像分类数据集和ImageNet图像分类数据集进行算法性能的测试和对比。CIFAR-10数据集包含10个类别的图像，每个类别有6,000张图片，总共有60,000张图片，其中分成50,000张训练图片和10,000张测试图片，在CIFAR数据集中，每张图片的空间分辨率为32×32 大小。ImageNet 2012数据集包含1000个类别的图像，其中有128万张训练图像和50000张验证图像。对于ImageNet的图像，我们将其剪裁为224×224大小。In the image classification task, the present invention selects the CIFAR-10 image classification data set and the ImageNet image classification data set to test and compare the algorithm performance. The CIFAR-10 dataset contains images of 10 categories, each with 6,000 images, for a total of 60,000 images, which are divided into 50,000 training images and 10,000 testing images. In the CIFAR dataset, the spatial resolution of each image is 32×32 in size. The ImageNet 2012 dataset contains images of 1000 categories, with 1.28 million training images and 50,000 validation images. For ImageNet images, we crop them to 224×224 size.

第二步，使用自监督搜索方法进行搜索。The second step is to search using a self-supervised search method.

从ImageNet的1000个原始类中随机选择100个类来构建一个训练集，并将图像大小调整到32×32的分辨率。训练集被分成两个相等的子集，一部分用于优化网络权重参数，另一个部分用于优化架构参数。在自监督搜索过程中，将图像输入到阶段性搜索空间的网络模型中，使用随机梯度下降(SGD)优化器优化网络参数，其中优化器的权重参数设置为0.0003，动量衰减率设置为0.9，网络初始学习率为0.025。使用Adam优化器优化结构参数，学习率为3×10^-5，权重衰减为0.001。经过20轮迭代之后，搜索结束，保存结构参数。100 classes were randomly selected from ImageNet's 1000 original classes to construct a training set, and the images were resized to a resolution of 32 × 32. The training set is divided into two equal subsets, one part is used to optimize the network weight parameters and the other part is used to optimize the architecture parameters. In the self-supervised search process, the images are input into the network model of the staged search space, and the network parameters are optimized using a stochastic gradient descent (SGD) optimizer, where the weight parameter of the optimizer is set to 0.0003, and the momentum decay rate is set to 0.9, The initial learning rate of the network is 0.025. Structural parameters are optimized using the Adam optimizer with a learning rate of 3 × 10 ^-5 and a weight decay of 0.001. After 20 iterations, the search ends and the structure parameters are saved.

第三步，使用监督搜索方法进行搜索。The third step is to use a supervised search method to search.

以第一阶段得到的结构参数为初始化值，在CIFAR-10数据集上进行监督搜索。CIFAR-10的训练图像被随机分成两部分，每组包含25000张图像。一组用于优化网络权重参数，另一组用于优化结构参数。在监督搜索过程中，将分类数据集的图像输入到阶段性搜索空间的网络模型中，使用SGD优化器优化权重参数，初始学习率为0.025，动量为0.9，权重衰减为0.0003。使用Adam优化器优化权重参数，学习率为1×10^-4，权重衰减为0.001。经过50轮迭代之后，搜索结束，根据结构参数得到最优的全注意力网络。搜索得到的网络如图4所示。Using the structural parameters obtained in the first stage as initialization values, supervised search is performed on the CIFAR-10 dataset. The training images for CIFAR-10 were randomly split into two parts, each containing 25,000 images. One set is used to optimize network weight parameters, and the other set is used to optimize structure parameters. In the supervised search process, the images of the classification dataset are input into the network model of the staged search space, and the weight parameters are optimized using the SGD optimizer with an initial learning rate of 0.025, momentum of 0.9, and weight decay of 0.0003. The weight parameters are optimized using the Adam optimizer with a learning rate of 1×10 ^-4 and a weight decay of 0.001. After 50 iterations, the search ends, and the optimal full attention network is obtained according to the structural parameters. The searched network is shown in Figure 4.

第四步，在CIFAR-10和ImageNet数据集上训练网络结构并测试。In the fourth step, the network structure is trained and tested on the CIFAR-10 and ImageNet datasets.

首先在CIFAR-10上训练网络，使用SGD优化器优化模型，其中优化器的权重参数设置为4×10^-4，动量衰减率设置为0.9，初始学习率为0.04，根据余弦衰减规则逐渐衰减为0。经过前向传导、反向传导后即更新一次网络的参数权重，经过500轮迭代之后，即可得到训练好的网络。测试时，将CIFAR-10的测试集图像输入网络模型，得出测试结果，如图5(a)所示。First, train the network on CIFAR-10, and use the SGD optimizer to optimize the model. The weight parameter of the optimizer is set to 4×10 ^-4 , the momentum decay rate is set to 0.9, and the initial learning rate is 0.04. According to the cosine decay rule, it gradually decays to 0. After forward conduction and reverse conduction, the parameter weights of the network are updated once, and after 500 iterations, the trained network can be obtained. During the test, the test set images of CIFAR-10 are input into the network model, and the test results are obtained, as shown in Figure 5(a).

之后在ImageNet上训练网络，使用SGD优化器优化模型，其中优化器的权重参数设置为3×10^-5，动量衰减率设置为0.9，初始学习率为0.04，根据余弦衰减规则逐渐衰减为0。经过前向传导、反向传导后即更新一次网络的参数权重，经过300轮迭代之后，即可得到训练好的网络。测试时，将ImageNet的测试集图像输入网络模型，得出测试结果，如图5(b)所示。Then train the network on ImageNet and use the SGD optimizer to optimize the model, where the weight parameter of the optimizer is set to 3×10 ^-5 , the momentum decay rate is set to 0.9, and the initial learning rate is 0.04, which gradually decays to 0 according to the cosine decay rule. After forward conduction and reverse conduction, the parameter weights of the network are updated once, and after 300 rounds of iterations, the trained network can be obtained. During the test, the test set image of ImageNet is input into the network model, and the test result is obtained, as shown in Figure 5(b).

本发明将搜索得到的模型和现有先进模型的分类结果进行了对比。从实验结果中可以看到，本发明算法在保证搜索效率的同时，极大的提高了搜索得到的全注意力网络结构在分类任务上的准确率。The present invention compares the searched model with the classification result of the existing advanced model. It can be seen from the experimental results that the algorithm of the present invention greatly improves the accuracy of the full attention network structure obtained by the search in the classification task while ensuring the search efficiency.

Claims

1. an image classification algorithm based on full attention network structure search, is characterized in that, comprises the following steps:

Design a staged search space in which each stage of the network selects a different self-attention operation;

Search using a self-supervised search method, input images into the network model of the staged search space, update the network internal weight parameters and structural parameters, when the self-supervised search stage is completed, retain the structural parameters and use them as the supervised search stage. initial value;

The supervised search method is used to search, the image is input into the network model of the staged search space, the internal weight parameters and structural parameters of the network are updated, and the optimal full attention network is obtained according to the structural parameters.

2. The image classification algorithm based on full attention network structure search according to claim 1, characterized in that, in the network structure of the phased search space, the first layer is a fixed local self-attention operation, and the last two The layers are the average pooling layer and the classification layer, and the rest of the middle part consists of five stages. There is a fixed pooling operation in the second and fourth stages, which halves the spatial size of the feature map and doubles the number of channels. ; There are three searchable layers in each stage. For each searchable layer, the operation with the best performance needs to be selected from many candidate operations; the candidate operation consists of 7 self-attention operations, including a non-local self-attention operation and 6 local self-attention operations with different hyperparameters, the hyperparameters of the local self-attention operation include the spatial range and the number of heads, where 3, 5 and 7 are selected for the spatial range, and 4 and 8 are selected for the number of heads; As mentioned above, the staged search space contains 15 searchable layers, each of which is selected from 7 candidate operations, and the search space contains ⁷¹⁵ possible structures.

3. The image classification algorithm based on full attention network structure search according to claim 1, is characterized in that, in self-supervised search stage, design a kind of self-supervised search algorithm based on context autoregressive task; context autoregressive task refers to Yes, randomly mask multiple regions of the input image and train the network to predict the content information of the missing parts. An encoder-decoder structure is employed to extract the features of the input image and reconstruct the missing image content. Then, use this task to search the full attention network.

Use the full attention network as the feature encoder to extract features from the input image; the network contains two types of learnable parameters: the weight parameter w of the self-attention operation and the structural parameter a corresponding to each candidate operation; the image dataset is divided into are two independent sets, represented by DatasetA and DatasetB respectively; use DatasetA to optimize weight parameters, and use DatasetB to optimize structure parameters; use L1 loss as the loss function, which is defined as follows:

where M is the number of pixels, p _i is the input pixel, and y _i is the real value; then, the differentiable structure search method is used to alternately optimize the weight parameter w and the structure parameter a; in an iterative way, on the DatasetA data set through gradient descent

4. The image classification algorithm based on full attention network structure search according to claim 1, is characterized in that, in the supervised search stage, use the differentiable structure search method to search on the image classification data set; use the self-supervised search stage to obtain As the initial value, the structure parameter α and the weight parameter w are optimized alternately by the gradient descent method; the cross entropy loss is used as the loss function, which is defined as:

Figure DEST_PATH_RE-GDA0003808320290000034

where K is the number of categories, p _k is the predicted probability of belonging to the kth category, and y _k is the category label; when the supervised search process is over, the structure parameters are sorted, and the operation corresponding to the maximum value of the structure parameters is selected to obtain the final system. structure.