CN114626506B

CN114626506B - A neural network unit structure search method and system based on attention mechanism

Info

Publication number: CN114626506B
Application number: CN202210219650.XA
Authority: CN
Inventors: 胡瑜; 孙自浩
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2025-09-05
Anticipated expiration: 2042-03-08
Also published as: CN114626506A

Abstract

The invention provides a neural network unit structure searching method and system based on an attention mechanism, which comprises the steps of constructing a macro architecture super network in a searching space, wherein each layer of unit structure in the macro architecture super network is a directed acyclic graph, nodes in the directed acyclic graph are connected through edges, each edge represents a combination of a plurality of candidate operations in the searching space, attention modules are added after feature graphs are output to all candidate operations of each edge in the unit structure, a network to be searched is obtained, a data set of marked labels is used for training the network to be searched, and candidate operations with minimum attention weights on each edge in an intermediate network unit structure are gradually deleted in the training process until the training reaches preset iteration times, and all attention modules in the current network to be searched are removed, so that a neural network unit structure searching result of the data set is obtained. The invention can not only consider the interaction between operations, but also reserve each operation until the last step of searching.

Description

Neural network unit structure searching method and system based on attention mechanism

Technical Field

The invention relates to the technical field of neural network architecture searching and picture classification in the field of automatic machine learning, in particular to a neural network unit structure searching method and device based on an attention mechanism.

Background

Automated machine learning (Auto-ML for short) refers to automating the steps of preprocessing data, selecting features, selecting algorithms, etc. in machine learning and the steps of designing a neural network architecture, optimizing super-parameters, training a neural network model, etc. in deep learning, and obtaining expected results without manual intervention. The neural network architecture search (Neural Architecture Search, abbreviated as NAS) belongs to the category of network design in automated machine learning, and refers to automatically searching to obtain a neural network architecture, for example, aiming at different computer vision tasks such as classification, detection, segmentation, tracking, and the like, from a search space containing various operations (such as convolution, pooling, jump connection), various operations are combined according to a search strategy to obtain a neural network architecture, and then under a specified evaluation strategy, the performance of the neural network architecture on the corresponding computer vision task is measured.

Early neural network architecture search strategies, including reinforcement learning, evolutionary algorithms, random search, and bayesian optimization, typically required retraining each resulting network structure to evaluate its corresponding performance, and thus the overall search process was computationally intensive and time consuming. In recent years, the differential searching strategy remarkably reduces the searching time by utilizing weight sharing and gradient descent optimization algorithm, so that the differential searching strategy is widely focused in academia and industry. Particularly representative is a differentiable architecture that searches for DARTS by searching for cell structures and then stacking the searched cell structures into a target network for verification performance.

However, the differential search strategy DARTS only considers the influence of the neural network model loss function on the weights of the operations in the search space, but does not consider the mutual influence among the operations in the search space, stacNAS proposes that multiple collinearity (multicollinearity) existing among the operations can cause the problem of 'ticket' of similar operations to lead to the selection, so that the correlation matrix of all the operations in the original search space is calculated first, the operations are grouped according to the correlation, then one operation is selected from each group to represent the operations in each group, so that a compact search space is formed by the representative operations, then in the compact search space, stacNAS adopts the same method as the DARTS to obtain the weights of the operations on each side, only the operation with the largest operation weight on each side is reserved and replaced by a plurality of operations in the corresponding group of the original search space, the search is continued, and finally, the finally reserved operation is obtained according to the operation weight on each side. Because StacNAS and DARTS are operations with smaller weights deleted in the searching process, when the difference of the operation weights is not large, the operations with smaller weights are deleted and have no opportunity to be selected, and therefore, only suboptimal neural network architecture can be searched possibly.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a neural network unit structure searching method based on an attention mechanism, which comprises the following steps:

Step 1, constructing a macro architecture super network in a search space, wherein each layer of unit structure in the macro architecture super network is a directed acyclic graph, nodes in the directed acyclic graph comprise input nodes, intermediate nodes and output nodes, the input nodes receive output feature graphs of a preceding unit structure, the intermediate nodes aggregate feature graphs of all preceding nodes in the unit structure, the output nodes splice feature graphs of all intermediate nodes, nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space;

step 2, adding an attention module after outputting feature graphs for all candidate operations of each side in the unit structure to obtain a network to be searched;

And step 3, training the network to be searched by using the data set of the marked label, and gradually deleting candidate operations with minimum attention weights on each side in the middle search network unit structure in the training process until the training reaches the preset iteration times, and removing all attention modules in the current network to be searched to obtain the neural network unit structure search result of the data set.

According to the neural network unit structure searching method, the data set comprises a plurality of samples, each sample is provided with a corresponding label, the samples are pictures, the labels are picture types, and the searching space is a DARTS searching space.

The neural network macro architecture searching method based on the attention mechanism further comprises the following steps:

And step 5, training the searching result of the neural network unit structure by using the data set to obtain a picture searching model, and inputting the picture to be classified into the picture searching model to obtain the picture category of the picture to be classified.

The neural network macro architecture searching method based on the attention mechanism, wherein each side in the unit structure of the macro architecture super network consists of a plurality of candidate operations, the side has m candidate operations, and each candidate operation corresponds to m feature graphsEach feature map has a size ofSplicing the m feature graphs according to the channel dimension to obtain the characteristics after splicingThe attention module is input into the attention module for calculating the attention weight of each candidate operation, and the attention module consists of a global average pooling layer, a full connection layer and a Sigmoid layer;

The step 3 includes calculating the attention weight of each candidate operation on each side in the element structure of the network to be searched:

the spliced characteristic F ^con of all candidate operations on each side is subjected to global average pooling to obtain pooled characteristics Then the featureThrough two layers of full connection and Sigmoid layers, the Sigmoid layers output attention weights

The invention also provides a neural network unit structure searching system based on the attention mechanism, which comprises:

The initialization module is used for constructing a macro architecture super network in a search space, each layer of unit structure in the macro architecture super network is a directed acyclic graph, nodes in the directed acyclic graph comprise input nodes, intermediate nodes and output nodes, the input nodes receive output characteristic graphs of a preceding unit structure, the intermediate nodes aggregate characteristic graphs of all preceding nodes in the unit structure, the output nodes splice the characteristic graphs of all intermediate nodes, the nodes in the directed acyclic graph are connected through edges, and each edge represents a combination of a plurality of candidate operations in the search space;

the adding module is used for adding the attention module after outputting the feature graphs to all candidate operations of each side in the unit structure to obtain a network to be searched;

The searching module is used for training the network to be searched by using the data set of the marked label, gradually deleting candidate operation with the minimum attention weight on each side in the middle searching network unit structure in the training process until the training reaches the preset iteration times, and removing all attention modules in the current network to be searched to obtain the searching result of the neural network unit structure of the data set.

The neural network unit structure search system comprises a plurality of samples in a data set, wherein each sample is provided with a corresponding label, the samples are pictures, the labels are picture types, and the search space is a DARTS search space.

The neural network macro architecture search system based on the attention mechanism further comprises:

The image classification module is used for training the searching result of the neural network unit structure by using the data set to obtain an image searching model, and inputting the image to be classified into the image searching model to obtain the image category of the image to be classified.

The neural network macro architecture search system based on the attention mechanism, wherein each side in the unit structure of the macro architecture super network consists of a plurality of candidate operations, the side has m candidate operations, and each candidate operation corresponds to m feature graphsEach feature map has a size ofSplicing the m feature graphs according to the channel dimension to obtain the characteristics after splicingThe attention module is input into the attention module for calculating the attention weight of each candidate operation, and the attention module consists of a global average pooling layer, a full connection layer and a Sigmoid layer;

the searching module is used for calculating the attention weight of each candidate operation on each side in the unit structure of the network to be searched:

The invention also provides a storage medium for storing a program for executing any neural network macro architecture searching method based on the attention mechanism.

The invention also provides a client side which is used for any neural network macro architecture search system based on the attention mechanism.

The advantages of the invention are as follows:

The invention provides a neural network unit structure searching method and device (Attention-based Neural CELL SEARCH, ANCS for short) based on an Attention mechanism, which are used for evaluating the importance of each operation on the basis of a differential searching strategy, adding a regularization term into a loss function to sparsify operation weights, and finally selecting the operation and the number of operations which need to be reserved finally according to the constraint of calculation and storage resources and reasoning time. Compared with the prior art, the method and the device can not only consider the interaction between the operations, but also reserve each operation until the last step of searching, thereby obtaining the neural network architecture with excellent performance.

Drawings

FIG. 1 is a schematic diagram of a neural network unit structure search method based on an attention mechanism according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a super network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an attention module according to an embodiment of the present invention;

Fig. 4 is a schematic diagram of an apparatus for searching a neural network unit structure based on an attention mechanism according to an embodiment of the present invention.

Detailed Description

The invention aims to provide a neural network unit structure searching method and device based on an attention mechanism. The invention mainly aims at searching a cell structure in a DARTS search space, and adopts an attention mechanism to measure the importance of each candidate operation in the cell.

In a first aspect, the present invention provides a neural network unit structure searching method based on an attention mechanism, which specifically includes the following steps:

And step1, designing a directed acyclic graph.

The directed acyclic graph is composed of N (e.g., n=3) intermediate nodes and E (e.g., e=9) edges. Wherein each node represents its corresponding feature map, each edge represents a combination of multiple candidate operations in the search space, both N and E are super-parameters, the larger both are, meaning that the larger the search space is, the larger the difficulty in searching for good structures is. The directed acyclic graph has two input nodes, each input node receives the output characteristic graphs of the two preceding unit structures, each intermediate node of the directed acyclic graph aggregates the characteristic graph information of all the preceding nodes in the unit structures, and the characteristic graphs of the output nodes of the directed acyclic graph are defined as the spliced characteristic graphs of all the intermediate nodes.

And 2, adding an attention module to each side of the directed acyclic graph.

In the defined directed acyclic graph, each edge is composed of a plurality of candidate operations. Assuming that there are m candidate operations on the edge, each candidate operation receives the input node of the edge, and thus the corresponding m feature maps are obtainedEach feature map has a size ofH and w represent the length and width of the feature map, respectively, c is the number of channels, and b represents the size of the convolution kernel. Then the m feature graphs are spliced according to the channel dimension to obtain the spliced featuresAnd then input into an attention mechanism module proposed by us for calculating the attention weight of each candidate operation. The attention mechanism module consists of a global average pooling, full connection layer and Sigmoid layer.

And 3, calculating the importance of each candidate operation on each side.

First, the stitching features F ^con of all candidate operations on each edge are subjected to global average pooling to obtain pooled featuresThen the featureThrough two layers of full connection and Sigmoid layer, output of Sigmoid layerThe attention weight, can be seen as the importance level of each channel. Thus, the importance of each candidate operation is represented by the sum of the activation values of the respective channels. To be able to update the weight of the attention module, the activation value of each channel is multiplied by the original stitching feature mapObtaining a weighted attention profileIn order to maintain the same dimension as the original input feature map, the attention weighted feature mapThe dot addition operation is performed.

And 4, training and updating the weights of candidate operations of the directed acyclic graph and the weights of the attention module based on the selected execution task.

Depending on the different target tasks, which may be target classification (object classification), target detection (object detection), semantic segmentation (semantic segmentation), instance segmentation (instance segmentation), target tracking (object tracking), etc., the directed acyclic graph is trained on the received training dataset using conventional machine learning training techniques (e.g., random gradient descent with back propagation) appropriate for the task, updating the weights of the different operations and the attention weights on each side according to back propagation.

And 5, evaluating the performance of the directed acyclic graph by using the verification data set of the selected task.

After convergence of the network training using the training set of the selected task in step 4, the performance of the directed acyclic graph is evaluated using the validation set of the task and the attention weights of all candidate operations on each side of the element structure are obtained, the low attention weight operations of the directed acyclic graph are first deleted.

And 6, repeating the step 5, continuing training until convergence, and then evaluating again by using the verification set.

After the low attention weighted operation on each side of the cell structure in step 5 is deleted, the accuracy of the super network is reduced, the training set needs to be reused for training to be converged, then the verification set is used for re-evaluation, the attention weights of all candidate operations on each side of the cell structure in the state at the moment are obtained, and the low attention weighted operation of the directed acyclic graph is deleted again.

And 7, outputting the neural network corresponding to the directed acyclic graph.

When only one operation with the greatest attention weight is retained on each side of the cell structure, an optimal target neural network suitable for the selected task is obtained. And finally, retraining the network by using all the data sets, and verifying the final performance.

In a second aspect, the present invention provides a neural network unit structure search device based on an attention mechanism, which specifically includes the following modules:

A. The unit structure building module builds a directed acyclic graph structure of the unit structure, and consists of N intermediate nodes and E edges.

B. Attention module-the module is added to each side of the unit structure for extracting the attention weight of all candidate operations on each side of the unit structure, i.e. the importance degree of each candidate operation.

C. and the unit structure searching and optimizing module mainly sends the training set into the unit structure for forward propagation, and optimizes the weight parameters of different candidate operations and the weight parameters of the attention module in the unit structure through backward propagation.

D. And the unit structure module is used for evaluating the importance degree of each candidate operation in the unit structure, deleting the operation with low attention weight according to the weight of each operation and then updating the topology of the unit structure.

E. and the unit structure acquisition module is used for acquiring an optimal unit structure model according to the importance degree of each candidate operation on each side of the unit structure.

F. and the target network training and verifying module retrains the obtained optimal unit network structure model by using all training sets so as to optimize weight parameters in the unit structure model and verify the performance of the searched unit structure model on a test set.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

Example 1

The invention provides a neural network unit structure searching method based on an attention mechanism, which comprises the following steps:

s11, designing a directed acyclic graph of the unit structure.

In this step, the invention is directed to a neural network cell structure search, i.e., a candidate operation for searching and determining the best on each side of the cell structure. Taking DARTS search space as an example, each side of the space contains a plurality of candidate operations, and the purpose of unit structure search is to select the optimal operation from the plurality of candidate operations to form a final target structure.

And S12, adding an attention module on each side of the directed acyclic graph.

In this step, the super network is formed by stacking L layer units, as shown in FIG. 2, c_ { k-2} and c_ { k-2} are input nodes, input data is the output of the first two units, 0, 1,2 are intermediate nodes, and c_ { k } is an output node. The cells are of two types, namely a general cell (normal cell) and a downsampling cell (reduction cell), wherein the downsampling cell is positioned in the networkAndLayers, typically units, are located at other layers. Inside all cells is a directed acyclic graph, containing N nodes and E edges, each edge having m candidate operations, such as zero operations, convolution operations, pooling operations, etc. At the same time, the present invention adds an attention module after all candidate operations per edge output feature graphs, as shown in FIG. 3. The attention weight of the attention module may be expressed as a degree of importance of each candidate operation.

The resolution of the feature map is not changed by the general unit structure, the resolution of the feature map is halved by the downsampling unit, and the channel is doubled. The number of the downsampling units in the search space is generally two, and the downsampling units can be arranged at different positions of the super network according to the requirement, so that the invention is not limited to the two downsampling units.

And S13, sending the selected task training set picture into the directed acyclic graph, calculating the gradient, and optimizing the weights of the unit structure and the attention module according to the gradient direction by using an optimizer.

In this step, training of the super-network including the attention module on the training set is mainly performed, and the weight of the super-network is updated while training, and the weight of the attention module is updated accordingly for learning the importance of each candidate operation in the super-network.

And S14, deleting the operation with low attention weight according to the weight of each operation, and then updating the unit structure topology.

In this step, as the training of the super network proceeds, the less attention-weighted operation on each side of the cell structure is gradually deleted, and then the topology of the cell structure is updated. The step-by-step deletion may be performed at each iteration or after each preset number of iterations.

S15, whether the algorithm reaches the appointed iteration times or not.

In this step, the method is used for judging whether the super network training including the attention module reaches the designated iteration number, that is, whether a plurality of candidate operations exist on each side of the unit structure, if the designated iteration number is not reached, continuing to perform the step S13 to continue training, and if the designated iteration number is reached, that is, only one operation exists on each side, performing the next step.

S16, obtaining the target optimal unit structure.

In this step, the candidate operation with low attention weight is gradually deleted according to steps S13 to S15, and then only one operation with the largest attention weight remains on each side of the final unit structure, so that the target optimal unit structure on the task is obtained.

S17, retraining the target structure on the whole data set, and verifying performance.

In this step, the target structure obtained in the above step S16 is retrained on the entire training set until convergence, and then the performance index thereof is tested in the test set.

Example 2

The embodiment of the invention also provides a neural network unit structure searching device based on an attention mechanism, which comprises a unit structure constructing module 21, an attention module 22, a unit structure searching and optimizing module 23, a unit structure evaluating and updating module 24, a unit structure acquiring module 25 and a target network training and verifying module 26 as shown in fig. 4.

The system comprises a unit structure construction module 21, a unit structure searching and optimizing module 23, an evaluation and updating unit structure module 24, a unit structure obtaining module 25, a target network training and verifying module 26, a unit structure optimizing module and a network model optimizing module, wherein the unit structure construction module is used for constructing a directed acyclic graph structure of a unit structure, the directed acyclic graph structure is composed of N middle nodes and E edges, the attention module 22 is composed of a global average pooling layer, a full-connection layer and a Sigmoid layer, the attention module is added to each edge of the unit structure and used for extracting the attention weight of all candidate operations on each edge of the unit structure, namely the importance degree of each candidate operation, the unit structure searching and optimizing module is mainly used for sending a training set into the unit structure to conduct forward propagation, and optimizing weight parameters of different candidate operations in the unit structure and weight parameters of the attention module through backward propagation, the evaluation and updating unit structure module 24 is used for evaluating the importance degree of each candidate operation in the unit structure, the operation with low attention weight is deleted according to the weight degree of each operation in the unit structure, then the topology of the unit structure is updated, the unit structure obtaining module is used for obtaining an optimal unit structure model according to the importance degree of each candidate operation on each edge of the unit structure, and the target network training and verifying module 26 is used for obtaining the optimal network training model to test the performance of the unit structure.

In the device for searching the network architecture by the micro-god based on the attention mechanism provided by the embodiment of the invention, the working process of each module has the same technical characteristics as the method for searching the network architecture by the micro-god based on the attention mechanism, so that the functions can be realized as well, and the detailed description is omitted.

The following is a system example corresponding to the above method example, and this embodiment mode may be implemented in cooperation with the above embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. The neural network unit structure searching method based on the attention mechanism is characterized by comprising the following steps of:

Step 3, training the network to be searched by using the data set of the marked label, and gradually deleting candidate operations with minimum attention weights on each side in the middle search network unit structure in the training process until the training reaches the preset iteration times, and removing all attention modules in the current network to be searched to obtain the neural network unit structure search result of the data set;

the data set comprises a plurality of samples, each sample is provided with a corresponding label, the samples are pictures, the labels are picture categories, and the search space is a DARTS search space.

2. The neural network macro architecture search method based on an attention mechanism of claim 1, further comprising:

3. The method for searching for neural network macro architecture based on attention mechanism of claim 1, wherein each edge in the unit structure of the macro architecture super network is composed of a plurality of candidate operations, the edge has m candidate operations, and each candidate operation corresponds to m feature graphsEach feature map has a size ofSplicing the m feature graphs according to the channel dimension to obtain the characteristics after splicingThe attention module is input into the attention module for calculating the attention weight of each candidate operation, and the attention module consists of a global average pooling layer, a full connection layer and a Sigmoid layer;

4. A neural network element structure search system based on an attention mechanism, comprising:

The searching module is used for training the network to be searched by using the data set of the marked label, gradually deleting candidate operation with the minimum attention weight on each side in the middle searching network unit structure in the training process until the training reaches the preset iteration times, and eliminating all attention modules in the current network to be searched to obtain the searching result of the neural network unit structure of the data set;

5. The attention-based neural network macro architecture search system of claim 4, further comprising:

6. The neural network macro architecture search system of claim 4, wherein each edge in the cell structure of the macro architecture super network is composed of a plurality of candidate operations, the edge having m candidate operations, each candidate operation corresponding to m feature graphsEach feature map has a size ofSplicing the m feature graphs according to the channel dimension to obtain the characteristics after splicingThe attention module is input into the attention module for calculating the attention weight of each candidate operation, and the attention module consists of a global average pooling layer, a full connection layer and a Sigmoid layer;

7. A storage medium storing a program for executing the neural network macro architecture search method based on the attention mechanism according to any one of claims 1 to 3.

8. A client for the neural network macro architecture search system based on an attention mechanism of any of claims 4 to 6.