Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an evolutionary neural architecture searching method and system based on performance level agent assistance. The method comprises the following steps:
Step 1, initializing a framework population, evolving a plurality of generations, actually evaluating all framework individuals in the population to obtain real performance, and recording the existing framework performance data;
Step 2, dividing the performance level of the architecture according to the real performance, marking a level label as training data, and training a proxy model by using the training data, wherein the proxy model establishes a mapping relation from the architecture to the performance level, so that the performance level of an unknown architecture can be predicted;
step 3, generating an architecture individual through a crossover operator and a mutation operator based on a simple path, predicting the performance level of the architecture individual by using a proxy model, reserving a high-level architecture, and directly eliminating a low-level architecture;
Step 4, the high-level architecture continues to participate in a one-time selection process, the real performance of the architecture finally reserved is evaluated, and the existing architecture performance data is added;
step 5, merging the evaluated high-level architecture with the current population, performing environment selection, and selecting a tournament to generate a new population in order to maintain the diversity of the population;
And 6, repeating the steps 3 to 5 until the stopping condition is met, and outputting the global optimal architecture.
In step 1, several generations of evolution are required in advance because the proxy model based on performance level prediction requires a partially high-performance architecture as a training sample. Only if the training dataset contains both high-performance and low-performance architectures, the proxy model can learn knowledge of the performance level. Therefore, through the evolution process of several generations, the performance of the architecture population is improved to a certain extent, and the architecture with partial high performance in the training data is ensured.
In step 1, after initializing a population P of frameworks, G generation is evolved, where G is less than or equal to 10, evaluating the performance of each framework on the task data set (the data set collected and prepared for a particular task or goal, for training, validating and testing the model, helping the model learn the patterns and relationships required for a particular task), and saving in recordsWhereinA pair of architecture performance is represented as a pair,Is the i-th architecture in the record,Is the performance corresponding to the ith architecture, and N is the total number of architectures in the record Archive.
In step 2, firstly, a training data set of the proxy model is made according to the existing architecture performance data division hierarchy to train the proxy model. The number of levels then requires a comprehensive search space size and complexity determination. The existing architecture data is ranked according to the performance, then the architecture is divided into layers according to the quantity, and the layers are correspondingly labeled. This will be used for supervised training of the proxy model. The step 2 specifically comprises the following steps:
Step 2.1, dividing the performance level of the architecture, and making a training data set:
for recording Architecture in terms of performanceOrdering from big to small, setting the performance level as K, and dividing the performance level into the following steps:
,
data is a training Data set of the agent Model, and is used for supervised training of the Model; representing the K-th level of the division, Is the number of architectures within the K-th hierarchy, K identifies the architectureIs a hierarchy of (2);
Step 2.2, designing the structure of the proxy model:
the agent Model comprises an architecture feature extractor and a hierarchy divider, wherein the architecture feature extractor and the hierarchy divider are mutually coupled to jointly train and complete tasks of architecture feature extraction and hierarchy division;
the architecture feature extractor comprises a graph convolution layer, and the architecture feature extractor finishes extracting architecture features of the neural network based on transmission and aggregation of information among nodes to obtain architecture representation;
the hierarchy divider comprises a full-connection layer, and the hierarchy divider completes prediction of the performance level of the neural network architecture based on learning of architecture representation and divides the corresponding architecture into the performance levels;
The neural network architecture can be viewed as a directed acyclic graph, with edges representing connections and nodes representing operations. For the data of the graph structure, the graph neural network can obtain effective architecture representation by utilizing the information packaged in the graph structure based on the transmission and aggregation of the information among the nodes. Thus, the architectural feature extractor is made up of a number of layers of the gallery stack. The features of each node are typically represented by a feature vector X. These features may be attributes of the node, embedded representations, or initial features obtained from other sources. In each layer of graph convolution, the feature vector of each node is updated by the feature vector of the neighboring node, which is described as:
,
Wherein, Is an activation function; The method is characterized in that the method comprises the steps of adding an adjacency matrix of a self-loop graph, wherein D is a degree matrix formed by all nodes and is a diagonal matrix, and the calculation method comprises the following steps: ; Is the first in the degree matrix D Line 1The elements of the column are arranged such that,Is an adjacency matrixMiddle (f)Elements of row b;
Is the trainable weight matrix of the first layer of the architecture feature extractor, namely the parameter matrix of the first layer; is the node characteristic vector of the first layer of the architecture characteristic extractor, is the node characteristic representation obtained by continuously transmitting and aggregating information through the formula, and Representing the input of an entire architectural feature extractorThe node characteristic information extracted by the graph convolution layer is combined with all the characteristic information through a global pooling layer to obtain a representation of the architecture in the hidden space;
The architecture representation is then input into a hierarchy divider, which is made up of a number of fully connected layers to establish a mapping of the architecture representation to the performance level, to predict the performance level to which the corresponding architecture belongs. In the invention, the extraction of the architecture features and the dividing modules of the performance level are mutually coupled, and the weight parameters of the two modules are trained and optimized simultaneously. This will facilitate the co-operation of the two modules while accomplishing the extraction of architectural features and the partitioning of the performance level.
And 2.3, the invention designs a loss function for the training process of the proxy model according to the task characteristics of the proxy model. In neural architecture searching, efficient architecture representation is critical to successful implementation of efficient searching and optimization. In one aspect, the architecture representation should adequately reflect the topology and operational information of the network architecture. On the other hand, the architecture representation would facilitate the prediction and search process of the proxy model if it were able to perform a corresponding distribution in hidden space according to performance. In other words, the architectural representation in hidden space is preferably capable of assuming a distribution according to performance. For example, the representation of high-performance architecture and low-performance architecture in hidden space should be as principle as possible to minimize interference with proxy model predictions, and similar-performance architecture should be aggregated in similar representation areas to facilitate proxy model predictions as much as possible. Accordingly, in a predictive performance level task, architectural representations of the same level should be as close as possible and architectural representations of different levels should be as far apart as possible. Therefore, in order to achieve an efficient architectural representation, it is necessary to design a special loss function.
First, the present invention employs a cross entropy loss function in hierarchical partitioning tasks. It is a widely used loss function in deep learning, especially for classification tasks. The method measures the difference between the actual label and the prediction probability distribution, is used for evaluating the prediction effect of the classification model, and guides model training by quantifying the inconsistency between the actual class label and the prediction probability distribution. The hierarchical division task may also be considered a classification task, so cross entropy loss functions may be used to guide the training of proxy models.
Designing a loss function, and guiding the training process of the proxy model:
The loss function of the design proxy model is expressed as:
,
,
,
Wherein, For the cross entropy loss function, m is the batch size of the input data,Is the level of the ith architecture in the batch predicted by the proxy model,Is the true performance level corresponding to the ith architecture; to design the characteristic loss function The aim is to guide the model to extract more advantageous architectural features. It will reduce the architectural features between the same levels, expand the architectural features between different levels, ensure similar performance of the architecture map to similar regions in the feature space. Wherein the function is,Is an architectural representationAnd architecture representationEuclidean distance between them; And Is the input of the hierarchy divider, i.e., the output of the architectural feature extractor, whenIntermediate parameters whenOtherwise, intermediate parameters;Is a critical value parameter for controlling the feature distance between different hierarchies, and the larger the hierarchy phase difference is, the larger the distance between the architecture features should be. For the same hierarchy of architecture, the closer the distance between them is, the better.Is a hyper-parameter that controls the importance between two different loss functions, it can also be set as a learnable parameter, letting the model automatically determine the optimal value according to the training process. And training the proxy model by adopting the total loss function L until convergence.
In step 3, most of the existing evolutionary neural architecture searches adopt a crossing mode based on an adjacency matrix. Firstly, leveling the adjacent matrix into one-dimensional vectors, connecting the one-dimensional vectors after node operation adopts one-hot coding, and then splicing the two one-dimensional vectors to be used as a representation of a framework. Based on this vector representation, conventional crossover operators are employed. Since conventional crossover operators are generally designed for linear structures, they are not suitable for processing the graph structure of the neural network, and may destroy the effective connection and structure of the original architecture, resulting in unstable performance of the generated sub-architecture. Moreover, common crossover operators tend to be non-intuitive for use on network architectures. The lack of understanding of the architecture characteristics of the neural network by the common crossover operator results in larger randomness of the generated child architecture, and is difficult to inherit the excellent characteristics of the parent.
The invention designs a crossover operator based on a simple path for generating a new architecture. The crossover operator based on the simple path is specially designed for the graph structure data, so that the topological structure of the neural network can be better processed, and the generated child architecture is ensured to accord with expectations. The crossover operator keeps the effective data processing path in the parent architecture as far as possible, ensures that the child architecture inherits the excellent characteristics of the parent, and improves the performance stability. The crossover process based on the simple path is more visual, is convenient to understand and realize, and can more effectively explore and optimize the neural network architecture. By adopting the crossover operator based on the simple path, the defects of the common crossover operator can be overcome, the neural architecture search can be more effectively carried out, and the performance and the stability of the final generated architecture are improved.
The step 3 specifically comprises the following steps:
Step 3.1, generating a new architecture by adopting a crossover operator based on a simple path;
and 3.2, predicting the performance level of the newly generated architecture by using the proxy model.
Step 3.1 comprises:
Step 3.1.1, selecting two father frameworks needing to be crossed, regarding the frameworks as directed graphs, numbering all nodes, and specifically, according to the fitness value of all framework individuals in a population, giving higher selection probability to the high-performance frameworks, giving lower selection probability to the low-performance frameworks, selecting the two father frameworks from the population according to a roulette algorithm, performing cross operation, regarding the two father frameworks as two directed graphs, and numbering the nodes in each graph from an input node to an output node in sequence;
Step 3.1.2, deconstructing two directed graphs to obtain all simple paths of the directed graphs to form a simple path candidate pool, wherein for each directed graph, all simple paths of the directed graph are found, namely, no repeated node sequence exists between an input node and an output node, all simple paths are combined into one path candidate pool, and node sequence numbers and operations are reserved;
In step 3.1.2, for each directed graph, all its simple paths, i.e. the order of nodes from the input node to the output node without repetition, are found, which in the network architecture are actually different data processing flows, such as an operation sequence of convolution-averaging pooling-convolution. All simple paths are combined into a path candidate pool, and node sequence numbers and operations of the path candidate pool are reserved. The candidate path pool provides diversified path selection, preserving efficient data processing paths from the parent architecture.
Step 3.1.3, randomly selecting a plurality of simple paths to reconstruct a directed graph of a sub-architecture, wherein the number of paths in a path candidate pool is M, a path number L is randomly generated from [1, M ], then, L simple paths are randomly selected from the path candidate pool again, and the sub-architecture graph is reconstructed by the L simple paths;
Step 3.1.4, inheriting the operation in different modes according to different nodes, wherein the operation directly reserves the node inherited by the father architecture, combines the generated new node, reserves one node from the node operation of the two father architectures randomly with a certain probability (generally 50%), generates a new sub-architecture through the intersection of the two father architectures, and can accelerate the convergence speed of the algorithm and find the architecture with high performance more quickly based on the intersection operator of the simple path because the superior characteristics of the father can be inherited more effectively. For complex neural network structures, the simple path-based crossover operator can better capture and utilize layering and modularized characteristics of the network, and has stronger adaptability.
Step 3.1.5, repeating the steps 3.1.1 to 3.1.4 until a set number of new structures are generated.
Step 3.2 comprises:
performance level prediction is performed on the newly generated architecture by using the constructed proxy model for predicting the architecture performance level, wherein for the set performance level K, the newly generated architecture is divided into K levels, which are expressed as:
,
Wherein, Representing the i-th architecture of the prediction,Is the predicted performance level of the ith architecture. For each new architecture, it is not practical to obtain all its performances in turn, which would cause huge computational and time consumption. Since a proxy model has been constructed that predicts the architecture performance level, the present invention uses the proxy model to perform performance level predictions on the newly generated architecture.
Step 4 comprises:
step 4.1, obtaining all feature representations of the training data by using an architecture feature extractor in the proxy model, and then calculating a hierarchy center of the performance hierarchy;
Step 4.2, using an architecture feature extractor in the proxy model to obtain a feature representation of each architecture in the high level, calculating the distance between the feature representation and the center of the highest performance level, and selecting the architecture closest to the center for performance evaluation;
and 4.3, if the updating condition of the proxy model is met, updating the proxy model by using the latest architecture performance data, and carrying out iterative updating on the proxy model on line.
Step 4.1 comprises:
for m architectures, the feature representation is extracted for all architectures in the training dataset Data using the proxy model, expressed as:
,
Wherein, The ith architecture of representation extractionIs characterized by;
According to the performance level, calculating the centers C of K levels respectively, and sequentially extracting each level architecture The center C of the hierarchy is calculated, expressed as:
,
Wherein, Representing the center of the level of the t-th level,Representing the number of architectures at the t-th level,A feature representation representing an ith architecture in a nth hierarchy.
Step 4.2 comprises:
In the result of agent model prediction, obtaining the feature representation of each architecture in the highest hierarchy, calculating the distance between the feature representation and the center of the highest hierarchy, selecting the V architectures with the smallest distance, wherein the V architectures are the network architectures most worthy of complete training by using a data set, so as to obtain the performance of the V architectures on a task data set, and storing the result in an Archive;
Step 4.3 comprises:
If the evolution of the agent model reaches the set algebra, namely after every evolution for 10 generations, the latest record Archive is used, the latest training Data set Data is manufactured along with the hierarchical division method, and the agent model is updated by using the training Data set Data. While each generation of updates will result in a better proxy model, the computational consumption and search time will correspondingly increase. In comprehensive consideration, the agent model is updated with a certain frequency, so that the agent model with high enough performance can be obtained while the calculation consumption is reduced.
Step 5 includes merging the most recently evaluated architecture with existing architectures in the current population. The consolidated population contains a variety of possible architectures that provide a rich candidate for subsequent environmental selection. On the one hand, in order to keep the current optimal architecture, the elite strategy is firstly adopted to select the part of the architecture with the forefront ranking to be directly kept, and meanwhile, in order to ensure that the diversity of the population is maintained in the selection process, the subsequent architecture adopts the tournament selection strategy. Tournament selection is a common selection method by randomly selecting several individuals in a population for comparison and then selecting the individual with the best performance among them for the next generation. This process is iterated until a new population is generated. By the method, the population can be prevented from converging to a local optimal solution prematurely while excellent individuals are maintained, and the diversity and exploratory capacity of the population are maintained. Finally, the population quality can be effectively improved in the optimization process, the diversity of the population is ensured, and the probability of finding the optimal neural network architecture is improved.
The invention also provides an evolutionary neural architecture search system based on performance level agent assistance, comprising:
the evolution operator module based on the simple path is used for generating a new neural network architecture and pushing the population evolution process;
The architecture feature extraction module based on the graph neural network is used for extracting topology information and operation information of a network architecture and obtaining vector representation of the network architecture;
and a proxy model based on performance hierarchy division, which is used for establishing a mapping relation between architecture representation and performance hierarchy and predicting the performance hierarchy of the neural network architecture.
According to the invention, by constructing a new proxy model, the calculation consumption of the evolutionary neural architecture search can be effectively reduced, the search process can be accelerated, the whole search space can be better explored, and the optimal neural network architecture can be stably searched. Based on the proxy model of performance level prediction, the performance prediction result of the unknown architecture can be accurately and stably given, and the search process is assisted. The invention also provides a method for treating the neural network architecture as a directed acyclic graph, and extracting architecture characteristics by using the graph neural network to obtain effective architecture representation. The architecture representation has greater adaptability and functionality than previously studied. Meanwhile, aiming at the characteristics of the network architecture, a crossover operator based on a simple path is designed, the efficient data processing path in the network architecture is reserved to the greatest extent, and the crossover process is more visual. New angles of view predicted from graph structures and performance levels bring new breakthrough and progress to the field of evolutionary neural architecture search. Compared with the existing evolutionary neural architecture searching method, the evolutionary neural architecture searching framework formed by the technology provided by the invention has better results.
The beneficial effects of the invention are as follows:
the method provides an evolutionary neural architecture searching method based on performance level agent assistance aiming at evolutionary neural architecture searching. The method extracts the architecture characteristic representation through the graph neural network, and extracts the high-efficiency architecture representation by adopting a designed loss function. And regarding the network architecture as a directed graph, and extracting topology information and operation information in the neural network architecture through an information transmission mechanism of the graph neural network. The constructed proxy model based on performance level prediction can efficiently predict the performance of the architecture from the new point of view of the performance level. A number of candidate architecture performance predictions can be given in an inexpensive manner, replacing the expensive architecture performance evaluations. Meanwhile, on the application of evolution calculation in neural architecture search, a crossover operator conforming to the characteristics of a network architecture is provided. The crossover operator based on the simple path accords with the characteristics of the neural network graph structure on one hand, and retains the effective data processing path in the neural network to the greatest extent on the other hand, and has high interpretation. The performance-level agent-assisted evolutionary neural architecture searching method provided by the invention increases the fit degree of evolutionary computation on neural architecture searching, and the proposed agent model method effectively reduces resource consumption, can promote the application of neural architecture searching on practical problems, and has higher practical value and wide application scenarios.
Detailed Description
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
In an embodiment of the present invention, as shown in fig. 1, an evolutionary neural architecture search method based on performance level agent assistance is provided, fig. 1 is a frame diagram of the whole method, and from the initialization of population to the output of results, the implementation flow and details of the evolutionary neural architecture search method based on performance level agent assistance provided in the present invention will be described in detail on a reference data set of the evaluation neural architecture search method, namely NAS-band-101. The method specifically comprises the following steps:
Initializing a framework population P with the size of N, an empty record Archive, evaluating all frameworks in the P on a task data set, and obtaining the actual performance values of the frameworks, wherein the actual performance values are recorded in the Archive, namely:
,
Wherein, Is a pair of architecture performance levels,Representing the i-th architecture of the system,Representing the performance value corresponding to the ith fabric, N is the total number of fabrics in the fabric.
Then evolves a small number of algebra to population P and evaluates each newly generated architecture while also adding the record Archive.
And 2, on the record Archive, according to the existing architecture performance data division hierarchy, making a training data set of the proxy model so as to train the proxy model. The architecture is ranked according to the performance, then the architecture is divided into K layers according to the number, and the layers are correspondingly labeled. Then, the proxy model is trained to build a mapping model of architecture to performance level.
Step 2.1, sorting the architecture in the architecture according to the performance from big to small, setting the performance level as K, equally dividing the performance level into K levels according to the number, wherein the number Q of the architecture of each level is as follows:
,
The result of the division is:
,
data is a training dataset of the proxy Model for supervised training of the Model. Representing the K-th level of the division,Is the number of architecture within this hierarchy, K identifies the hierarchy of architecture;
Step 2.2 constructing a proxy model as shown in fig. 2 (where C1 represents the number of channels of the input layer and C2 represents the number of channels of the output layer), consisting of an architecture feature extractor and a hierarchy divider. The architecture feature extractor is based on a graph convolution layer, the hierarchy divider is based on a full connection layer, a batch of architectures are input, after feature extraction, the architecture feature extractor is input into the hierarchy divider, and finally a predicted hierarchy is output.
And 2.3, before using the proxy Model, performing supervised training on the proxy Model by using Data, and adopting the designed loss function until the Model converges to obtain the proxy Model. Thereafter, feature representations are extracted for all the schemas in Data using ModelAnd the centers C of the K levels are calculated respectively according to the performance levels thereof.Representing the total number of architectures in the log Archive.
Step 3, as shown in fig. 3, after selecting two father frameworks needing to be crossed, deconstructing all simple paths to form a path pool, randomly selecting a certain number of simple paths from the path pool to reconstruct a new sub-framework, and then predicting the performance level of all the generated frameworks by using a proxy Model to reserve the framework of a high performance level. The method comprises the following steps:
And 3.1, generating a new architecture by adopting a crossover operator based on a simple path. And taking the two parent frameworks as directed graphs, and sequentially numbering all nodes from the input nodes of the network framework until the nodes are output. Then, all simple paths from the input node to the output node are obtained. And combining all the simple paths found by the two father frameworks to form a path pool, randomly selecting a certain number of simple paths from the path pool, combining the nodes with the same sequence number into a node according to the sequence number of the nodes in each simple path, and respectively retaining the nodes with different sequence numbers, so as to reconstruct a new framework. This process is repeated multiple times, yielding 10N architectures 。
And 3.2, predicting the performance level of the newly generated architecture by using the proxy model. Using Model pairsPredicting performance levels for each architecture in (a) to yield:
,
Wherein, Is the ith architectureThe predicted architecture performance level.
And 4, selecting a promising architecture to evaluate the performance on the target data set, and updating the proxy model by using the latest Archive when the condition is met. The method comprises the following steps:
step 4.1, obtaining the characteristic representation of all the frameworks in the architecture by the Model: Then, each hierarchical center C is calculated, expressed as:
,
Where t is the level of performance of a certain level, Is the number of architectures in the t-level,Is a feature representation of the ith architecture.
And 4.2, selecting out V high-level architectures. However, three different situations can occur. The method comprises the steps of firstly, calculating the distance between the representation of each architecture in the highest hierarchy and the center C of the existing hierarchy if the number of architectures in the highest hierarchy is larger than V, reserving the V architectures closest to the center C in the obtained result, secondly, reserving the number of architectures in the highest hierarchy which is smaller than V but not 0, wherein all architectures in the highest hierarchy are reserved, extracting the rest architectures from the next highest hierarchy, still selecting the architecture closest to the center of the hierarchy, thirdly, extracting the complementary number V of architectures from the subsequent hierarchy according to the first similar condition and the second similar condition when the number of architectures in the highest hierarchy is 0. Finally, the performance of the V architectures on the target dataset is evaluated and the results are incorporated into the Archive.
And 4.3, if the agent Model updating condition is met, carrying out hierarchical division on the Archive by adopting the division mode in the step 2 to obtain a new training Data set Data, and training the Model in the same mode until convergence.
And 5, merging the evaluated sub-architecture with the current population, participating in environment selection, and generating a next generation population P. The elite mechanism is adopted, the optimal architecture of the first 10% is selected from the previous generation population P to be directly reserved with the elite control rate of 0.1, the excellent architecture is ensured to be reserved, and then 90% of architecture and elite architecture group layer next generation population is selected from the previous generation population by adopting tournament in order to maintain the diversity of the population.
Repeating the steps until the stopping condition is reached, and then outputting the globally optimal architecture as a result.
NASBench-101 are a benchmark dataset for neural architecture searches that provide a standardized platform for evaluating and comparing different NAS methodologies. Researchers can use the same data set and evaluation index to compare, ensure consistency and repeatability of results. NAS-Bench-101 contains more than 423000 unique neural network architectures, each of which was evaluated on the CIFAR-10 dataset. This large search space allows researchers to search and optimize in a large set of architectures. The present invention will be tested in NAS-Bench-101 to verify validity.
In examining the crossover operator of the present invention, a performance comparison is made with random sampling and a crossover operator based on an adjacency matrix. Intersection operators based on adjacency matrices are often adopted by existing studies and are therefore chosen as baselines. 2000 architectures were randomly sampled at NAS-Bench-101, then 2000 architectures were generated by a cross operator based on adjacency matrix, and finally 2000 architectures were generated by a cross operator based on simple paths. Next, the Gaussian kernel density function that generated the architecture is calculated and the performance of the architecture and the corresponding Gaussian kernel density estimation curve are plotted, as shown in FIG. 4.
It can be found that the architecture based on the crossover operator generation of simple paths has higher overall performance than other approaches. The performance of a large number of architectures lies in a higher performance interval, and the most part of the architecture performance is higher than that generated in other modes. In addition, in NAS-Bench-101, the effect of the proxy model is verified at the same time, as shown in FIG. 5. Here, a confusion matrix of the prediction results of the proxy model is given, and the prediction capability of the proxy model can be clearly seen from the result of the confusion matrix. It can be found that the proxy model predicts the performance level of the architecture with high accuracy.
The invention provides a method and a system for searching an evolutionary neural architecture based on performance-level agent assistance, and the method and the way for realizing the technical scheme are numerous, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, several improvements and modifications can be made without departing from the principle of the invention, and the improvements and modifications should be regarded as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.