Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for searching a dynamic differentiable space and an architecture, which comprises the following steps:
step 1, obtaining a search space by gathering a plurality of neural network architectures to be searched, sharing all the neural network architectures in the search space by weight to generate a super network, representing a directed acyclic graph of the super network by a matrix to obtain a space matrix, wherein the row number of the space matrix represents the edge of the graph, and the column represents a candidate operation;
step 2, calculating a confidence threshold value of each element of the space matrix, sequencing the confidence threshold values of all the elements, and generating a subspace matrix by selecting the maximum topK, wherein the subspace matrix corresponds to a sampled subspace;
step 3, optimizing the weight parameters, the spatial parameters and the structural parameters of the candidate operation in the subspace matrix by using a back propagation algorithm; the number of iterations of the optimization is M steps, and the iteration number is used for controlling the convergence degree of the parameters in the subspace.
And 4, obtaining the importance degrees of all candidate operations of each edge in the subspace matrix according to the spatial parameters and the structural parameters to construct an importance degree matrix, judging whether the current iteration times reach the preset times, if not, executing the step 2 again until the current iteration times reach the preset times, and selecting the candidate operation with the maximum importance degree of each edge according to the importance degree matrix to form a final neural network architecture.
The method for searching the dynamic differentiable space and the architecture comprises the step that the super network comprises the number of layers, the number of channels, the step length, the topological structure and candidate operation information among nodes of the neural network.
The method for searching the dynamic differentiable space and the architecture comprises the step that the value of an element in a space matrix is 1 or 0, and the value is used for indicating whether the operation of the edge is a candidate operation of the search space.
The method for searching the dynamic differentiable space and the architecture comprises the following steps of 4, wherein the importance degree of all candidate operations of each edge is as follows:
if the value of the position corresponding to the subspace matrix is 1, the importance degree of the ith iteration is the spatial probability + the structure conditional probability + the importance degree of the (i-1) th iteration, otherwise, the importance degree of the ith iteration is the importance degree of the (i-1) th iteration, wherein the spatial probability is the spatial parameter, and the structure conditional probability is the structural parameter.
The method for searching the dynamic differentiable space and the architecture comprises the following specific process of calculating the upper confidence bound value in the step 2:
a confidence threshold is computed for each element of the spatial matrix, UCB ═ spatial probability + c × sqrt (lnT/T), where UCB is the confidence threshold, the spatial probability is from the initial value at the first iteration, followed by the spatial parameters from step 3, c is the confidence coefficient, T is the number of subspaces sampled in total, and T is the number of times the subspace matrix value for the operation is 1.
The method for searching the dynamic differentiable space and the architecture comprises the following steps of 3, wherein the optimization process specifically comprises the following steps:
the set of candidate operations in the subspace matrix is O, O representing a particular operation,
denotes the mixing operation, ω, ζ, α are weight, space, structural parameters, k is used to denote the k-th operation, thenThe calculation of the mixing operation between two nodes is
Wherein p is
k,q
kRespectively representing the spatial probability of the k-th operation and the conditional structure probability in this space, p
k=sigmoid(ζ
k),
Z
kAnd obtaining weight, space and structure parameters for the subspace matrix value of the kth operation among the nodes through back propagation updating. And the step number of the back propagation optimization is M steps, the step number is used for controlling the convergence degree of parameters in the subspace, the size of the parameter M is selected according to the size of the subspace, when the M is larger, the convergence degree of the subspace is better, but the search is easy to fall into a local optimal solution through overfitting. When M is small, the subspace does not converge well, and the assessment of its importance will be inaccurate.
The invention also provides a system for searching dynamic differentiable space and architecture, comprising:
the method comprises the steps that a module 1 is used for collecting a plurality of neural network architectures to be searched to obtain a search space, all the neural network architectures in the search space are shared through weights to generate a super network, a directed acyclic graph of the super network is represented by a matrix to obtain a space matrix, the row number of the space matrix represents the edge of the graph, and the column represents candidate operation;
the module 2 is used for calculating the upper confidence limit value of each element of the space matrix, sequencing the upper confidence limit values of all the elements, and generating a subspace matrix by selecting the largest topK, wherein the subspace matrix corresponds to a sampled subspace;
and the module 3 is used for optimizing the weight parameters, the space parameters and the structural parameters of the candidate operation in the subspace matrix by using a back propagation algorithm, wherein the optimization step number is M steps and is used for controlling the convergence degree of the parameters in the subspace.
And the module 4 is used for obtaining the importance degrees of all candidate operations of each edge in the subspace matrix according to the spatial parameters and the structural parameters so as to construct an importance degree matrix, judging whether the current iteration times reach the preset times, and if not, operating the module 2 again until the current iteration times reach the preset times. Wherein the iteration times are sampling times. And selecting the candidate operation with the maximum importance degree of each edge according to the importance degree matrix to form a final neural network architecture.
The system for searching the dynamic differentiable space and the architecture comprises the super network, wherein the super network comprises the number of layers, the number of channels, the step length, the topological structure and candidate operation information among nodes of the neural network.
The system for searching the dynamic differentiable space and the architecture comprises a space matrix, wherein the value of an element in the space matrix is 1 or 0, and the space matrix is used for indicating whether the operation of the edge is a candidate operation of the search space.
The system for searching the dynamic differentiable space and the architecture comprises the following importance degrees of all candidate operations of each edge in the module 4:
if the value of the corresponding position of the subspace matrix is 1, the importance degree of the ith iteration is the space probability structure condition probability + the importance degree of the ith-1 iteration, otherwise, the importance degree of the ith iteration is the importance degree of the ith-1 iteration, wherein the space probability is the space parameter, and the structure condition probability is the structure parameter;
the specific process of calculating the upper confidence limit value in the module 2 is as follows:
computing a confidence threshold value for each element of the spatial matrix, UCB ═ spatial probability + c × sqrt (lnT/T), where UCB is the confidence threshold value, the spatial probability comes from the initial value at the first iteration, then the spatial parameters from module 3, c is the confidence coefficient, T is the number of total sampling times in the subspace, and T is the number of times the subspace matrix value of the operation is 1;
the module 3 optimization process specifically comprises the following steps:
the set of candidate operations in the subspace matrix is O, O representing a particular operation,
representing the blending operation, ω, ζ, α are weight, spatial, structural parameters, k is used to represent the kth operation, and the blending operation between two nodes is calculated as
Wherein p is
k,q
kRespectively representing the spatial probability of the k-th operation and the conditional structure probability in this space, p
k=sigmoid(ζ
k),
Z
kAnd obtaining weight, space and structure parameters for the subspace matrix value of the kth operation among the nodes through back propagation updating, wherein I is an indication function, the condition establishment value in the parenthesis is 1, and otherwise, the condition establishment value is 0.
According to the scheme, the invention has the advantages that: the invention combines space sampling and differentiable searching, only differentiable searching is carried out on the sampling space, and meanwhile, the probability distribution of the searching space is updated in the searching process so as to guide the next sampling. After multiple iterations, the algorithm converges, and then the final search structure is determined according to the corresponding parameters. Therefore, the search can be accelerated by optimizing the subspace, and the search can be carried out under a plurality of subspaces in a sampling mode, so that a local optimal solution caused by differential optimization is skipped, and a better network structure is found.
Detailed Description
The sample space is a subspace of the entire search space. From a set perspective, the sample space is a subset of the entire search, both belonging to an inclusion relationship. The sample space may be any subset of the entire search space, except for the empty set. For a certain sampling, other than the sampling space, the sampling space is not sampled at this time, and may be sampled later. There is no space outside the sampling space.
In a first aspect, the present invention provides a method for searching a dynamically differentiable space and architecture, comprising the steps of:
step 1, determining a search space and generating a weight-shared super network.
Determining the search space, i.e. determining the neural network architecture set to be searched, for example, designing a network with 10 layers, each layer being selectable with 2 convolution kernels, such as 3 × 3 and 5 × 5, so that there are 2^10 ^ 1024 network structures in total. Specifically, the super network includes information such as the number of layers, the number of channels, the step size, the topology, and candidate operations between nodes of the neural network. Then, all the structures of the set are processed through a weight sharing strategy to generate a hyper-network, namely a directed acyclic graph. The graph can be represented by a matrix, referred to as a spatial matrix. The rows of the spatial matrix represent the edges of the graph and the columns represent candidate operations. The value of the spatial matrix is 1 or 0, 1 indicates that the operation of the edge is a candidate operation for searching the space, and 0 is not. The matrix initialization value is 1, which represents a full space matrix.
And 2, sampling a sub-space from the search space, namely sampling a sub-graph from the directed acyclic graph corresponding to the hyper-network.
Before sampling, the search space needs to be evaluated to balance the exploration and utilization of the space. The evaluation is based on the UCB (Upper Confidence Bound) principle, i.e. the UCB value is calculated for each element of the matrix. For each candidate operation, UCB ═ spatial probability + c ═ sqrt (lnT/t). The spatial probability first step is from an initial value, then the spatial probability value from the 4 th step, c is a confidence coefficient, for example, 1.44, T is the total number of times of sampling in the subspace, T is the number of times of the subspace matrix value of the operation being 1, a spatial matrix is generated for each sampling, K values in the matrix are 1, other values are 0, and the position with the value of 1 is counted once. The total number of counts of a position after several samples is the number of times that the position matrix value is 1. One position here corresponds to one operation. Then, the UCB values of all elements are sorted, and the largest topK is selected. This creates a spatial matrix of 1's and 0's, where K of 1's correspond to the K preceding positions. The spatial matrix is a subspace matrix, corresponding to the sampled subspace.
And 3, carrying out differentiable search on the subspace.
According to the subspace matrix of 2, the method only optimizes the candidate operation with the median value of 1 in the subspace matrix in the ultra-network. The optimization uses a back propagation algorithm. The optimized parameters include three types, namely a weight parameter, a spatial parameter and a structural parameter, a weight parameter (such as a parameter of a convolution kernel) respectively representing a candidate operation, a parameter of spatial probability distribution and a parameter of structural conditional probability distribution under a given space. The spatial and structural parameters may also form a matrix, which is the same size as the spatial matrix. The weight parameter is used to converge the hyper-network, while it affects the spatial and structural parameters.
And 4, evaluating the importance degree of the candidate operation and the UCB value of the search space.
The importance level is a variable that determines the final network architecture. This is also a matrix, called the importance matrix. The size of which is the same as the spatial matrix, each value of the matrix being a non-negative real number. The matrix element value calculation mode is as follows: if the value of the corresponding position of the subspace matrix is 1, the importance degree of the ith iteration is equal to the space probability structure conditional probability + the importance degree of the ith-1 th iteration, otherwise, the importance degree of the ith iteration is equal to the importance degree of the ith-1 th iteration. Wherein the spatial and structural conditional probabilities are calculated by the spatial and structural parameters of step 3, respectively. In addition, the UCB value is a parameter for balance exploration and utilization, which is used to guide the next spatial sampling, and the calculation formula is shown in step 2.
In a second aspect, the present invention provides a method and apparatus for dynamic differentiable space and architecture search, comprising the following modules:
A. and the super network generation module is used for constructing a weight-shared super network for the determined search space. The super network is a directed acyclic graph and can be represented by a space matrix. The spatial matrix rows are edges and columns are candidate operations. Its value is 0 or 1, 1 indicates that the operation of the edge is a candidate operation for searching the space, 0 is not. The initial values of the matrices are all 1, representing the full space.
B. And the subspace sampling module is used for exploring and utilizing the search space, and extracting the subspace with the maximum profit, namely the subspace with the maximum UCB value, from the search space for searching.
C. And the differentiable optimization module is used for updating the weight parameter, the space parameter and the structural parameter by the gradient.
D. And the importance degree evaluation module is used for calculating and updating the importance degree of each candidate operation.
E. And the network architecture selection module determines the final network architecture according to the importance degree.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Example 1
FIG. 1 is a search method for dynamically differentiable space and architecture provided by the present invention, comprising the steps of:
s11: and determining a search space, generating a weight-shared super network, and constructing a matrix for a directed acyclic graph of the super network.
In an embodiment of this step: as shown in fig. 2, the super-network is formed by stacking n-layer cells (cells), which have two types, namely, normal cells (normal cells) and down-sampling cells (reduction cells), wherein the down-sampling cells are located in the network
And

layers, with the general cells being located in other layers. The inside of the unit is a directed acyclic graph which comprises v nodes and e edges, and each edge is formed by m candidate operations, such as zero operation, convolution operation, pooling operation and the like. Thus, by determining the topology of the cell and the type of candidate operation, the search space can also be determined. The unit comprises all search structures which form the whole complete unit in a weight sharing mode. Meanwhile, the method represents the whole search space in a matrix form, and the search space is called as a space matrix. The rows of the spatial matrix represent edges and the columns represent candidate operations. The matrix has a value of 1 or 0, 1 indicating that the operation of the edge is a candidate operation for searching the space, and 0 is not. The initial values of the matrices are all 1, representing the full space. In fig. 2, a cell is composed of 4 nodes, the number of edges is 6, the candidate operation is 3, and the size of the matrix is 6 × 3. Note that since the super network is formed by stacking two types of cells, it is not necessary to construct a matrix for the super network, and only two types of cells need to be constructed separately.
S12: and sampling a subspace from the search space, namely sampling a sub-graph from a directed acyclic graph corresponding to the hyper-network, and carrying out topK selection on the basis of UCB (unified principal component analysis) of sampling.
In an embodiment of this step: as shown in fig. 3, before sampling, the search space needs to be evaluated to balance the exploration and utilization of the space, and the evaluation is calculated based on UCB (upper Confidence bound), that is, the UCB value is calculated for each element of the matrix. Where the UCB value comes from the initial value first, followed by the result from S14. Then, the UCB values of all elements are sorted, the largest topK is selected, and then a spatial matrix consisting of 1 and 0 is generated, where the number of 1 is K, corresponding to the position of the previous K numbers. The spatial matrix is referred to herein as a subspace matrix, and represents a sub-graph, corresponding to a subspace. Note that K in fig. 3 is 12, and the UCB value is only for understanding, not necessarily the value that the method produces.
S13: and carrying out differentiable search on the subspace, and carrying out differentiable optimization on the weight parameter, the space parameter and the structure parameter.
In an embodiment of this step: from the spatial matrix of S12, the present invention optimizes only the candidate operations in the hyper-network with a spatial matrix value of 1. The optimization uses a back propagation algorithm, and the optimized parameters include three types, namely a weight parameter, a spatial parameter and a structural parameter, a weight parameter (such as a parameter of a convolution kernel) representing a candidate operation, a parameter of a spatial probability distribution and a parameter of a structural condition probability distribution under a given space respectively. For more precise description, it is assumed here that the set of candidate operations is O, which represents a specific operation, i.e. an element of O,
indicating a mixing operation. ω, ζ, α are weight, spatial, structural parameters, k is used to represent the k-th operation, and the computation of the hybrid operation between two nodes is
Wherein x represents a feature map, p
k,q
kRespectively representing the spatial probability of the k-th operation and the conditional structure probability in the space, which are calculated by p
k=sigmoid(ζ
k),
Z
kThe subspace matrix value of the kth operation among the nodes is shown, k' is an index subscript, and n is the operation number. Therefore, the network forward propagation can be determined, and then the three parameters are updated according to the back propagation algorithm.
S14: and evaluating the candidate operation importance degree and the UCB value of the search space, wherein the importance degree is calculated according to the space parameters and the structure parameters optimized in the S13, and the UCB value is calculated according to the space parameters.
In an embodiment of this step: the importance degree is a variable used for determining the final network architecture, the size of the final network architecture is the same as a space matrix, and each value of the matrix is a non-negative real number. The matrix element value calculation mode is as follows: if the value of the corresponding position of the subspace matrix is 1, the importance degree of the ith iteration is equal to the space probability structure condition probability + the importance degree of the ith-1 th iteration, otherwise, the importance degree of the ith iteration is equal to the importance degree of the ith-1 st iteration. The spatial and structural conditional probabilities are calculated by the spatial parameters and the structural parameters, respectively, as detailed in S13. Meanwhile, for each candidate operation, a UCB value is calculated, where c is a confidence coefficient, T is a total number of subspace samples, and T is a number of times that the subspace matrix value for the operation is 1.
S15: and judging whether the algorithm reaches the specified iteration times, if not, jumping to S12, and if so, jumping to S16.
S16: and selecting the final network architecture according to the importance degree of the candidate operation, and selecting the operation with the maximum importance degree of each edge.
In an embodiment of this step: the selection strategy is to select the candidate operation with the greatest importance degree at each edge, so that the selected operations form the final network architecture, and the step is implemented as shown in fig. 4, and it is noted that the values of the importance degree matrix are only used for understanding, but not necessarily the values generated by the method
Example 2
The embodiment of the invention also provides a device for searching the dynamic differentiable space and the architecture, which comprises: the system comprises a hyper network generation module 21, a subspace sampling module 22, a differentiable optimization module 23, an evaluation module 24 and a network architecture selection module 25.
The super network generation module 21 is used for integrating all network structures to be searched into a super network in a weight sharing mode; a subspace sampling module 22 for selecting a subspace according to the UCB value topK; the differentiable optimization module 23 is used for carrying out differentiable optimization on the weight parameters, the structural parameters and the space parameters in the generated sub-network; the evaluation module 24 is configured to evaluate a candidate operation importance degree and a UCB value, where the former is used for network architecture selection, and the latter is used for guiding next subspace sampling; the network architecture selection module 25 selects the final network architecture according to the importance level.
The following are system embodiments corresponding to the above method embodiments, and this embodiment mode can be implemented in cooperation with the above embodiment modes. The related technical details mentioned in the above embodiments are still valid in the present embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also provides a system for searching dynamic differentiable space and architecture, comprising:
the method comprises the steps that a module 1 is used for collecting a plurality of neural network architectures to be searched to obtain a search space, all the neural network architectures in the search space are shared through weights to generate a super network, a directed acyclic graph of the super network is represented by a matrix to obtain a space matrix, the row number of the space matrix represents the edge of the graph, and the column represents candidate operation;
the module 2 is used for calculating the upper confidence limit value of each element of the space matrix, sequencing the upper confidence limit values of all the elements, and generating a subspace matrix by selecting the largest topK, wherein the subspace matrix corresponds to a sampled subspace;
a module 3 for optimizing weight parameters, spatial parameters and structural parameters of the candidate operations in the subspace matrix by using a back propagation algorithm;
and the module 4 is used for obtaining the importance degrees of all candidate operations of each edge in the subspace matrix according to the spatial parameters and the structural parameters so as to construct an importance degree matrix, judging whether the current iteration times reach the preset times, if not, running the module 2 again until the current iteration times reach the preset times, and selecting the candidate operation with the maximum importance degree of each edge according to the importance degree matrix to form a final neural network architecture.
The system for searching the dynamic differentiable space and the architecture comprises the super network, wherein the super network comprises the number of layers, the number of channels, the step length, the topological structure and candidate operation information among nodes of the neural network.
The system for searching the dynamic differentiable space and the architecture comprises a space matrix, wherein the value of an element in the space matrix is 1 or 0, and the space matrix is used for indicating whether the operation of the edge is a candidate operation of the search space.
The system for searching the dynamic differentiable space and the architecture comprises the following importance degrees of all candidate operations of each edge in the module 4:
if the value of the corresponding position of the subspace matrix is 1, the importance degree of the ith iteration is the space probability structure condition probability + the importance degree of the ith-1 iteration, otherwise, the importance degree of the ith iteration is the importance degree of the ith-1 iteration, wherein the space probability is the space parameter, and the structure condition probability is the structure parameter;
the specific process of calculating the upper confidence limit value in the module 2 is as follows:
computing a confidence threshold value for each element of the spatial matrix, UCB ═ spatial probability + c × sqrt (lnT/T), where UCB is the confidence threshold value, the spatial probability comes from the initial value at the first iteration, then the spatial parameters from module 3, c is the confidence coefficient, T is the number of total sampling times in the subspace, and T is the number of times the subspace matrix value of the operation is 1;
the module 3 optimization process specifically comprises the following steps:
the set of candidate operations in the subspace matrix is O, O representing a particular operation,
representing the blending operation, ω, ζ, α are weight, spatial, structural parameters, k is used to represent the kth operation, and the blending operation between two nodes is calculated as
Wherein p is
k,q
kRespectively representing the spatial probability of the k-th operation and the conditional structure probability in this space, p
k=sigmoid(ζ
k),
Z
kAnd obtaining weight, space and structure parameters for the subspace matrix value of the kth operation among the nodes through back propagation updating.