Disclosure of Invention
The embodiment of the application provides a method, a device and equipment for searching a self-attention mechanism representation network, and aims to accelerate the design process of the self-attention mechanism representation network and improve the design efficiency and effect of a depth model.
In a first aspect, an embodiment of the present application provides a method for characterizing a network search by a self-attention mechanism, where the method includes:
constructing a self-attention mechanism characterization network search space according to self-attention mechanism characterization networks with various architectures, wherein the self-attention mechanism characterization network search space comprises a plurality of sub-networks;
constructing a super network from said plurality of sub-networks;
and inputting data to be analyzed into the super network, and searching the corresponding self-attention mechanism characterization network from the self-attention mechanism characterization network search space through the super network.
Optionally, the method further comprises:
collecting training data aiming at a target task, and putting the training data into a set to obtain a training data set;
and inputting the training data set into the super network, and training the super network to obtain the trained super network.
Optionally, constructing a self-attention mechanism characterization network search space according to a self-attention mechanism characterization network of various architectures, including:
obtaining a plurality of network fusion layers corresponding to each self-attention mechanism representation network according to the structural characteristics of the self-attention mechanism representation networks of the plurality of architectures;
connecting the network fusion layers to obtain a plurality of sub-networks;
constructing the self-attention mechanism through the plurality of subnetworks characterizes a network search space.
Optionally, the method further comprises:
pruning the self-attention mechanism characterization network search space using a constraint strategy.
Optionally, the constraint policy includes: a backbone constraint policy, a key-value binding constraint policy, and a zero-operation constraint policy.
Optionally, constructing a super network by the plurality of sub-networks includes:
setting a parameter that is context-identical in the plurality of sub-networks to a parameter at a same location in the super-network;
setting contextually different parameters in the plurality of sub-networks to parameters at different locations in the super-network.
Optionally, the method further comprises:
and training the corresponding self-attention mechanism characterization network to obtain the trained self-attention mechanism characterization network.
Optionally, obtaining a plurality of network fusion layers corresponding to each self-attention mechanism characterization network according to the structural features of the self-attention mechanism characterization networks of the plurality of architectures, including:
obtaining a self-attention mechanism fusion layer corresponding to each self-attention mechanism characterization network according to the self-attention layers in the self-attention mechanism characterization networks with the plurality of architectures;
and obtaining an additive fusion layer corresponding to each self-attention mechanism characterization network according to other layers except the self-attention layer in the self-attention mechanism characterization networks of the plurality of architectures.
A second aspect of the embodiments of the present application provides an apparatus for generating a self-attention mechanism characterization network, where the apparatus includes:
the search space construction module is used for constructing a self-attention mechanism characterization network search space according to self-attention mechanism characterization networks with various architectures, and the self-attention mechanism characterization network search space comprises a plurality of sub-networks;
a super network construction module for constructing a super network through the plurality of sub-networks;
and the network obtaining module is used for inputting the data to be analyzed into the super network and searching the corresponding self-attention mechanism representation network from the self-attention mechanism representation network searching space through the super network.
Optionally, the apparatus further comprises:
the training data set generating module is used for collecting training data aiming at a target task and putting the training data into a set to obtain a training data set;
and the super network training module is used for inputting the training data set into the super network and training the super network to obtain the trained super network.
Optionally, the search space construction module includes:
the network fusion layer obtaining submodule is used for obtaining a plurality of network fusion layers corresponding to each self-attention mechanism representation network according to the structural characteristics of the self-attention mechanism representation networks of the plurality of architectures;
a sub-network generation sub-module, configured to connect the multiple network fusion layers to obtain multiple sub-networks;
a search space construction sub-module for constructing the self-attention mechanism characterization network search space through the plurality of sub-networks.
Optionally, the search space construction module further includes:
and the space pruning submodule is used for pruning the self-attention mechanism representation network search space by using a constraint strategy.
Optionally, the constraint policy includes: a backbone constraint policy, a key-value binding constraint policy, and a zero-operation constraint policy.
Optionally, the super network building module includes:
a first parameter setting sub-module for setting a same context parameter in the plurality of sub-networks as a parameter at a same location in the super-network;
a second parameter setting sub-module for setting context-different parameters in the plurality of sub-networks to parameters at different locations in the super-network.
Optionally, the apparatus further comprises:
and training the corresponding self-attention mechanism characterization network to obtain the trained self-attention mechanism characterization network.
Optionally, the network fusion layer sub-module includes:
the self-attention layer obtaining submodule is used for obtaining a self-attention mechanism fusion layer corresponding to each self-attention mechanism characterization network according to the self-attention layers in the self-attention mechanism characterization networks of the plurality of architectures;
and the additive fusion layer obtaining submodule is used for obtaining the additive fusion layer corresponding to each self-attention mechanism characterization network according to other layers except the self-attention layer in the self-attention mechanism characterization networks of the plurality of architectures.
A third aspect of embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method according to the first aspect of the present application.
By adopting the method for searching the self-attention mechanism characterization network, a self-attention mechanism characterization network searching space is constructed according to self-attention mechanism characterization networks with various architectures, wherein the self-attention mechanism characterization network searching space comprises a plurality of sub-networks; and inputting data to be analyzed into the super network, and searching the corresponding self-attention mechanism characterization network from the self-attention mechanism characterization network search space through the super network. According to the method, the self-attention mechanism characterization networks with various architectures are used as sub-networks to construct a self-attention mechanism characterization network search space, the super-networks are constructed through the sub-networks in the search space, the self-attention mechanism characterization network corresponding to the data set to be analyzed can be searched in the search space by using the super-networks, the automatic search of the optimal self-attention mechanism characterization network architecture is achieved, the time for constructing the self-attention mechanism characterization network is saved, and the efficiency for constructing the deep neural network is improved.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a method for characterizing a network search by a self-attention mechanism according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
s11: a self-attention mechanism characterization network search space is constructed according to self-attention mechanism characterization networks with various architectures, and the self-attention mechanism characterization network search space comprises a plurality of sub-networks.
In this embodiment, the self-attention mechanism characterization network is a deep neural network constructed based on a self-attention mechanism, and the architectures of different self-attention mechanism characterization networks are different, so that a self-attention mechanism guaranteed network search space needs to be constructed as a sub-network according to self-attention mechanism characterization networks of multiple architectures, where the self-attention mechanism characterization network search space is a set of self-attention mechanism characterization networks of multiple different architectures.
In this embodiment, the specific step of constructing a self-attention mechanism characterization network search space according to a self-attention mechanism characterization network with multiple architectures includes:
s11-1: and obtaining a plurality of network fusion layers corresponding to each self-attention mechanism characterization network according to the structural characteristics of the self-attention mechanism characterization networks of the plurality of architectures.
In this embodiment, the self-attention mechanism characterization networks with multiple architectures are designed self-attention mechanism characterization networks, and by observing the structural features of each self-attention mechanism characterization network, each self-attention mechanism characterization network can be abstracted into a uniform standard form, so that the self-attention mechanism characterization networks with multiple architectures can be conveniently placed into the same space in the following process.
In the embodiment, each self-attention mechanism does not represent the concept of network abstraction as layer and connection, and the construction task of the deep neural network is modeled as a multi-stack fusion layer selection problem. The network fusion layer is a layer structure abstracted from the structure of the self-attention mechanism characterization network. The network fusion layer is divided into two types, one type is an additive fusion layer (additive layer), and the other type is a self-attention mechanism fusion layer (attention layer), and the specific steps of obtaining a plurality of network fusion layers corresponding to each self-attention mechanism characterization network according to the structural characteristics of the self-attention mechanism characterization networks of the plurality of architectures are as follows:
s11-1-1: and obtaining a self-attention mechanism fusion layer corresponding to each self-attention mechanism characterization network according to the self-attention layers in the self-attention mechanism characterization networks of the plurality of architectures.
In this embodiment, the self-attention mechanism characterization network has several self-attention layers, and performs self-attention mechanism processing on the input vector, and each part of the processed vector is assigned with different weights. And (3) representing the self-attention layers in the network according to each self-attention mechanism, processing the self-attention layers, abstracting the self-attention layers into a standard expression, and further obtaining a corresponding self-attention mechanism fusion layer for constructing a self-attention mechanism search space.
In the present embodiment, the attention layer accepts three inputs as a query input, a key input, and a value input, and outputs a vector subjected to the self-attention mechanism processing, and performs an attention information fusion operation. Through the self-attention fusion layer, self-attention mechanism processing can be performed on input data.
S11-1-2: and obtaining an additive fusion layer corresponding to each self-attention mechanism characterization network according to other layers except the self-attention layer in the self-attention mechanism characterization networks of the plurality of architectures.
In this embodiment, the additive layer is obtained according to a non-attentive mechanism layer in the self-attentive mechanism characterization network, and mainly includes a feature extraction layer and the like.
In this embodiment, the additive layer receives two inputs and performs addition processing on the two operations, and outputs a vector after the addition processing.
Illustratively, the additive layer accepts input 1 and input 2, corresponding to operation 1 and operation 2, where operation 1 may be a feature extraction operation performed by one feature extraction network, and operation 2 may be a feature extraction operation performed by another feature extraction network, and after receiving operation 1 and operation 2, feature extraction results of operation 1 and operation 2 are merged.
S11-2: and connecting the plurality of network fusion layers to obtain the plurality of sub-networks.
In this embodiment, the sub-network is a network structure formed by connecting a plurality of network fusion layers, and a plurality of sub-networks are located in one search space, and the architecture of each sub-network is different. A plurality of sub-networks can be obtained by connecting a plurality of network fusion layers.
In this embodiment, the connection mode of each structure is abstracted according to the structural features of the original self-attention mechanism representation network, and the network fusion layer is connected according to the connection mode of each structure to obtain a plurality of sub-networks.
For example, the sub-network may have a structure in which layer 1 is an attention layer, layers 2 and 3 are additive layers, and layers 4 and 5 are attention layers.
S11-3: constructing the self-attention mechanism through the plurality of subnetworks characterizes a network search space.
In this embodiment, a plurality of subnetworks are added to the same space, so that a self-attention mechanism characterizing the network search space is constructed.
S12: a super network is constructed by the plurality of sub-networks.
In this embodiment, the super network is a network search model, and when data is received, a plurality of sub networks are searched from a plurality of sub networks in a network search space to be ranked, and the sub network with the highest score is ranked at the top, that is, the sub network most suitable for processing the data is obtained.
In the embodiment, all the sub-networks in the search space can be jointly optimized and updated by constructing all the sub-networks in the search space into a super-network, so that the efficiency of the search model search architecture is greatly improved.
In this embodiment, the specific steps of constructing a super network by the plurality of sub-networks are as follows:
s12-1: setting a parameter that is context-identical in the plurality of sub-networks to a parameter at a same location in the super-network.
S12-2: setting contextually different parameters in the plurality of sub-networks to parameters at different locations in the super-network.
In this embodiment, the context of the parameter refers to the type of the fusion layer connected to the parameter, and the parameter that is the same in the contexts in the plurality of subnetworks is set as the parameter at the same position, that is, the parameter of different subnetworks that have the same type of the fusion layer and the same connection mode is set as the parameter at the same position, so that the parameters are shared. And if the contexts are different, the network fusion layer types or connection modes in the sub-networks are different, and the network fusion layer types or connection modes are set as parameters of different positions in the super-network.
Illustratively, of the three subnetworks in the search space, the first layer of subnetwork 1 is the attention layer, and the second layer is the additive layer; the first layer of subnetwork 2 is the additive layer and the second layer is the attention layer; the first layer of sub-network 3 is the attention layer and the second layer is the additive layer, the first and second layers of sub-networks 1 and 3 are set to the same positional parameters.
In the search space of the self-attention characterization network, fusion layers have multiple choices, and the functions and required connection types of each fusion layer are different, so that the space heterogeneity is caused. The direct construction of the super network for optimization cannot consider the heterogeneity, so that the parameter homogenization of the super network is caused, and the search effect is influenced.
S13: and inputting data to be analyzed into the super network, and searching the corresponding self-attention mechanism characterization network from the self-attention mechanism characterization network search space through the super network.
In this embodiment, the data to be analyzed is data that needs to be processed, the data to be analyzed is the same as a target task for which a data set of a training super network is directed, after the data to be analyzed is input into the super network, the super network calculates the data to be analyzed to obtain top n self-attention mechanism characterization networks with the highest score, and the self-attention mechanism characterization network with the highest score is the self-attention mechanism characterization network most suitable for processing the data to be analyzed, that is, the self-attention mechanism characterization network corresponding to the data to be analyzed.
Illustratively, the picture data set to be classified is input into a super network trained by the data set for picture classification, and the super network searches the optimal self-attention mechanism characterization network for classifying the picture to be classified from the search space.
In another embodiment of the present application, after the super network is established, the super network is trained to obtain a trained super network, and the specific steps are as follows:
s21: and collecting training data aiming at the target task, and putting the training data into a set to obtain a training data set.
S22: and inputting the training data set into the super network, and training the super network to obtain the trained super network.
In this embodiment, after the hyper-network is constructed, training needs to be performed to optimize parameters of the hyper-network, different training data needs to be collected for different tasks, and data for different target tasks are put into different data sets to obtain a plurality of training data sets for different types of tasks.
After the training data set is obtained, the training data set is input into a super network, the super network is trained, and the super network obtains TOPN self-attention mechanism characterization networks corresponding to the data set according to the data in the training data set. The training process can continuously optimize parameters of the super network, and when the parameters of the super network are adjusted to be optimal, the effect of searching the self-attention mechanism representation network in the search space is also optimal. Each data set for a specific task can train a super network for the task, the super network can select an optimal self-attention mechanism characterization network for the target task, and parameters of the super networks trained for the data sets of different tasks are different.
Illustratively, the target task is a semantic recognition task, a training data set corresponding to the semantic recognition task is input into a super network, the super network is trained to obtain a trained super network, and the trained super network can quickly obtain the TOPN super networks with the best recognition effect on the semantic recognition task in a search space when receiving the data set aiming at the semantic recognition task. The principle of operation is the same for other tasks.
In another embodiment of the present application, after the search space is established, pruning the space through a constraint policy is further required, which specifically includes:
s23: pruning the self-attention mechanism characterization network search space using a constraint strategy.
In this embodiment, the constraint policy is a standardized requirement that is provided for the network in the space, the network in the space can normally operate only if the constraint policy is satisfied, and pruning of the search space is achieved by setting the network in the search space according to the constraint policy.
S24: the constraint policy includes: a backbone constraint policy, a key-value binding constraint policy, and a zero-operation constraint policy.
In this embodiment, the backbone constraint strategy, that is, all layers have an input from the output of the previous layer, and such interlayer connection is called a backbone connection, and the input is fixed as a query input for the attention layer and is an arbitrary input for the additive layer.
For example, in a sub-network in the search space, the first layer is an additive layer, the second layer is an attention layer, and the third layer is an additive layer, then the output of the first layer is the query input of the second layer, and the output of the second layer can be any input of the third layer.
The key value binding constraint strategy is that the key input and the value input of the attention layer are from the output of the same layer, so that the key and the value are ensured to come from the same layer, the optimization difficulty of the model is reduced, the self-attention mechanism plays a role, and the method has significance.
The zero operation constraint strategy is that for all the connections of the backbone connection and the attention layer, the method forces the operation on the connection to be non-zero, because if the operation on the connection is zero, the output result is determined to be zero, and the generated deep neural network is invalid, and the strategy avoids the generation of the invalid deep neural network.
The three constraint strategies, the complexity of the search space is from O (n!4) Reduced to O (n!2) Obviously, the complexity of the algorithm is greatly reduced, and the method is more beneficial to search.
In another embodiment of the present application, after the corresponding self-attention mechanism characterization network is searched, the following steps are further performed:
s31: and training the corresponding self-attention mechanism characterization network to obtain the trained self-attention mechanism characterization network.
In this embodiment, after obtaining the corresponding self-attention mechanism characterization network from the search space, the network is trained using the corresponding data set, and parameters of the self-attention mechanism characterization network are adjusted to obtain the trained self-attention mechanism characterization network.
In the embodiment of the application, the design of the representation network of the self-attention mechanism in the deep neural network is automated for the first time, and the optimal representation network of the self-attention mechanism can be automatically generated from the network search space, so that the expression capability of the deep neural network is greatly improved. Meanwhile, the time, resources and labor technical cost for designing the neural network are greatly reduced, and the efficiency and effect of the design process are improved.
The embodiments of the present application will be described below with reference to the accompanying drawings:
fig. 2 is a schematic diagram of a sub-network structure according to an embodiment of the present invention, as shown in fig. 2, and fig. 2 shows:
the various markings are distinguished by a dark and light shade in the figure. As can be seen from the left part of fig. 2, a subnetwork is mainly composed of two types of network fusion layers, namely an additive fusion layer and a self-attention mechanism fusion layer. The additive fusion layer receives two inputs, and obtains an output after fusing operation 1 and operation 2. From the attention mechanism fusion layer, a query input, a key input, and a value input are received, resulting in an output.
The middle part of fig. 2 also illustrates the structure of the sub-networks, wherein the sub-networks are layer 1, layer 4, and layer 5, which are attention layers, and layer 2 and layer 3, which are additive layers. There are 3 connections, query connection, key connection and value connection, input into layer 1, layer 2 receives 3 inputs; layer 2 receives the output and input from layer 1; layer 3 receives outputs and inputs from layer 2; layer 4 receives the output from layer 3 and the key and value passed by layer 2; layer 5 receives the output of layer 4 and the key and value passed by layer 3.
As can be seen from the right part of fig. 2, layers 2, 4, and 6 of sub-network 1 (above) are additive layers, layers 2 and 6 of sub-network 2 (below) are additive layers, and layer 4 is an attention layer, and it can be seen that, if layers 2 and 6 of sub-networks 1 and 2 are both additive layers, the data between layers 2 to 6 of sub-networks 1 and 2 are set as parameters at the same position, and are shared. Since layer 4 of sub-network 1 is an additive layer and layer 4 of sub-network 2 is an attention layer, the parameters between layers 2 to 4 and 4 to 6 are set to parameters at different positions and are not shared.
Based on the same inventive concept, an embodiment of the present application provides a self-attention mechanism characterization network search apparatus. Referring to fig. 3, fig. 3 is a schematic diagram of a self-attention mechanism characterization network searching apparatus 300 according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:
a search space construction module 301, configured to construct a self-attention mechanism characterization network search space according to a self-attention mechanism characterization network with multiple architectures, where the self-attention mechanism characterization network search space includes multiple subnetworks;
a super network construction module 302 for constructing a super network through the plurality of sub-networks;
a network obtaining module 303, configured to input data to be analyzed into the super network, and search a corresponding self-attention mechanism characterization network from the self-attention mechanism characterization network search space through the super network.
Optionally, the apparatus further comprises:
the training data set generating module is used for collecting training data aiming at a target task and putting the training data into a set to obtain a training data set;
and the super network training module is used for inputting the training data set into the super network and training the super network to obtain the trained super network.
Optionally, the search space construction module includes:
the network fusion layer obtaining submodule is used for obtaining a plurality of network fusion layers corresponding to each self-attention mechanism representation network according to the structural characteristics of the self-attention mechanism representation networks of the plurality of architectures;
a sub-network generation sub-module, configured to connect the multiple network fusion layers to obtain multiple sub-networks;
a search space construction sub-module for constructing the self-attention mechanism characterization network search space through the plurality of sub-networks.
Optionally, the search space construction module further includes:
and the space pruning submodule is used for pruning the self-attention mechanism representation network search space by using a constraint strategy.
Optionally, the constraint policy includes: a backbone constraint policy, a key-value binding constraint policy, and a zero-operation constraint policy.
Optionally, the super network building module includes:
a first parameter setting sub-module for setting a same context parameter in the plurality of sub-networks as a parameter at a same location in the super-network;
a second parameter setting sub-module for setting context-different parameters in the plurality of sub-networks to parameters at different locations in the super-network.
Optionally, the apparatus further comprises:
and training the corresponding self-attention mechanism characterization network to obtain the trained self-attention mechanism characterization network.
Optionally, the network fusion layer sub-module includes:
the self-attention layer obtaining submodule is used for obtaining a self-attention mechanism fusion layer corresponding to each self-attention mechanism characterization network according to the self-attention layers in the self-attention mechanism characterization networks of the plurality of architectures;
and the additive fusion layer obtaining submodule is used for obtaining the additive fusion layer corresponding to each self-attention mechanism characterization network according to other layers except the self-attention layer in the self-attention mechanism characterization networks of the plurality of architectures.
Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the electronic device implements the steps in the method for characterizing a network search by a self-attention mechanism as described in any of the above embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The method, the device and the equipment for searching the self-attention mechanism representation network are introduced in detail, specific examples are applied in the method for explaining the principle and the implementation mode of the method, and the description of the embodiments is only used for helping to understand the method and the core idea of the method; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.