CN117035038A

CN117035038A - Model pruning method, device, equipment and storage medium

Info

Publication number: CN117035038A
Application number: CN202211176919.7A
Authority: CN
Inventors: 王新民; 黄锦静; 袁镱; 魏征; 杨蕾; 潘欣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-11-10

Abstract

The application provides a model pruning method, a device, equipment and a storage medium, and relates to the field of machine learning of artificial intelligence, wherein the method comprises the following steps: determining a target loss function of a first model, wherein a corresponding pruning weight layer is arranged between each group of target adjacent layers of the first model, and the pruning weight layer comprises N first pruning weights; training a first model according to the target loss function so that N second pruning weights obtained after training N first pruning weights corresponding to each group of target adjacent layers tend to be two poles; pruning the dimension of the target input layer of the trained model, and pruning the connection between the dimension of the output layer and the dimension of the target input layer in the trained model; the target input layer dimension is the input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value. Therefore, under the condition of simultaneously guaranteeing the model compression and the model precision, the machine cost and the time delay of model training are reduced.

Description

Model pruning method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence (Artificial Intelligence, AI), in particular to a model pruning method, device, equipment and storage medium.

Background

The goal of model compression is to reduce the size of the model as much as possible while guaranteeing the model prediction effect. Pruning techniques are an important means of achieving model compression. For example, for some type of node in the model, such as neurons of a neural network layer or feature embedding dimensions of an embedding layer, if such node has little effect on the model, such node may be pruned to reduce the model size while also guaranteeing model prediction effects.

The current L1 regular pruning method comprises the following three steps: 1) Training an initial model, wherein each node or connection in the model corresponds to a weight used for representing the importance degree of the node or connection; 2) Cutting out nodes or connections with corresponding weights lower than a preset threshold value in the initial model; 3) Retraining the pruned model to compensate for the model accuracy degradation due to pruning in step 2. In order to achieve a satisfactory model compression ratio and model accuracy, steps 2 and 3 are typically repeated a number of times.

Performing steps 2 and 3 above multiple times results in excessive machine costs and time for model training. Especially for the recommended model, on one hand, the decrease of the Area Under the Curve (Area indicator Curve, AUC) index of the micrometer bit can bring obvious negative on-line effect, in other words, the accuracy requirement of the recommended model is higher, and if the recommended model is pruned according to the L1 regular pruning method, the recommended model is required to be executed for more times in the steps 2 and 3; on the other hand, the recommended model corresponds to an incremental training mode, in other words, the training samples of the recommended model are continuously increased, the model needs to be continuously trained through the increased training samples, and the recommended model needs to be executed for more steps 2 and 3. In short, if the recommended model is pruned according to the L1 regular pruning method, the machine cost of model training is excessive in time.

Disclosure of Invention

The application provides a model pruning method, device, equipment and storage medium, so that the machine cost and time delay of model training are reduced under the condition of simultaneously guaranteeing model compression and model precision.

In a first aspect, an embodiment of the present application provides a method for pruning a model, where the method includes: determining a target loss function of a first model, wherein a corresponding pruning weight layer is arranged between each group of target adjacent layers in at least one group of target adjacent layers of the first model, the pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with the input layer dimensions of the corresponding target adjacent layers, and N is an integer larger than 1; training a first model according to the target loss function so that N second pruning weights obtained after training N first pruning weights corresponding to each group of target adjacent layers tend to be two poles; pruning the dimension of the target input layer of the trained model, and pruning the connection between the dimension of the output layer and the dimension of the target input layer in the trained model; the target input layer dimension is the input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

In a second aspect, an embodiment of the present application provides a model pruning device, including: the training device comprises a determining module, a training module and a pruning module, wherein the determining module is used for determining a target loss function of a first model, a corresponding pruning weight layer is arranged between each group of target adjacent layers in at least one group of target adjacent layers of the first model, the pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with the input layer dimensions of the corresponding target adjacent layers, and N is an integer larger than 1; the training module is used for training the first model according to the target loss function so that N second pruning weights obtained after training N first pruning weights corresponding to each group of target adjacent layers tend to be two poles; the pruning module is used for pruning the dimension of the target input layer of the trained model and pruning the connection between the dimension of the output layer and the dimension of the target input layer in the trained model; the target input layer dimension is the input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

In a third aspect, there is provided an electronic device comprising: a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory for performing the method as in the first aspect or in various implementations thereof.

In a fourth aspect, a computer-readable storage medium is provided for storing a computer program for causing a computer to perform the method as in the first aspect or in various implementations thereof.

In a fifth aspect, a computer program product is provided comprising computer program instructions for causing a computer to perform the method as in the first aspect or in various implementations thereof.

In a sixth aspect, a computer program is provided, the computer program causing a computer to perform the method as in the first aspect or in various implementations thereof.

According to the technical scheme provided by the embodiment of the application, firstly, the purpose of model compression can be achieved through the model pruning method. Secondly, the model pruning method provided by the embodiment of the application can ensure the model accuracy while achieving the purpose of model compression, namely the model effect is not affected, because the target loss function comprises- ||gamma-gamma' | ₁ ，-||γ-γ'|| ₁ Is to all gamma _i The method has the advantages that the method tends to be bipolar as much as possible, the term is smaller, meanwhile, the distribution of the trending to be bipolar can also adapt to the countermeasure relation between the regular term and the original loss, so that the pruning weight approaching to 0 can reach infinite approaching to 0, the pruning weight has higher precision, and the pruning method does not lead to effective input layer dimension and connectionAnd (5) pruning. In a word, because the model pruning method based on two polarization constraints can ensure the model precision, the method does not need to prune and retrain the recommended model for multiple times, and further reduces the machine cost and time delay of model training under the condition of simultaneously ensuring the model compression and the model precision.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2A is a schematic diagram of a model according to an embodiment of the present application;

FIG. 2B is a schematic diagram of another embodiment of the present application;

FIG. 2C is a schematic diagram of another embodiment of the present application;

FIG. 3 is a flowchart of a method for pruning a model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a model pruning method according to an embodiment of the present application;

FIG. 5 is a flowchart of another method for pruning a model according to an embodiment of the present application;

fig. 6 is a schematic diagram of a model pruning device 600 according to an embodiment of the present application;

fig. 7 is a schematic block diagram of an electronic device provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The application relates to the technical field of artificial intelligence, in particular to a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by utilizing a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Before describing the technical scheme of the application, the following first explains the related knowledge of the application:

recommendation system (Recommender System, RS): is an information filtering system for predicting a user's score or preference for an item.

Recommendation model: a model used in a recommendation system, such as a Click-Through Rate (CTR) estimation model, is a recommendation model, and the CTR estimation model is used for predicting the Click condition of each commodity according to information such as given advertisements, users and context conditions, and can further recommend the commodity to the users based on the predicted Click condition. The CTR estimation model may be a model using a deep neural network (Deep Neural Network, DNN) or a deep factorizer (Deep Factorization machines, deep fm) algorithm, or the like.

Model compression (Model Compression, MC): the method is an information simplifying method, can compress a large model with more parameters and high complexity into a small model with less parameters and low complexity, and the compressed small model can obtain the prediction performance close to that of the large model.

The technical problems and the inventive concept to be solved by the technical scheme of the present application will be described below:

with the rapid development of deep learning technology in recent years, large-scale sparse models, i.e., recommendation models ranging in size from several Gigabytes (GB) to several Terabytes (TB), are widely applied to numerous individual scenes, with larger models representing higher model effect upper limits. These models are typically composed of an Embedding (Embedding) layer and a neural network layer, with the number of features of the Embedding layer being on the order of trillions, these features typically accounting for over 99% of the model parameters. The goal of model compression is to reduce the size of the model as much as possible while guaranteeing the model prediction effect. Pruning techniques are an important means of achieving model compression. At present, the model can be pruned by an L1 regular pruning method, but the model training machine cost and time are excessively high due to the mode. Especially for recommended models with high precision requirements and incremental training characteristics, the resulting machine costs and time are greater.

In order to solve the technical problems, the adjacent layers of the model of the embodiment of the application are added with pruning weight layers, the pruning weight layers comprise N pruning weights which are in one-to-one correspondence with the input layer dimensions of the adjacent layers, the N pruning weights can tend to the two poles of the value range corresponding to the pruning weights after training in the model training, after the model training, the input layer dimensions with the pruning weights smaller than the preset threshold and the connection between the input layer dimensions and the output layer dimensions can be pruned, and the mode does not need to adopt repeated pruning and retraining steps, so that the machine cost and the time delay of the model training are reduced under the condition of simultaneously guaranteeing the model compression and the model precision.

In some implementations, in embodiments of the application, the machine costs may include: the cost of the central processing unit (Central Processing Unit, CPU), the memory, etc., is not limited thereto.

In some embodiments, a system architecture of an embodiment of the present application is shown in fig. 1.

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application, where a user device 101, a data acquisition device 102, a training device 103, an execution device 104, a database 105, and a content library 106.

The data acquisition device 102 is configured to read training data from the content library 106, and store the read training data in the database 105. Taking a recommended scenario as an example, training data related to the embodiment of the present application may include: recommendation context related features, user related features, item related features, tags, etc., which may include: time, cell phone status, etc.; the user-related features may include: age, gender, preference, etc. of the user; the item-related features may include: attribute information of the article, etc.; the tag may be whether the user clicked on an item on the web page, etc. Wherein, training data can be divided into sparse features, continuous features and labels according to the input required format.

In some implementations, in embodiments of the present application, a published game data MovieLens dataset may be used, which is a set of historical scores of users for movies, with different sizes of data volumes, named 1M, 10M, and 20M, representing 1, 10, and 20 ten thousand scores respectively, and in embodiments of the present application a 1M dataset may be used. Further, the data set may be divided into training and testing sets in an 8:2 ratio. For a training set Wherein x is _i And y _i Representing the input feature and the label corresponding to the input feature, respectively.

The training device 103 trains and prunes the recommendation model based on training data maintained in the database 105 so that the pruned recommendation model can output a prediction result, for example, a probability that a user clicks on an item. The pruned recommendation model can be applied to different recommendation systems, for example, recommendation systems such as e-commerce shopping, video or music recommendation, news information stream recommendation and the like.

In addition, referring to fig. 1, the execution device 104 is configured with an I/O interface 107, and performs data interaction with an external device. Such as receiving the recommended scene-related features, user-related features, item-related features sent by the user device 101 via the I/O interface. The calculation module 109 in the execution device 104 processes the input features using the pruned recommendation model, outputs the probability of the user clicking on the item, and sends the result to the user device 101 via the I/O interface.

The user device 101 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, a mobile internet device (mobile internet device, MID), or other terminal devices with a browser installation function.

The execution device 104 may be a server.

By way of example, the server may be a rack server, a blade server, a tower server, or a rack server, among other computing devices. The server can be an independent test server or a test server cluster formed by a plurality of test servers.

In this embodiment, the execution device 104 is connected to the user device 101 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, a telephony network, etc.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings does not constitute any limitation. In some embodiments, the data acquisition device 102 may be the same device as the user device 101, the training device 103, and the execution device 104. The database 105 may be distributed over one server or over a plurality of servers, and the content library 106 may be distributed over one server or over a plurality of servers.

The following describes the technical scheme of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2A is a schematic diagram of a model according to an embodiment of the present application, as shown in fig. 2A, where the model includes: an embedding layer, a pruning weight layer and a neural network layer. The neural network layer includes at least one neural network sub-layer, for example, fig. 2A exemplarily shows two neural network sub-layers, neural network sub-layer 1 and neural network sub-layer 2, respectively. The pruning weight layer includes N first pruning weights, where the N first pruning weights are in one-to-one correspondence with N feature embedding dimensions of the embedding layer, where N is an integer greater than 1, and the embedding layer is configured to implement dimension reduction on an input feature, for example, assuming that the input feature is 1000 dimensions, by multiplying the input feature by an embedding matrix (embedding matrix) of 1000×300 dimensions, a feature vector of 300 dimensions may be obtained, and the feature embedding dimension refers to a dimension of the feature vector after dimension reduction, for example, 300 dimensions. After the training device completes the training of the model, the second pruning weight corresponding to each first pruning weight, namely the value of the pruning weight after training, can determine whether to prune the feature embedding dimension corresponding to the pruning weight and the connection between any neuron in the neural network layer and the feature embedding dimension, thereby realizing the compression of the model.

It should be appreciated that in the model shown in fig. 2A, the embedded layer and the neural network sublayer 1 are a set of adjacent layers, the characteristic embedded dimension of an embedded layer in the adjacent layers being referred to as the input layer dimension of the set of adjacent layers, and the neuron dimension of the neural network sublayer 1 in the adjacent layers being referred to as the output layer dimension of the set of adjacent layers.

Fig. 2B is a schematic diagram of another model provided in an embodiment of the present application, as shown in fig. 2B, where the model includes: an embedding layer, a pruning weight layer and a neural network layer, the neural network layer comprising at least one neural network sub-layer, for example, fig. 2B exemplarily shows two neural network sub-layers, namely, neural network sub-layer 1 and neural network sub-layer 2, respectively, in the model shown in fig. 2B, the pruning weight layer is disposed between the two neural layers, the pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with N dimensions of the neural network sub-layer 1, and N is an integer greater than 1. After the training device completes the training of the model, the second pruning weight corresponding to each first pruning weight, that is, the value of the pruning weight after training can determine whether to prune the connection between the neuron in the neural network sub-layer 1 corresponding to the pruning weight and the neuron in the neural network sub-layer 2 from any one of the neurons in the neural network sub-layer 1, thereby realizing the compression of the model.

It should be understood that in the model shown in fig. 2B, neural network sub-layer 1 and neural network sub-layer 2 are a set of adjacent layers, the neuron dimensions of neural network sub-layer 1 in the adjacent layers being referred to as the input layer dimensions of the set of adjacent layers, and the neuron dimensions of neural network sub-layer 2 in the adjacent layers being referred to as the output layer dimensions of the set of adjacent layers.

Fig. 2C is a schematic diagram of another model according to an embodiment of the present application, as shown in fig. 2C, where the model includes: the neural network layer comprises at least one neural network sub-layer, for example, fig. 2C exemplarily shows two neural network sub-layers, namely, the neural network sub-layer 1 and the neural network sub-layer 2, respectively, and in the model shown in fig. 2C, the pruning weight layer 1 comprises N1 first pruning weights, N1 first pruning weights are in one-to-one correspondence with N1 feature embedding dimensions of the embedding layer, and N1 is an integer greater than 1. The pruning weight layer 2 is arranged between two nerve layers, the pruning weight layer 2 comprises N2 first pruning weights, the N2 first pruning weights are in one-to-one correspondence with N2 dimensions of the nerve network sub-layer 1, and N2 is an integer larger than 1. After the training device completes the training of the model, the second pruning weight corresponding to each first pruning weight, that is, the value of the pruning weight after training can determine whether to prune the connection between any neuron in the neural network layer and the feature embedding dimension corresponding to the pruning weight and whether to prune the connection between any neuron in the neural network sub-layer 1 corresponding to the pruning weight and any neuron in the neural network sub-layer 2 and the neuron in the neural network sub-layer 1, thereby realizing the compression of the model.

It should be appreciated that in the model shown in fig. 2C, the embedded layer and the neural network sublayer 1 are a set of adjacent layers, the characteristic embedded dimension of an embedded layer in the adjacent layers being referred to as the input layer dimension of the set of adjacent layers, and the neuron dimension of the neural network sublayer 1 in the adjacent layers being referred to as the output layer dimension of the set of adjacent layers. Neural network sub-layer 1 and neural network sub-layer 2 are another set of adjacent layers, the neuron dimensions of neural network sub-layer 1 in the adjacent layers being referred to as the input layer dimensions of the set of adjacent layers, and the neuron dimensions of neural network sub-layer 2 in the adjacent layers being referred to as the output layer dimensions of the set of adjacent layers.

It should be appreciated that pruning weights may also be referred to as scale factors in embodiments of the present application.

In some implementations, the neural network layer may be a deep neural network (Deep Neural Network, DNN), which is a type of feed-forward neural network with deep architecture, one of the representative algorithms of the deep learning model.

In some implementations, the training device or the execution device may apply a normalization function to the output result of the neural network layer to obtain a prediction result, where the normalization function may be, but is not limited to, a sigmoid function.

Fig. 3 is a flowchart of a model pruning method according to an embodiment of the present application, where the method may be performed by the training device in fig. 1, but is not limited thereto, and as shown in fig. 3, the method may include:

s1: determining a target loss function of the first model;

s2: training a first model according to the target loss function so that N second pruning weights obtained after training N first pruning weights corresponding to each group of target adjacent layers tend to be two poles;

s3: pruning the dimension of the target input layer of the trained model, and pruning the connection between the dimension of the output layer and the dimension of the target input layer in the trained model; the target input layer dimension is the input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

The following description is given for S1:

in some implementations, the first model can be a recommendation model.

Corresponding pruning weight layers are arranged between each group of target adjacent layers in at least one group of target adjacent layers of the first model, each pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with the input layer dimensions of the corresponding target adjacent layers, and N is an integer greater than 1.

It should be understood that the at least one set of target adjacent layers refers to adjacent layers provided with pruning weight layers. For example, in the model shown in fig. 2A, the first model includes a set of target adjacent layers, which are adjacent layers of the embedded layer and the neural network sublayer 1. For another example, in the model shown in fig. 2B, the first model includes a set of target adjacent layers that are adjacent layers of neural network sublayer 1 and neural network sublayer 2. For another example, in the model shown in fig. 2C, the first model includes two sets of target adjacent layers, respectively: an adjacent layer composed of the embedding layer and the neural network sublayer 1, and an adjacent layer composed of the neural network sublayer 1 and the neural network sublayer 2.

It should be appreciated that if the first model includes multiple sets of target adjacent layers, the number of pruning weights N included by the pruning weight layers between each set of target adjacent layers may be different.

In some implementations, S1 can include:

s11: determining an original loss function of the first model and average pruning weights of N first pruning weights corresponding to each group of target adjacent layers;

s12: determining regularization items corresponding to each group of target adjacent layers according to N first pruning weights and average pruning weights corresponding to each group of target adjacent layers;

s13: and determining the target loss function of the first model according to the original loss function and regularization items corresponding to each group of target adjacent layers.

The following description is made for S11:

it should be understood that the original loss function of the first model refers to a loss function obtained by considering only the prediction result and the label of the first model, wherein the original loss function may be a cross entropy loss function or a mean square error loss function, etc., but is not limited thereto. For example, equation (1) provides a cross entropy loss function:

wherein y is ⁽ⁱ⁾ Representing the predicted result corresponding to the ith training sample,representing the label corresponding to the ith training sample, namely the real result, and M represents the number of training samples.

For each set of adjacent layers in the at least one set of target adjacent layers, it is assumed that N first pruning weights corresponding to the set of adjacent layers are each γ ₁ ，γ ₂ …γ _N Wherein, gamma _i For pruning weights applied to the ith input dimension, γ _i ∈[0,a]And a>0, then the average pruning weight of the N first pruning weights may be calculated by equation (2):

in some implementations, the N first pruning weights are obtained by inputting N initial values to the objective function. For example, N first pruning weights may be obtained by the objective function (3):

y＝activation(x) (3)

where x represents an initial value corresponding to a certain first pruning weight, and y represents the first pruning weight. Typically the initial value takes 0. In the model training process, a can be set to 1 in general, in order to set γ _i The value is limited to a certain interval, two ways can be used: (1) utilizing the output value of the sigmoid function; (2) truncating an output value greater than 1 to 1 using a ReLu function. That is, the above objective function may be a sigmoid function or a ReLu function, but is not limited thereto.

The following description is made for S12:

in one implementation, S12 may include:

s121-a: calculating the difference vector of a first vector formed by N first pruning weights corresponding to each group of target adjacent layers and a second vector formed by N average pruning weights corresponding to each group of target adjacent layers;

S122-a: calculating norms of difference vectors corresponding to each group of target adjacent layers to obtain first norms corresponding to each group of target adjacent layers;

s123-a: and determining regularization items corresponding to each group of target adjacent layers according to the first norm items corresponding to each group of target adjacent layers.

The explanation is made for S121-a to S123-a:

it should be understood that, for each of the at least one set of target adjacent layers, assuming that a first vector of N first pruning weights corresponding to the set of adjacent layers is γ and a second vector of N average pruning weights is γ ', a difference vector between the first vector and the second vector is γ—γ'.

In some implementations, the norms involved in embodiments of the present application may be L1 norms, L2 norms, or other norms, which embodiments of the present application do not limit. Assuming that the embodiments of the present application are illustrated by way of example with an L1 norm, then the first norm as described above the item may be | gamma-gamma' | ₁ 。

Wherein S123-a may be implemented by any of the following realizations, but is not limited thereto:

in one implementation manner, S123-a may further include:

s123-a-1: calculating norms of the first vectors corresponding to each group of target adjacent layers to obtain second norms corresponding to each group of target adjacent layers;

Accordingly, S123-a may include:

s123-a-2: and determining regularization items corresponding to each group of target adjacent layers according to the first norm items and the second norm items corresponding to each group of target adjacent layers.

The following is a description of S123-a-1 and S123-a-2:

it will be appreciated that, for each of the at least one set of target adjacent layers, assuming that the first vector of N first pruning weights corresponding to the set of adjacent layers is y, then the corresponding first layer of the group of adjacent layers the two-norm term may be |||gamma||| ₁ 。

In one implementation manner, the training device may calculate a product of a second norm term corresponding to each set of target adjacent layers and a corresponding first factor to obtain a first product term corresponding to the set of target adjacent layers; and calculating the difference between the first product term corresponding to the group of target adjacent layers and the corresponding first norm term to obtain a regularization term corresponding to the group of target adjacent layers.

Assuming that the first factor corresponding to the set of target adjacent layers is denoted by t, the training device may obtain the regularization term corresponding to the set of target adjacent layers through equation (4), but is not limited thereto:

the first factor t is used for controlling the proportion of any one of the N second pruning weights corresponding to the target adjacent layer, which tends to be two poles, so as to control the compression proportion of the first model, and t epsilon < -2,2 >.

In some implementations, the first factor corresponding to the target adjacent layer is in a linear relationship with the compression ratio of the first model, which may specifically be a linear relationship as shown in formula (5):

wherein p represents the compression ratio of the first model, i.e. all gamma _i The ratio of a is defined as a.

In another implementation manner, the training device may also calculate a product of the first norm term corresponding to the set of target adjacent layers and the third factor to obtain a third product term corresponding to the set of target adjacent layers; and calculating the difference between the second norm term and the third product term corresponding to the set of target adjacent layers to obtain a regularization term corresponding to the set of target adjacent layers.

Assuming that the third factor corresponding to the set of target adjacent layers is represented by m, the training device may derive a regularization term from equation (6), but is not limited thereto:

second, S123-a may include:

s123-a-1': and obtaining regularization items corresponding to the target adjacent layers of the group by taking the opposite numbers of the first norm items corresponding to the target adjacent layers of the group.

For example, for each of the at least one set of target adjacent layers, the regularization term corresponding to the set of adjacent layers may be calculated by the following formula (7), but is not limited thereto:

Another implementation of S12 is described below:

in some implementations, S12 may include:

s121-b: calculating the product of a first vector formed by N first pruning weights corresponding to each group of target adjacent layers and a fourth factor corresponding to the group of target adjacent layers to obtain a fourth product item corresponding to the group of target adjacent layers, and calculating the product of a second vector formed by N average pruning weights corresponding to the group of target adjacent layers and the fourth factor to obtain a fifth product item corresponding to the group of target adjacent layers;

s122-b: calculating a difference vector of a fourth product term and a fifth product term corresponding to the group of target adjacent layers;

s123-b: calculating norms of difference vectors corresponding to the set of target adjacent layers to obtain third norms corresponding to the set of target adjacent layers;

s124-b: and obtaining regularization items corresponding to the target adjacent layers of the group by taking the opposite numbers of the third norm items corresponding to the target adjacent layers of the group.

In some implementations, the range of values of the fourth factor corresponding to the set of target adjacent layers may be [ 0,1 ].

Assuming that the fourth factor corresponding to the set of target adjacent layers is represented by q, the regularization term corresponding to the set of target adjacent layers may be calculated by, but is not limited to, the following equation (8):

The following description is made for S13:

in one implementation, S13 may include:

s131-a: calculating the product of regularization items corresponding to each group of target adjacent layers and the corresponding second factors to obtain second product items corresponding to each group of target adjacent layers;

s132-a: and calculating the sum of the second product terms corresponding to the original loss function and all the target adjacent layers to obtain the target loss function.

Assuming that the first model includes a set of target adjacent layers, and the second factor corresponding to the set of target adjacent layers is k, the target loss function may be calculated by the following formula (9), but is not limited thereto:

Loss＝loss_ctr+kR _s (γ) (9)

where loss_ctr represents the original loss function of the first model, R _s (gamma) represents the regularization term corresponding to the set of target neighbors. The second factor k is used to adjust loss_ctr and R _s Magnitude relation of (gamma). In general, k is adjusted such that R _s And (gamma) is 1-5% of loss_ctr, so that model compression and model precision can be ensured simultaneously.

Assuming that the first model includes multiple sets of target adjacent layers, the second factor corresponding to the ith set of target adjacent layers is k _i I=1, 2 … P, P represents the number of target adjacent layers, and P is greater than 1, then the target loss function can be calculated by the following formula (10), but is not limited thereto:

Where loss_ctr represents the original loss function of the first model, R _i (gamma) represents regularization term corresponding to the i-th group target neighbor group. A second factor k _i For adjusting loss_ctr and R _i Magnitude relation of (gamma). In general, the regulationk _i So that R is _i And (gamma) is 1-5% of loss_ctr, so that model compression and model precision can be ensured simultaneously.

In another implementation, S13 may include:

s131-b: calculating the product of the original loss function and the fifth factor to obtain a sixth product term;

s132-b: and calculating the sum of the sixth product term and regularization terms corresponding to all target adjacent layers to obtain a target loss function.

Assuming that the first model includes a set of target adjacent layers, s represents a fifth factor, the target loss function can be calculated by the following equation (11):

Loss＝s*loss_ctr+R _s (γ) (10)

where loss_ctr represents the original loss function of the first model, R _s (γ) represents the regularization term corresponding to the set of target adjacent layers. The fifth factor s is used to adjust loss_ctr and R _s Magnitude relation of (gamma). In general, s is adjusted such that R _s And (gamma) is 1-5% of loss_ctr, so that model compression and model precision can be ensured simultaneously.

Assuming that the first model includes multiple sets of target adjacent layers, s representing a fifth factor, the target loss function may be calculated by the following equation (12):

Where loss_ctr represents the original loss function of the first model, R _i (γ) represents regularization term corresponding to the i-th set of target neighbors, i=1, 2 … P, P represents the number of target neighbors, P is greater than 1, and a fifth factor s is used to adjust loss_ctr and R _s Magnitude relation of (gamma).

The following description is given for S2:

in the L1 regular pruning method, all gamma _i An approach to 0 minimizes the L1 regularization loss, but since the L1 regularization loss is a correlation with the original loss of the first model, i.e., the L1 regularization loss is small, the original loss is relatively lowLarge and thus all gamma _i Not approaching 0 infinitely, the pruning weight has lower precision, for example, can be 10 ^-1 The accuracy may be 0.1 for some pruning weights, 0.2 for some pruning weights, 0.5 for some pruning weights, etc., when the training device performs pruning based on the preset threshold, the input dimension corresponding to some pruning weights and the connection from the output dimension to the input dimension may be pruned, for example, when the preset threshold is 0.3, the feature embedding dimension corresponding to 0.2 and the connection from the neuron to the dimension may be pruned, but the feature embedding dimension corresponding to the pruning weight of 0.2 and the connection from the neuron to the dimension may be effective feature embedding dimension and effective connection, and the pruning mode brings a certain loss to the model effect. Therefore, in the L1 regular pruning method, pruning and retraining need to be performed a plurality of times.

In an embodiment of the present application, in the present application, the above objective loss function includes- || gamma-gamma' | ₁ Term, when all gamma _i When equal, the term reaches a maximum; when half gamma _i 0 and the other half gamma _i In case of a, the term reaches a minimum value, that is to say that the first and second, - ||gamma-gamma' || ₁ Is to all gamma _i As far as possible, the two poles are towards each other, so that the term is smaller, and the distribution of the two poles is suitable for the countermeasure relation between the regular term and the original loss, therefore, the pruning weight approaching 0 can reach infinite approaching 0, the precision of the pruning weight is higher, for example, the pruning weight can be higher than 10 ^-4 The accuracy, some pruning weights are 0.00002, some are 0.00005, some are 0.0098, etc., and the pruning based on the preset threshold will not generally prune the valid input dimension and valid connection. For example, when the model pruning method provided by the embodiment of the application is used, the preset threshold value can be set to be 0.0001, the training equipment can prune out the feature embedding dimension and the connection corresponding to 0.00002 and 0.00005, but the feature embedding dimension and the connection corresponding to the two pruning weights have little influence on the model effect, and the pruning mode can reach the expected pruning proportion on the premise of not causing loss to the model effect.

The following description is given for S3:

it should be appreciated that for each adjacent layer's input layer, the training device may prune the target input layer dimension in the trained model, and for each adjacent layer's output layer, the training device may prune the connection between each output layer dimension and the target input layer dimension in the trained model. For example, assuming that a pruning weight layer is disposed between the embedding layer of the first model and the lowest neural network sub-layer of the neural network layer, three second pruning weights are obtained after training, and are respectively 0.00002, 0.00005 and 0.0098, and assuming that the preset threshold is set to 0.0001, based on this, the training device may prune the feature embedding dimensions corresponding to 0.00002 and 0.00005. And assuming that there is a connection between the two feature embedding dimensions and neuron 1 in the neural network layer, the training device may prune the connection between the two feature embedding dimensions and neuron 1.

It should be understood that in the embodiment of the present application, pruning is also referred to as deletion or pruning, and the embodiment of the present application is not limited thereto.

In some implementations, during model training, the training device may use optimizers such as an adaptive moment estimation (Adaptive moment estimation, adam) optimizer, a random gradient descent (Stochastic Gradient Descent, SGD) optimizer, and an adaptive gradient (Adaptive Gradient, adaGrad) optimizer to solve for parameters of the first model, and employ Xavier mode to initialize the parameters of the first model. In the solving process, training data are transmitted into a first model for training, model optimization is completed through error back propagation, and simultaneous optimization of a scoring prediction target and a weight two-polarization target is achieved.

As described above, the liquid crystal display device, - | y-gamma' || ₁ Is to all gamma _i As far as possible, the whole gamma can be reduced based on this _i As far as possible towards two poles, called two polarization constraints. FIG. 4 is a schematic diagram of a model pruning method according to an embodiment of the present application, where, as shown in FIG. 4, a calculation module in a training device trains a first model through training data, calculates an objective loss function of the first model based on two polarization constraints, and performs model optimization by adopting error back propagation, that is, performs trainingPruning is carried out on the model to obtain a first pruned model, and finally the pruned model is put on line, wherein D in fig. 4 represents a neural network layer, E represents an embedding layer, D 'represents the neural network layer after pruning, and E' represents the embedding layer after pruning.

According to the technical scheme provided by the embodiment of the application, firstly, the purpose of model compression can be achieved by the model pruning method, and the method is mainly characterized by comprising the following two points: (1) For the input layer of each target adjacent layer, the training device can trim the dimension of the target input layer in the trained model, so that the model volume can be reduced, and particularly when the input layer is an embedded layer, the number of features of the embedded layer can reach trillion levels, the features usually occupy more than 99% of model parameters, and the pruning technology for the embedded layer can greatly compress the model volume. (2) For the neural network layer, the training device can cut off the connection between the dimension of the output layer and the dimension of the target input layer in the trained model, and when the neural network layer performs forward calculation, the numerical value of the neuron of the current layer is obtained by adopting the input data of the previous layer and the weight on the connection, and if the connection is cut off, the calculation amount and the calculation complexity of the model can be greatly reduced.

Secondly, the model pruning method provided by the embodiment of the application can ensure the model accuracy while achieving the purpose of model compression, namely the model effect is not affected, because the target loss function comprises- ||gamma-gamma' | ₁ ，-||γ-γ'|| ₁ Is to all gamma _i The method has the advantages that the two poles tend to be as small as possible, meanwhile, the distribution of the two poles can adapt to the countermeasure relation between the regular term and the original loss, so that pruning weight approaching to 0 can reach infinite approaching to 0, the pruning weight is high in precision, and the effective input layer dimension and connection cannot be pruned by the pruning method.

In a word, because the model pruning method based on two polarization constraints can ensure the model precision, the recommendation model does not need to be pruned and retrained for a plurality of times, and further the machine cost and time delay of model training are reduced under the condition of simultaneously ensuring the model compression and the model precision, and the method is particularly suitable for the recommendation model.

Thirdly, when the training set changes, training results of the recommendation model are different, based on the fact that pruning weights obtained based on different training sets are different, feature embedding dimensions and connections pruned by training equipment are different, in other words, the embodiment of the application provides a dynamic pruning method, and under a recommendation scene, the incremental applicability of the dynamic pruning method is stronger because the recommendation model corresponds to an incremental training mode.

Fourth, the model pruning method provided in the embodiment of the present application is embedded in a pluggable manner, that is, for a trained model, the model pruning method may be quickly migrated and used, for example, for a trained model, the model pruning method may be performed based on the situation that the pruning weight layer of the model is regarded as all 1's.

Fifth, the training device may prune the pruned model after model training, and further derive an online line, so that the targets in the training stage and the reasoning stage have consistency.

It should be understood that the at least one set of target adjacent layers may be all adjacent sets in the first model, or may be a partial adjacent set, and if a partial adjacent set is used, the training device needs to first select the partial adjacent set from all adjacent sets in the first model, which will be described in detail below:

in some implementations, as shown in fig. 5, the model pruning method before S1 above may further include:

s501: acquiring a second model;

s502: selecting at least one set of target adjacent layers in a second model;

s503: corresponding pruning weight layers are added between each set of target adjacent layers to generate a first model.

It should be understood that the second model is the model of the first model before adding the pruning weight layer, in other words, the second model differs from the first model in that: the first model has a pruning weight layer and the second model does not have a pruning weight layer.

The following description is made for S502:

wherein S502 may be implemented by any one of the following realizations, but is not limited thereto:

in one implementation, the training device randomly selects at least one set of target adjacent layers in the second model.

In a second implementation manner, the training device selects, as the target adjacent layer, at least one group of adjacent layers in the second model, an adjacent layer formed by the embedded layer and a lowermost neural network sub-layer of the neural network layers.

In a third implementation manner, the training device determines respective complexity of at least one set of adjacent layers in the second model; at least one set of target adjacent layers is selected based on respective complexity of the at least one set of adjacent layers.

Wherein the training device may determine the respective complexity of at least one set of adjacent layers by any one of the following realizations, but is not limited thereto:

in one implementation, the training device may determine input layer dimensions and output layer dimensions for each set of adjacent layers; the complexity of the set of adjacent layers is determined based on the input layer dimensions and the output layer dimensions of the set of adjacent layers.

For example, the training device may calculate the complexity of each set of adjacent layers using equation (13), but is not limited thereto:

C＝(2×I-1)×O (13)

wherein C represents the complexity of a set of adjacent layers, I represents the input layer dimensions of the set of adjacent layers, and O represents the output layer dimensions of the set of adjacent layers.

In another implementation, the training device may determine input layer dimensions for each set of adjacent layers; the input layer dimension of the set of adjacent layers is taken as the complexity of the set of adjacent layers.

In yet another implementation, the training device may determine the number of connections for each set of adjacent layers; the number of connections of the set of adjacent layers is taken as the complexity of the set of adjacent layers.

In some implementations, after the training device obtains the respective complexity of at least one group of adjacent layers, an adjacent layer with a complexity greater than a preset complexity may be selected as a target adjacent layer, that is, for an adjacent layer with a greater complexity, the training device may set a pruning weight layer between the adjacent layers to implement pruning, so as to achieve the purpose of model compression.

According to the technical scheme provided by the embodiment of the application, the training equipment can select the target adjacent layer to set the pruning weight layer on the target adjacent layer, and the purpose of model pruning is achieved due to the action of the pruning weight layer so as to achieve the purpose of model compression.

Fig. 6 is a schematic diagram of a model pruning device 600 provided by the embodiment of the present application, where the determining module 610, the training module 620 and the pruning module 630 are configured to determine a target loss function of a first model, a corresponding pruning weight layer is disposed between each of at least one set of target adjacent layers of the first model, the pruning weight layer includes N first pruning weights, the N first pruning weights are in one-to-one correspondence with input layer dimensions of the corresponding target adjacent layers, and N is an integer greater than 1; the training module 620 is configured to train the first model according to the objective loss function, so that N second pruning weights obtained after training N first pruning weights corresponding to each set of objective adjacent layers tend to be two poles; pruning module 630 is configured to prune a target input layer dimension of the trained model, and prune a connection between an output layer dimension and the target input layer dimension in the trained model; the target input layer dimension is the input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

In some implementations, the determining module 610 is specifically configured to: determining an original loss function of the first model and average pruning weights of N first pruning weights corresponding to each group of target adjacent layers; determining regularization items corresponding to each group of target adjacent layers according to N first pruning weights and average pruning weights corresponding to each group of target adjacent layers; and determining the target loss function of the first model according to the original loss function and regularization items corresponding to each group of target adjacent layers.

In some implementations, the determining module 610 is specifically configured to: calculating the difference vector of a first vector formed by N first pruning weights corresponding to each group of target adjacent layers and a second vector formed by N average pruning weights corresponding to each group of target adjacent layers; calculating norms of difference vectors corresponding to each group of target adjacent layers to obtain first norms corresponding to each group of target adjacent layers; and determining regularization items corresponding to each group of target adjacent layers according to the first norm items corresponding to each group of target adjacent layers.

In some implementations, the determination module 610 is further to: before determining regularization items corresponding to each group of target adjacent layers according to first norm items corresponding to each group of target adjacent layers, calculating norms of first vectors corresponding to each group of target adjacent layers to obtain second norms corresponding to each group of target adjacent layers; accordingly, the determining module 610 is specifically configured to: and determining regularization items corresponding to each group of target adjacent layers according to the first norm items and the second norm items corresponding to each group of target adjacent layers.

In some implementations, the determining module 610 is specifically configured to: calculating the product of a second norm item corresponding to each group of target adjacent layers and a corresponding first factor to obtain a first product item corresponding to each group of target adjacent layers; calculating the difference between the first product term corresponding to each group of target adjacent layers and the corresponding first norm term to obtain a regularization term corresponding to each group of target adjacent layers; the first factors corresponding to each group of target adjacent layers are used for controlling the proportion of any pole tending to two poles in the corresponding N second pruning weights so as to control the compression proportion of the first model.

In some implementations, the first factor corresponding to each set of target adjacent layers is linear with the compression ratio of the first model.

In some implementations, the determining module 610 is specifically configured to: and obtaining the regularization term corresponding to each group of target adjacent layers by taking the opposite number of the first norm term corresponding to each group of target adjacent layers.

In some implementations, the determining module 610 is specifically configured to: calculating the product of regularization items corresponding to each group of target adjacent layers and the corresponding second factors to obtain second product items corresponding to each group of target adjacent layers; and calculating the sum of the original loss function and a second product term corresponding to each group of target adjacent layers to obtain the target loss function.

In some implementations, the apparatus 600 further includes: an acquisition module 640, a selection module 650, and a generation module 660, the acquisition module 640 being configured to acquire a second model before the determination module 610 determines the target loss function of the first model; the selection module 650 is configured to select at least one set of target adjacent layers in the second model; the generating module 660 is configured to add a corresponding pruning weight layer between each set of target adjacent layers to generate the first model.

In some implementations, the selection module 650 is specifically configured to: determining respective complexity of at least one set of adjacent layers in the second model; at least one set of target adjacent layers is selected based on respective complexity of the at least one set of adjacent layers.

In some implementations, the selection module 650 is specifically configured to: determining the dimension of an input layer and the dimension of an output layer of each group of adjacent layers; the complexity of each set of adjacent layers is determined based on the input layer dimensions and the output layer dimensions of each set of adjacent layers.

In some implementations, the selection module 650 is specifically configured to: calculating the product of the input layer dimension and 2 of each group of adjacent layers to obtain a product result; calculating the difference between the product result and 1 to obtain a difference result; and calculating the product of the difference result and the dimension of the output layer of each group of adjacent layers to obtain the complexity of each group of adjacent layers.

In some implementations, the N first pruning weights corresponding to each set of target adjacent layers are obtained by inputting N initial values to the objective function.

In some implementations, the objective function is a sigmoid function or a ReLu function.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the apparatus 600 shown in fig. 6 may perform the method embodiment corresponding to fig. 3, and the foregoing and other operations and/or functions of each module in the apparatus 600 are respectively for implementing the corresponding flow in each method in fig. 3, which is not described herein for brevity.

The apparatus 600 of the embodiment of the present application is described above in terms of functional modules in conjunction with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. In some implementations, the software modules may be located in a memory medium well known in the art, such as random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and so forth. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

As shown in fig. 7, the electronic device may include:

A memory 710 and a processor 720, the memory 710 being configured to store a computer program and to transfer the program code to the processor 720. In other words, the processor 720 may call and run a computer program from the memory 710 to implement the method in the embodiment of the present application.

For example, the processor 720 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In some embodiments of the application, the processor 720 may include, but is not limited to:

a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

In some embodiments of the application, the memory 710 includes, but is not limited to:

volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

In some embodiments of the application, the computer program may be partitioned into one or more modules that are stored in the memory 710 and executed by the processor 720 to perform the methods provided by the application. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.

As shown in fig. 7, the electronic device may further include:

a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710.

The processor 720 may control the transceiver 730 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 730 may include a transmitter and a receiver. Transceiver 730 may further include antennas, the number of which may be one or more.

It will be appreciated that the various components in the electronic device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.

When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The above is only a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of pruning a model, comprising:

determining a target loss function of a first model, wherein a corresponding pruning weight layer is arranged between each group of target adjacent layers in at least one group of target adjacent layers of the first model, the pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with the input layer dimensions of the corresponding target adjacent layers, and N is an integer larger than 1;

Training the first model according to the target loss function so that N second pruning weights obtained after training N first pruning weights corresponding to each group of target adjacent layers tend to be two poles;

pruning the dimension of a target input layer of the trained model, and pruning the connection between the dimension of an output layer in the trained model and the dimension of the target input layer; the target input layer dimension is an input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

2. The method of claim 1, wherein determining the target loss function for the first model comprises:

determining an original loss function of the first model and average pruning weights of N first pruning weights corresponding to each group of target adjacent layers;

determining regularization items corresponding to each group of target adjacent layers according to N first pruning weights and average pruning weights corresponding to each group of target adjacent layers;

and determining the target loss function of the first model according to the original loss function and regularization items corresponding to each group of target adjacent layers.

3. The method of claim 2, wherein the determining the regularization term corresponding to each set of target adjacent layers according to the N first pruning weights and the average pruning weight corresponding to each set of target adjacent layers comprises:

Calculating the difference vector of a first vector formed by N first pruning weights corresponding to each group of target adjacent layers and a second vector formed by N average pruning weights corresponding to each group of target adjacent layers;

calculating norms of difference vectors corresponding to each group of target adjacent layers to obtain first norms corresponding to each group of target adjacent layers;

and determining regularization items corresponding to each group of target adjacent layers according to the first norm items corresponding to each group of target adjacent layers.

4. The method of claim 3, wherein before determining regularization terms corresponding to each set of target adjacent layers from the first norm terms corresponding to each set of target adjacent layers, further comprising:

calculating norms of the first vectors corresponding to each group of target adjacent layers to obtain second norms corresponding to each group of target adjacent layers;

correspondingly, the determining the regularization term corresponding to each group of target adjacent layers according to the first norm term corresponding to each group of target adjacent layers includes:

and determining regularization items corresponding to each group of target adjacent layers according to the first norm items and the second norm items corresponding to each group of target adjacent layers.

5. The method of claim 4, wherein the determining regularization term for each set of target adjacent layers based on the first and second norm terms for each set of target adjacent layers comprises:

Calculating the product of a second norm item corresponding to each group of target adjacent layers and a corresponding first factor to obtain a first product item corresponding to each group of target adjacent layers;

calculating the difference between the first product term corresponding to each group of target adjacent layers and the corresponding first norm term to obtain a regularization term corresponding to each group of target adjacent layers;

the first factors corresponding to each group of target adjacent layers are used for controlling the proportion of any one of the two poles in the corresponding N second pruning weights so as to control the compression proportion of the first model.

6. The method of claim 5, wherein the first factor for each set of target adjacent layers is linearly related to the compression ratio of the first model.

7. A method according to claim 3, wherein said determining regularization term corresponding to each set of target adjacent layers from the first norm term corresponding to each set of target adjacent layers comprises:

and obtaining the regularization term corresponding to each group of target adjacent layers by taking the opposite number of the first norm term corresponding to each group of target adjacent layers.

8. The method of any of claims 2-7, wherein the determining the target loss function of the first model from the original loss function and regularization term corresponding to each set of target adjacent layers comprises:

Calculating the product of regularization items corresponding to each group of target adjacent layers and the corresponding second factors to obtain second product items corresponding to each group of target adjacent layers;

and calculating the sum of the original loss function and a second product term corresponding to each group of target adjacent layers to obtain the target loss function.

9. The method according to any one of claims 1-7, further comprising, prior to said determining the target loss function of the first model:

acquiring a second model;

selecting the at least one set of target adjacent layers in the second model;

and adding a corresponding pruning weight layer between each group of target adjacent layers to generate the first model.

10. The method of claim 9, wherein the selecting the at least one set of target adjacent layers in the second model comprises:

determining respective complexity of at least one set of adjacent layers in the second model;

the at least one set of target adjacent layers is selected based on respective complexity of the at least one set of adjacent layers.

11. The method of claim 10, wherein said determining respective complexity of at least one set of adjacent layers in the second model comprises:

determining the dimension of an input layer and the dimension of an output layer of each group of adjacent layers;

The complexity of each set of adjacent layers is determined based on the input layer dimensions and the output layer dimensions of each set of adjacent layers.

12. The method of claim 11, wherein determining the complexity of each set of adjacent layers based on the input layer dimensions and the output layer dimensions of each set of adjacent layers comprises:

calculating the product of the input layer dimension and 2 of each group of adjacent layers to obtain a product result;

calculating the difference between the product result and 1 to obtain a difference result;

and calculating the product of the difference result and the dimension of the output layer of each group of adjacent layers to obtain the complexity of each group of adjacent layers.

13. A model pruning device, comprising:

the determining module is used for determining a target loss function of a first model, a corresponding pruning weight layer is arranged between each group of target adjacent layers in at least one group of target adjacent layers of the first model, the pruning weight layer comprises N first pruning weights, the N first pruning weights are in one-to-one correspondence with the input layer dimensions of the corresponding target adjacent layers, and N is an integer greater than 1;

the training module is used for training the first model according to the target loss function so as to enable N second pruning weights obtained after training of N first pruning weights corresponding to each group of target adjacent layers to tend to be two poles;

The pruning module is used for pruning the dimension of the target input layer of the trained model and pruning the connection between the dimension of the output layer and the dimension of the target input layer in the trained model; the target input layer dimension is an input layer dimension of which the corresponding second pruning weight is smaller than a preset threshold value.

14. An electronic device, comprising:

a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory to perform the method of any of claims 1 to 12.

15. A computer readable storage medium for storing a computer program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 12.