CN111126599B

CN111126599B - A neural network weight initialization method based on transfer learning

Info

Publication number: CN111126599B
Application number: CN201911321102.2A
Authority: CN
Inventors: 范益波; 刘超
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2023-09-05
Anticipated expiration: 2039-12-20
Also published as: CN111126599A

Abstract

The invention belongs to the technical field of neural network models, and particularly relates to a neural network weight initialization method based on transfer learning. In the method, for a designated target task, a neural network model with higher complexity, namely a teacher model, is designed, the teacher model is trained, and after the training is finished, the weight initialization of the student model is guided by using the generated feature map; calculating the difference between the feature maps or mapping the feature maps into a regenerated kernel Hilbert space, calculating the difference of the feature maps in the regenerated kernel Hilbert space, and simplifying the calculation by adopting a kernel function method; the simple student model achieves a better weight initialization effect, and after the weight initialization is completed, the student model is generally trained, so that the student model achieves a better global convergence point, and the performance of the student model is more excellent. The invention can effectively improve the performance of the student model on the premise of not increasing the complexity of the student model.

Description

A neural network weight initialization method based on transfer learning

技术领域technical field

本发明属于神经网络模型技术领域，具体涉及一种基于知识迁移学习的神经网络权重初始化方法。The invention belongs to the technical field of neural network models, and in particular relates to a neural network weight initialization method based on knowledge transfer learning.

背景技术Background technique

神经网络在近年来取得了长足的发展，尤其是在计算机视觉领域和自然语言处理领域，其很多表现都已经超越人类，然而神经网络过高的计算量和过大的训练要求给神经网络在实际应用中造成了很大的障碍。因此如何使得一个轻量级的模型表现的更好成为了一个需要解决的热点问题。Neural networks have made great progress in recent years, especially in the fields of computer vision and natural language processing, and many of their performances have surpassed those of humans. application has caused great obstacles. Therefore, how to make a lightweight model perform better has become a hot issue that needs to be solved.

过去的几年中，很多研究者提出了各种各样的方案来帮助神经网络可以达到一个更好的收敛效果。其主要包括以下几类，一类是基于知识蒸馏和知识迁移，试图通过对学生模型的训练过程中添加一些额外的损失函数，使用一个训练好的老师模型来帮助学生模型表现地更加出色，从而在不增加学生模型的复杂度基础上达到提升模型的性能。第二类则是基于模型的量化的剪枝，通过对神经网络的权重进行量化，把原本32位的加减法变成8位乃至1位的加减法，从而大大的减少了神经网络的权重复杂度。从而减少了计算量。剪枝则是对神经网络连接的一些连接边进行直接删除，然后评估剪枝对模型的精度带来的损失是否是忽略不计的，从而达到了有效减少模型复杂度的效果。In the past few years, many researchers have proposed various schemes to help neural networks achieve a better convergence effect. It mainly includes the following categories. One is based on knowledge distillation and knowledge transfer. It tries to use a trained teacher model to help the student model perform better by adding some additional loss functions to the training process of the student model, thereby The performance of the model is improved without increasing the complexity of the student model. The second type is model-based quantitative pruning. By quantifying the weights of the neural network, the original 32-bit addition and subtraction is changed to 8-bit or even 1-bit addition and subtraction, thereby greatly reducing the neural network. weight complexity. Thereby reducing the amount of calculation. Pruning is to directly delete some connection edges connected by the neural network, and then evaluate whether the loss of the accuracy of the model caused by pruning is negligible, thereby achieving the effect of effectively reducing the complexity of the model.

发明内容Contents of the invention

本发明的目的在于提出一种在不增加模型复杂度的基础上，有效提升模型性能的神经网络权重初始化方法。The purpose of the present invention is to propose a neural network weight initialization method that can effectively improve the performance of the model without increasing the complexity of the model.

本发明提供的神经网络权重初始化方法，是基于知识迁移学习技术的，即对于一个复杂度较高的神经网络模型，称为老师模型(该复杂度较高的老师模型是难以在实际工程中应用的)；从比较复杂的老师模型学习到先验知识，依此帮助复杂度较低的神经网络模型(称为学生模型，其在实际应用中在复杂度和性能之间有较好的权衡)拥有一个良好的初始化状态，从而走出训练过程中的局部最优点，以达到更好的训练效果。The neural network weight initialization method provided by the present invention is based on knowledge transfer learning technology, that is, for a neural network model with high complexity, it is called a teacher model (the teacher model with high complexity is difficult to apply in actual engineering. ); learn prior knowledge from the more complex teacher model, and thus help the less complex neural network model (called the student model, which has a better trade-off between complexity and performance in practical applications) Have a good initialization state, so as to get out of the local optimal point in the training process, so as to achieve better training effect.

本发明中，首先对于指定任务，设计复杂度较高的神经网络模型即老师模型，并对老师模型进行训练，训练完成后，利用的产生特征图指导学生模型的权重初始化；通过计算特征图之间的差异，或者将特征图映射到再生核希尔伯特空间中，计算其在再生核希尔伯特空间中的差异，采用核函数的方法来简化计算；使简单的学生模型达到更好的权重初始化的效果，在权重初始化完成之后，再对学生模型进行一般的训练，使得学生模型达到更好的全局收敛点，其性能更加优异。In the present invention, firstly, for the specified task, a neural network model with high complexity is designed, namely the teacher model, and the teacher model is trained. After the training is completed, the generated feature map is used to guide the weight initialization of the student model; The difference between them, or map the feature map to the regenerating kernel Hilbert space, calculate the difference in the regenerating kernel Hilbert space, and use the kernel function method to simplify the calculation; make the simple student model better After the weight initialization is completed, the student model is generally trained, so that the student model reaches a better global convergence point, and its performance is more excellent.

本发明可以有效地使学生模型避开在初始训练中参数依赖导致的收敛到局部最优解的问题。The invention can effectively make the student model avoid the problem of converging to a local optimal solution caused by parameter dependence in the initial training.

本发明提出的神经网络权重初始化方法，具体步骤如下：The neural network weight initialization method that the present invention proposes, concrete steps are as follows:

(1)对于特定的学习任务，往往拥有常规的损失函数和模型结构，首先，针对目标任务，设计老师模型，并使用常规损失函数训练老师模型；(1) For a specific learning task, it often has a conventional loss function and model structure. First, design the teacher model for the target task, and use the conventional loss function to train the teacher model;

(2)然后将训练好的老师模型的中间层输出导出，通过映射方式得到特征图；其中，映射方式可以有注意力迁移【Sergey Zagoruyko and Nikos Komodakis,“Paying moreattention to attention:Improving the performance of convolutional neuralnetworks via attention transfer,”arXiv preprint arXiv:1612.03928,2016】，或者是使用核函数映射到再生核希尔伯特空间中【Zehao Huang and Naiyan Wang,“Like whatyou like:Knowledge distill via neuron selectivity transfer,”arXiv preprintarXiv:1707.01219,2017.】；具体为式(2)和式(3)所示；(2) Then export the output of the middle layer of the trained teacher model, and obtain the feature map through the mapping method; among them, the mapping method can have attention transfer [Sergey Zagoruyko and Nikos Komodakis, "Paying more attention to attention: Improving the performance of convolutional neuralnetworks via attention transfer,” arXiv preprint arXiv:1612.03928, 2016], or use kernel functions to map to the regenerated kernel Hilbert space [Zehao Huang and Naiyan Wang, “Like what you like:Knowledge distill via neuron selectivity transfer,” arXiv preprint arXiv:1707.01219,2017.]; specifically shown in formula (2) and formula (3);

(3)设计一个结构较为简单的学生模型，其要求和老师模型具有相同的网络结构；即构成网络的基本网络层应当一致，比如网络结构都是采用了基于卷积层构成的串行连接网络，而老师模型的卷积层层数更多，特征图数量更多，学生模型的卷积层层数更少，特征图数量也更少；(3) Design a student model with a relatively simple structure, which requires the same network structure as the teacher model; that is, the basic network layers that make up the network should be consistent, for example, the network structure is based on a serial connection network composed of convolutional layers , while the teacher model has more convolutional layers and more feature maps, and the student model has fewer convolutional layers and fewer feature maps;

(4)将步骤(2)中计算好的特征图，和学生模型使用相同方式映射得到的特征图之间的均方误差作为损失函数，来训练学生模型；训练结束之后，得到的学生模型的权重并非是传统意义上的正态分布或者均匀分布的初始化，而是一种从老师模型中学习到知识来对学生模型的权重进行调整，从而使学生模型的权重进行特定的初始化，这样就使得学生模型拥有逼近老师模型性能的能力；(4) Use the mean square error between the feature map calculated in step (2) and the feature map obtained by mapping the student model in the same way as the loss function to train the student model; after the training, the obtained student model's The weight is not the initialization of normal distribution or uniform distribution in the traditional sense, but a kind of learning knowledge from the teacher model to adjust the weight of the student model, so that the weight of the student model can be initialized specifically, so that The student model has the ability to approach the performance of the teacher model;

(5)在初始化完成之后，最后再对学生模型使用常规的损失函数进行训练，得到一个可用的学生模型。(5) After the initialization is completed, the student model is finally trained using a conventional loss function to obtain a usable student model.

其中，所述常规的损失函数，是指在当前任务中常常使用到的均方误差。Wherein, the conventional loss function refers to the mean square error often used in the current task.

本发明中，老师模型的复杂度比学生模型要高，以此使得学生模型可以很好地学习到老师模型的特征。In the present invention, the complexity of the teacher model is higher than that of the student model, so that the student model can learn the characteristics of the teacher model well.

附图说明Description of drawings

图1本发明方法示意图。Fig. 1 schematic diagram of the method of the present invention.

图2实验的训练损失和测试损失(QP＝22)。Figure 2. Training loss and test loss (QP=22) for experiments.

图3实验的训练损失和测试损失(QP＝37)。Figure 3. Training loss and test loss (QP=37) for experiments.

具体实施方式Detailed ways

下面以在视频编码中基于神经网络的环路滤波任务为例，对本发明做进一步描述。The present invention will be further described below by taking the neural network-based loop filtering task in video coding as an example.

针对目标任务，需要在传统的视频编码器比如在HEVC中添加一个神经网络模块，该模块的功能是环路滤波。通过基于神经网络的环路滤波方法，来提升视频编码器的性能。可以理解为是降噪滤波问题，去除传统视频编码器带来的人工印记和噪声。我们首先设计一个复杂度较稿的老师模型，其复杂度应当比最终目标实际目标应用中的复杂度显著地高，比如其计算复杂度和消耗的计算资源是预期设计模型的2倍以上；使用常规的损失函数对老师模型进行训练，得到一个训练好的老师模型。针对基于神经网络的环路滤波任务，我们设计了一个卷积神经网络，其结构如图1，上半部分为设计的老师模型结构，下半部分为设计的学生模型结构。常规的损失函数在这里指的就是在当前任务中常常使用到的均方误差。这个损失函数被用来训练老师和学生模型。For the target task, it is necessary to add a neural network module to a traditional video encoder such as HEVC, and the function of this module is loop filtering. The performance of the video encoder is improved by a neural network-based in-loop filtering method. It can be understood as a noise reduction filtering problem, which removes artificial marks and noise brought by traditional video encoders. We first design a teacher model with relatively high complexity. Its complexity should be significantly higher than the complexity of the final target actual target application. For example, its computational complexity and computational resources consumed are more than twice that of the expected design model; use The conventional loss function trains the teacher model to obtain a trained teacher model. For the loop filtering task based on neural network, we designed a convolutional neural network, its structure is shown in Figure 1, the upper part is the designed teacher model structure, and the lower part is the designed student model structure. The conventional loss function here refers to the mean square error that is often used in the current task. This loss function is used to train the teacher and student models.

模型结构方面，我们采用深度可分离卷积和批规范化作为老师模型的主要层，其特征层数为64，卷积核大小为3x3。其中，使用24个深度可分离卷积层作为老师模型的主干，将其分为了三部分，第一部分由10个深度可分离卷积层构成，第二部分由8个深度可分离卷积层构成，第三部分由6个深度可分离卷积层构成。模型的最后一层选用普通卷积层，其特征层数为1，卷积核大小为1x1。所有的深度可分离卷积都有一个ReLU激活函数。模型的输入通过一个直连边连接到最后的输出，使得神经网络处于一个残差学习的状态，使得模型更快的收敛。In terms of model structure, we use depthwise separable convolution and batch normalization as the main layer of the teacher model, with a feature layer of 64 and a convolution kernel size of 3x3. Among them, 24 depth-separable convolution layers are used as the backbone of the teacher model, which is divided into three parts. The first part consists of 10 depth-separable convolution layers, and the second part consists of 8 depth-separable convolution layers. , the third part consists of 6 depthwise separable convolutional layers. The last layer of the model is an ordinary convolutional layer, the number of feature layers is 1, and the convolution kernel size is 1x1. All depthwise separable convolutions have a ReLU activation function. The input of the model is connected to the final output through a direct connection, so that the neural network is in a state of residual learning, which makes the model converge faster.

模型的输入输出分别为视频编码器的重建像素图和滤波后的像素图，损失函数L_T选用的是神经网络的输出和原始像素/>之间的均方误差：The input and output of the model are the reconstructed pixel map and the filtered pixel map of the video encoder respectively, and the loss function L _T is the output of the neural network and raw pixels /> The mean square error between:

在老师模型训练结束之后，从训练的数据集得到老师模型三个子部分的中间层输出，从神经网络注意力图或者是再生核希尔伯特空间的映射结果，计算出老师模型在这些地方的输出结果F_T。使用老师模型的中间层计算结果F_T和老师模型输入的数据集组成一组新的数据集，来对学生模型进行训练。After the training of the teacher model is completed, the output of the middle layer of the three sub-parts of the teacher model is obtained from the training data set, and the output of the teacher model in these places is calculated from the neural network attention map or the mapping result of the regenerated kernel Hilbert space Result F _T . A new data set is formed by using the calculation result F _T of the middle layer of the teacher model and the data set input by the teacher model to train the student model.

学生模型的构建需要类似老师模型，以此来保证知识迁移的成功。所以我们也采用了类似的网络结构，使用了9个深度可分离卷积层作为学生模型的主干，将其分为三部分，每一部分都由3个深度可分离卷积层构成。其特征层数为32，卷积核大小为3x3。同时因为学生模型和老师模型所设计的目标是一致的，所以其输入输出也保持一致，而且也采用了直连边的结构，用于学生模型很好的学习残差。The construction of the student model needs to be similar to the teacher model to ensure the success of knowledge transfer. So we also adopted a similar network structure, using 9 depth-separable convolutional layers as the backbone of the student model, and dividing it into three parts, each of which consists of 3 depth-separable convolutional layers. The number of feature layers is 32, and the convolution kernel size is 3x3. At the same time, because the design goals of the student model and the teacher model are the same, their input and output are also consistent, and the structure of directly connected edges is also used for the good learning residual of the student model.

使老师模型的中间层输出和学生模型的中间层输出经过相同的映射，计算其之间的均方误差等。其损失函数具体表现为公式(2)和(3)。计算其注意力图或者是使用线性核函数k(x,y)＝x^Ty来近似在再生核希尔伯特空间映射结果。其计算公式如下：Make the intermediate layer output of the teacher model and the intermediate layer output of the student model go through the same mapping, and calculate the mean square error between them, etc. Its loss function is specifically expressed as formulas (2) and (3). Computing its attention map or using a linear kernel function k(x, y) = x ^T y to approximate the result of mapping in the regenerating kernel Hilbert space. Its calculation formula is as follows:

这里，F_T,F_S分别表示老师和学生模型的注意力图，和/>则具体表示第i个特征图，C_T和C_S分别表示老师模型和学生模型的特征层数量。对于实际应用而言，p取正整数。于是，得到了一个初始化好的学生模型。Here, F _T , F _S denote the attention maps of the teacher and student models respectively, and /> Then specifically represent the i-th feature map, C _T and C _S respectively represent the number of feature layers of the teacher model and the student model. For practical applications, p takes a positive integer. Thus, an initialized student model is obtained.

然后，再使用如式(4)所示的标准均方误差L_S来对学生模型进行最终的训练，即可得到训练好的学生模型。相应的表示学生模型的输出。Then, use the standard mean square error _LS as shown in formula (4) to carry out final training on the student model, and the trained student model can be obtained. corresponding Represents the output of the student model.

在这样的初始化后，轻量级的模型往往可以比不使用初始化方法表现的更好。在得到了学生模型之后，将学生模型导回到视频编码器中，即可实现在视频编码中，基于神经网络的环路滤波。如图3则表现了实验时的训练和测试损失，可以明显的看出使用了初始化之后，损失函数降低地更快。After such initialization, lightweight models can often perform better than without initialization. After obtaining the student model, the student model can be imported back to the video encoder to realize the loop filtering based on the neural network in the video coding. As shown in Figure 3, the training and test losses during the experiment are shown. It can be clearly seen that after initialization is used, the loss function decreases faster.

Claims

1. A neural network-based loop filter optimization method in video coding, at first adding a neural network module in a traditional video encoder, the function of this module is a loop filter; for loop filter target tasks, the neural network The network module includes a more complex neural network model, the teacher model, and a simpler student model;

Train the teacher model. After the training is completed, use the generated feature map to guide the weight initialization of the student model; by calculating the difference between the feature maps, or mapping the feature map to the regenerating kernel Hilbert space, calculate its regeneration. The difference in the kernel Hilbert space, using the kernel function method to simplify the calculation; to make the simple student model achieve better weight initialization effect, after the weight initialization is completed, the student model is generally trained, so that the student model To achieve a better global convergence point, its performance is more excellent; the specific steps are as follows:

(1) For a specific learning task, it has a conventional loss function and model structure. First, design a teacher model for the target task, and use the conventional loss function to train the teacher model;

(2) Then export the output of the middle layer of the trained teacher model, and obtain the feature map through the mapping method; among them, the mapping method has attention transfer, or uses the kernel function to map to the regenerated kernel Hilbert space;

(3) Design a student model with a relatively simple structure, which requires the same network structure as the teacher model, that is, the basic network layer that constitutes the network is consistent; when the network structure is a serial connection network based on convolutional layers, The teacher model has more convolutional layers and more feature maps, while the student model has fewer convolutional layers and fewer feature maps;

(4) Use the mean square error between the feature map calculated in step (2) and the feature map obtained by mapping the student model in the same way as the loss function to train the student model; after the training, the obtained student model The weight of is to adjust the weight of the student model by learning knowledge from the teacher model, so that the weight of the student model can be initialized specifically, so that the student model has the ability to approach the performance of the teacher model;

(5) After the initialization is completed, the student model is trained using a conventional loss function to obtain a usable student model;

Using depth-separable convolution and batch normalization as the main layer of the teacher model, its feature layer number is 64, and the convolution kernel size is 3x3; among them, 24 depth-separable convolution layers are used as the backbone of the teacher model, and it is divided into For three parts, the first part consists of 10 depthwise separable convolutional layers, the second part consists of 8 depthwise separable convolutional layers, and the third part consists of 6 depthwise separable convolutional layers; the last layer of the model The ordinary convolution layer is selected, the number of feature layers is 1, and the size of the convolution kernel is 1x1; all depth-separable convolutions have a ReLU activation function; the input of the model is connected to the final output through a direct connection, so that the neural The network is in a state of residual learning, resulting in faster convergence;

The input and output of the model are the reconstructed pixel map and the filtered pixel map of the video encoder respectively, and the loss function L _T selects the output of the neural network and raw pixels /> The mean square error between:

After the training of the teacher model is completed, the output of the middle layer of the three sub-parts of the teacher model is obtained from the training data set, and the output of the teacher model in these places is calculated from the neural network attention map or the mapping result of the regenerated kernel Hilbert space Result F _T ; use the middle layer calculation result F _T of the teacher model and the data set input by the teacher model to form a new data set to train the student model;

The student model adopts a network structure similar to the teacher model, using 9 depth-separable convolutional layers as the backbone of the student model, and dividing it into three parts, each of which is composed of 3 depth-separable convolutional layers, and its feature layer The number is 32, and the size of the convolution kernel is 3x3; the goals designed by the student model and the teacher model are consistent, and their input and output are also consistent, and the structure of straight edges is also used for the learning residual of the student model;

Make the intermediate layer output of the teacher model and the intermediate layer output of the student model go through the same mapping, and calculate the mean square error between them; the loss function is specifically expressed as formulas (2) and (3); calculate its attention map or use The linear kernel function k(x, y)=x ^T y is used to approximate the result of mapping in the regenerating kernel Hilbert space; its calculation formula is as follows:

Here, F _T and F _S represent the attention maps of the teacher and student models respectively, f _T ⁱ and f _S ⁱ specifically represent the i-th feature map, C _T and C _S represent the number of feature layers of the teacher model and the student model, respectively, p takes a positive integer; thus, the initialized student model is obtained;

Use the standard mean square error _LS shown in formula (4) to perform final training on the initialized student model, that is, to obtain the trained student model; the corresponding Y _S ⁱ represents the output of the student model;

Import the student model back into the video encoder, that is, implement neural network-based loop filtering in video encoding.