CN105184303B

CN105184303B - An Image Annotation Method Based on Multimodal Deep Learning

Info

Publication number: CN105184303B
Application number: CN201510198325.XA
Authority: CN
Inventors: 朱松豪; 孙成建; 师哲
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Beijing Zhangying Information Technology Co.,Ltd.
Priority date: 2015-04-23
Filing date: 2015-04-23
Publication date: 2019-08-09
Anticipated expiration: 2035-04-23
Also published as: CN105184303A

Abstract

The invention discloses an image labeling method based on multi-modal deep learning. The method comprises the following steps: firstly, using unlabeled images to train a deep neural network; secondly, using backpropagation to optimize each single mode; finally, using online The learned power gradient algorithm optimizes the weights across different modalities. The invention optimizes the parameters of the deep neural network by applying the convolutional neural network technology, and improves the labeling accuracy. Experiments on public datasets show that the invention can effectively improve the performance of image annotation.

Description

An Image Annotation Method Based on Multimodal Deep Learning

技术领域technical field

本发明涉及一种图像标注方法，尤其涉及基于多模态深度学习的图像标注方法，属于图像处理技术领域。The invention relates to an image tagging method, in particular to an image tagging method based on multimodal deep learning, and belongs to the technical field of image processing.

背景技术Background technique

近年来，随着图像数量的剧增，人们迫切地需要实现图像内容的高效标注，以实现大规模图像的有效检索与管理。In recent years, with the rapid increase in the number of images, people urgently need to achieve efficient annotation of image content to achieve effective retrieval and management of large-scale images.

从模式识别的角度来看，将图像标注问题视为根据内容给图像分配一组标签，其中如何选取表征图像内容的合适特征，将在很大程度上影响标注性能。由于众所周知的语义鸿沟问题，现有技术进行图像语义标注时很难达到令人满意的结果。近年来，Hinton等人提出利用深度神经网络，从训练集中有效地训练特征。不同类型的深度神经网络，已成功应用于各种语言及信息检索。这些方法通过深度结构、深度学习从训练数据中发现隐藏的数据结构及有效的表征特征，提高了系统性能。From the perspective of pattern recognition, the image annotation problem is regarded as assigning a set of labels to images according to the content, and how to select the appropriate features to represent the image content will greatly affect the annotation performance. Due to the well-known semantic gap problem, it is difficult for existing technologies to achieve satisfactory results when semantically annotating images. In recent years, Hinton et al. proposed to use deep neural networks to efficiently train features from the training set. Different types of deep neural networks have been successfully applied to various languages and information retrieval. These methods discover hidden data structures and effective representational features from training data through deep structure and deep learning to improve system performance.

发明内容Contents of the invention

本发明目的在于提供了一种基于多模态深度学习的图像标注方法，该方法应用于卷积神经网络技术，优化了深层神经网络参数，提高了标注精度。该方法总结单模态学习的基础上，实现多模态的学习，其中既包括研究表征图像的底层特征，如颜色、形状或纹理等，也包括度量图像与标注间相似性函数，如线性相似性、余弦相似性以及径向距离等。The purpose of the present invention is to provide an image labeling method based on multimodal deep learning, which is applied to convolutional neural network technology, optimizes deep neural network parameters, and improves labeling accuracy. This method summarizes the basis of single-modal learning and realizes multi-modal learning, which includes not only studying the underlying features of the representation image, such as color, shape or texture, but also measuring the similarity function between the image and the label, such as linear similarity. sex, cosine similarity, and radial distance.

本发明解决其技术问题所采取的技术方案是：本发明提供了一种基于多模态深度学习的图像标注方法，该方法包括以下步骤：The technical scheme adopted by the present invention to solve the technical problem is: the present invention provides a method for image labeling based on multimodal deep learning, which comprises the following steps:

步骤1：利用无标签的图像样本集，预训练深度神经网络的节点权重。Step 1: Use the unlabeled image sample set to pre-train the node weights of the deep neural network.

步骤2：采用反向传播算法，优化各个单模态的权重。Step 2: Use the backpropagation algorithm to optimize the weights of each single mode.

步骤3：采用在线学习的幂梯度算法，优化模态组合间的权重。Step 3: Use the power gradient algorithm of online learning to optimize the weights between modal combinations.

本发明步骤1所述的深度神经网络是采用八层的卷积神经网络，其中前五层为卷积层，其余三层为全连接层；全连接层的输出作为Softmax分类器的输入，Softmax分类器生成1000个标识的类别；预训练与微调阶段均使用多项式逻辑回归的目标函数。The deep neural network described in step 1 of the present invention adopts eight layers of convolutional neural networks, wherein the first five layers are convolutional layers, and the remaining three layers are fully connected layers; the output of the fully connected layer is used as the input of the Softmax classifier, and Softmax The classifier generates 1000 labeled categories; both the pre-training and fine-tuning phases use the objective function of multinomial logistic regression.

在上述本发明的卷积层中，第一层、第二层、第五层均为归一化层，且为保持不变性，所有归一化层均使用最大池技术。另外，在所有卷积层和全连接层中，均使用线性调整单元作为非线性激活函数；In the above-mentioned convolutional layers of the present invention, the first layer, the second layer, and the fifth layer are all normalized layers, and in order to maintain invariance, all normalized layers use the maximum pooling technique. In addition, in all convolutional layers and fully connected layers, linear adjustment units are used as nonlinear activation functions;

本发明上述所用卷积神经网络中，所有输入图像大小统一为256×256大小；接下来，分别将前两个卷积滤波器设为7×7和5×5，步长为2，使用这种类型滤波器是为获取所有频段信息，使用小步长是为避免产生对下一层网络有影响的“死特征”；然后，将卷积层的后三层依次连接，且设置滤波器大小3×3，步长为1；最后，每个全连接层的输出尺寸为4096。在预训练阶段，将前两个全连接层的信号丢失率设为0.6。In the convolutional neural network used above in the present invention, the size of all input images is unified to 256×256; next, the first two convolution filters are set to 7×7 and 5×5 respectively, and the step size is 2, using this This type of filter is to obtain all frequency band information, and the use of a small step size is to avoid the generation of "dead features" that affect the next layer of the network; then, the last three layers of the convolutional layer are connected in turn, and the filter size is set 3×3 with a stride of 1; finally, the output size of each fully connected layer is 4096. In the pre-training phase, the loss-of-signal ratio of the first two fully-connected layers is set to 0.6.

本发明步骤2所述的反向传播优化各个单模态步骤，包括：The backpropagation described in step 2 of the present invention optimizes each single-mode step, comprising:

①单模态预训练：①Single-modal pre-training:

利用无标注训练集进行卷积神经网络的预训练，实现图像目标的中间表示，同时初始化网络。具体过程描述如下：首先，利用对比差异，训练输入层与第一卷积层间的节点权值W₁；然后，将第一卷积层节点的条件概率作为第二卷积层的输入：Use the unlabeled training set to pre-train the convolutional neural network, realize the intermediate representation of the image object, and initialize the network at the same time. The specific process is described as follows: First, use the comparison difference to train the node weight W ₁ between the input layer and the first convolutional layer; then, use the conditional probability of the nodes in the first convolutional layer as the input of the second convolutional layer:

p(Γ|x^j)＝S(W₁,x^j) (1)p(Γ|x ^j )=S(W ₁ ,x ^j ) (1)

其中x^j为第j个特征矢量，Γ为标注信息，S()为如下式所示的相似性函数：Where x ^j is the jth feature vector, Γ is the label information, and S() is the similarity function shown in the following formula:

然后，第一卷积层和第二卷积层结合起来训练节点权重W₂；利用相同的方法，训练其余的3层卷积层和3层全连接层的节点权重；Then, the first convolutional layer and the second convolutional layer are combined to train the node weight W ₂ ; use the same method to train the node weights of the remaining 3 convolutional layers and 3 fully connected layers;

②单模态微调阶段：②Single-mode fine-tuning stage:

在单模态微调阶段，利用反向传播标注误差优化节点权重。从模式识别角度来看，多标注学习可视为多任务学习。因此，卷积神经网络的总体标注误差可视为每个标注误差的总和。下面以第l个标注误差为例说明节点权重优化过程；In the single-modal fine-tuning stage, the node weights are optimized by back-propagating annotation errors. From the perspective of pattern recognition, multi-label learning can be regarded as multi-task learning. Therefore, the overall labeling error of a convolutional neural network can be regarded as the sum of each labeling error. The following takes the lth labeling error as an example to illustrate the node weight optimization process;

首先，对图像x而言，其在第j个特征模式下x^j，含有第l个标注Γ_l的概率可用下式的后验概率表示：First, for an image x, the probability that it contains the l-th label Γ _l in the j-th feature mode x ^j can be expressed by the posterior probability of the following formula:

其中L表示标注数量。where L represents the number of labels.

然后，最小化预测概率与参考概率间的KL差异。假定每幅图像有多个标注，用矢量表示y∈R₁×c，其中y_l＝1表示图像x的标注集中含有这第l个标注，而y_l＝0表示图像x的标注集中不含有这第l个标注。q_il表示图像x_i与标注l间的概率，则将这第l个标注正确分配给图像的误差为：Then, minimize the KL difference between the predicted probability and the reference probability. Assuming that each image has multiple labels, it is represented by a vector y∈R ₁ ×c, where y _l = 1 means that the label set of image x contains the l-th label, and y _l = 0 means that the label set of image x does not contain This is the lth label. q _il represents the probability between the image x _i and the label l, then the error of correctly assigning the l-th label to the image is:

所有标注的分配误差为：The distribution error for all labels is:

最后，依次利用反向传播更新其它两层全连接层与五层卷积层的节点权重。Finally, backpropagation is used to update the node weights of the other two layers of fully connected layers and the five layers of convolutional layers.

包括：include:

对多模态深度网络而言，另一个重要任务是学习多模态间的最佳组合权重α＝(α₁,α₂,…,α_n,…,α_N)，其中将α_n初始设置成1/N。本发明采用在线学习的幂梯度算法优化多模态的权重组合：For multimodal deep networks, another important task is to learn the best combined weights α=(α ₁ ,α ₂ ,…,α _n ,…,α _N ) among multimodal, where α _{n is} initially set into 1/N. The present invention adopts the power gradient algorithm of online learning to optimize the weight combination of multiple modes:

其中KL(.)表示KL差分，h(α)表示合页损失函数：Where KL(.) represents the KL difference, and h(α) represents the hinge loss function:

其中S_t为：where S _t is:

S_t＝(S₁(x,Γ⁺)-S₁(x,Γ^-),...,S_N(x,Γ⁺)-S_N(x,Γ^-))^T (8)S _t ＝(S ₁ (x,Γ ⁺ )-S ₁ (x,Γ ^- ),...,S _N (x,Γ ⁺ )-S _N (x,Γ ^- )) ^T (8)

其中标注Γ⁺与Γ^-更能反应图像内容。The labels Γ ⁺ and Γ ^- can better reflect the image content.

在α_t处对函数h(α)进行一阶泰勒展开式，以简化优化问题，因此等式(8)可写为一阶泰勒展开形式：Perform a first-order Taylor expansion on the function h(α) at α _t to simplify the optimization problem, so equation (8) can be written as a first-order Taylor expansion:

若Γ+与Γ-未按顺序正确排列，即对节点权重α的值进行自动化更新。If Γ+ and Γ- are not arranged correctly in order, the value of node weight α is automatically updated.

有益效果：Beneficial effect:

1、本发明优化了深层神经网络参数，提高了标注精度。1. The present invention optimizes the deep neural network parameters and improves the labeling accuracy.

2、本发明更好地实现了基于深度神经网络学习模型的图像标注有效性。2. The present invention better realizes the effectiveness of image labeling based on the deep neural network learning model.

3、本发明能够有效地提高图像标注的性能。3. The present invention can effectively improve the performance of image labeling.

附图说明Description of drawings

图1为本发明的方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2为本发明的深度神经网络模型。Fig. 2 is the deep neural network model of the present invention.

图3为本发明的自然场景图形库的示例图像。Fig. 3 is an example image of the natural scene graph library of the present invention.

图4为本发明的NUS-WIDE图像库的图像。Figure 4 is an image of the NUS-WIDE image library of the present invention.

图5为本发明的IAPRTC-12图像数据库的示例图像。FIG. 5 is an example image of the IAPRTC-12 image database of the present invention.

图6为本发明的三种公共图像库中，不同模态权重组合的结果示意图。FIG. 6 is a schematic diagram of the results of different modality weight combinations in the three public image libraries of the present invention.

具体实施方式Detailed ways

下面结合说明书附图对本发明创造作进一步的详细说明。The invention will be described in further detail below in conjunction with the accompanying drawings.

如图1所示，本发明提供了一种基于多模态深度学习的图像标注方法，该方法包括：首先，利用无标签图像训练深度神经网络；其次，采用反向传播优化各个单模态；最后，采用在线学习的幂梯度算法优化不同模态间的权重。As shown in Figure 1, the present invention provides a kind of image tagging method based on multimodal deep learning, and this method comprises: First, utilize unlabeled image training deep neural network; Finally, the power gradient algorithm of online learning is used to optimize the weights between different modalities.

本发明中的深度神经网络是采用卷积神经网络，其模型结构如图2所示。本发明通过一系列实验，评估本发明提出的基于多模态深度学习图像标注算法的性能。The deep neural network in the present invention adopts a convolutional neural network, and its model structure is shown in FIG. 2 . The present invention evaluates the performance of the image labeling algorithm based on multimodal deep learning proposed by the present invention through a series of experiments.

步骤1：介绍用于评估算法性能的数据集。Step 1: Introduce the dataset used to evaluate the performance of the algorithm.

实验采用三个公共图像数据集，包括如图3所示的自然场景图像库，如图4所示的NUS-WIDE图像库，及如图5所示的IAPRTC-12图像库。这三个图像库的详细信息描述如下：The experiment uses three public image datasets, including the natural scene image library shown in Figure 3, the NUS-WIDE image library shown in Figure 4, and the IAPRTC-12 image library shown in Figure 5. The details of these three image libraries are described as follows:

自然场景图像库包含2000幅图像，所有这些图像包含以下5种标注：沙漠，高山，大海，夕阳和树木。超过20％的图像含有一个以上标注，每幅图像标注的平均值为1.3。图3给出两幅来自自然场景图形库的示例图像，其中图3(a)的标注为夕阳与大海，图3(b)的标注为高山与树木。The natural scene image library contains 2000 images, all of which contain the following 5 annotations: desert, mountain, sea, sunset and trees. More than 20% of images contain more than one annotation, with an average of 1.3 annotations per image. Figure 3 shows two sample images from the natural scene graph library, where Figure 3(a) is labeled as the sunset and the sea, and Figure 3(b) is labeled as mountains and trees.

NUS-WIDE图像库包含30,000种图像，这些图像标注含有小船、汽车、旗帜、马、天空、太阳、塔、飞机、斑马等在内的31种标注。图4给出两幅来自NUS-WIDE图像库的图像，其中图4(a)的标注含有天空与飞机，而图4(b)的标注含有大海与夕阳。The NUS-WIDE image library contains 30,000 images labeled with 31 annotations including boats, cars, flags, horses, sky, sun, towers, airplanes, zebras, etc. Figure 4 shows two images from the NUS-WIDE image library, where the labels in Figure 4(a) contain the sky and the plane, and the labels in Figure 4(b) contain the sea and the sunset.

IAPRTC-12图像数据库包含20,000幅图像，291种标注，每幅图像的平均标注数为5.7。图5给出了两幅来自于IAPRTC-12图像数据库的示例图像。图5(a)的标注含有棕色，人脸，头发，男人和女人，而图5(b)的标注含有船舶、湖泊、天空、树木。The IAPRTC-12 image database contains 20,000 images with 291 annotations, and the average number of annotations per image is 5.7. Figure 5 shows two example images from the IAPRTC-12 image database. The annotations in Fig. 5(a) contain brown, human face, hair, man and woman, while the annotations in Fig. 5(b) contain ships, lakes, sky, trees.

步骤2：给出表征图像的视觉特征与学习得到的最优参数。Step 2: Give the visual features representing the image and the optimal parameters learned.

特征选择对系统性能有着很大的影响。本发明选取以下全局特征和局部特征作为图像表征的描述符：Feature selection has a great impact on system performance. The present invention selects the following global features and local features as descriptors for image representation:

全局特征：(1)128维HSV颜色直方图和225维LAB颜色矩，(2)37维边缘方向直方图，(3)36维金字塔小波纹理，(4)59维局部二元模式特征描述符，(5)960维GIST特征描述符。Global features: (1) 128-dimensional HSV color histogram and 225-dimensional LAB color moment, (2) 37-dimensional edge direction histogram, (3) 36-dimensional pyramidal wavelet texture, (4) 59-dimensional local binary pattern feature descriptor , (5) 960-dimensional GIST feature descriptor.

局部特征：采用两种不同的取样方法和三种不同的局部描述符来提取局部纹理特征，具体过程包括如下描述：首先，进行密集采样和哈里斯角点检测；然后，提取SIFT特征、CSIFT特征、RGBSIFT特征，构建k均值聚类的1000类别的码本；接下来，采用二级空间金字塔模式，构建每幅图像的5000维矢量；最后，使用TF-IDF权重方法生成最终的视觉词袋。在整个实验中，所有特征向量都标准化在[0,1]范围内。Local features: Two different sampling methods and three different local descriptors are used to extract local texture features. The specific process includes the following description: First, perform dense sampling and Harris corner detection; then, extract SIFT features, CSIFT features , RGBSIFT features, and construct a codebook of 1000 categories for k-means clustering; next, use the two-level spatial pyramid mode to construct a 5000-dimensional vector for each image; finally, use the TF-IDF weight method to generate the final visual bag of words. All eigenvectors are normalized in the range [0,1] throughout the experiments.

对每组查询-标注对，上述公式(4)中给出了3种相似性度量，且通过交叉验证选择边缘参数μ。交叉验证后，余弦相似度测量中的μ值为0.18；线性相似度测量中的μ值为1；RBF相似性度量中的σ值为2，μ值为0.18。For each query-label pair, three similarity measures are given in the above formula (4), and the marginal parameter μ is selected by cross-validation. After cross-validation, the μ value in the cosine similarity measure is 0.18; the μ value in the linear similarity measure is 1; the σ value in the RBF similarity measure is 2 and the μ value is 0.18.

步骤3：通过对比实验，测试本发明所提算法的性能。Step 3: Test the performance of the proposed algorithm of the present invention through comparative experiments.

算法对比Algorithm comparison

本发明对比实验在以下三种图像分类方法间进行：The comparative experiment of the present invention is carried out between following three kinds of image classification methods:

基于惰性学习算法：首先，对于每个测试图像，在训练图像库中寻找K个最相似的图像；然后，统计K个最相似图像的特性；最后，依据最大后验概率分配测试图像的标注。Based on the lazy learning algorithm: first, for each test image, find the K most similar images in the training image library; then, count the characteristics of the K most similar images; finally, assign the labels of the test images according to the maximum posterior probability.

基于深度表示与编码算法：利用分层模型学习图像像素级的表示，实现图像标注Based on deep representation and coding algorithm: use layered model to learn image pixel-level representation and realize image annotation

本发明方法：通过深层神经网络实现图像标注。The method of the invention: realizing image labeling through a deep neural network.

模态权重Modal weight

本发明所述方法中，不同模态的组合权重α对系统性能有着很大的影响。图5给出三种公共图像库中，不同模态权重组合的结果。图6(a):自然图像库下的不同模态组合权重。图6(b):NUS-WIDE图像下的不同模态组合权重。图6(c):IAPRTC-12图像下的不同模态组合权重。In the method of the present invention, the combination weight α of different modes has a great influence on the system performance. Figure 5 shows the results of different modality weight combinations in three public image libraries. Figure 6(a): Combination weights of different modalities under the natural image library. Figure 6(b): Combination weights of different modalities under NUS-WIDE images. Figure 6(c): Weights of different modality combinations under IAPRTC-12 images.

从图6所示的结果中可以很容易地看到，不同模态间的比例并没有显著差异。这就意味着每种模态对不同图像类别或多或少有些帮助，这主要是因为这三种图像库包含许多不同类别的自然场景图像，这也同时进一步验证了获得不同模态最优组合的重要性。From the results shown in Fig. 6, it can be easily seen that there is no significant difference in the proportions among the different modalities. This means that each modality is more or less helpful to different image categories, mainly because these three image libraries contain many different categories of natural scene images, which further verifies the optimal combination of different modalities importance.

性能对比performance comparison

表1给出了几种使用不同方法的多标号图像注释技术的实验对比结果。Table 1 presents the experimental comparison results of several multi-label image annotation techniques using different methods.

表1：实验对比结果。Table 1: Experimental comparison results.

从表1所示结果可以看出，本发明所提方法的NDCG@w性能优于其它两种现有的方法，这验证基于深度神经网络学习模型的图像标注有效性。It can be seen from the results shown in Table 1 that the NDCG@w performance of the method proposed in the present invention is better than the other two existing methods, which verifies the effectiveness of image annotation based on the deep neural network learning model.

Claims

1. a method for image labeling based on multimodal deep learning, is characterized in that, described method comprises the steps:

Step 1: Use the unlabeled image sample set to pre-train the node weights of the deep neural network. The deep neural network uses an eight-layer convolutional neural network, of which the first five layers are convolutional layers, and the remaining three layers are fully connected layers; The output of the connection layer is used as the input of the Softmax classifier, and the Softmax classifier generates 1000 identified categories; both the pre-training and fine-tuning stages use the objective function of multinomial logistic regression;

The first layer, the second layer, and the fifth layer of the convolutional layer are all normalized layers, and in order to maintain invariance, all normalized layers use the largest pooling technique; in all convolutional layers and fully connected layers In both, the linear adjustment unit is used as the nonlinear activation function;

In the convolutional neural network used, the size of all input images is uniformly 256×256; next, set the first two convolution filters to 7×7 and 5×5 respectively, with a step size of 2, and use this type of filter In order to obtain all frequency band information, the small step size is used to avoid the generation of "dead features" that affect the next layer of the network; then, the last three layers of the convolutional layer are connected in turn, and the filter size is set to 3×3, The step size is 1; finally, the output size of each fully connected layer is 4096, and in the pre-training stage, the signal loss rate of the first two fully connected layers is set to 0.6;

Step 2: Use the backpropagation algorithm to optimize the weight of each single mode. The backpropagation algorithm includes:

①Single-modal pre-training:

Use the unlabeled training set to pre-train the convolutional neural network, realize the intermediate representation of the image target, and initialize the network at the same time, including: first, use the comparison difference to train the node weight W ₁ between the input layer and the first convolutional layer; Then, the conditional probability of the first convolutional layer node is used as the input of the second convolutional layer:

p(Γ|x ^j )=S(W ₁ ,x ^j ) (1)

Where x ^j is the jth feature vector, Γ is the label information, and S() is the similarity function shown in the following formula:

Then, the first convolutional layer and the second convolutional layer are combined to train the node weight W ₂ ; use the same method to train the node weights of the remaining 3 convolutional layers and 3 fully connected layers;

②Single-mode fine-tuning stage:

In the single-mode fine-tuning stage, the node weights are optimized by backpropagating labeling errors. From the perspective of pattern recognition, multi-labeling learning is regarded as multi-task learning; the overall labeling error of the convolutional neural network is regarded as the sum of each labeling error, The node weight optimization process is described by the lth labeling error, including:

First, for an image x, in the j-th feature mode x ^j , the probability of containing the l-th label Γ _l is expressed by the posterior probability of the following formula:

Where L represents the number of labels;

Then, minimize the KL difference between the predicted probability and the reference probability; assuming that each image has multiple labels, y∈R _1c is represented by a vector, where y _l = 1 means that the label set of image x contains the l-th label, and y _l = 0 means that the label set of image x does not contain the l-th label, q _il indicates the probability between image x _i and label l, then the error of correctly assigning the l-th label to the image is:

The distribution error for all labels is:

Finally, use backpropagation to update the node weights of the other two layers of fully connected layers and five layers of convolutional layers;

Step 3: Use the power gradient algorithm of online learning to optimize the weights between modal combinations;

For multimodal deep networks, another important task is to learn the best combined weights α=(α ₁ ,α ₂ ,…,α _n ,…,α _N ) among multimodal, where α _{n is} initially set into 1/N; use the power gradient algorithm of online learning to optimize the weight combination of multi-modality, including:

Where KL(.) represents the KL difference, and h(α) represents the hinge loss function:

where S _t is:

S _t ＝(S ₁ (x,Γ ⁺ )-S ₁ (x,Γ ^- ),...,S _N (x,Γ ⁺ )-S _N (x,Γ ^- )) ^T (8)

Among them, the labels Γ ⁺ and Γ ^- can better reflect the image content;

Perform a first-order Taylor expansion on the function h(α) at α _t to simplify the optimization problem, so equation (8) can be written as a first-order Taylor expansion:

If Γ+ and Γ- are not arranged correctly in order, the value of node weight α is automatically updated.

2. The image labeling method based on multimodal deep learning according to claim 1, wherein the image labeling method based on multimodal deep learning is applied to a convolutional neural network.