[go: up one dir, main page]

CN105184303B - An Image Annotation Method Based on Multimodal Deep Learning - Google Patents

An Image Annotation Method Based on Multimodal Deep Learning Download PDF

Info

Publication number
CN105184303B
CN105184303B CN201510198325.XA CN201510198325A CN105184303B CN 105184303 B CN105184303 B CN 105184303B CN 201510198325 A CN201510198325 A CN 201510198325A CN 105184303 B CN105184303 B CN 105184303B
Authority
CN
China
Prior art keywords
image
layers
layer
convolutional
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510198325.XA
Other languages
Chinese (zh)
Other versions
CN105184303A (en
Inventor
朱松豪
孙成建
师哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhangying Information Technology Co.,Ltd.
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201510198325.XA priority Critical patent/CN105184303B/en
Publication of CN105184303A publication Critical patent/CN105184303A/en
Application granted granted Critical
Publication of CN105184303B publication Critical patent/CN105184303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)

Abstract

本发明公开了一种基于多模态深度学习的图像标注方法,该方法包括以下步骤:首先,利用无标签图像训练深度神经网络;其次,采用反向传播优化各个单模态;最后,采用在线学习的幂梯度算法优化不同模态间的权重。本发明在应用卷积神经网络技术优化深层神经网络的参数,提高标注精度。公共数据集的实验表明,本发明可以有效地提高图像标注的性能。

The invention discloses an image labeling method based on multi-modal deep learning. The method comprises the following steps: firstly, using unlabeled images to train a deep neural network; secondly, using backpropagation to optimize each single mode; finally, using online The learned power gradient algorithm optimizes the weights across different modalities. The invention optimizes the parameters of the deep neural network by applying the convolutional neural network technology, and improves the labeling accuracy. Experiments on public datasets show that the invention can effectively improve the performance of image annotation.

Description

一种基于多模态深度学习的图像标注方法An Image Annotation Method Based on Multimodal Deep Learning

技术领域technical field

本发明涉及一种图像标注方法,尤其涉及基于多模态深度学习的图像标注方法,属于图像处理技术领域。The invention relates to an image tagging method, in particular to an image tagging method based on multimodal deep learning, and belongs to the technical field of image processing.

背景技术Background technique

近年来,随着图像数量的剧增,人们迫切地需要实现图像内容的高效标注,以实现大规模图像的有效检索与管理。In recent years, with the rapid increase in the number of images, people urgently need to achieve efficient annotation of image content to achieve effective retrieval and management of large-scale images.

从模式识别的角度来看,将图像标注问题视为根据内容给图像分配一组标签,其中如何选取表征图像内容的合适特征,将在很大程度上影响标注性能。由于众所周知的语义鸿沟问题,现有技术进行图像语义标注时很难达到令人满意的结果。近年来,Hinton等人提出利用深度神经网络,从训练集中有效地训练特征。不同类型的深度神经网络,已成功应用于各种语言及信息检索。这些方法通过深度结构、深度学习从训练数据中发现隐藏的数据结构及有效的表征特征,提高了系统性能。From the perspective of pattern recognition, the image annotation problem is regarded as assigning a set of labels to images according to the content, and how to select the appropriate features to represent the image content will greatly affect the annotation performance. Due to the well-known semantic gap problem, it is difficult for existing technologies to achieve satisfactory results when semantically annotating images. In recent years, Hinton et al. proposed to use deep neural networks to efficiently train features from the training set. Different types of deep neural networks have been successfully applied to various languages and information retrieval. These methods discover hidden data structures and effective representational features from training data through deep structure and deep learning to improve system performance.

发明内容Contents of the invention

本发明目的在于提供了一种基于多模态深度学习的图像标注方法,该方法应用于卷积神经网络技术,优化了深层神经网络参数,提高了标注精度。该方法总结单模态学习的基础上,实现多模态的学习,其中既包括研究表征图像的底层特征,如颜色、形状或纹理等,也包括度量图像与标注间相似性函数,如线性相似性、余弦相似性以及径向距离等。The purpose of the present invention is to provide an image labeling method based on multimodal deep learning, which is applied to convolutional neural network technology, optimizes deep neural network parameters, and improves labeling accuracy. This method summarizes the basis of single-modal learning and realizes multi-modal learning, which includes not only studying the underlying features of the representation image, such as color, shape or texture, but also measuring the similarity function between the image and the label, such as linear similarity. sex, cosine similarity, and radial distance.

本发明解决其技术问题所采取的技术方案是:本发明提供了一种基于多模态深度学习的图像标注方法,该方法包括以下步骤:The technical scheme adopted by the present invention to solve the technical problem is: the present invention provides a method for image labeling based on multimodal deep learning, which comprises the following steps:

步骤1:利用无标签的图像样本集,预训练深度神经网络的节点权重。Step 1: Use the unlabeled image sample set to pre-train the node weights of the deep neural network.

步骤2:采用反向传播算法,优化各个单模态的权重。Step 2: Use the backpropagation algorithm to optimize the weights of each single mode.

步骤3:采用在线学习的幂梯度算法,优化模态组合间的权重。Step 3: Use the power gradient algorithm of online learning to optimize the weights between modal combinations.

本发明步骤1所述的深度神经网络是采用八层的卷积神经网络,其中前五层为卷积层,其余三层为全连接层;全连接层的输出作为Softmax分类器的输入,Softmax分类器生成1000个标识的类别;预训练与微调阶段均使用多项式逻辑回归的目标函数。The deep neural network described in step 1 of the present invention adopts eight layers of convolutional neural networks, wherein the first five layers are convolutional layers, and the remaining three layers are fully connected layers; the output of the fully connected layer is used as the input of the Softmax classifier, and Softmax The classifier generates 1000 labeled categories; both the pre-training and fine-tuning phases use the objective function of multinomial logistic regression.

在上述本发明的卷积层中,第一层、第二层、第五层均为归一化层,且为保持不变性,所有归一化层均使用最大池技术。另外,在所有卷积层和全连接层中,均使用线性调整单元作为非线性激活函数;In the above-mentioned convolutional layers of the present invention, the first layer, the second layer, and the fifth layer are all normalized layers, and in order to maintain invariance, all normalized layers use the maximum pooling technique. In addition, in all convolutional layers and fully connected layers, linear adjustment units are used as nonlinear activation functions;

本发明上述所用卷积神经网络中,所有输入图像大小统一为256×256大小;接下来,分别将前两个卷积滤波器设为7×7和5×5,步长为2,使用这种类型滤波器是为获取所有频段信息,使用小步长是为避免产生对下一层网络有影响的“死特征”;然后,将卷积层的后三层依次连接,且设置滤波器大小3×3,步长为1;最后,每个全连接层的输出尺寸为4096。在预训练阶段,将前两个全连接层的信号丢失率设为0.6。In the convolutional neural network used above in the present invention, the size of all input images is unified to 256×256; next, the first two convolution filters are set to 7×7 and 5×5 respectively, and the step size is 2, using this This type of filter is to obtain all frequency band information, and the use of a small step size is to avoid the generation of "dead features" that affect the next layer of the network; then, the last three layers of the convolutional layer are connected in turn, and the filter size is set 3×3 with a stride of 1; finally, the output size of each fully connected layer is 4096. In the pre-training phase, the loss-of-signal ratio of the first two fully-connected layers is set to 0.6.

本发明步骤2所述的反向传播优化各个单模态步骤,包括:The backpropagation described in step 2 of the present invention optimizes each single-mode step, comprising:

①单模态预训练:①Single-modal pre-training:

利用无标注训练集进行卷积神经网络的预训练,实现图像目标的中间表示,同时初始化网络。具体过程描述如下:首先,利用对比差异,训练输入层与第一卷积层间的节点权值W1;然后,将第一卷积层节点的条件概率作为第二卷积层的输入:Use the unlabeled training set to pre-train the convolutional neural network, realize the intermediate representation of the image object, and initialize the network at the same time. The specific process is described as follows: First, use the comparison difference to train the node weight W 1 between the input layer and the first convolutional layer; then, use the conditional probability of the nodes in the first convolutional layer as the input of the second convolutional layer:

p(Γ|xj)=S(W1,xj) (1)p(Γ|x j )=S(W 1 ,x j ) (1)

其中xj为第j个特征矢量,Γ为标注信息,S()为如下式所示的相似性函数:Where x j is the jth feature vector, Γ is the label information, and S() is the similarity function shown in the following formula:

然后,第一卷积层和第二卷积层结合起来训练节点权重W2;利用相同的方法,训练其余的3层卷积层和3层全连接层的节点权重;Then, the first convolutional layer and the second convolutional layer are combined to train the node weight W 2 ; use the same method to train the node weights of the remaining 3 convolutional layers and 3 fully connected layers;

②单模态微调阶段:②Single-mode fine-tuning stage:

在单模态微调阶段,利用反向传播标注误差优化节点权重。从模式识别角度来看,多标注学习可视为多任务学习。因此,卷积神经网络的总体标注误差可视为每个标注误差的总和。下面以第l个标注误差为例说明节点权重优化过程;In the single-modal fine-tuning stage, the node weights are optimized by back-propagating annotation errors. From the perspective of pattern recognition, multi-label learning can be regarded as multi-task learning. Therefore, the overall labeling error of a convolutional neural network can be regarded as the sum of each labeling error. The following takes the lth labeling error as an example to illustrate the node weight optimization process;

首先,对图像x而言,其在第j个特征模式下xj,含有第l个标注Γl的概率可用下式的后验概率表示:First, for an image x, the probability that it contains the l-th label Γ l in the j-th feature mode x j can be expressed by the posterior probability of the following formula:

其中L表示标注数量。where L represents the number of labels.

然后,最小化预测概率与参考概率间的KL差异。假定每幅图像有多个标注,用矢量表示y∈R1×c,其中yl=1表示图像x的标注集中含有这第l个标注,而yl=0表示图像x的标注集中不含有这第l个标注。qil表示图像xi与标注l间的概率,则将这第l个标注正确分配给图像的误差为:Then, minimize the KL difference between the predicted probability and the reference probability. Assuming that each image has multiple labels, it is represented by a vector y∈R 1 ×c, where y l = 1 means that the label set of image x contains the l-th label, and y l = 0 means that the label set of image x does not contain This is the lth label. q il represents the probability between the image x i and the label l, then the error of correctly assigning the l-th label to the image is:

所有标注的分配误差为:The distribution error for all labels is:

最后,依次利用反向传播更新其它两层全连接层与五层卷积层的节点权重。Finally, backpropagation is used to update the node weights of the other two layers of fully connected layers and the five layers of convolutional layers.

包括:include:

对多模态深度网络而言,另一个重要任务是学习多模态间的最佳组合权重α=(α12,…,αn,…,αN),其中将αn初始设置成1/N。本发明采用在线学习的幂梯度算法优化多模态的权重组合:For multimodal deep networks, another important task is to learn the best combined weights α=(α 12 ,…,α n ,…,α N ) among multimodal, where α n is initially set into 1/N. The present invention adopts the power gradient algorithm of online learning to optimize the weight combination of multiple modes:

其中KL(.)表示KL差分,h(α)表示合页损失函数:Where KL(.) represents the KL difference, and h(α) represents the hinge loss function:

其中St为:where S t is:

St=(S1(x,Γ+)-S1(x,Γ-),...,SN(x,Γ+)-SN(x,Γ-))T (8)S t =(S 1 (x,Γ + )-S 1 (x,Γ - ),...,S N (x,Γ + )-S N (x,Γ - )) T (8)

其中标注Γ+与Γ-更能反应图像内容。The labels Γ + and Γ - can better reflect the image content.

在αt处对函数h(α)进行一阶泰勒展开式,以简化优化问题,因此等式(8)可写为一阶泰勒展开形式:Perform a first-order Taylor expansion on the function h(α) at α t to simplify the optimization problem, so equation (8) can be written as a first-order Taylor expansion:

若Γ+与Γ-未按顺序正确排列,即对节点权重α的值进行自动化更新。If Γ+ and Γ- are not arranged correctly in order, the value of node weight α is automatically updated.

有益效果:Beneficial effect:

1、本发明优化了深层神经网络参数,提高了标注精度。1. The present invention optimizes the deep neural network parameters and improves the labeling accuracy.

2、本发明更好地实现了基于深度神经网络学习模型的图像标注有效性。2. The present invention better realizes the effectiveness of image labeling based on the deep neural network learning model.

3、本发明能够有效地提高图像标注的性能。3. The present invention can effectively improve the performance of image labeling.

附图说明Description of drawings

图1为本发明的方法流程图。Fig. 1 is a flow chart of the method of the present invention.

图2为本发明的深度神经网络模型。Fig. 2 is the deep neural network model of the present invention.

图3为本发明的自然场景图形库的示例图像。Fig. 3 is an example image of the natural scene graph library of the present invention.

图4为本发明的NUS-WIDE图像库的图像。Figure 4 is an image of the NUS-WIDE image library of the present invention.

图5为本发明的IAPRTC-12图像数据库的示例图像。FIG. 5 is an example image of the IAPRTC-12 image database of the present invention.

图6为本发明的三种公共图像库中,不同模态权重组合的结果示意图。FIG. 6 is a schematic diagram of the results of different modality weight combinations in the three public image libraries of the present invention.

具体实施方式Detailed ways

下面结合说明书附图对本发明创造作进一步的详细说明。The invention will be described in further detail below in conjunction with the accompanying drawings.

如图1所示,本发明提供了一种基于多模态深度学习的图像标注方法,该方法包括:首先,利用无标签图像训练深度神经网络;其次,采用反向传播优化各个单模态;最后,采用在线学习的幂梯度算法优化不同模态间的权重。As shown in Figure 1, the present invention provides a kind of image tagging method based on multimodal deep learning, and this method comprises: First, utilize unlabeled image training deep neural network; Finally, the power gradient algorithm of online learning is used to optimize the weights between different modalities.

本发明中的深度神经网络是采用卷积神经网络,其模型结构如图2所示。本发明通过一系列实验,评估本发明提出的基于多模态深度学习图像标注算法的性能。The deep neural network in the present invention adopts a convolutional neural network, and its model structure is shown in FIG. 2 . The present invention evaluates the performance of the image labeling algorithm based on multimodal deep learning proposed by the present invention through a series of experiments.

步骤1:介绍用于评估算法性能的数据集。Step 1: Introduce the dataset used to evaluate the performance of the algorithm.

实验采用三个公共图像数据集,包括如图3所示的自然场景图像库,如图4所示的NUS-WIDE图像库,及如图5所示的IAPRTC-12图像库。这三个图像库的详细信息描述如下:The experiment uses three public image datasets, including the natural scene image library shown in Figure 3, the NUS-WIDE image library shown in Figure 4, and the IAPRTC-12 image library shown in Figure 5. The details of these three image libraries are described as follows:

自然场景图像库包含2000幅图像,所有这些图像包含以下5种标注:沙漠,高山,大海,夕阳和树木。超过20%的图像含有一个以上标注,每幅图像标注的平均值为1.3。图3给出两幅来自自然场景图形库的示例图像,其中图3(a)的标注为夕阳与大海,图3(b)的标注为高山与树木。The natural scene image library contains 2000 images, all of which contain the following 5 annotations: desert, mountain, sea, sunset and trees. More than 20% of images contain more than one annotation, with an average of 1.3 annotations per image. Figure 3 shows two sample images from the natural scene graph library, where Figure 3(a) is labeled as the sunset and the sea, and Figure 3(b) is labeled as mountains and trees.

NUS-WIDE图像库包含30,000种图像,这些图像标注含有小船、汽车、旗帜、马、天空、太阳、塔、飞机、斑马等在内的31种标注。图4给出两幅来自NUS-WIDE图像库的图像,其中图4(a)的标注含有天空与飞机,而图4(b)的标注含有大海与夕阳。The NUS-WIDE image library contains 30,000 images labeled with 31 annotations including boats, cars, flags, horses, sky, sun, towers, airplanes, zebras, etc. Figure 4 shows two images from the NUS-WIDE image library, where the labels in Figure 4(a) contain the sky and the plane, and the labels in Figure 4(b) contain the sea and the sunset.

IAPRTC-12图像数据库包含20,000幅图像,291种标注,每幅图像的平均标注数为5.7。图5给出了两幅来自于IAPRTC-12图像数据库的示例图像。图5(a)的标注含有棕色,人脸,头发,男人和女人,而图5(b)的标注含有船舶、湖泊、天空、树木。The IAPRTC-12 image database contains 20,000 images with 291 annotations, and the average number of annotations per image is 5.7. Figure 5 shows two example images from the IAPRTC-12 image database. The annotations in Fig. 5(a) contain brown, human face, hair, man and woman, while the annotations in Fig. 5(b) contain ships, lakes, sky, trees.

步骤2:给出表征图像的视觉特征与学习得到的最优参数。Step 2: Give the visual features representing the image and the optimal parameters learned.

特征选择对系统性能有着很大的影响。本发明选取以下全局特征和局部特征作为图像表征的描述符:Feature selection has a great impact on system performance. The present invention selects the following global features and local features as descriptors for image representation:

全局特征:(1)128维HSV颜色直方图和225维LAB颜色矩,(2)37维边缘方向直方图,(3)36维金字塔小波纹理,(4)59维局部二元模式特征描述符,(5)960维GIST特征描述符。Global features: (1) 128-dimensional HSV color histogram and 225-dimensional LAB color moment, (2) 37-dimensional edge direction histogram, (3) 36-dimensional pyramidal wavelet texture, (4) 59-dimensional local binary pattern feature descriptor , (5) 960-dimensional GIST feature descriptor.

局部特征:采用两种不同的取样方法和三种不同的局部描述符来提取局部纹理特征,具体过程包括如下描述:首先,进行密集采样和哈里斯角点检测;然后,提取SIFT特征、CSIFT特征、RGBSIFT特征,构建k均值聚类的1000类别的码本;接下来,采用二级空间金字塔模式,构建每幅图像的5000维矢量;最后,使用TF-IDF权重方法生成最终的视觉词袋。在整个实验中,所有特征向量都标准化在[0,1]范围内。Local features: Two different sampling methods and three different local descriptors are used to extract local texture features. The specific process includes the following description: First, perform dense sampling and Harris corner detection; then, extract SIFT features, CSIFT features , RGBSIFT features, and construct a codebook of 1000 categories for k-means clustering; next, use the two-level spatial pyramid mode to construct a 5000-dimensional vector for each image; finally, use the TF-IDF weight method to generate the final visual bag of words. All eigenvectors are normalized in the range [0,1] throughout the experiments.

对每组查询-标注对,上述公式(4)中给出了3种相似性度量,且通过交叉验证选择边缘参数μ。交叉验证后,余弦相似度测量中的μ值为0.18;线性相似度测量中的μ值为1;RBF相似性度量中的σ值为2,μ值为0.18。For each query-label pair, three similarity measures are given in the above formula (4), and the marginal parameter μ is selected by cross-validation. After cross-validation, the μ value in the cosine similarity measure is 0.18; the μ value in the linear similarity measure is 1; the σ value in the RBF similarity measure is 2 and the μ value is 0.18.

步骤3:通过对比实验,测试本发明所提算法的性能。Step 3: Test the performance of the proposed algorithm of the present invention through comparative experiments.

算法对比Algorithm comparison

本发明对比实验在以下三种图像分类方法间进行:The comparative experiment of the present invention is carried out between following three kinds of image classification methods:

基于惰性学习算法:首先,对于每个测试图像,在训练图像库中寻找K个最相似的图像;然后,统计K个最相似图像的特性;最后,依据最大后验概率分配测试图像的标注。Based on the lazy learning algorithm: first, for each test image, find the K most similar images in the training image library; then, count the characteristics of the K most similar images; finally, assign the labels of the test images according to the maximum posterior probability.

基于深度表示与编码算法:利用分层模型学习图像像素级的表示,实现图像标注Based on deep representation and coding algorithm: use layered model to learn image pixel-level representation and realize image annotation

本发明方法:通过深层神经网络实现图像标注。The method of the invention: realizing image labeling through a deep neural network.

模态权重Modal weight

本发明所述方法中,不同模态的组合权重α对系统性能有着很大的影响。图5给出三种公共图像库中,不同模态权重组合的结果。图6(a):自然图像库下的不同模态组合权重。图6(b):NUS-WIDE图像下的不同模态组合权重。图6(c):IAPRTC-12图像下的不同模态组合权重。In the method of the present invention, the combination weight α of different modes has a great influence on the system performance. Figure 5 shows the results of different modality weight combinations in three public image libraries. Figure 6(a): Combination weights of different modalities under the natural image library. Figure 6(b): Combination weights of different modalities under NUS-WIDE images. Figure 6(c): Weights of different modality combinations under IAPRTC-12 images.

从图6所示的结果中可以很容易地看到,不同模态间的比例并没有显著差异。这就意味着每种模态对不同图像类别或多或少有些帮助,这主要是因为这三种图像库包含许多不同类别的自然场景图像,这也同时进一步验证了获得不同模态最优组合的重要性。From the results shown in Fig. 6, it can be easily seen that there is no significant difference in the proportions among the different modalities. This means that each modality is more or less helpful to different image categories, mainly because these three image libraries contain many different categories of natural scene images, which further verifies the optimal combination of different modalities importance.

性能对比performance comparison

表1给出了几种使用不同方法的多标号图像注释技术的实验对比结果。Table 1 presents the experimental comparison results of several multi-label image annotation techniques using different methods.

表1:实验对比结果。Table 1: Experimental comparison results.

从表1所示结果可以看出,本发明所提方法的NDCG@w性能优于其它两种现有的方法,这验证基于深度神经网络学习模型的图像标注有效性。It can be seen from the results shown in Table 1 that the NDCG@w performance of the method proposed in the present invention is better than the other two existing methods, which verifies the effectiveness of image annotation based on the deep neural network learning model.

Claims (2)

1.一种基于多模态深度学习的图像标注方法,其特征在于,所述方法包括如下步骤:1. a method for image labeling based on multimodal deep learning, is characterized in that, described method comprises the steps: 步骤1:利用无标签的图像样本集,预训练深度神经网络的节点权重,深度神经网络采用八层的卷积神经网络,其中前五层为卷积层,其余三层为全连接层;全连接层的输出作为Softmax分类器的输入,Softmax分类器生成1000个标识的类别;预训练与微调阶段均使用多项式逻辑回归的目标函数;Step 1: Use the unlabeled image sample set to pre-train the node weights of the deep neural network. The deep neural network uses an eight-layer convolutional neural network, of which the first five layers are convolutional layers, and the remaining three layers are fully connected layers; The output of the connection layer is used as the input of the Softmax classifier, and the Softmax classifier generates 1000 identified categories; both the pre-training and fine-tuning stages use the objective function of multinomial logistic regression; 所述卷积层的第一层、第二层、第五层均为归一化层,且为保持不变性,所有归一化层均使用最大池技术;在所有卷积层和全连接层中,均使用线性调整单元作为非线性激活函数;The first layer, the second layer, and the fifth layer of the convolutional layer are all normalized layers, and in order to maintain invariance, all normalized layers use the largest pooling technique; in all convolutional layers and fully connected layers In both, the linear adjustment unit is used as the nonlinear activation function; 所用卷积神经网络中,所有输入图像大小统一为256×256;接下来,分别将前两个卷积滤波器设为7×7和5×5,步长为2,使用这种类型滤波器是为获取所有频段信息,使用小步长是为避免产生对下一层网络有影响的“死特征”;然后,将卷积层的后三层依次连接,且设置滤波器大小3×3,步长为1;最后,每个全连接层的输出尺寸为4096,在预训练阶段,将前两个全连接层的信号丢失率设为0.6;In the convolutional neural network used, the size of all input images is uniformly 256×256; next, set the first two convolution filters to 7×7 and 5×5 respectively, with a step size of 2, and use this type of filter In order to obtain all frequency band information, the small step size is used to avoid the generation of "dead features" that affect the next layer of the network; then, the last three layers of the convolutional layer are connected in turn, and the filter size is set to 3×3, The step size is 1; finally, the output size of each fully connected layer is 4096, and in the pre-training stage, the signal loss rate of the first two fully connected layers is set to 0.6; 步骤2:采用反向传播算法,优化各个单模态的权重,反向传播算法包括:Step 2: Use the backpropagation algorithm to optimize the weight of each single mode. The backpropagation algorithm includes: ①单模态预训练:①Single-modal pre-training: 利用无标注训练集进行卷积神经网络的预训练,实现图像目标的中间表示,同时初始化网络,包括:首先,利用对比差异,训练输入层与第一卷积层间的节点权值W1;然后,将第一卷积层节点的条件概率作为第二卷积层的输入:Use the unlabeled training set to pre-train the convolutional neural network, realize the intermediate representation of the image target, and initialize the network at the same time, including: first, use the comparison difference to train the node weight W 1 between the input layer and the first convolutional layer; Then, the conditional probability of the first convolutional layer node is used as the input of the second convolutional layer: p(Γ|xj)=S(W1,xj) (1)p(Γ|x j )=S(W 1 ,x j ) (1) 其中xj为第j个特征矢量,Γ为标注信息,S()为如下式所示的相似性函数:Where x j is the jth feature vector, Γ is the label information, and S() is the similarity function shown in the following formula: 然后,第一卷积层和第二卷积层结合起来训练节点权重W2;利用相同的方法,训练其余的3层卷积层和3层全连接层的节点权重;Then, the first convolutional layer and the second convolutional layer are combined to train the node weight W 2 ; use the same method to train the node weights of the remaining 3 convolutional layers and 3 fully connected layers; ②单模态微调阶段:②Single-mode fine-tuning stage: 在单模态微调阶段,利用反向传播标注误差优化节点权重,从模式识别角度来看,多标注学习视为多任务学习;卷积神经网络的总体标注误差视为每个标注误差的总和,以第l个标注误差描述节点权重优化过程,包括:In the single-mode fine-tuning stage, the node weights are optimized by backpropagating labeling errors. From the perspective of pattern recognition, multi-labeling learning is regarded as multi-task learning; the overall labeling error of the convolutional neural network is regarded as the sum of each labeling error, The node weight optimization process is described by the lth labeling error, including: 首先,对图像x而言,其在第j个特征模式下xj,含有第l个标注Γl的概率用下式的后验概率表示:First, for an image x, in the j-th feature mode x j , the probability of containing the l-th label Γ l is expressed by the posterior probability of the following formula: 其中L表示标注数量;Where L represents the number of labels; 然后,最小化预测概率与参考概率间的KL差异;假定每幅图像有多个标注,用矢量表示y∈R1c,其中yl=1表示图像x的标注集中含有这第l个标注,而yl=0表示图像x的标注集中不含有这第l个标注,qil表示图像xi与标注l间的概率,则将这第l个标注正确分配给图像的误差为:Then, minimize the KL difference between the predicted probability and the reference probability; assuming that each image has multiple labels, y∈R 1c is represented by a vector, where y l = 1 means that the label set of image x contains the l-th label, and y l = 0 means that the label set of image x does not contain the l-th label, q il indicates the probability between image x i and label l, then the error of correctly assigning the l-th label to the image is: 所有标注的分配误差为:The distribution error for all labels is: 最后,依次利用反向传播更新其它两层全连接层与五层卷积层的节点权重;Finally, use backpropagation to update the node weights of the other two layers of fully connected layers and five layers of convolutional layers; 步骤3:采用在线学习的幂梯度算法,优化模态组合间的权重;Step 3: Use the power gradient algorithm of online learning to optimize the weights between modal combinations; 对多模态深度网络而言,另一个重要任务是学习多模态间的最佳组合权重α=(α12,…,αn,…,αN),其中将αn初始设置成1/N;采用在线学习的幂梯度算法优化多模态的权重组合,包括:For multimodal deep networks, another important task is to learn the best combined weights α=(α 12 ,…,α n ,…,α N ) among multimodal, where α n is initially set into 1/N; use the power gradient algorithm of online learning to optimize the weight combination of multi-modality, including: 其中KL(.)表示KL差分,h(α)表示合页损失函数:Where KL(.) represents the KL difference, and h(α) represents the hinge loss function: 其中St为:where S t is: St=(S1(x,Γ+)-S1(x,Γ-),...,SN(x,Γ+)-SN(x,Γ-))T (8)S t =(S 1 (x,Γ + )-S 1 (x,Γ - ),...,S N (x,Γ + )-S N (x,Γ - )) T (8) 其中标注Γ+与Γ-更能反应图像内容;Among them, the labels Γ + and Γ - can better reflect the image content; 在αt处对函数h(α)进行一阶泰勒展开式,以简化优化问题,因此等式(8)可写为一阶泰勒展开形式:Perform a first-order Taylor expansion on the function h(α) at α t to simplify the optimization problem, so equation (8) can be written as a first-order Taylor expansion: 若Γ+与Γ-未按顺序正确排列,即对节点权重α的值进行自动化更新。If Γ+ and Γ- are not arranged correctly in order, the value of node weight α is automatically updated. 2.根据权利要求1所述基于多模态深度学习的图像标注方法,其特征在于,所述基于多模态深度学习的图像标注方法应用于卷积神经网络。2. The image labeling method based on multimodal deep learning according to claim 1, wherein the image labeling method based on multimodal deep learning is applied to a convolutional neural network.
CN201510198325.XA 2015-04-23 2015-04-23 An Image Annotation Method Based on Multimodal Deep Learning Active CN105184303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510198325.XA CN105184303B (en) 2015-04-23 2015-04-23 An Image Annotation Method Based on Multimodal Deep Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510198325.XA CN105184303B (en) 2015-04-23 2015-04-23 An Image Annotation Method Based on Multimodal Deep Learning

Publications (2)

Publication Number Publication Date
CN105184303A CN105184303A (en) 2015-12-23
CN105184303B true CN105184303B (en) 2019-08-09

Family

ID=54906369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510198325.XA Active CN105184303B (en) 2015-04-23 2015-04-23 An Image Annotation Method Based on Multimodal Deep Learning

Country Status (1)

Country Link
CN (1) CN105184303B (en)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654942A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter
CN105678340B (en) * 2016-01-20 2018-12-25 福州大学 An Automatic Image Annotation Method Based on Enhanced Stacked Autoencoder
CN105760859B (en) * 2016-03-22 2018-12-21 中国科学院自动化研究所 Reticulate pattern facial image recognition method and device based on multitask convolutional neural networks
CN105894012B (en) * 2016-03-29 2019-05-14 天津大学 Based on the object identification method for cascading micro- neural network
JP6727543B2 (en) * 2016-04-01 2020-07-22 富士ゼロックス株式会社 Image pattern recognition device and program
CN106056602B (en) * 2016-05-27 2019-06-28 中国人民解放军信息工程大学 FMRI visual performance datum target extracting method based on CNN
CN105930877B (en) * 2016-05-31 2020-07-10 上海海洋大学 Remote sensing image classification method based on multi-mode deep learning
US9971958B2 (en) * 2016-06-01 2018-05-15 Mitsubishi Electric Research Laboratories, Inc. Method and system for generating multimodal digital images
CN106202338B (en) * 2016-06-30 2019-04-05 合肥工业大学 Image search method based on the more relationships of multiple features
CN106682592B (en) * 2016-12-08 2023-10-27 北京泛化智能科技有限公司 Image automatic identification system and method based on neural network method
CN106845427B (en) * 2017-01-25 2019-12-06 北京深图智服技术有限公司 face detection method and device based on deep learning
CN107122800B (en) * 2017-04-27 2020-09-18 南京大学 A Robust Digital Image Annotation Method Based on Prediction Result Screening
CN108960015A (en) * 2017-05-24 2018-12-07 优信拍(北京)信息科技有限公司 A kind of vehicle system automatic identifying method and device based on deep learning
CN109583583B (en) * 2017-09-29 2023-04-07 腾讯科技(深圳)有限公司 Neural network training method and device, computer equipment and readable medium
CN108307205A (en) * 2017-12-06 2018-07-20 中国电子科技集团公司电子科学研究院 Merge the recognition methods of video expressive force, terminal and the storage medium of audio visual feature
CN108388768A (en) * 2018-02-08 2018-08-10 南京恺尔生物科技有限公司 Utilize the biological nature prediction technique for the neural network model that biological knowledge is built
CN109544517A (en) * 2018-11-06 2019-03-29 中山大学附属第医院 Multi-modal ultrasound omics analysis method and system based on deep learning
CN109543835B (en) * 2018-11-30 2021-06-25 上海寒武纪信息科技有限公司 Computing method, device and related products
CN109583580B (en) * 2018-11-30 2021-08-03 上海寒武纪信息科技有限公司 Computing method, device and related products
CN109543833B (en) * 2018-11-30 2021-08-03 上海寒武纪信息科技有限公司 Computing method, device and related products
CN109711464B (en) * 2018-12-25 2022-09-27 中山大学 Image Description Method Based on Hierarchical Feature Relation Graph
CN109886226B (en) * 2019-02-27 2020-12-01 北京达佳互联信息技术有限公司 Method and device for determining characteristic data of image, electronic equipment and storage medium
CN110019652B (en) * 2019-03-14 2022-06-03 九江学院 Cross-modal Hash retrieval method based on deep learning
CN110569967A (en) * 2019-09-11 2019-12-13 山东浪潮人工智能研究院有限公司 A neural network model compression and encryption method and system based on arithmetic coding
CN111127456A (en) * 2019-12-28 2020-05-08 北京无线电计量测试研究所 Image annotation quality evaluation method
CN111383744B (en) * 2020-06-01 2020-10-16 北京协同创新研究院 Medical microscopic image annotation information processing method and system and image analysis equipment
CN112633394B (en) * 2020-12-29 2022-12-16 厦门市美亚柏科信息股份有限公司 Intelligent user label determination method, terminal equipment and storage medium
CN114582011B (en) * 2021-12-27 2025-07-18 广西壮族自治区公众信息产业有限公司 Pedestrian tracking method based on federal learning and edge calculation
CN114170481B (en) * 2022-02-10 2022-06-17 北京字节跳动网络技术有限公司 Method, apparatus, storage medium, and program product for image processing
CN115356363B (en) * 2022-08-01 2023-06-20 河南理工大学 Pore structure characterization method based on wide ion beam polishing-scanning electron microscope
CN116563400B (en) * 2023-07-12 2023-09-05 南通原力云信息技术有限公司 Small program image information compression processing method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902966A (en) * 2012-10-12 2013-01-30 大连理工大学 Super-resolution face recognition method based on deep belief networks
CN103345656A (en) * 2013-07-17 2013-10-09 中国科学院自动化研究所 Method and device for data identification based on multitask deep neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489529B2 (en) * 2011-03-31 2013-07-16 Microsoft Corporation Deep convex network with joint use of nonlinear random projection, Restricted Boltzmann Machine and batch-based parallelizable optimization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102902966A (en) * 2012-10-12 2013-01-30 大连理工大学 Super-resolution face recognition method based on deep belief networks
CN103345656A (en) * 2013-07-17 2013-10-09 中国科学院自动化研究所 Method and device for data identification based on multitask deep neural network

Also Published As

Publication number Publication date
CN105184303A (en) 2015-12-23

Similar Documents

Publication Publication Date Title
CN105184303B (en) An Image Annotation Method Based on Multimodal Deep Learning
Mascarenhas et al. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification
Wang et al. A review on extreme learning machine
CN108132968B (en) Weak supervision learning method for associated semantic elements in web texts and images
Wu et al. Application of image retrieval based on convolutional neural networks and Hu invariant moment algorithm in computer telecommunications
CN111738303B (en) A Long Tail Distribution Image Recognition Method Based on Hierarchical Learning
Wei et al. Learning to segment with image-level annotations
CN104599275B (en) The RGB-D scene understanding methods of imparametrization based on probability graph model
Ahmad et al. Medical image retrieval with compact binary codes generated in frequency domain using highly reactive convolutional features
EP3029606A2 (en) Method and apparatus for image classification with joint feature adaptation and classifier learning
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN114780767A (en) A large-scale image retrieval method and system based on deep convolutional neural network
Yang et al. Local label descriptor for example based semantic image labeling
Feng et al. Bag of visual words model with deep spatial features for geographical scene classification
Chung et al. Filter pruning by image channel reduction in pre-trained convolutional neural networks
CN114741507A (en) Establishment and classification of citation network classification model based on Transformer-based graph convolutional network
Xu et al. Weakly supervised facial expression recognition via transferred DAL-CNN and active incremental learning
CN118470714B (en) Camouflage object semantic segmentation method, system, medium and electronic equipment based on decision-level feature fusion modeling
CN104517120A (en) Remote sensing image scene classifying method on basis of multichannel layering orthogonal matching
Zhu et al. Deep neural network based image annotation
CN103440651A (en) Multi-label image annotation result fusion method based on rank minimization
Wu et al. Recognition of pear leaf disease under complex background based on DBPNet and modified mobilenetV2
Limei et al. Landscape image recognition and analysis based on deep learning algorithm
Browne et al. PulseNetOne: fast unsupervised pruning of convolutional neural networks for remote sensing
CN104331717B (en) The image classification method that a kind of integration characteristics dictionary structure is encoded with visual signature

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20151223

Assignee: Zhangjiagang Institute of Zhangjiagang

Assignor: NANJING University OF POSTS AND TELECOMMUNICATIONS

Contract record no.: X2019980001251

Denomination of invention: Image marking method based on multi-mode deep learning

Granted publication date: 20190809

License type: Common License

Record date: 20191224

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20250414

Address after: Room 288, 1st Floor, Building 14, No. 7 Beixiaoyingfu Front Street, Shunyi District, Beijing 101300

Patentee after: Beijing Zhangying Information Technology Co.,Ltd.

Country or region after: China

Address before: 210023 9 Wen Yuan Road, Qixia District, Nanjing, Jiangsu.

Patentee before: NANJING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China