TWI764081B

TWI764081B - Framework for combining multiple global descriptors for image retrieval

Info

Publication number: TWI764081B
Application number: TW109101190A
Authority: TW
Inventors: 高秉秀; 全希宰; 金鍾澤; 金永俊; 金仁植
Original assignee: 南韓商納寶股份有限公司
Priority date: 2019-03-22
Filing date: 2020-01-14
Publication date: 2022-05-11
Also published as: KR102262264B1; KR20200112574A; TW202036329A

Abstract

本發明公開組合多個全局描述符以進行圖像檢索的框架。通過電腦系統實現的用於圖像檢索的框架包括：主模組，對從卷積神經網路（convolution neural network，CNN）提取的互不相同的多個全局描述符（global descriptor）進行級聯（concatenate）來學習；以及輔助模組，用於進一步學習多個上述全局描述符中的一個特定的全局描述符。The present invention discloses a framework for combining multiple global descriptors for image retrieval. The framework for image retrieval implemented by a computer system includes: a main module that cascades multiple different global descriptors extracted from a convolutional neural network (CNN) (concatenate) to learn; and an auxiliary module to further learn a specific global descriptor among the multiple above-mentioned global descriptors.

Description

A framework for combining multiple global descriptors for image retrieval

以下說明涉及用於圖像檢索的深度學習模型的框架。The following description refers to the framework of a deep learning model for image retrieval.

基於卷積神經網路（CNN）的圖像描述符在包括分類（classification）、對象檢測（object detection）、語義分割（semantic segmentation）在內的電腦視覺技術中被用作普通的描述符。此外，還用於圖像字幕（image captioning）和視覺問題應答（visual question answering）等非常有意義的研究中。Image descriptors based on convolutional neural networks (CNN) are used as common descriptors in computer vision techniques including classification, object detection, and semantic segmentation. In addition, it is also used in very meaningful research such as image captioning and visual question answering.

應用基於CNN的圖像描述符的近年來的研究適用於即時級圖像檢索，上述即時級圖像檢索適用依賴於局部描述符匹配（local descriptor matching）的現有方法，並通過空間驗證（spatial verification）重新排序。Recent studies applying CNN-based image descriptors are applicable to temporal-level image retrieval, which is applicable to existing methods that rely on local descriptor matching, and are validated by spatial verification. )rearrange.

圖像檢索（image retrieval）領域中可將CNN之後的結果為池化（平均池化（average pooling）、最大池化（max pooling）、廣義平均池化（generalized mean pooling）等）的特徵用作全局描述符（global descriptor）。並且，可在卷積層（convolution layers）之後，增加FC層（fully connected layers）來將通過FC層呈現的特徵用作全局描述符。在此情況下，FC層用於減小維度（dimensionality），在無需減小維度的情況下，可以省略FC層。Features that result from pooling (average pooling, max pooling, generalized mean pooling, etc.) after CNN can be used in the field of image retrieval as global descriptor. Also, fully connected layers can be added after convolution layers to use the features presented by the FC layers as global descriptors. In this case, the FC layer is used to reduce the dimensionality, and the FC layer can be omitted when there is no need to reduce the dimensionality.

作為一例，韓國授權專利第10-1917369號（登記日：2018年11月05日）公開了利用卷積神經網路的影像檢索技術。As an example, Korean Granted Patent No. 10-1917369 (registration date: November 05, 2018) discloses an image retrieval technology using a convolutional neural network.

通過全局池化方法（global pooling method）生成的典型的全局描述符包括卷積的池化和（sum pooling of convolution，SPoC）、最大卷積啟動（maximum activation of convolution， MAC）以及廣義平均池（generalized-mean pooling，GeM）。由於每個全局描述符的屬性不同，因而其性能根據數據集而發生變化。例如，SPoC在圖像表示上啟動更大的區域，相反的，MAC啟動更多的集中區域。為了增強功能而存在加權和池（weighted sum pooling）、加權值GeM、區域MAC（regional MAC ，R-MAC）等典型的全局描述符的變形。Typical global descriptors generated by global pooling methods include sum pooling of convolution (SPoC), maximum activation of convolution (MAC), and generalized average pooling ( generalized-mean pooling, GeM). Since each global descriptor has different properties, its performance varies depending on the dataset. For example, SPoC activates larger regions on the image representation, conversely, MAC activates more concentrated regions. Variations of typical global descriptors such as weighted sum pooling, weighted value GeM, and regional MAC (R-MAC) exist for enhanced functionality.

近年來的研究側重於用於圖像檢索的組合技術（ensemble techniques）。如果存在單獨教育多個學習者（learner）並使用經組合的模型組合來提高性能的現有組合技術，那麼近年來則存在多種通過組合單獨接受教育的方法全局描述符來提高檢索性能的方式。換言之，當前，在圖像檢索領域中，通過組合（ensemble）不同的CNN骨幹（backbone）模型和全局描述符來使用，以提高檢索性能。Recent research has focused on ensemble techniques for image retrieval. If there are existing combinatorial techniques that educate multiple learners individually and use a combined model combination to improve performance, in recent years there have been multiple ways to improve retrieval performance by combining individually educated method global descriptors. In other words, currently, in the field of image retrieval, different CNN backbone models and global descriptors are used by ensemble to improve retrieval performance.

但是，若為了組合而顯性訓練不同的學習者（CNN骨幹模型或全局描述符），則不僅導致訓練時間變長、記憶體消耗量增加，而且由於需要特殊設計的策略或損失以控制學習者之間的多樣性（diversity），從而導致訓練過程繁瑣且困難。However, explicitly training different learners (CNN backbone models or global descriptors) for the purpose of combination not only results in longer training time and increased memory consumption, but also requires specially designed policies or losses to control learners The diversity between them makes the training process cumbersome and difficult.

[發明所欲解決之問題] 提供能夠以單一模型一次性學習不同的全局描述符來使用的深度學習模型框架。[Problems to be Solved by Invention] Provides a deep learning model framework that can be used with a single model that learns different global descriptors at once.

提供可通過應用多個全局描述符（global descriptor）來獲得與組合等同的效果，而無需顯性訓練多個學習者（learners）或控制學習者之間的多樣性（diversity）的方法。 [解決問題之技術手段]Provides a method whereby the equivalent effect of combining can be obtained by applying multiple global descriptors without explicitly training multiple learners or controlling for diversity among learners. [Technical means to solve problems]

提供用於圖像檢索的框架，通過電腦系統實現，其中，包括：主模組，對從卷積神經網路（convolution neural network， CNN）提取的互不相同的多個全局描述符（global descriptor）進行級聯（concatenate）來學習；以及輔助模組，用於進一步學習多個上述全局描述符中的一個特定的全局描述符。Provides a framework for image retrieval, implemented by a computer system, including: a main module for a plurality of mutually different global descriptors (global descriptors) extracted from a convolutional neural network (convolution neural network, CNN). ) for concatenation to learn; and an auxiliary module for further learning a specific global descriptor among the multiple above-mentioned global descriptors.

根據一實施方式，上述主模組為用於圖像表示（image representation）的排序損失（ranking loss）的學習模組，上述輔助模組為用於上述圖像表示的分類損失（classification loss）的學習模組，以端到端（end-to-end）方式且利用作為上述排序損失與上述分類損失之和的最終損失來訓練上述用於圖像檢索的框架。According to one embodiment, the main module is a learning module for ranking loss of image representation, and the auxiliary module is a learning module for classification loss of image representation. A learning module to train the above framework for image retrieval in an end-to-end manner and with a final loss that is the sum of the above ranking loss and the above classification loss.

根據再一實施方式，上述CNN作為提供給定圖像的特徵圖的骨幹（backbone）網，在上述骨幹網的最後階段（stage）之前不進行向下採樣（down sampling）。According to yet another embodiment, the above-mentioned CNN acts as a backbone network that provides feature maps of a given image, and no down sampling is performed before the final stage of the above-mentioned backbone network.

根據又一實施方式，上述主模組在對多個上述全局描述符進行歸一化（normalization）之後通過級聯來將其形成為一個最終的全局描述符，並可通過排序損失（ranking loss）來學習上述最終的全局描述符。According to yet another embodiment, the above-mentioned main module forms a final global descriptor by concatenating a plurality of the above-mentioned global descriptors after normalization, and can pass a ranking loss (ranking loss) to learn the final global descriptor above.

根據又一實施方式，上述主模組包括通過使用多個上述全局描述符來輸出每個圖像表示的多個分支（branch），上述分支的數量可根據所要使用的全局描述而改變。According to yet another embodiment, the above-mentioned main module includes a plurality of branches for outputting each image representation by using a plurality of the above-mentioned global descriptors, the number of which may vary according to the global description to be used.

根據又一實施方式，上述輔助模組可利用分類損失來對多個上述全局描述符中的基於學習性能來確定的上述特定的全局描述符進行學習。According to yet another embodiment, the above-mentioned auxiliary module may use a classification loss to learn the above-mentioned specific global descriptor determined based on the learning performance among the plurality of above-mentioned global descriptors.

根據又一實施方式，上述輔助模組在利用分類損失來進行學習時，可利用標籤平滑（label smoothing）和溫度定標（temperature scaling）技術中的至少一種。According to yet another embodiment, when the above-mentioned auxiliary module uses the classification loss for learning, at least one of label smoothing and temperature scaling techniques can be used.

提供描述符學習方法，在電腦系統上執行，其中，上述電腦系統包括至少一個處理器，上述至少一個處理器執行包含在記憶體中的多個電腦可讀指令，上述描述符學習方法包括：主要學習步驟，級聯從CNN提取的互不相同的多個全局描述符來利用排序損失進行學習；以及輔助學習步驟，利用分類損失進一步學習多個上述全局描述符中的一個特定的全局描述符。A descriptor learning method is provided, executed on a computer system, wherein the computer system includes at least one processor, and the at least one processor executes a plurality of computer-readable instructions contained in a memory, and the descriptor learning method includes: mainly: A learning step of cascading a plurality of mutually different global descriptors extracted from the CNN to learn using a ranking loss; and an auxiliary learning step of further learning a specific global descriptor among the plurality of above-mentioned global descriptors using a classification loss.

提供非暫時性電腦可讀記錄介質，其中，存儲有用於在上述電腦系統上執行上述描述符學習方法的電腦程式。 [對照先前技術之功效]A non-transitory computer-readable recording medium is provided, in which a computer program for executing the above-described descriptor learning method on the above-described computer system is stored. [Compared to the efficacy of the prior art]

根據本發明的實施例，通過適用用於組合多個全局描述符的新的框架，即，由能夠以端到端方式訓練的多個全局描述符組合而成的多個全局描述符組合（combination of multiple global descriptors，CGD），從而可實現與組合等同的效果，而無需對全局描述符使用顯性組合模型或進行多樣性控制。其通過全局描述符、CNN骨幹、損失及數據集而具有靈活且可擴展的特性，由於使用組合描述符的方法可使用其他類型的特徵，因而不僅相對於單一全局描述符具有優異的性能，而且可提高圖像檢索性能。According to an embodiment of the present invention, by applying a new framework for combining multiple global descriptors, ie a combination of multiple global descriptors that can be trained in an end-to-end manner of multiple global descriptors, CGD), which can achieve the equivalent effect of composition without using an explicit composition model or diversity control for global descriptors. It is flexible and scalable through global descriptors, CNN backbones, losses, and datasets. Since the method using combined descriptors can use other types of features, it not only has excellent performance relative to a single global descriptor, but also Improves image retrieval performance.

以下，參照附圖來詳細說明本發明的實施例。Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

本發明的實施例涉及用於圖像檢索的深度學習模型的框架，尤其，涉及對用於圖像檢索的多個全局描述符進行的技術。Embodiments of the present invention relate to the framework of deep learning models for image retrieval, and in particular, to techniques for performing multiple global descriptors for image retrieval.

包含本說明書中具體公開的內容的實施例提出通過應用能夠以端到端方式訓練的多個全局描述符來獲得與組合等同的效果的框架，由此在靈活性、擴展性、縮短時間、節省成本、檢索性能等方面，實現顯著的優點。Embodiments incorporating what is specifically disclosed in this specification propose a framework to achieve the equivalent effect of combining by applying multiple global descriptors that can be trained in an end-to-end manner, thereby reducing flexibility, scalability, time reduction, savings Significant advantages are achieved in terms of cost, retrieval performance, etc.

圖1為用於說明本發明一實施例的電腦系統的內部結構的一例的框圖。例如，本發明實施例的描述符學習系統可通過圖1的電腦系統100實現。如圖1所示，電腦系統100作為用於執行描述符學習方法的結構要素，可包括處理器110、記憶體120，永久性存儲裝置130、匯流排140、輸入/輸出介面150及網路介面160。FIG. 1 is a block diagram for explaining an example of the internal structure of a computer system according to an embodiment of the present invention. For example, the descriptor learning system of the embodiment of the present invention can be implemented by the computer system 100 in FIG. 1 . As shown in FIG. 1, the computer system 100, as a structural element for executing the descriptor learning method, may include a processor 110, a memory 120, a persistent storage device 130, a bus 140, an input/output interface 150 and a network interface 160.

處理器110作為用於學習描述符的結構要素，可以包括可處理多個指令的序列的任何裝置，或者是該裝置的一部分。處理器110可以包括例如電腦處理器、移動裝置或其他電子裝置中的處理器和/或數字處理器。處理器110可以包括在例如，伺服器計算設備、伺服器電腦、一系列伺服器電腦、伺服器場、雲電腦、內容平臺等。處理器110可通過匯流排140級聯到記憶體120。As a structural element for learning descriptors, the processor 110 may comprise, or be part of, any apparatus capable of processing a sequence of multiple instructions. The processor 110 may include, for example, a computer processor, a processor in a mobile device or other electronic device and/or a digital processor. The processor 110 may be included in, for example, a server computing device, a server computer, a series of server computers, a server farm, a cloud computer, a content platform, and the like. The processor 110 can be cascaded to the memory 120 through the bus bar 140 .

記憶體120可以包括用於存儲由電腦系統100使用或由電腦系統100輸出的資訊的易失性記憶體、永久性、虛擬或其他記憶體。記憶體120可以包括例如隨機存取記憶體（random access memory ，RAM）和/或動態RAM（dynamic RAM ，DRAM）。記憶體120可以用於存儲電腦系統100的狀態資訊等任何資訊。記憶體120還可用於存儲包含例如用於學習描述符的多個指令的電腦系統100的多個指令。電腦系統100可根據需要或在適當的情況下包括一個以上的處理器110。Memory 120 may include volatile memory, persistent, virtual or other memory for storing information used by or output by computer system 100 . The memory 120 may include, for example, random access memory (RAM) and/or dynamic RAM (DRAM). The memory 120 may be used to store any information such as status information of the computer system 100 . Memory 120 may also be used to store instructions of computer system 100 including instructions for learning descriptors, for example. Computer system 100 may include more than one processor 110 as needed or where appropriate.

匯流排140可以包括使得能夠在電腦系統100的多種組件之間進行交互的通信基礎結構。匯流排140可以在例如電腦系統100的多個組件之間、在例如處理器110與記憶體120之間傳輸數據。匯流排140可以包括電腦系統100的多個組件之間的無線和/或有線通信介質，並且可以包括並行、串行或其他拓撲排列。The bus bar 140 may include a communication infrastructure that enables interaction between the various components of the computer system 100 . The bus bar 140 may transfer data between components such as the computer system 100 , such as between the processor 110 and the memory 120 . Bus bar 140 may include wireless and/or wired communication media between the various components of computer system 100 and may include parallel, serial, or other topological arrangements.

永久性存儲裝置130（例如，相對於記憶體120）可以包括諸如由電腦系統100使用以在規定的擴展時間記憶體儲數據的諸如記憶體或其他永久性存儲裝置等多個組件。永久性存儲裝置130可以包括諸如由電腦系統100內的處理器110使用的非易失性主記憶體。永久性存儲裝置130可以包括例如快閃記憶體、硬碟、光碟或其他電腦可讀介質。Persistent storage 130 (eg, as opposed to memory 120 ) may include a number of components such as memory or other persistent storage, such as those used by computer system 100 to store data in memory for extended periods of time. Persistent storage 130 may include non-volatile main memory, such as used by processor 110 within computer system 100 . Persistent storage 130 may include, for example, flash memory, hard disk, optical disk, or other computer-readable medium.

輸入/輸出介面150可以包括鍵盤、滑鼠、語音指令輸入、顯示器或其他輸入或輸出裝置的介面。可以通過輸入/輸出介面150接收用於配置指令和/或學習描述符的輸入。Input/output interface 150 may include a keyboard, mouse, voice command input, display, or interface to other input or output devices. Input for configuration instructions and/or learning descriptors may be received through input/output interface 150 .

網路介面160可以包括諸如局域網或網際網路之類的網路的介面。網路介面160可以包括有線或無線級聯的介面。可以通過網路介面160接收用於配置指令和/或學習描述符的輸入。Network interface 160 may include an interface to a network such as a local area network or the Internet. The network interface 160 may include wired or wireless cascaded interfaces. Input for configuration instructions and/or learning descriptors may be received through network interface 160 .

並且，在其他實施例中，電腦系統100也可包括比圖1的結構要素更多的結構要素。但是，無需明確示出大多數現有技術性結構要素。例如，電腦系統100可以實現為包括與上述輸入/輸出介面150級聯的多個輸入/輸出裝置中的至少一部分，或者還可以包括諸如收發器（transceiver）、全球定位系統（Global Positioning System， GPS）模組、攝像機、各種感測器、資料庫等其他結構要素。Furthermore, in other embodiments, the computer system 100 may also include more structural elements than those shown in FIG. 1 . However, most prior art structural elements need not be explicitly shown. For example, the computer system 100 may be implemented to include at least a part of a plurality of input/output devices cascaded with the above-mentioned input/output interface 150, or may also include devices such as a transceiver, a Global Positioning System (GPS) ) modules, cameras, various sensors, databases and other structural elements.

本發明的實施例涉及能夠以單一模型一次性學習不同的全局描述符來使用的深度學習模型框架。Embodiments of the present invention relate to a deep learning model framework that can be used with a single model learning different global descriptors at once.

在近年來的圖像檢索研究中，基於深度CNN的全局描述符比諸如尺度不變特徵轉換（Scale Invariant Feature Transform， SIFT）之類的現有技術具有更完整的特徵。SPoC是CNN的最後特徵圖中的和池（sum pooling）。MAC是另一種強大的描述符，而R-MAC在執行區域中的最大值池化之後最後相加區域內MAC描述符。GeM使用池參數推廣最大池及平均值池。其他全局描述符方法包括加權和池(weighted sum pooling)、加權值GeM、多尺度R-MAC (Multiscale R-MAC)等。In recent image retrieval research, deep CNN-based global descriptors have more complete features than existing techniques such as Scale Invariant Feature Transform (SIFT). SPoC is sum pooling in the last feature map of CNN. MAC is another powerful descriptor, and R-MAC finally adds in-region MAC descriptors after performing in-region max pooling. GeM uses pooling parameters to generalize max pooling and mean pooling. Other global descriptor methods include weighted sum pooling, weighted value GeM, Multiscale R-MAC, etc.

在一些研究中，試圖使用附加策略（additional strategy）或注意機制（attention mechanism）來最大化特徵圖中重要特徵的啟動，或提出批量特性擦除（BFE）策略，該策略可強制網路優化其他區域的特徵表示。並且，還應用同時優化特徵表示並具有平滑像素和難以關注區域的模型。上述技術的缺點是不僅可能增加網路大小和訓練時間，而且還需要額外參數以用於訓練。In some studies, an additional strategy or an attention mechanism has been used to maximize the activation of important features in the feature map, or a batch feature erasure (BFE) strategy has been proposed, which can force the network to optimize other Feature representation of the region. And, a model that simultaneously optimizes the feature representation and has smooth pixels and hard-to-focus regions is also applied. The disadvantage of the above technique is that not only is it possible to increase the network size and training time, but it also requires additional parameters for training.

換言之，近年來對圖像檢索工作的研究是組合不同的模型合組合多個全局描述符，但是，為了這種組合而訓練不同的模型不僅困難，而且在時間或存儲方面效率低下。In other words, recent research on image retrieval work is to combine different models and combine multiple global descriptors, however, training different models for this combination is not only difficult, but also inefficient in terms of time or storage.

在本實施例中，提出可在能夠以端到端方式訓練的期間內，通過應用多個全局描述符來獲得與組合等同的效果的新的框架。本發明的框架通過全局描述符、CNN骨幹、損失及數據集而具有靈活且可擴展的特性。並且，本發明的框架僅需要擁有訓練的幾種附加參數，而不需要附加策略或注意機制。In the present embodiment, a new framework is proposed that can obtain an effect equivalent to combining by applying multiple global descriptors within a period that can be trained in an end-to-end manner. The framework of the present invention is flexible and scalable through global descriptors, CNN backbones, losses and datasets. Also, the framework of the present invention only needs to possess a few additional parameters for training, and does not require additional policies or attention mechanisms.

組合是通過訓練多個學習者來提高結果，並從經訓練的學習者獲得經組合的結果的眾所周知的技術，其在過去幾十年廣泛使用於圖像檢索。但是，現有的組合技術的缺點在於，隨著模型複雜性的增加而導致計算成本增加，並且為了計算學習者之間的多樣性而需要進一步的控制。Combining is a well-known technique for improving results by training multiple learners and obtaining combined results from the trained learners, which has been widely used in image retrieval over the past few decades. However, existing combinatorial techniques have the disadvantage of increasing computational cost as model complexity increases and requiring further control in order to compute the diversity among learners.

本發明的框架在不控制多樣性的情況下能夠以端到端方式訓練時應用組合技術的思想。The framework of the present invention enables the idea of combining techniques to be applied when training in an end-to-end manner without controlling for diversity.

圖2示出本發明一實施例的用於圖像檢索的多個全局描述符組合（combination of multiple global descriptors，CGD）的框架。FIG. 2 shows a framework of a combination of multiple global descriptors (CGD) for image retrieval according to an embodiment of the present invention.

本發明的CGD框架200可通過上述中所描述的電腦系統100來實現，可作為用於學習描述符的結構要素來包含在處理器110中。The CGD framework 200 of the present invention can be implemented by the computer system 100 described above, and can be included in the processor 110 as a structural element for learning descriptors.

參照圖2，CGD框架200可由CNN骨幹網201和作為2個模組的主模組210及輔助模組220構成。Referring to FIG. 2 , the CGD framework 200 may be composed of a CNN backbone network 201 and a main module 210 and an auxiliary module 220 as two modules.

在此情況下，主模組210執行對圖像表示（image representation）進行學習的功能，由用於排序損失（ranking loss）的多個全局描述符的組合形成。並且，輔助模組220執行通過分類損失（classification loss）微調CNN的功能。In this case, the main module 210 performs the function of learning an image representation, formed by the combination of multiple global descriptors for ranking loss. And, the auxiliary module 220 performs the function of fine-tuning the CNN through a classification loss.

能夠以端到端方式來通過作為來自主模組210的排序損失和來自輔助模組220的分類損失之和的最終損失訓練CGD框架200。 1．CNN骨幹網201The CGD framework 200 can be trained in an end-to-end manner with a final loss that is the sum of the ranking loss from the main module 210 and the classification loss from the auxiliary module 220 . 1. CNN Backbone 201

可將所有CNN模型用作CNN骨幹網201。CGD框架200可以使用諸如BN-Inception、ShuffleNet-v2、ResNet以及其他變形模型之類的CNN骨幹，例如，如圖2所示，可將ResNet-50用作CNN骨幹網201。All CNN models can be used as CNN backbone 201. The CGD framework 200 can use CNN backbones such as BN-Inception, ShuffleNet-v2, ResNet, and other deformable models, for example, as shown in Figure 2, ResNet-50 can be used as the CNN backbone 201.

作為一例，CNN骨幹網201可以利用由4個階段形成的網路，在此情況下，為了能夠在最後特徵圖（feature map）中保存更多的資訊，可通過放棄階段3（stage3）與階段4（stage4）之間的向下採樣操作來修改相應的網路。由此，為224×224的輸入大小提供大小為14×14的特徵圖，因而可提高個別全局描述符的性能。換言之，為了提高全局描述符的性能，在ResNet-50的階段3（stage3）之後和最後階段（stage4）之前不進行向下採樣以包含更多的資訊。 2. 主模組210：多個全局描述符As an example, the CNN backbone network 201 can utilize a network formed by four stages. In this case, in order to save more information in the final feature map (feature map), it is possible to abandon stage 3 (stage 3) and stage 4 (stage4) downsampling operations to modify the corresponding network. Thereby, a feature map of size 14×14 is provided for an input size of 224×224, thus improving the performance of individual global descriptors. In other words, in order to improve the performance of global descriptors, no downsampling is performed after stage 3 (stage 3) and before the final stage ( stage 4) of ResNet-50 to include more information. 2. Main Module 210: Multiple Global Descriptors

主模組210從CNN骨幹網201的最後特徵圖中通過多個特徵聚合（feature aggregation）方法來提取全局描述符並與FC層歸一化（normalization）。The main module 210 extracts global descriptors from the last feature map of the CNN backbone 201 through multiple feature aggregation methods and normalizes them with the FC layers.

從主模組210提取的全局描述符被級聯（concatenate），並經過歸一化可形成一個最終的全局描述符，在此情況下，最終的全局描述符通過排序損失在實例級別（instance level）被學習。其中，排序損失可以由用於度量學習（metric learning）的損失代替，代表性地可使用三元組（triplet）損失。The global descriptors extracted from the main module 210 are concatenated and normalized to form a final global descriptor, in this case the final global descriptor is lost at the instance level by ordering ) are learned. Among them, the ranking loss can be replaced by a loss for metric learning, and typically a triplet loss can be used.

詳細地，主模組210包括多個分支（branch），上述分支用於在最後卷積層使用不同的全局描述符來輸出每個圖像表示。作為一例，主模組210包括卷積的池化和（sum pooling of convolution，SPoC）、最大卷積啟動（maximum activation of convolution， MAC）、廣義平均池（generalized-mean pooling，GeM），並在每個分支中使用最為典型的全局描述符的三種類型。In detail, the main module 210 includes a plurality of branches for outputting each image representation using different global descriptors at the last convolutional layer. As an example, the main module 210 includes sum pooling of convolution (SPoC), maximum activation of convolution (MAC), generalized-mean pooling (GeM), and in The three most typical types of global descriptors used in each branch.

可以增加或減少主模組210中包括的分支的數量，並且可根據用戶的需求變形及組合所要使用的全局描述符。The number of branches included in the main module 210 can be increased or decreased, and the global descriptor to be used can be deformed and combined according to the user's needs.

當給定圖像I時，最後卷積層輸出為C×H×W維的3D張量（tensor）

，其中，C是特徵圖的數量。將

假設是特徵圖

的 H×W啟動集。則網路輸出由2D特徵圖的C通道構成。全局描述符將

用作輸入，作為池化過程的輸出來生成向量

。這種池化方法可以泛化成如數學公式1。 [數學公式1]

When given an image I, the output of the final convolutional layer is a 3D tensor of C×H×W dimensions

, where C is the number of feature maps. Will

Suppose it is a feature map

The H×W launch set. Then the network output consists of the C channel of the 2D feature map. The global descriptor will

Used as input, as output of the pooling process to generate a vector

. This pooling method can be generalized as Mathematical Equation 1. [Mathematical formula 1]

當

時，將SPoC定義為

，當

時，將SPoC定義為

，在剩餘情況下，將GeM定義為

。在GeM的情況下，可使用通過實驗固定的

參數3，根據實施例，可由用戶手動設置參數

，或者可以學習參數

本身。when

, the SPoC is defined as

,when

, the SPoC is defined as

, in the remaining cases, the GeM is defined as

. In the case of GeM, experimentally fixed

Parameter 3, according to the embodiment, the parameter can be manually set by the user

, or the parameters can be learned

itself.

通過FC層的維的減小及通過

-歸一化（normalization）層的歸一化來生成第i個分支的輸出特徵向量

。 [數學公式2]

Dimension reduction through the FC layer and through

- Normalization of the normalization layer to generate the output feature vector of the ith branch

. [Mathematical formula 2]

當

時，

可以是分支數，

可以是FC層的加權值，當

時，全局描述符

可以是SPoC，當

時，可以是MAC，當

時，可以是GeM。when

hour,

can be the number of branches,

can be the weighted value of the FC layer, when

, the global descriptor

can be SPoC, when

, can be a MAC, when

can be GeM.

本發明的CGD框架200的被稱為組合描述符

的最終特徵向量通過級聯多種分支的輸出特徵向量來依次進行

-歸一化。 [數學公式3]

The CGD framework 200 of the present invention is called a composite descriptor

The final eigenvector of

-Normalized. [Mathematical formula 3]

當

時，

為級聯（concatenation）。when

hour,

For the cascade (concatenation).

可在任何類型的排序損失中訓練這種組合描述符，作為一例，代表性地使用批次硬三元組損失(batch-hard triplet loss)。Such combined descriptors can be trained in any type of ranking loss, typically using a batch-hard triplet loss as an example.

CGD框架200中組合多個全局描述符，這具有兩種優點。第一，可以僅通過幾種附加參數來帶來與組合等同的效果。如之前所提及的研究，獲得組合效果，但為了能夠以端到端方式對其進行訓練，CGD框架200從單個CNN骨幹網201提取多個全局描述符。第二，對每個分支的輸出自動提供其他屬性，而無需多樣性控制。近年來的研究中提出為鼓勵學習者之間的多樣性而專門設計的損失，CGD框架200不需要為控制多個分支之間的多樣性而專門設計的損失。Combining multiple global descriptors in the CGD framework 200 has two advantages. First, the equivalent effect of the combination can be brought about by only a few additional parameters. As in the previously mentioned studies, the combined effect is obtained, but in order to be able to train it in an end-to-end manner, the CGD framework 200 extracts multiple global descriptors from a single CNN backbone 201. Second, additional properties are automatically provided on the output of each branch without diversity control. While recent studies have proposed losses specifically designed to encourage diversity among learners, the CGD framework 200 does not require losses specifically designed to control diversity among multiple branches.

通過實驗，可比較全局描述符的多個組合的性能來找出描述符組合。但是，根據每個數據的輸出特徵維，存在性能差異不大的情況。例如，若SPoC 1536維和768維的性能差異不大，則相對於SPoC 1536維（單個全局描述符），可使用SPoC 768維＋GeM 768維（多個全局描述符）的組合獲得更好的性能。 3. 輔助模組220：分類損失Through experiments, descriptor combinations can be found by comparing the performance of multiple combinations of global descriptors. However, depending on the output feature dimension of each data, there are cases where there is little difference in performance. For example, if the performance difference between SPoC 1536-dimension and 768-dimension is not significant, the combination of SPoC 768-dimension + GeM 768-dimension (multiple global descriptors) can be used to obtain better performance than SPoC 1536-dimension (single global descriptor). 3. Auxiliary Module 220: Classification Loss

輔助模組220可利用分類損失來學校從主模組210的第一全局描述符輸出的圖像表示，以在嵌入的分類級別（categorical level）進行學習。在此情況下，當利用分類損失進行學習時，為了提高性能，可適用標籤平滑（label smoothing）和溫度定標（temperature scaling）技術。The auxiliary module 220 can utilize the classification loss to learn the image representation output from the first global descriptor of the main module 210 to learn at the categorical level of the embedding. In this case, label smoothing and temperature scaling techniques can be applied in order to improve performance when learning with classification loss.

換言之，輔助模組220利用輔助分類損失來基於主模組210的第一全局描述符微調CNN骨幹。輔助模組220可以利用分類損失來對由主模組210所包括的全局描述符中的第一全局描述符呈現的圖像表示進行學習。這遵循由兩個步驟構成的訪問方法，該方法與分類損失一同微調CNN骨幹來改善卷積濾波器，之後通過微調網路來改善全局描述符的性能。In other words, the auxiliary module 220 utilizes the auxiliary classification loss to fine-tune the CNN backbone based on the first global descriptor of the main module 210 . The auxiliary module 220 may utilize the classification loss to learn the image representation presented by the first one of the global descriptors included by the main module 210 . This follows a two-step access method that fine-tunes the CNN backbone together with the classification loss to improve the convolutional filters, followed by fine-tuning the network to improve the performance of the global descriptor.

CGD框架200修改了這種訪問方式，使得具有用於端到端訓練的僅一次的步驟。具有輔助分類損失的訓練可以實現等級之間具有分離屬性的圖像表示，並且相對於僅使用排序損失，有助於更快且更穩定地訓練網路。The CGD framework 200 modifies this access so that there is only one step for end-to-end training. Training with an auxiliary classification loss enables image representations with disjunctive properties between ranks, and helps train the network faster and more robustly than using only a ranking loss.

柔性最大值交叉熵損失（softmax loss）中的溫度定標和標籤平滑有助於分類損失訓練，柔性最大值損失定義為數學公式4。 [數學公式4]

Temperature scaling and label smoothing in softmax cross-entropy loss (softmax loss) help classification loss training, and softmax loss is defined as Mathematical Equation 4. [Mathematical formula 4]

其中，

、

、

分別表示批量大小、類數及第i個輸入的ID標籤。

和

分別是可訓練的加權值和偏差（bias）。並且，

為第一分支的全局描述符，其中

為默認值（default value）1的溫度參數。in,

,

represent the batch size, the number of classes, and the ID label of the i-th input, respectively.

and

are trainable weights and biases, respectively. and,

is the global descriptor of the first branch, where

The temperature parameter with the default value of 1.

在數學公式4中，使用溫度參數

的溫度定標將更大的梯度（gradient）分配給更難的例子，對於類內的緊湊及類之間擴展嵌入很有用。標籤平滑通過加強模型來推定訓練中的標籤丟失的邊際效果，從而改善泛化性。因此，為了防止過度擬合併學習更好的嵌入方法，在輔助分類損失中追加標籤平滑和溫度定標。In Math Equation 4, use the temperature parameter

The temperature scaling of , assigns larger gradients to harder examples, useful for compact within-class and extended embeddings between classes. Label smoothing improves generalization by strengthening the model to infer the marginal effect of label loss in training. Therefore, to prevent overfitting and learn better embedding methods, label smoothing and temperature scaling are appended to the auxiliary classification loss.

可以通過每個全局描述符的性能來確定用於計算分類損失的第一全局描述符。作為一例，所要使用於組的多個全局描述符用作單個分支來進行學習之後，可將其中性能最佳的全局描述符用作用於計算分類損失的第一全局描述符。例如，若分別學習SPoC、MAC、GeM的結果其性能為GeM>SPoC>MAC，則GeM＋MAC的組合具有比MAC＋GeM的組合呈現出更優異的性能的傾向，因而可考慮這一點來將GeM用作用於計算分類損失的全局描述符。 4.框架結構The first global descriptor used to compute the classification loss can be determined by the performance of each global descriptor. As an example, after multiple global descriptors to be used for a group are used as a single branch for learning, the best performing global descriptor among them may be used as the first global descriptor for computing the classification loss. For example, if the performance of SPoC, MAC, and GeM is learned separately, and the performance is GeM>SPoC>MAC, the combination of GeM+MAC tends to show better performance than the combination of MAC+GeM. Compute the global descriptor for the classification loss. 4. Frame structure

CGD框架200可根據全局描述符分支的數量來擴展，根據全局描述符的結構而允許其他類型的網路。例如，使用3個全局描述符（SPoC、MAC、GeM），針對輔助分類損失單獨使用最初的全局描述符，因而可構成12種可行的配置。The CGD framework 200 is extensible based on the number of global descriptor branches, allowing other types of networks based on the structure of the global descriptor. For example, using 3 global descriptors (SPoC, MAC, GeM) and using the original global descriptor alone for the auxiliary classification loss, 12 possible configurations can be formed.

為了便於說明，將SPoC簡稱為S、將MAC簡稱為M、將GeM簡稱為G，符號中的第一個字母表示用於輔助分類損失的第一全局描述符。CGD框架200可從一個CNN骨幹網201提取三種全局描述符S、M、G，在此情況下，可基於全局描述符S、M、G來進行如下12種配置：S、M、G、SM、MS、SG、GS、MG、GM、SMG、MSG、GSM。組合所有全局描述符來在排序損失進行學習，只有第一全局描述符在分類損失進行附加學習。例如，在SMG的情況下，只有全局描述符S在分類損失進行附加學習，所有S、M及G被組合（SM、MS、SG、GS、MG、GM、SMG、MSG、GSM）而在排序損失學習。For the convenience of description, SPoC is abbreviated as S, MAC is abbreviated as M, and GeM is abbreviated as G, and the first letter in the symbol represents the first global descriptor used to assist the classification loss. The CGD framework 200 can extract three global descriptors S, M, G from a CNN backbone 201, in this case, the following 12 configurations can be made based on the global descriptors S, M, G: S, M, G, SM , MS, SG, GS, MG, GM, SMG, MSG, GSM. All global descriptors are combined to learn at the ranking loss, and only the first global descriptor is additionally learned at the classification loss. For example, in the case of SMG, only the global descriptor S is additionally learned in the classification loss, and all S, M, and G are combined (SM, MS, SG, GS, MG, GM, SMG, MSG, GSM) while sorting loss learning.

因此，與單獨學習多種模型以組合多個全局描述符的現有方法不同，本發明可通過以端到端方式僅學習一個模型來獲得與組合等同的效果。現有方法通過為進行組合而單獨製造的損失來進行多樣性控制，而本方法可在沒有多樣性控制的情況下獲得與組合等同的效果。根據本發明，可將最終的全局描述符用於圖像檢索，根據需要，可以使用級聯（concatenate）之前的多個圖像表示，以使用更小的維。可以根據用戶需求使用方法全局描述符，可調節全局描述符的數量來擴展及縮小模型。Therefore, unlike existing methods that learn multiple models individually to combine multiple global descriptors, the present invention can achieve the same effect as combining by learning only one model in an end-to-end manner. Existing methods perform diversity control by making losses separately for combination, while this method can achieve the same effect as combination without diversity control. According to the present invention, the final global descriptor can be used for image retrieval, and multiple image representations before concatenation can be used as needed to use smaller dimensions. The method global descriptor can be used according to user needs, and the number of global descriptors can be adjusted to expand and shrink the model.

上述CGD框架200的實例如下。An example of the above-described CGD framework 200 is as follows.

作為用於圖像檢索的數據集，利用文獻“C. Wah，S. Branson，P. Welinder，P. Perona，and S. Belongie.The caltech-ucsd birds-200-2011 dataset. 2011.”中使用的數據集（CUB200）以及文獻“J. Krause，M. Stark，J. DenG，and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision WorkshopS、pages 554–561，2013.”中使用的數據集（CARS196）來評價本發明的CGD框架200。在CUB200和CARS196的情況下，使用具有邊界框（bounding box）資訊的剪切影像。As a dataset for image retrieval, use the document "C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011." The dataset (CUB200) and the literature "J. Krause, M. Stark, J. DenG, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision WorkshopS, pages 554–561, 2013.” used the dataset (CARS196) to evaluate the CGD framework 200 of the present invention. In the case of CUB200 and CARS196, clipped images with bounding box information are used.

所有實驗在具有24GB記憶體的Tesla P40 GPU使用MXNet來進行。並且，MXNet GluonCV的mageNet ILSVRC預定加權值一同使用BNInception、ShuffleNet-v2、ResNet-50、SEResNet-50。所有實驗均使用224×224的輸入大小和1536維的嵌入。在訓練步驟中，將輸入影像的大小調整為252×252，並任意剪切為224×224之後，朝水準方向隨機翻轉。使用學習速度為1e-4的Adam優化器，當調整學習速度時，使用逐步衰減。在所有實驗中，三元組(triplet)損失的餘量

為0.1，柔性最大值損失的溫度

為0.5。所有數據集使用128個批量大小，每個類的CARS196、CUB200使用64個實例，並且僅通過作為默認輸入大小的224×224來調整圖像大小。 1. 構架設計實驗 1）訓練排序和分類損失 [分類損失]All experiments were performed using MXNet on a Tesla P40 GPU with 24GB of memory. In addition, the predetermined weights of mageNet ILSVRC of MXNet GluonCV use BNInception, ShuffleNet-v2, ResNet-50, and SEResNet-50 together. All experiments use an input size of 224 × 224 and an embedding of 1536 dimensions. In the training step, the size of the input image is resized to 252×252, and after arbitrarily cropped to 224×224, it is randomly flipped in the horizontal direction. Use the Adam optimizer with a learning rate of 1e-4, and use stepwise decay when tuning the learning rate. In all experiments, the margin of triplet loss

is 0.1, the temperature at which the maximum flexibility is lost

is 0.5. All datasets use a batch size of 128, CARS196, CUB200 use 64 instances per class, and only resize images by 224×224 as the default input size. 1. Architecture Design Experiment 1) Training Ranking and Classification Loss [Classification Loss]

通過一同使用第一全局描述符的分類損失和排序損失來訓練CGD框架200。圖3的表格對在CARS196使用排序損失的情況（排序）和使用輔助分類損失和排序損失的情況（兩者）的結果進行了比較。在該實驗中，在所有情況下未將標籤平滑和溫度定標適用於分類損失。這證明與單獨使用排序損失時相比，使用兩種損失提供更高的性能。分類損失側重於在類別型水準下將每個類聚類到封閉的嵌入空間。排序損失側重於在相同的等級下收集樣本，並在實例級別的不同的等級下隔開樣本之間的距離。因此，若一同訓練排序損失和輔助分類損失，則改善對分類型及細分化的特徵嵌入的優化。 [標籤平滑及溫度定標]The CGD framework 200 is trained by using the first global descriptor's classification loss and ranking loss together. The table in Figure 3 compares the results for the case of using the ranking loss (ranking) and the case of using the auxiliary classification loss and the ranking loss (both) at CARS196. In this experiment, label smoothing and temperature scaling were not applied to the classification loss in all cases. This proves that using both losses provides higher performance than when using the ranking loss alone. The classification loss focuses on clustering each class into a closed embedding space at the category level. The ranking loss focuses on collecting samples at the same level and separating the distances between samples at different levels at the instance level. Therefore, if the ranking loss and the auxiliary classification loss are trained together, the optimization of feature embeddings for classification types and subdivisions is improved. [Label smoothing and temperature scaling]

圖4的表格對在CARS196沒有使用標籤平滑和溫度定標的情況（no trick）（未使用）、使用標籤平滑的情況（標籤平滑）、使用溫度定標的情況（溫度定標）以及使用標籤平滑和溫度定標（both tricks）（兩者）的結果進行了比較。這使用全局描述符SM來在ResNet-50骨幹進行，示出使用標籤平滑和溫度定標的情況比未使用(no tricks)提高性能。尤其可知，若一同適用標籤平滑和溫度定標，則提高每種性能並獲得最佳性能。 2）多個全局描述符組合 [組合的位置]The table in Figure 4 compares the cases where label smoothing and temperature scaling are not used at CARS196 (no trick) (not used), when label smoothing is used (label smoothing), when temperature scaling is used (temperature scaling), and when label smoothing and The results of temperature scaling (both tricks) (both) were compared. This is done on the ResNet-50 backbone using the global descriptor SM, showing that using label smoothing and temperature scaling improves performance over no tricks. In particular, it can be seen that if label smoothing and temperature scaling are applied together, each performance is improved and the best performance is obtained. 2) Multiple global descriptor combinations [Position of the combination]

由於CGD框架200使用多個全局描述符，因而為了選擇最佳構架，對多個全局描述符組合的不同位置進行了實驗。Since the CGD framework 200 uses multiple global descriptors, in order to select the best framework, experiments were performed on different positions where multiple global descriptors were combined.

圖5示出用於訓練多個全局描述符的第一類型的構架，圖6示出用於訓練多個全局描述符的第二類型的構架。Figure 5 shows a first type of architecture for training multiple global descriptors, and Figure 6 shows a second type of architecture for training multiple global descriptors.

如圖5所示，第一類型的構架在通過單獨的排序損失訓練每個全局描述符之後，在推理步驟中進行組合，並對每個分支使用相同的全局描述符，而不使用分類損失。As shown in Figure 5, the first type of architecture combines in the inference step after training each global descriptor with a separate ranking loss, and uses the same global descriptor for each branch without using a classification loss.

另一方面，圖6所示的第二類型的構架通過組合全局描述符的原始輸出來以單個排序損失進行教育，且不使用多個全局描述符。On the other hand, the second type of architecture shown in Figure 6 educates with a single ranking loss by combining the raw outputs of global descriptors and does not use multiple global descriptors.

相反的，如圖2所示，本發明的CGD框架200組合FC層之後的多個全局描述符和

-歸一化。In contrast, as shown in FIG. 2, the CGD framework 200 of the present invention combines multiple global descriptors after the FC layer and

-Normalized.

圖7的表格作為使用CUB200中的全局描述符SM，將CGD框架的性能與第一類型的構架A及第二類型的構架B進行了比較。可知CGD框架的性能最高。The table of FIG. 7 compares the performance of the CGD framework with framework A of the first type and framework B of the second type as using the global descriptor SM in CUB200. It can be seen that the performance of the CGD framework is the highest.

第二類型的構架B包含多個分支特性和輸出特性向量的多樣性。與CGD框架相反，在訓練步驟中，第一類型的構架A的最終嵌入與推理步驟不同，第二類型的構架B的最終嵌入因級聯後的FC層而喪失全局描述符的每個屬性。 [組合方法]The second type of framework B contains multiple branch features and a variety of output feature vectors. Contrary to the CGD framework, in the training step, the final embedding of the first type of architecture A is different from the inference step, and the final embedding of the second type of architecture B loses every property of the global descriptor due to the cascaded FC layers. [Combination method]

從組合方法的觀點上，多個全局描述符的級聯（concatenation）和求和（summation）提高模型結果。因此，本發明的CGD框架可通過比較兩種組合方法來選擇最佳方法。From a combined approach point of view, concatenation and summation of multiple global descriptors improves model results. Therefore, the CGD framework of the present invention can select the best method by comparing the two combined methods.

圖8的表格作為使用CUB200中的全局描述符SM，比較了作為組合方法的求和方法（Sum）和級聯方法（Concat）的結果。多個全局描述符的級聯方法（Concat）比求和方法（Sum）提供更優異的性能。求和方法（Sum）可能因全局描述符的啟動相互混合（mix）而失去每個全局描述符的特性，相反的，級聯方法（Concat）可以保留每個全局描述符的屬性並維持多樣性。 2．組合描述符的效果（1）定量分析The table of Fig. 8 compares the results of the summation method (Sum) and the concatenation method (Concat) as combined methods as using the global descriptor SM in CUB200. The concatenation method of multiple global descriptors (Concat) provides better performance than the summation method (Sum). The sum method (Sum) may lose the properties of each global descriptor due to the intermixing of the initiation of global descriptors (mix), in contrast, the concatenated method (Concat) can preserve the properties of each global descriptor and maintain diversity . 2. Effects of Combining Descriptors (1) Quantitative analysis

本發明的CGD框架的核心是應用多個全局描述符。針對CGD框架對輔助分類損失使用溫度定標的每個圖像檢索數據集，進行12種可行的配置的實驗。The core of the CGD framework of the present invention is the application of multiple global descriptors. Experiments with 12 possible configurations are conducted on each image retrieval dataset where the CGD framework uses temperature scaling for the auxiliary classification loss.

圖9比較了相對於CARS196的CGD框架的多種結構的性能，圖10比較了相對於CUB200的CGD框架的多種結構的性能。本實驗利用了每個類採樣100個實例的測試集。由於深度學習模型的不確定性，通過使用箱型圖來示出10次以上的結果。Figure 9 compares the performance of various architectures relative to the CGD framework of CARS196, and Figure 10 compares the performance of various architectures relative to the CGD framework of CUB200. This experiment utilizes a test set that samples 100 instances per class. Due to the uncertainty of the deep learning model, the results are shown more than 10 times by using a boxplot.

參照圖9及圖10，可知組合描述符（SG、GSM、SMG、SM、GM、GS、MS、MSG、MG）比單個全局描述符（S、M、G）呈現出更出色的性能。在CUB200的情況下，單個全局描述符G和M呈現出相對高的性能，相反的，最佳性能配置仍然是組合描述符MG。性能可根據數據集的屬性、用於分類損失的特徵、輸入大小及輸出維等改變。主要本質是若應用多個全局描述符，則可比單個全局描述符提高性能。9 and 10 , it can be seen that the combined descriptors (SG, GSM, SMG, SM, GM, GS, MS, MSG, MG) exhibit better performance than the single global descriptors (S, M, G). In the case of CUB200, the single global descriptors G and M exhibit relatively high performance, conversely, the best performing configuration is still the combined descriptor MG. The performance can vary according to the properties of the dataset, the features used for the classification loss, the size of the input, and the dimension of the output. The main essence is that if multiple global descriptors are used, the performance can be improved over a single global descriptor.

圖11的表格對CARS196的組合描述符（SG、GSM、SMG、SM、GM、GS、MS、MSG、MG）與單個全局描述符（S、M、G）的性能進行了比較。個別描述符表示每個分支的輸出特徵向量。組合描述符為CGD框架的最終特徵向量。The table in Figure 11 compares the performance of CARS196's combined descriptors (SG, GSM, SMG, SM, GM, GS, MS, MSG, MG) with individual global descriptors (S, M, G). Individual descriptors represent the output feature vector of each branch. The combined descriptor is the final feature vector of the CGD framework.

圖11示出了組合之前的個別全局描述符的性能和組合之後可計算出的性能的提高程度。所有組合描述符具有1536維嵌入向量，相反的，每個個別描述符具有用於SM、MS、SG、GS、MG、GM的1536維嵌入向量和用於SMG、MSG、GS、MG、GS、GM的512維嵌入向量。更大的嵌入向量通常提供更好的性能。但是，若大嵌入向量及小嵌入向量之間的性能差異不大，則可以優選使用其他全局描述符的多個小嵌入向量。例如，768維嵌入向量SG的個別描述符GeM具有與1536維嵌入向量的單一描述符G類似的性能，因此SG通過組合SPC和GeM的不同特徵來獲得顯著的性能提高。 3．CGD框架的靈活性Figure 11 shows the performance of individual global descriptors before combining and the computable improvement in performance after combining. All combined descriptors have 1536-dimensional embedding vectors, in contrast, each individual descriptor has 1536-dimensional embedding vectors for SM, MS, SG, GS, MG, GM and 1536-dimensional embedding vectors for SMG, MSG, GS, MG, GS, 512-dimensional embedding vector for GM. Larger embedding vectors generally provide better performance. However, if the performance difference between large and small embedding vectors is not significant, it may be preferable to use multiple small embedding vectors of other global descriptors. For example, the individual descriptor GeM of the 768-dimensional embedding vector SG has similar performance to the single descriptor G of the 1536-dimensional embedding vector, so SG achieves a significant performance improvement by combining different features of SPC and GeM. 3. Flexibility of the CGD Framework

圖12示出了本發明的CGD框架可使用多種排序損失（批次硬三元組損失(batch-hard triplet loss)、HAP2S損失、加權採樣餘量損失等）。若比較單個全局描述符S和多個全局描述符SM的性能，則在所有情況下，多個全局描述符SM的性能優於單個全局描述符S、從這一點上可適用損失，從而可知其靈活。Figure 12 shows that the CGD framework of the present invention can use various ranking losses (batch-hard triplet loss, HAP2S loss, weighted sampling margin loss, etc.). If the performance of a single global descriptor S and multiple global descriptors SM are compared, in all cases, the performance of multiple global descriptors SM is better than that of a single global descriptor S. From this point, the loss can be applied, so it can be seen that its flexible.

除了排序損失之外，本發明的CGD框架可以適用多種CNN骨幹網一級多種圖像檢索數據集。適用多個全局描述符的CGD框架在大多數骨幹或數據集中提供比現有模型更高的性能。In addition to the ranking loss, the CGD framework of the present invention can be applied to various image retrieval datasets at the level of various CNN backbone networks. A CGD framework that applies multiple global descriptors provides higher performance than existing models in most backbones or datasets.

像這樣，根據本發明的實施例，通過適用用於組合多個全局描述符的新的框架，即，由能夠以端到端方式訓練的多個全局描述符組合而成的CGD，從而可實現與組合等同的效果，而無需對全局描述符使用顯性組合模型或進行多樣性控制。本發明的CGD框架通過全局描述符、CNN骨幹、損失及數據集而具有靈活且可擴展的特性，由於使用組合描述符的方法可使用其他類型的特徵，因而不僅相對於單一全局描述符具有優異的性能，而且可提高圖像檢索性能。As such, according to an embodiment of the present invention, by applying a new framework for combining multiple global descriptors, ie, CGD composed of multiple global descriptors that can be trained in an end-to-end manner, it is possible to achieve Equivalent effect to composition without using an explicit composition model or diversity control for global descriptors. The CGD framework of the present invention is flexible and scalable through global descriptors, CNN backbones, losses, and datasets. Since the method using combined descriptors can use other types of features, it is not only superior to a single global descriptor. performance, and can improve image retrieval performance.

上述裝置可以被實現為硬體結構要素、軟體結構要素和/或硬體結構要素和軟體結構要素的組合。例如，實施例中描述的裝置及結構要素可利用諸如處理器、控制器、算數邏輯單元（arithmetic logic unit， ALU）、數字信號處理器（digital signal processor）、微型電腦、現場可編程門陣列（field programmable gate array， FPGA）、可編程邏輯單元（programmable logic unit， PLU）、微處理器或能夠執行並回應指令（instruction）的其他任何裝置等一個以上的通用電腦或專用電腦來實現。處理裝置可以執行操作系統（OS）以及在上述操作系統上運行的一個以上的軟體應用程式。並且，處理裝置還可以回應於軟體的運行來訪問、存儲、操作、處理及生成數據。為了便於理解，存在描述為使用一個處理裝置的情況，但本技術領域的普通技術人員可知處理裝置可以包括多個處理要素（processing element）和/或多種類型的處理要素。例如，處理裝置可以包括多個處理器或一個處理器及一個控制器。並且，也可以是諸如並行處理器（parallel processor）等其他處理結構（processing configuration）。The above-described apparatus may be implemented as hardware structural elements, software structural elements, and/or a combination of hardware structural elements and software structural elements. For example, the devices and structural elements described in the embodiments may utilize devices such as processors, controllers, arithmetic logic units (ALUs), digital signal processors (digital signal processors), microcomputers, field programmable gate arrays ( Field programmable gate array, FPGA), programmable logic unit (programmable logic unit, PLU), microprocessor or any other device capable of executing and responding to instructions (instruction) and more than one general-purpose computer or special-purpose computer to implement. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. Also, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, there are instances where one processing device is described, but one of ordinary skill in the art will appreciate that a processing device may include multiple processing elements and/or multiple types of processing elements. For example, a processing device may include multiple processors or a processor and a controller. Also, other processing configurations such as parallel processors are also possible.

軟體可以包括電腦程式（computer program）、代碼（code）、指令（instruction）或它們中的一個以上的組合，並且可將處理裝置配置為根據需要進行操作，或者獨立地或共同地（collectively）命令處理裝置。軟體和/或數據可以體現（embody）在任何類型的機器、組件（component）、物理裝置、電腦存儲介質或裝置中，以便由處理裝置解釋或向處理設備提供指令或數據。軟體可以分佈在聯網的電腦系統上，從而能夠以分佈式被存儲或運行。軟體和數據可以存儲在一個或多個電腦可讀記錄介質上。The software may comprise a computer program, code, instructions, or a combination of one or more thereof, and may configure the processing device to operate as desired, or to command independently or collectively processing device. Software and/or data may be embodied in any type of machine, component, physical device, computer storage medium or device for interpretation by processing means or to provide instructions or data to processing equipment. Software can be distributed across networked computer systems so that it can be stored or run in a distributed fashion. Software and data may be stored on one or more computer readable recording media.

根據實施例的方法能夠以可通過多種電腦單元運行的程式指令的形式實現，並記錄於電腦可讀介質上。在此情況下，介質可以繼續存儲可通過電腦運行的程式或臨時存儲以運行或下載。並且，介質可以是單個或多個硬體結合形式的多種記錄單元或存儲單元，其不限於直接級聯到某個電腦系統的介質，而是可以分佈在網路上。作為介質的示例，包括：諸如硬碟、軟碟和磁帶之類的磁性介質；諸如CD-ROM和DVD之類的光學記錄介質；諸如軟碟（floptical disk）之類的磁光介質（magneto-optical medium）；以及包括ROM、RAM、快閃記憶體等以存儲程式指令的介質。並且，作為另一種介質的示例，也可以舉出由分發應用程式的應用商店、提供或分發各種軟體的站點、伺服器等管理的記錄介質或存儲介質。The method according to the embodiment can be implemented in the form of program instructions executable by various computer units and recorded on a computer readable medium. In this case, the media can continue to store programs that can be run by the computer or temporarily store them to run or download. Also, the medium may be a single or multiple hard-integrated recording units or storage units, which are not limited to media directly concatenated to a computer system, but may be distributed over a network. Examples of the media include: magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floppy disks optical medium); and media including ROM, RAM, flash memory, etc. to store program instructions. In addition, as another example of the medium, a recording medium or a storage medium managed by an application store that distributes application programs, a site that provides or distributes various software, a server, and the like can be cited.

如上所述，儘管通過限定的實施例和附圖來對本發明進行了說明，但只要是本發明所屬技術領域的普通技術人員，就可通過這種記載對方法進行修改及變形。例如，即使說明的技術按與說明的方法不同的順序執行，和/或說明的系統、結構、裝置、電路等的結構要素以與說明的方法不同的形式結合或組合、通過其他結構要素或等同技術方案代替或置換，也可實現適當結果。As described above, although the present invention has been described with reference to the limited embodiment and the accompanying drawings, those skilled in the art to which the present invention pertains can modify and modify the method from this description. For example, even if the illustrated techniques are performed in a different order than the illustrated methods, and/or structural elements of the illustrated systems, structures, devices, circuits, etc. are combined or combined in different ways than the illustrated methods, through other structural elements or equivalents Appropriate results can also be achieved by substitution or substitution of technical solutions.

因此，其他實例、其他實施例以及與發明要求保護範圍等同的技術方案也屬於後述的發明要求保護範圍之內。Therefore, other examples, other embodiments, and technical solutions equivalent to the protection scope of the invention also fall within the protection scope of the invention described later.

100:電腦系統 110:處理器 120:記憶體 130:永久性存儲裝置 140:匯流排 150:輸入/輸出介面 160:網路介面 200:CGD框架 201:CNN骨幹網 210:主模組 220:輔助模組100: Computer System 110: Processor 120: memory 130: Persistent Storage Device 140: Busbar 150: Input/Output Interface 160: Web Interface 200: CGD Framework 201: CNN Backbone 210: Main Module 220: Auxiliary module

圖1為用於說明本發明一實施例的電腦系統的內部結構的一例的框圖。FIG. 1 is a block diagram for explaining an example of the internal structure of a computer system according to an embodiment of the present invention.

圖3為用於說明本發明一實施例的使用分類損失和排序損失這兩者的CGD框架的性能的表格。FIG. 3 is a table for illustrating the performance of a CGD framework using both classification loss and ranking loss according to an embodiment of the present invention.

圖4為用於說明本發明一實施例的使用標籤平滑（label smoothing）和溫度定標（temperature scaling）的CGD框架的性能的表格。4 is a table for illustrating the performance of a CGD framework using label smoothing and temperature scaling according to an embodiment of the present invention.

圖5至圖6示出用於訓練多個全局描述符的其他類型的構架示例。5-6 illustrate other types of architecture examples for training multiple global descriptors.

圖7為示出本發明的CGD框架的性能與其他類型的構架進行比較的比較結果的表格。FIG. 7 is a table showing comparison results comparing the performance of the CGD framework of the present invention with other types of frameworks.

圖8為用於說明本發明一實施例的通過級聯方法（concatenation）來組合多個全局描述符的CGD框架的性能的表格。FIG. 8 is a table for explaining the performance of a CGD framework for combining multiple global descriptors through concatenation according to an embodiment of the present invention.

圖9至圖12為用於說明本發明一實施例的由多個全局描述符組合而成的結構的性能的圖表和表格。9 to 12 are graphs and tables for explaining the performance of a structure composed of a combination of a plurality of global descriptors according to an embodiment of the present invention.

200:CGD框架 200: CGD Framework

201:CNN骨幹網 201: CNN Backbone

210:主模組 210: Main Module

220:輔助模組 220: Auxiliary module

Claims

A framework for image retrieval, implemented by a computer system, including: a main module, which is used for a plurality of different global descriptors (global descriptors) extracted from a convolution neural network (convolution neural network, CNN). ) for concatenation (concatenate) to learn; and an auxiliary module for further learning a specific global descriptor in a plurality of the above-mentioned global descriptors; wherein, the above-mentioned main module is used for image representation (image representation) The learning module of the ranking loss, the auxiliary module is a learning module for the classification loss of the above image representation, in an end-to-end manner and using as The final loss of the sum of the above ranking loss and the above classification loss to train the above framework for image retrieval.

A framework for image retrieval as claimed in claim 1, wherein the above-mentioned CNN acts as a backbone network providing feature maps of a given image, and no downsampling is performed before the final stage of the above-mentioned backbone network ( down sampling).

The framework for image retrieval of claim 1, wherein the main module forms a final global descriptor by cascading after normalizing a plurality of the global descriptors, And learn the above final global descriptor through ranking loss.

A framework for image retrieval as claimed in claim 1, wherein said main module includes a plurality of branches for outputting each image representation by using a plurality of said global descriptors, the number of said branches is according to the number to be used changes to the global description.

A framework for image retrieval as in claim 1, wherein, The above-mentioned auxiliary module uses the classification loss to learn the above-mentioned specific global descriptor determined based on the learning performance among the plurality of above-mentioned global descriptors.

The framework for image retrieval according to claim 5, wherein the auxiliary module uses at least one of label smoothing and temperature scaling techniques when using classification loss for learning.

A descriptor learning method, executed on a computer system, wherein the computer system includes at least one processor, and the at least one processor executes a plurality of computer-readable instructions contained in a memory, and the descriptor learning method includes: mainly: A learning step of cascading a plurality of mutually different global descriptors extracted from the CNN to learn using a ranking loss; and an auxiliary learning step of further learning a specific global descriptor among the plurality of above-mentioned global descriptors using a classification loss.

The descriptor learning method of claim 7, wherein, in the above-mentioned descriptor learning method, a plurality of the above-mentioned global descriptors are trained in an end-to-end manner and with a final loss that is the sum of the above-mentioned ranking loss and the above-mentioned classification loss.

The descriptor learning method of claim 7, wherein the above-mentioned CNN serves as a backbone network that provides feature maps of a given image, and no downsampling is performed before the final stage of the above-mentioned backbone network.

The descriptor learning method of claim 7, wherein, in the above-mentioned main learning step, after a plurality of the above-mentioned global descriptors are normalized, they are formed into a final global descriptor by cascading, and the above-mentioned ordering loss to learn the final global descriptor above.

The descriptor learning method of claim 7, wherein, In the above-mentioned auxiliary learning step, the above-mentioned specific global descriptor determined based on the learning performance among the plurality of above-mentioned global descriptors is learned by using the above-mentioned classification loss.

The descriptor learning method of claim 11, wherein, in the auxiliary learning step, at least one of label smoothing and temperature scaling techniques is used when using the classification loss for learning.

A non-transitory computer-readable recording medium storing a computer program for executing the descriptor learning method described in any one of Claims 7 to 12 on the above-mentioned computer system.