WO2019230665A1

WO2019230665A1 - Learning device, search device, method, and program

Info

Publication number: WO2019230665A1
Application number: PCT/JP2019/020947
Authority: WO
Inventors: 之人渡邉; 周平田良島; 島村　潤; 杵渕　哲也
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2018-06-01
Filing date: 2019-05-27
Publication date: 2019-12-05
Anticipated expiration: 2020-12-01
Also published as: JP2019211912A

Abstract

The present invention makes it possible to learn a neural network parameter for performing an accurate search for an object in an image. Each of first images belonging to a first image set comprising first images having a predetermined resolution is converted by a first convolutional neural network, each of second images belonging to a second image set is converted by a second convolutional neural network, feature vectors are extracted from each of the first images belonging to the first image set, each of the second images belonging to the second image set, each of the converted images resulting from conversion of each of the first images, and each of the converted images resulting from conversion of each of the second images, and parameters of the first convolutional neural network and the second convolutional neural network are updated on the basis of the error between the feature vectors of each of the images and the feature vectors of each of the converted images.

Description

Learning device, search device, method, and program

　本発明は、学習装置、検索装置、方法、及びプログラムに係り、特に、画像中に写る物体を検索するための学習装置、検索装置、方法、及びプログラムに関する。 The present invention relates to a learning device, a search device, a method, and a program, and more particularly, to a learning device, a search device, a method, and a program for searching for an object that appears in an image.

　スマートフォン等の小型撮像デバイスの普及に伴い、様々な場所や環境で任意の対象を撮影したような画像中に写る物体を検索する技術への要望が高まってきている。 With the widespread use of small-sized imaging devices such as smartphones, there is an increasing demand for technologies for searching for objects that appear in images that are taken from arbitrary objects in various places and environments.

　従来、画像中の物体を検索する種々の技術が発明され開示されているが、典型的な手続きを、特許文献１に記載の技術に従って説明する。まず、畳み込みニューラルネットワーク（ＣＮＮ）を用いて画像から特徴量ベクトルを抽出する。次に、互いに異なる二つの画像について特徴量ベクトル同士の内積を計算する。内積の値が大きいほど、同一の物体が写っているとみなす。予め、認識したい物体を含む画像（参照画像）によりあらかじめ参照画像データベースを構築し、新たに入力された画像（クエリ画像）と同一の物体が写っているものを検索することにより、クエリ画像中に存在する物体を特定することができる。 Conventionally, various techniques for searching for an object in an image have been invented and disclosed. A typical procedure will be described according to the technique described in Patent Document 1. First, a feature vector is extracted from an image using a convolutional neural network (CNN). Next, the inner product of the feature quantity vectors is calculated for two different images. The larger the inner product value, the more the same object is considered. By constructing a reference image database in advance with an image (reference image) including an object to be recognized, and searching for an image that contains the same object as the newly input image (query image), An existing object can be identified.

　他にも特許文献１には、画像から抽出した多数の特徴量ベクトルを用いて物体を検索する方法が開示されている。画像から特徴的であるような微小領域を多数検出し、各領域から特徴量ベクトルを抽出する。次に、互いに異なる二つの画像の各々に含まれる微小領域について特徴量ベクトル同士のユークリッド距離等を計算し、対応する部分領域の個数を算出する。この個数が多いほど類似度が大きくなるため、二つの画像には、同一の物体が写っているとみなせる。 In addition, Patent Document 1 discloses a method for searching for an object using a large number of feature vectors extracted from an image. A large number of minute regions that are characteristic from the image are detected, and a feature vector is extracted from each region. Next, the Euclidean distance between feature quantity vectors is calculated for a minute area included in each of two different images, and the number of corresponding partial areas is calculated. Since the similarity increases as the number increases, it can be considered that the same object appears in the two images.

　しかしながら、上記物体検索の手続きにも重大な問題がある。クエリ画像と参照画像に写る物体の解像度に乖離がある場合、たとえ同じ物体同士であっても異なる特徴量ベクトルが得られてしまうような場合が多い。結果として、異なる物体が検索されてしまうことがあるのである。なお、解像度とは画素数を指すものとして説明する。 However, there is a serious problem with the above object search procedure. When there is a difference between the resolutions of the objects shown in the query image and the reference image, different feature vectors are often obtained even for the same objects. As a result, different objects may be searched. In the following description, the resolution refers to the number of pixels.

　例えば、物体が大きく高解像度に写る参照画像に対して、物体が小さく低解像度に写る画像をクエリ画像として検索するようなケースでは、クエリ画像中の物体から高周波成分が失われていることが多く、前述の問題が発生しやすい典型例といえる。 For example, in a case where a query image is searched for an image with a small object and a low resolution compared to a reference image with a large object and a high resolution, high frequency components are often lost from the object in the query image. It can be said that this is a typical example in which the above-mentioned problems are likely to occur.

　以上の問題を鑑み、クエリ画像と参照画像の解像度の乖離を防止して検索を実現するための、解像度変換技術が望まれる。 In view of the above problems, a resolution conversion technique for realizing a search by preventing a difference in resolution between a query image and a reference image is desired.

　以上の問題に対して、従来いくつかの発明がなされ、開示されてきている。 Several inventions have been made and disclosed for the above problems.

　非特許文献２には、画像ペアに基づく学習型超解像の方法が開示されている。低解像度画像をＢｉｃｕｂｉｃ法によって拡大し、ＣＮＮを用いて変換することで、高解像度画像を獲得する方法である。事前に低解像度画像と高解像度画像のペアを用いてＣＮＮを学習することで、低解像度画像に含まれない高周波成分を高精度に復元することができる。低解像度画像を高解像度に変換した後に特徴量ベクトルを抽出して検索することで、結果として、検索精度の改善が期待できる。 Non-Patent Document 2 discloses a learning type super-resolution method based on image pairs. This is a method of acquiring a high resolution image by enlarging a low resolution image by the Bicubic method and converting it using CNN. By learning CNN in advance using a pair of a low resolution image and a high resolution image, a high frequency component not included in the low resolution image can be restored with high accuracy. As a result, the search accuracy can be improved by extracting and searching the feature vector after converting the low resolution image to the high resolution.

　非特許文献３には、画像集合ペアに基づく学習型画像変換の方法が開示されている。２つの集合間の変換を学習によって獲得するものであり、より具体的には、画像集合Ｘを画像集合Ｙに変換する変換器、及び、変換された画像と画像集合Ｙに属する画像とを見分ける識別器、並びに、画像集合Ｙを画像集合Ｘに変換する変換器、及び、変換された画像と画像集合Ｘに属する画像とを見分ける識別器からなる。画像集合Ｘに属する画像を画像集合Ｙへ変換後、Ｘへ再度変換した際の再構築誤差を用いることで、１対１で対応する画像ペアがなくとも、２つの画像集合間の変換を実現できる。例えば、一方の画像集合を低解像度画像の集合、他方を高解像度画像の集合とすることで、解像度の変換器を獲得し、クエリ画像と参照画像の乖離を防ぐことができる。 Non-Patent Document 3 discloses a learning type image conversion method based on an image set pair. The conversion between the two sets is acquired by learning. More specifically, the converter that converts the image set X into the image set Y, and the converted image and the image belonging to the image set Y are distinguished. It comprises a discriminator, a converter that converts the image set Y into the image set X, and a discriminator that distinguishes the converted image from the images belonging to the image set X. By converting the images belonging to the image set X to the image set Y and then using the reconstruction error when converted back to the X, conversion between the two image sets is realized even if there is no one-to-one corresponding image pair. it can. For example, by setting one image set as a set of low resolution images and the other as a set of high resolution images, it is possible to acquire a resolution converter and prevent a discrepancy between a query image and a reference image.

特開２０１７－１６５０１号公報JP 2017-16501 A

G. Tolias, R. Sicre, and H. Jegou, Particular Object Retrieval with Integral Max-Pooling of CNN Activations, In ICLR, 2016.G. Tolias, R. Sicre, and H. Jegou, Particular Object Retrieval with Integral Max-Pooling of CNN Activations, In ICLR, 2016. C. Dong, C. C. Loy, K. He, and X. Tang, Image super-resolution using deep convolutional networks, In CVPR, 2014.C. Dong, C. C. Loy, K. He, and X. Tang, Image super-resolution using deep convolutional networks, In CVPR, 2014. J. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, In ICCV, 2017.J. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, In ICCV, 2017.

　非特許文献２のように画像ペアに基づいてそれらの変換を学習する方法は、画像ペアの準備方法が問題となる。非特許文献２では、高解像度画像と、当該高解像度画像をＢｉｃｕｂｉｃ法によって低解像度にした画像とをペアとして学習が行われている。この場合、Ｂｉｃｕｂｉｃ法によって高解像度画像を低解像度に変換した画像と、撮像した画像中に低解像度に写りこむ物体との乖離による検索精度の劣化が発生してしまう。また、同一物体画像を同一条件下で高解像度および低解像度に撮像して解像度の異なる画像ペアを準備することは難しく、学習データとして多量のペアを集めることは非効率的である。 As a method of learning conversion based on image pairs as in Non-Patent Document 2, a method for preparing image pairs becomes a problem. In Non-Patent Document 2, learning is performed by pairing a high-resolution image and an image obtained by reducing the high-resolution image by the Bicubic method. In this case, degradation in search accuracy occurs due to a difference between an image obtained by converting a high resolution image to a low resolution by the Bicubic method and an object that appears in the low resolution in the captured image. In addition, it is difficult to prepare image pairs having different resolutions by capturing the same object image at high resolution and low resolution under the same conditions, and collecting a large amount of pairs as learning data is inefficient.

　また、非特許文献３のように画像集合ペアに基づいてそれらの変換を学習する方法は、ある集合に属する画像を他方の集合に属する画像となるよう変換し、再度元の集合に属する画像へと変換した際に同一の画像へと戻るように学習を行う。そのため、各々の変換自体は見た目が大きく乖離する画像への変換となる可能性があり、結果として、検索精度の重大な劣化を引き起こす可能性がある。 Further, as in Non-Patent Document 3, a method of learning conversion based on an image set pair converts an image belonging to one set to be an image belonging to the other set, and returns to an image belonging to the original set again. Learning to return to the same image. For this reason, each conversion itself may be a conversion into an image with a large difference in appearance, and as a result, there is a possibility of causing a serious deterioration in search accuracy.

　さらに、前述の方法のどちらにも共通し、物体検索の特徴量ベクトルについては考慮していない点が問題となる。非特許文献２の方法では、低解像度画像の変換後の画像が高解像度画像と近くなるように学習されるが、特徴量ベクトルが一致するとは限らない。非特許文献３の方法も同様に、変換前後の特徴量ベクトルが一致するとは限らない。 Furthermore, there is a problem that the feature amount vector for object search is not considered in common with both of the methods described above. In the method of Non-Patent Document 2, learning is performed so that the converted image of the low-resolution image is close to the high-resolution image, but the feature quantity vectors do not always match. Similarly, in the method of Non-Patent Document 3, the feature quantity vectors before and after conversion do not always match.

　以上のように、現在に至るまで、クエリ画像と参照画像間に解像度の乖離が存在する場合に、高精度に物体を検索できる検索技術は発明されていなかった。 As described above, until now, no search technique has been invented that can search for an object with high accuracy when there is a resolution difference between a query image and a reference image.

　本発明は、上記問題点を解決するために成されたものであり、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides a learning device, method, and program capable of learning parameters of a neural network for accurately searching for an object appearing in an image. For the purpose.

　また、画像に写る物体を精度よく検索することができる検索装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a search device, method, and program capable of accurately searching for an object appearing in an image.

　上記目的を達成するために、第１の発明に係る学習装置は、所定の解像度の第一の画像からなる第一画像集合に属する前記第一の画像の各々を、第一の畳み込みニューラルネットワークによって前記第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する第一変換部と、前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換する第二変換部と、前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する特徴量抽出部と、前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するパラメータ更新部と、を含んで構成されている。 In order to achieve the above object, a learning device according to a first invention uses a first convolution neural network to connect each of the first images belonging to a first image set consisting of a first image having a predetermined resolution. A first conversion unit for converting the first image to have a resolution higher than that of the first image and corresponding to the resolution of the second image belonging to the second image set; and the first image belonging to the second image set. A second converter that converts each of the two images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolution neural network, and belongs to the first image set Each of the first images, each of the second images belonging to the second image set, each of the converted images of the first image converted by the first conversion unit, Two conversion units A feature quantity extraction unit that extracts a feature quantity vector from each of the converted images of the second image that has been further transformed, and the first image belonging to the first image set extracted by the feature quantity extraction unit. An error between each feature quantity vector of the image and each feature quantity vector of each converted image of the second image transformed by the second transform unit, and the second image set belonging to the second image set. A first convolutional neural network based on an error between each feature vector of the second image and each feature vector of each converted image of the first image converted by the first conversion unit; And a parameter updating unit that updates parameters of the second convolutional neural network.

　また、第１の発明に係る学習装置において、前記第一変換部は、前記第二変換部で変換された第二の画像の各々の変換画像の各々を、前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として更に変換し、前記第二変換部は、前記第一変換部で変換された第一の画像の各々の変換画像の各々を、前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として更に変換し、前記パラメータ更新部は、前記第二変換部で更に変換された前記第一の画像の各々の再変換画像の各々と前記第一画像集合に属する前記第一の画像の各々との誤差、及び前記第一変換部で更に変換された前記第二の画像の各々の再変換画像の各々と前記第二画像集合に属する前記第二の画像の各々との誤差を更に用いて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するようにしてもよい。 In the learning device according to the first aspect, the first conversion unit may convert each converted image of the second image converted by the second conversion unit to a second image belonging to the second image set. The second conversion unit further converts each of the converted images of the first image converted by the first conversion unit so as to have a resolution corresponding to the resolution of the first image. , Further converting each of the reconverted images so as to have a resolution corresponding to the resolution of the first image belonging to the first image set, and the parameter updating unit is further converted by the second conversion unit An error between each re-converted image of each of the one images and each of the first images belonging to the first image set, and each of the second images further converted by the first converter. Each of the transformed images and the second image belonging to the second image set Further using the error between each image, it may be updated parameters of the first convolution neural network and said second convolution neural network.

　また、第１の発明に係る学習装置において、前記パラメータ更新部は、更に、前記第一の畳み込みニューラルネットワークと、前記第一変換部で変換された第一の画像の各々の変換画像の各々が、前記第一画像集合に属する前記第一の画像、及び前記第二画像集合に属する前記第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、前記第一の畳み込みニューラルネットワークと、前記識別用ニューラルネットワークとが、互いに競合することを表す損失関数を用いて、前記第一の畳み込みニューラルネットワーク及び前記識別用ニューラルネットワークのパラメータを更新するようにしてもよい。 In the learning device according to the first invention, the parameter update unit further includes the first convolutional neural network and the converted images of the first image converted by the first conversion unit. The first convolutional neural network for identifying the first image belonging to the first image set and the identifying neural network identifying the second image belonging to the second image set; The parameters of the first convolutional neural network and the identification neural network may be updated using a loss function indicating that the identification neural network competes with each other.

　また、第２の発明に係る検索装置は、上記第１の発明に係る学習装置によってパラメータが学習された第一の畳み込みニューラルネットワークを用いて、任意の画像を前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する検索第一変換部と、前記任意の画像の変換画像と、第三の画像からなる第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出する検索特徴量抽出部と、前記検索特徴量抽出部で抽出された前記変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、を含んで構成されている。 Further, the search device according to the second invention uses the first convolution neural network whose parameters have been learned by the learning device according to the first invention, and assigns an arbitrary image to the second image set belonging to the second image set. A search first conversion unit that converts to a resolution corresponding to the resolution of the image, a converted image of the arbitrary image, and each of the third images of the third image set including the third image A search feature quantity extraction unit that extracts a feature quantity vector; a feature quantity vector of the converted image extracted by the search feature quantity extraction unit; and a feature quantity vector of each of the third images in the third image set A collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using the set.

　また、第３の発明に係る検索装置は、上記第１の発明に係る学習装置によってパラメータが学習された前記第二の畳み込みニューラルネットワークを用いて、第三の画像からなる第三画像集合の前記第三の画像の各々を前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換する検索第二変換部と、任意の画像と、前記第三画像集合の前記第三の画像の各々の変換画像とから特徴量ベクトルを抽出する検索特徴量抽出部と、前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、を含んで構成されている。 Further, a search device according to a third invention uses the second convolution neural network whose parameters have been learned by the learning device according to the first invention, and uses the second convolutional neural network for the third image set of a third image. A search second conversion unit that converts each of the third images to a resolution corresponding to the resolution of the first image belonging to the first image set, an arbitrary image, and the first image of the third image set A search feature quantity extraction unit that extracts a feature quantity vector from each converted image of the third image; a feature quantity vector of the arbitrary image extracted by the search feature quantity extraction unit; and the third image set of the third image set. A collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using a pair of feature vectors of the converted images of the three images. It consists of

　また、第３の発明に係る検索装置において、前記第一の畳み込みニューラルネットワークを用いて、前記任意の画像を前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する検索第一変換部を更に含み、前記検索特徴量抽出部は、更に、前記任意の画像の変換画像と、前記第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出し、前記照合部は、前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度、及び前記任意の画像の変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力するようにしてもよい。 In the search device according to the third invention, the first convolutional neural network is used to convert the arbitrary image to have a resolution corresponding to the resolution of the second image belonging to the second image set. A search first conversion unit that further extracts a feature vector from the converted image of the arbitrary image and each of the third images of the third image set. The collation unit uses a set of a feature amount vector of the arbitrary image extracted by the search feature amount extraction unit and a feature amount vector of each converted image of the third image in the third image set. The arbitrary image based on the similarity using the combination of the feature amount vector of the converted image of the arbitrary image and the feature amount vector of each of the third images of the third image set. And the third image Collating, may output the verification result.

　第４の発明に係る学習方法は、第一変換部が、所定の解像度の第一の画像からなる第一画像集合に属する前記第一の画像の各々を、第一の畳み込みニューラルネットワークによって前記第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換するステップと、第二変換部が、前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換するステップと、特徴量抽出部が、前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出するステップと、パラメータ更新部が、前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するステップと、を含んで実行することを特徴とする。 In the learning method according to a fourth aspect of the present invention, the first conversion unit uses the first convolution neural network to convert each of the first images belonging to the first image set including the first images having a predetermined resolution. Converting to a resolution higher than that of one image and corresponding to the resolution of the second image belonging to the second image set, and a second conversion unit belonging to the second image set Converting each of the second images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolutional neural network; and a feature amount extraction unit, Each of the first images belonging to the image set, each of the second images belonging to the second image set, and each of the converted images of the first image converted by the first conversion unit When, A step of extracting a feature vector from each of the converted images of the second image converted by the second conversion unit, and a parameter updating unit extracted by the feature extraction unit. An error between each feature amount vector of the first image belonging to the image set and each feature amount vector of each converted image of the second image converted by the second conversion unit; and Based on an error between each feature quantity vector of the second image belonging to a set of two images and each feature quantity vector of each converted image of the first image converted by the first conversion unit, Updating the parameters of the first convolutional neural network and the second convolutional neural network.

　第５の発明に係るプログラムは、コンピュータを、第１の発明に係る学習装置、又は第２若しくは第３の発明に係る検索装置の各部として機能させるためのプログラムである。 The program according to the fifth invention is a program for causing a computer to function as each part of the learning device according to the first invention or the search device according to the second or third invention.

　本発明の学習装置、方法、及びプログラムによれば、所定の解像度の第一の画像からなる第一画像集合に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換し、第二画像集合に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換し、第一画像集合に属する第一の画像の各々と、第二画像集合に属する第二の画像の各々と、変換された第一の画像の各々の変換画像の各々と、変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出し、画像の各々の特徴量ベクトルと変換画像の各々の特徴量ベクトルとの間の誤差に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新することにより、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる、という効果が得られる。 According to the learning device, method, and program of the present invention, each of the first images belonging to the first image set made up of the first images having a predetermined resolution is converted from the first image by the first convolution neural network. Is converted to a resolution corresponding to the resolution of the second image belonging to the second image set, and each second image belonging to the second image set is converted to the second convolutional neural network. Are converted to a resolution corresponding to the resolution of the first image belonging to the first image set, and each of the first image belonging to the first image set and each of the second image belonging to the second image set And extracting a feature vector from each of the converted images of each of the converted first images and each of the converted images of the converted second image, and converting each of the feature vectors of the images image Parameters of the neural network for accurately searching for an object appearing in the image by updating the parameters of the first convolutional neural network and the second convolutional neural network based on an error between each feature vector. Can be learned.

　また、本発明の検索装置、方法、及びプログラムによれば、学習されたパラメータを用いた第一の畳み込みニューラルネットワークによって画像を変換し、学習されたパラメータを用いた第二の畳み込みニューラルネットワークによって、参照画像である第三の画像の各々を変換し、変換画像から特徴量ベクトルを抽出し、特徴量ベクトルの組を用いた類似度を照合し、照合した結果を出力することで、画像に写る物体を精度よく検索することができる。 Further, according to the search device, method and program of the present invention, an image is converted by the first convolution neural network using the learned parameters, and the second convolution neural network using the learned parameters is used. Each of the third images, which are reference images, is converted, a feature vector is extracted from the converted image, the similarity using a set of feature vectors is collated, and the collation result is output, so that the image is captured. The object can be searched with high accuracy.

本発明の実施形態に係る学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the learning apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the search device which concerns on embodiment of this invention. 本発明の実施形態に係る学習装置における学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the learning process routine in the learning apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る検索装置における検索処理ルーチンを示すフローチャートである。It is a flowchart which shows the search process routine in the search device which concerns on embodiment of this invention.

　以下、図面を参照して本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

　本実施形態に係る学習装置、及び検索装置は、特定の対象となる物体（以下、特定対象物と呼ぶ）が写ったクエリ画像と参照画像との解像度が大きく乖離する場合であっても、高精度に検索結果を得るための学習装置、及び検索装置である。 The learning device and the search device according to the present embodiment can be used even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified target) is significantly different from a reference image. A learning device and a search device for accurately obtaining a search result.

＜本発明の実施形態に係る学習装置の構成＞ <Configuration of Learning Device According to Embodiment of the Present Invention>

　次に、本発明の実施形態に係る学習装置の構成について説明する。図１に示す学習装置１０は、特定の対象となる物体（以下、特定対象物と呼ぶ）が写ったクエリ画像と参照画像との解像度が大きく乖離する場合であっても、高精度に検索結果を得るための学習装置である。学習装置１０は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この学習装置１０は、機能的には図１に示すように第一変換部２１と、第二変換部２２と、特徴量抽出部２３と、パラメータ更新部２４と、記憶部２９とを備えている。なお、以降では画像５がクエリ画像に対応し、第三画像集合６が１枚以上の参照画像からなる画像集合に対応するものとし、クエリ画像が低解像度に、参照画像が高解像度に特定対象物を写しているものとして説明する。また、解像度とは画像の総画素数を指すものとして説明する。 Next, the configuration of the learning device according to the embodiment of the present invention will be described. The learning device 10 shown in FIG. 1 has a high-accuracy search result even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified object) appears and the reference image greatly deviate. It is a learning device for obtaining. The learning device 10 can be configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning process routine described later and various data. Functionally, the learning device 10 includes a first conversion unit 21, a second conversion unit 22, a feature amount extraction unit 23, a parameter update unit 24, and a storage unit 29 as shown in FIG. Yes. In the following, it is assumed that the image 5 corresponds to the query image, the third image set 6 corresponds to the image set including one or more reference images, the query image has a low resolution, and the reference image has a high resolution. Explain that it is a copy of an object. The description will be made assuming that the resolution indicates the total number of pixels of the image.

　学習装置１０には、データベース２に格納された、１枚以上の低解像度な画像に対応する第一の画像からなる第一画像集合３、および、１枚以上の高解像度な画像に対応する第二の画像からなる第二画像集合４が入力される。なお、畳み込みニューラルネットワークは、入力された画像が変換された画像か参照画像であるかを判別するものを用いるため、第一画像集合３と第二画像集合４とで画像間の対応が取れている必要はないものとする。また、低解像度な画像に対応する第一の画像は、Ｂｉｃｕｂｉｃ法などによって拡大され、高解像度な画像に対応する第二の画像と画素数を揃えているものとする。 The learning apparatus 10 includes a first image set 3 including a first image corresponding to one or more low-resolution images stored in the database 2 and a first image set 3 corresponding to one or more high-resolution images. A second image set 4 composed of two images is input. Since the convolutional neural network uses an image that determines whether the input image is a converted image or a reference image, the first image set 3 and the second image set 4 can correspond to each other. It is not necessary to be. In addition, the first image corresponding to the low resolution image is enlarged by the Bicubic method or the like, and has the same number of pixels as the second image corresponding to the high resolution image.

　学習装置１０は、データベース２と通信手段（図示省略）を介して相互に情報を通信する。 Learning device 10 communicates information with each other via database 2 and communication means (not shown).

　データベース２は、例えば、一般的な汎用コンピュータに実装されているファイルシステムによって構成できる。本実施形態では、一例としてデータベース２には、第一画像集合３、及び、第二画像集合４の画像データが予め格納されているものとする。本実施形態では、各画像それぞれを一意に識別可能な、通し番号によるＩＤ（Identification）やユニークな画像ファイル名等の識別子が与えられているものとしている。また、データベース２は、各々の画像について、当該画像の識別子と、当該画像の画像データとを関連づけて記憶しているものとする。あるいは、データベース２は、同様に、ＲＤＢＭＳ（Relational Database Management System）等で実装及び構成されていても構わない。データベース２が記憶する情報は、その他、メタデータとして、例えば画像の内容を表現する情報（画像のタイトル、概要文、またはキーワード等）、画像のフォーマットに関する情報（画像のデータ量、サムネイル等のサイズ）等を含んでいても構わないが、これらの情報の記憶は本開示の実施においては必須ではない。 The database 2 can be configured by, for example, a file system mounted on a general general-purpose computer. In the present embodiment, as an example, it is assumed that the database 2 stores image data of the first image set 3 and the second image set 4 in advance. In the present embodiment, it is assumed that an identifier such as a serial number ID (Identification) or a unique image file name that can uniquely identify each image is given. In addition, the database 2 stores, for each image, an identifier of the image and image data of the image in association with each other. Alternatively, the database 2 may be similarly implemented and configured with RDBMS (Relational Database Management System) or the like. The information stored in the database 2 includes, as metadata, for example, information representing the contents of an image (such as an image title, summary text, or keyword), and information regarding the image format (the amount of image data, the size of a thumbnail, etc.) ) And the like may be included, but the storage of these pieces of information is not essential in the implementation of the present disclosure.

　データベース２は、学習装置１０の内部及び外部の何れに設けられていても構わず、通信手段は任意の公知ものを用いることができる。なお、本実施形態では、データベース２は、学習装置１０の外部に設けられているものとし、インターネット、及びＴＣＰ／ＩＰ（Transmission Control Protocol／Internet Protocol）等のネットワークを通信手段として学習装置１０と通信可能に接続されているものとする。 The database 2 may be provided either inside or outside the learning apparatus 10, and any known communication means can be used. In this embodiment, it is assumed that the database 2 is provided outside the learning device 10 and communicates with the learning device 10 using the Internet and a network such as TCP / IP (Transmission Control Protocol / Internet Protocol) as communication means. It shall be connected as possible.

　また、学習装置１０が備える各部及びデータベース２は、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit)等の演算処理装置や、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、及びＨＤＤ（Hard Disk Drive）等の記憶装置等を備えたコンピュータやサーバ等により構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは学習装置１０が備える上記記憶装置に予め記憶されていてもよいし、磁気ディスク、光ディスク、及び半導体メモリ等の記録媒体に格納して提供することも、ネットワークを通して提供することも可能である。もちろん、その他いかなる構成要素についても、単一のコンピュータやサーバによって実現しなければならないものではなく、ネットワークによって接続された複数のコンピュータに分散して実現しても構わない。 In addition, each unit and the database 2 included in the learning device 10 include arithmetic processing devices such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and an HDD ( It may be configured by a computer or a server provided with a storage device such as “Hard Disk Drive”, and the processing of each unit may be executed by a program. This program may be stored in advance in the storage device included in the learning device 10, stored in a recording medium such as a magnetic disk, an optical disk, and a semiconductor memory, or provided through a network. is there. Of course, any other components need not be realized by a single computer or server, but may be realized by being distributed to a plurality of computers connected by a network.

　次に、本実施形態における学習装置１０の各部の機能について説明する。 Next, functions of each unit of the learning device 10 in the present embodiment will be described.

　第一変換部２１は、データベース２に格納された第一画像集合３に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合４に属する第二の画像の解像度に対応する解像度となるように変換する。また、第一変換部２１は、第二変換部２２で変換された第二の画像の各々の変換画像の各々を、第一の畳み込みニューラルネットワークによって第二画像集合４に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。 The first conversion unit 21 converts each of the first images belonging to the first image set 3 stored in the database 2 to a higher resolution than the first image by the first convolution neural network, Conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the set 4. In addition, the first conversion unit 21 converts each of the converted images of the second image converted by the second conversion unit 22 into the second image belonging to the second image set 4 by the first convolution neural network. Each reconverted image is converted so as to have a resolution corresponding to the resolution.

　本実施形態では、第一変換部２１の変換は低解像度画像から高解像度画像への変換を想定する。画像の変換で用いる第一の畳み込みニューラルネットワークは、ニューラルネットワークを用いて畳み込む方法であれば限定されない。本実施形態では、非特許文献３に記載の、ダウンサンプリングを行うストライド２の畳み込み層、residual block、アップサンプリングを行うストライド１／２の畳み込み層からなる９層の畳み込みニューラルネットワーク（ＣＮＮ:Convolutional Neural Network）によって畳み込むことで変換を実施する。 In the present embodiment, the conversion of the first conversion unit 21 is assumed to be a conversion from a low resolution image to a high resolution image. The first convolution neural network used for image conversion is not limited as long as it is a convolution method using a neural network. In this embodiment, a 9-layer convolutional neural network (CNN: Convolutional Neural consisting of a convolution layer of stride 2 that performs downsampling, a residual block, and a stride 1/2 convolution layer that performs upsampling described in Non-Patent Document 3. Perform conversion by convolution with Network.

　第二変換部２２は、データベース２に格納された第二画像集合４に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように変換する。また、第二変換部２２は、第一変換部２１で変換された第一の画像の各々の変換画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。 The second conversion unit 22 corresponds each resolution of the second image belonging to the second image set 4 stored in the database 2 to the resolution of the first image belonging to the first image set 3 by the second convolution neural network. Convert to the resolution you want. Further, the second conversion unit 22 converts each of the converted images of the first image converted by the first conversion unit 21 to the first image belonging to the first image set 3 by the second convolution neural network. Each reconverted image is converted so as to have a resolution corresponding to the resolution.

　特徴量抽出部２３は、第一画像集合３に属する第一の画像の各々と、第二画像集合４に属する第二の画像の各々と、第一変換部２１により変換された第一の画像の各々の変換画像の各々と、第二変換部２２により変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。 The feature amount extraction unit 23 includes each of the first images belonging to the first image set 3, each of the second images belonging to the second image set 4, and the first image converted by the first conversion unit 21. A feature vector is extracted from each of the converted images and each of the converted images of the second image converted by the second converting unit 22.

　特徴量ベクトルの抽出処理には、ニューラルネットワークを用いて固定の次元を持つベクトルとして表現できるものを用いればよいが、例えば、非特許文献４に記載の方法を用いることで抽出できる。当該方法によれば、ＣＮＮの一種であるＶＧＧ１６、あるいは、ＲｅｓＮｅｔ１０１というニューラルネットワークに画像を入力した際の全結合層への入力に該当する特徴マップ（高さ、横幅、及び、チャネル数によって特徴マップの大きさが規定される）に関し、まず、様々な大きさの矩形を規定し、チャネルごとに矩形内の値の最大値を求めることで、矩形数×チャネル数のベクトルを得る。当該ベクトル群を正規化し、同一チャネルの値を足し合わせて再度正規化することで、一枚の画像をチャネル数の次元を持つ特徴量ベクトルによって表現することができる。正規化は、公知の方法を用いれば良いが、Ｌ２正規化が好適である。 For the feature vector extraction process, a vector that can be expressed as a vector having a fixed dimension using a neural network may be used. For example, it can be extracted by using the method described in Non-Patent Document 4. According to this method, the feature map corresponding to the input to all the connection layers when the image is inputted to the neural network called VGG16 or ResNet 101 which is a kind of CNN (feature map depending on the height, width, and number of channels). First, rectangles of various sizes are defined, and a vector of the number of rectangles × the number of channels is obtained by determining the maximum value in the rectangle for each channel. By normalizing the vector group, adding the values of the same channel, and normalizing again, one image can be expressed by a feature vector having a dimension of the number of channels. For normalization, a known method may be used, but L2 normalization is preferable.

[非特許文献４]A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus, End-to-End learning　of deep visual representations for image retrieval, IJCV, 2017. [Non-Patent Document 4] A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus, End-to-End learning of deep visual representations for image retrieval, IJCV, 2017.

　パラメータ更新部２４は、特徴量ベクトル間の誤差、変換前と再変換後の画像間の誤差、及び損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新し、記憶部２９に格納する。 The parameter update unit 24 updates the parameters of the first convolution neural network and the second convolution neural network based on the error between the feature amount vectors, the error between the images before and after conversion, and the loss function. And stored in the storage unit 29.

　具体的には、特徴量ベクトル間の誤差は、第一画像集合３に属する第一の画像の各々の特徴量ベクトルと第二変換部２２により変換された第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び第二画像集合４に属する前記第二の画像の各々の特徴量ベクトルと第一変換部２１により変換された第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差である。 Specifically, the error between the feature quantity vectors is the difference between each feature quantity vector of the first image belonging to the first image set 3 and each converted image of the second image converted by the second conversion unit 22. The error between each feature quantity vector and each feature quantity vector of the second image belonging to the second image set 4 and each converted image of the first image converted by the first conversion unit 21 This is an error from each feature vector.

　変換前と再変換後の画像間の誤差は、第二変換部２２で更に変換された第一の画像の各々の再変換画像の各々と第一画像集合３に属する第一の画像の各々との誤差、及び第一変換部２１で更に変換された第二の画像の各々の再変換画像の各々と第二画像集合４に属する第二の画像の各々との誤差である。 The error between the image before conversion and the image after re-conversion is that each of the re-converted images of the first image further converted by the second conversion unit 22 and each of the first images belonging to the first image set 3. And the error between each of the reconverted images of the second image further converted by the first converter 21 and each of the second images belonging to the second image set 4.

　損失関数は、第一の畳み込みニューラルネットワークと、第一変換部２１で変換された第一の画像の各々の変換画像の各々が、第一画像集合３に属する第一の画像、及び第二画像集合４に属する第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、第一の畳み込みニューラルネットワークと、識別用ニューラルネットワークとが、互いに競合することを表す損失関数と、第二の畳み込みニューラルネットワークと、第二変換部２２で変換された第二の画像の各々の変換画像の各々が、第一画像集合３に属する第一の画像、及び第二画像集合４に属する第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、第二の畳み込みニューラルネットワークと、識別用ニューラルネットワークとが、互いに競合することを表す損失関数とである。 The loss function includes a first convolutional neural network and a first image and a second image in which each converted image of the first image converted by the first conversion unit 21 belongs to the first image set 3. A loss function indicating that the first convolutional neural network and the identification neural network compete with each other for the identification neural network for identifying which of the second images belongs to the set 4; Each of the converted images of the convolutional neural network and the second image converted by the second conversion unit 22 includes a first image belonging to the first image set 3 and a second image belonging to the second image set 4. A discriminating neural network for identifying one of the images of the second convolutional neural network, a discriminating neural network, Is the loss function indicating that conflict with one another.

　パラメータ更新部２４は、以上の特徴量ベクトル間の誤差、変換前と再変換後の画像間の誤差、及び損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新する。誤差の算出には、任意の公知の損失関数を用いて構わないが、例えば、変換前後の特徴量ベクトル間の２乗誤差によって求められる。また特徴量ベクトル間の誤差に加えて、非特許文献３に記載のAdversarial Lossの損失関数、および、Cycle Consistency Lossの画像間の誤差を足し合わせて用いてもよい。 The parameter update unit 24 calculates the parameters of the first convolution neural network and the second convolution neural network based on the error between the feature vectors, the error between the images before and after the conversion, and the loss function. Update. For calculation of the error, any known loss function may be used. For example, the error is obtained by a square error between the feature amount vectors before and after the conversion. Further, in addition to the error between feature quantity vectors, the loss function of Adversarial Loss described in Non-Patent Document 3 and the error between images of Cycle Consistency Loss may be added together.

　Adversarial Lossは、上述した損失関数を用いて、第一の畳み込みニューラルネットワークと、識別用ニューラルネットワークとを交互に学習する。変換された第一の画像の変換画像が第一の画像であると識別されなくなるほど、損失関数の値が減少する。損失関数の値が減少するように、第一の畳み込みニューラルネットワーク及び識別用ニューラルネットワークのパラメータを更新する。つまり、識別用ニューラルネットワークが、変換画像を見分けることができなくなるように第一ニューラルネットワークのパラメータの学習を行う。また、同様に、上述した損失関数を用いて、第二の畳み込みニューラルネットワークと、識別用ニューラルネットワークとを交互に学習する。変換された第二の画像の変換画像が第二の画像であると識別されなくなるほど、損失関数の値が減少する。損失関数の値が減少するように、第二の畳み込みニューラルネットワーク及び識別用ニューラルネットワークのパラメータを更新する。 Adversarial Loss learns the first convolutional neural network and the identification neural network alternately using the above-described loss function. The loss function value decreases as the converted image of the converted first image is not identified as the first image. The parameters of the first convolutional neural network and the identification neural network are updated so that the value of the loss function decreases. That is, the parameters of the first neural network are learned so that the identifying neural network cannot distinguish the converted image. Similarly, the second convolutional neural network and the identification neural network are alternately learned using the above-described loss function. The loss function value decreases as the converted image of the converted second image is not identified as the second image. The parameters of the second convolutional neural network and the identification neural network are updated so that the value of the loss function decreases.

　Cycle Consistency Lossは、第一画像集合３に属する第一の画像を第一変換部２１により変換した第一の画像の各々の変換画像の各々を、第二変換部２２により更に変換した再変換画像の各々と、変換前の第一の画像の各々との誤差、及び、第二画像集合４に属する第二の画像を第二変換部２２により変換した第二の画像の各々の変換画像の各々を、第一変換部２１により更に変換した再変換画像の各々と、変換前の第二の画像の各々との誤差によって構成される。 Cycle Consistency Loss is a reconverted image obtained by further converting each converted image of the first image obtained by converting the first image belonging to the first image set 3 by the first converting unit 21 by the second converting unit 22. And each of the converted images of the second image obtained by converting the second image belonging to the second image set 4 by the second conversion unit 22. Is constituted by an error between each of the reconverted images further converted by the first conversion unit 21 and each of the second images before conversion.

　パラメータの更新方法は限定されないが、例えば、誤差逆伝播法を用いてパラメータを更新してもよい。誤差逆伝播法は、ニューラルネットワークの出力から入力に向かって各ニューロンのパラメータを局所誤差が小さくなるように修正する方法である。なお、上記の更新においては、特徴量ベクトル間の誤差及び変換前と再変換後の画像間の誤差を用いた更新と、損失関数を用いた更新とを交互に行えばよい。 The method for updating the parameters is not limited. For example, the parameters may be updated using an error back propagation method. The error back-propagation method is a method of correcting the parameters of each neuron so that the local error decreases from the output of the neural network to the input. In the above update, an update using an error between feature quantity vectors and an error between images before and after conversion and an update using a loss function may be alternately performed.

＜本発明の実施形態に係る検索装置の構成＞ <Configuration of Retrieval Device According to Embodiment of the Present Invention>

　次に、本発明の実施形態に係る検索装置の構成について説明する。 Next, the configuration of the search device according to the embodiment of the present invention will be described.

　本実施形態では、検索装置１１が低解像度なクエリ画像を高解像度に変換する、あるいは、高解像度な参照画像を低解像度に変換する、あるいは、どちらの変換も実施した後に特徴量ベクトルを抽出し照合することで、解像度が乖離する画像間の高精度な検索が可能となる。 In the present embodiment, the search apparatus 11 converts a low resolution query image to a high resolution, or converts a high resolution reference image to a low resolution, or extracts a feature vector after performing both conversions. By collating, it is possible to search with high accuracy between images whose resolutions are different.

　図２に示す検索装置１１は、ＣＰＵと、ＲＡＭと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この学習装置１０は、機能的には図２に示すように検索第一変換部３１と、検索第二変換部３２と、検索特徴量抽出部３３と、照合部３５と、記憶部３９とを備えている。 The search device 11 shown in FIG. 2 can be constituted by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning process routine described later and various data. Functionally, the learning device 10 includes a search first conversion unit 31, a search second conversion unit 32, a search feature amount extraction unit 33, a collation unit 35, and a storage unit 39 as shown in FIG. I have.

　記憶部３９には、上記学習装置１０で学習された第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータが格納されている。 The storage unit 39 stores parameters of the first convolutional neural network and the second convolutional neural network learned by the learning device 10.

　検索第一変換部３１は、照合対象の画像５を、記憶部３９に格納されたパラメータを用いた第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合４に属する第二の画像の解像度に対応する解像度となるように変換する。 The search first conversion unit 31 has a higher resolution than the first image by the first convolution neural network using the parameters stored in the storage unit 39, and the second image set 4 Is converted so as to have a resolution corresponding to the resolution of the second image belonging to.

　検索第二変換部３２は、参照画像である第三の画像からなる第三画像集合６に属する第三の画像の各々を、記憶部３９に格納されたパラメータを用いた第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように変換する。なお、第三画像集合６は、第二画像集合４と同様のものであってもよい。 The search second conversion unit 32 uses the second convolution neural network using the parameters stored in the storage unit 39 for each of the third images belonging to the third image set 6 including the third images as reference images. Thus, conversion is performed so that the resolution corresponds to the resolution of the first image belonging to the first image set 3. The third image set 6 may be the same as the second image set 4.

　検索特徴量抽出部３３は、画像５と、検索第一変換部３１により変換された画像５の変換画像と、第三画像集合６に属する第三の画像の各々と、第三画像集合６に属する第三の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。 The search feature quantity extraction unit 33 includes the image 5, the converted image of the image 5 converted by the search first conversion unit 31, each of the third images belonging to the third image set 6, and the third image set 6. A feature vector is extracted from each converted image of the third image to which it belongs.

　照合部３５は、検索特徴量抽出部３３で抽出された画像５の特徴量ベクトルと検索第二変換部３２で変換された第三画像集合６に属する第三の画像の各々の特徴量ベクトルとの組、及び変換された画像５の特徴量ベクトルと第三画像集合６に属する第三の画像の各々の特徴量ベクトルとの組を用いた類似度を照合し、照合した結果を検索結果７として出力する。なお、類似度の照合は上記二つの組のうちの少なくとも一方の組であってもよい。 The collation unit 35 includes the feature amount vector of the image 5 extracted by the search feature amount extraction unit 33 and each feature amount vector of the third image belonging to the third image set 6 converted by the search second conversion unit 32. And the similarity using the set of the feature vector of the converted image 5 and the feature vector of each of the third images belonging to the third image set 6 are collated, and the collation result is obtained as a search result 7. Output as. Note that the similarity check may be at least one of the two sets.

　照合は、例えば、特徴量ベクトル間の内積を計算し、当該値を画像間の類似度とし、第三画像集合６の類似度の高い上位Ｎ枚を検索結果７として出力すればよい。Ｎは１以上第三画像集合６の画像枚数以下の整数である。 The collation may be performed by, for example, calculating the inner product between feature quantity vectors, setting the value as the similarity between images, and outputting the top N images with the highest similarity in the third image set 6 as the search result 7. N is an integer from 1 to the number of images in the third image set 6.

　また、照合部３５に入力される画像５に関する特徴量ベクトルは１枚の画像５につき１つである必要は必ずしもなく、複数であるように構成できる。例えば、画像５から抽出した特徴量ベクトル、及び、検索第一変換部３１で変換された画像５から抽出した特徴量ベクトルを検索クエリとして、第三画像集合６に属する第三の画像の各々から抽出した特徴量ベクトル、及び、検索第二変換部３２で変換された第三画像集合６に属する第三の画像の各々から抽出した特徴量ベクトルから検索を行ってもよい。この場合、１枚の画像５につき２つの特徴量ベクトルが入力されるため、画像毎に２つの特徴量ベクトルを足し合わせ、正規化した後に、類似度を算出すればよい。正規化は、公知の方法を用いれば良いが、Ｌ２正規化が好適である。 Further, the feature quantity vector related to the image 5 input to the collation unit 35 does not necessarily need to be one for each image 5, and can be configured to be plural. For example, the feature quantity vector extracted from the image 5 and the feature quantity vector extracted from the image 5 converted by the search first conversion unit 31 are used as search queries from each of the third images belonging to the third image set 6. A search may be performed from the extracted feature vector and the feature vector extracted from each of the third images belonging to the third image set 6 converted by the search second conversion unit 32. In this case, since two feature quantity vectors are input per image 5, it is only necessary to add the two feature quantity vectors for each image and normalize them before calculating the similarity. For normalization, a known method may be used, but L2 normalization is preferable.

＜本発明の実施形態に係る学習装置の作用＞ <Operation of Learning Device According to Embodiment of the Present Invention>

　次に、本発明の実施形態に係る学習装置１０の作用について説明する。学習装置１０は、図３に示す学習処理ルーチンを実行する。 Next, the operation of the learning device 10 according to the embodiment of the present invention will be described. The learning device 10 executes a learning process routine shown in FIG.

　まず、ステップＳ２０１では、第一変換部２１が、データベース２に格納された第一画像集合３に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合４に属する第二の画像の解像度に対応する解像度となるように変換する。 First, in step S201, the first conversion unit 21 converts each of the first images belonging to the first image set 3 stored in the database 2 to a higher resolution than the first image by the first convolution neural network. Thus, conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the second image set 4.

　次に、ステップＳ２０２では、第二変換部２２が、データベース２に格納された第二画像集合４に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように変換する。 Next, in step S202, the second conversion unit 22 converts each of the second images belonging to the second image set 4 stored in the database 2 into the first image set 3 belonging to the second convolution neural network. Conversion is performed so that the resolution corresponds to the resolution of one image.

　ステップＳ２０３では、特徴量抽出部２３が、第一画像集合３に属する第一の画像の各々と、ステップＳ２０２で変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。 In step S203, the feature amount extraction unit 23 extracts a feature amount vector from each of the first images belonging to the first image set 3 and each converted image of the second image converted in step S202. To do.

　ステップＳ２０４では、特徴量抽出部２３が、第二画像集合４に属する第二の画像の各々と、ステップＳ２０１で変換された第一の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。 In step S204, the feature amount extraction unit 23 extracts a feature amount vector from each of the second images belonging to the second image set 4 and each of the converted images of the first image converted in step S201. To do.

　ステップＳ２０５では、第一変換部２１が、ステップＳ２０２で変換された第二の画像の各々の変換画像の各々を、第一の畳み込みニューラルネットワークによって第二画像集合４に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。 In step S205, the first conversion unit 21 converts each of the converted images of the second image converted in step S202 to the resolution of the second image belonging to the second image set 4 by the first convolution neural network. Each of the reconverted images is converted so as to have a resolution corresponding to.

　ステップＳ２０６では、第二変換部２２が、ステップＳ２０１で変換された第一の画像の各々の変換画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。 In step S206, the second conversion unit 22 converts each of the converted images of the first image converted in step S201 to the resolution of the first image belonging to the first image set 3 by the second convolution neural network. Each of the reconverted images is converted so as to have a resolution corresponding to.

　ステップＳ２０７では、変換前と再変換後の変換前後の画像間の誤差として、第二変換部２２で更に変換された第一の画像の各々の再変換画像の各々と第一画像集合３に属する第一の画像の各々との誤差、及び第一変換部２１で更に変換された第二の画像の各々の再変換画像の各々と第二画像集合４に属する第二の画像の各々との誤差を算出し、特徴量ベクトル間の誤差として、第一画像集合３に属する第一の画像の各々の特徴量ベクトルと第二変換部２２により変換された第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び第二画像集合４に属する第二の画像の各々の特徴量ベクトルと第一変換部２１により変換された第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差を算出し、算出した誤差、及び第一の畳み込みニューラルネットワークと識別用ニューラルネットワークとが互いに競合することを表す損失関数に基づいて第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新し、記憶部２９に格納する。 In step S207, each of the reconverted images of the first image further converted by the second conversion unit 22 belongs to the first image set 3 as an error between the images before and after conversion before and after the conversion. The error between each of the first images and the error between each of the reconverted images of each of the second images further converted by the first converter 21 and each of the second images belonging to the second image set 4 As an error between feature quantity vectors, each feature quantity vector of the first image belonging to the first image set 3 and each converted image of each of the second images transformed by the second conversion unit 22 are calculated. Each of the converted image of each of the first image converted by the first conversion unit 21 and the feature vector of each of the second images belonging to the second image set 4 An error with the feature vector is calculated, and the calculated error and A convolutional neural network identification neural network updates the parameter of the first convolution neural networks and a second convolution neural network based on loss function indicating that compete with each other, and stored in the storage unit 29.

　以上説明したように、本発明の実施形態に係る学習装置によれば、所定の解像度の第一の画像からなる第一画像集合３に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合４に属する第二の画像の解像度に対応する解像度となるように変換し、第二画像集合４に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように変換し、第一画像集合３に属する第一の画像の各々と、第二画像集合４に属する第二の画像の各々と、変換された第一の画像の各々の変換画像の各々と、変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出し、画像の各々の特徴量ベクトルと変換画像の各々の特徴量ベクトルとの間の誤差、画像の各々と変換画像の各々との誤差、及び第一の畳み込みニューラルネットワークと識別用ニューラルネットワークとが互いに競合することを表す損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新することにより、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる。 As described above, according to the learning device according to the embodiment of the present invention, each of the first images belonging to the first image set 3 including the first images having a predetermined resolution is converted into the first convolution neural network. Are converted to a resolution higher than that of the first image and corresponding to the resolution of the second image belonging to the second image set 4, and each of the second images belonging to the second image set 4 is converted to Are converted to a resolution corresponding to the resolution of the first image belonging to the first image set 3 by the second convolution neural network, and each of the first images belonging to the first image set 3 and the second Extract feature vectors from each of the second images belonging to the image set 4, each of the converted images of the converted first image, and each of the converted images of the converted second image. And each of the images Represents an error between the collected vector and each feature vector of the converted image, an error between each of the images and each of the converted images, and that the first convolutional neural network and the identifying neural network compete with each other. By updating the parameters of the first convolutional neural network and the second convolutional neural network based on the loss function, it is possible to learn the parameters of the neural network for accurately searching for an object appearing in the image.

＜本発明の実施形態に係る検索装置の作用＞ <Operation of Retrieval Device According to Embodiment of the Present Invention>

　次に、本発明の実施形態に係る検索装置１１の作用について説明する。検索装置１１は、図４に示す検索処理ルーチンを実行する。 Next, the operation of the search device 11 according to the embodiment of the present invention will be described. The search device 11 executes a search processing routine shown in FIG.

　まず、ステップＳ３０１では、検索第一変換部３１が、照合対象の画像５を、記憶部３９に格納されたパラメータを用いた第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合４に属する第二の画像の解像度に対応する解像度となるように変換する。 First, in step S301, the search first conversion unit 31 determines that the image 5 to be collated is higher in resolution than the first image by the first convolution neural network using the parameters stored in the storage unit 39. Then, conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the second image set 4.

　次に、ステップＳ３０２では、検索第二変換部３２が、参照画像である第三の画像からなる第三画像集合６に属する第三の画像の各々を、記憶部３９に格納されたパラメータを用いた第二の畳み込みニューラルネットワークによって第一画像集合３に属する第一の画像の解像度に対応する解像度となるように変換する。 Next, in step S302, the search second conversion unit 32 uses each parameter stored in the storage unit 39 for each of the third images belonging to the third image set 6 including the third images as reference images. The second convolutional neural network converts the resolution so as to correspond to the resolution of the first image belonging to the first image set 3.

　ステップＳ３０３では、検索特徴量抽出部３３が、画像５と、検索第一変換部３１により変換された画像５の変換画像と、第三画像集合６に属する第三の画像の各々と、第三画像集合６に属する第三の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。 In step S <b> 303, the search feature amount extraction unit 33 determines that the image 5, the converted image of the image 5 converted by the search first conversion unit 31, each of the third images belonging to the third image set 6, A feature vector is extracted from each converted image of the third image belonging to the image set 6.

　ステップＳ３０４では、検索特徴量抽出部３３で抽出された画像５の特徴量ベクトルと検索第二変換部３２で変換された第三画像集合６に属する第三の画像の各々の特徴量ベクトルとの組、及び変換された画像５の特徴量ベクトルと第三画像集合６に属する第三の画像の各々の特徴量ベクトルとの組を用いた類似度を照合し、照合した結果を検索結果７として出力する。 In step S304, the feature quantity vector of the image 5 extracted by the search feature quantity extraction unit 33 and the feature quantity vector of each of the third images belonging to the third image set 6 converted by the search second conversion unit 32. Similarity using a set and a combination of the feature vector of the converted image 5 and each feature vector of the third image belonging to the third image set 6 is collated, and the collation result is used as a search result 7 Output.

　以上説明したように、本発明の実施形態に係る検索装置によれば、学習されたパラメータを用いた第一の畳み込みニューラルネットワークによって画像を変換し、学習されたパラメータを用いた第二の畳み込みニューラルネットワークによって、参照画像である第三の画像の各々を変換し、変換画像から特徴量ベクトルを抽出し、特徴量ベクトルの組を用いた類似度を照合し、照合した結果を出力することで、画像に写る物体を精度よく検索することができる。 As described above, according to the search device according to the embodiment of the present invention, the image is converted by the first convolution neural network using the learned parameters, and the second convolution neural using the learned parameters. By converting each of the third images as the reference image by the network, extracting the feature vector from the converted image, collating the similarity using the set of feature vectors, and outputting the collation result, It is possible to search for an object appearing in an image with high accuracy.

　なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

　以上、図面を参照して本発明の実施形態を説明してきたが、上記実施形態は本発明の例示に過ぎず、本発明が上記実施形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び範囲を逸脱しない範囲で構成要素の追加、省略、置換、その他の変更を行ってもよい。 As mentioned above, although embodiment of this invention has been described with reference to drawings, it is clear that the said embodiment is only the illustration of this invention and this invention is not limited to the said embodiment. Therefore, additions, omissions, substitutions, and other modifications of the components may be made without departing from the technical idea and scope of the present invention.

　例えば、参照画像（第三画像集合６）のデータベースを構築し、第二変換部２２による画像の変換、及び、特徴量抽出部２３による特徴量ベクトルの抽出を実施の上データベースに格納しておき、クエリ画像（画像５）については適宜問い合わせのタイミングで外部から入力を受け付けて処理を行い、照合部３５がデータベースから参照画像の特徴量ベクトルを取得して類似度を算出してもよい。 For example, a database of reference images (third image set 6) is constructed, and image conversion by the second conversion unit 22 and feature quantity vector extraction by the feature quantity extraction unit 23 are performed and stored in the database. The query image (image 5) may be processed by receiving an input from the outside as appropriate at the timing of the inquiry, and the collation unit 35 may obtain the feature vector of the reference image from the database and calculate the similarity.

　また、必ずしも画像５および第三画像集合６のどちらの画像も変換する必要はなく、例えば、第三画像集合６を第二変換部２２で変換した第三の画像の各々の変換画像の各々から抽出した特徴量ベクトルと、画像５を変換せずに抽出した特徴量ベクトルとから照合を行ってもよい。この場合には、検索装置を、検索第二変換部と、検索特徴量抽出部と、照合部とを備えるように構成すればよい。このように構成することで、画像５を受け付けてから検索結果７を求めるまでに必要な時間の短縮効果が得られる。 Further, it is not always necessary to convert either the image 5 or the third image set 6, for example, from each of the converted images of the third image obtained by converting the third image set 6 by the second conversion unit 22. You may collate from the extracted feature-value vector and the feature-value vector extracted without converting the image 5. FIG. In this case, the search device may be configured to include a search second conversion unit, a search feature amount extraction unit, and a collation unit. With this configuration, an effect of shortening the time required from receiving the image 5 to obtaining the search result 7 can be obtained.

　また、例えば、第三画像集合６に属する第三の画像の各々から抽出した特徴量ベクトルと、画像５を変換した変換画像の特徴量ベクトルとから照合を行ってもよい。検索装置を、検索第一変換部と、検索特徴量抽出部と、照合部とを備えるように構成すればよい。このように構成することで、画像５を受け付けてから検索結果７を求めるまでに必要な時間の短縮効果が得られる。 Further, for example, matching may be performed from the feature amount vector extracted from each of the third images belonging to the third image set 6 and the feature amount vector of the converted image obtained by converting the image 5. What is necessary is just to comprise a search device so that a search 1st conversion part, a search feature-value extraction part, and a collation part may be provided. With this configuration, an effect of shortening the time required from receiving the image 5 to obtaining the search result 7 can be obtained.

３第一画像集合
４第二画像集合
５画像
６第三画像集合
７検索結果
１０学習装置
１１検索装置
２１第一変換部
２２第二変換部
２３特徴量抽出部
２４パラメータ更新部
２９、３９記憶部
３１検索第一変換部
３２検索第二変換部
３３検索特徴量抽出部
３５照合部 3 First image set 4 Second image set 5 Image 6 Third image set 7 Search result 10 Learning device 11 Search device 21 First conversion unit 22 Second conversion unit 23 Feature amount extraction unit 24 Parameter update units 29 and 39 Storage unit 31 Search 1st conversion part 32 Search 2nd conversion part 33 Search feature-value extraction part 35 Collation part

Claims

Each of the first images belonging to the first image set consisting of the first images of a predetermined resolution is higher in resolution than the first image by the first convolution neural network and is converted into the second image set. A first conversion unit for converting to a resolution corresponding to the resolution of the second image to which it belongs;
A second conversion unit that converts each of the second images belonging to the second image set to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolution neural network. When,
Conversion of each of the first images belonging to the first image set, each of the second images belonging to the second image set, and each of the first images converted by the first conversion unit. A feature amount extraction unit that extracts a feature amount vector from each of the images and each of the converted images of the second image converted by the second conversion unit;
Each feature amount vector of the first image belonging to the first image set extracted by the feature amount extraction unit and each converted image of the second image converted by the second conversion unit Error between the feature vector and
And based on an error between each feature amount vector of the second image belonging to the second image set and each feature amount vector of each converted image of the first image converted by the first conversion unit. A parameter updating unit for updating parameters of the first convolution neural network and the second convolution neural network;
A learning device including

The first conversion unit is configured so that each converted image of the second image converted by the second conversion unit has a resolution corresponding to the resolution of the second image belonging to the second image set. Further transform as each reconverted image,
The second converter is configured so that each of the converted images of the first image converted by the first converter has a resolution corresponding to the resolution of the first image belonging to the first image set. Further transform as each reconverted image,
The parameter update unit includes an error between each reconverted image of each of the first images further converted by the second conversion unit and each of the first images belonging to the first image set, and The first convolution is further performed by further using an error between each reconverted image of each of the second images further converted by the first conversion unit and each of the second images belonging to the second image set. The learning apparatus according to claim 1, wherein parameters of the neural network and the second convolutional neural network are updated.

The parameter update unit further includes:
Each of the converted images of the first convolution neural network and the first image converted by the first conversion unit includes the first image and the second image set belonging to the first image set. Using a loss function representing that the first convolutional neural network and the identification neural network compete with each other for the identification neural network that identifies which of the second images belongs to Updating the parameters of the first convolutional neural network and the identification neural network,
Each of the converted images of the second convolutional neural network and the second image converted by the second conversion unit includes the first image and the second image set belonging to the first image set. A loss function representing that the second convolutional neural network and the identification neural network compete with each other for the identification neural network that identifies which of the second images belongs to The learning device according to claim 1, wherein parameters of the second convolutional neural network and the identification neural network are updated.

The resolution of a second image belonging to the second image set using the first convolution neural network whose parameters have been learned by the learning device according to any one of claims 1 to 3. A search first conversion unit for converting to a resolution corresponding to
A search feature quantity extraction unit that extracts a feature quantity vector from the converted image of the arbitrary image and each of the third images of the third image set including a third image;
Based on the similarity using a set of feature quantity vectors of the converted image extracted by the search feature quantity extraction unit and feature quantity vectors of the third images of the third image set, the arbitrary A collation unit that collates an image with the third image and outputs a collation result;
Search device including

The third image of the third image set consisting of a third image using the second convolutional neural network whose parameters are learned by the learning device according to any one of claims 1 to 3. A search second conversion unit that converts each of the image data to a resolution corresponding to the resolution of the first image belonging to the first image set;
A search feature quantity extraction unit that extracts a feature quantity vector from an arbitrary image and each converted image of the third image of the third image set;
Based on the degree of similarity using a set of the feature quantity vector of the arbitrary image extracted by the search feature quantity extraction unit and the feature quantity vector of each converted image of the third image of the third image set. A collation unit that collates the arbitrary image with the third image and outputs a collation result;
Search device including

The first convolution neural network further includes a search first conversion unit that converts the arbitrary image to have a resolution corresponding to the resolution of the second image belonging to the second image set,
The search feature quantity extraction unit further extracts a feature quantity vector from the converted image of the arbitrary image and each of the third images of the third image set,
The collation unit uses a set of a feature amount vector of the arbitrary image extracted by the search feature amount extraction unit and a feature amount vector of each converted image of the third image of the third image set. Based on the degree of similarity and the degree of similarity using a set of the feature amount vector of the converted image of the arbitrary image and the feature amount vector of each of the third images in the third image set, the arbitrary image The search device according to claim 5, wherein the third image is collated and a collation result is output.

Each of the first images belonging to the first image set composed of the first images having a predetermined resolution is higher in resolution than the first image by the first convolution neural network. Converting to a resolution corresponding to the resolution of the second image belonging to the second image set;
The second conversion unit causes each of the second images belonging to the second image set to have a resolution corresponding to the resolution of the first image belonging to the first image set by the second convolution neural network. Converting, and
The feature amount extraction unit includes the first image belonging to the first image set, the second image belonging to the second image set, and the first conversion unit converted by the first conversion unit. Extracting a feature vector from each converted image of each of the images and each converted image of the second image converted by the second conversion unit;
Each of the feature amount vectors of the first image belonging to the first image set and the second image converted by the second conversion unit extracted by the feature amount extraction unit by the parameter update unit. Error between each feature vector of the transformed image of
And based on an error between each feature amount vector of the second image belonging to the second image set and each feature amount vector of each converted image of the first image converted by the first conversion unit. Updating the parameters of the first convolutional neural network and the second convolutional neural network;
Learning methods including.

A program for causing a computer to function as each part of the learning device according to any one of claims 1 to 3 or the search device according to any one of claims 4 to 6.