WO2019230665A1 - Learning device, search device, method, and program - Google Patents
Learning device, search device, method, and program Download PDFInfo
- Publication number
- WO2019230665A1 WO2019230665A1 PCT/JP2019/020947 JP2019020947W WO2019230665A1 WO 2019230665 A1 WO2019230665 A1 WO 2019230665A1 JP 2019020947 W JP2019020947 W JP 2019020947W WO 2019230665 A1 WO2019230665 A1 WO 2019230665A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- converted
- images
- neural network
- belonging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/094—Adversarial learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/56—Information retrieval; Database structures therefor; File system structures therefor of still image data having vectorial format
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
Definitions
- the present invention relates to a learning device, a search device, a method, and a program, and more particularly, to a learning device, a search device, a method, and a program for searching for an object that appears in an image.
- Patent Document 1 discloses a method for searching for an object using a large number of feature vectors extracted from an image. A large number of minute regions that are characteristic from the image are detected, and a feature vector is extracted from each region. Next, the Euclidean distance between feature quantity vectors is calculated for a minute area included in each of two different images, and the number of corresponding partial areas is calculated. Since the similarity increases as the number increases, it can be considered that the same object appears in the two images.
- the resolution refers to the number of pixels.
- a resolution conversion technique for realizing a search by preventing a difference in resolution between a query image and a reference image is desired.
- Non-Patent Document 2 discloses a learning type super-resolution method based on image pairs. This is a method of acquiring a high resolution image by enlarging a low resolution image by the Bicubic method and converting it using CNN. By learning CNN in advance using a pair of a low resolution image and a high resolution image, a high frequency component not included in the low resolution image can be restored with high accuracy. As a result, the search accuracy can be improved by extracting and searching the feature vector after converting the low resolution image to the high resolution.
- Non-Patent Document 3 discloses a learning type image conversion method based on an image set pair. The conversion between the two sets is acquired by learning. More specifically, the converter that converts the image set X into the image set Y, and the converted image and the image belonging to the image set Y are distinguished. It comprises a discriminator, a converter that converts the image set Y into the image set X, and a discriminator that distinguishes the converted image from the images belonging to the image set X. By converting the images belonging to the image set X to the image set Y and then using the reconstruction error when converted back to the X, conversion between the two image sets is realized even if there is no one-to-one corresponding image pair. it can. For example, by setting one image set as a set of low resolution images and the other as a set of high resolution images, it is possible to acquire a resolution converter and prevent a discrepancy between a query image and a reference image.
- Non-Patent Document 2 As a method of learning conversion based on image pairs as in Non-Patent Document 2, a method for preparing image pairs becomes a problem.
- learning is performed by pairing a high-resolution image and an image obtained by reducing the high-resolution image by the Bicubic method.
- degradation in search accuracy occurs due to a difference between an image obtained by converting a high resolution image to a low resolution by the Bicubic method and an object that appears in the low resolution in the captured image.
- it is difficult to prepare image pairs having different resolutions by capturing the same object image at high resolution and low resolution under the same conditions, and collecting a large amount of pairs as learning data is inefficient.
- Non-Patent Document 3 a method of learning conversion based on an image set pair converts an image belonging to one set to be an image belonging to the other set, and returns to an image belonging to the original set again. Learning to return to the same image. For this reason, each conversion itself may be a conversion into an image with a large difference in appearance, and as a result, there is a possibility of causing a serious deterioration in search accuracy.
- the feature amount vector for object search is not considered in common with both of the methods described above.
- learning is performed so that the converted image of the low-resolution image is close to the high-resolution image, but the feature quantity vectors do not always match.
- the feature quantity vectors before and after conversion do not always match.
- the present invention has been made to solve the above-described problems, and provides a learning device, method, and program capable of learning parameters of a neural network for accurately searching for an object appearing in an image. For the purpose.
- a learning device uses a first convolution neural network to connect each of the first images belonging to a first image set consisting of a first image having a predetermined resolution.
- a second converter that converts each of the two images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolution neural network, and belongs to the first image set
- Two conversion units A feature quantity extraction unit that extracts a feature quantity vector from each of the converted images of the second image that has been further transformed, and the first image belonging to the first image set extracted by the feature quantity extraction unit.
- a first convolutional neural network based on an error between each feature vector of the second image and each feature vector of each converted image of the first image converted by the first conversion unit;
- a parameter updating unit that updates parameters of the second convolutional neural network.
- the first conversion unit may convert each converted image of the second image converted by the second conversion unit to a second image belonging to the second image set.
- the second conversion unit further converts each of the converted images of the first image converted by the first conversion unit so as to have a resolution corresponding to the resolution of the first image.
- the parameter updating unit is further converted by the second conversion unit An error between each re-converted image of each of the one images and each of the first images belonging to the first image set, and each of the second images further converted by the first converter.
- Each of the transformed images and the second image belonging to the second image set Further using the error between each image, it may be updated parameters of the first convolution neural network and said second convolution neural network.
- the parameter update unit further includes the first convolutional neural network and the converted images of the first image converted by the first conversion unit.
- the first convolutional neural network for identifying the first image belonging to the first image set and the identifying neural network identifying the second image belonging to the second image set;
- the parameters of the first convolutional neural network and the identification neural network may be updated using a loss function indicating that the identification neural network competes with each other.
- the search device uses the first convolution neural network whose parameters have been learned by the learning device according to the first invention, and assigns an arbitrary image to the second image set belonging to the second image set.
- a search first conversion unit that converts to a resolution corresponding to the resolution of the image, a converted image of the arbitrary image, and each of the third images of the third image set including the third image
- a search feature quantity extraction unit that extracts a feature quantity vector; a feature quantity vector of the converted image extracted by the search feature quantity extraction unit; and a feature quantity vector of each of the third images in the third image set
- a collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using the set.
- a search device uses the second convolution neural network whose parameters have been learned by the learning device according to the first invention, and uses the second convolutional neural network for the third image set of a third image.
- a search second conversion unit that converts each of the third images to a resolution corresponding to the resolution of the first image belonging to the first image set, an arbitrary image, and the first image of the third image set
- a search feature quantity extraction unit that extracts a feature quantity vector from each converted image of the third image; a feature quantity vector of the arbitrary image extracted by the search feature quantity extraction unit; and the third image set of the third image set.
- a collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using a pair of feature vectors of the converted images of the three images. It consists of
- the first convolutional neural network is used to convert the arbitrary image to have a resolution corresponding to the resolution of the second image belonging to the second image set.
- a search first conversion unit that further extracts a feature vector from the converted image of the arbitrary image and each of the third images of the third image set.
- the collation unit uses a set of a feature amount vector of the arbitrary image extracted by the search feature amount extraction unit and a feature amount vector of each converted image of the third image in the third image set.
- the arbitrary image based on the similarity using the combination of the feature amount vector of the converted image of the arbitrary image and the feature amount vector of each of the third images of the third image set.
- the third image Collating may output the verification result.
- the first conversion unit uses the first convolution neural network to convert each of the first images belonging to the first image set including the first images having a predetermined resolution. Converting to a resolution higher than that of one image and corresponding to the resolution of the second image belonging to the second image set, and a second conversion unit belonging to the second image set Converting each of the second images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolutional neural network; and a feature amount extraction unit, Each of the first images belonging to the image set, each of the second images belonging to the second image set, and each of the converted images of the first image converted by the first conversion unit When, A step of extracting a feature vector from each of the converted images of the second image converted by the second conversion unit, and a parameter updating unit extracted by the feature extraction unit.
- the program according to the fifth invention is a program for causing a computer to function as each part of the learning device according to the first invention or the search device according to the second or third invention.
- each of the first images belonging to the first image set made up of the first images having a predetermined resolution is converted from the first image by the first convolution neural network. Is converted to a resolution corresponding to the resolution of the second image belonging to the second image set, and each second image belonging to the second image set is converted to the second convolutional neural network.
- an image is converted by the first convolution neural network using the learned parameters, and the second convolution neural network using the learned parameters is used.
- Each of the third images which are reference images, is converted, a feature vector is extracted from the converted image, the similarity using a set of feature vectors is collated, and the collation result is output, so that the image is captured.
- the object can be searched with high accuracy.
- the learning device and the search device according to the present embodiment can be used even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified target) is significantly different from a reference image.
- a learning device and a search device for accurately obtaining a search result can be used even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified target) is significantly different from a reference image.
- the learning device 10 shown in FIG. 1 has a high-accuracy search result even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified object) appears and the reference image greatly deviate. It is a learning device for obtaining.
- the learning device 10 can be configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning process routine described later and various data.
- the learning device 10 includes a first conversion unit 21, a second conversion unit 22, a feature amount extraction unit 23, a parameter update unit 24, and a storage unit 29 as shown in FIG. Yes.
- the image 5 corresponds to the query image
- the third image set 6 corresponds to the image set including one or more reference images
- the query image has a low resolution
- the reference image has a high resolution.
- the resolution indicates the total number of pixels of the image.
- the learning apparatus 10 includes a first image set 3 including a first image corresponding to one or more low-resolution images stored in the database 2 and a first image set 3 corresponding to one or more high-resolution images.
- a second image set 4 composed of two images is input. Since the convolutional neural network uses an image that determines whether the input image is a converted image or a reference image, the first image set 3 and the second image set 4 can correspond to each other. It is not necessary to be.
- the first image corresponding to the low resolution image is enlarged by the Bicubic method or the like, and has the same number of pixels as the second image corresponding to the high resolution image.
- Learning device 10 communicates information with each other via database 2 and communication means (not shown).
- the database 2 can be configured by, for example, a file system mounted on a general general-purpose computer.
- the database 2 stores image data of the first image set 3 and the second image set 4 in advance.
- an identifier such as a serial number ID (Identification) or a unique image file name that can uniquely identify each image is given.
- the database 2 stores, for each image, an identifier of the image and image data of the image in association with each other.
- the database 2 may be similarly implemented and configured with RDBMS (Relational Database Management System) or the like.
- the information stored in the database 2 includes, as metadata, for example, information representing the contents of an image (such as an image title, summary text, or keyword), and information regarding the image format (the amount of image data, the size of a thumbnail, etc.) ) And the like may be included, but the storage of these pieces of information is not essential in the implementation of the present disclosure.
- the database 2 may be provided either inside or outside the learning apparatus 10, and any known communication means can be used. In this embodiment, it is assumed that the database 2 is provided outside the learning device 10 and communicates with the learning device 10 using the Internet and a network such as TCP / IP (Transmission Control Protocol / Internet Protocol) as communication means. It shall be connected as possible.
- TCP / IP Transmission Control Protocol / Internet Protocol
- each unit and the database 2 included in the learning device 10 include arithmetic processing devices such as a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and an HDD ( It may be configured by a computer or a server provided with a storage device such as “Hard Disk Drive”, and the processing of each unit may be executed by a program.
- This program may be stored in advance in the storage device included in the learning device 10, stored in a recording medium such as a magnetic disk, an optical disk, and a semiconductor memory, or provided through a network. is there.
- any other components need not be realized by a single computer or server, but may be realized by being distributed to a plurality of computers connected by a network.
- the first conversion unit 21 converts each of the first images belonging to the first image set 3 stored in the database 2 to a higher resolution than the first image by the first convolution neural network, Conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the set 4.
- the first conversion unit 21 converts each of the converted images of the second image converted by the second conversion unit 22 into the second image belonging to the second image set 4 by the first convolution neural network.
- Each reconverted image is converted so as to have a resolution corresponding to the resolution.
- the conversion of the first conversion unit 21 is assumed to be a conversion from a low resolution image to a high resolution image.
- the first convolution neural network used for image conversion is not limited as long as it is a convolution method using a neural network.
- a 9-layer convolutional neural network (CNN: Convolutional Neural consisting of a convolution layer of stride 2 that performs downsampling, a residual block, and a stride 1/2 convolution layer that performs upsampling described in Non-Patent Document 3. Perform conversion by convolution with Network.
- the second conversion unit 22 corresponds each resolution of the second image belonging to the second image set 4 stored in the database 2 to the resolution of the first image belonging to the first image set 3 by the second convolution neural network. Convert to the resolution you want. Further, the second conversion unit 22 converts each of the converted images of the first image converted by the first conversion unit 21 to the first image belonging to the first image set 3 by the second convolution neural network. Each reconverted image is converted so as to have a resolution corresponding to the resolution.
- the feature amount extraction unit 23 includes each of the first images belonging to the first image set 3, each of the second images belonging to the second image set 4, and the first image converted by the first conversion unit 21.
- a feature vector is extracted from each of the converted images and each of the converted images of the second image converted by the second converting unit 22.
- a vector that can be expressed as a vector having a fixed dimension using a neural network may be used.
- it can be extracted by using the method described in Non-Patent Document 4.
- VGG16 or ResNet 101 which is a kind of CNN (feature map depending on the height, width, and number of channels).
- rectangles of various sizes are defined, and a vector of the number of rectangles ⁇ the number of channels is obtained by determining the maximum value in the rectangle for each channel.
- a known method may be used, but L2 normalization is preferable.
- Non-Patent Document 4 A. Gordo, J. Almaz ⁇ an, J. Revaud, and D. Larlus, End-to-End learning of deep visual representations for image retrieval, IJCV, 2017.
- the parameter update unit 24 updates the parameters of the first convolution neural network and the second convolution neural network based on the error between the feature amount vectors, the error between the images before and after conversion, and the loss function. And stored in the storage unit 29.
- the error between the feature quantity vectors is the difference between each feature quantity vector of the first image belonging to the first image set 3 and each converted image of the second image converted by the second conversion unit 22.
- the error between each feature quantity vector and each feature quantity vector of the second image belonging to the second image set 4 and each converted image of the first image converted by the first conversion unit 21 This is an error from each feature vector.
- the error between the image before conversion and the image after re-conversion is that each of the re-converted images of the first image further converted by the second conversion unit 22 and each of the first images belonging to the first image set 3. And the error between each of the reconverted images of the second image further converted by the first converter 21 and each of the second images belonging to the second image set 4.
- the loss function includes a first convolutional neural network and a first image and a second image in which each converted image of the first image converted by the first conversion unit 21 belongs to the first image set 3.
- a loss function indicating that the first convolutional neural network and the identification neural network compete with each other for the identification neural network for identifying which of the second images belongs to the set 4;
- Each of the converted images of the convolutional neural network and the second image converted by the second conversion unit 22 includes a first image belonging to the first image set 3 and a second image belonging to the second image set 4.
- a discriminating neural network for identifying one of the images of the second convolutional neural network, a discriminating neural network, Is the loss function indicating that conflict with one another.
- the parameter update unit 24 calculates the parameters of the first convolution neural network and the second convolution neural network based on the error between the feature vectors, the error between the images before and after the conversion, and the loss function. Update.
- any known loss function may be used.
- the error is obtained by a square error between the feature amount vectors before and after the conversion.
- the loss function of Adversarial Loss described in Non-Patent Document 3 and the error between images of Cycle Consistency Loss may be added together.
- Adversarial Loss learns the first convolutional neural network and the identification neural network alternately using the above-described loss function.
- the loss function value decreases as the converted image of the converted first image is not identified as the first image.
- the parameters of the first convolutional neural network and the identification neural network are updated so that the value of the loss function decreases. That is, the parameters of the first neural network are learned so that the identifying neural network cannot distinguish the converted image.
- the second convolutional neural network and the identification neural network are alternately learned using the above-described loss function.
- the loss function value decreases as the converted image of the converted second image is not identified as the second image.
- the parameters of the second convolutional neural network and the identification neural network are updated so that the value of the loss function decreases.
- Cycle Consistency Loss is a reconverted image obtained by further converting each converted image of the first image obtained by converting the first image belonging to the first image set 3 by the first converting unit 21 by the second converting unit 22. And each of the converted images of the second image obtained by converting the second image belonging to the second image set 4 by the second conversion unit 22. Is constituted by an error between each of the reconverted images further converted by the first conversion unit 21 and each of the second images before conversion.
- the method for updating the parameters is not limited.
- the parameters may be updated using an error back propagation method.
- the error back-propagation method is a method of correcting the parameters of each neuron so that the local error decreases from the output of the neural network to the input.
- an update using an error between feature quantity vectors and an error between images before and after conversion and an update using a loss function may be alternately performed.
- the search apparatus 11 converts a low resolution query image to a high resolution, or converts a high resolution reference image to a low resolution, or extracts a feature vector after performing both conversions. By collating, it is possible to search with high accuracy between images whose resolutions are different.
- the search device 11 shown in FIG. 2 can be constituted by a computer including a CPU, a RAM, and a ROM that stores a program for executing a learning process routine described later and various data.
- the learning device 10 includes a search first conversion unit 31, a search second conversion unit 32, a search feature amount extraction unit 33, a collation unit 35, and a storage unit 39 as shown in FIG. I have.
- the storage unit 39 stores parameters of the first convolutional neural network and the second convolutional neural network learned by the learning device 10.
- the search first conversion unit 31 has a higher resolution than the first image by the first convolution neural network using the parameters stored in the storage unit 39, and the second image set 4 Is converted so as to have a resolution corresponding to the resolution of the second image belonging to.
- the search second conversion unit 32 uses the second convolution neural network using the parameters stored in the storage unit 39 for each of the third images belonging to the third image set 6 including the third images as reference images. Thus, conversion is performed so that the resolution corresponds to the resolution of the first image belonging to the first image set 3.
- the third image set 6 may be the same as the second image set 4.
- the search feature quantity extraction unit 33 includes the image 5, the converted image of the image 5 converted by the search first conversion unit 31, each of the third images belonging to the third image set 6, and the third image set 6. A feature vector is extracted from each converted image of the third image to which it belongs.
- the collation unit 35 includes the feature amount vector of the image 5 extracted by the search feature amount extraction unit 33 and each feature amount vector of the third image belonging to the third image set 6 converted by the search second conversion unit 32. And the similarity using the set of the feature vector of the converted image 5 and the feature vector of each of the third images belonging to the third image set 6 are collated, and the collation result is obtained as a search result 7. Output as. Note that the similarity check may be at least one of the two sets.
- the collation may be performed by, for example, calculating the inner product between feature quantity vectors, setting the value as the similarity between images, and outputting the top N images with the highest similarity in the third image set 6 as the search result 7.
- N is an integer from 1 to the number of images in the third image set 6.
- the feature quantity vector related to the image 5 input to the collation unit 35 does not necessarily need to be one for each image 5, and can be configured to be plural.
- the feature quantity vector extracted from the image 5 and the feature quantity vector extracted from the image 5 converted by the search first conversion unit 31 are used as search queries from each of the third images belonging to the third image set 6.
- a search may be performed from the extracted feature vector and the feature vector extracted from each of the third images belonging to the third image set 6 converted by the search second conversion unit 32.
- a known method may be used, but L2 normalization is preferable.
- the learning device 10 executes a learning process routine shown in FIG.
- step S201 the first conversion unit 21 converts each of the first images belonging to the first image set 3 stored in the database 2 to a higher resolution than the first image by the first convolution neural network. Thus, conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the second image set 4.
- step S202 the second conversion unit 22 converts each of the second images belonging to the second image set 4 stored in the database 2 into the first image set 3 belonging to the second convolution neural network. Conversion is performed so that the resolution corresponds to the resolution of one image.
- step S203 the feature amount extraction unit 23 extracts a feature amount vector from each of the first images belonging to the first image set 3 and each converted image of the second image converted in step S202. To do.
- step S204 the feature amount extraction unit 23 extracts a feature amount vector from each of the second images belonging to the second image set 4 and each of the converted images of the first image converted in step S201. To do.
- step S205 the first conversion unit 21 converts each of the converted images of the second image converted in step S202 to the resolution of the second image belonging to the second image set 4 by the first convolution neural network.
- Each of the reconverted images is converted so as to have a resolution corresponding to.
- step S206 the second conversion unit 22 converts each of the converted images of the first image converted in step S201 to the resolution of the first image belonging to the first image set 3 by the second convolution neural network.
- Each of the reconverted images is converted so as to have a resolution corresponding to.
- each of the reconverted images of the first image further converted by the second conversion unit 22 belongs to the first image set 3 as an error between the images before and after conversion before and after the conversion.
- the error between each of the first images and the error between each of the reconverted images of each of the second images further converted by the first converter 21 and each of the second images belonging to the second image set 4 As an error between feature quantity vectors, each feature quantity vector of the first image belonging to the first image set 3 and each converted image of each of the second images transformed by the second conversion unit 22 are calculated.
- each of the first images belonging to the first image set 3 including the first images having a predetermined resolution is converted into the first convolution neural network. Are converted to a resolution higher than that of the first image and corresponding to the resolution of the second image belonging to the second image set 4, and each of the second images belonging to the second image set 4 is converted to Are converted to a resolution corresponding to the resolution of the first image belonging to the first image set 3 by the second convolution neural network, and each of the first images belonging to the first image set 3 and the second Extract feature vectors from each of the second images belonging to the image set 4, each of the converted images of the converted first image, and each of the converted images of the converted second image.
- each of the images Represents an error between the collected vector and each feature vector of the converted image, an error between each of the images and each of the converted images, and that the first convolutional neural network and the identifying neural network compete with each other.
- the search device 11 executes a search processing routine shown in FIG.
- step S301 the search first conversion unit 31 determines that the image 5 to be collated is higher in resolution than the first image by the first convolution neural network using the parameters stored in the storage unit 39. Then, conversion is performed so that the resolution corresponds to the resolution of the second image belonging to the second image set 4.
- step S302 the search second conversion unit 32 uses each parameter stored in the storage unit 39 for each of the third images belonging to the third image set 6 including the third images as reference images.
- the second convolutional neural network converts the resolution so as to correspond to the resolution of the first image belonging to the first image set 3.
- step S ⁇ b> 303 the search feature amount extraction unit 33 determines that the image 5, the converted image of the image 5 converted by the search first conversion unit 31, each of the third images belonging to the third image set 6, A feature vector is extracted from each converted image of the third image belonging to the image set 6.
- step S304 the feature quantity vector of the image 5 extracted by the search feature quantity extraction unit 33 and the feature quantity vector of each of the third images belonging to the third image set 6 converted by the search second conversion unit 32. Similarity using a set and a combination of the feature vector of the converted image 5 and each feature vector of the third image belonging to the third image set 6 is collated, and the collation result is used as a search result 7 Output.
- the image is converted by the first convolution neural network using the learned parameters, and the second convolution neural using the learned parameters.
- the network By converting each of the third images as the reference image by the network, extracting the feature vector from the converted image, collating the similarity using the set of feature vectors, and outputting the collation result, It is possible to search for an object appearing in an image with high accuracy.
- a database of reference images (third image set 6) is constructed, and image conversion by the second conversion unit 22 and feature quantity vector extraction by the feature quantity extraction unit 23 are performed and stored in the database.
- the query image (image 5) may be processed by receiving an input from the outside as appropriate at the timing of the inquiry, and the collation unit 35 may obtain the feature vector of the reference image from the database and calculate the similarity.
- the search device may be configured to include a search second conversion unit, a search feature amount extraction unit, and a collation unit. With this configuration, an effect of shortening the time required from receiving the image 5 to obtaining the search result 7 can be obtained.
- matching may be performed from the feature amount vector extracted from each of the third images belonging to the third image set 6 and the feature amount vector of the converted image obtained by converting the image 5.
- What is necessary is just to comprise a search device so that a search 1st conversion part, a search feature-value extraction part, and a collation part may be provided. With this configuration, an effect of shortening the time required from receiving the image 5 to obtaining the search result 7 can be obtained.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
本発明は、学習装置、検索装置、方法、及びプログラムに係り、特に、画像中に写る物体を検索するための学習装置、検索装置、方法、及びプログラムに関する。 The present invention relates to a learning device, a search device, a method, and a program, and more particularly, to a learning device, a search device, a method, and a program for searching for an object that appears in an image.
スマートフォン等の小型撮像デバイスの普及に伴い、様々な場所や環境で任意の対象を撮影したような画像中に写る物体を検索する技術への要望が高まってきている。 With the widespread use of small-sized imaging devices such as smartphones, there is an increasing demand for technologies for searching for objects that appear in images that are taken from arbitrary objects in various places and environments.
従来、画像中の物体を検索する種々の技術が発明され開示されているが、典型的な手続きを、特許文献1に記載の技術に従って説明する。まず、畳み込みニューラルネットワーク(CNN)を用いて画像から特徴量ベクトルを抽出する。次に、互いに異なる二つの画像について特徴量ベクトル同士の内積を計算する。内積の値が大きいほど、同一の物体が写っているとみなす。予め、認識したい物体を含む画像(参照画像)によりあらかじめ参照画像データベースを構築し、新たに入力された画像(クエリ画像)と同一の物体が写っているものを検索することにより、クエリ画像中に存在する物体を特定することができる。 Conventionally, various techniques for searching for an object in an image have been invented and disclosed. A typical procedure will be described according to the technique described in Patent Document 1. First, a feature vector is extracted from an image using a convolutional neural network (CNN). Next, the inner product of the feature quantity vectors is calculated for two different images. The larger the inner product value, the more the same object is considered. By constructing a reference image database in advance with an image (reference image) including an object to be recognized, and searching for an image that contains the same object as the newly input image (query image), An existing object can be identified.
他にも特許文献1には、画像から抽出した多数の特徴量ベクトルを用いて物体を検索する方法が開示されている。画像から特徴的であるような微小領域を多数検出し、各領域から特徴量ベクトルを抽出する。次に、互いに異なる二つの画像の各々に含まれる微小領域について特徴量ベクトル同士のユークリッド距離等を計算し、対応する部分領域の個数を算出する。この個数が多いほど類似度が大きくなるため、二つの画像には、同一の物体が写っているとみなせる。 In addition, Patent Document 1 discloses a method for searching for an object using a large number of feature vectors extracted from an image. A large number of minute regions that are characteristic from the image are detected, and a feature vector is extracted from each region. Next, the Euclidean distance between feature quantity vectors is calculated for a minute area included in each of two different images, and the number of corresponding partial areas is calculated. Since the similarity increases as the number increases, it can be considered that the same object appears in the two images.
しかしながら、上記物体検索の手続きにも重大な問題がある。クエリ画像と参照画像に写る物体の解像度に乖離がある場合、たとえ同じ物体同士であっても異なる特徴量ベクトルが得られてしまうような場合が多い。結果として、異なる物体が検索されてしまうことがあるのである。なお、解像度とは画素数を指すものとして説明する。 However, there is a serious problem with the above object search procedure. When there is a difference between the resolutions of the objects shown in the query image and the reference image, different feature vectors are often obtained even for the same objects. As a result, different objects may be searched. In the following description, the resolution refers to the number of pixels.
例えば、物体が大きく高解像度に写る参照画像に対して、物体が小さく低解像度に写る画像をクエリ画像として検索するようなケースでは、クエリ画像中の物体から高周波成分が失われていることが多く、前述の問題が発生しやすい典型例といえる。 For example, in a case where a query image is searched for an image with a small object and a low resolution compared to a reference image with a large object and a high resolution, high frequency components are often lost from the object in the query image. It can be said that this is a typical example in which the above-mentioned problems are likely to occur.
以上の問題を鑑み、クエリ画像と参照画像の解像度の乖離を防止して検索を実現するための、解像度変換技術が望まれる。 In view of the above problems, a resolution conversion technique for realizing a search by preventing a difference in resolution between a query image and a reference image is desired.
以上の問題に対して、従来いくつかの発明がなされ、開示されてきている。 Several inventions have been made and disclosed for the above problems.
非特許文献2には、画像ペアに基づく学習型超解像の方法が開示されている。低解像度画像をBicubic法によって拡大し、CNNを用いて変換することで、高解像度画像を獲得する方法である。事前に低解像度画像と高解像度画像のペアを用いてCNNを学習することで、低解像度画像に含まれない高周波成分を高精度に復元することができる。低解像度画像を高解像度に変換した後に特徴量ベクトルを抽出して検索することで、結果として、検索精度の改善が期待できる。 Non-Patent Document 2 discloses a learning type super-resolution method based on image pairs. This is a method of acquiring a high resolution image by enlarging a low resolution image by the Bicubic method and converting it using CNN. By learning CNN in advance using a pair of a low resolution image and a high resolution image, a high frequency component not included in the low resolution image can be restored with high accuracy. As a result, the search accuracy can be improved by extracting and searching the feature vector after converting the low resolution image to the high resolution.
非特許文献3には、画像集合ペアに基づく学習型画像変換の方法が開示されている。2つの集合間の変換を学習によって獲得するものであり、より具体的には、画像集合Xを画像集合Yに変換する変換器、及び、変換された画像と画像集合Yに属する画像とを見分ける識別器、並びに、画像集合Yを画像集合Xに変換する変換器、及び、変換された画像と画像集合Xに属する画像とを見分ける識別器からなる。画像集合Xに属する画像を画像集合Yへ変換後、Xへ再度変換した際の再構築誤差を用いることで、1対1で対応する画像ペアがなくとも、2つの画像集合間の変換を実現できる。例えば、一方の画像集合を低解像度画像の集合、他方を高解像度画像の集合とすることで、解像度の変換器を獲得し、クエリ画像と参照画像の乖離を防ぐことができる。 Non-Patent Document 3 discloses a learning type image conversion method based on an image set pair. The conversion between the two sets is acquired by learning. More specifically, the converter that converts the image set X into the image set Y, and the converted image and the image belonging to the image set Y are distinguished. It comprises a discriminator, a converter that converts the image set Y into the image set X, and a discriminator that distinguishes the converted image from the images belonging to the image set X. By converting the images belonging to the image set X to the image set Y and then using the reconstruction error when converted back to the X, conversion between the two image sets is realized even if there is no one-to-one corresponding image pair. it can. For example, by setting one image set as a set of low resolution images and the other as a set of high resolution images, it is possible to acquire a resolution converter and prevent a discrepancy between a query image and a reference image.
非特許文献2のように画像ペアに基づいてそれらの変換を学習する方法は、画像ペアの準備方法が問題となる。非特許文献2では、高解像度画像と、当該高解像度画像をBicubic法によって低解像度にした画像とをペアとして学習が行われている。この場合、Bicubic法によって高解像度画像を低解像度に変換した画像と、撮像した画像中に低解像度に写りこむ物体との乖離による検索精度の劣化が発生してしまう。また、同一物体画像を同一条件下で高解像度および低解像度に撮像して解像度の異なる画像ペアを準備することは難しく、学習データとして多量のペアを集めることは非効率的である。 As a method of learning conversion based on image pairs as in Non-Patent Document 2, a method for preparing image pairs becomes a problem. In Non-Patent Document 2, learning is performed by pairing a high-resolution image and an image obtained by reducing the high-resolution image by the Bicubic method. In this case, degradation in search accuracy occurs due to a difference between an image obtained by converting a high resolution image to a low resolution by the Bicubic method and an object that appears in the low resolution in the captured image. In addition, it is difficult to prepare image pairs having different resolutions by capturing the same object image at high resolution and low resolution under the same conditions, and collecting a large amount of pairs as learning data is inefficient.
また、非特許文献3のように画像集合ペアに基づいてそれらの変換を学習する方法は、ある集合に属する画像を他方の集合に属する画像となるよう変換し、再度元の集合に属する画像へと変換した際に同一の画像へと戻るように学習を行う。そのため、各々の変換自体は見た目が大きく乖離する画像への変換となる可能性があり、結果として、検索精度の重大な劣化を引き起こす可能性がある。 Further, as in Non-Patent Document 3, a method of learning conversion based on an image set pair converts an image belonging to one set to be an image belonging to the other set, and returns to an image belonging to the original set again. Learning to return to the same image. For this reason, each conversion itself may be a conversion into an image with a large difference in appearance, and as a result, there is a possibility of causing a serious deterioration in search accuracy.
さらに、前述の方法のどちらにも共通し、物体検索の特徴量ベクトルについては考慮していない点が問題となる。非特許文献2の方法では、低解像度画像の変換後の画像が高解像度画像と近くなるように学習されるが、特徴量ベクトルが一致するとは限らない。非特許文献3の方法も同様に、変換前後の特徴量ベクトルが一致するとは限らない。 Furthermore, there is a problem that the feature amount vector for object search is not considered in common with both of the methods described above. In the method of Non-Patent Document 2, learning is performed so that the converted image of the low-resolution image is close to the high-resolution image, but the feature quantity vectors do not always match. Similarly, in the method of Non-Patent Document 3, the feature quantity vectors before and after conversion do not always match.
以上のように、現在に至るまで、クエリ画像と参照画像間に解像度の乖離が存在する場合に、高精度に物体を検索できる検索技術は発明されていなかった。 As described above, until now, no search technique has been invented that can search for an object with high accuracy when there is a resolution difference between a query image and a reference image.
本発明は、上記問題点を解決するために成されたものであり、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides a learning device, method, and program capable of learning parameters of a neural network for accurately searching for an object appearing in an image. For the purpose.
また、画像に写る物体を精度よく検索することができる検索装置、方法、及びプログラムを提供することを目的とする。 It is another object of the present invention to provide a search device, method, and program capable of accurately searching for an object appearing in an image.
上記目的を達成するために、第1の発明に係る学習装置は、所定の解像度の第一の画像からなる第一画像集合に属する前記第一の画像の各々を、第一の畳み込みニューラルネットワークによって前記第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する第一変換部と、前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換する第二変換部と、前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する特徴量抽出部と、前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するパラメータ更新部と、を含んで構成されている。 In order to achieve the above object, a learning device according to a first invention uses a first convolution neural network to connect each of the first images belonging to a first image set consisting of a first image having a predetermined resolution. A first conversion unit for converting the first image to have a resolution higher than that of the first image and corresponding to the resolution of the second image belonging to the second image set; and the first image belonging to the second image set. A second converter that converts each of the two images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolution neural network, and belongs to the first image set Each of the first images, each of the second images belonging to the second image set, each of the converted images of the first image converted by the first conversion unit, Two conversion units A feature quantity extraction unit that extracts a feature quantity vector from each of the converted images of the second image that has been further transformed, and the first image belonging to the first image set extracted by the feature quantity extraction unit. An error between each feature quantity vector of the image and each feature quantity vector of each converted image of the second image transformed by the second transform unit, and the second image set belonging to the second image set. A first convolutional neural network based on an error between each feature vector of the second image and each feature vector of each converted image of the first image converted by the first conversion unit; And a parameter updating unit that updates parameters of the second convolutional neural network.
また、第1の発明に係る学習装置において、前記第一変換部は、前記第二変換部で変換された第二の画像の各々の変換画像の各々を、前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として更に変換し、前記第二変換部は、前記第一変換部で変換された第一の画像の各々の変換画像の各々を、前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として更に変換し、前記パラメータ更新部は、前記第二変換部で更に変換された前記第一の画像の各々の再変換画像の各々と前記第一画像集合に属する前記第一の画像の各々との誤差、及び前記第一変換部で更に変換された前記第二の画像の各々の再変換画像の各々と前記第二画像集合に属する前記第二の画像の各々との誤差を更に用いて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するようにしてもよい。 In the learning device according to the first aspect, the first conversion unit may convert each converted image of the second image converted by the second conversion unit to a second image belonging to the second image set. The second conversion unit further converts each of the converted images of the first image converted by the first conversion unit so as to have a resolution corresponding to the resolution of the first image. , Further converting each of the reconverted images so as to have a resolution corresponding to the resolution of the first image belonging to the first image set, and the parameter updating unit is further converted by the second conversion unit An error between each re-converted image of each of the one images and each of the first images belonging to the first image set, and each of the second images further converted by the first converter. Each of the transformed images and the second image belonging to the second image set Further using the error between each image, it may be updated parameters of the first convolution neural network and said second convolution neural network.
また、第1の発明に係る学習装置において、前記パラメータ更新部は、更に、前記第一の畳み込みニューラルネットワークと、前記第一変換部で変換された第一の画像の各々の変換画像の各々が、前記第一画像集合に属する前記第一の画像、及び前記第二画像集合に属する前記第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、前記第一の畳み込みニューラルネットワークと、前記識別用ニューラルネットワークとが、互いに競合することを表す損失関数を用いて、前記第一の畳み込みニューラルネットワーク及び前記識別用ニューラルネットワークのパラメータを更新するようにしてもよい。 In the learning device according to the first invention, the parameter update unit further includes the first convolutional neural network and the converted images of the first image converted by the first conversion unit. The first convolutional neural network for identifying the first image belonging to the first image set and the identifying neural network identifying the second image belonging to the second image set; The parameters of the first convolutional neural network and the identification neural network may be updated using a loss function indicating that the identification neural network competes with each other.
また、第2の発明に係る検索装置は、上記第1の発明に係る学習装置によってパラメータが学習された第一の畳み込みニューラルネットワークを用いて、任意の画像を前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する検索第一変換部と、前記任意の画像の変換画像と、第三の画像からなる第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出する検索特徴量抽出部と、前記検索特徴量抽出部で抽出された前記変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、を含んで構成されている。 Further, the search device according to the second invention uses the first convolution neural network whose parameters have been learned by the learning device according to the first invention, and assigns an arbitrary image to the second image set belonging to the second image set. A search first conversion unit that converts to a resolution corresponding to the resolution of the image, a converted image of the arbitrary image, and each of the third images of the third image set including the third image A search feature quantity extraction unit that extracts a feature quantity vector; a feature quantity vector of the converted image extracted by the search feature quantity extraction unit; and a feature quantity vector of each of the third images in the third image set A collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using the set.
また、第3の発明に係る検索装置は、上記第1の発明に係る学習装置によってパラメータが学習された前記第二の畳み込みニューラルネットワークを用いて、第三の画像からなる第三画像集合の前記第三の画像の各々を前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換する検索第二変換部と、任意の画像と、前記第三画像集合の前記第三の画像の各々の変換画像とから特徴量ベクトルを抽出する検索特徴量抽出部と、前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、を含んで構成されている。 Further, a search device according to a third invention uses the second convolution neural network whose parameters have been learned by the learning device according to the first invention, and uses the second convolutional neural network for the third image set of a third image. A search second conversion unit that converts each of the third images to a resolution corresponding to the resolution of the first image belonging to the first image set, an arbitrary image, and the first image of the third image set A search feature quantity extraction unit that extracts a feature quantity vector from each converted image of the third image; a feature quantity vector of the arbitrary image extracted by the search feature quantity extraction unit; and the third image set of the third image set. A collation unit that collates the arbitrary image with the third image and outputs a collation result based on the similarity using a pair of feature vectors of the converted images of the three images. It consists of
また、第3の発明に係る検索装置において、前記第一の畳み込みニューラルネットワークを用いて、前記任意の画像を前記第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換する検索第一変換部を更に含み、前記検索特徴量抽出部は、更に、前記任意の画像の変換画像と、前記第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出し、前記照合部は、前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度、及び前記任意の画像の変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力するようにしてもよい。 In the search device according to the third invention, the first convolutional neural network is used to convert the arbitrary image to have a resolution corresponding to the resolution of the second image belonging to the second image set. A search first conversion unit that further extracts a feature vector from the converted image of the arbitrary image and each of the third images of the third image set. The collation unit uses a set of a feature amount vector of the arbitrary image extracted by the search feature amount extraction unit and a feature amount vector of each converted image of the third image in the third image set. The arbitrary image based on the similarity using the combination of the feature amount vector of the converted image of the arbitrary image and the feature amount vector of each of the third images of the third image set. And the third image Collating, may output the verification result.
第4の発明に係る学習方法は、第一変換部が、所定の解像度の第一の画像からなる第一画像集合に属する前記第一の画像の各々を、第一の畳み込みニューラルネットワークによって前記第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換するステップと、第二変換部が、前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換するステップと、特徴量抽出部が、前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出するステップと、パラメータ更新部が、前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するステップと、を含んで実行することを特徴とする。 In the learning method according to a fourth aspect of the present invention, the first conversion unit uses the first convolution neural network to convert each of the first images belonging to the first image set including the first images having a predetermined resolution. Converting to a resolution higher than that of one image and corresponding to the resolution of the second image belonging to the second image set, and a second conversion unit belonging to the second image set Converting each of the second images to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolutional neural network; and a feature amount extraction unit, Each of the first images belonging to the image set, each of the second images belonging to the second image set, and each of the converted images of the first image converted by the first conversion unit When, A step of extracting a feature vector from each of the converted images of the second image converted by the second conversion unit, and a parameter updating unit extracted by the feature extraction unit. An error between each feature amount vector of the first image belonging to the image set and each feature amount vector of each converted image of the second image converted by the second conversion unit; and Based on an error between each feature quantity vector of the second image belonging to a set of two images and each feature quantity vector of each converted image of the first image converted by the first conversion unit, Updating the parameters of the first convolutional neural network and the second convolutional neural network.
第5の発明に係るプログラムは、コンピュータを、第1の発明に係る学習装置、又は第2若しくは第3の発明に係る検索装置の各部として機能させるためのプログラムである。 The program according to the fifth invention is a program for causing a computer to function as each part of the learning device according to the first invention or the search device according to the second or third invention.
本発明の学習装置、方法、及びプログラムによれば、所定の解像度の第一の画像からなる第一画像集合に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合に属する第二の画像の解像度に対応する解像度となるように変換し、第二画像集合に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換し、第一画像集合に属する第一の画像の各々と、第二画像集合に属する第二の画像の各々と、変換された第一の画像の各々の変換画像の各々と、変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出し、画像の各々の特徴量ベクトルと変換画像の各々の特徴量ベクトルとの間の誤差に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新することにより、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる、という効果が得られる。 According to the learning device, method, and program of the present invention, each of the first images belonging to the first image set made up of the first images having a predetermined resolution is converted from the first image by the first convolution neural network. Is converted to a resolution corresponding to the resolution of the second image belonging to the second image set, and each second image belonging to the second image set is converted to the second convolutional neural network. Are converted to a resolution corresponding to the resolution of the first image belonging to the first image set, and each of the first image belonging to the first image set and each of the second image belonging to the second image set And extracting a feature vector from each of the converted images of each of the converted first images and each of the converted images of the converted second image, and converting each of the feature vectors of the images image Parameters of the neural network for accurately searching for an object appearing in the image by updating the parameters of the first convolutional neural network and the second convolutional neural network based on an error between each feature vector. Can be learned.
また、本発明の検索装置、方法、及びプログラムによれば、学習されたパラメータを用いた第一の畳み込みニューラルネットワークによって画像を変換し、学習されたパラメータを用いた第二の畳み込みニューラルネットワークによって、参照画像である第三の画像の各々を変換し、変換画像から特徴量ベクトルを抽出し、特徴量ベクトルの組を用いた類似度を照合し、照合した結果を出力することで、画像に写る物体を精度よく検索することができる。 Further, according to the search device, method and program of the present invention, an image is converted by the first convolution neural network using the learned parameters, and the second convolution neural network using the learned parameters is used. Each of the third images, which are reference images, is converted, a feature vector is extracted from the converted image, the similarity using a set of feature vectors is collated, and the collation result is output, so that the image is captured. The object can be searched with high accuracy.
以下、図面を参照して本発明の実施形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
本実施形態に係る学習装置、及び検索装置は、特定の対象となる物体(以下、特定対象物と呼ぶ)が写ったクエリ画像と参照画像との解像度が大きく乖離する場合であっても、高精度に検索結果を得るための学習装置、及び検索装置である。 The learning device and the search device according to the present embodiment can be used even when the resolution of a query image in which an object to be specified (hereinafter referred to as a specified target) is significantly different from a reference image. A learning device and a search device for accurately obtaining a search result.
<本発明の実施形態に係る学習装置の構成> <Configuration of Learning Device According to Embodiment of the Present Invention>
次に、本発明の実施形態に係る学習装置の構成について説明する。図1に示す学習装置10は、特定の対象となる物体(以下、特定対象物と呼ぶ)が写ったクエリ画像と参照画像との解像度が大きく乖離する場合であっても、高精度に検索結果を得るための学習装置である。学習装置10は、CPUと、RAMと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したROMと、を含むコンピュータで構成することが出来る。この学習装置10は、機能的には図1に示すように第一変換部21と、第二変換部22と、特徴量抽出部23と、パラメータ更新部24と、記憶部29とを備えている。なお、以降では画像5がクエリ画像に対応し、第三画像集合6が1枚以上の参照画像からなる画像集合に対応するものとし、クエリ画像が低解像度に、参照画像が高解像度に特定対象物を写しているものとして説明する。また、解像度とは画像の総画素数を指すものとして説明する。
Next, the configuration of the learning device according to the embodiment of the present invention will be described. The
学習装置10には、データベース2に格納された、1枚以上の低解像度な画像に対応する第一の画像からなる第一画像集合3、および、1枚以上の高解像度な画像に対応する第二の画像からなる第二画像集合4が入力される。なお、畳み込みニューラルネットワークは、入力された画像が変換された画像か参照画像であるかを判別するものを用いるため、第一画像集合3と第二画像集合4とで画像間の対応が取れている必要はないものとする。また、低解像度な画像に対応する第一の画像は、Bicubic法などによって拡大され、高解像度な画像に対応する第二の画像と画素数を揃えているものとする。
The
学習装置10は、データベース2と通信手段(図示省略)を介して相互に情報を通信する。
データベース2は、例えば、一般的な汎用コンピュータに実装されているファイルシステムによって構成できる。本実施形態では、一例としてデータベース2には、第一画像集合3、及び、第二画像集合4の画像データが予め格納されているものとする。本実施形態では、各画像それぞれを一意に識別可能な、通し番号によるID(Identification)やユニークな画像ファイル名等の識別子が与えられているものとしている。また、データベース2は、各々の画像について、当該画像の識別子と、当該画像の画像データとを関連づけて記憶しているものとする。あるいは、データベース2は、同様に、RDBMS(Relational Database Management System)等で実装及び構成されていても構わない。データベース2が記憶する情報は、その他、メタデータとして、例えば画像の内容を表現する情報(画像のタイトル、概要文、またはキーワード等)、画像のフォーマットに関する情報(画像のデータ量、サムネイル等のサイズ)等を含んでいても構わないが、これらの情報の記憶は本開示の実施においては必須ではない。 The database 2 can be configured by, for example, a file system mounted on a general general-purpose computer. In the present embodiment, as an example, it is assumed that the database 2 stores image data of the first image set 3 and the second image set 4 in advance. In the present embodiment, it is assumed that an identifier such as a serial number ID (Identification) or a unique image file name that can uniquely identify each image is given. In addition, the database 2 stores, for each image, an identifier of the image and image data of the image in association with each other. Alternatively, the database 2 may be similarly implemented and configured with RDBMS (Relational Database Management System) or the like. The information stored in the database 2 includes, as metadata, for example, information representing the contents of an image (such as an image title, summary text, or keyword), and information regarding the image format (the amount of image data, the size of a thumbnail, etc.) ) And the like may be included, but the storage of these pieces of information is not essential in the implementation of the present disclosure.
データベース2は、学習装置10の内部及び外部の何れに設けられていても構わず、通信手段は任意の公知ものを用いることができる。なお、本実施形態では、データベース2は、学習装置10の外部に設けられているものとし、インターネット、及びTCP/IP(Transmission Control Protocol/Internet Protocol)等のネットワークを通信手段として学習装置10と通信可能に接続されているものとする。
The database 2 may be provided either inside or outside the
また、学習装置10が備える各部及びデータベース2は、CPU(Central Processing Unit)、GPU(Graphics Processing Unit)等の演算処理装置や、RAM(Random Access Memory)、ROM(Read Only Memory)、及びHDD(Hard Disk Drive)等の記憶装置等を備えたコンピュータやサーバ等により構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは学習装置10が備える上記記憶装置に予め記憶されていてもよいし、磁気ディスク、光ディスク、及び半導体メモリ等の記録媒体に格納して提供することも、ネットワークを通して提供することも可能である。もちろん、その他いかなる構成要素についても、単一のコンピュータやサーバによって実現しなければならないものではなく、ネットワークによって接続された複数のコンピュータに分散して実現しても構わない。
In addition, each unit and the database 2 included in the
次に、本実施形態における学習装置10の各部の機能について説明する。
Next, functions of each unit of the
第一変換部21は、データベース2に格納された第一画像集合3に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合4に属する第二の画像の解像度に対応する解像度となるように変換する。また、第一変換部21は、第二変換部22で変換された第二の画像の各々の変換画像の各々を、第一の畳み込みニューラルネットワークによって第二画像集合4に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。
The
本実施形態では、第一変換部21の変換は低解像度画像から高解像度画像への変換を想定する。画像の変換で用いる第一の畳み込みニューラルネットワークは、ニューラルネットワークを用いて畳み込む方法であれば限定されない。本実施形態では、非特許文献3に記載の、ダウンサンプリングを行うストライド2の畳み込み層、residual block、アップサンプリングを行うストライド1/2の畳み込み層からなる9層の畳み込みニューラルネットワーク(CNN:Convolutional Neural Network)によって畳み込むことで変換を実施する。
In the present embodiment, the conversion of the
第二変換部22は、データベース2に格納された第二画像集合4に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように変換する。また、第二変換部22は、第一変換部21で変換された第一の画像の各々の変換画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。
The
特徴量抽出部23は、第一画像集合3に属する第一の画像の各々と、第二画像集合4に属する第二の画像の各々と、第一変換部21により変換された第一の画像の各々の変換画像の各々と、第二変換部22により変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。
The feature
特徴量ベクトルの抽出処理には、ニューラルネットワークを用いて固定の次元を持つベクトルとして表現できるものを用いればよいが、例えば、非特許文献4に記載の方法を用いることで抽出できる。当該方法によれば、CNNの一種であるVGG16、あるいは、ResNet101というニューラルネットワークに画像を入力した際の全結合層への入力に該当する特徴マップ(高さ、横幅、及び、チャネル数によって特徴マップの大きさが規定される)に関し、まず、様々な大きさの矩形を規定し、チャネルごとに矩形内の値の最大値を求めることで、矩形数×チャネル数のベクトルを得る。当該ベクトル群を正規化し、同一チャネルの値を足し合わせて再度正規化することで、一枚の画像をチャネル数の次元を持つ特徴量ベクトルによって表現することができる。正規化は、公知の方法を用いれば良いが、L2正規化が好適である。 For the feature vector extraction process, a vector that can be expressed as a vector having a fixed dimension using a neural network may be used. For example, it can be extracted by using the method described in Non-Patent Document 4. According to this method, the feature map corresponding to the input to all the connection layers when the image is inputted to the neural network called VGG16 or ResNet 101 which is a kind of CNN (feature map depending on the height, width, and number of channels). First, rectangles of various sizes are defined, and a vector of the number of rectangles × the number of channels is obtained by determining the maximum value in the rectangle for each channel. By normalizing the vector group, adding the values of the same channel, and normalizing again, one image can be expressed by a feature vector having a dimension of the number of channels. For normalization, a known method may be used, but L2 normalization is preferable.
[非特許文献4]A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus, End-to-End learning of deep visual representations for image retrieval, IJCV, 2017. [Non-Patent Document 4] A. Gordo, J. Almaz´an, J. Revaud, and D. Larlus, End-to-End learning of deep visual representations for image retrieval, IJCV, 2017.
パラメータ更新部24は、特徴量ベクトル間の誤差、変換前と再変換後の画像間の誤差、及び損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新し、記憶部29に格納する。
The
具体的には、特徴量ベクトル間の誤差は、第一画像集合3に属する第一の画像の各々の特徴量ベクトルと第二変換部22により変換された第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び第二画像集合4に属する前記第二の画像の各々の特徴量ベクトルと第一変換部21により変換された第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差である。
Specifically, the error between the feature quantity vectors is the difference between each feature quantity vector of the first image belonging to the first image set 3 and each converted image of the second image converted by the
変換前と再変換後の画像間の誤差は、第二変換部22で更に変換された第一の画像の各々の再変換画像の各々と第一画像集合3に属する第一の画像の各々との誤差、及び第一変換部21で更に変換された第二の画像の各々の再変換画像の各々と第二画像集合4に属する第二の画像の各々との誤差である。
The error between the image before conversion and the image after re-conversion is that each of the re-converted images of the first image further converted by the
損失関数は、第一の畳み込みニューラルネットワークと、第一変換部21で変換された第一の画像の各々の変換画像の各々が、第一画像集合3に属する第一の画像、及び第二画像集合4に属する第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、第一の畳み込みニューラルネットワークと、識別用ニューラルネットワークとが、互いに競合することを表す損失関数と、第二の畳み込みニューラルネットワークと、第二変換部22で変換された第二の画像の各々の変換画像の各々が、第一画像集合3に属する第一の画像、及び第二画像集合4に属する第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、第二の畳み込みニューラルネットワークと、識別用ニューラルネットワークとが、互いに競合することを表す損失関数とである。
The loss function includes a first convolutional neural network and a first image and a second image in which each converted image of the first image converted by the
パラメータ更新部24は、以上の特徴量ベクトル間の誤差、変換前と再変換後の画像間の誤差、及び損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新する。誤差の算出には、任意の公知の損失関数を用いて構わないが、例えば、変換前後の特徴量ベクトル間の2乗誤差によって求められる。また特徴量ベクトル間の誤差に加えて、非特許文献3に記載のAdversarial Lossの損失関数、および、Cycle Consistency Lossの画像間の誤差を足し合わせて用いてもよい。
The
Adversarial Lossは、上述した損失関数を用いて、第一の畳み込みニューラルネットワークと、識別用ニューラルネットワークとを交互に学習する。変換された第一の画像の変換画像が第一の画像であると識別されなくなるほど、損失関数の値が減少する。損失関数の値が減少するように、第一の畳み込みニューラルネットワーク及び識別用ニューラルネットワークのパラメータを更新する。つまり、識別用ニューラルネットワークが、変換画像を見分けることができなくなるように第一ニューラルネットワークのパラメータの学習を行う。また、同様に、上述した損失関数を用いて、第二の畳み込みニューラルネットワークと、識別用ニューラルネットワークとを交互に学習する。変換された第二の画像の変換画像が第二の画像であると識別されなくなるほど、損失関数の値が減少する。損失関数の値が減少するように、第二の畳み込みニューラルネットワーク及び識別用ニューラルネットワークのパラメータを更新する。 Adversarial Loss learns the first convolutional neural network and the identification neural network alternately using the above-described loss function. The loss function value decreases as the converted image of the converted first image is not identified as the first image. The parameters of the first convolutional neural network and the identification neural network are updated so that the value of the loss function decreases. That is, the parameters of the first neural network are learned so that the identifying neural network cannot distinguish the converted image. Similarly, the second convolutional neural network and the identification neural network are alternately learned using the above-described loss function. The loss function value decreases as the converted image of the converted second image is not identified as the second image. The parameters of the second convolutional neural network and the identification neural network are updated so that the value of the loss function decreases.
Cycle Consistency Lossは、第一画像集合3に属する第一の画像を第一変換部21により変換した第一の画像の各々の変換画像の各々を、第二変換部22により更に変換した再変換画像の各々と、変換前の第一の画像の各々との誤差、及び、第二画像集合4に属する第二の画像を第二変換部22により変換した第二の画像の各々の変換画像の各々を、第一変換部21により更に変換した再変換画像の各々と、変換前の第二の画像の各々との誤差によって構成される。
Cycle Consistency Loss is a reconverted image obtained by further converting each converted image of the first image obtained by converting the first image belonging to the first image set 3 by the first converting
パラメータの更新方法は限定されないが、例えば、誤差逆伝播法を用いてパラメータを更新してもよい。誤差逆伝播法は、ニューラルネットワークの出力から入力に向かって各ニューロンのパラメータを局所誤差が小さくなるように修正する方法である。なお、上記の更新においては、特徴量ベクトル間の誤差及び変換前と再変換後の画像間の誤差を用いた更新と、損失関数を用いた更新とを交互に行えばよい。 The method for updating the parameters is not limited. For example, the parameters may be updated using an error back propagation method. The error back-propagation method is a method of correcting the parameters of each neuron so that the local error decreases from the output of the neural network to the input. In the above update, an update using an error between feature quantity vectors and an error between images before and after conversion and an update using a loss function may be alternately performed.
<本発明の実施形態に係る検索装置の構成> <Configuration of Retrieval Device According to Embodiment of the Present Invention>
次に、本発明の実施形態に係る検索装置の構成について説明する。 Next, the configuration of the search device according to the embodiment of the present invention will be described.
本実施形態では、検索装置11が低解像度なクエリ画像を高解像度に変換する、あるいは、高解像度な参照画像を低解像度に変換する、あるいは、どちらの変換も実施した後に特徴量ベクトルを抽出し照合することで、解像度が乖離する画像間の高精度な検索が可能となる。
In the present embodiment, the
図2に示す検索装置11は、CPUと、RAMと、後述する学習処理ルーチンを実行するためのプログラムや各種データを記憶したROMと、を含むコンピュータで構成することが出来る。この学習装置10は、機能的には図2に示すように検索第一変換部31と、検索第二変換部32と、検索特徴量抽出部33と、照合部35と、記憶部39とを備えている。
The
記憶部39には、上記学習装置10で学習された第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータが格納されている。
The
検索第一変換部31は、照合対象の画像5を、記憶部39に格納されたパラメータを用いた第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合4に属する第二の画像の解像度に対応する解像度となるように変換する。
The search
検索第二変換部32は、参照画像である第三の画像からなる第三画像集合6に属する第三の画像の各々を、記憶部39に格納されたパラメータを用いた第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように変換する。なお、第三画像集合6は、第二画像集合4と同様のものであってもよい。
The search
検索特徴量抽出部33は、画像5と、検索第一変換部31により変換された画像5の変換画像と、第三画像集合6に属する第三の画像の各々と、第三画像集合6に属する第三の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。
The search feature
照合部35は、検索特徴量抽出部33で抽出された画像5の特徴量ベクトルと検索第二変換部32で変換された第三画像集合6に属する第三の画像の各々の特徴量ベクトルとの組、及び変換された画像5の特徴量ベクトルと第三画像集合6に属する第三の画像の各々の特徴量ベクトルとの組を用いた類似度を照合し、照合した結果を検索結果7として出力する。なお、類似度の照合は上記二つの組のうちの少なくとも一方の組であってもよい。
The
照合は、例えば、特徴量ベクトル間の内積を計算し、当該値を画像間の類似度とし、第三画像集合6の類似度の高い上位N枚を検索結果7として出力すればよい。Nは1以上第三画像集合6の画像枚数以下の整数である。 The collation may be performed by, for example, calculating the inner product between feature quantity vectors, setting the value as the similarity between images, and outputting the top N images with the highest similarity in the third image set 6 as the search result 7. N is an integer from 1 to the number of images in the third image set 6.
また、照合部35に入力される画像5に関する特徴量ベクトルは1枚の画像5につき1つである必要は必ずしもなく、複数であるように構成できる。例えば、画像5から抽出した特徴量ベクトル、及び、検索第一変換部31で変換された画像5から抽出した特徴量ベクトルを検索クエリとして、第三画像集合6に属する第三の画像の各々から抽出した特徴量ベクトル、及び、検索第二変換部32で変換された第三画像集合6に属する第三の画像の各々から抽出した特徴量ベクトルから検索を行ってもよい。この場合、1枚の画像5につき2つの特徴量ベクトルが入力されるため、画像毎に2つの特徴量ベクトルを足し合わせ、正規化した後に、類似度を算出すればよい。正規化は、公知の方法を用いれば良いが、L2正規化が好適である。
Further, the feature quantity vector related to the image 5 input to the
<本発明の実施形態に係る学習装置の作用> <Operation of Learning Device According to Embodiment of the Present Invention>
次に、本発明の実施形態に係る学習装置10の作用について説明する。学習装置10は、図3に示す学習処理ルーチンを実行する。
Next, the operation of the
まず、ステップS201では、第一変換部21が、データベース2に格納された第一画像集合3に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合4に属する第二の画像の解像度に対応する解像度となるように変換する。
First, in step S201, the
次に、ステップS202では、第二変換部22が、データベース2に格納された第二画像集合4に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように変換する。
Next, in step S202, the
ステップS203では、特徴量抽出部23が、第一画像集合3に属する第一の画像の各々と、ステップS202で変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。
In step S203, the feature
ステップS204では、特徴量抽出部23が、第二画像集合4に属する第二の画像の各々と、ステップS201で変換された第一の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。
In step S204, the feature
ステップS205では、第一変換部21が、ステップS202で変換された第二の画像の各々の変換画像の各々を、第一の畳み込みニューラルネットワークによって第二画像集合4に属する第二の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。
In step S205, the
ステップS206では、第二変換部22が、ステップS201で変換された第一の画像の各々の変換画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として変換する。
In step S206, the
ステップS207では、変換前と再変換後の変換前後の画像間の誤差として、第二変換部22で更に変換された第一の画像の各々の再変換画像の各々と第一画像集合3に属する第一の画像の各々との誤差、及び第一変換部21で更に変換された第二の画像の各々の再変換画像の各々と第二画像集合4に属する第二の画像の各々との誤差を算出し、特徴量ベクトル間の誤差として、第一画像集合3に属する第一の画像の各々の特徴量ベクトルと第二変換部22により変換された第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、及び第二画像集合4に属する第二の画像の各々の特徴量ベクトルと第一変換部21により変換された第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差を算出し、算出した誤差、及び第一の畳み込みニューラルネットワークと識別用ニューラルネットワークとが互いに競合することを表す損失関数に基づいて第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新し、記憶部29に格納する。
In step S207, each of the reconverted images of the first image further converted by the
以上説明したように、本発明の実施形態に係る学習装置によれば、所定の解像度の第一の画像からなる第一画像集合3に属する第一の画像の各々を、第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合4に属する第二の画像の解像度に対応する解像度となるように変換し、第二画像集合4に属する第二の画像の各々を、第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように変換し、第一画像集合3に属する第一の画像の各々と、第二画像集合4に属する第二の画像の各々と、変換された第一の画像の各々の変換画像の各々と、変換された第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出し、画像の各々の特徴量ベクトルと変換画像の各々の特徴量ベクトルとの間の誤差、画像の各々と変換画像の各々との誤差、及び第一の畳み込みニューラルネットワークと識別用ニューラルネットワークとが互いに競合することを表す損失関数に基づいて、第一の畳み込みニューラルネットワーク及び第二の畳み込みニューラルネットワークのパラメータを更新することにより、画像に写る物体を精度よく検索するためのニューラルネットワークのパラメータを学習することができる。 As described above, according to the learning device according to the embodiment of the present invention, each of the first images belonging to the first image set 3 including the first images having a predetermined resolution is converted into the first convolution neural network. Are converted to a resolution higher than that of the first image and corresponding to the resolution of the second image belonging to the second image set 4, and each of the second images belonging to the second image set 4 is converted to Are converted to a resolution corresponding to the resolution of the first image belonging to the first image set 3 by the second convolution neural network, and each of the first images belonging to the first image set 3 and the second Extract feature vectors from each of the second images belonging to the image set 4, each of the converted images of the converted first image, and each of the converted images of the converted second image. And each of the images Represents an error between the collected vector and each feature vector of the converted image, an error between each of the images and each of the converted images, and that the first convolutional neural network and the identifying neural network compete with each other. By updating the parameters of the first convolutional neural network and the second convolutional neural network based on the loss function, it is possible to learn the parameters of the neural network for accurately searching for an object appearing in the image.
<本発明の実施形態に係る検索装置の作用> <Operation of Retrieval Device According to Embodiment of the Present Invention>
次に、本発明の実施形態に係る検索装置11の作用について説明する。検索装置11は、図4に示す検索処理ルーチンを実行する。
Next, the operation of the
まず、ステップS301では、検索第一変換部31が、照合対象の画像5を、記憶部39に格納されたパラメータを用いた第一の畳み込みニューラルネットワークによって第一の画像よりも高解像度であって、第二画像集合4に属する第二の画像の解像度に対応する解像度となるように変換する。
First, in step S301, the search
次に、ステップS302では、検索第二変換部32が、参照画像である第三の画像からなる第三画像集合6に属する第三の画像の各々を、記憶部39に格納されたパラメータを用いた第二の畳み込みニューラルネットワークによって第一画像集合3に属する第一の画像の解像度に対応する解像度となるように変換する。
Next, in step S302, the search
ステップS303では、検索特徴量抽出部33が、画像5と、検索第一変換部31により変換された画像5の変換画像と、第三画像集合6に属する第三の画像の各々と、第三画像集合6に属する第三の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する。
In step S <b> 303, the search feature
ステップS304では、検索特徴量抽出部33で抽出された画像5の特徴量ベクトルと検索第二変換部32で変換された第三画像集合6に属する第三の画像の各々の特徴量ベクトルとの組、及び変換された画像5の特徴量ベクトルと第三画像集合6に属する第三の画像の各々の特徴量ベクトルとの組を用いた類似度を照合し、照合した結果を検索結果7として出力する。
In step S304, the feature quantity vector of the image 5 extracted by the search feature
以上説明したように、本発明の実施形態に係る検索装置によれば、学習されたパラメータを用いた第一の畳み込みニューラルネットワークによって画像を変換し、学習されたパラメータを用いた第二の畳み込みニューラルネットワークによって、参照画像である第三の画像の各々を変換し、変換画像から特徴量ベクトルを抽出し、特徴量ベクトルの組を用いた類似度を照合し、照合した結果を出力することで、画像に写る物体を精度よく検索することができる。 As described above, according to the search device according to the embodiment of the present invention, the image is converted by the first convolution neural network using the learned parameters, and the second convolution neural using the learned parameters. By converting each of the third images as the reference image by the network, extracting the feature vector from the converted image, collating the similarity using the set of feature vectors, and outputting the collation result, It is possible to search for an object appearing in an image with high accuracy.
なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.
以上、図面を参照して本発明の実施形態を説明してきたが、上記実施形態は本発明の例示に過ぎず、本発明が上記実施形態に限定されるものではないことは明らかである。したがって、本発明の技術思想及び範囲を逸脱しない範囲で構成要素の追加、省略、置換、その他の変更を行ってもよい。 As mentioned above, although embodiment of this invention has been described with reference to drawings, it is clear that the said embodiment is only the illustration of this invention and this invention is not limited to the said embodiment. Therefore, additions, omissions, substitutions, and other modifications of the components may be made without departing from the technical idea and scope of the present invention.
例えば、参照画像(第三画像集合6)のデータベースを構築し、第二変換部22による画像の変換、及び、特徴量抽出部23による特徴量ベクトルの抽出を実施の上データベースに格納しておき、クエリ画像(画像5)については適宜問い合わせのタイミングで外部から入力を受け付けて処理を行い、照合部35がデータベースから参照画像の特徴量ベクトルを取得して類似度を算出してもよい。
For example, a database of reference images (third image set 6) is constructed, and image conversion by the
また、必ずしも画像5および第三画像集合6のどちらの画像も変換する必要はなく、例えば、第三画像集合6を第二変換部22で変換した第三の画像の各々の変換画像の各々から抽出した特徴量ベクトルと、画像5を変換せずに抽出した特徴量ベクトルとから照合を行ってもよい。この場合には、検索装置を、検索第二変換部と、検索特徴量抽出部と、照合部とを備えるように構成すればよい。このように構成することで、画像5を受け付けてから検索結果7を求めるまでに必要な時間の短縮効果が得られる。
Further, it is not always necessary to convert either the image 5 or the third image set 6, for example, from each of the converted images of the third image obtained by converting the third image set 6 by the
また、例えば、第三画像集合6に属する第三の画像の各々から抽出した特徴量ベクトルと、画像5を変換した変換画像の特徴量ベクトルとから照合を行ってもよい。検索装置を、検索第一変換部と、検索特徴量抽出部と、照合部とを備えるように構成すればよい。このように構成することで、画像5を受け付けてから検索結果7を求めるまでに必要な時間の短縮効果が得られる。 Further, for example, matching may be performed from the feature amount vector extracted from each of the third images belonging to the third image set 6 and the feature amount vector of the converted image obtained by converting the image 5. What is necessary is just to comprise a search device so that a search 1st conversion part, a search feature-value extraction part, and a collation part may be provided. With this configuration, an effect of shortening the time required from receiving the image 5 to obtaining the search result 7 can be obtained.
3 第一画像集合
4 第二画像集合
5 画像
6 第三画像集合
7 検索結果
10 学習装置
11 検索装置
21 第一変換部
22 第二変換部
23 特徴量抽出部
24 パラメータ更新部
29、39 記憶部
31 検索第一変換部
32 検索第二変換部
33 検索特徴量抽出部
35 照合部
3 First image set 4 Second image set 5 Image 6 Third image set 7
Claims (8)
前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換する第二変換部と、
前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出する特徴量抽出部と、
前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、
及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するパラメータ更新部と、
を含む学習装置。 Each of the first images belonging to the first image set consisting of the first images of a predetermined resolution is higher in resolution than the first image by the first convolution neural network and is converted into the second image set. A first conversion unit for converting to a resolution corresponding to the resolution of the second image to which it belongs;
A second conversion unit that converts each of the second images belonging to the second image set to a resolution corresponding to the resolution of the first image belonging to the first image set by a second convolution neural network. When,
Conversion of each of the first images belonging to the first image set, each of the second images belonging to the second image set, and each of the first images converted by the first conversion unit. A feature amount extraction unit that extracts a feature amount vector from each of the images and each of the converted images of the second image converted by the second conversion unit;
Each feature amount vector of the first image belonging to the first image set extracted by the feature amount extraction unit and each converted image of the second image converted by the second conversion unit Error between the feature vector and
And based on an error between each feature amount vector of the second image belonging to the second image set and each feature amount vector of each converted image of the first image converted by the first conversion unit. A parameter updating unit for updating parameters of the first convolution neural network and the second convolution neural network;
A learning device including
前記第二変換部は、前記第一変換部で変換された第一の画像の各々の変換画像の各々を、前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように再変換画像の各々として更に変換し、
前記パラメータ更新部は、前記第二変換部で更に変換された前記第一の画像の各々の再変換画像の各々と前記第一画像集合に属する前記第一の画像の各々との誤差、及び前記第一変換部で更に変換された前記第二の画像の各々の再変換画像の各々と前記第二画像集合に属する前記第二の画像の各々との誤差を更に用いて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新する請求項1に記載の学習装置。 The first conversion unit is configured so that each converted image of the second image converted by the second conversion unit has a resolution corresponding to the resolution of the second image belonging to the second image set. Further transform as each reconverted image,
The second converter is configured so that each of the converted images of the first image converted by the first converter has a resolution corresponding to the resolution of the first image belonging to the first image set. Further transform as each reconverted image,
The parameter update unit includes an error between each reconverted image of each of the first images further converted by the second conversion unit and each of the first images belonging to the first image set, and The first convolution is further performed by further using an error between each reconverted image of each of the second images further converted by the first conversion unit and each of the second images belonging to the second image set. The learning apparatus according to claim 1, wherein parameters of the neural network and the second convolutional neural network are updated.
前記第一の畳み込みニューラルネットワークと、前記第一変換部で変換された第一の画像の各々の変換画像の各々が、前記第一画像集合に属する前記第一の画像、及び前記第二画像集合に属する前記第二の画像の何れであるかを識別する識別用ニューラルネットワークとについて、前記第一の畳み込みニューラルネットワークと、前記識別用ニューラルネットワークとが、互いに競合することを表す損失関数を用いて、前記第一の畳み込みニューラルネットワーク及び前記識別用ニューラルネットワークのパラメータを更新すると共に、
前記第二の畳み込みニューラルネットワークと、前記第二変換部で変換された第二の画像の各々の変換画像の各々が、前記第一画像集合に属する前記第一の画像、及び前記第二画像集合に属する前記第二の画像の何れであるかを識別する前記識別用ニューラルネットワークとについて、前記第二の畳み込みニューラルネットワークと、前記識別用ニューラルネットワークとが、互いに競合することを表す損失関数を用いて、前記第二の畳み込みニューラルネットワーク及び前記識別用ニューラルネットワークのパラメータを更新する
請求項1又は2に記載の学習装置。 The parameter update unit further includes:
Each of the converted images of the first convolution neural network and the first image converted by the first conversion unit includes the first image and the second image set belonging to the first image set. Using a loss function representing that the first convolutional neural network and the identification neural network compete with each other for the identification neural network that identifies which of the second images belongs to Updating the parameters of the first convolutional neural network and the identification neural network,
Each of the converted images of the second convolutional neural network and the second image converted by the second conversion unit includes the first image and the second image set belonging to the first image set. A loss function representing that the second convolutional neural network and the identification neural network compete with each other for the identification neural network that identifies which of the second images belongs to The learning device according to claim 1, wherein parameters of the second convolutional neural network and the identification neural network are updated.
前記任意の画像の変換画像と、第三の画像からなる第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出する検索特徴量抽出部と、
前記検索特徴量抽出部で抽出された前記変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、
を含む検索装置。 The resolution of a second image belonging to the second image set using the first convolution neural network whose parameters have been learned by the learning device according to any one of claims 1 to 3. A search first conversion unit for converting to a resolution corresponding to
A search feature quantity extraction unit that extracts a feature quantity vector from the converted image of the arbitrary image and each of the third images of the third image set including a third image;
Based on the similarity using a set of feature quantity vectors of the converted image extracted by the search feature quantity extraction unit and feature quantity vectors of the third images of the third image set, the arbitrary A collation unit that collates an image with the third image and outputs a collation result;
Search device including
任意の画像と、前記第三画像集合の前記第三の画像の各々の変換画像とから特徴量ベクトルを抽出する検索特徴量抽出部と、
前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する照合部と、
を含む検索装置。 The third image of the third image set consisting of a third image using the second convolutional neural network whose parameters are learned by the learning device according to any one of claims 1 to 3. A search second conversion unit that converts each of the image data to a resolution corresponding to the resolution of the first image belonging to the first image set;
A search feature quantity extraction unit that extracts a feature quantity vector from an arbitrary image and each converted image of the third image of the third image set;
Based on the degree of similarity using a set of the feature quantity vector of the arbitrary image extracted by the search feature quantity extraction unit and the feature quantity vector of each converted image of the third image of the third image set. A collation unit that collates the arbitrary image with the third image and outputs a collation result;
Search device including
前記検索特徴量抽出部は、更に、前記任意の画像の変換画像と、前記第三画像集合の前記第三の画像の各々とから特徴量ベクトルを抽出し、
前記照合部は、前記検索特徴量抽出部で抽出された前記任意の画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の変換画像の特徴量ベクトルとの組を用いた類似度、及び前記任意の画像の変換画像の特徴量ベクトルと前記第三画像集合の前記第三の画像の各々の特徴量ベクトルとの組を用いた類似度に基づいて、前記任意の画像と前記第三の画像とを照合し、照合結果を出力する請求項5記載の検索装置。 The first convolution neural network further includes a search first conversion unit that converts the arbitrary image to have a resolution corresponding to the resolution of the second image belonging to the second image set,
The search feature quantity extraction unit further extracts a feature quantity vector from the converted image of the arbitrary image and each of the third images of the third image set,
The collation unit uses a set of a feature amount vector of the arbitrary image extracted by the search feature amount extraction unit and a feature amount vector of each converted image of the third image of the third image set. Based on the degree of similarity and the degree of similarity using a set of the feature amount vector of the converted image of the arbitrary image and the feature amount vector of each of the third images in the third image set, the arbitrary image The search device according to claim 5, wherein the third image is collated and a collation result is output.
第二変換部が、前記第二画像集合に属する前記第二の画像の各々を、第二の畳み込みニューラルネットワークによって前記第一画像集合に属する第一の画像の解像度に対応する解像度となるように変換するステップと、
特徴量抽出部が、前記第一画像集合に属する前記第一の画像の各々と、前記第二画像集合に属する前記第二の画像の各々と、前記第一変換部により変換された前記第一の画像の各々の変換画像の各々と、前記第二変換部により変換された前記第二の画像の各々の変換画像の各々とから特徴量ベクトルを抽出するステップと、
パラメータ更新部が、前記特徴量抽出部により抽出された、前記第一画像集合に属する前記第一の画像の各々の特徴量ベクトルと前記第二変換部により変換された前記第二の画像の各々の変換画像の各々の特徴量ベクトルとの間の誤差、
及び前記第二画像集合に属する前記第二の画像の各々の特徴量ベクトルと前記第一変換部により変換された前記第一の画像の各々の変換画像の各々の特徴量ベクトルとの誤差に基づいて、前記第一の畳み込みニューラルネットワーク及び前記第二の畳み込みニューラルネットワークのパラメータを更新するステップと、
を含む学習方法。 Each of the first images belonging to the first image set composed of the first images having a predetermined resolution is higher in resolution than the first image by the first convolution neural network. Converting to a resolution corresponding to the resolution of the second image belonging to the second image set;
The second conversion unit causes each of the second images belonging to the second image set to have a resolution corresponding to the resolution of the first image belonging to the first image set by the second convolution neural network. Converting, and
The feature amount extraction unit includes the first image belonging to the first image set, the second image belonging to the second image set, and the first conversion unit converted by the first conversion unit. Extracting a feature vector from each converted image of each of the images and each converted image of the second image converted by the second conversion unit;
Each of the feature amount vectors of the first image belonging to the first image set and the second image converted by the second conversion unit extracted by the feature amount extraction unit by the parameter update unit. Error between each feature vector of the transformed image of
And based on an error between each feature amount vector of the second image belonging to the second image set and each feature amount vector of each converted image of the first image converted by the first conversion unit. Updating the parameters of the first convolutional neural network and the second convolutional neural network;
Learning methods including.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2018106235A JP2019211912A (en) | 2018-06-01 | 2018-06-01 | Learning device, search device, method, and program |
| JP2018-106235 | 2018-06-01 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2019230665A1 true WO2019230665A1 (en) | 2019-12-05 |
Family
ID=68698142
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2019/020947 Ceased WO2019230665A1 (en) | 2018-06-01 | 2019-05-27 | Learning device, search device, method, and program |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP2019211912A (en) |
| WO (1) | WO2019230665A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025041222A1 (en) * | 2023-08-21 | 2025-02-27 | Nec Corporation | Image matching apparatus, image matching method, training apparatus, training method, and non-transitory computer-readable medium |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3901902B1 (en) * | 2020-04-20 | 2025-05-07 | FEI Company | Method implemented by a data processing apparatus, and charged particle beam device for inspecting a specimen using such a method |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11212990A (en) * | 1998-01-26 | 1999-08-06 | Toray Ind Inc | Image retrieving device, image retrieving display method and production of product |
| JP2000155833A (en) * | 1998-11-19 | 2000-06-06 | Matsushita Electric Ind Co Ltd | Image recognition device |
| JP6320649B1 (en) * | 2017-03-31 | 2018-05-09 | 三菱電機株式会社 | Machine learning device and image recognition device |
-
2018
- 2018-06-01 JP JP2018106235A patent/JP2019211912A/en active Pending
-
2019
- 2019-05-27 WO PCT/JP2019/020947 patent/WO2019230665A1/en not_active Ceased
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11212990A (en) * | 1998-01-26 | 1999-08-06 | Toray Ind Inc | Image retrieving device, image retrieving display method and production of product |
| JP2000155833A (en) * | 1998-11-19 | 2000-06-06 | Matsushita Electric Ind Co Ltd | Image recognition device |
| JP6320649B1 (en) * | 2017-03-31 | 2018-05-09 | 三菱電機株式会社 | Machine learning device and image recognition device |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025041222A1 (en) * | 2023-08-21 | 2025-02-27 | Nec Corporation | Image matching apparatus, image matching method, training apparatus, training method, and non-transitory computer-readable medium |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2019211912A (en) | 2019-12-12 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11928790B2 (en) | Object recognition device, object recognition learning device, method, and program | |
| JP5926291B2 (en) | Method and apparatus for identifying similar images | |
| US8135239B2 (en) | Display control apparatus, display control method, computer program, and recording medium | |
| JP6431302B2 (en) | Image processing apparatus, image processing method, and program | |
| US20170060867A1 (en) | Video and image match searching | |
| US20070195344A1 (en) | System, apparatus, method, program and recording medium for processing image | |
| JP6211407B2 (en) | Image search system, image search device, search server device, image search method, and image search program | |
| JP2023520625A (en) | IMAGE FEATURE MATCHING METHOD AND RELATED DEVICE, DEVICE AND STORAGE MEDIUM | |
| US11714921B2 (en) | Image processing method with ash code on local feature vectors, image processing device and storage medium | |
| JP7192990B2 (en) | Learning device, retrieval device, learning method, retrieval method, learning program, and retrieval program | |
| WO2019230666A1 (en) | Feature amount extraction device, method, and program | |
| KR101917369B1 (en) | Method and apparatus for retrieving image using convolution neural network | |
| US8385656B2 (en) | Image processing apparatus, image processing method and program | |
| CN113920415A (en) | Scene recognition method, device, terminal and medium | |
| CN111177436B (en) | Face feature retrieval method, device and equipment | |
| WO2019230665A1 (en) | Learning device, search device, method, and program | |
| CN113128278B (en) | Image recognition method and device | |
| CN116467463A (en) | Multimodal Knowledge Graph Representation Learning System and Products Based on Subgraph Learning | |
| WO2017010514A1 (en) | Image retrieval device and method, photograph time estimation device and method, iterative structure extraction device and method, and program | |
| CN115272768A (en) | Content identification method, device, equipment, storage medium and computer program product | |
| CN114297154A (en) | Vehicle data processing method, terminal and computer storage medium | |
| JP6482505B2 (en) | Verification apparatus, method, and program | |
| CN105488099A (en) | Vehicle retrieval method based on similarity learning | |
| CN117496187A (en) | A light field image saliency detection method | |
| JP2018194956A (en) | Image recognition apparatus, method, and program |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19811560 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 19811560 Country of ref document: EP Kind code of ref document: A1 |