WO2025033195A1 - Detection device, learning device, detection method, learning method, and recording medium - Google Patents
Detection device, learning device, detection method, learning method, and recording medium Download PDFInfo
- Publication number
- WO2025033195A1 WO2025033195A1 PCT/JP2024/026593 JP2024026593W WO2025033195A1 WO 2025033195 A1 WO2025033195 A1 WO 2025033195A1 JP 2024026593 W JP2024026593 W JP 2024026593W WO 2025033195 A1 WO2025033195 A1 WO 2025033195A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- learning
- image
- detection
- images
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S13/00—Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
- G01S13/02—Systems using reflection of radio waves, e.g. primary radar systems; Analogous systems
- G01S13/04—Systems determining presence of a target
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S13/00—Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
- G01S13/86—Combinations of radar systems with non-radar systems, e.g. sonar, direction finder
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
Definitions
- This disclosure relates to a detection device, a learning device, a detection method, and a program.
- Patent Document 1 discloses technology for detecting objects that reflect a lot of millimeter wave radar, such as metals, based on the reflection intensity of the millimeter wave radar.
- Patent Document 1 detects all objects that reflect a lot of millimeter wave radar, such as metals. With the technology disclosed in Patent Document 1, it is difficult to accurately detect only some of the objects that reflect a lot of millimeter wave radar.
- one example of the objective of the present disclosure is to provide a detection device, a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy.
- a detection device has a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
- One or more computers A detection method is provided for detecting a detection target based on a processing target composite image obtained by combining a processing target captured image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
- a program functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
- a learning device having a learning means for learning a learning model using a plurality of learning composite images obtained by combining a learning captured image with a learning two-dimensional image generated based on reflected wave information indicating a reflected electromagnetic wave.
- a learning method for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
- a program functions as a learning means for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
- a detection device a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy are realized.
- FIG. 2 is a diagram illustrating an example of a functional block diagram of a learning device according to the present disclosure.
- 11 is a flowchart illustrating an example of a process flow of a learning device according to the present disclosure.
- FIG. 2 is a diagram showing an example of a functional block diagram of a detection device according to the present disclosure.
- 10 is a flowchart illustrating an example of a process flow of a detection device according to the present disclosure.
- FIG. 11 is a diagram for explaining a process of a comparative example.
- FIG. 2 is a diagram for explaining processing of a learning device according to the present disclosure.
- FIG. 2 is a diagram illustrating an example of a hardware configuration of a learning device and a detection device according to the present disclosure.
- FIG. 13 is a diagram illustrating another example of a functional block diagram of a learning device according to the present disclosure.
- 1 is a diagram for explaining an example of processing executed by a learning device and a detection device according to the present disclosure.
- FIG. 11 is a diagram for explaining another example of the processing executed by the learning device and the detection device according to the present disclosure.
- FIG. 13 is a flowchart showing another example of the processing flow of the learning device according to the present disclosure.
- FIG. 13 is a diagram showing another example of a functional block diagram of a detection device according to the present disclosure.
- 10 is a flowchart showing another example of the processing flow of the detection device according to the present disclosure.
- FIG. 1 is a functional block diagram showing an overview of the learning device 20.
- Fig. 2 is a flowchart showing an example of the flow of processing executed by the learning device 20.
- the learning device 20 has a learning unit 21. This functional unit executes the process in FIG. 2.
- the learning unit 21 trains the learning model using multiple synthetic learning images that are a combination of "learning captured images” and “learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves” (S10).
- the learning device 20 which uses such synthetic learning images to train a learning model, can generate a learning model that has learned both the "features of the captured image” and the “features of the reflected wave information indicating the reflected waves of electromagnetic waves.”
- This learning model can perform object detection and object recognition based on both the "features of the captured image” and the "features of the reflected wave information indicating the reflected waves of electromagnetic waves.” By learning not just one of the "features of the captured image” and the "features of the reflected wave information indicating the reflected waves of electromagnetic waves,” but both, the accuracy of object detection and object recognition by the learning model is improved.
- the learning device 20 also integrates the "features of the captured image” and the "features of the reflected wave information indicating the reflected electromagnetic waves” by a unique method of generating a synthetic image, and learns them together. Rather than learning the "features of the captured image” and the “features of the reflected wave information indicating the reflected electromagnetic waves” individually, learning them together as a synthetic image that is integrated by a unique method is expected to have a synergistic effect and improve the accuracy of object detection and object recognition.
- Fig. 3 is a functional block diagram showing an overview of the detection device 10.
- Fig. 4 is a flowchart showing an example of the flow of processing executed by the detection device 10.
- the detection device 10 has a detection unit 11. This functional unit executes the process in FIG. 4.
- the detection unit 11 detects the detection target based on a composite image of the processing target, which is a combination of a "processing target captured image of the target area” and a “processing target two-dimensional image generated based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area" (S20).
- the detection device 10 which detects the detection target using such a composite image to be processed, can detect the detection target based on both the "characteristics of the captured image” and the “characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave.” By using not just one but both of the "characteristics of the captured image” and the “characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave,” the detection accuracy of the detection target is improved.
- the detection device 10 integrates the "features of the captured image” and the “features of the reflected wave information indicating the reflected waves of the electromagnetic waves” by a unique method of generating a composite image, and detects the detection target based on the composite image. Rather than processing the "features of the captured image” and the “features of the reflected wave information indicating the reflected waves of the electromagnetic waves” separately, they are integrated by a unique method and processed together as a composite image, which is expected to have a synergistic effect and improve detection accuracy.
- the learning device 20 of the third embodiment is a specific embodiment of the configuration of the learning device 20 of the first embodiment.
- object recognition models that use language models to recognize detection targets and object detection models that use language models to detect detection targets have become known.
- These object recognition models and object detection models are models that can represent images and language in a joint embedding space.
- the object recognition model and the object detection model are generated by learning the relationship between the results of object recognition/detection obtained by a technology such as a neural network and language related to the object (description or expression of the object). Based on the object recognition model and the object detection model, the object indicated by the search criteria (text) can be recognized/detected in the captured image.
- a technology such as a neural network
- language related to the object description or expression of the object.
- the object indicated by the search criteria text
- the technology is disclosed in, for example, the following documents. "Radford, Alec, et al. "Learning transferable visual models from natural language supervision.”International conference on machine learning. PMLR, 2021.” “Li, Liunian Harold, et al. "Grounded language-image pre-training.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.”
- the object recognition model and object detection model disclosed in the above-mentioned documents are generated by learning the correlation between language related to objects (descriptions and expressions of objects) and captured images, as shown in Figure 5.
- the learning device 20 In response to this, the learning device 20 generates a learning model by learning the correlation between language relating to an object (description or expression of the object) and a synthetic image, as shown in FIG. 6.
- the synthetic image is an image generated by synthesizing a "captured image” and a "two-dimensional image based on reflected wave information indicating reflected electromagnetic waves.”
- the learning device 20 differs from the object recognition model that recognizes a detection target using the language model disclosed in the above-mentioned document and the object detection model that detects a detection target using a language model.
- the configuration of the learning device 20 is explained in detail below.
- the software includes programs that are stored in the device before it is shipped, and programs downloaded from recording media such as CDs (Compact Discs) and servers on the Internet.
- CDs Compact Discs
- FIG. 7 is a block diagram illustrating an example of the hardware configuration of a learning device 20.
- the learning device 20 has a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A.
- the peripheral circuit 4A includes various modules.
- the learning device 20 does not have to have the peripheral circuit 4A.
- the learning device 20 may be composed of multiple devices that are physically and/or logically separated. In this case, each of the multiple devices can have the above hardware configuration.
- the bus 5A is a data transmission path for the processor 1A, memory 2A, peripheral circuit 4A, and input/output interface 3A to send and receive data to each other.
- the processor 1A is an arithmetic processing device such as a CPU or a GPU (Graphics Processing Unit).
- the memory 2A is a memory such as a RAM (Random Access Memory) or a ROM (Read Only Memory).
- the input/output interface 3A includes interfaces for acquiring information from an input device, an external device, an external server, an external sensor, a camera, etc., and interfaces for outputting information to an output device, an external device, an external server, etc.
- the input/output interface 3A also includes an interface for connecting to a communication network such as the Internet.
- Examples of input devices include a keyboard, a mouse, a microphone, a physical button, a touch panel, etc.
- Examples of output devices include a display, a speaker, a printer, a mailer, etc.
- the processor 1A can issue commands to each module and perform calculations based on the results of those calculations.
- Fig. 8 shows an example of a functional block diagram of the learning device 20.
- the learning device 20 has a learning unit 21, a learning image acquisition unit 22, a learning reflected wave processing unit 23, a learning synthesis unit 24, and a language input unit 25.
- the learning image acquisition unit 22 acquires learning images.
- Training images are captured images that are used as learning data when the learning device 20 learns a learning model.
- a “captured image” is an image generated by taking a picture with a camera.
- a camera detects visible light and converts it into an image. Note that a camera may also detect other types of light, such as infrared or ultraviolet light, and convert it into an image.
- a captured image may be a still image or a video.
- a captured image may be provided with information indicating the date and time of capture and the location where the image was taken. If the captured image is a video, information that identifies the date and time of capture and the location where the image was taken may be provided for each frame image.
- the camera is mounted on the measuring device.
- the measuring device is a device that can be moved.
- the measuring device may be, for example, an air vehicle such as a drone, or may be a device that moves on land.
- the measuring device can be moved automatically or by remote control.
- the measuring device can move automatically, for example, along a route registered in advance. As the measuring device moves, it captures an image of a specified data collection area with the camera.
- the data collection area is an area selected for the purpose of collecting learning data, detecting a detection target, etc.
- the data collection area may be a part of the ground, a part of a building, or something else.
- the camera is carried by a worker who moves around and takes pictures of the data collection area with the camera.
- the captured images generated by the camera are input to the learning device 20 by any means.
- the camera and the learning device 20 may be configured to be able to communicate with each other.
- the camera may then transmit the captured images generated to the learning device 20 via the communication means.
- the transmission of the captured images from the camera to the learning device 20 may be performed by real-time processing or batch processing.
- the captured images generated by the camera may be stored in any storage device.
- the storage device may be provided in the camera, or in an external device configured to be able to communicate with the camera.
- the captured images stored in the storage device may then be input to the learning device 20 at any time and by any means.
- the learning reflected wave processing unit 23 generates a learning two-dimensional image based on the reflected wave information indicating the reflected electromagnetic waves.
- the "electromagnetic wave” is, for example, a millimeter wave, and an example of its wavelength is 0.3 GHz or more and 300 GHz or less.
- the band of the electromagnetic wave is not limited to millimeter waves.
- the electromagnetic wave may be near infrared rays, far infrared rays, etc.
- Reflected waves are electromagnetic waves that are reflected by objects in the irradiated area. If an object that reflects electromagnetic waves is present in that area, the strength of the reflected waves will be high. Objects that reflect electromagnetic waves are often made primarily of metal.
- the irradiation of electromagnetic waves and the reception of reflected waves are achieved by an electromagnetic wave transmitting/receiving device.
- the electromagnetic wave transmitting/receiving device includes an electromagnetic wave transmitting unit that transmits electromagnetic waves, and an electromagnetic wave receiving unit that receives reflected waves.
- the electromagnetic wave transmitting/receiving device may include multiple electromagnetic wave receiving units, for example, two. These multiple electromagnetic wave receiving units are spaced apart from each other, and receive reflected waves of the electromagnetic waves irradiated by the same electromagnetic wave transmitting unit. In this way, the accuracy of detecting the position of an object is increased.
- the transmission method used by the electromagnetic wave transmitter is, for example, one of FMCW (Frequency Modulated Continuous Wave), pulse, CW (Continuous Wave) Doppler, two-frequency CW, and pulse compression, but may be other than these.
- FMCW Frequency Modulated Continuous Wave
- CW Continuous Wave
- Doppler two-frequency CW
- pulse compression but may be other than these.
- the measuring device equipped with the camera described above further includes an electromagnetic wave transmitting and receiving device. As the measuring device moves, it captures images of the data collection area with the camera, and uses the electromagnetic wave transmitting and receiving device to irradiate electromagnetic waves onto the data collection area and receive the reflected waves.
- the electromagnetic wave transmitting and receiving device is carried by a worker. While moving around carrying a camera and an electromagnetic wave transmitting and receiving device, the worker photographs the data collection area with the camera, and irradiates the data collection area with electromagnetic waves using the electromagnetic wave transmitting and receiving device and receives the reflected waves.
- Reflected wave information is generated by the electromagnetic wave transmitting/receiving device. More specifically, the reflected wave information is generated based on the results of reception of the reflected wave by the electromagnetic wave receiving unit.
- the reflected wave information includes, for example, time series information on the strength of the reflected wave. This time series information includes a combination of the date and time the reflected wave was received and the strength of the reflected wave at that time. If multiple electromagnetic wave receiving units are provided, the reflected wave information is generated separately for each of the multiple electromagnetic wave receiving units.
- the electromagnetic wave transmitting/receiving device may generate location information indicating the location of the electromagnetic wave transmitting/receiving device.
- the location information is indicated by, for example, latitude and longitude.
- the location information may include altitude in addition to latitude and longitude. This location information may be generated using, for example, GPS, or may be generated using other methods, such as SLAM (Simultaneous Localization and Mapping).
- the electromagnetic wave transmitting/receiving device may add location information of the electromagnetic wave transmitting/receiving device at the time the reflected wave was received to the above-mentioned reflected wave information.
- the above-mentioned location information may be information separate from the reflected wave information.
- the location information is time-series information on the location of the electromagnetic wave transmitting/receiving device.
- This time-series information includes a combination of date and time and the location of the electromagnetic wave transmitting/receiving device at that date and time.
- the reflected wave information generated by the electromagnetic wave transmitting/receiving device is input to the learning device 20 by any means.
- the electromagnetic wave transmitting/receiving device and the learning device 20 may be configured to be able to communicate with each other.
- the electromagnetic wave transmitting/receiving device may then transmit the generated reflected wave information to the learning device 20 via the communication means.
- the transmission of the reflected wave information from the electromagnetic wave transmitting/receiving device to the learning device 20 may be performed by real-time processing or by batch processing.
- the reflected wave information generated by the electromagnetic wave transmitting/receiving device may be stored in any storage device.
- the storage device may be provided within the electromagnetic wave transmitting/receiving device, or may be provided within an external device configured to be able to communicate with the electromagnetic wave transmitting/receiving device.
- the reflected wave information stored in the storage device may then be input to the learning device 20 at any time and by any means.
- 2D training images are 2D images generated for training purposes.
- a "two-dimensional image” is a two-dimensional image generated based on the reflected wave information described above. Below, the process of generating a two-dimensional image from reflected wave information is explained.
- the learning reflected wave processing unit 23 generates three-dimensional information by processing the reflected wave information.
- the reflected wave information includes a time series signal of the reflected wave intensity and the position of the electromagnetic wave transmitting and receiving device.
- the learning reflected wave processing unit 23 calculates the distance from the electromagnetic wave receiving unit to the reflection point that is the origin of the reflected wave by performing FFT (Fast Fourier Transform) multiple times on the reflected wave that constitutes this time series signal. If there are multiple electromagnetic wave receiving units, the learning reflected wave processing unit 23 performs this processing for each electromagnetic wave receiving unit.
- FFT Fast Fourier Transform
- the learning reflected wave processing unit 23 calculates an estimate of the reflected wave intensity for at least one first point included in the three-dimensional space corresponding to the data collection area by integrating multiple distances based on reflected waves measured by different electromagnetic wave receiving units at the same timing.
- the learning reflected wave processing unit 23 then performs this processing on the reflected waves measured at multiple timings to calculate an estimate of the intensity of the reflected waves for each of the multiple first points, and treats these as three-dimensional information.
- this estimate can be considered as a value indicating the possibility that an object that reflects electromagnetic waves is present at the first point.
- this value will be referred to as the first value.
- the method of generating the three-dimensional information for example the method of generating the first value, is not limited to this example.
- the learning reflected wave processing unit 23 generates two-dimensional information by projecting the three-dimensional information onto a specified plane.
- this specified plane will be referred to as the projection plane.
- the angle that the projection plane makes with the surface of the data collection area is 10° or less. In other words, it is preferable that the projection plane is horizontal to the surface of the data collection area.
- the learning reflected wave processing unit 23 identifies a plurality of first points corresponding to the second point. For example, the learning reflected wave processing unit 23 determines a plurality of first points that overlap with the second point when viewed from a direction perpendicular to the projection surface as the first points corresponding to the second point. Next, the learning reflected wave processing unit 23 identifies a first value corresponding to each of the identified plurality of first points, and generates a second value indicating the possibility that an object that reflects electromagnetic waves is present at the second point using the first value.
- the second value can be a statistical value (maximum value, minimum value, average value, mode, median, etc.) of the plurality of first values.
- the first value corresponding to the first point closest to the data collection area among the first points whose first values exceed a reference value may be determined as the second value.
- the learning reflected wave processing unit 23 determines the second value for each second point as two-dimensional information. From this two-dimensional information, for example, black and white image data (two-dimensional image) can be generated. Note that the two-dimensional image may represent the second value of the second point in black and white, or in other colors or other means.
- the learning device 20 may be configured to learn the learning model by taking into account at least one of the geological information of the data collection area and the weather information at the time the reflected waves were generated.
- the geology of a particular area of the data collection area may be more conducive to generating reflected waves.
- water or snow may accumulate on the ground surface of the data collection area, affecting the reflected waves.
- the learning reflected wave processing unit 23 can reflect this effect in the generation of the two-dimensional image.
- the learning reflected wave processing unit 23 may generate three-dimensional information or two-dimensional information by multiplying a parameter corresponding to the geology of the data collection area by a first value or a second value. This parameter is set in advance.
- the learning reflected wave processing unit 23 may also generate three-dimensional information or two-dimensional information by multiplying a parameter corresponding to the weather at the time of measurement by a first value or a second value. This parameter is also set in advance.
- the geological information and weather information may be input to the learning device 20 by, for example, a user of the learning device 20, or may be obtained by the learning device 20 from a database in which the geological information and weather information are stored.
- the learning synthesis unit 24 synthesizes the learning photographed image and the learning two-dimensional image to generate a learning synthesized image. It is preferable that the learning photographed image and the learning two-dimensional image are of similar size.
- the learning synthesis unit 24 blends (synthesizes) the learning captured image and the learning two-dimensional image at a predetermined blend ratio to generate a learning synthetic image.
- the training synthesis unit 24 may generate a training synthetic image from a set of training photographic images and a training two-dimensional image.
- the Blend ratio may be determined in advance.
- the learning synthesis unit 24 may generate multiple learning composite images having different blend ratios of the learning photographed images and learning two-dimensional images from a set of learning photographed images and learning two-dimensional images. In this way, it is preferable that a large number of learning composite images can be generated from a set of learning photographed images and learning two-dimensional images. It is also preferable that learning composite images with various blend ratios can be learned.
- the learning synthesis unit 24 may generate a learning synthetic image in which the blending ratio differs for each partial region in the image.
- the learning synthesis unit 24 may vary the blending ratio for each object region detected using widely known object detection technology, semantic segmentation, or the like. This is preferable because it makes it possible to generate a large number of learning synthetic images from a set of learning photographed images and learning two-dimensional images. It is also preferable because it makes it possible to learn from learning synthetic images with various blending ratios.
- the learning synthesis unit 24 synthesizes a learning photographed image and a learning two-dimensional image containing information about the same object to generate a learning composite image. To achieve this, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same time. Alternatively, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same position. Alternatively, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same time and in the same position.
- the "same timing” may be a perfect match, or may be a concept that allows for a time difference of up to a few seconds.
- the "same position” may be a perfect match, or may be a concept that allows for a position difference of up to a few centimeters to a few meters.
- the learning synthesis unit 24 may synthesize the images after performing a process to approximately match the position of an object in the learning captured image with the position of the object in the learning two-dimensional image. However, this process is not essential. It is required that the learning captured image and the learning two-dimensional image to be synthesized together contain information of the same object, but it is not essential that the positions of the objects in the images match. The inventors have confirmed that sufficient learning effects and detection results can be obtained even if the positions of the objects in the images do not match. However, if the positions of the objects in the images match, it is expected that more favorable learning effects and detection results can be obtained compared to when they do not. The learning synthesis unit 24 may, for example, perform the following misalignment correction process 1 or 2.
- misalignment correction process 1 the misalignment is corrected by user input.
- the learning synthesis unit 24 simultaneously displays the captured image and the two-dimensional image as shown in FIG. 10 on the screen.
- the user shifts the position of at least one of the captured image and the two-dimensional image so that the positions of an object shown in each image overlap each other.
- the user can specify which object in the captured image corresponds to the second set of points in the two-dimensional image based on the shape of the object shown in the captured image and the shape of the second set of points in the two-dimensional image.
- the learning synthesis unit 24 may transform the captured image into the same coordinate system as the two-dimensional image based on the positions of the camera and the electromagnetic wave transmitting and receiving device. When the camera and the electromagnetic wave transmitting and receiving device are mounted on the same measurement device, the transformation can be realized based on the distance, orientation, etc. of the camera and the electromagnetic wave transmitting and receiving device. A transformation rule for transforming the captured image into a predetermined coordinate system based on the position information is prepared in advance. The learning synthesis unit 24 executes the above-mentioned transformation of the captured image using the transformation rule. Examples of the image transformation method include image transformation such as affine transformation and homography transformation, but are not limited to these.
- a predetermined position (e.g., center) in the captured image and the same predetermined position (e.g., center) in the two-dimensional image indicate information of the same position in the data collection area.
- the language input unit 25 acquires text that describes an object that appears in the synthetic training image.
- the text may be a word, a sentence, a prompt, etc.
- the text that describes an object may include, for example, the type of object, or the external characteristics of the object (color, size, shape, etc.). An example of text is "blue can.”
- a user may input text describing an object appearing in the training composite image into the training device 20.
- the user may visually recognize the training photographed image that was the basis for the training composite image, and identify the object appearing in the training composite image.
- the user may also identify the object appearing in the training composite image by other means. For example, when training data is generated by placing a specific object in the data collection area and generating photographic/reflection information, the user can identify the object placed in the data collection area by any other means, without visually recognizing the training photographed image.
- the recognition result obtained by inputting the learning captured image that is the basis of the learning synthetic image into a recognition model (such as a classifier) that has been generated in advance by machine learning or the like may be input to the learning device 20 as the above text.
- a recognition model such as a classifier
- the learning unit 21 trains a learning model using multiple training synthetic images and text that describes objects that appear in each training synthetic image.
- the learning model is an object recognition model or object detection model that uses a language model.
- an object recognition model or object detection model using a language model is a model capable of expressing images and language in a joint embedding space.
- the object recognition model and object detection model are generated by learning the relationship between the results of object recognition/object detection obtained using technology such as neural networks and the language related to objects (descriptions and expressions of objects). Based on the object recognition model and object detection model, objects indicated by the search criteria (text) can be recognized/detected within the captured image.
- the learning unit 21 learns the correlation between an object appearing in a synthetic training image and text expressing that object.
- the learning unit 21 can generate the learning model of this embodiment by utilizing a widely known technology that learns the correlation between "an object appearing in a captured image” and "text expressing that object” based on a captured image to generate an object recognition model or an object detection model.
- the learning unit 21 can learn the learning model by replacing the "captured image” with the "synthetic training image” in this widely known technology.
- the learning unit 21 can divide one or more texts inputted by the language input unit 25 into tokens and extract language features from these.
- the language features can be extracted using any widely known technology.
- the transformer described in the following document may be used to extract the language features.
- the learning unit 21 may extract the language features using a 63 million parameter transformer with 12 layers, 512 hidden layer dimensions, and 8 attention heads. "Radford, Alec, et al. "Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.”
- the learning unit 21 can extract image features from the synthetic image for learning.
- the image features can be extracted using any widely known technology. ResNet-50, ResNet-101, ResNet-50 ⁇ 4, ResNet-50 ⁇ 16, ResNet-50 ⁇ 64, ViT-B/32, ViT-B/16, ViT-L/14, etc. may be used to extract the image features.
- the image features may be extracted using, for example, the technology described in the following document. "He, Kaiming, et al. "Deep residual learning for image recognition.”Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.” "Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale.”arXiv preprint arXiv:2010.11929(2020).”
- the learning unit 21 can adjust at least one of the language features, the synthetic training image, and the image features to adapt them to the learning model generated/adjusted for the captured image.
- the learning unit 21 transforms the language features (feature vectors) extracted from the text inputted by the language input unit 25 using a nonlinear function or a linear function, for example to generate features of the same dimensions as the input features.
- a nonlinear function for example, a fully connected neural network (MLP) with one or more layers can be used.
- MLP fully connected neural network
- a linear function a linear function with a scale (coefficient) and an offset (addition part) as optimization parameters can be used. It is preferable that the function has a smaller number of optimization parameters than the extraction of language features.
- the learning unit 21 converts the training composite image generated by the training synthesis unit 24 using a nonlinear function or a linear function to generate an image of the same dimensions as the input image, for example.
- a nonlinear function for example, an MLP with one or more layers can be used.
- a linear function with the optimization parameters being the scale (coefficient) and the offset (addition part) can be used.
- the parameters for optimizing the scale and offset correspond to the correction of brightness. This correction makes it possible to bring the statistical index of the pixel values of the training composite image (such as the range of pixel values) closer to the statistical index of the pixel values of the captured image. It is preferable that the function has a smaller number of optimization parameters than the extraction of image features.
- the learning unit 21 extracts image features from the adjusted training composite image.
- the learning unit 21 transforms the image features (feature vectors) extracted from the training synthetic image that has not been adjusted or the training synthetic image that has been adjusted, using a nonlinear function or a linear function to generate features of the same dimension as the input features, for example.
- a nonlinear function for example, an MLP with one or more layers can be used.
- a linear function with scale (coefficient) and offset (additive part) as optimization parameters can be used. This correction makes it possible to bring the statistical index of the image features extracted from the training synthetic image closer to the statistical index of the feature amount extracted from the captured image. It is preferable that the function has a smaller number of optimization parameters than the extraction of image features.
- the learning unit 21 can perform the following processing using the data after the above adjustments have been made to at least one of the language features, the synthetic training image, and the image features.
- the learning unit 21 calculates the correlation between the language feature and the image feature, for example, using cosine similarity. For example, the learning unit 21 performs normalization processing on each of the language feature and the image feature, and calculates the inner product between these normalized vectors to calculate the correlation. To explain in more detail, the learning unit 21 receives image features for each of a plurality of training synthetic images and language features paired with each of them. The learning unit 21 then calculates not only the cosine similarity between the paired image feature and language feature, but also the cosine similarity between unpaired image features and language features. In other words, when N pairs of language features and image features are input to the learning unit 21, the learning unit 21 calculates N x N cosine similarities.
- the learning unit 21 calculates a loss value based on the calculated correlation, characterized in that correct pairs have small losses and incorrect pairs have large losses.
- the learning unit 21 updates at least one parameter among the above-mentioned adjustment of language features, adjustment of synthetic images for training, and adjustment of image features so as to reduce the loss value. Note that the learning unit 21 may further update other parameters.
- the learning unit 21 can calculate the loss value using any well-known technique. For example, the learning unit 21 arranges N x N cosine similarities in an N x N matrix, calculates the cross-entropy error in the row direction (horizontal direction), and further calculates the loss in the column direction (vertical direction). The learning unit 21 can then calculate the average value of these cross-entropies as the loss value.
- the learning unit 21 can also update the above-mentioned parameters using widely known techniques.
- the learning unit 21 may update the parameters using backpropagation or the like.
- the learning device 20 acquires text that describes an object.
- the object is an object that appears in the learning captured image or the learning composite image.
- the learning device 20 acquires a learning image.
- the learning device 20 generates a learning two-dimensional image based on reflected wave information indicating reflected electromagnetic waves.
- the learning image and the learning two-dimensional image indicate information of the same object.
- the learning device 20 synthesizes the learning captured image and the learning two-dimensional image to generate a learning composite image.
- the learning device 20 learns a learning model using the text acquired in S30 and the synthetic learning image generated in S33.
- the learning model is an object recognition model or an object detection model that uses a language model.
- the learning device 20 learns a learning model using the language features extracted from the text and the image features extracted from the synthetic training image.
- the learning device 20 can adjust at least one of the language features, the synthetic training image, and the image features to adapt them to the learning model generated/adjusted for the captured image. After making the adjustment, the learning device 20 calculates the correlation between the language features and the image features, and updates the parameters of the adjustment so that the loss value based on the calculated correlation is smaller.
- the learning device 20 learns a learning model using a synthetic learning image obtained by synthesizing a "learning captured image” and a "learning two-dimensional image generated based on reflected wave information indicating reflected waves of electromagnetic waves.” Such a learning device 20 can generate a learning model that has learned both the “features of the captured image” and the “features of the reflected wave information indicating reflected waves of electromagnetic waves.” This learning model can perform object detection and object recognition based on both the “features of the captured image” and the “features of the reflected wave information indicating reflected waves of electromagnetic waves.” By learning not just one of the "features of the captured image” and the "features of the reflected wave information indicating reflected waves of electromagnetic waves,” but both, the accuracy of object detection and object recognition by the learning model is improved.
- the learning device 20 also integrates the "features of the captured image” and the "features of the reflected wave information indicating the reflected electromagnetic waves” by a unique method of generating a synthetic image, and learns them together. Rather than learning the "features of the captured image” and the “features of the reflected wave information indicating the reflected electromagnetic waves” individually, learning them together as a synthetic image that is integrated by a unique method is expected to have a synergistic effect and improve the accuracy of object detection and object recognition.
- the learning device 20 can also generate multiple learning composite images from a set of learning photographed images and learning two-dimensional images, with the blending ratios of the learning photographed images and learning two-dimensional images being different from one another.
- the learning device 20 can also generate learning composite images with different blending ratios for each partial area within the image. In this way, it is preferable that multiple learning composite images can be generated from a set of learning photographed images and learning two-dimensional images. It is also preferable that learning composite images with various blending ratios can be learned.
- the learning device 20 can also adjust at least one of the language features, the synthetic image for training, and the image features to adapt them to the learning model generated/adjusted for the captured image.
- the learning device 20 can then update the parameters for this adjustment so that the loss value based on the correlation between the language features and the image features becomes smaller.
- Such a learning device 20 can generate a learning model that learns a synthetic image that combines a captured image and a two-dimensional image using an existing learning model generated/adjusted for the captured image.
- the learning model of this embodiment can be generated by having an existing learning model generated/adjusted for the captured image learn a synthetic image that combines a captured image and a two-dimensional image.
- the learning device 20 of the fourth embodiment acquires learning images by a means different from that of the learning device 20 of the third embodiment, which will be described in detail below.
- the learning image acquisition unit 22 searches for images that match the search query from among multiple images, using the "text describing an object" input to the language input unit 25 as a search query. The learning image acquisition unit 22 then uses the images included in the search results as learning images.
- the learning photographed image and the learning two-dimensional image to be combined with each other are required to contain information about the same object, but the position of the object within the image does not necessarily have to match.
- the same object means matching in type or variety, and does not require matching of the individual object. Therefore, the learning photographed image and the learning two-dimensional image to be combined with each other do not have to be photographs of the same individual object taken at the same position and in the same place. For this reason, the learning photographed image acquisition unit 22 of this embodiment acquires the learning photographed image by the means described above.
- the learning image acquisition unit 22 may search for images that match the search query from among the vast amount of images available on the Internet. Alternatively, a database that stores a large number of images may be generated in advance. The learning image acquisition unit 22 may then search for images that match the search query from within this database.
- the learning image acquisition unit 22 may acquire one learning image in response to one search query, or may acquire multiple learning images in response to one search query. For example, the learning image acquisition unit 22 may acquire a predetermined number of images as learning images from among multiple images searched based on the search query, in descending order of matching rate.
- the learning synthesis unit 24 can generate multiple learning synthetic images by synthesizing one learning two-dimensional image containing information of an object with each of the multiple learning images containing information of the object. Increasing the number of learning synthetic images is preferable because it improves the learning effect.
- the learning image acquisition unit 22 may acquire learning images using both the means described in the third embodiment and the means described in this embodiment. Increasing the number of learning images is preferable because it increases the number of learning composite images that are generated.
- the rest of the configuration of the learning device 20 in this embodiment is the same as the configuration of the learning device 20 in the first and third embodiments.
- the learning device 20 of this embodiment achieves the same effects as the learning device 20 of the first and third embodiments. Furthermore, the learning device 20 of this embodiment can acquire learning images from among multiple images by searching for "text expressing an object" input to the language input unit 25 as a search query. Such a learning device 20 is preferable because it can acquire a large number of learning images and generate a large number of learning composite images. Furthermore, in one example, a camera is not required, which can reduce the cost burden and the burden of equipment maintenance, etc.
- the detection device 10 of the fifth embodiment is a specific embodiment of the configuration of the detection device 10 of the second embodiment.
- the detection device 10 detects a detection target using a learning model generated by the learning device 20 of the first, third, and fourth embodiments.
- the configuration of the detection device 10 of this embodiment will be described in detail below.
- the hardware configuration of the detection device 10 is realized by any combination of hardware and software.
- the software includes programs that are stored in the device before it is shipped, and programs downloaded from recording media such as CDs (Compact Discs) and servers on the Internet.
- FIG. 7 is a block diagram illustrating an example of the hardware configuration of the detection device 10.
- the detection device 10 has a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A.
- the peripheral circuit 4A includes various modules.
- the learning device 20 does not need to have the peripheral circuit 4A.
- the detection device 10 may be composed of multiple devices that are physically and/or logically separated. In this case, each of the multiple devices can have the above hardware configuration.
- the bus 5A is a data transmission path for the processor 1A, memory 2A, peripheral circuit 4A, and input/output interface 3A to send and receive data to each other.
- the processor 1A is an arithmetic processing device such as a CPU or a GPU (Graphics Processing Unit).
- the memory 2A is a memory such as a RAM (Random Access Memory) or a ROM (Read Only Memory).
- the input/output interface 3A includes interfaces for acquiring information from an input device, an external device, an external server, an external sensor, a camera, etc., and interfaces for outputting information to an output device, an external device, an external server, etc.
- the input/output interface 3A also includes an interface for connecting to a communication network such as the Internet.
- Examples of input devices include a keyboard, a mouse, a microphone, a physical button, a touch panel, etc.
- Examples of output devices include a display, a speaker, a printer, a mailer, etc.
- the processor 1A can issue commands to each module and perform calculations based on the results of those calculations.
- Fig. 12 shows an example of a functional block diagram of the detection device 10. As shown in the figure, the detection device 10 has a detection unit 11, a detection-use photographed image acquisition unit 12, a detection-use reflected wave processing unit 13, and a detection-use synthesis unit 14.
- the detection image acquisition unit 12 acquires the captured image to be processed.
- the "image to be processed” is the image that is the subject of processing to detect the detection target.
- the image to be processed is an image of the target area.
- the "detection target” is the object that is to be detected.
- the learning model generated by the learning device 20 detects the detection target based on the "characteristics of the captured image” and the "characteristics of the reflected wave information indicating the reflected electromagnetic wave.” Therefore, an object that reflects electromagnetic waves can be the detection target. Objects that reflect electromagnetic waves are often made mainly of metal, but are not limited to this.
- the "target area” is the area being searched to see if a detection target exists.
- the target area may be a part of the ground, a part of a building, or something else.
- the detection reflected wave processing unit 13 generates a two-dimensional image to be processed based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area.
- the detection reflected wave processing unit 13 can generate a two-dimensional image based on the reflected wave information using processing similar to that of the learning reflected wave processing unit 23.
- the "two-dimensional image to be processed" is the two-dimensional image that is the subject of processing to detect the detection target.
- the detection synthesis unit 14 synthesizes the photographic image to be processed and the two-dimensional image to be processed to generate a composite image to be processed.
- the detection synthesis unit 14 can synthesize the photographic image to be processed and the two-dimensional image to be processed in a similar process to the synthesis of the photographic image to be processed and the two-dimensional image to be processed by the learning synthesis unit 24 to generate a composite image to be processed. It is preferable that the photographic image to be processed and the two-dimensional image to be processed are of similar size.
- the detection unit 11 detects the detection target based on a composite image of the processing target, which is a composite of the captured image of the processing target and the two-dimensional image of the processing target.
- the detection unit 11 detects the detection target based on the learning model generated by the learning device 20 of the first, third, and fourth embodiments, and the composite image of the processing target.
- the learning model is an object recognition model or an object detection model that uses a language model.
- the detection unit 11 can adjust at least one of the language features, the synthetic image to be processed, and the image features to adapt them to the learning model generated/adjusted for the captured image.
- the detection unit 11 can then use the adjusted data to detect the detection target.
- the adjustment of at least one of the language features, the synthetic image to be processed, and the image features by the detection unit 11 is achieved by the same means as the adjustment of at least one of the language features, the synthetic image to be processed, and the image features by the learning unit 21 described in the third embodiment.
- the detection unit 11 can transform language features using a nonlinear function or a linear function, and detect the detection target based on the transformed language features.
- the detection unit 11 can also transform the composite image to be processed using a nonlinear function or a linear function, and detect the detection target based on the transformed composite image to be processed.
- the detection unit 11 can also transform image features extracted from the composite image to be processed using a nonlinear function or a linear function, and detect the detection target based on the transformed image features.
- the detection unit 11 acquires search conditions that express the detection target in text.
- the detection target is specified by these search conditions.
- the detection unit 11 then detects the detection target specified by these search conditions.
- the detection target may be specified each time a processing target composite image is input. Alternatively, the detection target may be specified in advance, and the specified detection target may be used in processing multiple processing target composite images.
- Search criteria is text that expresses the object to be detected.
- the text that constitutes the search criteria is a word, a sentence, a prompt, etc.
- the text that constitutes the search criteria can include, for example, the type of object, or the external characteristics of the object (color, size, shape, etc.).
- An example of a search criterion is "blue can.”
- the search criteria may further include text expressing an object to be removed from the search results. That is, the search criteria may include text expressing the detection target and text expressing the target to be removed from the search results.
- the detection target may be referred to as a "positive example”
- the target to be removed from the search results may be referred to as a "negative example.”
- Negative examples are expressed in the same manner as positive examples.
- An example of a search criterion for negative examples is "white can.”
- the detection unit 11 acquires search conditions input by the user.
- the user can input search conditions to the detection device 10 via any input device, such as a keyboard, touch panel, microphone, mouse, or physical button.
- the detection unit 11 can detect the detection target from within the composite image to be processed, for example, based on the correlation between the composite image to be processed and the search conditions.
- the detection unit 11 can detect objects with a high correlation value (cosine similarity) with the search criteria (positive examples) from within the composite image to be processed, and output the area of the object as the detection result.
- the detection unit 11 may detect objects with a correlation value with the search criteria (positive examples) equal to or greater than a threshold from within the composite image to be processed.
- the detection unit 11 may also detect a predetermined number of objects with a high correlation value with the search criteria (positive examples) from within the composite image to be processed.
- the detection unit 11 may also detect a predetermined number of objects with a high correlation value with the search criteria (positive examples) from within the composite image to be processed, whose correlation value with the search criteria (positive examples) is equal to or greater than a threshold.
- the detection unit 11 can also detect objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed, and output the detection result without including the area of the object in it. For example, the detection unit 11 can detect objects having a correlation value with the search criteria (negative examples) equal to or greater than a threshold from within the composite image to be processed. The detection unit 11 can also detect a predetermined number of objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed. The detection unit 11 can also detect a predetermined number of objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed, whose correlation value with the search criteria (negative examples) is equal to or greater than a threshold.
- the detection unit 11 detects an area in which the detection target exists from within the composite image to be processed.
- the detection unit 11 then outputs information indicating the detected area.
- the detection unit 11 may output an image in which information indicating the detected area (frame, mask, etc.) is superimposed on the composite image to be processed or the photographed image to be processed.
- the detection unit 11 may indicate a reliability calculated by any means for each detected area. Note that this output example is merely an example and is not limited to this.
- the detection device 10 acquires a captured image of the target area to be processed.
- the detection device 10 generates a two-dimensional image of the target area to be processed based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area.
- the detection device 10 synthesizes the captured image to be processed and the two-dimensional image to be processed to generate a composite image to be processed. Then, the detection device 10 detects the detection target based on the composite image to be processed.
- the detection target can be detected based on both the “characteristics of the captured image” and the “characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave.”
- the detection accuracy of the detection target is improved.
- the detection device 10 integrates the "features of the captured image” and the “features of the reflected wave information indicating the reflected waves of the electromagnetic waves” by a unique method of generating a composite image, and detects the detection target based on the composite image. Rather than processing the "features of the captured image” and the “features of the reflected wave information indicating the reflected waves of the electromagnetic waves” separately, they are integrated by a unique method and processed together as a composite image, which is expected to have a synergistic effect and improve detection accuracy.
- the detection device 10 can also adjust at least one of the language features, the synthetic image to be processed, and the image features to accommodate the learning model generated/adjusted for the captured image.
- the detection device 10 can then use the adjusted data to detect the detection target.
- Such a detection device 10 can detect the detection target using a learning model generated by learning an existing learning model generated/adjusted for the captured image.
- the detection device 10 of this embodiment generates a plurality of processing target composite images having different blend ratios using a set of a processing target photographic image and a processing target two-dimensional image that are input. Then, the detection device 10 detects the detection target using the plurality of processing target composite images. This will be described in detail below.
- the detection synthesis unit 14 uses an input set of a photographic image to be processed and a two-dimensional image to be processed to generate a plurality of composite images to be processed having different blending ratios.
- the detection synthesis unit 14 may generate composite images to be processed having different blending ratios for each partial area within the image.
- the detection synthesis unit 14 can achieve this synthesis by a means similar to that used by the learning synthesis unit 24 described in the third embodiment.
- the detection unit 11 detects the detection target from each of a plurality of processing target composite images generated using a set of processing target photographic images and processing target two-dimensional images.
- the process of detecting the detection target from each processing target composite image is realized by the same means as in the fifth embodiment.
- the detection unit 11 detects the detection target based on the detection results of each of the multiple processing target composite images. For example, the detection unit 11 can detect, as a region where the detection target exists, an area where the detection target has been detected to exist in at least a predetermined percentage of the processing target composite images (or at least a predetermined number of processing target composite images) among the multiple processing target composite images.
- the predetermined percentage and the predetermined number are values that are determined in advance.
- the detection device 10 of this embodiment provides the same effects as the second and fifth embodiments. Furthermore, the detection device 10 of this embodiment generates a plurality of composite images of the processing target having different blend ratios using an input set of a photographic image of the processing target and a two-dimensional image of the processing target, and can detect the detection target using the plurality of composite images of the processing target. Such a detection device 10 improves the detection accuracy of the detection target.
- the detection synthesis unit 14 may perform a correction process on the photographed image to be processed, and then generate a composite image to be processed using the corrected photographed image to be processed.
- the learning synthesis unit 24 may perform a correction process on the learning captured image, and then generate a learning composite image using the corrected learning captured image.
- the detection unit 11 may perform a correction process on the processing target composite image, and then detect the detection target using the corrected processing target composite image.
- the learning unit 21 may perform a correction process on the training composite image and then use the corrected training composite image to train the learning model.
- An example of the correction process is a correction to adjust the brightness of an image.
- the brightness of an image may be adjusted using the technology disclosed in the following document.
- the correction process may be performed when the brightness of an image satisfies a predetermined condition (brightness is equal to or less than a threshold value).
- a predetermined condition for example, the brightness of an image satisfies a predetermined condition (brightness is equal to or less than a threshold value.
- a detection device having a detection means for detecting a detection target based on a processing target composite image obtained by combining a processing target photographed image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating reflected waves of electromagnetic waves irradiated to the target area.
- a detection image acquisition means for acquiring the processing target photographed image; a detection reflected wave processing means for generating the two-dimensional image of the processing object based on the reflected wave information; a detection synthesis means for synthesizing the captured image to be processed and the two-dimensional image to be processed to generate the synthesized image to be processed; 2.
- the detection device comprising: 3.
- the detection means is A detection device as described in 1 or 2, which detects the detection target based on a learning model that has been trained using a plurality of learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected waves of electromagnetic waves, and the detection target based on the processing target composite image.
- the learning model is an object recognition model or an object detection model using a language model.
- the detection synthesis means comprises: generating a plurality of processing target composite images having different blend ratios between the processing target photographed image and the processing target two-dimensional image;
- the detection means includes: Detecting the detection target from each of the plurality of processing target composite images; 3.
- the detection device which detects the detection target based on detection results of each of the multiple processing target composite images.
- the detection means is Transforming the composite image to be processed by a nonlinear function or a linear function; 6.
- the detection device according to any one of claims 1 to 5, which detects the detection target based on the converted composite image of the processing target.
- the detection means is Extracting image features from the composite image to be processed; Transforming the image features by a nonlinear or linear function; 7.
- a detection device according to any one of 1 to 6, which detects the detection target based on the transformed image features. 8.
- One or more computers A detection method for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
- a program that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
- a learning device having a learning means for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating a reflected electromagnetic wave.
- the learning model is an object recognition model or an object detection model using a language model.
- a learning image acquisition means for acquiring the learning image; a learning reflected wave processing means for generating the learning two-dimensional image based on the reflected wave information; a learning synthesis means for synthesizing the learning photographed image and the learning two-dimensional image to generate the learning synthesized image; 13.
- the learning synthesis means The learning device according to claim 13, wherein a plurality of the learning composite images are generated in which the blending ratios of the learning captured images and the learning two-dimensional images are different from each other.
- the learning means Transforming the synthetic image for learning by a nonlinear function or a linear function; 15.
- the learning means Extracting image features from the synthetic image for training; Transforming the image features by a nonlinear or linear function; 16.
- a learning device according to any one of claims 10 to 15, which learns the learning model using the transformed image features.
- One or more computers A learning method for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.
- a computer A program that functions as a learning means for learning a learning model using multiple learning composite images that are synthesized by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
- Appendices 2 to 7 that are dependent on the detection device of Appendix 1 described above may also be dependent on the detection method of Appendix 8 and the program of Appendix 9 in the same dependent relationship as Appendix 1 and Appendices 2 to 7.
- Appendices 11 to 16 that are dependent on the learning device of Appendix 10 described above may also be dependent on the learning method of Appendix 17 and the program of Appendix 18 in the same dependent relationship as Appendix 10 and Appendices 11 to 16.
- some or all of the configurations described as appendices can be realized in various hardware, software, various recording means for recording software, or systems.
Landscapes
- Engineering & Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
本開示は、検出装置、学習装置、検出方法、及びプログラムに関する。 This disclosure relates to a detection device, a learning device, a detection method, and a program.
本開示に関連する技術が特許文献1に開示されている。特許文献1は、ミリ波レーダの反射強度に基づき、金属等のミリ波レーダを多く反射する物体を検出する技術を開示している。 Technology related to the present disclosure is disclosed in Patent Document 1. Patent Document 1 discloses technology for detecting objects that reflect a lot of millimeter wave radar, such as metals, based on the reflection intensity of the millimeter wave radar.
特許文献1に開示の技術の場合、金属等のミリ波レーダを多く反射する物体の全てを検出してしまう。特許文献1に開示の技術の場合、ミリ波レーダを多く反射する物体の中の一部のみを高精度に検出することは困難である。 The technology disclosed in Patent Document 1 detects all objects that reflect a lot of millimeter wave radar, such as metals. With the technology disclosed in Patent Document 1, it is difficult to accurately detect only some of the objects that reflect a lot of millimeter wave radar.
本開示の目的の一例は、上述した問題を鑑み、所望の物体を精度よく検出するための検出装置、学習装置、検出方法、学習方法及びプログラムを提供することにある。 In view of the above-mentioned problems, one example of the objective of the present disclosure is to provide a detection device, a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy.
本開示によれば、
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段を有する検出装置が提供される。
According to the present disclosure,
A detection device is provided that has a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
また、本開示によれば、
1つ以上のコンピュータが、
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出方法が提供される。
Further, according to the present disclosure,
One or more computers
A detection method is provided for detecting a detection target based on a processing target composite image obtained by combining a processing target captured image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
また、本開示によれば、
コンピュータを、
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段として機能させるプログラムが提供される。
Further, according to the present disclosure,
Computer,
A program is provided that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
また、本開示によれば、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段を有する学習装置が提供される。
Further, according to the present disclosure,
A learning device is provided having a learning means for learning a learning model using a plurality of learning composite images obtained by combining a learning captured image with a learning two-dimensional image generated based on reflected wave information indicating a reflected electromagnetic wave.
また、本開示によれば、
1つ以上のコンピュータが、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習方法が提供される。
Further, according to the present disclosure,
One or more computers
A learning method is provided for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
また、本開示によれば、
コンピュータを、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段として機能させるプログラムが提供される。
Further, according to the present disclosure,
Computer,
A program is provided that functions as a learning means for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
本開示の一態様によれば、所望の物体を精度よく検出するための検出装置、学習装置、検出方法、学習方法及びプログラムが実現される。 According to one aspect of the present disclosure, a detection device, a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy are realized.
以下、本開示の実施形態について、図面を用いて説明する。本開示において図面は、1以上の実施形態に関連付けられる。また、全ての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Below, an embodiment of the present disclosure will be described with reference to the drawings. In this disclosure, the drawings relate to one or more embodiments. In addition, in all drawings, similar components are given similar reference symbols, and descriptions will be omitted as appropriate.
<第1の実施形態>
図1は、学習装置20の概要を示す機能ブロック図である。図2は、学習装置20が実行する処理の流れの一例を示すフローチャートである。
First Embodiment
Fig. 1 is a functional block diagram showing an overview of the
図1に示すように、学習装置20は、学習部21を有する。この機能部により、図2の処理が実行される。
As shown in FIG. 1, the
学習部21は、「学習用撮影画像」と「電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像」とを合成した複数の学習用合成画像を用いて学習モデルを学習する(S10)。
The
このような学習用合成画像を用いて学習モデルを学習する学習装置20によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方を学習した学習モデルを生成することができる。この学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、物体検出や物体認識を行うことができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を学習することで、学習モデルによる物体検出や物体認識の精度が向上する。
The
また、学習装置20は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、それらをまとめて学習する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に学習するのでなく、それらを特徴的な手段で統合した合成画像でまとめて学習することで、相乗効果による物体検出や物体認識の精度向上が期待される。
The
<第2の実施形態>
図3は、検出装置10の概要を示す機能ブロック図である。図4は、検出装置10が実行する処理の流れの一例を示すフローチャートである。
Second Embodiment
Fig. 3 is a functional block diagram showing an overview of the
図3に示すように、検出装置10は、検出部11を有する。この機能部により、図4の処理が実行される。
As shown in FIG. 3, the
検出部11は、「対象領域を撮影した処理対象撮影画像」と「対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像」とを合成した処理対象合成画像に基づき、検出対象を検出する(S20)。
The
このような処理対象合成画像を用いて検出対象を検出する検出装置10によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、検出対象を検出することができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を用いることで、検出対象の検出精度が向上する。
The
また、検出装置10は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、その合成画像に基づき検出対象を検出する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に処理するのでなく、それらを特徴的な手段で統合した合成画像でまとめて処理することで、相乗効果による検出精度の向上が期待される。
In addition, the
<第3の実施形態>
「概要」
第3の実施形態の学習装置20は、第1の実施形態の学習装置20の構成を具体化したものである。
Third Embodiment
"overview"
The
近年、言語モデルを用いて検出対象を認識する物体認識モデルや、言語モデルを用いて検出対象を検出する物体検出モデルが知られている。当該物体認識モデル及び物体検出モデルは、画像と言語を同時埋め込み空間で表現することが可能なモデルである。 In recent years, object recognition models that use language models to recognize detection targets and object detection models that use language models to detect detection targets have become known. These object recognition models and object detection models are models that can represent images and language in a joint embedding space.
当該物体認識モデル及び物体検出モデルは、例えばニューラルネットワーク等の技術で得られた物体認識/物体検出の結果と物体に関する言語(物体の説明や表現)との関係を学習することで生成される。当該物体認識モデル及び物体検出モデルに基づき、検索条件(テキスト)で示される物体を撮影画像内で認識/検出することができる。当該技術は、例えば以下の文献等に開示されている。
「Radford, Alec, et al. "Learning transferable visual models from natural language supervision."International conference on machine learning. PMLR, 2021.」
「Li, Liunian Harold, et al. "Grounded language-image pre-training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.」
The object recognition model and the object detection model are generated by learning the relationship between the results of object recognition/detection obtained by a technology such as a neural network and language related to the object (description or expression of the object). Based on the object recognition model and the object detection model, the object indicated by the search criteria (text) can be recognized/detected in the captured image. The technology is disclosed in, for example, the following documents.
"Radford, Alec, et al. "Learning transferable visual models from natural language supervision."International conference on machine learning. PMLR, 2021."
"Li, Liunian Harold, et al. "Grounded language-image pre-training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022."
上記文献等で開示されている物体認識モデル及び物体検出モデルは、図5に示すように、物体に関する言語(物体の説明や表現)と、撮影画像との相関関係を学習することで生成される。 The object recognition model and object detection model disclosed in the above-mentioned documents are generated by learning the correlation between language related to objects (descriptions and expressions of objects) and captured images, as shown in Figure 5.
これに対し、学習装置20は、図6に示すように、物体に関する言語(物体の説明や表現)と、合成画像との相関関係を学習することで学習モデルを生成する。合成画像は、「撮影画像」と「電磁波の反射波を示す反射波情報に基づく2次元画像」とを合成することで生成された画像である。学習装置20は、上記文献に開示されている言語モデルを用いて検出対象を認識する物体認識モデルや、言語モデルを用いて検出対象を検出する物体検出モデルと、当該点で異なる。
In response to this, the
以下、学習装置20の構成を詳細に説明する。
The configuration of the
「ハードウエア構成」
まず、学習装置20のハードウエア構成の一例を説明する。学習装置20の各機能部は、ハードウエアとソフトウエアの任意の組み合わせによって実現される。その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。ソフトウエアは、予め装置を出荷する段階から格納されているプログラムや、CD(Compact Disc)等の記録媒体やインターネット上のサーバ等からダウンロードされたプログラム等を含む。
"Hardware Configuration"
First, an example of the hardware configuration of the
図7は、学習装置20のハードウエア構成を例示するブロック図である。図7に示すように、学習装置20は、プロセッサ1A、メモリ2A、入出力インターフェイス3A、周辺回路4A、バス5Aを有する。周辺回路4Aには、様々なモジュールが含まれる。学習装置20は周辺回路4Aを有さなくてもよい。なお、学習装置20は物理的及び/又は論理的に分かれた複数の装置で構成されてもよい。この場合、複数の装置各々が上記ハードウエア構成を備えることができる。
FIG. 7 is a block diagram illustrating an example of the hardware configuration of a
バス5Aは、プロセッサ1A、メモリ2A、周辺回路4A及び入出力インターフェイス3Aが相互にデータを送受信するためのデータ伝送路である。プロセッサ1Aは、例えばCPU、GPU(Graphics Processing Unit)等の演算処理装置である。メモリ2Aは、例えばRAM(Random Access Memory)やROM(Read Only Memory)等のメモリである。入出力インターフェイス3Aは、入力装置、外部装置、外部サーバ、外部センサ、カメラ等から情報を取得するためのインターフェイスや、出力装置、外部装置、外部サーバ等に情報を出力するためのインターフェイス等を含む。また、入出力インターフェイス3Aはインターネット等の通信ネットワークに接続するためのインターフェイスを含む。入力装置は、例えばキーボード、マウス、マイク、物理ボタン、タッチパネル等である。出力装置は、例えばディスプレイ、スピーカ、プリンタ、メーラ等である。プロセッサ1Aは、各モジュールに指令を出し、それらの演算結果をもとに演算を行うことができる。
The
「機能構成」
次に、学習装置20の機能構成を詳細に説明する。図8に、学習装置20の機能ブロック図の一例を示す。図示するように、学習装置20は、学習部21と、学習用撮影画像取得部22と、学習用反射波処理部23と、学習用合成部24と、言語入力部25とを有する。
"Function Configuration"
Next, a detailed description will be given of the functional configuration of the
学習用撮影画像取得部22は、学習用撮影画像を取得する。 The learning image acquisition unit 22 acquires learning images.
「学習用撮影画像」は、学習装置20による学習モデルの学習において学習データとして利用される撮影画像である。
"Learning images" are captured images that are used as learning data when the
「撮影画像」は、カメラで撮影することで生成された画像である。カメラは、可視光を検出し、画像化する。なお、カメラは、赤外線や紫外線等のその他の光を検出し、画像化してもよい。撮影画像は、静止画像であってもよいし、動画像であってもよい。撮影画像には、撮影日時及び撮影場所を示す情報が付与されてもよい。撮影画像が動画像である場合、フレーム画像毎に撮影日時及び撮影場所を特定可能な情報が付与されてもよい。 A "captured image" is an image generated by taking a picture with a camera. A camera detects visible light and converts it into an image. Note that a camera may also detect other types of light, such as infrared or ultraviolet light, and convert it into an image. A captured image may be a still image or a video. A captured image may be provided with information indicating the date and time of capture and the location where the image was taken. If the captured image is a video, information that identifies the date and time of capture and the location where the image was taken may be provided for each frame image.
一例では、カメラは測定装置に搭載される。測定装置は移動が可能な装置である。測定装置は、例えばドローンなどの飛行体でもよいし、陸上を自走する装置でもよい。測定装置は、自動又は遠隔操作により移動可能である。測定装置は、例えば、予め登録されたルートに沿って自動で移動することができる。測定装置は移動しながら、カメラで所定のデータ収集領域を撮影する。データ収集領域は、学習データの収集や検出対象の検出等のために選択された領域である。データ収集領域は、地上の一部領域であってもよいし、建造物の一部領域であってもよいし、その他であってもよい。 In one example, the camera is mounted on the measuring device. The measuring device is a device that can be moved. The measuring device may be, for example, an air vehicle such as a drone, or may be a device that moves on land. The measuring device can be moved automatically or by remote control. The measuring device can move automatically, for example, along a route registered in advance. As the measuring device moves, it captures an image of a specified data collection area with the camera. The data collection area is an area selected for the purpose of collecting learning data, detecting a detection target, etc. The data collection area may be a part of the ground, a part of a building, or something else.
他の例では、カメラは作業員に携帯される。作業員はカメラを携帯して移動しながら、カメラでデータ収集領域を撮影する。 In another example, the camera is carried by a worker who moves around and takes pictures of the data collection area with the camera.
カメラが生成した撮影画像は、任意の手段で学習装置20に入力される。例えば、カメラと学習装置20とは互いに通信可能に構成されていてもよい。そして、カメラは、当該通信手段を介して、生成した撮影画像を学習装置20に送信してもよい。カメラから学習装置20への撮影画像の送信は、リアルタイム処理で行われてもよいし、バッチ処理で行われてもよい。
The captured images generated by the camera are input to the
その他、カメラが生成した撮影画像は、任意の記憶装置内に蓄積されてもよい。当該記憶装置は、カメラ内に設けられてもよいし、カメラと通信可能に構成された外部装置内に設けられてもよい。そして、任意のタイミングかつ任意の手段で、記憶装置内に蓄積された撮影画像が学習装置20に入力されてもよい。
In addition, the captured images generated by the camera may be stored in any storage device. The storage device may be provided in the camera, or in an external device configured to be able to communicate with the camera. The captured images stored in the storage device may then be input to the
学習用反射波処理部23は、電磁波の反射波を示す反射波情報に基づき学習用2次元画像を生成する。
The learning reflected
「電磁波」は、例えばミリ波であり、その波長の一例は0.3GHz以上300GHz以下である。ただし、電磁波の帯域はミリ波に限定されない。電磁波は、近赤外線や遠赤外線等であってもよい。 The "electromagnetic wave" is, for example, a millimeter wave, and an example of its wavelength is 0.3 GHz or more and 300 GHz or less. However, the band of the electromagnetic wave is not limited to millimeter waves. The electromagnetic wave may be near infrared rays, far infrared rays, etc.
「反射波」は、照射された電磁波が、照射した領域に存在する物体によって反射されたものである。電磁波を反射する物体がその領域に存在すると、反射波の強度は高くなる。電磁波を反射する物体は、主に金属によって形成されている場合が多い。 "Reflected waves" are electromagnetic waves that are reflected by objects in the irradiated area. If an object that reflects electromagnetic waves is present in that area, the strength of the reflected waves will be high. Objects that reflect electromagnetic waves are often made primarily of metal.
このような電磁波の照射及び反射波の受信は、電磁波送受信装置により実現される。電磁波送受信装置は、電磁波を発信する電磁波送信部と、反射波を受信する電磁波受信部とを備える。電磁波送受信装置は、電磁波受信部を複数、例えば2つ備えてもよい。これら複数の電磁波受信部は互いに離れており、同一の電磁波送信部が照射した電磁波の反射波を受信する。このようにすると、物体の位置の検出精度は高くなる。 The irradiation of electromagnetic waves and the reception of reflected waves are achieved by an electromagnetic wave transmitting/receiving device. The electromagnetic wave transmitting/receiving device includes an electromagnetic wave transmitting unit that transmits electromagnetic waves, and an electromagnetic wave receiving unit that receives reflected waves. The electromagnetic wave transmitting/receiving device may include multiple electromagnetic wave receiving units, for example, two. These multiple electromagnetic wave receiving units are spaced apart from each other, and receive reflected waves of the electromagnetic waves irradiated by the same electromagnetic wave transmitting unit. In this way, the accuracy of detecting the position of an object is increased.
電磁波送信部が用いる発信方式は、例えば、FMCW(Frequency Modulated Continuous Wave)、パルス、CW(Continuous Wave)ドップラー、2周波CW、及びパルス圧縮のいずれかであるが、これら以外であってもよい。 The transmission method used by the electromagnetic wave transmitter is, for example, one of FMCW (Frequency Modulated Continuous Wave), pulse, CW (Continuous Wave) Doppler, two-frequency CW, and pulse compression, but may be other than these.
一例では、上述したカメラを搭載した測定装置が、さらに、電磁波送受信装置を備える。測定装置は移動しながら、カメラでデータ収集領域を撮影するとともに、電磁波送受信装置でデータ収集領域に電磁波を照射し、その反射波を受信する。 In one example, the measuring device equipped with the camera described above further includes an electromagnetic wave transmitting and receiving device. As the measuring device moves, it captures images of the data collection area with the camera, and uses the electromagnetic wave transmitting and receiving device to irradiate electromagnetic waves onto the data collection area and receive the reflected waves.
他の例では、電磁波送受信装置は作業員に携帯される。作業員はカメラ及び電磁波送受信装置を携帯して移動しながら、カメラでデータ収集領域を撮影するとともに、電磁波送受信装置でデータ収集領域に電磁波を照射し、その反射波を受信する。 In another example, the electromagnetic wave transmitting and receiving device is carried by a worker. While moving around carrying a camera and an electromagnetic wave transmitting and receiving device, the worker photographs the data collection area with the camera, and irradiates the data collection area with electromagnetic waves using the electromagnetic wave transmitting and receiving device and receives the reflected waves.
「反射波情報」は、電磁波送受信装置により生成される。より詳細には、反射波情報は、電磁波受信部による反射波の受信結果に基づき生成される。反射波情報は、例えば反射波の強度の時系列情報を含んでいる。この時系列情報は、反射波の受信日時及びその時の反射波の強度の組み合わせを含んでいる。電磁波受信部が複数設けられている場合、複数の電磁波受信部別に反射波情報が生成される。 "Reflected wave information" is generated by the electromagnetic wave transmitting/receiving device. More specifically, the reflected wave information is generated based on the results of reception of the reflected wave by the electromagnetic wave receiving unit. The reflected wave information includes, for example, time series information on the strength of the reflected wave. This time series information includes a combination of the date and time the reflected wave was received and the strength of the reflected wave at that time. If multiple electromagnetic wave receiving units are provided, the reflected wave information is generated separately for each of the multiple electromagnetic wave receiving units.
なお、電磁波送受信装置は、電磁波送受信装置の位置を示す位置情報を生成してもよい。位置情報は、例えば緯度、経度で示される。位置情報は、緯度、経度に加えて高度を含んでもよい。この位置情報は、例えばGPSを用いて生成されてもよいし、他の方法、例えばSLAM(Simultaneous Localization and Mapping)を用いて行われてもよい。そして、電磁波送受信装置は、上述した反射波情報に、その反射波を受信した時の電磁波送受信装置の位置情報を加えてもよい。 The electromagnetic wave transmitting/receiving device may generate location information indicating the location of the electromagnetic wave transmitting/receiving device. The location information is indicated by, for example, latitude and longitude. The location information may include altitude in addition to latitude and longitude. This location information may be generated using, for example, GPS, or may be generated using other methods, such as SLAM (Simultaneous Localization and Mapping). The electromagnetic wave transmitting/receiving device may add location information of the electromagnetic wave transmitting/receiving device at the time the reflected wave was received to the above-mentioned reflected wave information.
なお、上記位置情報は、反射波情報とは別の情報であってもよい。この場合、位置情報は、電磁波送受信装置の位置の時系列情報になる。この時系列情報は、日時とその日時における電磁波送受信装置の位置の組み合わせを含んでいる。 The above-mentioned location information may be information separate from the reflected wave information. In this case, the location information is time-series information on the location of the electromagnetic wave transmitting/receiving device. This time-series information includes a combination of date and time and the location of the electromagnetic wave transmitting/receiving device at that date and time.
電磁波送受信装置が生成した反射波情報は、任意の手段で学習装置20に入力される。例えば、電磁波送受信装置と学習装置20とは互いに通信可能に構成されていてもよい。そして、電磁波送受信装置は、当該通信手段を介して、生成した反射波情報を学習装置20に送信してもよい。電磁波送受信装置から学習装置20への反射波情報の送信は、リアルタイム処理で行われてもよいし、バッチ処理で行われてもよい。
The reflected wave information generated by the electromagnetic wave transmitting/receiving device is input to the
その他、電磁波送受信装置が生成した反射波情報は、任意の記憶装置内に蓄積されてもよい。当該記憶装置は、電磁波送受信装置内に設けられてもよいし、電磁波送受信装置と通信可能に構成された外部装置内に設けられてもよい。そして、任意のタイミング、かつ任意の手段で、記憶装置内に蓄積された反射波情報が学習装置20に入力されてもよい。
In addition, the reflected wave information generated by the electromagnetic wave transmitting/receiving device may be stored in any storage device. The storage device may be provided within the electromagnetic wave transmitting/receiving device, or may be provided within an external device configured to be able to communicate with the electromagnetic wave transmitting/receiving device. The reflected wave information stored in the storage device may then be input to the
「学習用2次元画像」は、学習用に生成された2次元画像である。 "2D training images" are 2D images generated for training purposes.
「2次元画像」は、上述した反射波情報に基づき生成された2次元の画像である。以下、反射波情報から2次元画像を生成する処理を説明する。 A "two-dimensional image" is a two-dimensional image generated based on the reflected wave information described above. Below, the process of generating a two-dimensional image from reflected wave information is explained.
まず、学習用反射波処理部23は、反射波情報を処理することにより3次元情報を生成する。詳細には、反射波情報には反射波の強度及び電磁波送受信装置の位置の時系列信号が含まれている。学習用反射波処理部23は、例えば、この時系列信号を構成する反射波に対して複数回FFT(Fast Fourier Transform)を行うことにより、電磁波受信部から反射波の起点となった反射点までの距離を算出する。電磁波受信部が複数ある場合、学習用反射波処理部23は、この処理を電磁波受信部別に行う。そして学習用反射波処理部23は、同一のタイミングで異なる電磁波受信部で測定された反射波に基づいた複数の距離を統合することにより、データ収集領域に対応する3次元空間に含まれる少なくとも1つの第1の点について、反射波の強度の推定値を算出する。そして学習用反射波処理部23は、この処理を複数のタイミングで測定された反射波に対して行うことにより、複数の第1の点のそれぞれについて、反射波の強度の推定値を算出し、これらを3次元情報とする。なお、この推定値は、当該第1の点に電磁波を反射する物体が存在する可能性を示す値とみなすことができる。以下、この値を第1の値とする。ただし、3次元情報の生成方法、例えば第1の値の生成方法はこの例に限定されない。
First, the learning reflected
次いで、学習用反射波処理部23は、3次元情報を所定の平面に投影することにより2次元情報を生成する。以下、この所定の平面を投影面と呼ぶ。投影面がデータ収集領域の表面(地面、建造物の外面等)に対してなす角度は10°以下であるのが好ましい。すなわち、投影面は、データ収集領域の表面に水平であるのが好ましい。
Then, the learning reflected
図9は、学習用反射波処理部23が行う処理の一例を説明するための図である。学習用反射波処理部23は、第2の点に対応する複数の第1の点を特定する。例えば学習用反射波処理部23は、投影面に垂直な方向から見た場合に、第2の点と重なる複数の第1の点を、当該第2の点に対応する第1の点とする。次いで学習用反射波処理部23は、特定された複数の第1の点それぞれに対応する第1の値を特定し、当該第1の値を用いて、当該第2の点に電磁波を反射する物体が存在する可能性を示す第2の値を生成する。第2の値は、複数の第1の値の統計値(最大値、最小値、平均値、最頻値、中央値等)とすることができる。その他、第1の値が基準値を超えた第1の点のうち、最もデータ収集領域に近い第1の点に対応する第1の値を、第2の値としてもよい。そして学習用反射波処理部23は、第2の点毎の第2の値を、2次元情報とする。この2次元情報から、例えば白黒の画像データ(2次元画像)を生成することができる。なお、2次元画像は、第2の点の第2の値を白黒で表現してもよいし、その他の色やその他の手段で表現してもよい。
9 is a diagram for explaining an example of processing performed by the learning reflected
変形例として、学習装置20は、データ収集領域の地質情報、及び反射波が生成された時の天候情報の少なくとも一方を考慮して、学習モデルを学習するように構成してもよい。例えば、データ収集領域のうち特定の領域の地質は、反射波を生成しやすいことがあり得る。また、天候によってはデータ収集領域の地表に水や雪がたまり、反射波に影響を与えることがあり得る。学習用反射波処理部23は、この影響を、上記2次元画像の生成に反映させることができる。
As a variant, the
例えば、学習用反射波処理部23は、データ収集領域の地質に応じたパラメータを第1の値又は第2の値に乗じた上で、3次元情報又は2次元情報を生成してもよい。このパラメータは予め設定されている。また、学習用反射波処理部23は、測定時の天候に応じたパラメータを第1の値又は第2の値に乗じた上で、3次元情報又は2次元情報を生成してもよい。このパラメータも予め設定されている。
For example, the learning reflected
なお、地質情報及び天候情報は、例えば学習装置20のユーザによって学習装置20に入力されてもよいし、学習装置20がこれらを記憶したデータベースから取得してもよい。
In addition, the geological information and weather information may be input to the
ここで説明した変形例は、以下の実施形態で詳細を説明する検出装置10においても適用可能である。
The modified examples described here can also be applied to the
図8に戻り、学習用合成部24は、学習用撮影画像と学習用2次元画像を合成して学習用合成画像を生成する。学習用撮影画像と学習用2次元画像は同様のサイズであることが好ましい。
Returning to FIG. 8, the learning
学習用合成部24は、図10に示すように、所定のブレンド比率で学習用撮影画像と学習用2次元画像をブレンド(合成)して、学習用合成画像を生成する。
As shown in FIG. 10, the learning
学習用合成部24は、1組の学習用撮影画像及び学習用2次元画像から、1つの学習用合成画像を生成してもよい。この場合、ブレント比率は予め定められていてもよい。
The
その他、学習用合成部24は、1組の学習用撮影画像及び学習用2次元画像から、学習用撮影画像及び学習用2次元画像のブレンド比率が互いに異なる複数の学習用合成画像を生成してもよい。このようにすれば、1組の学習用撮影画像及び学習用2次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。
In addition, the learning
その他、学習用合成部24は、画像内の一部領域毎にブレンド比率が異なる学習用合成画像を生成してもよい。例えば、学習用合成部24は、広く知られた物体検出技術やセマンティックセグメンテーション等を用いて検出された物体領域毎に、ブレント比率を異ならせてもよい。このようにすれば、1組の学習用撮影画像及び学習用2次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。
In addition, the learning
学習用合成部24は、同じ物体に関する情報が含まれる学習用撮影画像及び学習用2次元画像を合成して、学習用合成画像を生成する。これを実現するため、学習用合成部24は、同じタイミングで撮影/測定された学習用撮影画像及び学習用2次元画像を合成してもよい。その他、学習用合成部24は、同じ位置で撮影/測定された学習用撮影画像及び学習用2次元画像を合成してもよい。その他、学習用合成部24は、同じタイミングかつ同じ位置で撮影/測定された学習用撮影画像及び学習用2次元画像を合成してもよい。ここでの「同じタイミング」は完全一致であってもよいし、数秒程度までの時間のずれを許容する概念であってもよい。また、ここでの「同じ位置」は完全一致であってもよいし、数センチメートルから数メートル程度までの位置のずれを許容する概念であってもよい。
The learning
なお、学習用合成部24は、学習用撮影画像内におけるある物体の位置と、学習用2次元画像内におけるその物体の位置とを略一致させる処理を行った後に、画像の合成を行ってもよい。ただし、当該処理は必須ではない。互いに合成する学習用撮影画像及び学習用2次元画像に同じ物体の情報が含まれていることは要求されるが、その物体の画像内の位置の一致は必須でない。本発明者らは、物体の画像内の位置が一致していなくても、十分な学習効果及び検出結果が得られることを確認している。しかし、物体の画像内の位置が一致していると、そうでない場合に比べて、より好ましい学習効果や検出結果が得られることが期待される。学習用合成部24は、例えば以下のズレ補正処理1又は2を実行してもよい。
The learning
"ズレ補正処理1"
一例では、ユーザ入力でズレを補正する。学習用合成部24は、図10に示すような撮影画像及び2次元画像を同時に画面表示する。ユーザは、撮影画像及び2次元画像の少なくとも一方の位置をずらし、各画像で示されるある物体の位置が互いに重なり合う関係とする。ユーザは、撮影画像で示される物体の形状と、2次元画像における第2の点の集合の形状とに基づき、2次元画像における第2の点の集合が撮影画像内のどの物体と対応するのか特定することができる。
"Misalignment correction process 1"
In one example, the misalignment is corrected by user input. The learning
"ズレ補正処理2"
学習用合成部24は、カメラと電磁波送受信装置の位置に基づき、撮影画像を、2次元画像と同じ座標系に変形してもよい。カメラ及び電磁波送受信装置が同一の測定装置に搭載されている場合、そのカメラ及び電磁波送受信装置の距離、向き等に基づき、当該変形を実現することができる。予め、それらの位置情報に基づき、撮影画像を所定の座標系に変形する変形ルールが用意されている。学習用合成部24は、その変形ルールを用いて、撮影画像の上記変形を実行する。画像変形の方法としては、例えばアフィン変換やホモグラフィ変換などの画像変換が例示されるが、これらに限定されない。変形後の撮影画像の視点は、2次元画像と同様の視点となる。結果、撮影画像内の所定位置(例:中心)と、2次元画像内の同所定位置(例:中心)は、データ収集領域内の同じ位置の情報を示すこととなる。
"Misalignment correction process 2"
The learning
図8に戻り、言語入力部25は、学習用合成画像に写る物体を表現したテキストを取得する。テキストは、単語、文章、プロンプト等である。物体を表現したテキストは、例えば物体の種類や、物体の外観の特徴(色、大きさ、形状等)等を含むことができる。テキストの一例としては、「青い缶」等が例示される。
Returning to FIG. 8, the
例えば、ユーザが、学習用合成画像に写る物体を表現したテキストを学習装置20に入力してもよい。ユーザは、学習用合成画像の元となった学習用撮影画像を視認し、学習用合成画像に写る物体を特定してもよい。なお、ユーザは、その他の手段で学習用合成画像に写る物体を特定してもよい。例えば、データ収集領域に所定の物体を配置して撮影/反射情報の生成等を行うことで学習データを生成する場合は、ユーザは、学習用撮影画像を視認しなくても、データ収集領域に配置した物体をその他の任意の手段で特定することができる。
For example, a user may input text describing an object appearing in the training composite image into the
その他の例として、機械学習等で予め生成された認識モデル(分類器など)に、学習用合成画像の元となった学習用撮影画像を入力することで得られた認識結果が、上記テキストとして学習装置20に入力されてもよい。
As another example, the recognition result obtained by inputting the learning captured image that is the basis of the learning synthetic image into a recognition model (such as a classifier) that has been generated in advance by machine learning or the like may be input to the
学習部21は、複数の学習用合成画像と、各学習用合成画像に写る物体を表現したテキストを用いて学習モデルを学習する。学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。
The
言語モデルを用いた物体認識モデル又は物体検出モデルは、上述の通り、画像と言語を同時埋め込み空間で表現することが可能なモデルである。当該物体認識モデル及び物体検出モデルは、例えばニューラルネットワーク等の技術で得られた物体認識/物体検出の結果と物体に関する言語(物体の説明や表現)との関係を学習することで生成される。当該物体認識モデル及び物体検出モデルに基づき、検索条件(テキスト)で示される物体を撮影画像内で認識/検出することができる。 As described above, an object recognition model or object detection model using a language model is a model capable of expressing images and language in a joint embedding space. The object recognition model and object detection model are generated by learning the relationship between the results of object recognition/object detection obtained using technology such as neural networks and the language related to objects (descriptions and expressions of objects). Based on the object recognition model and object detection model, objects indicated by the search criteria (text) can be recognized/detected within the captured image.
学習部21は、学習用合成画像に写る物体と、その物体を表現したテキストとの相関関係を学習する。学習部21は、広く知られた、撮影画像に基づき「撮影画像に写る物体」と「その物体を表現したテキスト」との相関関係を学習して物体認識モデル又は物体検出モデルを生成する技術を利用して、本実施形態の学習モデルを生成することができる。すなわち、学習部21は、広く知られた当該技術において、「撮影画像」を「学習用合成画像」に置き換えて、学習モデルを学習することができる。
The
以下、学習部21が実行する処理の一例を説明するが、学習部21の処理はこれに限定されない。
Below, an example of the processing performed by the
まず、学習部21は、言語入力部25にて入力された一つ以上のテキストをトークンに分割し、これらから言語特徴を抽出することができる。言語特徴は、広く知られたあらゆる技術を用いて抽出することができる。言語特徴の抽出においては、例えば以下の文献に記載のトランスフォーマーを用いてもよい。例えば、学習部21は、レイヤー数12、隠れ層の次元512、アテンション・ヘッドを8とした6300万パラメータのトランスフォーマーを用いて言語特徴を抽出してもよい。
「Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.」
First, the
"Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021."
また、学習部21は、学習用合成画像から画像特徴を抽出することができる。画像特徴は、広く知られたあらゆる技術を用いて抽出することができる。画像特徴の抽出においては、ResNet-50、ResNet-101、ResNet-50×4、ResNet-50×16、ResNet-50×64、ViT-B/32、ViT-B/16、ViT-L/14等を用いてもよい。また、例えば以下の文献に記載の技術を用いて画像特徴を抽出してもよい。
「He, Kaiming, et al. "Deep residual learning for image recognition."Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.」
「Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale."arXiv preprint arXiv:2010.11929(2020).」
Furthermore, the
"He, Kaiming, et al. "Deep residual learning for image recognition."Proceedings of the IEEE conference on computer vision and pattern recognition. 2016."
"Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale."arXiv preprint arXiv:2010.11929(2020)."
学習部21は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも1つに対し、撮影画像用に生成/調整された学習モデルに適応させるための調整を行うことができる。
The
言語特徴の調整では、学習部21は、言語入力部25にて入力されたテキストから抽出された言語特徴(特徴ベクトル)を、非線形関数あるいは線形関数により変換し、例えば入力された特徴と同じ次元の特徴を生成する。非線形関数としては、例えば一層以上のMLP(全結合ニューラルネットワーク)を用いることができる。また、線形関数としては、スケール(係数)とオフセット(加算部分)を最適化パラメータとする線形関数を用いることができる。言語特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。
In adjusting language features, the
学習用合成画像の調整では、学習部21は、学習用合成部24によって生成された学習用合成画像を、非線形関数あるいは線形関数により変換し、例えば入力された画像と同じ次元の画像を生成する。非線形関数としては、例えば一層以上のMLPを用いることができる。また、線形関数としては、スケール(係数)とオフセット(加算部分)を最適化パラメータとする線形関数を用いることができる。例えば、スケールとオフセットを最適化するパラメータは、明るさの補正に相当する。当該補正により、学習用合成画像の画素値の統計指標(画素値の範囲等)を、撮影画像の画素値の統計指標に近づけることができる。画像特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。学習用合成画像の調整を行う場合、学習部21は、調整後の学習用合成画像から画像特徴を抽出する。
In adjusting the training composite image, the
画像特徴の調整では、学習部21は、上記調整を行っていない学習用合成画像又は上記調整を行った学習用合成画像から抽出された画像特徴(特徴ベクトル)を、非線形関数あるいは線形関数により変換し、例えば入力された特徴と同じ次元の特徴を生成する。非線形関数としては、例えば一層以上のMLPを用いることができる。また、線形関数としては、スケール(係数)とオフセット(加算部分)を最適化パラメータとする線形関数を用いることができる。当該補正により、学習用合成画像から抽出される画像特徴の統計指標を、撮影画像から抽出される特徴量の統計指標に近づけることができる。画像特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。
In adjusting the image features, the
学習部21は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも1つに対して上記調整を行った後のデータを用いて、以下の処理を行うことができる。
The
学習部21は、言語特徴と画像特徴の相関関係を例えばコサイン類似度にて算出する。例えば、学習部21は、言語特徴及び画像特徴のそれぞれに対し正規化処理を行い、これらの正規化後のベクトル間の内積を算出することで、相関関係を算出する。より詳しく説明すると、学習部21には、複数の学習用合成画像各々に対する画像特徴と、各々とペアとなる言語特徴が入力される。そして、学習部21は、ペアとなる画像特徴と言語特徴のコサイン類似度だけでなく、ペアとなっていない画像特徴と言語特徴同士のコサイン類似度も算出する。すなわち、学習部21にN組のペアの言語特徴及び画像特徴が入力された場合、学習部21は、N×N個のコサイン類似度を算出する。
The
次いで、学習部21は、算出した相関関係から、正しいペアは損失が小さく、誤ったペアは損失が大きくなることを特徴とする損失値を算出する。そして、学習部21は、損失値がより小さくなるように、上述した言語特徴の調整、学習用合成画像の調整、及び画像特徴の調整の中の少なくとも1つのパラメータを更新する。なお、学習部21は、さらにその他のパラメータを更新してもよい。
Then, the
学習部21は、広く知られたあらゆる技術を用いて、損失値を算出することができる。例えば、学習部21は、N×N個のコサイン類似度をN×Nの行列に並べ、行方向(横方向)にクロスエントロピー誤差を計算し、さらに列方向(縦方向)に損失を計算する。そして、学習部21はこれらのクロスエントロピーの平均値を損失値として算出することができる。
The
また、学習部21は、広く知られた技術を用いた上述したパラメータを更新することができる。例えば、学習部21は、逆誤差伝搬法等を用いてパラメータを更新してもよい。
The
次に、図11のフローチャートを用いて、学習装置20の処理の流れの一例を説明する。なお、各処理の詳細は上述したので、ここでは各処理の概要とともに、処理の流れの一例を説明する。
Next, an example of the process flow of the
S30では、学習装置20は、物体を表現したテキストを取得する。物体は、学習用撮影画像や学習用合成画像に写る物体である。
In S30, the
S31では、学習装置20は、学習用撮影画像を取得する。S32では、学習装置20は、電磁波の反射波を示す反射波情報に基づき、学習用2次元画像を生成する。学習用撮影画像及び学習用2次元画像は、同じ物体の情報を示す。
In S31, the
S33では、学習装置20は、学習用撮影画像と学習用2次元画像を合成して学習用合成画像を生成する。
In S33, the
S34では、学習装置20は、S30で取得したテキストと、S33で生成した学習用合成画像を用いて学習モデルを学習する。学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。
In S34, the
学習装置20は、上記テキストから抽出された言語特徴、及び、上記学習用合成画像から抽出された画像特徴を用いて、学習モデルを学習する。学習装置20は、上記言語特徴、上記学習用合成画像、及び上記画像特徴の中の少なくとも1つに対し、撮影画像用に生成/調整された学習モデルに適応させるための調整を行うことができる。そして、学習装置20は、当該調整を行った後に、言語特徴と画像特徴の相関関係を算出し、算出した相関関係に基づく損失値がより小さくなるように、当該調整のパラメータ等を更新する。
The
「作用効果」
学習装置20は、「学習用撮影画像」と、「電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像」とを合成した学習用合成画像を用いて学習モデルを学習する。このような学習装置20によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方を学習した学習モデルを生成することができる。この学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、物体検出や物体認識を行うことができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を学習することで、学習モデルによる物体検出や物体認識の精度が向上する。
"Action and effect"
The
また、学習装置20は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、それらをまとめて学習する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に学習するのでなく、それらを特徴的な手段で統合した合成画像でまとめて学習することで、相乗効果による物体検出や物体認識の精度向上が期待される。
The
また、学習装置20は、1組の学習用撮影画像及び学習用2次元画像から、学習用撮影画像及び学習用2次元画像のブレンド比率が互いに異なる複数の学習用合成画像を生成することができる。また、学習装置20は、画像内の一部領域毎にブレンド比率が異なる学習用合成画像を生成することができる。このようにすれば、1組の学習用撮影画像及び学習用2次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。
The
また、学習装置20は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも1つに対し、撮影画像用に生成/調整された学習モデルに適応させるための調整を行うことができる。そして、学習装置20は、言語特徴と画像特徴の相関関係に基づく損失値がより小さくなるように、当該調整のパラメータを更新することができる。このような学習装置20によれば、撮影画像用に生成/調整された既存の学習モデルを用いて、撮影画像及び2次元画像を合成した合成画像を学習した学習モデルを生成することができる。すなわち、撮影画像用に生成/調整された既存の学習モデルに、撮影画像及び2次元画像を合成した合成画像を学習させて、本実施形態の学習モデルを生成することができる。
The
<第4の実施形態>
第4の実施形態の学習装置20は、第3の実施形態の学習装置20と異なる手段で、学習用撮影画像を取得する。以下、詳細に説明する。
Fourth Embodiment
The
学習用撮影画像取得部22は、言語入力部25に入力された「物体を表現したテキスト」を検索クエリとして、複数の画像の中から検索クエリにマッチングする画像を検索する。そして、学習用撮影画像取得部22は、検索結果に含まれる画像を学習用撮影画像とする。
The learning image acquisition unit 22 searches for images that match the search query from among multiple images, using the "text describing an object" input to the
第3の実施形態で説明した通り、互いに合成する学習用撮影画像及び学習用2次元画像は同じ物体の情報を含んでいることが要求されるが、その物体の画像内の位置の一致は必須でない。「同じ物体」は種類や種別の一致を意味し、個体の一致までは要求されない。このため、互いに合成する学習用撮影画像及び学習用2次元画像は、同じ位置かつ同じ場所で、同じ個体を撮影したものでなくてもよい。このため、本実施形態の学習用撮影画像取得部22は、上述した手段で学習用撮影画像を取得する。 As explained in the third embodiment, the learning photographed image and the learning two-dimensional image to be combined with each other are required to contain information about the same object, but the position of the object within the image does not necessarily have to match. "The same object" means matching in type or variety, and does not require matching of the individual object. Therefore, the learning photographed image and the learning two-dimensional image to be combined with each other do not have to be photographs of the same individual object taken at the same position and in the same place. For this reason, the learning photographed image acquisition unit 22 of this embodiment acquires the learning photographed image by the means described above.
学習用撮影画像取得部22は、インターネット上で公開されている膨大な画像の中から、検索クエリにマッチングする画像を検索してもよい。その他、予め、多数の画像を記憶したデータベースが生成されていてもよい。そして、学習用撮影画像取得部22は、このデータベースの中から、検索クエリにマッチングする画像を検索してもよい。 The learning image acquisition unit 22 may search for images that match the search query from among the vast amount of images available on the Internet. Alternatively, a database that stores a large number of images may be generated in advance. The learning image acquisition unit 22 may then search for images that match the search query from within this database.
なお、学習用撮影画像取得部22は、1つの検索クエリに対応して1つの学習用撮影画像を取得してもよいし、1つの検索クエリに対応して複数の学習用撮影画像を取得してもよい。例えば、学習用撮影画像取得部22は、検索クエリに基づき検索された複数の画像の中から、マッチング率が高い方から順に所定数の画像を学習用撮影画像として取得してもよい。1つの検索クエリに対応して複数の学習用撮影画像を取得する場合、学習用合成部24は、ある物体の情報を含む1つの学習用2次元画像と、その物体の情報を含む複数の学習用撮影画像各々とを合成して、複数の学習用合成画像を生成することができる。学習用合成画像の数が増えると学習効果が向上して好ましい。
The learning image acquisition unit 22 may acquire one learning image in response to one search query, or may acquire multiple learning images in response to one search query. For example, the learning image acquisition unit 22 may acquire a predetermined number of images as learning images from among multiple images searched based on the search query, in descending order of matching rate. When multiple learning images are acquired in response to one search query, the learning
変形例として、学習用撮影画像取得部22は、第3の実施形態で説明した手段と、本実施形態で説明した手段の両方で、学習用撮影画像を取得してもよい。学習用撮影画像の数が増えると、生成される学習用合成画像の数が増えて好ましい。 As a modified example, the learning image acquisition unit 22 may acquire learning images using both the means described in the third embodiment and the means described in this embodiment. Increasing the number of learning images is preferable because it increases the number of learning composite images that are generated.
本実施形態の学習装置20のその他の構成は、第1及び第3の実施形態の20の構成と同様である。
The rest of the configuration of the
本実施形態の学習装置20によれば、第1及び第3の実施形態の学習装置20と同様の作用効果が実現される。また、本実施形態の学習装置20によれば、言語入力部25に入力された「物体を表現したテキスト」を検索クエリとして検索することで、複数の画像の中から学習用撮影画像を取得することができる。このような学習装置20によれば、多数の学習用撮影画像を取得し、多数の学習用合成画像を生成できて好ましい。また、一例ではカメラが不要になるので、費用負担や機材のメンテナンスの負担等を軽減できる。
The
<第5の実施形態>
「概要」
第5の実施形態の検出装置10は、第2の実施形態の検出装置10の構成を具体化したものである。検出装置10は、第1、3及び4の実施形態の学習装置20が生成した学習モデルを用いて、検出対象を検出する。以下、本実施形態の検出装置10の構成を詳細に説明する。
Fifth embodiment
"overview"
The
「ハードウエア構成」
まず、検出装置10のハードウエア構成の一例を説明する。検出装置10の各機能部は、ハードウエアとソフトウエアの任意の組み合わせによって実現される。その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。ソフトウエアは、予め装置を出荷する段階から格納されているプログラムや、CD(Compact Disc)等の記録媒体やインターネット上のサーバ等からダウンロードされたプログラム等を含む。
"Hardware Configuration"
First, an example of the hardware configuration of the
図7は、検出装置10のハードウエア構成を例示するブロック図である。図7に示すように、検出装置10は、プロセッサ1A、メモリ2A、入出力インターフェイス3A、周辺回路4A、バス5Aを有する。周辺回路4Aには、様々なモジュールが含まれる。学習装置20は周辺回路4Aを有さなくてもよい。なお、検出装置10は物理的及び/又は論理的に分かれた複数の装置で構成されてもよい。この場合、複数の装置各々が上記ハードウエア構成を備えることができる。
FIG. 7 is a block diagram illustrating an example of the hardware configuration of the
バス5Aは、プロセッサ1A、メモリ2A、周辺回路4A及び入出力インターフェイス3Aが相互にデータを送受信するためのデータ伝送路である。プロセッサ1Aは、例えばCPU、GPU(Graphics Processing Unit)等の演算処理装置である。メモリ2Aは、例えばRAM(Random Access Memory)やROM(Read Only Memory)等のメモリである。入出力インターフェイス3Aは、入力装置、外部装置、外部サーバ、外部センサ、カメラ等から情報を取得するためのインターフェイスや、出力装置、外部装置、外部サーバ等に情報を出力するためのインターフェイス等を含む。また、入出力インターフェイス3Aはインターネット等の通信ネットワークに接続するためのインターフェイスを含む。入力装置は、例えばキーボード、マウス、マイク、物理ボタン、タッチパネル等である。出力装置は、例えばディスプレイ、スピーカ、プリンタ、メーラ等である。プロセッサ1Aは、各モジュールに指令を出し、それらの演算結果をもとに演算を行うことができる。
The
「機能構成」
次に、検出装置10の機能構成を詳細に説明する。図12に、検出装置10の機能ブロック図の一例を示す。図示するように、検出装置10は、検出部11と、検出用撮影画像取得部12と、検出用反射波処理部13と、検出用合成部14とを有する。
"Function Configuration"
Next, a detailed description will be given of the functional configuration of the
検出用撮影画像取得部12は、処理対象撮影画像を取得する。
The detection
「処理対象撮影画像」は、検出対象を検出する処理の対象となる撮影画像である。処理対象撮影画像は、対象領域を撮影した画像である。 The "image to be processed" is the image that is the subject of processing to detect the detection target. The image to be processed is an image of the target area.
「撮影画像」の概念や、撮影方法の一例は、第3の実施形態で説明した通りである。 The concept of a "captured image" and an example of a capture method are as explained in the third embodiment.
「検出対象」は、検出する対象となっている物体である。学習装置20が生成する学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」に基づき検出対象を検出する。このため、電磁波を反射する物体が、検出対象となり得る。電磁波を反射する物体は、主に金属によって形成されている場合が多いが、これに限定されない。
The "detection target" is the object that is to be detected. The learning model generated by the
「対象領域」は、検出対象が存在するか検索する対象となっている領域である。対象領域は、地上の一部領域であってもよいし、建造物の一部領域であってもよいし、その他であってもよい。 The "target area" is the area being searched to see if a detection target exists. The target area may be a part of the ground, a part of a building, or something else.
検出用反射波処理部13は、対象領域に照射された電磁波の反射波を示す反射波情報に基づき処理対象2次元画像を生成する。検出用反射波処理部13は、学習用反射波処理部23と同様の処理で、反射波情報に基づき2次元画像を生成することができる。
The detection reflected
「処理対象2次元画像」は、検出対象を検出する処理の対象となる2次元画像である。 The "two-dimensional image to be processed" is the two-dimensional image that is the subject of processing to detect the detection target.
「2次元画像」、「電磁波」、「反射波」、及び「反射波情報」やこれらに関連する情報の概念や、生成方法の一例は、第3の実施形態で説明した通りである。 The concepts of "two-dimensional images," "electromagnetic waves," "reflected waves," and "reflected wave information," as well as information related to these, and an example of a method for generating these, are as described in the third embodiment.
検出用合成部14は、処理対象撮影画像と処理対象2次元画像を合成して処理対象合成画像を生成する。検出用合成部14は、学習用合成部24による学習用撮影画像と学習用2次元画像の合成と同様の処理で、処理対象撮影画像と処理対象2次元画像を合成し、処理対象合成画像を生成することができる。処理対象撮影画像と処理対象2次元画像は同様のサイズであることが好ましい。
The
検出部11は、処理対象撮影画像と処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する。検出部11は、第1、第3及び第4の実施形態の学習装置20が生成した学習モデルと、処理対象合成画像に基づき、検出対象を検出する。当該学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。
The
検出部11は、言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも1つに対し、撮影画像用に生成/調整された学習モデルに適応させるための調整を行うことができる。そして、検出部11は、当該調整後のデータを用いて、検出対象の検出を行うことができる。検出部11による言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも1つに対する調整は、第3の実施形態で説明した学習部21による言語特徴、学習用合成画像、及び画像特徴の中の少なくとも1つに対する調整と同様の手段で実現される。
The
すなわち、検出部11は、言語特徴を非線形関数又は線形関数により変換し、変換後の言語特徴に基づき、検出対象を検出することができる。また、検出部11は、処理対象合成画像を非線形関数又は線形関数により変換し、変換後の処理対象合成画像に基づき、検出対象を検出することができる。また、検出部11は、処理対象合成画像の中から抽出した画像特徴を非線形関数又は線形関数により変換し、変換後の画像特徴に基づき、検出対象を検出することができる。
In other words, the
なお、検出部11は、検出対象をテキストで表現した検索条件を取得する。この検索条件により、検出対象が指定される。そして、検出部11は、この検索条件で指定される検出対象を検出する。検出対象は処理対象合成画像が入力される毎に指定されてもよい。その他、予め検出対象が指定されており、その指定されている検出対象が複数の処理対象合成画像の処理において利用されてもよい。
The
「検索条件」は、検出対象を表現したテキストである。検索条件を構成するテキストは、単語、文章、プロンプト等である。検索条件を構成するテキストは、例えば物体の種類や、物体の外観の特徴(色、大きさ、形状等)等を含むことができる。検索条件の一例としては、「青い缶」等が例示される。 "Search criteria" is text that expresses the object to be detected. The text that constitutes the search criteria is a word, a sentence, a prompt, etc. The text that constitutes the search criteria can include, for example, the type of object, or the external characteristics of the object (color, size, shape, etc.). An example of a search criterion is "blue can."
なお、検索条件は、検索結果から除去したい物体を表現したテキストをさらに含んでもよい。すなわち、検索条件は、検出対象を表現したテキストと、検索結果から除去したい対象を表現したテキストを含んでもよい。以下、検出対象を「正例」と呼び、検索結果から除去したい対象を「負例」と呼ぶ場合がある。負例も、正例と同様の手段で表現される。負例の検索条件の一例としては、「白い缶」等が例示される。 The search criteria may further include text expressing an object to be removed from the search results. That is, the search criteria may include text expressing the detection target and text expressing the target to be removed from the search results. Hereinafter, the detection target may be referred to as a "positive example," and the target to be removed from the search results may be referred to as a "negative example." Negative examples are expressed in the same manner as positive examples. An example of a search criterion for negative examples is "white can."
検出部11は、ユーザが入力した検索条件を取得する。ユーザは、キーボード、タッチパネル、マイク、マウス、物理ボタン等のあらゆる入力装置を介して、検出装置10に検索条件を入力することができる。
The
検出部11は、例えば処理対象合成画像と検索条件との相関関係に基づき、処理対象合成画像の中から検出対象を検出することができる。
The
一例では、検出部11は、検索条件(正例)との相関値(コサイン類似度)の高い物体を処理対象合成画像の中から検出し、その物体の領域を検出結果として出力することができる。例えば、検出部11は、検索条件(正例)との相関値が閾値以上の物体を処理対象合成画像の中から検出してもよい。また、検出部11は、検索条件(正例)との相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。また、検出部11は、検索条件(正例)との相関値が閾値以上であって、当該相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。
In one example, the
また、検出部11は、検索条件(負例)と相関値の高い物体を処理対象合成画像の中から検出し、その物体の領域を検出結果に含めずに出力することができる。例えば、検出部11は、検索条件(負例)との相関値が閾値以上の物体を処理対象合成画像の中から検出してもよい。また、検出部11は、検索条件(負例)との相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。また、検出部11は、検索条件(負例)との相関値が閾値以上であって、当該相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。
The
このように、検出部11は、処理対象合成画像の中から、検出対象が存在する領域を検出する。そして、検出部11は、その検出した領域を示す情報を出力する。例えば、検出部11は、処理対象合成画像又は処理対象撮影画像上に検出した領域を示す情報(枠、マスク等)を重畳した画像を出力してもよい。また、検出した領域毎に任意の手段で算出した信頼度を示してもよい。なお、当該出力例はあくまで一例であり、これに限定されない。
In this way, the
次に、図13のフローチャートを用いて、検出装置10の処理の流れの一例を説明する。なお、各処理の詳細は上述したので、ここでは各処理の概要とともに、処理の流れの一例を説明する。
Next, an example of the process flow of the
S40では、検出装置10は、対象領域を撮影した処理対象撮影画像を取得する。S41では、検出装置10は、対象領域に照射された電磁波の反射波を示す反射波情報に基づき処理対象2次元画像を生成する。
In S40, the
次いで、S42では、検出装置10は、処理対象撮影画像と処理対象2次元画像を合成して処理対象合成画像を生成する。そして、検出装置10は、処理対象合成画像に基づき、検出対象を検出する。
Next, in S42, the
「作用効果」
処理対象合成画像を用いて検出対象を検出する検出装置10によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、検出対象を検出することができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を用いることで、検出対象の検出精度が向上する。
"Action and effect"
According to the
また、検出装置10は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、その合成画像に基づき検出対象を検出する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に処理するのでなく、それらを特徴的な手段で統合した合成画像でまとめて処理することで、相乗効果による検出精度の向上が期待される。
In addition, the
また、検出装置10は、言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも1つに対し、撮影画像用に生成/調整された学習モデルに適応させるための調整を行うことができる。そして、検出装置10は、調整後のデータを用いて検出対象の検出を行うことができる。このような検出装置10によれば、撮影画像用に生成/調整された既存の学習モデルを学習して生成された学習モデルを用いて、検出対象の検出を行うことができる。
The
<第6の実施形態>
本実施形態の検出装置10は、入力された1組の処理対象撮影画像及び処理対象2次元画像を用いて、ブレンド比率が互いに異なる複数の処理対象合成画像を生成する。そして、検出装置10は、この複数の処理対象合成画像を用いて、検出対象を検出する。以下、詳細に説明する。
Sixth Embodiment
The
検出用合成部14は、入力された1組の処理対象撮影画像及び処理対象2次元画像を用いて、ブレンド比率が互いに異なる複数の処理対象合成画像を生成する。検出用合成部14は、画像内の一部領域毎にブレンド比率が異なる処理対象合成画像を生成してもよい。検出用合成部14は、第3の実施形態で説明した学習用合成部24による手段と同様の手段で、当該合成を実現することができる。
The
検出部11は、1組の処理対象撮影画像及び処理対象2次元画像を用いて生成された複数の処理対象合成画像各々の中から検出対象を検出する。各処理対象合成画像の中から検出対象を検出する処理は、第5の実施形態と同様の手段で実現される。
The
そして、検出部11は、複数の処理対象合成画像各々の検出結果に基づき、検出対象を検出する。例えば、検出部11は、複数の処理対象合成画像の中の所定割合以上の処理対象合成画像(又は所定数以上の処理対象合成画像)において検出対象が存在すると検出されている領域を、検出対象が存在する領域として検出することができる。所定割合や所定数は、予め定められる値である。
Then, the
本実施形態の検出装置10のその他の構成は、第2及び第5の実施形態と同様である。
The rest of the configuration of the
本実施形態の検出装置10によれば、第2及び第5の実施形態と同様の作用効果が得られる。また、本実施形態の検出装置10は、入力された1組の処理対象撮影画像及び処理対象2次元画像を用いてブレンド比率が互いに異なる複数の処理対象合成画像を生成し、この複数の処理対象合成画像を用いて検出対象を検出することができる。このような検出装置10によれば、検出対象の検出精度が向上する。
The
<変形例>
第1乃至第6の実施形態に適用可能な変形例を説明する。
<Modification>
Modifications that can be applied to the first to sixth embodiments will be described.
検出用合成部14は、処理対象撮影画像に対する補正処理を行った後に、補正後の処理対象撮影画像を用いて処理対象合成画像を生成してもよい。
The
また、学習用合成部24は、学習用撮影画像に対する補正処理を行った後に、補正後の学習用撮影画像を用いて学習用合成画像を生成してもよい。
In addition, the learning
また、検出部11は、処理対象合成画像に対する補正処理を行った後に、補正後の処理対象合成画像を用いて検出対象の検出を行ってもよい。
In addition, the
また、学習部21は、学習用合成画像に対する補正処理を行った後に、補正後の学習用合成画像を用いて学習モデルの学習を行ってもよい。
In addition, the
補正処理の一例は、画像の明度を調整する補正である。例えば以下の文献に開示されている技術を用いて、画像の明度の調整が行われてもよい。当該補正処理は、画像の明度が所定条件(明度が閾値以下)を満たす場合に行われてもよい。
「Shibata, Takashi, Masayuki Tanaka, and Masatoshi Okutomi. "Gradient-domain image reconstruction framework with intensity-range and base-structure constraints." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.」
「Tanaka, Masayuki, Takashi Shibata, and Masatoshi Okutomi. "Gradient-based low-light image enhancement." 2019 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2019.」
An example of the correction process is a correction to adjust the brightness of an image. For example, the brightness of an image may be adjusted using the technology disclosed in the following document. The correction process may be performed when the brightness of an image satisfies a predetermined condition (brightness is equal to or less than a threshold value).
"Shibata, Takashi, Masayuki Tanaka, and Masatoshi Okutomi. "Gradient-domain image reconstruction framework with intensity-range and base-structure constraints." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016."
"Tanaka, Masayuki, Takashi Shibata, and Masatoshi Okutomi. "Gradient-based low-light image enhancement." 2019 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2019."
以上、実施の形態を参照して本開示を説明したが、本開示は上述の実施の形態に限定されるものではない。本開示の構成や詳細には、本開示のスコープ内で当業者が理解し得る様々な変更をすることができる。そして、各実施の形態は、適宜他の実施の形態と組み合わせることができる。 The present disclosure has been described above with reference to the embodiments, but the present disclosure is not limited to the above-mentioned embodiments. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present disclosure within the scope of the present disclosure. Furthermore, each embodiment can be combined with other embodiments as appropriate.
また、上述の説明で用いたフローチャートでは、複数の工程(処理)が順番に記載されている。しかし、各実施の形態で実行される工程の実行順序は、その記載の順番に制限されない。各実施の形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。 In addition, in the flowcharts used in the above explanations, multiple steps (processing) are listed in order. However, the order in which the steps are executed in each embodiment is not limited to the order listed. In each embodiment, the order of the steps shown in the figures can be changed to the extent that does not interfere with the content.
上記の実施の形態の一部又は全部は、以下の付記のようにも記載されうるが、以下に限られない。
1. 対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段を有する検出装置。
2. 前記処理対象撮影画像を取得する検出用撮影画像取得手段と、
前記反射波情報に基づき前記処理対象2次元画像を生成する検出用反射波処理手段と、
前記処理対象撮影画像と前記処理対象2次元画像を合成して前記処理対象合成画像を生成する検出用合成手段と、
を有する1に記載の検出装置。
3. 前記検出手段は、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成される学習用2次元画像とを合成した複数の学習用合成画像を学習した学習モデルと、前記処理対象合成画像に基づき、前記検出対象を検出する1又は2に記載の検出装置。
4. 前記学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである3に記載の検出装置。
5. 前記検出用合成手段は、
前記処理対象撮影画像及び前記処理対象2次元画像のブレンド比率が互いに異なる複数の前記処理対象合成画像を生成し、
前記検出手段は、
複数の前記処理対象合成画像各々の中から前記検出対象を検出し、
複数の前記処理対象合成画像各々の検出結果に基づき、前記検出対象を検出する2に記載の検出装置。
6. 前記検出手段は、
前記処理対象合成画像を非線形関数又は線形関数により変換し、
変換後の前記処理対象合成画像に基づき、前記検出対象を検出する1から5のいずれかに記載の検出装置。
7. 前記検出手段は、
前記処理対象合成画像の中から画像特徴を抽出し、
前記画像特徴を非線形関数又は線形関数により変換し、
変換後の前記画像特徴に基づき、前記検出対象を検出する1から6のいずれかに記載の検出装置。
8. 1つ以上のコンピュータが、
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出方法。
9. コンピュータを、
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段として機能させるプログラム。
10. 学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段を有する学習装置。
11. 前記学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである10に記載の学習装置。
12. 前記学習用合成画像に写る物体を表現したテキストを取得する言語入力手段を有し、
前記学習手段は、前記学習用合成画像に写る前記物体と、前記テキストとの相関関係を学習する11に記載の学習装置。
13. 前記学習用撮影画像を取得する学習用撮影画像取得手段と、
前記反射波情報に基づき前記学習用2次元画像を生成する学習用反射波処理手段と、
前記学習用撮影画像と前記学習用2次元画像を合成して前記学習用合成画像を生成する学習用合成手段と、
を有する10から12のいずれかに記載の学習装置。
14. 前記学習用合成手段は、
前記学習用撮影画像及び前記学習用2次元画像のブレンド比率が互いに異なる複数の前記学習用合成画像を生成する13に記載の学習装置。
15. 前記学習手段は、
前記学習用合成画像を非線形関数又は線形関数により変換し、
変換後の前記学習用合成画像を用いて前記学習モデルを学習する10から14のいずれかに記載の学習装置。
16. 前記学習手段は、
前記学習用合成画像の中から画像特徴を抽出し、
前記画像特徴を非線形関数又は線形関数により変換し、
変換後の前記画像特徴を用いて前記学習モデルを学習する10から15のいずれかに記載の学習装置。
17. 1つ以上のコンピュータが、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習方法。
18. コンピュータを、
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段として機能させるプログラム。
上述した付記1の検出装置に従属する付記2乃至7の一部又は全ては、付記8の検出方法及び付記9のプログラムに対しても、付記1と付記2乃至7と同様の従属関係により従属し得る。また、上述した付記10の学習装置に従属する付記11乃至16の一部又は全ては、付記17の学習方法及び付記18のプログラムに対しても、付記10と付記11乃至16と同様の従属関係により従属し得る。さらに、上述した各実施の形態から逸脱しない範囲において、様々なハードウエア、ソフトウエア、ソフトウエアを記録するための種々の記録手段、又はシステムにおいて、付記として記載した構成の一部又は全てを実現することができる。
A part or all of the above-described embodiments can be described as, but is not limited to, the following supplementary notes.
1. A detection device having a detection means for detecting a detection target based on a processing target composite image obtained by combining a processing target photographed image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating reflected waves of electromagnetic waves irradiated to the target area.
2. A detection image acquisition means for acquiring the processing target photographed image;
a detection reflected wave processing means for generating the two-dimensional image of the processing object based on the reflected wave information;
a detection synthesis means for synthesizing the captured image to be processed and the two-dimensional image to be processed to generate the synthesized image to be processed;
2. The detection device according to claim 1, comprising:
3. The detection means is
A detection device as described in 1 or 2, which detects the detection target based on a learning model that has been trained using a plurality of learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected waves of electromagnetic waves, and the detection target based on the processing target composite image.
4. The detection device according to 3, wherein the learning model is an object recognition model or an object detection model using a language model.
5. The detection synthesis means comprises:
generating a plurality of processing target composite images having different blend ratios between the processing target photographed image and the processing target two-dimensional image;
The detection means includes:
Detecting the detection target from each of the plurality of processing target composite images;
3. The detection device according to claim 2, which detects the detection target based on detection results of each of the multiple processing target composite images.
6. The detection means is
Transforming the composite image to be processed by a nonlinear function or a linear function;
6. The detection device according to any one of claims 1 to 5, which detects the detection target based on the converted composite image of the processing target.
7. The detection means is
Extracting image features from the composite image to be processed;
Transforming the image features by a nonlinear or linear function;
7. A detection device according to any one of 1 to 6, which detects the detection target based on the transformed image features.
8. One or more computers:
A detection method for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
9. Computers,
A program that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
10. A learning device having a learning means for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating a reflected electromagnetic wave.
11. The learning device according to 10, wherein the learning model is an object recognition model or an object detection model using a language model.
12. A language input means for acquiring text expressing an object appearing in the learning synthetic image,
12. The learning device according to
13. A learning image acquisition means for acquiring the learning image;
a learning reflected wave processing means for generating the learning two-dimensional image based on the reflected wave information;
a learning synthesis means for synthesizing the learning photographed image and the learning two-dimensional image to generate the learning synthesized image;
13. The learning device according to any one of 10 to 12, comprising:
14. The learning synthesis means
The learning device according to
15. The learning means
Transforming the synthetic image for learning by a nonlinear function or a linear function;
15. The learning device according to any one of
16. The learning means
Extracting image features from the synthetic image for training;
Transforming the image features by a nonlinear or linear function;
16. A learning device according to any one of
17. One or more computers:
A learning method for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.
18. A computer
A program that functions as a learning means for learning a learning model using multiple learning composite images that are synthesized by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
Some or all of Appendices 2 to 7 that are dependent on the detection device of Appendix 1 described above may also be dependent on the detection method of Appendix 8 and the program of Appendix 9 in the same dependent relationship as Appendix 1 and Appendices 2 to 7. In addition, some or all of
この出願は、2023年8月10日に出願された日本出願特願2023-130762号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2023-130762, filed on August 10, 2023, the entire disclosure of which is incorporated herein by reference.
10 検出装置
11 検出部
12 検出用撮影画像取得部
13 検出用反射波処理部
14 検出用合成部
20 学習装置
21 学習部
22 学習用撮影画像取得部
23 学習用反射波処理部
24 学習用合成部
25 言語入力部
1A プロセッサ
2A メモリ
3A 入出力I/F
4A 周辺回路
5A バス
REFERENCE SIGNS
4A
Claims (20)
前記反射波情報に基づき前記処理対象2次元画像を生成する検出用反射波処理手段と、
前記処理対象撮影画像と前記処理対象2次元画像を合成して前記処理対象合成画像を生成する検出用合成手段と、
を有する請求項1に記載の検出装置。 A detection image acquisition means for acquiring the processing target photographed image;
a detection reflected wave processing means for generating the two-dimensional image of the processing object based on the reflected wave information;
a detection synthesis means for synthesizing the captured image to be processed and the two-dimensional image to be processed to generate the synthesized image to be processed;
The detection device according to claim 1 , further comprising:
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成される学習用2次元画像とを合成した複数の学習用合成画像を学習した学習モデルと、前記処理対象合成画像に基づき、前記検出対象を検出する請求項1又は2に記載の検出装置。 The detection means includes:
The detection device according to claim 1 or 2, which detects the detection target based on a learning model that has been trained using a plurality of learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected waves of electromagnetic waves, and the processing target composite image.
前記処理対象撮影画像及び前記処理対象2次元画像のブレンド比率が互いに異なる複数の前記処理対象合成画像を生成し、
前記検出手段は、
複数の前記処理対象合成画像各々の中から前記検出対象を検出し、
複数の前記処理対象合成画像各々の検出結果に基づき、前記検出対象を検出する請求項2に記載の検出装置。 The detection synthesis means comprises:
generating a plurality of processing target composite images having different blend ratios between the processing target photographed image and the processing target two-dimensional image;
The detection means includes:
Detecting the detection target from each of the plurality of processing target composite images;
The detection device according to claim 2 , wherein the detection target is detected based on detection results of each of the plurality of processing target composite images.
前記処理対象合成画像を非線形関数又は線形関数により変換し、
変換後の前記処理対象合成画像に基づき、前記検出対象を検出する請求項1から5のいずれか1項に記載の検出装置。 The detection means includes:
Transforming the composite image to be processed by a nonlinear function or a linear function;
The detection device according to claim 1 , wherein the detection target is detected based on the converted composite image of the processing target.
前記処理対象合成画像の中から画像特徴を抽出し、
前記画像特徴を非線形関数又は線形関数により変換し、
変換後の前記画像特徴に基づき、前記検出対象を検出する請求項1から6のいずれか1項に記載の検出装置。 The detection means includes:
Extracting image features from the composite image to be processed;
Transforming the image features by a nonlinear or linear function;
The detection device according to claim 1 , wherein the detection target is detected based on the transformed image features.
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出方法。 One or more computers
A detection method for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象2次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段として機能させるプログラムを記録した記録媒体。 Computer,
A recording medium having a program recorded thereon that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
前記学習手段は、前記学習用合成画像に写る前記物体と、前記テキストとの相関関係を学習する請求項11に記載の学習装置。 a language input means for acquiring text expressing an object appearing in the learning synthetic image,
The learning device according to claim 11 , wherein the learning means learns a correlation between the object appearing in the synthetic image for learning and the text.
前記反射波情報に基づき前記学習用2次元画像を生成する学習用反射波処理手段と、
前記学習用撮影画像と前記学習用2次元画像を合成して前記学習用合成画像を生成する学習用合成手段と、
を有する請求項10から12のいずれか1項に記載の学習装置。 A learning image acquisition means for acquiring the learning image;
a learning reflected wave processing means for generating the learning two-dimensional image based on the reflected wave information;
a learning synthesis means for synthesizing the learning photographed image and the learning two-dimensional image to generate the learning synthesized image;
13. A learning device according to claim 10, comprising:
前記学習用撮影画像及び前記学習用2次元画像のブレンド比率が互いに異なる複数の前記学習用合成画像を生成する請求項13に記載の学習装置。 The learning synthesis means includes:
The learning device according to claim 13 , wherein a plurality of the learning composite images are generated in which the blending ratios of the learning captured images and the learning two-dimensional images are different from each other.
前記学習用合成画像を非線形関数又は線形関数により変換し、
変換後の前記学習用合成画像を用いて前記学習モデルを学習する請求項10から14のいずれか1項に記載の学習装置。 The learning means includes:
Transforming the synthetic image for learning by a nonlinear function or a linear function;
The learning device according to claim 10 , wherein the learning model is trained using the training synthetic image after conversion.
前記学習用合成画像の中から画像特徴を抽出し、
前記画像特徴を非線形関数又は線形関数により変換し、
変換後の前記画像特徴を用いて前記学習モデルを学習する請求項10から15のいずれか1項に記載の学習装置。 The learning means includes:
Extracting image features from the synthetic image for training;
Transforming the image features by a nonlinear or linear function;
The learning device according to claim 10 , wherein the learning model is trained using the transformed image features.
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習方法。 One or more computers
A learning method for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.
学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用2次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段として機能させるプログラムを記録した記録媒体。 Computer,
A recording medium having a program recorded thereon that functions as a learning means for learning a learning model using multiple learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023130762 | 2023-08-10 | ||
| JP2023-130762 | 2023-08-10 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025033195A1 true WO2025033195A1 (en) | 2025-02-13 |
Family
ID=94534265
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2024/026593 Pending WO2025033195A1 (en) | 2023-08-10 | 2024-07-25 | Detection device, learning device, detection method, learning method, and recording medium |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025033195A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016115115A (en) * | 2014-12-15 | 2016-06-23 | 株式会社Screenホールディングス | Image classification device and image classification method |
| JP2018092610A (en) * | 2016-11-28 | 2018-06-14 | キヤノン株式会社 | Image recognition apparatus, image recognition method, and program |
| WO2020116194A1 (en) * | 2018-12-07 | 2020-06-11 | ソニーセミコンダクタソリューションズ株式会社 | Information processing device, information processing method, program, mobile body control device, and mobile body |
| JP2021032660A (en) * | 2019-08-22 | 2021-03-01 | トヨタ自動車株式会社 | Object recognition device |
| JP2021179790A (en) * | 2020-05-13 | 2021-11-18 | 富士電機株式会社 | Image generator, image generation method and program |
-
2024
- 2024-07-25 WO PCT/JP2024/026593 patent/WO2025033195A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2016115115A (en) * | 2014-12-15 | 2016-06-23 | 株式会社Screenホールディングス | Image classification device and image classification method |
| JP2018092610A (en) * | 2016-11-28 | 2018-06-14 | キヤノン株式会社 | Image recognition apparatus, image recognition method, and program |
| WO2020116194A1 (en) * | 2018-12-07 | 2020-06-11 | ソニーセミコンダクタソリューションズ株式会社 | Information processing device, information processing method, program, mobile body control device, and mobile body |
| JP2021032660A (en) * | 2019-08-22 | 2021-03-01 | トヨタ自動車株式会社 | Object recognition device |
| JP2021179790A (en) * | 2020-05-13 | 2021-11-18 | 富士電機株式会社 | Image generator, image generation method and program |
Non-Patent Citations (1)
| Title |
|---|
| RADFORD, ALEC ET AL.: "Learning Transferable Visual Models From Natural Language Supervision.", PROCEEDINGS OF THE 38TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, vol. 139, 2021, pages 8748 - 8763, XP093229475, Retrieved from the Internet <URL:https://proceedings.mlr.press/v139/radford2la.html> * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11403860B1 (en) | Multi-sensor object detection fusion system and method using point cloud projection | |
| US10565458B2 (en) | Simulation system, simulation program and simulation method | |
| EP3525000B1 (en) | Methods and apparatuses for object detection in a scene based on lidar data and radar data of the scene | |
| CN110675418A (en) | Target track optimization method based on DS evidence theory | |
| WO2019154541A1 (en) | Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera | |
| Sun et al. | IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes | |
| CN117974792B (en) | Ship target detection positioning method based on vision and AIS data cooperative training | |
| CN114359675B (en) | Hyperspectral image saliency map generation method based on semi-supervised neural network | |
| JP2023539865A (en) | Real-time cross-spectral object association and depth estimation | |
| Sun et al. | SAR vehicle image generation with integrated deep imaging geometric information | |
| CN118887105A (en) | A multi-spectral fusion intelligent perception and safety warning system for underground rubber-wheeled vehicles | |
| CN117408149A (en) | Joint training optimization method based on 3D scene data simulation and perception model | |
| CN120009877A (en) | A bird-repelling detection method for airports based on multi-sensor fusion | |
| Li et al. | Riders: radar-infrared depth estimation for robust sensing | |
| WO2025033195A1 (en) | Detection device, learning device, detection method, learning method, and recording medium | |
| Cai et al. | Multi-Modality Object Detection with Sonar and Underwater Camera via Object-Shadow Feature Generation and Saliency Information | |
| Xu et al. | Unsupervised classification based on deep adaptation network for sonar images | |
| CN117455952A (en) | An optical image ship target positioning and tracking method | |
| WO2025028385A1 (en) | Information processing device, information processing method, and recording medium | |
| Long et al. | Radar fusion monocular depth estimation based on dual attention | |
| CN118608600B (en) | Visual radar combined with UAV localization method based on multimodal scale normalization | |
| WO2022226723A1 (en) | Information processing method and apparatus | |
| CN120236243B (en) | A smart security video surveillance system for preventing animal aggression. | |
| Li et al. | Adaptive Frequency Separation Enhancement Network for Infrared Small Target Detection | |
| Li et al. | Height-Adaptive Deformable Multi-Modal Fusion for 3D Object Detection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24851621 Country of ref document: EP Kind code of ref document: A1 |