WO2025033195A1

WO2025033195A1 - Detection device, learning device, detection method, learning method, and recording medium

Info

Publication number: WO2025033195A1
Application number: PCT/JP2024/026593
Authority: WO
Inventors: 剛志柴田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2023-08-10
Filing date: 2024-07-25
Publication date: 2025-02-13
Anticipated expiration: 2026-02-10

Abstract

A detection device according to the present disclosure comprises a detection means for detecting an object to be detected on the basis of a composite image of the object to be detected obtained by synthesizing: a captured image to be processed which results from imaging a region of interest; and a two-dimensional image to be processed which was generated on the basis of reflected wave information indicating reflected waves of electromagnetic waves projected on the region of interest. The detection means detects the object to be detected on the basis of a learning model obtained by learning a plurality of composite images for learning obtained by synthesizing captured images for learning and two-dimensional images for learning generated on the basis of reflected wave information describing reflected waves of electromagnetic waves.

Description

Detection device, learning device, detection method, learning method, and recording medium

　本開示は、検出装置、学習装置、検出方法、及びプログラムに関する。 This disclosure relates to a detection device, a learning device, a detection method, and a program.

　本開示に関連する技術が特許文献１に開示されている。特許文献１は、ミリ波レーダの反射強度に基づき、金属等のミリ波レーダを多く反射する物体を検出する技術を開示している。 Technology related to the present disclosure is disclosed in Patent Document 1. Patent Document 1 discloses technology for detecting objects that reflect a lot of millimeter wave radar, such as metals, based on the reflection intensity of the millimeter wave radar.

国際公開第２０１７／０５７０５８号International Publication No. 2017/057058

　特許文献１に開示の技術の場合、金属等のミリ波レーダを多く反射する物体の全てを検出してしまう。特許文献１に開示の技術の場合、ミリ波レーダを多く反射する物体の中の一部のみを高精度に検出することは困難である。 The technology disclosed in Patent Document 1 detects all objects that reflect a lot of millimeter wave radar, such as metals. With the technology disclosed in Patent Document 1, it is difficult to accurately detect only some of the objects that reflect a lot of millimeter wave radar.

　本開示の目的の一例は、上述した問題を鑑み、所望の物体を精度よく検出するための検出装置、学習装置、検出方法、学習方法及びプログラムを提供することにある。 In view of the above-mentioned problems, one example of the objective of the present disclosure is to provide a detection device, a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy.

　本開示によれば、
　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段を有する検出装置が提供される。 According to the present disclosure,
A detection device is provided that has a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

　また、本開示によれば、
　１つ以上のコンピュータが、
　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出方法が提供される。 Further, according to the present disclosure,
One or more computers
A detection method is provided for detecting a detection target based on a processing target composite image obtained by combining a processing target captured image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

　また、本開示によれば、
　コンピュータを、
　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段として機能させるプログラムが提供される。 Further, according to the present disclosure,
Computer,
A program is provided that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

　また、本開示によれば、
　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段を有する学習装置が提供される。 Further, according to the present disclosure,
A learning device is provided having a learning means for learning a learning model using a plurality of learning composite images obtained by combining a learning captured image with a learning two-dimensional image generated based on reflected wave information indicating a reflected electromagnetic wave.

　また、本開示によれば、
　１つ以上のコンピュータが、
　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習方法が提供される。 Further, according to the present disclosure,
One or more computers
A learning method is provided for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.

　また、本開示によれば、
　コンピュータを、
　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段として機能させるプログラムが提供される。 Further, according to the present disclosure,
Computer,
A program is provided that functions as a learning means for learning a learning model using a plurality of learning composite images obtained by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.

　本開示の一態様によれば、所望の物体を精度よく検出するための検出装置、学習装置、検出方法、学習方法及びプログラムが実現される。 According to one aspect of the present disclosure, a detection device, a learning device, a detection method, a learning method, and a program for detecting a desired object with high accuracy are realized.

本開示にかかる学習装置の機能ブロック図の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional block diagram of a learning device according to the present disclosure. 本開示にかかる学習装置の処理の流れの一例を示すフローチャートである。11 is a flowchart illustrating an example of a process flow of a learning device according to the present disclosure. 本開示にかかる検出装置の機能ブロック図の一例を示す図である。FIG. 2 is a diagram showing an example of a functional block diagram of a detection device according to the present disclosure. 本開示にかかる検出装置の処理の流れの一例を示すフローチャートである。10 is a flowchart illustrating an example of a process flow of a detection device according to the present disclosure. 比較例の処理を説明するための図である。FIG. 11 is a diagram for explaining a process of a comparative example. 本開示にかかる学習装置の処理を説明するための図である。FIG. 2 is a diagram for explaining processing of a learning device according to the present disclosure. 本開示にかかる学習装置及び検出装置のハードウエア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a learning device and a detection device according to the present disclosure. 本開示にかかる学習装置の機能ブロック図の他の一例を示す図である。FIG. 13 is a diagram illustrating another example of a functional block diagram of a learning device according to the present disclosure. 本開示にかかる学習装置及び検出装置が実行する処理の一例を説明するための図である。1 is a diagram for explaining an example of processing executed by a learning device and a detection device according to the present disclosure. FIG. 本開示にかかる学習装置及び検出装置が実行する処理の他の一例を説明するための図である。11 is a diagram for explaining another example of the processing executed by the learning device and the detection device according to the present disclosure. FIG. 本開示にかかる学習装置の処理の流れの他の一例を示すフローチャートである。13 is a flowchart showing another example of the processing flow of the learning device according to the present disclosure. 本開示にかかる検出装置の機能ブロック図の他の一例を示す図である。FIG. 13 is a diagram showing another example of a functional block diagram of a detection device according to the present disclosure. 本開示にかかる検出装置の処理の流れの他の一例を示すフローチャートである。10 is a flowchart showing another example of the processing flow of the detection device according to the present disclosure.

　以下、本開示の実施形態について、図面を用いて説明する。本開示において図面は、１以上の実施形態に関連付けられる。また、全ての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Below, an embodiment of the present disclosure will be described with reference to the drawings. In this disclosure, the drawings relate to one or more embodiments. In addition, in all drawings, similar components are given similar reference symbols, and descriptions will be omitted as appropriate.

＜第１の実施形態＞
　図１は、学習装置２０の概要を示す機能ブロック図である。図２は、学習装置２０が実行する処理の流れの一例を示すフローチャートである。 First Embodiment
Fig. 1 is a functional block diagram showing an overview of the learning device 20. Fig. 2 is a flowchart showing an example of the flow of processing executed by the learning device 20.

　図１に示すように、学習装置２０は、学習部２１を有する。この機能部により、図２の処理が実行される。 As shown in FIG. 1, the learning device 20 has a learning unit 21. This functional unit executes the process in FIG. 2.

　学習部２１は、「学習用撮影画像」と「電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像」とを合成した複数の学習用合成画像を用いて学習モデルを学習する（Ｓ１０）。 The learning unit 21 trains the learning model using multiple synthetic learning images that are a combination of "learning captured images" and "learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves" (S10).

　このような学習用合成画像を用いて学習モデルを学習する学習装置２０によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方を学習した学習モデルを生成することができる。この学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、物体検出や物体認識を行うことができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を学習することで、学習モデルによる物体検出や物体認識の精度が向上する。 The learning device 20, which uses such synthetic learning images to train a learning model, can generate a learning model that has learned both the "features of the captured image" and the "features of the reflected wave information indicating the reflected waves of electromagnetic waves." This learning model can perform object detection and object recognition based on both the "features of the captured image" and the "features of the reflected wave information indicating the reflected waves of electromagnetic waves." By learning not just one of the "features of the captured image" and the "features of the reflected wave information indicating the reflected waves of electromagnetic waves," but both, the accuracy of object detection and object recognition by the learning model is improved.

　また、学習装置２０は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、それらをまとめて学習する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に学習するのでなく、それらを特徴的な手段で統合した合成画像でまとめて学習することで、相乗効果による物体検出や物体認識の精度向上が期待される。 The learning device 20 also integrates the "features of the captured image" and the "features of the reflected wave information indicating the reflected electromagnetic waves" by a unique method of generating a synthetic image, and learns them together. Rather than learning the "features of the captured image" and the "features of the reflected wave information indicating the reflected electromagnetic waves" individually, learning them together as a synthetic image that is integrated by a unique method is expected to have a synergistic effect and improve the accuracy of object detection and object recognition.

＜第２の実施形態＞
　図３は、検出装置１０の概要を示す機能ブロック図である。図４は、検出装置１０が実行する処理の流れの一例を示すフローチャートである。 Second Embodiment
Fig. 3 is a functional block diagram showing an overview of the detection device 10. Fig. 4 is a flowchart showing an example of the flow of processing executed by the detection device 10.

　図３に示すように、検出装置１０は、検出部１１を有する。この機能部により、図４の処理が実行される。 As shown in FIG. 3, the detection device 10 has a detection unit 11. This functional unit executes the process in FIG. 4.

　検出部１１は、「対象領域を撮影した処理対象撮影画像」と「対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像」とを合成した処理対象合成画像に基づき、検出対象を検出する（Ｓ２０）。 The detection unit 11 detects the detection target based on a composite image of the processing target, which is a combination of a "processing target captured image of the target area" and a "processing target two-dimensional image generated based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area" (S20).

　このような処理対象合成画像を用いて検出対象を検出する検出装置１０によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、検出対象を検出することができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を用いることで、検出対象の検出精度が向上する。 The detection device 10, which detects the detection target using such a composite image to be processed, can detect the detection target based on both the "characteristics of the captured image" and the "characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave." By using not just one but both of the "characteristics of the captured image" and the "characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave," the detection accuracy of the detection target is improved.

　また、検出装置１０は、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」を合成画像の生成という特徴的な手段で統合し、その合成画像に基づき検出対象を検出する。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」各々を個別に処理するのでなく、それらを特徴的な手段で統合した合成画像でまとめて処理することで、相乗効果による検出精度の向上が期待される。 In addition, the detection device 10 integrates the "features of the captured image" and the "features of the reflected wave information indicating the reflected waves of the electromagnetic waves" by a unique method of generating a composite image, and detects the detection target based on the composite image. Rather than processing the "features of the captured image" and the "features of the reflected wave information indicating the reflected waves of the electromagnetic waves" separately, they are integrated by a unique method and processed together as a composite image, which is expected to have a synergistic effect and improve detection accuracy.

＜第３の実施形態＞
「概要」
　第３の実施形態の学習装置２０は、第１の実施形態の学習装置２０の構成を具体化したものである。 Third Embodiment
"overview"
The learning device 20 of the third embodiment is a specific embodiment of the configuration of the learning device 20 of the first embodiment.

　近年、言語モデルを用いて検出対象を認識する物体認識モデルや、言語モデルを用いて検出対象を検出する物体検出モデルが知られている。当該物体認識モデル及び物体検出モデルは、画像と言語を同時埋め込み空間で表現することが可能なモデルである。 In recent years, object recognition models that use language models to recognize detection targets and object detection models that use language models to detect detection targets have become known. These object recognition models and object detection models are models that can represent images and language in a joint embedding space.

　当該物体認識モデル及び物体検出モデルは、例えばニューラルネットワーク等の技術で得られた物体認識／物体検出の結果と物体に関する言語（物体の説明や表現）との関係を学習することで生成される。当該物体認識モデル及び物体検出モデルに基づき、検索条件（テキスト）で示される物体を撮影画像内で認識／検出することができる。当該技術は、例えば以下の文献等に開示されている。
「Radford, Alec, et al. "Learning transferable visual models from natural language supervision."International conference on machine learning. PMLR, 2021.」
「Li, Liunian Harold, et al. "Grounded language-image pre-training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.」 The object recognition model and the object detection model are generated by learning the relationship between the results of object recognition/detection obtained by a technology such as a neural network and language related to the object (description or expression of the object). Based on the object recognition model and the object detection model, the object indicated by the search criteria (text) can be recognized/detected in the captured image. The technology is disclosed in, for example, the following documents.
"Radford, Alec, et al. "Learning transferable visual models from natural language supervision."International conference on machine learning. PMLR, 2021."
"Li, Liunian Harold, et al. "Grounded language-image pre-training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022."

　上記文献等で開示されている物体認識モデル及び物体検出モデルは、図５に示すように、物体に関する言語（物体の説明や表現）と、撮影画像との相関関係を学習することで生成される。 The object recognition model and object detection model disclosed in the above-mentioned documents are generated by learning the correlation between language related to objects (descriptions and expressions of objects) and captured images, as shown in Figure 5.

　これに対し、学習装置２０は、図６に示すように、物体に関する言語（物体の説明や表現）と、合成画像との相関関係を学習することで学習モデルを生成する。合成画像は、「撮影画像」と「電磁波の反射波を示す反射波情報に基づく２次元画像」とを合成することで生成された画像である。学習装置２０は、上記文献に開示されている言語モデルを用いて検出対象を認識する物体認識モデルや、言語モデルを用いて検出対象を検出する物体検出モデルと、当該点で異なる。 In response to this, the learning device 20 generates a learning model by learning the correlation between language relating to an object (description or expression of the object) and a synthetic image, as shown in FIG. 6. The synthetic image is an image generated by synthesizing a "captured image" and a "two-dimensional image based on reflected wave information indicating reflected electromagnetic waves." In this respect, the learning device 20 differs from the object recognition model that recognizes a detection target using the language model disclosed in the above-mentioned document and the object detection model that detects a detection target using a language model.

　以下、学習装置２０の構成を詳細に説明する。 The configuration of the learning device 20 is explained in detail below.

「ハードウエア構成」
　まず、学習装置２０のハードウエア構成の一例を説明する。学習装置２０の各機能部は、ハードウエアとソフトウエアの任意の組み合わせによって実現される。その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。ソフトウエアは、予め装置を出荷する段階から格納されているプログラムや、ＣＤ（Compact Disc）等の記録媒体やインターネット上のサーバ等からダウンロードされたプログラム等を含む。 "Hardware Configuration"
First, an example of the hardware configuration of the learning device 20 will be described. Each functional unit of the learning device 20 is realized by any combination of hardware and software. Those skilled in the art will understand that there are various variations in the realization method and device. The software includes programs that are stored in the device before it is shipped, and programs downloaded from recording media such as CDs (Compact Discs) and servers on the Internet.

　図７は、学習装置２０のハードウエア構成を例示するブロック図である。図７に示すように、学習装置２０は、プロセッサ１Ａ、メモリ２Ａ、入出力インターフェイス３Ａ、周辺回路４Ａ、バス５Ａを有する。周辺回路４Ａには、様々なモジュールが含まれる。学習装置２０は周辺回路４Ａを有さなくてもよい。なお、学習装置２０は物理的及び／又は論理的に分かれた複数の装置で構成されてもよい。この場合、複数の装置各々が上記ハードウエア構成を備えることができる。 FIG. 7 is a block diagram illustrating an example of the hardware configuration of a learning device 20. As shown in FIG. 7, the learning device 20 has a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A. The peripheral circuit 4A includes various modules. The learning device 20 does not have to have the peripheral circuit 4A. The learning device 20 may be composed of multiple devices that are physically and/or logically separated. In this case, each of the multiple devices can have the above hardware configuration.

　バス５Ａは、プロセッサ１Ａ、メモリ２Ａ、周辺回路４Ａ及び入出力インターフェイス３Ａが相互にデータを送受信するためのデータ伝送路である。プロセッサ１Ａは、例えばＣＰＵ、ＧＰＵ（Graphics Processing Unit）等の演算処理装置である。メモリ２Ａは、例えばＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）等のメモリである。入出力インターフェイス３Ａは、入力装置、外部装置、外部サーバ、外部センサ、カメラ等から情報を取得するためのインターフェイスや、出力装置、外部装置、外部サーバ等に情報を出力するためのインターフェイス等を含む。また、入出力インターフェイス３Ａはインターネット等の通信ネットワークに接続するためのインターフェイスを含む。入力装置は、例えばキーボード、マウス、マイク、物理ボタン、タッチパネル等である。出力装置は、例えばディスプレイ、スピーカ、プリンタ、メーラ等である。プロセッサ１Ａは、各モジュールに指令を出し、それらの演算結果をもとに演算を行うことができる。 The bus 5A is a data transmission path for the processor 1A, memory 2A, peripheral circuit 4A, and input/output interface 3A to send and receive data to each other. The processor 1A is an arithmetic processing device such as a CPU or a GPU (Graphics Processing Unit). The memory 2A is a memory such as a RAM (Random Access Memory) or a ROM (Read Only Memory). The input/output interface 3A includes interfaces for acquiring information from an input device, an external device, an external server, an external sensor, a camera, etc., and interfaces for outputting information to an output device, an external device, an external server, etc. The input/output interface 3A also includes an interface for connecting to a communication network such as the Internet. Examples of input devices include a keyboard, a mouse, a microphone, a physical button, a touch panel, etc. Examples of output devices include a display, a speaker, a printer, a mailer, etc. The processor 1A can issue commands to each module and perform calculations based on the results of those calculations.

「機能構成」
　次に、学習装置２０の機能構成を詳細に説明する。図８に、学習装置２０の機能ブロック図の一例を示す。図示するように、学習装置２０は、学習部２１と、学習用撮影画像取得部２２と、学習用反射波処理部２３と、学習用合成部２４と、言語入力部２５とを有する。 "Function Configuration"
Next, a detailed description will be given of the functional configuration of the learning device 20. Fig. 8 shows an example of a functional block diagram of the learning device 20. As shown in the figure, the learning device 20 has a learning unit 21, a learning image acquisition unit 22, a learning reflected wave processing unit 23, a learning synthesis unit 24, and a language input unit 25.

　学習用撮影画像取得部２２は、学習用撮影画像を取得する。 The learning image acquisition unit 22 acquires learning images.

　「学習用撮影画像」は、学習装置２０による学習モデルの学習において学習データとして利用される撮影画像である。 "Learning images" are captured images that are used as learning data when the learning device 20 learns a learning model.

　「撮影画像」は、カメラで撮影することで生成された画像である。カメラは、可視光を検出し、画像化する。なお、カメラは、赤外線や紫外線等のその他の光を検出し、画像化してもよい。撮影画像は、静止画像であってもよいし、動画像であってもよい。撮影画像には、撮影日時及び撮影場所を示す情報が付与されてもよい。撮影画像が動画像である場合、フレーム画像毎に撮影日時及び撮影場所を特定可能な情報が付与されてもよい。 A "captured image" is an image generated by taking a picture with a camera. A camera detects visible light and converts it into an image. Note that a camera may also detect other types of light, such as infrared or ultraviolet light, and convert it into an image. A captured image may be a still image or a video. A captured image may be provided with information indicating the date and time of capture and the location where the image was taken. If the captured image is a video, information that identifies the date and time of capture and the location where the image was taken may be provided for each frame image.

　一例では、カメラは測定装置に搭載される。測定装置は移動が可能な装置である。測定装置は、例えばドローンなどの飛行体でもよいし、陸上を自走する装置でもよい。測定装置は、自動又は遠隔操作により移動可能である。測定装置は、例えば、予め登録されたルートに沿って自動で移動することができる。測定装置は移動しながら、カメラで所定のデータ収集領域を撮影する。データ収集領域は、学習データの収集や検出対象の検出等のために選択された領域である。データ収集領域は、地上の一部領域であってもよいし、建造物の一部領域であってもよいし、その他であってもよい。 In one example, the camera is mounted on the measuring device. The measuring device is a device that can be moved. The measuring device may be, for example, an air vehicle such as a drone, or may be a device that moves on land. The measuring device can be moved automatically or by remote control. The measuring device can move automatically, for example, along a route registered in advance. As the measuring device moves, it captures an image of a specified data collection area with the camera. The data collection area is an area selected for the purpose of collecting learning data, detecting a detection target, etc. The data collection area may be a part of the ground, a part of a building, or something else.

　他の例では、カメラは作業員に携帯される。作業員はカメラを携帯して移動しながら、カメラでデータ収集領域を撮影する。 In another example, the camera is carried by a worker who moves around and takes pictures of the data collection area with the camera.

　カメラが生成した撮影画像は、任意の手段で学習装置２０に入力される。例えば、カメラと学習装置２０とは互いに通信可能に構成されていてもよい。そして、カメラは、当該通信手段を介して、生成した撮影画像を学習装置２０に送信してもよい。カメラから学習装置２０への撮影画像の送信は、リアルタイム処理で行われてもよいし、バッチ処理で行われてもよい。 The captured images generated by the camera are input to the learning device 20 by any means. For example, the camera and the learning device 20 may be configured to be able to communicate with each other. The camera may then transmit the captured images generated to the learning device 20 via the communication means. The transmission of the captured images from the camera to the learning device 20 may be performed by real-time processing or batch processing.

　その他、カメラが生成した撮影画像は、任意の記憶装置内に蓄積されてもよい。当該記憶装置は、カメラ内に設けられてもよいし、カメラと通信可能に構成された外部装置内に設けられてもよい。そして、任意のタイミングかつ任意の手段で、記憶装置内に蓄積された撮影画像が学習装置２０に入力されてもよい。 In addition, the captured images generated by the camera may be stored in any storage device. The storage device may be provided in the camera, or in an external device configured to be able to communicate with the camera. The captured images stored in the storage device may then be input to the learning device 20 at any time and by any means.

　学習用反射波処理部２３は、電磁波の反射波を示す反射波情報に基づき学習用２次元画像を生成する。 The learning reflected wave processing unit 23 generates a learning two-dimensional image based on the reflected wave information indicating the reflected electromagnetic waves.

　「電磁波」は、例えばミリ波であり、その波長の一例は０．３ＧＨｚ以上３００ＧＨｚ以下である。ただし、電磁波の帯域はミリ波に限定されない。電磁波は、近赤外線や遠赤外線等であってもよい。 The "electromagnetic wave" is, for example, a millimeter wave, and an example of its wavelength is 0.3 GHz or more and 300 GHz or less. However, the band of the electromagnetic wave is not limited to millimeter waves. The electromagnetic wave may be near infrared rays, far infrared rays, etc.

　「反射波」は、照射された電磁波が、照射した領域に存在する物体によって反射されたものである。電磁波を反射する物体がその領域に存在すると、反射波の強度は高くなる。電磁波を反射する物体は、主に金属によって形成されている場合が多い。 "Reflected waves" are electromagnetic waves that are reflected by objects in the irradiated area. If an object that reflects electromagnetic waves is present in that area, the strength of the reflected waves will be high. Objects that reflect electromagnetic waves are often made primarily of metal.

　このような電磁波の照射及び反射波の受信は、電磁波送受信装置により実現される。電磁波送受信装置は、電磁波を発信する電磁波送信部と、反射波を受信する電磁波受信部とを備える。電磁波送受信装置は、電磁波受信部を複数、例えば２つ備えてもよい。これら複数の電磁波受信部は互いに離れており、同一の電磁波送信部が照射した電磁波の反射波を受信する。このようにすると、物体の位置の検出精度は高くなる。 The irradiation of electromagnetic waves and the reception of reflected waves are achieved by an electromagnetic wave transmitting/receiving device. The electromagnetic wave transmitting/receiving device includes an electromagnetic wave transmitting unit that transmits electromagnetic waves, and an electromagnetic wave receiving unit that receives reflected waves. The electromagnetic wave transmitting/receiving device may include multiple electromagnetic wave receiving units, for example, two. These multiple electromagnetic wave receiving units are spaced apart from each other, and receive reflected waves of the electromagnetic waves irradiated by the same electromagnetic wave transmitting unit. In this way, the accuracy of detecting the position of an object is increased.

　電磁波送信部が用いる発信方式は、例えば、ＦＭＣＷ（Ｆｒｅｑｕｅｎｃｙ　Ｍｏｄｕｌａｔｅｄ　Ｃｏｎｔｉｎｕｏｕｓ　Ｗａｖｅ）、パルス、ＣＷ（Ｃｏｎｔｉｎｕｏｕｓ　Ｗａｖｅ）ドップラー、２周波ＣＷ、及びパルス圧縮のいずれかであるが、これら以外であってもよい。 The transmission method used by the electromagnetic wave transmitter is, for example, one of FMCW (Frequency Modulated Continuous Wave), pulse, CW (Continuous Wave) Doppler, two-frequency CW, and pulse compression, but may be other than these.

　一例では、上述したカメラを搭載した測定装置が、さらに、電磁波送受信装置を備える。測定装置は移動しながら、カメラでデータ収集領域を撮影するとともに、電磁波送受信装置でデータ収集領域に電磁波を照射し、その反射波を受信する。 In one example, the measuring device equipped with the camera described above further includes an electromagnetic wave transmitting and receiving device. As the measuring device moves, it captures images of the data collection area with the camera, and uses the electromagnetic wave transmitting and receiving device to irradiate electromagnetic waves onto the data collection area and receive the reflected waves.

　他の例では、電磁波送受信装置は作業員に携帯される。作業員はカメラ及び電磁波送受信装置を携帯して移動しながら、カメラでデータ収集領域を撮影するとともに、電磁波送受信装置でデータ収集領域に電磁波を照射し、その反射波を受信する。 In another example, the electromagnetic wave transmitting and receiving device is carried by a worker. While moving around carrying a camera and an electromagnetic wave transmitting and receiving device, the worker photographs the data collection area with the camera, and irradiates the data collection area with electromagnetic waves using the electromagnetic wave transmitting and receiving device and receives the reflected waves.

　「反射波情報」は、電磁波送受信装置により生成される。より詳細には、反射波情報は、電磁波受信部による反射波の受信結果に基づき生成される。反射波情報は、例えば反射波の強度の時系列情報を含んでいる。この時系列情報は、反射波の受信日時及びその時の反射波の強度の組み合わせを含んでいる。電磁波受信部が複数設けられている場合、複数の電磁波受信部別に反射波情報が生成される。 "Reflected wave information" is generated by the electromagnetic wave transmitting/receiving device. More specifically, the reflected wave information is generated based on the results of reception of the reflected wave by the electromagnetic wave receiving unit. The reflected wave information includes, for example, time series information on the strength of the reflected wave. This time series information includes a combination of the date and time the reflected wave was received and the strength of the reflected wave at that time. If multiple electromagnetic wave receiving units are provided, the reflected wave information is generated separately for each of the multiple electromagnetic wave receiving units.

　なお、電磁波送受信装置は、電磁波送受信装置の位置を示す位置情報を生成してもよい。位置情報は、例えば緯度、経度で示される。位置情報は、緯度、経度に加えて高度を含んでもよい。この位置情報は、例えばＧＰＳを用いて生成されてもよいし、他の方法、例えばＳＬＡＭ（Ｓｉｍｕｌｔａｎｅｏｕｓ　Ｌｏｃａｌｉｚａｔｉｏｎ　ａｎｄ　Ｍａｐｐｉｎｇ）を用いて行われてもよい。そして、電磁波送受信装置は、上述した反射波情報に、その反射波を受信した時の電磁波送受信装置の位置情報を加えてもよい。 The electromagnetic wave transmitting/receiving device may generate location information indicating the location of the electromagnetic wave transmitting/receiving device. The location information is indicated by, for example, latitude and longitude. The location information may include altitude in addition to latitude and longitude. This location information may be generated using, for example, GPS, or may be generated using other methods, such as SLAM (Simultaneous Localization and Mapping). The electromagnetic wave transmitting/receiving device may add location information of the electromagnetic wave transmitting/receiving device at the time the reflected wave was received to the above-mentioned reflected wave information.

　なお、上記位置情報は、反射波情報とは別の情報であってもよい。この場合、位置情報は、電磁波送受信装置の位置の時系列情報になる。この時系列情報は、日時とその日時における電磁波送受信装置の位置の組み合わせを含んでいる。 The above-mentioned location information may be information separate from the reflected wave information. In this case, the location information is time-series information on the location of the electromagnetic wave transmitting/receiving device. This time-series information includes a combination of date and time and the location of the electromagnetic wave transmitting/receiving device at that date and time.

　電磁波送受信装置が生成した反射波情報は、任意の手段で学習装置２０に入力される。例えば、電磁波送受信装置と学習装置２０とは互いに通信可能に構成されていてもよい。そして、電磁波送受信装置は、当該通信手段を介して、生成した反射波情報を学習装置２０に送信してもよい。電磁波送受信装置から学習装置２０への反射波情報の送信は、リアルタイム処理で行われてもよいし、バッチ処理で行われてもよい。 The reflected wave information generated by the electromagnetic wave transmitting/receiving device is input to the learning device 20 by any means. For example, the electromagnetic wave transmitting/receiving device and the learning device 20 may be configured to be able to communicate with each other. The electromagnetic wave transmitting/receiving device may then transmit the generated reflected wave information to the learning device 20 via the communication means. The transmission of the reflected wave information from the electromagnetic wave transmitting/receiving device to the learning device 20 may be performed by real-time processing or by batch processing.

　その他、電磁波送受信装置が生成した反射波情報は、任意の記憶装置内に蓄積されてもよい。当該記憶装置は、電磁波送受信装置内に設けられてもよいし、電磁波送受信装置と通信可能に構成された外部装置内に設けられてもよい。そして、任意のタイミング、かつ任意の手段で、記憶装置内に蓄積された反射波情報が学習装置２０に入力されてもよい。 In addition, the reflected wave information generated by the electromagnetic wave transmitting/receiving device may be stored in any storage device. The storage device may be provided within the electromagnetic wave transmitting/receiving device, or may be provided within an external device configured to be able to communicate with the electromagnetic wave transmitting/receiving device. The reflected wave information stored in the storage device may then be input to the learning device 20 at any time and by any means.

　「学習用２次元画像」は、学習用に生成された２次元画像である。 "2D training images" are 2D images generated for training purposes.

　「２次元画像」は、上述した反射波情報に基づき生成された２次元の画像である。以下、反射波情報から２次元画像を生成する処理を説明する。 A "two-dimensional image" is a two-dimensional image generated based on the reflected wave information described above. Below, the process of generating a two-dimensional image from reflected wave information is explained.

　まず、学習用反射波処理部２３は、反射波情報を処理することにより３次元情報を生成する。詳細には、反射波情報には反射波の強度及び電磁波送受信装置の位置の時系列信号が含まれている。学習用反射波処理部２３は、例えば、この時系列信号を構成する反射波に対して複数回ＦＦＴ（Ｆａｓｔ　Ｆｏｕｒｉｅｒ　Ｔｒａｎｓｆｏｒｍ）を行うことにより、電磁波受信部から反射波の起点となった反射点までの距離を算出する。電磁波受信部が複数ある場合、学習用反射波処理部２３は、この処理を電磁波受信部別に行う。そして学習用反射波処理部２３は、同一のタイミングで異なる電磁波受信部で測定された反射波に基づいた複数の距離を統合することにより、データ収集領域に対応する３次元空間に含まれる少なくとも１つの第１の点について、反射波の強度の推定値を算出する。そして学習用反射波処理部２３は、この処理を複数のタイミングで測定された反射波に対して行うことにより、複数の第１の点のそれぞれについて、反射波の強度の推定値を算出し、これらを３次元情報とする。なお、この推定値は、当該第１の点に電磁波を反射する物体が存在する可能性を示す値とみなすことができる。以下、この値を第１の値とする。ただし、３次元情報の生成方法、例えば第１の値の生成方法はこの例に限定されない。 First, the learning reflected wave processing unit 23 generates three-dimensional information by processing the reflected wave information. In detail, the reflected wave information includes a time series signal of the reflected wave intensity and the position of the electromagnetic wave transmitting and receiving device. For example, the learning reflected wave processing unit 23 calculates the distance from the electromagnetic wave receiving unit to the reflection point that is the origin of the reflected wave by performing FFT (Fast Fourier Transform) multiple times on the reflected wave that constitutes this time series signal. If there are multiple electromagnetic wave receiving units, the learning reflected wave processing unit 23 performs this processing for each electromagnetic wave receiving unit. Then, the learning reflected wave processing unit 23 calculates an estimate of the reflected wave intensity for at least one first point included in the three-dimensional space corresponding to the data collection area by integrating multiple distances based on reflected waves measured by different electromagnetic wave receiving units at the same timing. The learning reflected wave processing unit 23 then performs this processing on the reflected waves measured at multiple timings to calculate an estimate of the intensity of the reflected waves for each of the multiple first points, and treats these as three-dimensional information. Note that this estimate can be considered as a value indicating the possibility that an object that reflects electromagnetic waves is present at the first point. Hereinafter, this value will be referred to as the first value. However, the method of generating the three-dimensional information, for example the method of generating the first value, is not limited to this example.

　次いで、学習用反射波処理部２３は、３次元情報を所定の平面に投影することにより２次元情報を生成する。以下、この所定の平面を投影面と呼ぶ。投影面がデータ収集領域の表面（地面、建造物の外面等）に対してなす角度は１０°以下であるのが好ましい。すなわち、投影面は、データ収集領域の表面に水平であるのが好ましい。 Then, the learning reflected wave processing unit 23 generates two-dimensional information by projecting the three-dimensional information onto a specified plane. Hereinafter, this specified plane will be referred to as the projection plane. It is preferable that the angle that the projection plane makes with the surface of the data collection area (the ground, the outer surface of a building, etc.) is 10° or less. In other words, it is preferable that the projection plane is horizontal to the surface of the data collection area.

　図９は、学習用反射波処理部２３が行う処理の一例を説明するための図である。学習用反射波処理部２３は、第２の点に対応する複数の第１の点を特定する。例えば学習用反射波処理部２３は、投影面に垂直な方向から見た場合に、第２の点と重なる複数の第１の点を、当該第２の点に対応する第１の点とする。次いで学習用反射波処理部２３は、特定された複数の第１の点それぞれに対応する第１の値を特定し、当該第１の値を用いて、当該第２の点に電磁波を反射する物体が存在する可能性を示す第２の値を生成する。第２の値は、複数の第１の値の統計値（最大値、最小値、平均値、最頻値、中央値等）とすることができる。その他、第１の値が基準値を超えた第１の点のうち、最もデータ収集領域に近い第１の点に対応する第１の値を、第２の値としてもよい。そして学習用反射波処理部２３は、第２の点毎の第２の値を、２次元情報とする。この２次元情報から、例えば白黒の画像データ（２次元画像）を生成することができる。なお、２次元画像は、第２の点の第２の値を白黒で表現してもよいし、その他の色やその他の手段で表現してもよい。 9 is a diagram for explaining an example of processing performed by the learning reflected wave processing unit 23. The learning reflected wave processing unit 23 identifies a plurality of first points corresponding to the second point. For example, the learning reflected wave processing unit 23 determines a plurality of first points that overlap with the second point when viewed from a direction perpendicular to the projection surface as the first points corresponding to the second point. Next, the learning reflected wave processing unit 23 identifies a first value corresponding to each of the identified plurality of first points, and generates a second value indicating the possibility that an object that reflects electromagnetic waves is present at the second point using the first value. The second value can be a statistical value (maximum value, minimum value, average value, mode, median, etc.) of the plurality of first values. Alternatively, the first value corresponding to the first point closest to the data collection area among the first points whose first values exceed a reference value may be determined as the second value. Then, the learning reflected wave processing unit 23 determines the second value for each second point as two-dimensional information. From this two-dimensional information, for example, black and white image data (two-dimensional image) can be generated. Note that the two-dimensional image may represent the second value of the second point in black and white, or in other colors or other means.

　変形例として、学習装置２０は、データ収集領域の地質情報、及び反射波が生成された時の天候情報の少なくとも一方を考慮して、学習モデルを学習するように構成してもよい。例えば、データ収集領域のうち特定の領域の地質は、反射波を生成しやすいことがあり得る。また、天候によってはデータ収集領域の地表に水や雪がたまり、反射波に影響を与えることがあり得る。学習用反射波処理部２３は、この影響を、上記２次元画像の生成に反映させることができる。 As a variant, the learning device 20 may be configured to learn the learning model by taking into account at least one of the geological information of the data collection area and the weather information at the time the reflected waves were generated. For example, the geology of a particular area of the data collection area may be more conducive to generating reflected waves. In addition, depending on the weather, water or snow may accumulate on the ground surface of the data collection area, affecting the reflected waves. The learning reflected wave processing unit 23 can reflect this effect in the generation of the two-dimensional image.

　例えば、学習用反射波処理部２３は、データ収集領域の地質に応じたパラメータを第１の値又は第２の値に乗じた上で、３次元情報又は２次元情報を生成してもよい。このパラメータは予め設定されている。また、学習用反射波処理部２３は、測定時の天候に応じたパラメータを第１の値又は第２の値に乗じた上で、３次元情報又は２次元情報を生成してもよい。このパラメータも予め設定されている。 For example, the learning reflected wave processing unit 23 may generate three-dimensional information or two-dimensional information by multiplying a parameter corresponding to the geology of the data collection area by a first value or a second value. This parameter is set in advance. The learning reflected wave processing unit 23 may also generate three-dimensional information or two-dimensional information by multiplying a parameter corresponding to the weather at the time of measurement by a first value or a second value. This parameter is also set in advance.

　なお、地質情報及び天候情報は、例えば学習装置２０のユーザによって学習装置２０に入力されてもよいし、学習装置２０がこれらを記憶したデータベースから取得してもよい。 In addition, the geological information and weather information may be input to the learning device 20 by, for example, a user of the learning device 20, or may be obtained by the learning device 20 from a database in which the geological information and weather information are stored.

　ここで説明した変形例は、以下の実施形態で詳細を説明する検出装置１０においても適用可能である。 The modified examples described here can also be applied to the detection device 10, which will be described in detail in the following embodiment.

　図８に戻り、学習用合成部２４は、学習用撮影画像と学習用２次元画像を合成して学習用合成画像を生成する。学習用撮影画像と学習用２次元画像は同様のサイズであることが好ましい。 Returning to FIG. 8, the learning synthesis unit 24 synthesizes the learning photographed image and the learning two-dimensional image to generate a learning synthesized image. It is preferable that the learning photographed image and the learning two-dimensional image are of similar size.

　学習用合成部２４は、図１０に示すように、所定のブレンド比率で学習用撮影画像と学習用２次元画像をブレンド（合成）して、学習用合成画像を生成する。 As shown in FIG. 10, the learning synthesis unit 24 blends (synthesizes) the learning captured image and the learning two-dimensional image at a predetermined blend ratio to generate a learning synthetic image.

　学習用合成部２４は、１組の学習用撮影画像及び学習用２次元画像から、１つの学習用合成画像を生成してもよい。この場合、ブレント比率は予め定められていてもよい。 The training synthesis unit 24 may generate a training synthetic image from a set of training photographic images and a training two-dimensional image. In this case, the Blend ratio may be determined in advance.

　その他、学習用合成部２４は、１組の学習用撮影画像及び学習用２次元画像から、学習用撮影画像及び学習用２次元画像のブレンド比率が互いに異なる複数の学習用合成画像を生成してもよい。このようにすれば、１組の学習用撮影画像及び学習用２次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。 In addition, the learning synthesis unit 24 may generate multiple learning composite images having different blend ratios of the learning photographed images and learning two-dimensional images from a set of learning photographed images and learning two-dimensional images. In this way, it is preferable that a large number of learning composite images can be generated from a set of learning photographed images and learning two-dimensional images. It is also preferable that learning composite images with various blend ratios can be learned.

　その他、学習用合成部２４は、画像内の一部領域毎にブレンド比率が異なる学習用合成画像を生成してもよい。例えば、学習用合成部２４は、広く知られた物体検出技術やセマンティックセグメンテーション等を用いて検出された物体領域毎に、ブレント比率を異ならせてもよい。このようにすれば、１組の学習用撮影画像及び学習用２次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。 In addition, the learning synthesis unit 24 may generate a learning synthetic image in which the blending ratio differs for each partial region in the image. For example, the learning synthesis unit 24 may vary the blending ratio for each object region detected using widely known object detection technology, semantic segmentation, or the like. This is preferable because it makes it possible to generate a large number of learning synthetic images from a set of learning photographed images and learning two-dimensional images. It is also preferable because it makes it possible to learn from learning synthetic images with various blending ratios.

　学習用合成部２４は、同じ物体に関する情報が含まれる学習用撮影画像及び学習用２次元画像を合成して、学習用合成画像を生成する。これを実現するため、学習用合成部２４は、同じタイミングで撮影／測定された学習用撮影画像及び学習用２次元画像を合成してもよい。その他、学習用合成部２４は、同じ位置で撮影／測定された学習用撮影画像及び学習用２次元画像を合成してもよい。その他、学習用合成部２４は、同じタイミングかつ同じ位置で撮影／測定された学習用撮影画像及び学習用２次元画像を合成してもよい。ここでの「同じタイミング」は完全一致であってもよいし、数秒程度までの時間のずれを許容する概念であってもよい。また、ここでの「同じ位置」は完全一致であってもよいし、数センチメートルから数メートル程度までの位置のずれを許容する概念であってもよい。 The learning synthesis unit 24 synthesizes a learning photographed image and a learning two-dimensional image containing information about the same object to generate a learning composite image. To achieve this, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same time. Alternatively, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same position. Alternatively, the learning synthesis unit 24 may synthesize a learning photographed image and a learning two-dimensional image that have been photographed/measured at the same time and in the same position. Here, the "same timing" may be a perfect match, or may be a concept that allows for a time difference of up to a few seconds. Also, here, the "same position" may be a perfect match, or may be a concept that allows for a position difference of up to a few centimeters to a few meters.

　なお、学習用合成部２４は、学習用撮影画像内におけるある物体の位置と、学習用２次元画像内におけるその物体の位置とを略一致させる処理を行った後に、画像の合成を行ってもよい。ただし、当該処理は必須ではない。互いに合成する学習用撮影画像及び学習用２次元画像に同じ物体の情報が含まれていることは要求されるが、その物体の画像内の位置の一致は必須でない。本発明者らは、物体の画像内の位置が一致していなくても、十分な学習効果及び検出結果が得られることを確認している。しかし、物体の画像内の位置が一致していると、そうでない場合に比べて、より好ましい学習効果や検出結果が得られることが期待される。学習用合成部２４は、例えば以下のズレ補正処理１又は２を実行してもよい。 The learning synthesis unit 24 may synthesize the images after performing a process to approximately match the position of an object in the learning captured image with the position of the object in the learning two-dimensional image. However, this process is not essential. It is required that the learning captured image and the learning two-dimensional image to be synthesized together contain information of the same object, but it is not essential that the positions of the objects in the images match. The inventors have confirmed that sufficient learning effects and detection results can be obtained even if the positions of the objects in the images do not match. However, if the positions of the objects in the images match, it is expected that more favorable learning effects and detection results can be obtained compared to when they do not. The learning synthesis unit 24 may, for example, perform the following misalignment correction process 1 or 2.

"ズレ補正処理１"
　一例では、ユーザ入力でズレを補正する。学習用合成部２４は、図１０に示すような撮影画像及び２次元画像を同時に画面表示する。ユーザは、撮影画像及び２次元画像の少なくとも一方の位置をずらし、各画像で示されるある物体の位置が互いに重なり合う関係とする。ユーザは、撮影画像で示される物体の形状と、２次元画像における第２の点の集合の形状とに基づき、２次元画像における第２の点の集合が撮影画像内のどの物体と対応するのか特定することができる。 "Misalignment correction process 1"
In one example, the misalignment is corrected by user input. The learning synthesis unit 24 simultaneously displays the captured image and the two-dimensional image as shown in FIG. 10 on the screen. The user shifts the position of at least one of the captured image and the two-dimensional image so that the positions of an object shown in each image overlap each other. The user can specify which object in the captured image corresponds to the second set of points in the two-dimensional image based on the shape of the object shown in the captured image and the shape of the second set of points in the two-dimensional image.

"ズレ補正処理２"
　学習用合成部２４は、カメラと電磁波送受信装置の位置に基づき、撮影画像を、２次元画像と同じ座標系に変形してもよい。カメラ及び電磁波送受信装置が同一の測定装置に搭載されている場合、そのカメラ及び電磁波送受信装置の距離、向き等に基づき、当該変形を実現することができる。予め、それらの位置情報に基づき、撮影画像を所定の座標系に変形する変形ルールが用意されている。学習用合成部２４は、その変形ルールを用いて、撮影画像の上記変形を実行する。画像変形の方法としては、例えばアフィン変換やホモグラフィ変換などの画像変換が例示されるが、これらに限定されない。変形後の撮影画像の視点は、２次元画像と同様の視点となる。結果、撮影画像内の所定位置（例：中心）と、２次元画像内の同所定位置（例：中心）は、データ収集領域内の同じ位置の情報を示すこととなる。 "Misalignment correction process 2"
The learning synthesis unit 24 may transform the captured image into the same coordinate system as the two-dimensional image based on the positions of the camera and the electromagnetic wave transmitting and receiving device. When the camera and the electromagnetic wave transmitting and receiving device are mounted on the same measurement device, the transformation can be realized based on the distance, orientation, etc. of the camera and the electromagnetic wave transmitting and receiving device. A transformation rule for transforming the captured image into a predetermined coordinate system based on the position information is prepared in advance. The learning synthesis unit 24 executes the above-mentioned transformation of the captured image using the transformation rule. Examples of the image transformation method include image transformation such as affine transformation and homography transformation, but are not limited to these. The viewpoint of the captured image after transformation is the same as that of the two-dimensional image. As a result, a predetermined position (e.g., center) in the captured image and the same predetermined position (e.g., center) in the two-dimensional image indicate information of the same position in the data collection area.

　図８に戻り、言語入力部２５は、学習用合成画像に写る物体を表現したテキストを取得する。テキストは、単語、文章、プロンプト等である。物体を表現したテキストは、例えば物体の種類や、物体の外観の特徴（色、大きさ、形状等）等を含むことができる。テキストの一例としては、「青い缶」等が例示される。 Returning to FIG. 8, the language input unit 25 acquires text that describes an object that appears in the synthetic training image. The text may be a word, a sentence, a prompt, etc. The text that describes an object may include, for example, the type of object, or the external characteristics of the object (color, size, shape, etc.). An example of text is "blue can."

　例えば、ユーザが、学習用合成画像に写る物体を表現したテキストを学習装置２０に入力してもよい。ユーザは、学習用合成画像の元となった学習用撮影画像を視認し、学習用合成画像に写る物体を特定してもよい。なお、ユーザは、その他の手段で学習用合成画像に写る物体を特定してもよい。例えば、データ収集領域に所定の物体を配置して撮影／反射情報の生成等を行うことで学習データを生成する場合は、ユーザは、学習用撮影画像を視認しなくても、データ収集領域に配置した物体をその他の任意の手段で特定することができる。 For example, a user may input text describing an object appearing in the training composite image into the training device 20. The user may visually recognize the training photographed image that was the basis for the training composite image, and identify the object appearing in the training composite image. The user may also identify the object appearing in the training composite image by other means. For example, when training data is generated by placing a specific object in the data collection area and generating photographic/reflection information, the user can identify the object placed in the data collection area by any other means, without visually recognizing the training photographed image.

　その他の例として、機械学習等で予め生成された認識モデル（分類器など）に、学習用合成画像の元となった学習用撮影画像を入力することで得られた認識結果が、上記テキストとして学習装置２０に入力されてもよい。 As another example, the recognition result obtained by inputting the learning captured image that is the basis of the learning synthetic image into a recognition model (such as a classifier) that has been generated in advance by machine learning or the like may be input to the learning device 20 as the above text.

　学習部２１は、複数の学習用合成画像と、各学習用合成画像に写る物体を表現したテキストを用いて学習モデルを学習する。学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。 The learning unit 21 trains a learning model using multiple training synthetic images and text that describes objects that appear in each training synthetic image. The learning model is an object recognition model or object detection model that uses a language model.

　言語モデルを用いた物体認識モデル又は物体検出モデルは、上述の通り、画像と言語を同時埋め込み空間で表現することが可能なモデルである。当該物体認識モデル及び物体検出モデルは、例えばニューラルネットワーク等の技術で得られた物体認識／物体検出の結果と物体に関する言語（物体の説明や表現）との関係を学習することで生成される。当該物体認識モデル及び物体検出モデルに基づき、検索条件（テキスト）で示される物体を撮影画像内で認識／検出することができる。 As described above, an object recognition model or object detection model using a language model is a model capable of expressing images and language in a joint embedding space. The object recognition model and object detection model are generated by learning the relationship between the results of object recognition/object detection obtained using technology such as neural networks and the language related to objects (descriptions and expressions of objects). Based on the object recognition model and object detection model, objects indicated by the search criteria (text) can be recognized/detected within the captured image.

　学習部２１は、学習用合成画像に写る物体と、その物体を表現したテキストとの相関関係を学習する。学習部２１は、広く知られた、撮影画像に基づき「撮影画像に写る物体」と「その物体を表現したテキスト」との相関関係を学習して物体認識モデル又は物体検出モデルを生成する技術を利用して、本実施形態の学習モデルを生成することができる。すなわち、学習部２１は、広く知られた当該技術において、「撮影画像」を「学習用合成画像」に置き換えて、学習モデルを学習することができる。 The learning unit 21 learns the correlation between an object appearing in a synthetic training image and text expressing that object. The learning unit 21 can generate the learning model of this embodiment by utilizing a widely known technology that learns the correlation between "an object appearing in a captured image" and "text expressing that object" based on a captured image to generate an object recognition model or an object detection model. In other words, the learning unit 21 can learn the learning model by replacing the "captured image" with the "synthetic training image" in this widely known technology.

　以下、学習部２１が実行する処理の一例を説明するが、学習部２１の処理はこれに限定されない。 Below, an example of the processing performed by the learning unit 21 is described, but the processing of the learning unit 21 is not limited to this.

　まず、学習部２１は、言語入力部２５にて入力された一つ以上のテキストをトークンに分割し、これらから言語特徴を抽出することができる。言語特徴は、広く知られたあらゆる技術を用いて抽出することができる。言語特徴の抽出においては、例えば以下の文献に記載のトランスフォーマーを用いてもよい。例えば、学習部２１は、レイヤー数１２、隠れ層の次元５１２、アテンション・ヘッドを８とした６３００万パラメータのトランスフォーマーを用いて言語特徴を抽出してもよい。
「Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.」 First, the learning unit 21 can divide one or more texts inputted by the language input unit 25 into tokens and extract language features from these. The language features can be extracted using any widely known technology. For example, the transformer described in the following document may be used to extract the language features. For example, the learning unit 21 may extract the language features using a 63 million parameter transformer with 12 layers, 512 hidden layer dimensions, and 8 attention heads.
"Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021."

　また、学習部２１は、学習用合成画像から画像特徴を抽出することができる。画像特徴は、広く知られたあらゆる技術を用いて抽出することができる。画像特徴の抽出においては、ResNet-50、ResNet-101、ResNet-50×4、ResNet-50×16、ResNet-50×64、ViT-B/32、ViT-B/16、ViT-L/14等を用いてもよい。また、例えば以下の文献に記載の技術を用いて画像特徴を抽出してもよい。
「He, Kaiming, et al. "Deep residual learning for image recognition."Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.」
「Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale."arXiv preprint arXiv:2010.11929(2020).」 Furthermore, the learning unit 21 can extract image features from the synthetic image for learning. The image features can be extracted using any widely known technology. ResNet-50, ResNet-101, ResNet-50×4, ResNet-50×16, ResNet-50×64, ViT-B/32, ViT-B/16, ViT-L/14, etc. may be used to extract the image features. Furthermore, the image features may be extracted using, for example, the technology described in the following document.
"He, Kaiming, et al. "Deep residual learning for image recognition."Proceedings of the IEEE conference on computer vision and pattern recognition. 2016."
"Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale."arXiv preprint arXiv:2010.11929(2020)."

　学習部２１は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも１つに対し、撮影画像用に生成／調整された学習モデルに適応させるための調整を行うことができる。 The learning unit 21 can adjust at least one of the language features, the synthetic training image, and the image features to adapt them to the learning model generated/adjusted for the captured image.

　言語特徴の調整では、学習部２１は、言語入力部２５にて入力されたテキストから抽出された言語特徴（特徴ベクトル）を、非線形関数あるいは線形関数により変換し、例えば入力された特徴と同じ次元の特徴を生成する。非線形関数としては、例えば一層以上のＭＬＰ（全結合ニューラルネットワーク）を用いることができる。また、線形関数としては、スケール（係数）とオフセット（加算部分）を最適化パラメータとする線形関数を用いることができる。言語特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。 In adjusting language features, the learning unit 21 transforms the language features (feature vectors) extracted from the text inputted by the language input unit 25 using a nonlinear function or a linear function, for example to generate features of the same dimensions as the input features. As the nonlinear function, for example, a fully connected neural network (MLP) with one or more layers can be used. Also, as the linear function, a linear function with a scale (coefficient) and an offset (addition part) as optimization parameters can be used. It is preferable that the function has a smaller number of optimization parameters than the extraction of language features.

　学習用合成画像の調整では、学習部２１は、学習用合成部２４によって生成された学習用合成画像を、非線形関数あるいは線形関数により変換し、例えば入力された画像と同じ次元の画像を生成する。非線形関数としては、例えば一層以上のＭＬＰを用いることができる。また、線形関数としては、スケール（係数）とオフセット（加算部分）を最適化パラメータとする線形関数を用いることができる。例えば、スケールとオフセットを最適化するパラメータは、明るさの補正に相当する。当該補正により、学習用合成画像の画素値の統計指標（画素値の範囲等）を、撮影画像の画素値の統計指標に近づけることができる。画像特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。学習用合成画像の調整を行う場合、学習部２１は、調整後の学習用合成画像から画像特徴を抽出する。 In adjusting the training composite image, the learning unit 21 converts the training composite image generated by the training synthesis unit 24 using a nonlinear function or a linear function to generate an image of the same dimensions as the input image, for example. As the nonlinear function, for example, an MLP with one or more layers can be used. As the linear function, a linear function with the optimization parameters being the scale (coefficient) and the offset (addition part) can be used. For example, the parameters for optimizing the scale and offset correspond to the correction of brightness. This correction makes it possible to bring the statistical index of the pixel values of the training composite image (such as the range of pixel values) closer to the statistical index of the pixel values of the captured image. It is preferable that the function has a smaller number of optimization parameters than the extraction of image features. When adjusting the training composite image, the learning unit 21 extracts image features from the adjusted training composite image.

　画像特徴の調整では、学習部２１は、上記調整を行っていない学習用合成画像又は上記調整を行った学習用合成画像から抽出された画像特徴（特徴ベクトル）を、非線形関数あるいは線形関数により変換し、例えば入力された特徴と同じ次元の特徴を生成する。非線形関数としては、例えば一層以上のＭＬＰを用いることができる。また、線形関数としては、スケール（係数）とオフセット（加算部分）を最適化パラメータとする線形関数を用いることができる。当該補正により、学習用合成画像から抽出される画像特徴の統計指標を、撮影画像から抽出される特徴量の統計指標に近づけることができる。画像特徴の抽出に比べ、最適化パラメータの数が小さくなる関数であることが好ましい。 In adjusting the image features, the learning unit 21 transforms the image features (feature vectors) extracted from the training synthetic image that has not been adjusted or the training synthetic image that has been adjusted, using a nonlinear function or a linear function to generate features of the same dimension as the input features, for example. As the nonlinear function, for example, an MLP with one or more layers can be used. Also, as the linear function, a linear function with scale (coefficient) and offset (additive part) as optimization parameters can be used. This correction makes it possible to bring the statistical index of the image features extracted from the training synthetic image closer to the statistical index of the feature amount extracted from the captured image. It is preferable that the function has a smaller number of optimization parameters than the extraction of image features.

　学習部２１は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも１つに対して上記調整を行った後のデータを用いて、以下の処理を行うことができる。 The learning unit 21 can perform the following processing using the data after the above adjustments have been made to at least one of the language features, the synthetic training image, and the image features.

　学習部２１は、言語特徴と画像特徴の相関関係を例えばコサイン類似度にて算出する。例えば、学習部２１は、言語特徴及び画像特徴のそれぞれに対し正規化処理を行い、これらの正規化後のベクトル間の内積を算出することで、相関関係を算出する。より詳しく説明すると、学習部２１には、複数の学習用合成画像各々に対する画像特徴と、各々とペアとなる言語特徴が入力される。そして、学習部２１は、ペアとなる画像特徴と言語特徴のコサイン類似度だけでなく、ペアとなっていない画像特徴と言語特徴同士のコサイン類似度も算出する。すなわち、学習部２１にＮ組のペアの言語特徴及び画像特徴が入力された場合、学習部２１は、Ｎ×Ｎ個のコサイン類似度を算出する。 The learning unit 21 calculates the correlation between the language feature and the image feature, for example, using cosine similarity. For example, the learning unit 21 performs normalization processing on each of the language feature and the image feature, and calculates the inner product between these normalized vectors to calculate the correlation. To explain in more detail, the learning unit 21 receives image features for each of a plurality of training synthetic images and language features paired with each of them. The learning unit 21 then calculates not only the cosine similarity between the paired image feature and language feature, but also the cosine similarity between unpaired image features and language features. In other words, when N pairs of language features and image features are input to the learning unit 21, the learning unit 21 calculates N x N cosine similarities.

　次いで、学習部２１は、算出した相関関係から、正しいペアは損失が小さく、誤ったペアは損失が大きくなることを特徴とする損失値を算出する。そして、学習部２１は、損失値がより小さくなるように、上述した言語特徴の調整、学習用合成画像の調整、及び画像特徴の調整の中の少なくとも１つのパラメータを更新する。なお、学習部２１は、さらにその他のパラメータを更新してもよい。 Then, the learning unit 21 calculates a loss value based on the calculated correlation, characterized in that correct pairs have small losses and incorrect pairs have large losses. The learning unit 21 then updates at least one parameter among the above-mentioned adjustment of language features, adjustment of synthetic images for training, and adjustment of image features so as to reduce the loss value. Note that the learning unit 21 may further update other parameters.

　学習部２１は、広く知られたあらゆる技術を用いて、損失値を算出することができる。例えば、学習部２１は、Ｎ×Ｎ個のコサイン類似度をＮ×Ｎの行列に並べ、行方向（横方向）にクロスエントロピー誤差を計算し、さらに列方向（縦方向）に損失を計算する。そして、学習部２１はこれらのクロスエントロピーの平均値を損失値として算出することができる。 The learning unit 21 can calculate the loss value using any well-known technique. For example, the learning unit 21 arranges N x N cosine similarities in an N x N matrix, calculates the cross-entropy error in the row direction (horizontal direction), and further calculates the loss in the column direction (vertical direction). The learning unit 21 can then calculate the average value of these cross-entropies as the loss value.

　また、学習部２１は、広く知られた技術を用いた上述したパラメータを更新することができる。例えば、学習部２１は、逆誤差伝搬法等を用いてパラメータを更新してもよい。 The learning unit 21 can also update the above-mentioned parameters using widely known techniques. For example, the learning unit 21 may update the parameters using backpropagation or the like.

　次に、図１１のフローチャートを用いて、学習装置２０の処理の流れの一例を説明する。なお、各処理の詳細は上述したので、ここでは各処理の概要とともに、処理の流れの一例を説明する。 Next, an example of the process flow of the learning device 20 will be described using the flowchart in FIG. 11. Note that the details of each process have been described above, so here, an example of the process flow will be described along with an overview of each process.

　Ｓ３０では、学習装置２０は、物体を表現したテキストを取得する。物体は、学習用撮影画像や学習用合成画像に写る物体である。 In S30, the learning device 20 acquires text that describes an object. The object is an object that appears in the learning captured image or the learning composite image.

　Ｓ３１では、学習装置２０は、学習用撮影画像を取得する。Ｓ３２では、学習装置２０は、電磁波の反射波を示す反射波情報に基づき、学習用２次元画像を生成する。学習用撮影画像及び学習用２次元画像は、同じ物体の情報を示す。 In S31, the learning device 20 acquires a learning image. In S32, the learning device 20 generates a learning two-dimensional image based on reflected wave information indicating reflected electromagnetic waves. The learning image and the learning two-dimensional image indicate information of the same object.

　Ｓ３３では、学習装置２０は、学習用撮影画像と学習用２次元画像を合成して学習用合成画像を生成する。 In S33, the learning device 20 synthesizes the learning captured image and the learning two-dimensional image to generate a learning composite image.

　Ｓ３４では、学習装置２０は、Ｓ３０で取得したテキストと、Ｓ３３で生成した学習用合成画像を用いて学習モデルを学習する。学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。 In S34, the learning device 20 learns a learning model using the text acquired in S30 and the synthetic learning image generated in S33. The learning model is an object recognition model or an object detection model that uses a language model.

　学習装置２０は、上記テキストから抽出された言語特徴、及び、上記学習用合成画像から抽出された画像特徴を用いて、学習モデルを学習する。学習装置２０は、上記言語特徴、上記学習用合成画像、及び上記画像特徴の中の少なくとも１つに対し、撮影画像用に生成／調整された学習モデルに適応させるための調整を行うことができる。そして、学習装置２０は、当該調整を行った後に、言語特徴と画像特徴の相関関係を算出し、算出した相関関係に基づく損失値がより小さくなるように、当該調整のパラメータ等を更新する。 The learning device 20 learns a learning model using the language features extracted from the text and the image features extracted from the synthetic training image. The learning device 20 can adjust at least one of the language features, the synthetic training image, and the image features to adapt them to the learning model generated/adjusted for the captured image. After making the adjustment, the learning device 20 calculates the correlation between the language features and the image features, and updates the parameters of the adjustment so that the loss value based on the calculated correlation is smaller.

「作用効果」
　学習装置２０は、「学習用撮影画像」と、「電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像」とを合成した学習用合成画像を用いて学習モデルを学習する。このような学習装置２０によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方を学習した学習モデルを生成することができる。この学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、物体検出や物体認識を行うことができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を学習することで、学習モデルによる物体検出や物体認識の精度が向上する。 "Action and effect"
The learning device 20 learns a learning model using a synthetic learning image obtained by synthesizing a "learning captured image" and a "learning two-dimensional image generated based on reflected wave information indicating reflected waves of electromagnetic waves." Such a learning device 20 can generate a learning model that has learned both the "features of the captured image" and the "features of the reflected wave information indicating reflected waves of electromagnetic waves." This learning model can perform object detection and object recognition based on both the "features of the captured image" and the "features of the reflected wave information indicating reflected waves of electromagnetic waves." By learning not just one of the "features of the captured image" and the "features of the reflected wave information indicating reflected waves of electromagnetic waves," but both, the accuracy of object detection and object recognition by the learning model is improved.

　また、学習装置２０は、１組の学習用撮影画像及び学習用２次元画像から、学習用撮影画像及び学習用２次元画像のブレンド比率が互いに異なる複数の学習用合成画像を生成することができる。また、学習装置２０は、画像内の一部領域毎にブレンド比率が異なる学習用合成画像を生成することができる。このようにすれば、１組の学習用撮影画像及び学習用２次元画像から多数の学習用合成画像を生成することができて好ましい。また、様々なブレンド比率の学習用合成画像を学習できて好ましい。 The learning device 20 can also generate multiple learning composite images from a set of learning photographed images and learning two-dimensional images, with the blending ratios of the learning photographed images and learning two-dimensional images being different from one another. The learning device 20 can also generate learning composite images with different blending ratios for each partial area within the image. In this way, it is preferable that multiple learning composite images can be generated from a set of learning photographed images and learning two-dimensional images. It is also preferable that learning composite images with various blending ratios can be learned.

　また、学習装置２０は、言語特徴、学習用合成画像、及び画像特徴の中の少なくとも１つに対し、撮影画像用に生成／調整された学習モデルに適応させるための調整を行うことができる。そして、学習装置２０は、言語特徴と画像特徴の相関関係に基づく損失値がより小さくなるように、当該調整のパラメータを更新することができる。このような学習装置２０によれば、撮影画像用に生成／調整された既存の学習モデルを用いて、撮影画像及び２次元画像を合成した合成画像を学習した学習モデルを生成することができる。すなわち、撮影画像用に生成／調整された既存の学習モデルに、撮影画像及び２次元画像を合成した合成画像を学習させて、本実施形態の学習モデルを生成することができる。 The learning device 20 can also adjust at least one of the language features, the synthetic image for training, and the image features to adapt them to the learning model generated/adjusted for the captured image. The learning device 20 can then update the parameters for this adjustment so that the loss value based on the correlation between the language features and the image features becomes smaller. Such a learning device 20 can generate a learning model that learns a synthetic image that combines a captured image and a two-dimensional image using an existing learning model generated/adjusted for the captured image. In other words, the learning model of this embodiment can be generated by having an existing learning model generated/adjusted for the captured image learn a synthetic image that combines a captured image and a two-dimensional image.

＜第４の実施形態＞
　第４の実施形態の学習装置２０は、第３の実施形態の学習装置２０と異なる手段で、学習用撮影画像を取得する。以下、詳細に説明する。 Fourth Embodiment
The learning device 20 of the fourth embodiment acquires learning images by a means different from that of the learning device 20 of the third embodiment, which will be described in detail below.

　学習用撮影画像取得部２２は、言語入力部２５に入力された「物体を表現したテキスト」を検索クエリとして、複数の画像の中から検索クエリにマッチングする画像を検索する。そして、学習用撮影画像取得部２２は、検索結果に含まれる画像を学習用撮影画像とする。 The learning image acquisition unit 22 searches for images that match the search query from among multiple images, using the "text describing an object" input to the language input unit 25 as a search query. The learning image acquisition unit 22 then uses the images included in the search results as learning images.

　第３の実施形態で説明した通り、互いに合成する学習用撮影画像及び学習用２次元画像は同じ物体の情報を含んでいることが要求されるが、その物体の画像内の位置の一致は必須でない。「同じ物体」は種類や種別の一致を意味し、個体の一致までは要求されない。このため、互いに合成する学習用撮影画像及び学習用２次元画像は、同じ位置かつ同じ場所で、同じ個体を撮影したものでなくてもよい。このため、本実施形態の学習用撮影画像取得部２２は、上述した手段で学習用撮影画像を取得する。 As explained in the third embodiment, the learning photographed image and the learning two-dimensional image to be combined with each other are required to contain information about the same object, but the position of the object within the image does not necessarily have to match. "The same object" means matching in type or variety, and does not require matching of the individual object. Therefore, the learning photographed image and the learning two-dimensional image to be combined with each other do not have to be photographs of the same individual object taken at the same position and in the same place. For this reason, the learning photographed image acquisition unit 22 of this embodiment acquires the learning photographed image by the means described above.

　学習用撮影画像取得部２２は、インターネット上で公開されている膨大な画像の中から、検索クエリにマッチングする画像を検索してもよい。その他、予め、多数の画像を記憶したデータベースが生成されていてもよい。そして、学習用撮影画像取得部２２は、このデータベースの中から、検索クエリにマッチングする画像を検索してもよい。 The learning image acquisition unit 22 may search for images that match the search query from among the vast amount of images available on the Internet. Alternatively, a database that stores a large number of images may be generated in advance. The learning image acquisition unit 22 may then search for images that match the search query from within this database.

　なお、学習用撮影画像取得部２２は、１つの検索クエリに対応して１つの学習用撮影画像を取得してもよいし、１つの検索クエリに対応して複数の学習用撮影画像を取得してもよい。例えば、学習用撮影画像取得部２２は、検索クエリに基づき検索された複数の画像の中から、マッチング率が高い方から順に所定数の画像を学習用撮影画像として取得してもよい。１つの検索クエリに対応して複数の学習用撮影画像を取得する場合、学習用合成部２４は、ある物体の情報を含む１つの学習用２次元画像と、その物体の情報を含む複数の学習用撮影画像各々とを合成して、複数の学習用合成画像を生成することができる。学習用合成画像の数が増えると学習効果が向上して好ましい。 The learning image acquisition unit 22 may acquire one learning image in response to one search query, or may acquire multiple learning images in response to one search query. For example, the learning image acquisition unit 22 may acquire a predetermined number of images as learning images from among multiple images searched based on the search query, in descending order of matching rate. When multiple learning images are acquired in response to one search query, the learning synthesis unit 24 can generate multiple learning synthetic images by synthesizing one learning two-dimensional image containing information of an object with each of the multiple learning images containing information of the object. Increasing the number of learning synthetic images is preferable because it improves the learning effect.

　変形例として、学習用撮影画像取得部２２は、第３の実施形態で説明した手段と、本実施形態で説明した手段の両方で、学習用撮影画像を取得してもよい。学習用撮影画像の数が増えると、生成される学習用合成画像の数が増えて好ましい。 As a modified example, the learning image acquisition unit 22 may acquire learning images using both the means described in the third embodiment and the means described in this embodiment. Increasing the number of learning images is preferable because it increases the number of learning composite images that are generated.

　本実施形態の学習装置２０のその他の構成は、第１及び第３の実施形態の２０の構成と同様である。 The rest of the configuration of the learning device 20 in this embodiment is the same as the configuration of the learning device 20 in the first and third embodiments.

　本実施形態の学習装置２０によれば、第１及び第３の実施形態の学習装置２０と同様の作用効果が実現される。また、本実施形態の学習装置２０によれば、言語入力部２５に入力された「物体を表現したテキスト」を検索クエリとして検索することで、複数の画像の中から学習用撮影画像を取得することができる。このような学習装置２０によれば、多数の学習用撮影画像を取得し、多数の学習用合成画像を生成できて好ましい。また、一例ではカメラが不要になるので、費用負担や機材のメンテナンスの負担等を軽減できる。 The learning device 20 of this embodiment achieves the same effects as the learning device 20 of the first and third embodiments. Furthermore, the learning device 20 of this embodiment can acquire learning images from among multiple images by searching for "text expressing an object" input to the language input unit 25 as a search query. Such a learning device 20 is preferable because it can acquire a large number of learning images and generate a large number of learning composite images. Furthermore, in one example, a camera is not required, which can reduce the cost burden and the burden of equipment maintenance, etc.

＜第５の実施形態＞
「概要」
　第５の実施形態の検出装置１０は、第２の実施形態の検出装置１０の構成を具体化したものである。検出装置１０は、第１、３及び４の実施形態の学習装置２０が生成した学習モデルを用いて、検出対象を検出する。以下、本実施形態の検出装置１０の構成を詳細に説明する。 Fifth embodiment
"overview"
The detection device 10 of the fifth embodiment is a specific embodiment of the configuration of the detection device 10 of the second embodiment. The detection device 10 detects a detection target using a learning model generated by the learning device 20 of the first, third, and fourth embodiments. The configuration of the detection device 10 of this embodiment will be described in detail below.

「ハードウエア構成」
　まず、検出装置１０のハードウエア構成の一例を説明する。検出装置１０の各機能部は、ハードウエアとソフトウエアの任意の組み合わせによって実現される。その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。ソフトウエアは、予め装置を出荷する段階から格納されているプログラムや、ＣＤ（Compact Disc）等の記録媒体やインターネット上のサーバ等からダウンロードされたプログラム等を含む。 "Hardware Configuration"
First, an example of the hardware configuration of the detection device 10 will be described. Each functional unit of the detection device 10 is realized by any combination of hardware and software. Those skilled in the art will understand that there are various variations in the realization method and device. The software includes programs that are stored in the device before it is shipped, and programs downloaded from recording media such as CDs (Compact Discs) and servers on the Internet.

　図７は、検出装置１０のハードウエア構成を例示するブロック図である。図７に示すように、検出装置１０は、プロセッサ１Ａ、メモリ２Ａ、入出力インターフェイス３Ａ、周辺回路４Ａ、バス５Ａを有する。周辺回路４Ａには、様々なモジュールが含まれる。学習装置２０は周辺回路４Ａを有さなくてもよい。なお、検出装置１０は物理的及び／又は論理的に分かれた複数の装置で構成されてもよい。この場合、複数の装置各々が上記ハードウエア構成を備えることができる。 FIG. 7 is a block diagram illustrating an example of the hardware configuration of the detection device 10. As shown in FIG. 7, the detection device 10 has a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A. The peripheral circuit 4A includes various modules. The learning device 20 does not need to have the peripheral circuit 4A. The detection device 10 may be composed of multiple devices that are physically and/or logically separated. In this case, each of the multiple devices can have the above hardware configuration.

「機能構成」
　次に、検出装置１０の機能構成を詳細に説明する。図１２に、検出装置１０の機能ブロック図の一例を示す。図示するように、検出装置１０は、検出部１１と、検出用撮影画像取得部１２と、検出用反射波処理部１３と、検出用合成部１４とを有する。 "Function Configuration"
Next, a detailed description will be given of the functional configuration of the detection device 10. Fig. 12 shows an example of a functional block diagram of the detection device 10. As shown in the figure, the detection device 10 has a detection unit 11, a detection-use photographed image acquisition unit 12, a detection-use reflected wave processing unit 13, and a detection-use synthesis unit 14.

　検出用撮影画像取得部１２は、処理対象撮影画像を取得する。 The detection image acquisition unit 12 acquires the captured image to be processed.

　「処理対象撮影画像」は、検出対象を検出する処理の対象となる撮影画像である。処理対象撮影画像は、対象領域を撮影した画像である。 The "image to be processed" is the image that is the subject of processing to detect the detection target. The image to be processed is an image of the target area.

　「撮影画像」の概念や、撮影方法の一例は、第３の実施形態で説明した通りである。 The concept of a "captured image" and an example of a capture method are as explained in the third embodiment.

　「検出対象」は、検出する対象となっている物体である。学習装置２０が生成する学習モデルは、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」に基づき検出対象を検出する。このため、電磁波を反射する物体が、検出対象となり得る。電磁波を反射する物体は、主に金属によって形成されている場合が多いが、これに限定されない。 The "detection target" is the object that is to be detected. The learning model generated by the learning device 20 detects the detection target based on the "characteristics of the captured image" and the "characteristics of the reflected wave information indicating the reflected electromagnetic wave." Therefore, an object that reflects electromagnetic waves can be the detection target. Objects that reflect electromagnetic waves are often made mainly of metal, but are not limited to this.

　「対象領域」は、検出対象が存在するか検索する対象となっている領域である。対象領域は、地上の一部領域であってもよいし、建造物の一部領域であってもよいし、その他であってもよい。 The "target area" is the area being searched to see if a detection target exists. The target area may be a part of the ground, a part of a building, or something else.

　検出用反射波処理部１３は、対象領域に照射された電磁波の反射波を示す反射波情報に基づき処理対象２次元画像を生成する。検出用反射波処理部１３は、学習用反射波処理部２３と同様の処理で、反射波情報に基づき２次元画像を生成することができる。 The detection reflected wave processing unit 13 generates a two-dimensional image to be processed based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area. The detection reflected wave processing unit 13 can generate a two-dimensional image based on the reflected wave information using processing similar to that of the learning reflected wave processing unit 23.

　「処理対象２次元画像」は、検出対象を検出する処理の対象となる２次元画像である。 The "two-dimensional image to be processed" is the two-dimensional image that is the subject of processing to detect the detection target.

　「２次元画像」、「電磁波」、「反射波」、及び「反射波情報」やこれらに関連する情報の概念や、生成方法の一例は、第３の実施形態で説明した通りである。 The concepts of "two-dimensional images," "electromagnetic waves," "reflected waves," and "reflected wave information," as well as information related to these, and an example of a method for generating these, are as described in the third embodiment.

　検出用合成部１４は、処理対象撮影画像と処理対象２次元画像を合成して処理対象合成画像を生成する。検出用合成部１４は、学習用合成部２４による学習用撮影画像と学習用２次元画像の合成と同様の処理で、処理対象撮影画像と処理対象２次元画像を合成し、処理対象合成画像を生成することができる。処理対象撮影画像と処理対象２次元画像は同様のサイズであることが好ましい。 The detection synthesis unit 14 synthesizes the photographic image to be processed and the two-dimensional image to be processed to generate a composite image to be processed. The detection synthesis unit 14 can synthesize the photographic image to be processed and the two-dimensional image to be processed in a similar process to the synthesis of the photographic image to be processed and the two-dimensional image to be processed by the learning synthesis unit 24 to generate a composite image to be processed. It is preferable that the photographic image to be processed and the two-dimensional image to be processed are of similar size.

　検出部１１は、処理対象撮影画像と処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する。検出部１１は、第１、第３及び第４の実施形態の学習装置２０が生成した学習モデルと、処理対象合成画像に基づき、検出対象を検出する。当該学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである。 The detection unit 11 detects the detection target based on a composite image of the processing target, which is a composite of the captured image of the processing target and the two-dimensional image of the processing target. The detection unit 11 detects the detection target based on the learning model generated by the learning device 20 of the first, third, and fourth embodiments, and the composite image of the processing target. The learning model is an object recognition model or an object detection model that uses a language model.

　検出部１１は、言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも１つに対し、撮影画像用に生成／調整された学習モデルに適応させるための調整を行うことができる。そして、検出部１１は、当該調整後のデータを用いて、検出対象の検出を行うことができる。検出部１１による言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも１つに対する調整は、第３の実施形態で説明した学習部２１による言語特徴、学習用合成画像、及び画像特徴の中の少なくとも１つに対する調整と同様の手段で実現される。 The detection unit 11 can adjust at least one of the language features, the synthetic image to be processed, and the image features to adapt them to the learning model generated/adjusted for the captured image. The detection unit 11 can then use the adjusted data to detect the detection target. The adjustment of at least one of the language features, the synthetic image to be processed, and the image features by the detection unit 11 is achieved by the same means as the adjustment of at least one of the language features, the synthetic image to be processed, and the image features by the learning unit 21 described in the third embodiment.

　すなわち、検出部１１は、言語特徴を非線形関数又は線形関数により変換し、変換後の言語特徴に基づき、検出対象を検出することができる。また、検出部１１は、処理対象合成画像を非線形関数又は線形関数により変換し、変換後の処理対象合成画像に基づき、検出対象を検出することができる。また、検出部１１は、処理対象合成画像の中から抽出した画像特徴を非線形関数又は線形関数により変換し、変換後の画像特徴に基づき、検出対象を検出することができる。 In other words, the detection unit 11 can transform language features using a nonlinear function or a linear function, and detect the detection target based on the transformed language features. The detection unit 11 can also transform the composite image to be processed using a nonlinear function or a linear function, and detect the detection target based on the transformed composite image to be processed. The detection unit 11 can also transform image features extracted from the composite image to be processed using a nonlinear function or a linear function, and detect the detection target based on the transformed image features.

　なお、検出部１１は、検出対象をテキストで表現した検索条件を取得する。この検索条件により、検出対象が指定される。そして、検出部１１は、この検索条件で指定される検出対象を検出する。検出対象は処理対象合成画像が入力される毎に指定されてもよい。その他、予め検出対象が指定されており、その指定されている検出対象が複数の処理対象合成画像の処理において利用されてもよい。 The detection unit 11 acquires search conditions that express the detection target in text. The detection target is specified by these search conditions. The detection unit 11 then detects the detection target specified by these search conditions. The detection target may be specified each time a processing target composite image is input. Alternatively, the detection target may be specified in advance, and the specified detection target may be used in processing multiple processing target composite images.

　「検索条件」は、検出対象を表現したテキストである。検索条件を構成するテキストは、単語、文章、プロンプト等である。検索条件を構成するテキストは、例えば物体の種類や、物体の外観の特徴（色、大きさ、形状等）等を含むことができる。検索条件の一例としては、「青い缶」等が例示される。 "Search criteria" is text that expresses the object to be detected. The text that constitutes the search criteria is a word, a sentence, a prompt, etc. The text that constitutes the search criteria can include, for example, the type of object, or the external characteristics of the object (color, size, shape, etc.). An example of a search criterion is "blue can."

　なお、検索条件は、検索結果から除去したい物体を表現したテキストをさらに含んでもよい。すなわち、検索条件は、検出対象を表現したテキストと、検索結果から除去したい対象を表現したテキストを含んでもよい。以下、検出対象を「正例」と呼び、検索結果から除去したい対象を「負例」と呼ぶ場合がある。負例も、正例と同様の手段で表現される。負例の検索条件の一例としては、「白い缶」等が例示される。 The search criteria may further include text expressing an object to be removed from the search results. That is, the search criteria may include text expressing the detection target and text expressing the target to be removed from the search results. Hereinafter, the detection target may be referred to as a "positive example," and the target to be removed from the search results may be referred to as a "negative example." Negative examples are expressed in the same manner as positive examples. An example of a search criterion for negative examples is "white can."

　検出部１１は、ユーザが入力した検索条件を取得する。ユーザは、キーボード、タッチパネル、マイク、マウス、物理ボタン等のあらゆる入力装置を介して、検出装置１０に検索条件を入力することができる。 The detection unit 11 acquires search conditions input by the user. The user can input search conditions to the detection device 10 via any input device, such as a keyboard, touch panel, microphone, mouse, or physical button.

　検出部１１は、例えば処理対象合成画像と検索条件との相関関係に基づき、処理対象合成画像の中から検出対象を検出することができる。 The detection unit 11 can detect the detection target from within the composite image to be processed, for example, based on the correlation between the composite image to be processed and the search conditions.

　一例では、検出部１１は、検索条件（正例）との相関値（コサイン類似度）の高い物体を処理対象合成画像の中から検出し、その物体の領域を検出結果として出力することができる。例えば、検出部１１は、検索条件（正例）との相関値が閾値以上の物体を処理対象合成画像の中から検出してもよい。また、検出部１１は、検索条件（正例）との相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。また、検出部１１は、検索条件（正例）との相関値が閾値以上であって、当該相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。 In one example, the detection unit 11 can detect objects with a high correlation value (cosine similarity) with the search criteria (positive examples) from within the composite image to be processed, and output the area of the object as the detection result. For example, the detection unit 11 may detect objects with a correlation value with the search criteria (positive examples) equal to or greater than a threshold from within the composite image to be processed. The detection unit 11 may also detect a predetermined number of objects with a high correlation value with the search criteria (positive examples) from within the composite image to be processed. The detection unit 11 may also detect a predetermined number of objects with a high correlation value with the search criteria (positive examples) from within the composite image to be processed, whose correlation value with the search criteria (positive examples) is equal to or greater than a threshold.

　また、検出部１１は、検索条件（負例）と相関値の高い物体を処理対象合成画像の中から検出し、その物体の領域を検出結果に含めずに出力することができる。例えば、検出部１１は、検索条件（負例）との相関値が閾値以上の物体を処理対象合成画像の中から検出してもよい。また、検出部１１は、検索条件（負例）との相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。また、検出部１１は、検索条件（負例）との相関値が閾値以上であって、当該相関値が高い方から所定数の物体を処理対象合成画像の中から検出してもよい。 The detection unit 11 can also detect objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed, and output the detection result without including the area of the object in it. For example, the detection unit 11 can detect objects having a correlation value with the search criteria (negative examples) equal to or greater than a threshold from within the composite image to be processed. The detection unit 11 can also detect a predetermined number of objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed. The detection unit 11 can also detect a predetermined number of objects having a high correlation value with the search criteria (negative examples) from within the composite image to be processed, whose correlation value with the search criteria (negative examples) is equal to or greater than a threshold.

　このように、検出部１１は、処理対象合成画像の中から、検出対象が存在する領域を検出する。そして、検出部１１は、その検出した領域を示す情報を出力する。例えば、検出部１１は、処理対象合成画像又は処理対象撮影画像上に検出した領域を示す情報（枠、マスク等）を重畳した画像を出力してもよい。また、検出した領域毎に任意の手段で算出した信頼度を示してもよい。なお、当該出力例はあくまで一例であり、これに限定されない。 In this way, the detection unit 11 detects an area in which the detection target exists from within the composite image to be processed. The detection unit 11 then outputs information indicating the detected area. For example, the detection unit 11 may output an image in which information indicating the detected area (frame, mask, etc.) is superimposed on the composite image to be processed or the photographed image to be processed. In addition, the detection unit 11 may indicate a reliability calculated by any means for each detected area. Note that this output example is merely an example and is not limited to this.

　次に、図１３のフローチャートを用いて、検出装置１０の処理の流れの一例を説明する。なお、各処理の詳細は上述したので、ここでは各処理の概要とともに、処理の流れの一例を説明する。 Next, an example of the process flow of the detection device 10 will be described using the flowchart in FIG. 13. Note that the details of each process have been described above, so here, an example of the process flow will be described along with an overview of each process.

　Ｓ４０では、検出装置１０は、対象領域を撮影した処理対象撮影画像を取得する。Ｓ４１では、検出装置１０は、対象領域に照射された電磁波の反射波を示す反射波情報に基づき処理対象２次元画像を生成する。 In S40, the detection device 10 acquires a captured image of the target area to be processed. In S41, the detection device 10 generates a two-dimensional image of the target area to be processed based on reflected wave information indicating the reflected waves of the electromagnetic waves irradiated to the target area.

　次いで、Ｓ４２では、検出装置１０は、処理対象撮影画像と処理対象２次元画像を合成して処理対象合成画像を生成する。そして、検出装置１０は、処理対象合成画像に基づき、検出対象を検出する。 Next, in S42, the detection device 10 synthesizes the captured image to be processed and the two-dimensional image to be processed to generate a composite image to be processed. Then, the detection device 10 detects the detection target based on the composite image to be processed.

「作用効果」
　処理対象合成画像を用いて検出対象を検出する検出装置１０によれば、「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の両方に基づき、検出対象を検出することができる。「撮影画像の特徴」及び「電磁波の反射波を示す反射波情報の特徴」の一方だけでなく、両方を用いることで、検出対象の検出精度が向上する。 "Action and effect"
According to the detection device 10 that detects the detection target using the composite image to be processed, the detection target can be detected based on both the "characteristics of the captured image" and the "characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave." By using not only one but both of the "characteristics of the captured image" and the "characteristics of the reflected wave information indicating the reflected wave of the electromagnetic wave," the detection accuracy of the detection target is improved.

　また、検出装置１０は、言語特徴、処理対象合成画像、及び画像特徴の中の少なくとも１つに対し、撮影画像用に生成／調整された学習モデルに適応させるための調整を行うことができる。そして、検出装置１０は、調整後のデータを用いて検出対象の検出を行うことができる。このような検出装置１０によれば、撮影画像用に生成／調整された既存の学習モデルを学習して生成された学習モデルを用いて、検出対象の検出を行うことができる。 The detection device 10 can also adjust at least one of the language features, the synthetic image to be processed, and the image features to accommodate the learning model generated/adjusted for the captured image. The detection device 10 can then use the adjusted data to detect the detection target. Such a detection device 10 can detect the detection target using a learning model generated by learning an existing learning model generated/adjusted for the captured image.

＜第６の実施形態＞
　本実施形態の検出装置１０は、入力された１組の処理対象撮影画像及び処理対象２次元画像を用いて、ブレンド比率が互いに異なる複数の処理対象合成画像を生成する。そして、検出装置１０は、この複数の処理対象合成画像を用いて、検出対象を検出する。以下、詳細に説明する。 Sixth Embodiment
The detection device 10 of this embodiment generates a plurality of processing target composite images having different blend ratios using a set of a processing target photographic image and a processing target two-dimensional image that are input. Then, the detection device 10 detects the detection target using the plurality of processing target composite images. This will be described in detail below.

　検出用合成部１４は、入力された１組の処理対象撮影画像及び処理対象２次元画像を用いて、ブレンド比率が互いに異なる複数の処理対象合成画像を生成する。検出用合成部１４は、画像内の一部領域毎にブレンド比率が異なる処理対象合成画像を生成してもよい。検出用合成部１４は、第３の実施形態で説明した学習用合成部２４による手段と同様の手段で、当該合成を実現することができる。 The detection synthesis unit 14 uses an input set of a photographic image to be processed and a two-dimensional image to be processed to generate a plurality of composite images to be processed having different blending ratios. The detection synthesis unit 14 may generate composite images to be processed having different blending ratios for each partial area within the image. The detection synthesis unit 14 can achieve this synthesis by a means similar to that used by the learning synthesis unit 24 described in the third embodiment.

　検出部１１は、１組の処理対象撮影画像及び処理対象２次元画像を用いて生成された複数の処理対象合成画像各々の中から検出対象を検出する。各処理対象合成画像の中から検出対象を検出する処理は、第５の実施形態と同様の手段で実現される。 The detection unit 11 detects the detection target from each of a plurality of processing target composite images generated using a set of processing target photographic images and processing target two-dimensional images. The process of detecting the detection target from each processing target composite image is realized by the same means as in the fifth embodiment.

　そして、検出部１１は、複数の処理対象合成画像各々の検出結果に基づき、検出対象を検出する。例えば、検出部１１は、複数の処理対象合成画像の中の所定割合以上の処理対象合成画像（又は所定数以上の処理対象合成画像）において検出対象が存在すると検出されている領域を、検出対象が存在する領域として検出することができる。所定割合や所定数は、予め定められる値である。 Then, the detection unit 11 detects the detection target based on the detection results of each of the multiple processing target composite images. For example, the detection unit 11 can detect, as a region where the detection target exists, an area where the detection target has been detected to exist in at least a predetermined percentage of the processing target composite images (or at least a predetermined number of processing target composite images) among the multiple processing target composite images. The predetermined percentage and the predetermined number are values that are determined in advance.

　本実施形態の検出装置１０のその他の構成は、第２及び第５の実施形態と同様である。 The rest of the configuration of the detection device 10 of this embodiment is the same as in the second and fifth embodiments.

　本実施形態の検出装置１０によれば、第２及び第５の実施形態と同様の作用効果が得られる。また、本実施形態の検出装置１０は、入力された１組の処理対象撮影画像及び処理対象２次元画像を用いてブレンド比率が互いに異なる複数の処理対象合成画像を生成し、この複数の処理対象合成画像を用いて検出対象を検出することができる。このような検出装置１０によれば、検出対象の検出精度が向上する。 The detection device 10 of this embodiment provides the same effects as the second and fifth embodiments. Furthermore, the detection device 10 of this embodiment generates a plurality of composite images of the processing target having different blend ratios using an input set of a photographic image of the processing target and a two-dimensional image of the processing target, and can detect the detection target using the plurality of composite images of the processing target. Such a detection device 10 improves the detection accuracy of the detection target.

＜変形例＞
　第１乃至第６の実施形態に適用可能な変形例を説明する。 <Modification>
Modifications that can be applied to the first to sixth embodiments will be described.

　検出用合成部１４は、処理対象撮影画像に対する補正処理を行った後に、補正後の処理対象撮影画像を用いて処理対象合成画像を生成してもよい。 The detection synthesis unit 14 may perform a correction process on the photographed image to be processed, and then generate a composite image to be processed using the corrected photographed image to be processed.

　また、学習用合成部２４は、学習用撮影画像に対する補正処理を行った後に、補正後の学習用撮影画像を用いて学習用合成画像を生成してもよい。 In addition, the learning synthesis unit 24 may perform a correction process on the learning captured image, and then generate a learning composite image using the corrected learning captured image.

　また、検出部１１は、処理対象合成画像に対する補正処理を行った後に、補正後の処理対象合成画像を用いて検出対象の検出を行ってもよい。 In addition, the detection unit 11 may perform a correction process on the processing target composite image, and then detect the detection target using the corrected processing target composite image.

　また、学習部２１は、学習用合成画像に対する補正処理を行った後に、補正後の学習用合成画像を用いて学習モデルの学習を行ってもよい。 In addition, the learning unit 21 may perform a correction process on the training composite image and then use the corrected training composite image to train the learning model.

　補正処理の一例は、画像の明度を調整する補正である。例えば以下の文献に開示されている技術を用いて、画像の明度の調整が行われてもよい。当該補正処理は、画像の明度が所定条件（明度が閾値以下）を満たす場合に行われてもよい。
「Shibata, Takashi, Masayuki Tanaka, and Masatoshi Okutomi. "Gradient-domain image reconstruction framework with intensity-range and base-structure constraints." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.」
「Tanaka, Masayuki, Takashi Shibata, and Masatoshi Okutomi. "Gradient-based low-light image enhancement." 2019 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2019.」 An example of the correction process is a correction to adjust the brightness of an image. For example, the brightness of an image may be adjusted using the technology disclosed in the following document. The correction process may be performed when the brightness of an image satisfies a predetermined condition (brightness is equal to or less than a threshold value).
"Shibata, Takashi, Masayuki Tanaka, and Masatoshi Okutomi. "Gradient-domain image reconstruction framework with intensity-range and base-structure constraints." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016."
"Tanaka, Masayuki, Takashi Shibata, and Masatoshi Okutomi. "Gradient-based low-light image enhancement." 2019 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2019."

　以上、実施の形態を参照して本開示を説明したが、本開示は上述の実施の形態に限定されるものではない。本開示の構成や詳細には、本開示のスコープ内で当業者が理解し得る様々な変更をすることができる。そして、各実施の形態は、適宜他の実施の形態と組み合わせることができる。 The present disclosure has been described above with reference to the embodiments, but the present disclosure is not limited to the above-mentioned embodiments. Various modifications that can be understood by those skilled in the art can be made to the configuration and details of the present disclosure within the scope of the present disclosure. Furthermore, each embodiment can be combined with other embodiments as appropriate.

　また、上述の説明で用いたフローチャートでは、複数の工程（処理）が順番に記載されている。しかし、各実施の形態で実行される工程の実行順序は、その記載の順番に制限されない。各実施の形態では、図示される工程の順番を内容的に支障のない範囲で変更することができる。 In addition, in the flowcharts used in the above explanations, multiple steps (processing) are listed in order. However, the order in which the steps are executed in each embodiment is not limited to the order listed. In each embodiment, the order of the steps shown in the figures can be changed to the extent that does not interfere with the content.

　上記の実施の形態の一部又は全部は、以下の付記のようにも記載されうるが、以下に限られない。
１．　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段を有する検出装置。
２．　前記処理対象撮影画像を取得する検出用撮影画像取得手段と、
　前記反射波情報に基づき前記処理対象２次元画像を生成する検出用反射波処理手段と、
　前記処理対象撮影画像と前記処理対象２次元画像を合成して前記処理対象合成画像を生成する検出用合成手段と、
を有する１に記載の検出装置。
３．　前記検出手段は、
　　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成される学習用２次元画像とを合成した複数の学習用合成画像を学習した学習モデルと、前記処理対象合成画像に基づき、前記検出対象を検出する１又は２に記載の検出装置。
４．　前記学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである３に記載の検出装置。
５．　前記検出用合成手段は、
　　前記処理対象撮影画像及び前記処理対象２次元画像のブレンド比率が互いに異なる複数の前記処理対象合成画像を生成し、
　前記検出手段は、
　　複数の前記処理対象合成画像各々の中から前記検出対象を検出し、
　　複数の前記処理対象合成画像各々の検出結果に基づき、前記検出対象を検出する２に記載の検出装置。
６．　前記検出手段は、
　　前記処理対象合成画像を非線形関数又は線形関数により変換し、
　　変換後の前記処理対象合成画像に基づき、前記検出対象を検出する１から５のいずれかに記載の検出装置。
７．　前記検出手段は、
　　前記処理対象合成画像の中から画像特徴を抽出し、
　　前記画像特徴を非線形関数又は線形関数により変換し、
　　変換後の前記画像特徴に基づき、前記検出対象を検出する１から６のいずれかに記載の検出装置。
８．　１つ以上のコンピュータが、
　　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出方法。
９．　コンピュータを、
　　対象領域を撮影した処理対象撮影画像と、前記対象領域に照射された電磁波の反射波を示す反射波情報に基づき生成された処理対象２次元画像とを合成した処理対象合成画像に基づき、検出対象を検出する検出手段として機能させるプログラム。
１０．　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段を有する学習装置。
１１．　前記学習モデルは、言語モデルを用いた物体認識モデル又は物体検出モデルである１０に記載の学習装置。
１２．　前記学習用合成画像に写る物体を表現したテキストを取得する言語入力手段を有し、
　前記学習手段は、前記学習用合成画像に写る前記物体と、前記テキストとの相関関係を学習する１１に記載の学習装置。
１３．　前記学習用撮影画像を取得する学習用撮影画像取得手段と、
　前記反射波情報に基づき前記学習用２次元画像を生成する学習用反射波処理手段と、
　前記学習用撮影画像と前記学習用２次元画像を合成して前記学習用合成画像を生成する学習用合成手段と、
を有する１０から１２のいずれかに記載の学習装置。
１４．　前記学習用合成手段は、
　　前記学習用撮影画像及び前記学習用２次元画像のブレンド比率が互いに異なる複数の前記学習用合成画像を生成する１３に記載の学習装置。
１５．　前記学習手段は、
　　前記学習用合成画像を非線形関数又は線形関数により変換し、
　　変換後の前記学習用合成画像を用いて前記学習モデルを学習する１０から１４のいずれかに記載の学習装置。
１６．　前記学習手段は、
　　前記学習用合成画像の中から画像特徴を抽出し、
　　前記画像特徴を非線形関数又は線形関数により変換し、
　　変換後の前記画像特徴を用いて前記学習モデルを学習する１０から１５のいずれかに記載の学習装置。
１７．　１つ以上のコンピュータが、
　　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習方法。
１８．　コンピュータを、
　　学習用撮影画像と、電磁波の反射波を示す反射波情報に基づき生成された学習用２次元画像とを合成した複数の学習用合成画像を用いて学習モデルを学習する学習手段として機能させるプログラム。
　上述した付記１の検出装置に従属する付記２乃至７の一部又は全ては、付記８の検出方法及び付記９のプログラムに対しても、付記１と付記２乃至７と同様の従属関係により従属し得る。また、上述した付記１０の学習装置に従属する付記１１乃至１６の一部又は全ては、付記１７の学習方法及び付記１８のプログラムに対しても、付記１０と付記１１乃至１６と同様の従属関係により従属し得る。さらに、上述した各実施の形態から逸脱しない範囲において、様々なハードウエア、ソフトウエア、ソフトウエアを記録するための種々の記録手段、又はシステムにおいて、付記として記載した構成の一部又は全てを実現することができる。 A part or all of the above-described embodiments can be described as, but is not limited to, the following supplementary notes.
1. A detection device having a detection means for detecting a detection target based on a processing target composite image obtained by combining a processing target photographed image of a target area and a processing target two-dimensional image generated based on reflected wave information indicating reflected waves of electromagnetic waves irradiated to the target area.
2. A detection image acquisition means for acquiring the processing target photographed image;
a detection reflected wave processing means for generating the two-dimensional image of the processing object based on the reflected wave information;
a detection synthesis means for synthesizing the captured image to be processed and the two-dimensional image to be processed to generate the synthesized image to be processed;
2. The detection device according to claim 1, comprising:
3. The detection means is
A detection device as described in 1 or 2, which detects the detection target based on a learning model that has been trained using a plurality of learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected waves of electromagnetic waves, and the detection target based on the processing target composite image.
4. The detection device according to 3, wherein the learning model is an object recognition model or an object detection model using a language model.
5. The detection synthesis means comprises:
generating a plurality of processing target composite images having different blend ratios between the processing target photographed image and the processing target two-dimensional image;
The detection means includes:
Detecting the detection target from each of the plurality of processing target composite images;
3. The detection device according to claim 2, which detects the detection target based on detection results of each of the multiple processing target composite images.
6. The detection means is
Transforming the composite image to be processed by a nonlinear function or a linear function;
6. The detection device according to any one of claims 1 to 5, which detects the detection target based on the converted composite image of the processing target.
7. The detection means is
Extracting image features from the composite image to be processed;
Transforming the image features by a nonlinear or linear function;
7. A detection device according to any one of 1 to 6, which detects the detection target based on the transformed image features.
8. One or more computers:
A detection method for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
9. Computers,
A program that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.
10. A learning device having a learning means for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating a reflected electromagnetic wave.
11. The learning device according to 10, wherein the learning model is an object recognition model or an object detection model using a language model.
12. A language input means for acquiring text expressing an object appearing in the learning synthetic image,
12. The learning device according to claim 11, wherein the learning means learns a correlation between the object appearing in the synthetic image for learning and the text.
13. A learning image acquisition means for acquiring the learning image;
a learning reflected wave processing means for generating the learning two-dimensional image based on the reflected wave information;
a learning synthesis means for synthesizing the learning photographed image and the learning two-dimensional image to generate the learning synthesized image;
13. The learning device according to any one of 10 to 12, comprising:
14. The learning synthesis means
The learning device according to claim 13, wherein a plurality of the learning composite images are generated in which the blending ratios of the learning captured images and the learning two-dimensional images are different from each other.
15. The learning means
Transforming the synthetic image for learning by a nonlinear function or a linear function;
15. The learning device according to any one of claims 10 to 14, which learns the learning model using the training synthetic image after conversion.
16. The learning means
Extracting image features from the synthetic image for training;
Transforming the image features by a nonlinear or linear function;
16. A learning device according to any one of claims 10 to 15, which learns the learning model using the transformed image features.
17. One or more computers:
A learning method for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.
18. A computer
A program that functions as a learning means for learning a learning model using multiple learning composite images that are synthesized by combining learning captured images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.
Some or all of Appendices 2 to 7 that are dependent on the detection device of Appendix 1 described above may also be dependent on the detection method of Appendix 8 and the program of Appendix 9 in the same dependent relationship as Appendix 1 and Appendices 2 to 7. In addition, some or all of Appendices 11 to 16 that are dependent on the learning device of Appendix 10 described above may also be dependent on the learning method of Appendix 17 and the program of Appendix 18 in the same dependent relationship as Appendix 10 and Appendices 11 to 16. Furthermore, within the scope of each of the above-mentioned embodiments, some or all of the configurations described as appendices can be realized in various hardware, software, various recording means for recording software, or systems.

　この出願は、２０２３年８月１０日に出願された日本出願特願２０２３－１３０７６２号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2023-130762, filed on August 10, 2023, the entire disclosure of which is incorporated herein by reference.

　１０　　検出装置
　１１　　検出部
　１２　　検出用撮影画像取得部
　１３　　検出用反射波処理部
　１４　　検出用合成部
　２０　　学習装置
　２１　　学習部
　２２　　学習用撮影画像取得部
　２３　　学習用反射波処理部
　２４　　学習用合成部
　２５　　言語入力部
　１Ａ　　プロセッサ
　２Ａ　　メモリ
　３Ａ　　入出力Ｉ／Ｆ
　４Ａ　　周辺回路
　５Ａ　　バス REFERENCE SIGNS LIST 10 Detection device 11 Detection section 12 Detection-use captured image acquisition section 13 Detection-use reflected wave processing section 14 Detection-use synthesis section 20 Learning device 21 Learning section 22 Learning-use captured image acquisition section 23 Learning-use reflected wave processing section 24 Learning-use synthesis section 25 Language input section 1A Processor 2A Memory 3A Input/output I/F
4A Peripheral circuit 5A Bus

Claims

A detection device having a detection means for detecting a detection target based on a composite image of a processing target, which is a composite of a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

A detection image acquisition means for acquiring the processing target photographed image;
a detection reflected wave processing means for generating the two-dimensional image of the processing object based on the reflected wave information;
a detection synthesis means for synthesizing the captured image to be processed and the two-dimensional image to be processed to generate the synthesized image to be processed;
The detection device according to claim 1 , further comprising:

The detection means includes:
The detection device according to claim 1 or 2, which detects the detection target based on a learning model that has been trained using a plurality of learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected waves of electromagnetic waves, and the processing target composite image.

The detection device according to claim 3, wherein the learning model is an object recognition model or an object detection model that uses a language model.

The detection synthesis means comprises:
generating a plurality of processing target composite images having different blend ratios between the processing target photographed image and the processing target two-dimensional image;
The detection means includes:
Detecting the detection target from each of the plurality of processing target composite images;
The detection device according to claim 2 , wherein the detection target is detected based on detection results of each of the plurality of processing target composite images.

The detection means includes:
Transforming the composite image to be processed by a nonlinear function or a linear function;
The detection device according to claim 1 , wherein the detection target is detected based on the converted composite image of the processing target.

The detection means includes:
Extracting image features from the composite image to be processed;
Transforming the image features by a nonlinear or linear function;
The detection device according to claim 1 , wherein the detection target is detected based on the transformed image features.

One or more computers
A detection method for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

Computer,
A recording medium having a program recorded thereon that functions as a detection means for detecting a detection target based on a composite image of a processing target obtained by combining a captured image of the processing target obtained by capturing an image of the target area and a two-dimensional image of the processing target generated based on reflected wave information indicating the reflected waves of electromagnetic waves irradiated to the target area.

A learning device having a learning means for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.

The learning device according to claim 10, wherein the learning model is an object recognition model or an object detection model using a language model.

a language input means for acquiring text expressing an object appearing in the learning synthetic image,
The learning device according to claim 11 , wherein the learning means learns a correlation between the object appearing in the synthetic image for learning and the text.

A learning image acquisition means for acquiring the learning image;
a learning reflected wave processing means for generating the learning two-dimensional image based on the reflected wave information;
a learning synthesis means for synthesizing the learning photographed image and the learning two-dimensional image to generate the learning synthesized image;
13. A learning device according to claim 10, comprising:

The learning synthesis means includes:
The learning device according to claim 13 , wherein a plurality of the learning composite images are generated in which the blending ratios of the learning captured images and the learning two-dimensional images are different from each other.

The learning means includes:
Transforming the synthetic image for learning by a nonlinear function or a linear function;
The learning device according to claim 10 , wherein the learning model is trained using the training synthetic image after conversion.

The learning means includes:
Extracting image features from the synthetic image for training;
Transforming the image features by a nonlinear or linear function;
The learning device according to claim 10 , wherein the learning model is trained using the transformed image features.

One or more computers
A learning method for learning a learning model using a plurality of synthetic learning images obtained by synthesizing a captured learning image with a two-dimensional learning image generated based on reflected wave information indicating reflected electromagnetic waves.

The learning method according to claim 17, wherein the learning model is an object recognition model or an object detection model using a language model.

Computer,
A recording medium having a program recorded thereon that functions as a learning means for learning a learning model using multiple learning composite images that are obtained by combining learning photographic images with learning two-dimensional images generated based on reflected wave information indicating reflected electromagnetic waves.

The recording medium according to claim 19, wherein the learning model is an object recognition model or an object detection model using a language model.