WO2025146712A1

WO2025146712A1 - Object detection device and method

Info

Publication number: WO2025146712A1
Application number: PCT/JP2024/000022
Authority: WO
Inventors: 宥光飯沼; 彩希八田; 寛之鵜澤; 周平吉田; 晃嗣山崎; 健中村; 大祐小林; 祐輔堀下
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2025-07-10
Anticipated expiration: 2026-07-04

Abstract

In the present invention: an estimation unit (22) estimates a density map of a target image serving as a target of object detection, by using a density estimation model (30) which is generated in advance by machine learning to estimate a density map indicating the density of the probability that an object is present at positions in an image; an extraction unit (24) extracts a region serving as a target of object detection, on the basis of the probability density in the estimated density map (30); a detection unit (26), by applying an object detection model (32) which is generated in advance by machine learning in order to detect an object from an image, detects an object in the region of the target image corresponding to the extracted region; and a processing unit (28) performs processing for setting the size of the region to be extracted according to the density map (30) or the object detection result.

Description

Object detection device and method

　開示の技術は、物体検出装置、及び物体検出方法に関する。 The disclosed technology relates to an object detection device and an object detection method.

　物体検出装置は、入力された画像に含まれる、人、自動車等の物体のクラスと、画像内の物体の領域を囲む矩形のバウンディングボックスの座標情報と、検出結果の信頼度とを推定する装置である。近年では、深層学習を用いた物体検出モデルが複数提案されている。深層学習に基づく物体検出モデルとして、バウンディングボックスと物体のクラスとを一括して推論するＹＯＬＯ（Ｙｏｕ　Ｏｎｌｙ　Ｌｏｏｋ　Ｏｎｃｅ）、ＲｅｔｉｎａＮｅｔ等が提案されている（非特許文献１及び２）。また、物体の候補領域の検出と、物体のクラスの分類とを分けて行うＲ－ＣＮＮや、それを改良したＦａｓｔｅｒ　Ｒ－ＣＮＮ等が提案されている（非特許文献３及び４）。 An object detection device is a device that estimates the class of objects, such as people and cars, contained in an input image, the coordinate information of the rectangular bounding box that surrounds the object area in the image, and the reliability of the detection result. In recent years, several object detection models using deep learning have been proposed. As object detection models based on deep learning, YOLO (You Only Look Once) and RetinaNet, which infer the bounding box and object class at the same time, have been proposed (Non-Patent Documents 1 and 2). In addition, R-CNN, which detects candidate object areas and classifies the object classes separately, and an improved version of this, Faster R-CNN, have been proposed (Non-Patent Documents 3 and 4).

　また、４Ｋ（３８４０×２１６０画素）や８Ｋ（７６８０×４３２０画素）のような高精細な画像や映像から物体検出を行う場合、物体検出モデルへの入力画像サイズは、例えば、ＹＯＬＯ　ｖ５では、最大で１５３６×１５３６画素という制限がある。そのため、高精細画像をそのままのサイズで物体検出モデルへ入力することができない。そこで、入力の高精細画像を物体検出モデルの入力画像サイズに合わせて分割し、各分割画像から物体検出を行った結果を集約して、画像全体から物体検出を行った結果として出力する手法が提案されている。この手法として、画像の分割方法によっていくつかの手法が提案されている。例えば、物体検出モデルの入力画像サイズに合わせて均等に分割した画像群と全体を縮小した画像とを、それぞれ物体検出モデルに入力する手法が提案されている。この手法では、得られたバウンディングボックスの座標情報をスケーリングした上で、各分割画像及び縮小画像の検出結果の合成を行い、最終的な結果を出力する（非特許文献５）。 In addition, when performing object detection from high-definition images or videos such as 4K (3840 x 2160 pixels) or 8K (7680 x 4320 pixels), the input image size to the object detection model is limited to a maximum of 1536 x 1536 pixels in YOLO v5, for example. Therefore, high-definition images cannot be input to the object detection model at their original size. Therefore, a method has been proposed in which the input high-definition image is divided to match the input image size of the object detection model, the results of object detection from each divided image are aggregated, and the result of object detection from the entire image is output. As this method, several methods have been proposed depending on the image division method. For example, a method has been proposed in which a group of images divided evenly to match the input image size of the object detection model and an image reduced in size as a whole are input to the object detection model, respectively. In this method, the coordinate information of the obtained bounding box is scaled, and the detection results of each divided image and the reduced image are synthesized, and the final result is output (Non-Patent Document 5).

　また、確率密度推定やクラスタ検出を用いて、画像内で物体が存在する領域の分布を推定し、物体が存在すると予測される領域のみを切り出して物体検出モデルを適用する手法も提案されている（非特許文献６及び７）。例えば、非特許文献６の手法では、まず、入力画像である高精細画像を縮小し、画像内で物体が存在する領域の分布を推定した密度マップを出力する。次に、出力された密度マップをもとに、物体検出を適用する画像上の矩形領域を抽出する。この時、物体が存在するとみなす物体の確率密度の単一の閾値を予め設定しておき、確率密度が閾値以上の部分を矩形領域として抽出する。最後に、抽出された矩形領域に対応する部分を入力画像から切り出し、物体検出モデルを適用して検出結果を得る。　Also proposed are methods that use probability density estimation or cluster detection to estimate the distribution of areas in an image where objects exist, and then cut out only the areas where objects are predicted to exist and apply an object detection model (Non-Patent Documents 6 and 7). For example, in the method of Non-Patent Document 6, a high-definition image that is the input image is first reduced, and a density map that estimates the distribution of areas in the image where objects exist is output. Next, based on the output density map, a rectangular area in the image to which object detection is applied is extracted. At this time, a single threshold value for the probability density of an object that is deemed to exist is set in advance, and the area where the probability density is equal to or greater than the threshold value is extracted as a rectangular area. Finally, a portion corresponding to the extracted rectangular area is cut out from the input image, and an object detection model is applied to obtain a detection result.

J. Redmon et al., "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788.J. Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788. T. -Y. Lin et al., "Focal Loss for Dense Object Detection," 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999-3007.T. -Y. Lin et al., “Focal Loss for Dense Object Detection,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999-3007. R. Girshick et al., "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.R. Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587. S. Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017.S. Ren et al., "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017. H. Uzawa et al., "High-definition object detection technology based on AI inference scheme and its implementation", IEICE Electronics Express, 2021, Volume 18, Issue 22, Pages 20210323.H. Uzawa et al., "High-definition object detection technology based on AI inference scheme and its implementation", IEICE Electronics Express, 2021, Volume 18, Issue 22, Pages 20210323. C. Li et al., "Density Map Guided Object Detection in Aerial Images," 2020 IEEE/CVF CVPRW, 2020, pp. 737-746.C. Li et al., “Density Map Guided Object Detection in Aerial Images,” 2020 IEEE/CVF CVPRW, 2020, pp. 737-746. F. Yang et al., "Clustered Object Detection in Aerial Images," 2019 IEEE/CVF ICCV, 2019, pp. 8310-8319.F. Yang et al., “Clustered Object Detection in Aerial Images,” 2019 IEEE/CVF ICCV, 2019, pp. 8310-8319.

　非特許文献５に記載の手法のように、高精細画像を均等に分割して物体検出を行う手法では、各分割画像に対して選択的に物体検出を適用することができない。そのため、入力画像の高解像度化に伴い分割画像数が増加すると、物体検出の計算量が増大する。一方で、非特許文献６及び７に記載の手法のように、クラスタ検出や確率密度推定に基づく手法では、画像内で物体検出を適用する領域を絞り込むため、画像によっては大幅に計算量の削減が可能な場合がある。 In a method such as the one described in Non-Patent Document 5, which divides a high-resolution image evenly and performs object detection, it is not possible to selectively apply object detection to each divided image. Therefore, when the number of divided images increases as the resolution of the input image increases, the amount of calculations required for object detection increases. On the other hand, in methods based on cluster detection and probability density estimation such as the methods described in Non-Patent Documents 6 and 7, the area in the image to which object detection is applied is narrowed down, so that depending on the image, it may be possible to significantly reduce the amount of calculations.

　しかし、確率密度推定を行う場合、一般的には、画像上でのサイズが小さな物体が密集した領域からは分散が小さく高密度の分布が得られ、サイズが大きな物体の部分からは分散が大きく低密度の分布が得られる。従来手法のように単一の閾値を用いて矩形領域を抽出する場合、小さな物体に合わせて閾値を高く設定すると、大きな物体に相当する矩形領域が抽出されない、又は抽出されたとしても大きな物体の一部だけが切り取られた小さな矩形領域が抽出される可能性がある。一方、大きな物体に合わせて閾値を低く設定すると、矩形領域のサイズが物体検出モデルの入力サイズよりも大きくなり、物体検出モデルに入力する際に切り出した画像を縮小する必要が生じるため、小さな物体がつぶれて検出できない可能性がある。このように、確率密度推定の結果から単一の閾値以上の部分を物体検出モデルへ入力する領域として抽出する場合には、最適なサイズの領域を抽出することができず、物体検出の精度が劣化する可能性がある。 However, when performing probability density estimation, a high-density distribution with low variance is generally obtained from areas on an image where small objects are densely packed, while a low-density distribution with high variance is obtained from areas where large objects are present. When extracting rectangular areas using a single threshold as in conventional methods, if the threshold is set high to match small objects, a rectangular area corresponding to a large object may not be extracted, or if it is extracted, a small rectangular area in which only a part of the large object is cut out may be extracted. On the other hand, if the threshold is set low to match large objects, the size of the rectangular area may be larger than the input size of the object detection model, and the cut-out image must be reduced when input to the object detection model, which may cause small objects to be crushed and not detectable. In this way, when extracting areas above a single threshold from the results of probability density estimation as areas to be input to the object detection model, it is not possible to extract an optimal size area, and the accuracy of object detection may deteriorate.

　開示の技術は、上記の点に鑑みてなされたものであり、物体検出モデルを適用する領域として、対象画像の場面に応じた最適なサイズの領域を対象画像から切り出すことを目的とする。 The disclosed technology has been developed in consideration of the above points, and aims to cut out an area from a target image of an optimal size according to the scene in the target image as an area to which an object detection model is applied.

　本開示の第１態様は、物体検出装置であって、画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定する推定部と、前記推定部により推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出する抽出部と、前記抽出部により抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出する検出部と、前記密度マップ、又は前記検出部による物体検出結果に応じて、前記抽出部により抽出する領域のサイズを設定するための処理を行う処理部と、を含む。 A first aspect of the present disclosure is an object detection device that includes an estimation unit that estimates a density map of a target image that is the subject of object detection using a density estimation model previously generated by machine learning so as to estimate a density map indicating the probability density of an object being present at each position of the image, an extraction unit that extracts a region that is the subject of object detection based on the probability density in the density map estimated by the estimation unit, a detection unit that detects an object by applying an object detection model previously generated by machine learning to detect an object from an image to a region of the target image that corresponds to the region extracted by the extraction unit, and a processing unit that performs processing to set the size of the region extracted by the extraction unit according to the density map or the object detection result by the detection unit.

　本開示の第２態様は、物体検出方法であって、推定部が、画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定し、抽出部が、前記推定部により推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出し、検出部が、前記抽出部により抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出し、処理部が、前記密度マップ、又は前記検出部による物体検出結果に応じて、前記抽出部により抽出する領域のサイズを設定するための処理を行う方法である。 A second aspect of the present disclosure is an object detection method, in which an estimation unit estimates a density map of a target image to be subjected to object detection using a density estimation model previously generated by machine learning so as to estimate a density map indicating the probability density of the presence of an object at each position of the image, an extraction unit extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit, a detection unit detects an object by applying an object detection model previously generated by machine learning to detect an object from an image to a region of the target image corresponding to the region extracted by the extraction unit, and a processing unit performs processing to set the size of the region to be extracted by the extraction unit according to the density map or the object detection result by the detection unit.

　開示の技術によれば、物体検出モデルを適用する領域として、対象画像の場面に応じた最適なサイズの領域を対象画像から切り出すことができる。 The disclosed technology makes it possible to extract an area of an optimal size from a target image according to the scene in the target image as an area to which an object detection model is applied.

物体検出装置のハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the object detection device. 第１実施形態に係る物体検出装置の機能構成を示すブロック図である。1 is a block diagram showing a functional configuration of an object detection device according to a first embodiment. 閾値と矩形領域のサイズとのペアを示す図である。FIG. 13 is a diagram showing pairs of thresholds and sizes of rectangular areas. 矩形領域の抽出を説明するための図である。FIG. 13 is a diagram for explaining extraction of a rectangular area. 矩形領域のマージを説明するための図である。FIG. 13 is a diagram for explaining merging of rectangular regions. 閾値の更新の概要を示す図である。FIG. 13 is a diagram illustrating an overview of updating a threshold value. 閾値の更新の概要を示す図である。FIG. 13 is a diagram illustrating an overview of updating a threshold value. 矩形領域のサイズの調整を説明するための図である。11A and 11B are diagrams for explaining adjustment of the size of a rectangular area. 第１実施形態に係る物体検出処理の一例を示すフローチャートである。5 is a flowchart illustrating an example of an object detection process according to the first embodiment. 閾値調整処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a threshold adjustment process. サイズ調整処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of a size adjustment process. 第２実施形態に係る物体検出装置の機能構成を示すブロック図である。FIG. 11 is a block diagram showing a functional configuration of an object detection device according to a second embodiment. 極大値に基づく矩形領域の抽出を説明するための図である。FIG. 13 is a diagram for explaining extraction of a rectangular region based on a maximum value. 極大値の探索を説明するための図である。FIG. 13 is a diagram for explaining a search for a maximum value. 第２実施形態に係る物体検出処理の一例を示すフローチャートである。13 is a flowchart illustrating an example of an object detection process according to the second embodiment.

　以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Below, an example of an embodiment of the disclosed technology will be described with reference to the drawings. Note that the same reference symbols are used for identical or equivalent components and parts in each drawing. Also, the dimensional ratios in the drawings have been exaggerated for the convenience of explanation and may differ from the actual ratios.

＜第１実施形態＞
　図１は、第１実施形態に係る物体検出装置１０のハードウェア構成を示すブロック図である。図１に示すように、物体検出装置１０は、ＣＰＵ（Ｃｅｎｔｒａｌ　Ｐｒｏｃｅｓｓｉｎｇ　Ｕｎｉｔ）１１と、ＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）１２と、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）１３と、ストレージ１４と、入力部１５と、表示部１６と、通信Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）１７とを有する。各構成は、バス１９を介して相互に通信可能に接続されている。 First Embodiment
Fig. 1 is a block diagram showing a hardware configuration of an object detection device 10 according to the first embodiment. As shown in Fig. 1, the object detection device 10 has a central processing unit (CPU) 11, a read only memory (ROM) 12, a random access memory (RAM) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface (I/F) 17. Each component is connected to each other via a bus 19 so as to be able to communicate with each other.

　ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、後述する物体検出処理を実行するための物体検出プログラムが格納されている。 The CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads a program from the ROM 12 or storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or storage 14. In this embodiment, the ROM 12 or storage 14 stores an object detection program for executing the object detection process described below.

　ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（Ｈａｒｄ　Ｄｉｓｋ　Ｄｒｉｖｅ）、ＳＳＤ（Ｓｏｌｉｄ　Ｓｔａｔｅ　Ｄｒｉｖｅ）等の記憶装置により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 ROM 12 stores various programs and data. RAM 13 temporarily stores programs or data as a working area. Storage 14 is made up of storage devices such as HDD (Hard Disk Drive) and SSD (Solid State Drive), and stores various programs including the operating system, and various data.

　入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能してもよい。通信Ｉ／Ｆ１７は、他の機器と通信するためのインタフェースである。当該通信には、例えば、イーサネット（登録商標）若しくはＦＤＤＩ等の有線通信の規格、又は、４Ｇ、５Ｇ、若しくはＷｉ－Ｆｉ（登録商標）等の無線通信の規格が用いられる。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various inputs. The display unit 16 is, for example, a liquid crystal display, and displays various information. The display unit 16 may function as the input unit 15 by adopting a touch panel system. The communication I/F 17 is an interface for communicating with other devices. For this communication, for example, a wired communication standard such as Ethernet (registered trademark) or FDDI, or a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used.

　次に、第１実施形態に係る物体検出装置１０の機能構成について説明する。図２は、物体検出装置１０の機能構成の例を示すブロック図である。図２に示すように、物体検出装置１０は、機能構成として、推定部２２と、抽出部２４と、検出部２６と、処理部２８とを含む。また、物体検出装置１０の所定の記憶領域には、密度推定モデル３０と、物体検出モデル３２とが記憶される。各機能構成は、ＣＰＵ１１がＲＯＭ１２又はストレージ１４に記憶された物体検出プログラムを読み出し、ＲＡＭ１３に展開して実行することにより実現される。 Next, the functional configuration of the object detection device 10 according to the first embodiment will be described. FIG. 2 is a block diagram showing an example of the functional configuration of the object detection device 10. As shown in FIG. 2, the object detection device 10 includes, as its functional configuration, an estimation unit 22, an extraction unit 24, a detection unit 26, and a processing unit 28. Furthermore, a density estimation model 30 and an object detection model 32 are stored in a predetermined storage area of the object detection device 10. Each functional configuration is realized by the CPU 11 reading out an object detection program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.

　密度推定モデル３０は、画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された機械学習モデルである。具体的には、密度推定モデル３０は、画像と、その画像に対する密度マップの正解データとを訓練データとし、密度推定モデル３０から出力される密度マップが正解データに近づくように、密度推定モデル３０のパラメータを更新することにより生成されている。 The density estimation model 30 is a machine learning model that has been generated in advance by machine learning so as to estimate a density map that indicates the probability density of the presence of an object at each position in an image. Specifically, the density estimation model 30 is generated by using an image and ground-truth data of the density map for that image as training data, and updating the parameters of the density estimation model 30 so that the density map output from the density estimation model 30 approaches the ground-truth data.

　推定部２２は、密度推定モデル３０を用いて、物体検出の対象となる対象画像の密度マップを推定する。具体的には、推定部２２は、物体検出装置１０に入力される映像の各フレームを対象画像として順次取得する。推定部２２は、取得した対象画像を密度推定モデル３０に入力し、密度推定モデル３０から出力される密度マップを取得する。推定部２２は、推定した密度マップを抽出部２４へ受け渡す。 The estimation unit 22 uses the density estimation model 30 to estimate a density map of a target image that is the subject of object detection. Specifically, the estimation unit 22 sequentially acquires each frame of the video input to the object detection device 10 as a target image. The estimation unit 22 inputs the acquired target image to the density estimation model 30 and acquires a density map output from the density estimation model 30. The estimation unit 22 passes the estimated density map to the extraction unit 24.

　抽出部２４は、推定部２２から受け渡された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を、後述する処理部２８から供給される多段階の閾値と矩形領域のサイズとのペアを用いて抽出する。閾値とは、密度マップから確率密度が閾値以上の部分を特定するための値であり、矩形領域のサイズとは、各段階の閾値と１対１で対応付くように定められるサイズである。本実施形態では、領域を矩形とし、矩形領域の一辺の長さ（画素数）を矩形領域のサイズとする。また、図３に示すように、ペア数をＮ（任意の自然数）、閾値をＴ_ｉ、閾値Ｔ_ｉに対応する矩形領域の一辺の長さをＳ_ｉとする。ただし、ｉ＝１，２，・・・，Ｎであり、Ｔ_１＞Ｔ_２＞・・・＞Ｔ_Ｎ、かつ、Ｓ_１＜Ｓ_２＜・・・＜Ｓ_Ｎであるとする。 Based on the probability density in the density map passed from the estimation unit 22, the extraction unit 24 extracts a region to be subjected to object detection using a pair of a multi-stage threshold value and a rectangular region size supplied from the processing unit 28 described later. The threshold value is a value for identifying a portion in the density map where the probability density is equal to or greater than the threshold value, and the size of the rectangular region is a size determined so as to correspond one-to-one with the threshold value of each stage. In this embodiment, the region is a rectangle, and the length of one side of the rectangular region (number of pixels) is the size of the rectangular region. Also, as shown in FIG. 3, the number of pairs is N (any natural number), the threshold value is T _i , and the length of one side of the rectangular region corresponding to the threshold value T _i is S _i , where i=1, 2, ..., N, T ₁ > T ₂ > ... > T _N , and S ₁ < S ₂ < ... < S _N .

　具体的には、抽出部２４は、図４に示すように、推定部２２で推定された密度マップで確率密度が閾値Ｔ_ｉ以上の画素が連続する部分を中心に、サイズＳ_ｉの矩形領域を、物体検出の対象とする領域として抽出する。抽出部２４は、各閾値Ｔ_ｉ（ｉ＝１，２，・・・，Ｎ）について矩形領域を抽出する。 4, the extraction unit 24 extracts a rectangular region of size S _i as a target region for object detection, centered on a portion of a series of pixels whose probability density is equal to or greater than a threshold value T _i in the density map estimated by the estimation unit 22. The extraction unit 24 extracts a rectangular region for each threshold value T _i (i=1, 2, ..., N).

　また、抽出部２４は、対象画像から複数の矩形領域を抽出した場合、重複度が所定値以上の矩形領域同士であって、マージ後の矩形領域のサイズに対して、検出可能な物体の最小サイズが所定の条件を満たす矩形領域同士をマージする。 In addition, when the extraction unit 24 extracts multiple rectangular regions from the target image, it merges rectangular regions that have an overlapping degree equal to or greater than a predetermined value and for which the minimum size of a detectable object satisfies a predetermined condition relative to the size of the rectangular region after merging.

　具体的には、図５上図に示すように、閾値Ｔ_ｉに基づいて抽出した矩形領域ｉと、閾値Ｔ_ｊに基づいて抽出した矩形領域ｊとが重複領域（図５中の斜線部）を持つとする。　抽出部２４は、例えば、ＩｏＵ（Ｉｎｔｅｒｓｅｃｔｉｏｎ　ｏｖｅｒ　Ｕｎｉｏｎ）を重複度として算出する。なお、ＩｏＵ＝矩形領域ｉと矩形領域ｊとの重複領域の面積／矩形領域ｉと矩形領域ｊとの和集合の領域の面積、である。抽出部２４は、重複度が所定値Ｔ_ｉｏｕ以上の場合、図５下図に示すように、矩形領域ｉ及び矩形領域ｊを内包する矩形領域をマージ後の矩形領域とする。このように矩形領域をマージすることにより、物体検出モデル３２を適用する際に、重複して物体検出を行う領域を削減し、物体検出の計算量を削減することができる。 Specifically, as shown in the upper diagram of FIG. 5, a rectangular area i extracted based on a threshold T _i and a rectangular area j extracted based on a threshold T _j have an overlapping area (shaded area in FIG. 5). The extraction unit 24 calculates, for example, IoU (Intersection over Union) as the overlapping degree. Note that IoU=area of the overlapping area between rectangular area i and rectangular area j/area of the union area of rectangular area i and rectangular area j. When the overlapping degree is equal to or greater than a predetermined value T _iou , the extraction unit 24 sets a rectangular area including rectangular area i and rectangular area j as a merged rectangular area, as shown in the lower diagram of FIG. 5. By merging rectangular areas in this way, it is possible to reduce the area where object detection is performed overlappingly when the object detection model 32 is applied, and reduce the amount of calculation for object detection.

　ただし、抽出部２４は、下記（１）式が成立する場合には、矩形領域のマージは行わない。なお、ａ_ｍｉｎ ^ｉ及びａ_ｍｉｎ ^ｊは、直前フレームにおいて、各閾値Ｔ_ｉ及びＴ_ｊに基づいて抽出された矩形領域に対応する画像から検出された物体の最小サイズ、Ａ_ｉｊは、マージ後の矩形領域の面積、Ａ_ｍｉｎは、物体検出モデル３２において検出可能な物体の最小サイズである。これにより、物体が検出可能なサイズを下回ることで、物体検出から漏れることを防ぎながら、矩形領域をマージすることができる。 However, the extraction unit 24 does not merge rectangular regions when the following formula (1) is satisfied. Note that a _min ⁱ and a _min ^j are the minimum size of an object detected from an image corresponding to a rectangular region extracted based on each threshold value T _i and T _j in the immediately preceding frame, A _ij is the area of the rectangular region after merging, and A _min is the minimum size of an object detectable by the object detection model 32. This makes it possible to merge rectangular regions while preventing an object from being missed from object detection due to being smaller than the detectable size.

　抽出部２４は、抽出した矩形領域の各々に識別情報であるＩＤを付与し、ＩＤと、矩形領域の位置と、矩形領域のサイズ（縦及び横の長さ）とを含む矩形情報を抽出部２４及び検出部２６へ受け渡す。なお、矩形領域の位置は、矩形領域の中心座標、左上角の座標等、矩形領域の所定点の座標としてよい。 The extraction unit 24 assigns an ID, which is identification information, to each of the extracted rectangular areas, and passes rectangular information including the ID, the position of the rectangular area, and the size (length and width) of the rectangular area to the extraction unit 24 and the detection unit 26. The position of the rectangular area may be the coordinates of a specified point of the rectangular area, such as the center coordinates of the rectangular area or the coordinates of the upper left corner.

　物体検出モデル３２は、画像から物体を検出するために予め機械学習により生成された機械学習モデルである。物体検出モデル３２は、入力された画像に含まれる物体のクラス、物体の領域を示すバウンディングボックス、及び検出の信頼度を含むメタデータを出力する。 The object detection model 32 is a machine learning model that has been generated in advance by machine learning in order to detect objects from images. The object detection model 32 outputs metadata including the class of the object contained in the input image, a bounding box indicating the area of the object, and the confidence of the detection.

　検出部２６は、抽出部２４により抽出された矩形領域に対応する対象画像の領域に物体検出モデル３２を適用して物体を検出する。具体的には、検出部２６は、抽出部２４から受け渡された矩形情報に基づいて、抽出部２４により抽出された矩形領域に対応する対象画像上の領域を特定する。検出部２６は、対象画像から特定した領域の画像を切り出し、切り出した画像を、物体検出モデル３２の入力サイズに合わせて拡大又は縮小して物体検出モデル３２に入力する。検出部２６は、物体検出モデル３２が出力したメタデータを検出結果として出力すると共に、検出結果を処理部２８へ受け渡す。 The detection unit 26 detects an object by applying the object detection model 32 to an area of the target image corresponding to the rectangular area extracted by the extraction unit 24. Specifically, the detection unit 26 identifies an area on the target image corresponding to the rectangular area extracted by the extraction unit 24 based on the rectangular information passed from the extraction unit 24. The detection unit 26 cuts out an image of the identified area from the target image, enlarges or reduces the cut-out image to match the input size of the object detection model 32, and inputs it to the object detection model 32. The detection unit 26 outputs the metadata output by the object detection model 32 as the detection result, and passes the detection result to the processing unit 28.

　処理部２８は、推定部２２により推定された密度マップ、又は検出部２６による物体の検出結果に応じて、抽出部２４により抽出する矩形領域のサイズを設定するための処理を行う。具体的には、処理部２８は、上述した多段階の閾値Ｔ_ｉと矩形領域のサイズＳ_ｉとのペアの初期値を取得する。初期値は任意に設定してよい。処理部２８は、対象画像が映像の最初のフレームの場合、初期値のＴ_ｉとＳ_ｉとのペアを抽出部２４に供給する。また、処理部２８は、対象画像が２フレーム以降のフレームの場合、映像における対象画像よりも前のフレームについての密度マップ又は物体の検出結果に基づいて、閾値Ｔ_ｉ及び矩形領域のサイズＳ_ｉを調整する。そして、処理部２８は、調整後のＴ_ｉとＳ_ｉとのペアを抽出部２４に供給する。 The processing unit 28 performs processing for setting the size of the rectangular area to be extracted by the extraction unit 24 according to the density map estimated by the estimation unit 22 or the object detection result by the detection unit 26. Specifically, the processing unit 28 acquires initial values of the pair of the above-mentioned multi-stage threshold value T _i and the rectangular area size S _i . The initial values may be set arbitrarily. When the target image is the first frame of the video, the processing unit 28 supplies the pair of initial values T _i and S _i to the extraction unit 24. When the target image is the second or subsequent frame, the processing unit 28 adjusts the threshold value T _i and the rectangular area size S _i based on the density map or the object detection result for the frame before the target image in the video. Then, the processing unit 28 supplies the adjusted pair of T _i and S _i to the extraction unit 24.

　閾値Ｔ_ｉの調整について、より具体的に説明する。処理部２８は、密度マップにおいて確率密度が閾値Ｔ_ｉ以上の画素が連続する部分の面積が、閾値Ｔ_ｉとペアとなるサイズＳ_ｉに対応する面積（Ｓ_ｉ×Ｓ_ｉ）よりも小さい場合、閾値Ｔ_ｉを所定値又は所定割合下げる。また、処理部２８は、密度マップにおいて確率密度が閾値Ｔ_ｉ以上の画素が連続する部分の面積がサイズＳ_ｉに対応する面積（Ｓ_ｉ×Ｓ_ｉ）よりも大きい場合、閾値Ｔ_ｉを所定値又は所定割合上げる。 The adjustment of the threshold T _i will be described in more detail. When the area of a portion in the density map where pixels having a probability density equal to or greater than the threshold T _i are consecutive is smaller than the area (S _i ×S _i ) corresponding to the size S _i paired with the threshold T _i , the processing unit 28 lowers the threshold T _i by a predetermined value or a predetermined percentage. Also, when the area of a portion in the density map where pixels having a probability density equal to or greater than the threshold T _i are consecutive is larger than the area (S _i ×S _i ) corresponding to the size S _i , the processing unit 28 raises the threshold T _i by a predetermined value or a predetermined percentage.

　例えば、処理部２８は、抽出部２４により、あるフレームにおいて、確率密度が閾値Ｔ_ｉ以上の部分として特定される部分の平均面積ａ_ｉを算出し、算出した平均面積ａ_ｉと、定数α（０＜α＜１）とを用いて、閾値Ｔ_ｉを下記（２）式により更新する。 For example, the processing unit 28 calculates an average area _ai of a portion in a certain frame that is identified by the extraction unit 24 as a portion whose probability density is equal to or greater than a threshold value T _i , and updates the threshold value T _i according to the following equation (2) using the calculated average area _ai and a constant α (0<α<1).

　図６に、Ｓ_ｉ×Ｓ_ｉ＞ａ_ｉの場合における閾値Ｔ_ｉの更新の概要を示す。閾値を低くするほど、確率密度が閾値以上の部分の面積は大きくなり、閾値を高くするほど、確率密度が閾値以上の部分の面積は小さくなる。そこで、処理部２８は、Ｓ_ｉ×Ｓ_ｉ＞ａ_ｉの場合、閾値Ｔ_ｉを下げることで、確率密度が閾値以上の部分を広げるように閾値を更新する。 6 shows an overview of updating the threshold T _i when S _i ×S _i > a _i . The lower the threshold, the larger the area of the part where the probability density is equal to or greater than the threshold, and the higher the threshold, the smaller the area of the part where the probability density is equal to or greater than the threshold. Therefore, when S _i ×S _i > a _i , the processing unit 28 updates the threshold so as to expand the part where the probability density is equal to or greater than the threshold by lowering the threshold T _i .

　図７に、Ｓ_ｉ×Ｓ_ｉ≦ａ_ｉの場合における閾値Ｔ_ｉの更新の概要を示す。処理部２８は、Ｓ_ｉ×Ｓ_ｉ≦ａ_ｉの場合、閾値Ｔ_ｉを上げる。これにより、処理部２８は、より小さな物体が分布する確率密度の高い領域を、閾値Ｔ_ｉとサイズＳ_ｉとのペアを利用して抽出するように閾値を調整する。また、閾値Ｔ_ｉ＋１又はそれよりも低い閾値とその閾値に紐付くサイズとのペアは、閾値Ｔ_ｉが対象とする物体よりも大きな物体が分布する領域を抽出する閾値となる。 7 shows an overview of updating the threshold T _i when S _i ×S _i ≦a _i . When S _i ×S _i ≦a _i , the processing unit 28 increases the threshold T _i . As a result, the processing unit 28 adjusts the threshold so as to extract an area with a high probability density where smaller objects are distributed using a pair of the threshold T _i and the size S _i . In addition, a pair of the threshold T _i+1 or a threshold lower than that and a size associated with that threshold becomes a threshold for extracting an area where objects larger than the object targeted by the threshold T _i are distributed.

　上記の閾値の調整によって、隣接する閾値Ｔ_ｋと閾値Ｔ_ｋ＋１とが、｜Ｔ_ｋ－Ｔ_ｋ＋１｜＜ε（εは任意の正数）となった場合、処理部２８は、これらの閾値を統合する。例えば、Ｔ_ｉを下げることにより、Ｔ_ｉがＴ_ｉ＋１に近づき、｜Ｔ_ｉ－Ｔ_ｉ＋１｜＜εとなる場合、又は、Ｔ_ｉを上げることにより、Ｔ_ｉがＴ_ｉ－１に近づき、｜Ｔ_ｉ－１－Ｔ_ｉ｜＜εとなる場合である。閾値の統合として、具体的には、処理部２８は、閾値Ｔ_ｋとサイズＳ_ｋ＋１とを新たなペアとし、閾値Ｔ_ｋ＋１及びサイズＳ_ｋを廃止する。 When the adjacent thresholds T _k and T _k+1 become |T _k -T _k+1 |<ε (ε is any positive number) by adjusting the thresholds, the processing unit 28 integrates these thresholds. For example, when lowering T _i makes T _i approach T i+ ₁ and |T _i -T _i+1 |<ε, or when raising T _i makes T _i approach T _i-1 and |T _i-1 -T _i |<ε. Specifically, as the integration of the thresholds, the processing unit 28 makes a new pair of the threshold T _k and the size S _k+1 , and abolishes the threshold T _k+1 and the size S _k .

　次に、矩形領域のサイズの調整について、より具体的に説明する。処理部２８は、前のフレームに対する検出部２６による物体の検出結果において、サイズＳ_ｉの領域から検出された物体のサイズを大きさ別に異なる複数段階のカテゴリに分類する。処理部２８は、物体のサイズが平均に相当するカテゴリよりも、物体のサイズが大きいカテゴリに含まれる物体の数の方が多い場合には、サイズＳ_ｉを所定割合拡大する。また、処理部２８は、物体のサイズが平均に相当するカテゴリよりも、物体のサイズが小さいカテゴリに含まれる物体の数の方が多い場合には、サイズＳ_ｉを所定割合縮小する。 Next, the adjustment of the size of the rectangular region will be described in more detail. The processing unit 28 classifies the size of the object detected from the region of size S _i in the object detection result by the detection unit 26 for the previous frame into a plurality of categories with different sizes. The processing unit 28 enlarges the size S _i by a predetermined percentage when the number of objects included in the category of large object sizes is greater than the category of object sizes corresponding to the average. The processing unit 28 also reduces the size S _i by a predetermined percentage when the number of objects included in the category of small object sizes is greater than the category of object sizes corresponding to the average.

　例えば、処理部２８は、あるフレームで検出された物体のバウンディングボックスのサイズを、物体検出モデル３２で検出可能な物体サイズの下限を基準として、Ｓ（Ｓｍａｌｌ）、Ｍ（Ｍｅｄｉｕｍ）、及びＬ（Ｌａｒｇｅ）の３つのカテゴリに分類する。処理部２８は、物体のバウンディングボックスの１辺の長さに基づいてサイズを分類してもよいし、バウンディングボックスの面積に基づいてサイズを分類してもよい。 For example, the processing unit 28 classifies the size of the bounding box of an object detected in a certain frame into three categories, S (Small), M (Medium), and L (Large), based on the lower limit of the object size that can be detected by the object detection model 32. The processing unit 28 may classify the size based on the length of one side of the object's bounding box, or may classify the size based on the area of the bounding box.

　また、処理部２８は、サイズＳ_ｉの矩形領域に基づいて対象画像から切り出された画像から検出された全ての物体のうち、最も数の多いカテゴリの物体が、物体のサイズが平均に相当するカテゴリの一例であるＭカテゴリになるようにサイズＳ_ｉを調整する。これは、Ｓカテゴリが最多だった場合には、物体検出モデル３２により検出できていない、より小さな物体が切り出した画像内に存在する可能性があるためである。また、Ｌカテゴリが最多だった場合には、物体がサイズＳ_ｉよりも大きく、切り出した画像から全体を一つの物体として検出できていない可能性があるためである。処理部２８は、このサイズの調整処理を全てのサイズＳ_ｉ（ｉ＝１，２，・・・，Ｎ）について行う。この際、処理部２８は、サイズＳ_ｉで切り出された画像が存在しない場合、又はサイズＳ_ｉで切り出された画像から検出された物体の最多のカテゴリがＭカテゴリの場合には、サイズＳ_ｉの調整は行わない。 Furthermore, the processing unit 28 adjusts the size S _i so that the object of the most numerous category among all objects detected from the image cut out from the target image based on the rectangular region of size S _i becomes the M category, which is an example of a category where the size of the object corresponds to the average. This is because, when the S category is the most numerous, there is a possibility that a smaller object that has not been detected by the object detection model 32 is present in the cut-out image. Also, when the L category is the most numerous, there is a possibility that the object is larger than size S _i and the whole object has not been detected as one object from the cut-out image. The processing unit 28 performs this size adjustment process for all sizes S _i (i = 1, 2, ..., N). At this time, the processing unit 28 does not adjust the size S _i when there is no image cut out at size S _i or when the category of the most numerous objects detected from the image cut out at size S _i is the M category.

　分類の基準として、バウンディングボックスの面積を用いる場合について、より詳細に説明する。処理部２８は、物体検出モデル３２で検出可能な物体のバウンディングボックスの面積の上限Ａ_ｍａｘ及び下限Ａ_ｍｉｎを取得する。また、処理部２８は、面積の上限Ａ_ｍａｘ及び下限Ａ_ｍｉｎの値に基づいて、想定される全物体のバウンディングボックスの平均面積Ａ_Ｍを定めておく。例えば、Ａ_Ｍ＝（Ａ_ｍａｘ＋Ａ_ｍｉｎ）／２、としてよい。 A more detailed description will be given of the case where the area of the bounding box is used as the classification criterion. The processing unit 28 acquires an upper limit _Amax and a lower limit _Amin of the area of the bounding box of an object detectable by the object detection model 32. The processing unit 28 also determines an average area _AM of the bounding boxes of all objects that are expected based on the values of the upper limit _Amax and the lower limit _Amin of the area. For example, _AM may be set as ( _Amax + _Amin )/2.

　また、処理部２８は、各フレームにおける検出部２６の検出結果に基づいて、予め定めた基準を用いて、検出された物体をＳ、Ｍ、Ｌに分類する。例えば、バウンディングボックスの面積ａ_Ｂｂｏｘが、Ａ_ｍｉｎ≦ａ_Ｂｂｏｘ＜Ｘの場合はＳカテゴリ、Ｘ≦ａ_Ｂｂｏｘ＜２Ｘの場合はＭカテゴリ、２Ｘ≦ａ_Ｂｂｏｘ≦Ａ_ｍａｘの場合はＬカテゴリとしてよい。Ｘは、例えば（Ａ_ｍａｘ－Ａ_ｍｉｎ）／３、としてよい。 Furthermore, the processing unit 28 classifies the detected object into S, M, or L using a predetermined criterion based on the detection result of the detection unit 26 in each frame. For example, if the area _aBbox of the bounding box is _Amin ≦ _aBbox <X, the object may be classified as S category, if X≦ _aBbox <2X, the object may be classified as M category, and if 2X≦ _aBbox ≦ _Amax , the object may be classified as L category. X may be, for example, ( _Amax − _Amin )/3.

　処理部２８は、各カテゴリに属する物体のバウンディングボックスの平均面積をそれぞれａ_Ｓ、ａ_Ｍ、及びａ_Ｌとし、カテゴリに関係なく検出された物体の最大面積をａ_ｍａｘ、最小面積をａ_ｍｉｎとする。これらの値を用いて、下記（３）式により、サイズＳ_ｉを更新する。 The processing unit 28 defines the average areas of the bounding boxes of the objects belonging to each category as a _S , a _M , and a _L , respectively, and defines the maximum area of the detected object regardless of the category as a _max and the minimum area as a _min . Using these values, the size S _i is updated by the following formula (3).

　ただし、上述したように、Ｍカテゴリの物体が最多の場合には、矩形領域のサイズの調整は実施しない。図８に、上記（３）式を用いて矩形領域のサイズの調整を行う例を示す。図８（ａ）に示すように、Ｌカテゴリが最多の場合には、矩形領域のサイズを大きくすることで、物体の全体像を捉え、より検出し易くする。一方、図８（ｂ）に示すように、Ｓカテゴリが最多の場合には、矩形領域のサイズを小さくすることで、小さな物体の検出漏れが生じないようにする。これによって、検出された物体が検出可能な物体のサイズの上限及び下限を超えない範囲で、矩形領域のサイズを適切に調整可能となる。 However, as mentioned above, when the M category contains the most objects, the size of the rectangular region is not adjusted. Figure 8 shows an example of adjusting the size of the rectangular region using the above formula (3). As shown in Figure 8(a), when the L category contains the most objects, the size of the rectangular region is increased to capture the entire image of the object and make it easier to detect. On the other hand, as shown in Figure 8(b), when the S category contains the most objects, the size of the rectangular region is reduced to prevent small objects from being missed. This makes it possible to appropriately adjust the size of the rectangular region within a range in which the detected object does not exceed the upper or lower limits of the size of a detectable object.

　次に、第１実施形態に係る物体検出装置１０の作用について説明する。図９は、物体検出装置１０による物体検出処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から物体検出プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、物体検出処理が行なわれる。なお、物体検出処理は、本開示の物体検出方法の一例である。 Next, the operation of the object detection device 10 according to the first embodiment will be described. FIG. 9 is a flowchart showing the flow of the object detection process performed by the object detection device 10. The object detection process is performed by the CPU 11 reading out an object detection program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it. Note that the object detection process is an example of the object detection method of the present disclosure.

　ステップＳ１０で、ＣＰＵ１１は、推定部２２として、物体検出装置１０に入力される映像のフレームを対象画像として取得する。次に、ステップＳ１２で、ＣＰＵ１１は、推定部２２として、密度推定モデル３０を用いて、対象画像の密度マップを推定する。 In step S10, the CPU 11, functioning as the estimation unit 22, acquires a frame of the video input to the object detection device 10 as a target image. Next, in step S12, the CPU 11, functioning as the estimation unit 22, estimates a density map of the target image using the density estimation model 30.

　次に、ステップＳ１４で、ＣＰＵ１１は、抽出部２４として、上記ステップＳ１２で推定された密度マップにおける確率密度に基づいて、物体検出の対象とする矩形領域を、処理部２８から供給される多段階の閾値Ｔ_ｉとサイズＳ_ｉとのペアを用いて抽出する。なお、ここで処理部２８から供給される多段階のＴ_ｉとＳ_ｉとのペアは、対象画像が映像の最初のフレームの場合、初期値のＴ_ｉとＳ_ｉとのペアである。また、対象画像が２フレーム以降のフレームの場合、後述する閾値調整処理及びサイズ調整処理での調整後のＴ_ｉとＳ_ｉとのペアである。 Next, in step S14, the CPU 11, functioning as the extraction unit 24, extracts a rectangular region to be subjected to object detection based on the probability density in the density map estimated in step S12 above, using a pair of multi-stage threshold value T _i and size S _i supplied from the processing unit 28. Note that the pair of multi-stage T _i and S _i supplied from the processing unit 28 here is a pair of initial values T _i and S _i when the target image is the first frame of a video. Also, when the target image is the second or subsequent frame, it is a pair of T _i and S _i after adjustment in a threshold adjustment process and a size adjustment process to be described later.

　次に、ステップＳ１６で、ＣＰＵ１１は、抽出部２４として、対象画像において、重複度が所定値以上の矩形領域同士であって、マージ後の矩形領域のサイズに対して、検出可能な物体の最小サイズが所定の条件を満たす矩形領域同士をマージする。ＣＰＵ１１は、抽出部２４として、抽出した矩形領域の各々に識別情報であるＩＤを付与し、ＩＤと、矩形領域の位置と、矩形領域のサイズとを含む矩形情報を抽出部２４及び検出部２６へ受け渡す。 Next, in step S16, the CPU 11, as the extraction unit 24, merges rectangular areas in the target image that have an overlapping degree equal to or greater than a predetermined value and where the minimum size of a detectable object satisfies a predetermined condition relative to the size of the rectangular area after merging. The CPU 11, as the extraction unit 24, assigns an ID, which is identification information, to each of the extracted rectangular areas, and passes rectangular information including the ID, the position of the rectangular area, and the size of the rectangular area to the extraction unit 24 and the detection unit 26.

　次に、ステップＳ１８で、ＣＰＵ１１は、検出部２６として、抽出部２４から受け渡された矩形情報に基づいて、抽出部２４により抽出された矩形領域に対応する対象画像上の領域を特定する。ＣＰＵ１１は、検出部２６として、対象画像から特定した領域の画像を切り出し、切り出した画像を、物体検出モデル３２の入力サイズに合わせて拡大又は縮小して物体検出モデル３２に入力する。そして、ＣＰＵ１１は、検出部２６として、物体検出モデル３２が出力する物体の検出結果であるメタデータを取得する。 Next, in step S18, the CPU 11, as the detection unit 26, identifies an area on the target image that corresponds to the rectangular area extracted by the extraction unit 24, based on the rectangular information passed from the extraction unit 24. The CPU 11, as the detection unit 26, cuts out an image of the identified area from the target image, enlarges or reduces the cut-out image to match the input size of the object detection model 32, and inputs it to the object detection model 32. The CPU 11, as the detection unit 26, then obtains metadata that is the object detection result output by the object detection model 32.

　次に、ステップＳ２０で、ＣＰＵ１１は、検出部２６として、物体の検出結果を出力すると共に、検出結果を処理部２８へ受け渡す。次に、ステップＳ２２で、ＣＰＵ１１は、処理部２８として、対象画像が、映像の最後のフレームか否かを判定する。最後のフレームの場合には、物体検出処理は終了し、継続するフレームがある場合には、ステップＳ３０へ移行する。 Next, in step S20, the CPU 11, functioning as the detection unit 26, outputs the object detection result and passes the detection result to the processing unit 28. Next, in step S22, the CPU 11, functioning as the processing unit 28, determines whether the target image is the last frame of the video. If it is the last frame, the object detection process ends, and if there are subsequent frames, the process proceeds to step S30.

　ステップＳ３０では、閾値調整処理が実行される。ここで、図１０を参照して、閾値調整処理について説明する。 In step S30, the threshold adjustment process is executed. Here, the threshold adjustment process is explained with reference to FIG. 10.

　ステップＳ３２で、ＣＰＵ１１は、処理部２８として、変数ｉに１を設定する。次に、ステップＳ３４で、ＣＰＵ１１は、処理部２８として、対象画像について、抽出部２４により、密度マップにおいて確率密度が閾値Ｔ_ｉ以上の部分として特定される部分の平均面積ａ_ｉを算出する。そして、ＣＰＵ１１は、処理部２８として、Ｓ_ｉ×Ｓ_ｉ≦ａ_ｉか否かを判定する。Ｓ_ｉ×Ｓ_ｉ≦ａ_ｉの場合には、ステップＳ３６へ移行し、Ｓ_ｉ×Ｓ_ｉ＞ａ_ｉの場合には、ステップＳ３８へ移行する。 In step S32, the CPU 11, functioning as the processing unit 28, sets the variable i to 1. Next, in step S34, the CPU 11, functioning as the processing unit 28, calculates the average area _ai of the portions of the target image that are identified by the extraction unit 24 in the density map as portions whose probability density is equal to or greater than the threshold value T _i . Then, the CPU 11, functioning as the processing unit 28, determines whether or not S _i ×S _i ≦a _i . If _{S i} ×S _i ≦a _i , the process proceeds to step S36, and if S _i ×S _i >a _i , the process proceeds to step S38.

　ステップＳ３６では、ＣＰＵ１１は、処理部２８として、閾値Ｔ_ｉを所定値又は所定割合上げる。一方、ステップＳ３８では、ＣＰＵ１１は、処理部２８として、閾値Ｔ_ｉを所定値又は所定割合下げる。次に、ステップＳ４０で、ＣＰＵ１１は、処理部２８として、隣接する閾値Ｔ_ｋと閾値Ｔ_ｋ＋１とについて、｜Ｔ_ｋ－Ｔ_ｋ＋１｜＜εか否かを判定する。｜Ｔ_ｋ－Ｔ_ｋ＋１｜＜εの場合には、ステップＳ４２へ移行し、｜Ｔ_ｋ－Ｔ_ｋ＋１｜≧εの場合には、ステップＳ４４へ移行する。 In step S36, the CPU 11, functioning as the processing unit 28, increases the threshold T _i by a predetermined value or a predetermined ratio. On the other hand, in step S38, the CPU 11, functioning as the processing unit 28, decreases the threshold _T _i by a predetermined value or a predetermined ratio. Next, in step S40, the CPU 11, functioning as the processing unit 28, determines whether or not |T _k -T _k+1 _| <ε for the adjacent thresholds T k and T k+1. If |T _k -T _k+1 |<ε, the process proceeds to step S42, and if |T _k -T _k+1 |≧ε, the process proceeds to step S44.

　ステップＳ４２では、ＣＰＵ１１は、処理部２８として、閾値Ｔ_ｋとサイズＳ_ｋ＋１とを新たなペアとし、閾値Ｔ_ｋ＋１及びサイズＳ_ｋを廃止するなどして、閾値Ｔ_ｋと閾値Ｔ_ｋ＋１とを統合する。次に、ステップＳ４４で、ＣＰＵ１１は、処理部２８として、変数ｉが閾値と矩形領域のサイズとのペア数であるＮとなったか否かを判定する。ｉ＜Ｎの場合には、ステップＳ４６へ移行し、ＣＰＵ１１は、処理部２８として、ｉを１インクリメントして、ステップＳ３４に戻る。一方、ｉ＝Ｎの場合には、閾値調整処理は終了し、物体検出処理（図９）にリターンする。 In step S42, the CPU 11, as the processing unit 28, combines the threshold T _k and the size S _k+1 into a new pair, and abolishes the threshold T _k+1 and the size S _k , thereby integrating the threshold T _k and the threshold T _k+1 . Next, in step S44, the CPU 11, as the processing unit 28, determines whether the variable i has reached N, which is the number of pairs of thresholds and sizes of rectangular areas. If i<N, the process proceeds to step S46, and the CPU 11, as the processing unit 28, increments i by 1 and returns to step S34. On the other hand, if i=N, the threshold adjustment process ends, and the process returns to the object detection process (FIG. 9).

　次に、ステップＳ５０で、サイズ調整処理が実行される。ここで、図１１を参照して、サイズ調整処理について説明する。 Next, in step S50, the size adjustment process is executed. Now, the size adjustment process will be described with reference to FIG. 11.

　ステップＳ５２で、ＣＰＵ１１は、処理部２８として、対象画像から検出された物体のバウンディングボックスのサイズを、物体検出モデル３２で検出可能な物体サイズの下限を基準として、Ｓ、Ｍ、及びＬの３つのカテゴリに分類する。次に、ステップＳ５４で、ＣＰＵ１１は、処理部２８として、変数ｉに１を設定する。 In step S52, the CPU 11, as the processing unit 28, classifies the size of the bounding box of the object detected from the target image into three categories, S, M, and L, based on the lower limit of the object size that can be detected by the object detection model 32. Next, in step S54, the CPU 11, as the processing unit 28, sets the variable i to 1.

　次に、ステップＳ５６で、ＣＰＵ１１は、処理部２８として、対象画像の密度マップから、サイズＳ_ｉで抽出された矩形領域が存在するか否かを判定する。サイズＳ_ｉで抽出された矩形領域が存在する場合には、ステップＳ５８へ移行し、存在しない場合には、ステップＳ６６へ移行する。 Next, in step S56, the CPU 11, functioning as the processing unit 28, determines whether or not a rectangular area extracted with size S _i exists from the density map of the target image. If a rectangular area extracted with size S _i exists, the process proceeds to step S58, and if not, the process proceeds to step S66.

　ステップＳ５８では、ＣＰＵ１１は、処理部２８として、サイズＳ_ｉの矩形領域に基づいて対象画像から切り出された画像から検出された全ての物体のうち、最多のカテゴリはＬカテゴリか否かを判定する。Ｌカテゴリの場合には、ステップＳ６０へ移行し、ＣＰＵ１１は、処理部２８として、サイズＳ_ｉを所定割合拡大する。一方、最多のカテゴリがカテゴリＬではない場合、ステップＳ６２へ移行する。 In step S58, the CPU 11, functioning as the processing unit 28, determines whether the most common category among all objects detected from the image cut out from the target image based on the rectangular region of size S _i is category L. If it is category L, the process proceeds to step S60, where the CPU 11, functioning as the processing unit 28, enlarges size S _i by a predetermined percentage. On the other hand, if the most common category is not category L, the process proceeds to step S62.

　ステップＳ６２では、ＣＰＵ１１は、処理部２８として、サイズＳ_ｉの矩形領域に基づいて対象画像から切り出された画像から検出された全ての物体のうち、最多のカテゴリはＳカテゴリか否かを判定する。Ｓカテゴリの場合には、ステップＳ６４へ移行し、ＣＰＵ１１は、処理部２８として、サイズＳ_ｉを所定割合縮小する。一方、最多のカテゴリがＳカテゴリではない場合、すなわち最多のカテゴリがＭカテゴリの場合、ステップＳ６６へ移行する。 In step S62, the CPU 11, functioning as the processing unit 28, determines whether the most prevalent category of all objects detected from the image cut out from the target image based on the rectangular region of size S _i is the S category. If it is the S category, the process proceeds to step S64, and the CPU 11, functioning as the processing unit 28, reduces the size S _i by a predetermined ratio. On the other hand, if the most prevalent category is not the S category, i.e., if the most prevalent category is the M category, the process proceeds to step S66.

　ステップＳ６６では、ＣＰＵ１１は、処理部２８として、変数ｉが閾値と矩形領域のサイズとのペア数であるＮとなったか否かを判定する。ｉ＜Ｎの場合には、ステップＳ６８へ移行し、ＣＰＵ１１は、処理部２８として、ｉを１インクリメントして、ステップＳ５６に戻る。一方、ｉ＝Ｎの場合には、サイズ調整処理は終了し、物体検出処理（図９）にリターンする。そして、ステップＳ１０に戻り、映像の次のフレームが対象画像として取得される。 In step S66, the CPU 11, functioning as the processing unit 28, determines whether the variable i has reached N, which is the number of pairs of a threshold value and the size of a rectangular area. If i<N, the process proceeds to step S68, and the CPU 11, functioning as the processing unit 28, increments i by 1 and returns to step S56. On the other hand, if i=N, the size adjustment process ends and the process returns to the object detection process (Figure 9). Then, the process returns to step S10, and the next frame of the video is acquired as the target image.

　なお、上記の物体検出処理において、ステップＳ３０の閾値調整処理と、ステップＳ５０のサイズ調整処理とは、いずれを先に実行してもよい。また、いずれか一方のみの実行するようにしてもよい。 In the above object detection process, the threshold adjustment process in step S30 and the size adjustment process in step S50 may be performed first. Also, only one of them may be performed.

　以上説明したように、第１実施形態に係る物体検出装置は、画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定する。また、物体検出装置は、推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出する。そして、物体検出装置は、抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出する。さらに、物体検出装置は、領域の抽出結果又は物体検出結果に応じて、抽出する領域のサイズを設定するための処理を行う。具体的には、物体検出装置は、映像において対象画像よりも前のフレームに対する抽出結果又は物体検出結果に基づいて、確率密度が閾値以上の部分を特定するための閾値、及び閾値とペアとなる領域のサイズの少なくとも一方を調整する。 As described above, the object detection device according to the first embodiment estimates a density map of a target image to be subjected to object detection using a density estimation model previously generated by machine learning so as to estimate a density map indicating the probability density of an object being present at each position of the image. The object detection device also extracts a region to be subjected to object detection based on the probability density in the estimated density map. The object detection device then detects an object by applying an object detection model previously generated by machine learning for detecting an object from an image to a region of the target image corresponding to the extracted region. Furthermore, the object detection device performs processing for setting the size of the region to be extracted according to the region extraction result or object detection result. Specifically, the object detection device adjusts at least one of the threshold for identifying a portion with a probability density equal to or greater than a threshold and the size of the region paired with the threshold, based on the extraction result or object detection result for a frame prior to the target image in the video.

　これにより、第１実施形態に係る物体検出装置は、物体検出モデルを適用する領域として、対象画像の場面に応じた最適なサイズの領域を入力画像から切り出すことができる。したがって、抽出する領域のサイズが不適切であることにより生じる物体の切断や画像縮小による物体の検出精度の劣化を抑制しつつ、高精細映像からの物体検出が可能となる。 As a result, the object detection device according to the first embodiment can extract from the input image an area of an optimal size according to the scene of the target image as the area to which the object detection model is applied. This makes it possible to detect objects from high-definition video while suppressing degradation of object detection accuracy due to cutting off of objects or image reduction caused by an inappropriate size of the extracted area.

　また、一般に、閾値とそれに紐付く矩形領域のサイズとのペアが複数ある場合、物体検出精度を高くするために、これらを手動で調整することは困難である。一方、第１実施形態に係る物体検出装置によれば、これらの調整を自動で行うことができ、調整の手間を省く効果がある。 In addition, in general, when there are multiple pairs of thresholds and the sizes of rectangular areas associated with them, it is difficult to manually adjust these in order to improve the accuracy of object detection. On the other hand, with the object detection device according to the first embodiment, these adjustments can be made automatically, which has the effect of reducing the effort required for adjustment.

　なお、第１実施形態では、検出された物体のバウンディングボックスの大きさをＳ、Ｍ、及びＬの３つのカテゴリに分類する場合について説明した。これは、物体のサイズが大き過ぎる場合、適切である場合、及び小さ過ぎる場合のそれぞれに対応できる最小の数が３であるという理由に基づくが、カテゴリの分類数は４以上であってもよい。 In the first embodiment, the size of the bounding box of a detected object is classified into three categories: S, M, and L. This is because three is the minimum number that can accommodate cases where the object size is too large, appropriate, and too small, respectively, but the number of categories may be four or more.

＜第２実施形態＞
　次に、第２実施形態について説明する。なお、第２実施形態に係る物体検出装置において、第１実施形態に係る物体検出装置１０と同様の構成については、同一符号を付して詳細な説明を省略する。また、第２実施形態に係る物体検出装置のハードウェア構成は、図１に示す第１実施形態に係る物体検出装置１０のハードウェア構成と同様であるため、説明を省略する。 Second Embodiment
Next, a second embodiment will be described. In the object detection device according to the second embodiment, the same components as those in the object detection device 10 according to the first embodiment are denoted by the same reference numerals and detailed description thereof will be omitted. In addition, the hardware configuration of the object detection device according to the second embodiment is similar to the hardware configuration of the object detection device 10 according to the first embodiment shown in FIG. 1, and therefore description thereof will be omitted.

　第２実施形態に係る物体検出装置の機能構成について説明する。図１２に示すように、物体検出装置２１０は、機能構成として、推定部２２と、抽出部２２４と、検出部２６と、処理部２２８とを含む。また、物体検出装置２１０の所定の記憶領域には、密度推定モデル３０と、物体検出モデル３２とが記憶される。各機能構成は、ＣＰＵ１１がＲＯＭ１２又はストレージ１４に記憶された物体検出プログラムを読み出し、ＲＡＭ１３に展開して実行することにより実現される。 The functional configuration of the object detection device according to the second embodiment will be described. As shown in FIG. 12, the object detection device 210 includes, as its functional configuration, an estimation unit 22, an extraction unit 224, a detection unit 26, and a processing unit 228. A density estimation model 30 and an object detection model 32 are stored in a predetermined storage area of the object detection device 210. Each functional configuration is realized by the CPU 11 reading out an object detection program stored in the ROM 12 or storage 14, expanding it in the RAM 13, and executing it.

　処理部２２８は、推定部２２により推定された密度マップにおいて、確率密度が極大値となる画素を探索する。また、処理部２２８は、確率密度が極大値となる画素を中心とする正規分布を仮定すると共に、極大値に基づいて、閾値となる分散を算出する。 The processing unit 228 searches for a pixel where the probability density is a maximum value in the density map estimated by the estimation unit 22. The processing unit 228 also assumes a normal distribution centered on the pixel where the probability density is a maximum value, and calculates a variance that serves as a threshold value based on the maximum value.

　具体的には、処理部２２８は、図１３上段の図に示すように、密度マップ全体から、確率密度が極大値となる画素を探索する。例えば、処理部２２８は、図１４に示すように、スライディングウィンドウを用いて極大値を探索してよい。図１４の例は、３×３画素のスライディングウィンドウを用いる場合である。また、図１４の各マスは、密度マップの一部の各画素を表し、各マス内の数字は、密度マップにおける各画素の確率密度を概略的に表したものである。 Specifically, as shown in the upper diagram of FIG. 13, the processing unit 228 searches the entire density map for pixels where the probability density is a maximum value. For example, the processing unit 228 may search for the maximum value using a sliding window as shown in FIG. 14. The example in FIG. 14 uses a sliding window of 3×3 pixels. Each square in FIG. 14 represents a pixel in a portion of the density map, and the numbers in each square roughly represent the probability density of each pixel in the density map.

　処理部２２８は、スライディングウィンドウの中心画素と、周辺の８画素の大小を比較し、中心画素の確率密度が他のどの画素の確率密度よりも大きい場合に、その確率密度を極大値、その画素の座標を極大値座標とする。処理部２２８は、探索によって発見されたＫ個の極大値をＭ_ｉ（ｉ＝１，２，・・・，Ｋ）、極大値座標を（ｘ_ｉ，ｙ_ｉ）とする。なお、極大値の探索方法は上記の例に限定されず、任意の方法を適用してよい。 The processing unit 228 compares the size of the central pixel of the sliding window with that of the surrounding eight pixels, and if the probability density of the central pixel is greater than the probability density of any of the other pixels, the probability density is set as the maximum value and the coordinates of that pixel are set as the maximum value coordinates. The processing unit 228 sets the K maximum values found by the search as M _i (i=1, 2, ..., K) and the maximum value coordinates as (x _i , y _i ). Note that the method of searching for the maximum values is not limited to the above example, and any method may be applied.

　また、処理部２２８は、図１３中段の図に示すように、極大値座標を中心とした正規分布状に物体が分布していると仮定する。具体的には、処理部２２８は、ある極大値座標（ｘ_ｉ，ｙ_ｉ）を中心とする正規分布をｆ_ｉ（ｘ，ｙ）とする。一般に、正規分布ｆ_ｉ（ｘ，ｙ）は、ｘ軸方向及びｙ軸方向の共分散が０であり、分散がσ_ｉであると仮定すると、下記（４）式で与えられる。 Furthermore, the processing unit 228 assumes that the objects are distributed in a normal distribution pattern centered on the maximum coordinates, as shown in the middle diagram of Fig. 13. Specifically, the processing unit 228 defines a normal distribution centered on a certain maximum coordinate (x _i , y _i ) as f _i (x, y). In general, assuming that the covariance of the normal distribution f _i (x, y) in the x-axis direction and the y-axis direction is 0 and the variance is σ _i , the normal distribution f i (x, y) is given by the following formula (4):

　処理部２２８は、ｆ_ｉ（ｘ_ｉ，ｙ_ｉ）＝Ｍ_ｉであることから、ｆ_ｉ（ｘ，ｙ）の分散σ_ｉを、下記（５）式により算出する。 Since f _i (x _i , y _i )=M _i , the processing unit 228 calculates the variance σ _i of f _i (x, y) by the following formula (5).

　処理部２２８は、探索した極大値Ｍ_ｉ及び極大値座標（ｘ_ｉ，ｙ_ｉ）と、算出した分散σ_ｉとを抽出部２２４へ受け渡す。 The processing unit 228 passes the found maximum value M _i and its coordinates (x _i , y _i ) and the calculated variance σ _i to the extraction unit 224 .

　抽出部２２４は、処理部２２８により探索された極大値座標の周辺で確率密度が閾値以上の画素が連続する部分を抽出する。具体的には、抽出部２２４は、処理部２２８が仮定した正規分布における、閾値として算出された分散の範囲を含む部分を、物体検出の対象とする領域として抽出する。 The extraction unit 224 extracts a continuous portion of pixels with a probability density equal to or greater than a threshold value around the maximum coordinates searched for by the processing unit 228. Specifically, the extraction unit 224 extracts a portion that includes the range of variance calculated as the threshold value in the normal distribution assumed by the processing unit 228 as a region to be subjected to object detection.

　具体的には、抽出部２２４は、処理部２２８から受け渡された分散σ_ｉと定数Ｔとを用いて、確率密度がｆ_ｉ（ｘ_ｉ±Ｔσ_ｉ，ｙｉ±Ｔσ_ｉ）（複号任意）以上、すなわち、Ｍ_ｉｅｘｐ（－Ｔ^２）以上で連続する画素の部分を、極大値座標（ｘ_ｉ，ｙ_ｉ）の周辺で探索する。定数Ｔは、抽出する分散の区間を表す正の実数であり、外部から任意の値を設定する。そして、抽出部２２４は、図１３下段の図に示すように、探索した部分を囲む矩形領域を抽出する。 Specifically, the extraction unit 224 uses the variance _σi and constant T passed from the processing unit 228 to search for a portion of continuous pixels with a probability density of f _i (x _i ± _Tσi , yi ± _Tσi ) (any compound sign) or more, i.e., M _i exp(-T ² ), in the vicinity of the maximum coordinate (x _i , y _i ). The constant T is a positive real number that represents the interval of the variance to be extracted, and is set to an arbitrary value from outside. Then, the extraction unit 224 extracts a rectangular area surrounding the searched portion, as shown in the lower diagram of FIG. 13.

　また、抽出部２２４は、第１実施形態における抽出部２４と同様の方法により、重複する矩形領域のマージを行う。抽出部２２４は、抽出した矩形領域についての矩形情報を検出部２６へ受け渡す。 The extraction unit 224 also merges overlapping rectangular areas using a method similar to that used by the extraction unit 24 in the first embodiment. The extraction unit 224 passes rectangular information about the extracted rectangular areas to the detection unit 26.

　次に、第２実施形態に係る物体検出装置２１０の作用について説明する。図１５は、物体検出装置２１０による物体検出処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から物体検出プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、物体検出処理が行なわれる。なお、第２実施形態の物体検出処理において、第１実施形態の物体検出処理と同様の処理については、同一のステップ番号を付して詳細な説明を省略する。 Next, the operation of the object detection device 210 according to the second embodiment will be described. FIG. 15 is a flowchart showing the flow of the object detection process by the object detection device 210. The object detection process is performed by the CPU 11 reading out an object detection program from the ROM 12 or storage 14, expanding it in the RAM 13, and executing it. Note that in the object detection process of the second embodiment, the same steps as those in the object detection process of the first embodiment are given the same step numbers and detailed descriptions are omitted.

　ステップＳ１０及びＳ１２を経て、ステップＳ２００で、ＣＰＵ１１は、処理部２２８として、上記ステップＳ１２で推定された密度マップ全体から、確率密度が極大値となる画素を探索する。ＣＰＵ１１は、処理部２２８として、探索によって発見されたＫ個の極大値をＭ_ｉ（ｉ＝１，２，・・・，Ｋ）、極大値座標を（ｘ_ｉ，ｙ_ｉ）とする。 After steps S10 and S12, in step S200, the CPU 11, as the processing unit 228, searches for pixels whose probability density is a maximum value from the entire density map estimated in step S12. The CPU 11, as the processing unit 228, defines the K maximum values found by the search as M _i (i=1, 2, ..., K) and the maximum value coordinates as (x _i , y _i ).

　次に、ステップＳ２０２で、ＣＰＵ１１は、処理部２２８として、極大値座標を中心とした正規分布ｆ_ｉ（ｘ，ｙ）を仮定し、極大値Ｍ_ｉからｆ_ｉ（ｘ，ｙ）の分散σ_ｉを算出する。ＣＰＵ１１は、処理部２２８として、探索した極大値Ｍ_ｉ及び極大値座標（ｘ_ｉ，ｙ_ｉ）と、算出した分散σ_ｉとを抽出部２２４へ受け渡す。 Next, in step S202, the CPU 11, as the processing unit 228, assumes a normal distribution _f (x, y) centered on the maximum value coordinate, and calculates the variance _σi of _f (x, y) from the maximum value _M. The CPU 11, as the processing unit 228, passes the searched maximum value _M and maximum value coordinates ( _x , _y ), and the calculated variance _σi to the extraction unit 224.

　次に、ステップＳ２０４で、ＣＰＵ１１は、抽出部２２４として、処理部２２８から受け渡された分散σ_ｉと定数Ｔとを用いて、確率密度がＭ_ｉｅｘｐ（－Ｔ^２）以上で連続する画素の部分を、極大値座標（ｘ_ｉ，ｙ_ｉ）の周辺で探索する。そして、ＣＰＵ１１は、抽出部２２４として、探索した部分を囲む矩形領域を抽出し、抽出した矩形領域の矩形情報を検出部２６へ受け渡す。 Next, in step S204, the CPU 11, functioning as the extraction unit 224, searches for a portion of continuous pixels with a probability density equal to or greater than M _i exp(-T ² ) around the maximum coordinate (x _i , y _i ₎ using the variance σ i and constant T passed from the processing unit 228. Then, the CPU 11, functioning as the extraction unit 224, extracts a rectangular area surrounding the portion found, and passes rectangular information about the extracted rectangular area to the detection unit 26.

　次に、ステップＳ１６～Ｓ２０を経て、次のステップＳ２２２で、ＣＰＵ１１は、処理部２２８として、対象画像が、映像の最後のフレームか否かを判定する。最後のフレームの場合には、物体検出処理は終了し、継続するフレームがある場合には、ステップＳ１０に戻る。 Next, after steps S16 to S20, in the next step S222, the CPU 11, as the processing unit 228, determines whether the target image is the last frame of the video. If it is the last frame, the object detection process ends, and if there are subsequent frames, the process returns to step S10.

　以上説明したように、第２実施形態に係る物体検出装置は、密度マップにおいて、確率密度が極大値となる極大値座標を探索し、この極大値座標を中心とした正規分布を仮定する。また、物体検出装置は、極大値に基づいて、仮定した正規分布の分散を算出し、確率密度が分散に基づいて定まる閾値以上の部分を囲む矩形領域を抽出する。これにより、密度マップに応じたサイズの矩形領域の抽出が可能となり、第１実施形態と同様に、物体検出モデルを適用する領域として、対象画像の場面に応じた最適なサイズの領域を入力画像から切り出すことができる。 As described above, the object detection device according to the second embodiment searches for the maximum coordinate in the density map where the probability density is a maximum value, and assumes a normal distribution centered on this maximum coordinate. The object detection device also calculates the variance of the assumed normal distribution based on the maximum value, and extracts a rectangular area that encloses the part where the probability density is equal to or greater than a threshold determined based on the variance. This makes it possible to extract a rectangular area of a size according to the density map, and as with the first embodiment, it is possible to cut out an area of an optimal size according to the scene in the target image from the input image as an area to which the object detection model is applied.

　また、第２実施形態では、探索された極大値座標を用いることで、第１実施形態のように、閾値及び矩形領域のサイズを用いることなく、分散の範囲を指定する定数のみを用いて物体の大きさに合わせた矩形領域の抽出が可能となる。したがって、定数の設定及びチューニングの負担が大幅に軽減される。 In addition, in the second embodiment, by using the coordinates of the searched maximum values, it is possible to extract a rectangular area that matches the size of the object using only a constant that specifies the range of variance, without using a threshold value and the size of the rectangular area as in the first embodiment. Therefore, the burden of setting and tuning the constants is significantly reduced.

　なお、上記各実施形態では、抽出部により矩形領域を抽出する場合について説明したが、他の形状の領域を抽出するようにしてもよい。この場合、対象画像から対応する領域を切り出す際、又は物体検出モデルに切り出した画像を入力する際に、物体検出モデルへ入力可能な形状及びサイズの画像に加工する必要がある。 In the above embodiments, a rectangular area is extracted by the extraction unit, but areas of other shapes may be extracted. In this case, when cutting out the corresponding area from the target image, or when inputting the cut-out image to the object detection model, it is necessary to process the image into a shape and size that can be input to the object detection model.

　また、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した物体検出処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ－Ｐｒｏｇｒａｍｍａｂｌｅ　Ｇａｔｅ　Ａｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（Ｐｒｏｇｒａｍｍａｂｌｅ　Ｌｏｇｉｃ　Ｄｅｖｉｃｅ）、及びＡＳＩＣ（Ａｐｐｌｉｃａｔｉｏｎ　Ｓｐｅｃｉｆｉｃ　Ｉｎｔｅｇｒａｔｅｄ　Ｃｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、物体検出処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 In addition, the object detection process executed by the CPU by reading the software (program) in each of the above embodiments may be executed by various processors other than the CPU. Examples of processors in this case include PLDs (Programmable Logic Devices) such as FPGAs (Field-Programmable Gate Arrays) whose circuit configuration can be changed after manufacture, and dedicated electric circuits such as ASICs (Application Specific Integrated Circuits), which are processors having a circuit configuration designed specifically to execute specific processes. The object detection process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA). The hardware structure of these various processors is, more specifically, an electric circuit that combines circuit elements such as semiconductor elements.

　また、上記各実施形態では、物体検出処理プログラムがＲＯＭ１２又はストレージ１４に予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（Ｃｏｍｐａｃｔ　Ｄｉｓｋ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（Ｄｉｇｉｔａｌ　Ｖｅｒｓａｔｉｌｅ　Ｄｉｓｋ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）、及びＵＳＢ（Ｕｎｉｖｅｒｓａｌ　Ｓｅｒｉａｌ　Ｂｕｓ）メモリ等の非一時的（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 In addition, in each of the above embodiments, the object detection processing program is described as being pre-stored (installed) in the ROM 12 or storage 14, but this is not limiting. The program may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. The program may also be downloaded from an external device via a network.

　以上の実施形態に関し、更に以下の付記を開示する。 The following notes are further provided with respect to the above embodiment.

（付記項１）
　画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定する推定部と、
　前記推定部により推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出する抽出部と、
　前記抽出部により抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出する検出部と、
　前記密度マップ、又は前記検出部による物体検出結果に応じて、前記抽出部により抽出する領域のサイズを設定するための処理を行う処理部と、
　を含む物体検出装置。 (Additional note 1)
an estimation unit that estimates a density map of a target image that is a target for object detection, using a density estimation model that has been generated in advance by machine learning, so as to estimate a density map that indicates a probability density of an object being present at each position in the image;
an extraction unit that extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit;
a detection unit that detects an object by applying an object detection model that has been generated in advance by machine learning to a region of the target image corresponding to the region extracted by the extraction unit, in order to detect an object from an image;
a processing unit that performs processing for setting a size of an area to be extracted by the extraction unit according to the density map or a result of object detection by the detection unit;
An object detection device comprising:

（付記項２）
　前記処理部は、入力された映像の各フレームを前記対象画像とする場合、前記映像において前記対象画像よりも前のフレームに対する前記密度マップ、又は前記検出部による物体検出結果に基づいて、前記確率密度が閾値以上の部分を特定するための前記閾値、及び前記閾値とペアとなる前記領域のサイズの少なくとも一方を調整する付記項１に記載の物体検出装置。 (Additional note 2)
The object detection device described in Appendix 1, wherein the processing unit, when each frame of the input video is the target image, adjusts at least one of the threshold for identifying a portion where the probability density is equal to or greater than a threshold, and the size of the region paired with the threshold, based on the density map for a frame in the video prior to the target image, or the object detection result by the detection unit.

（付記項３）
　前記処理部は、前記確率密度が前記閾値以上の画素が連続する部分の面積が、前記閾値とペアとなる前記領域のサイズに対応する面積よりも小さい場合、前記閾値を所定値又は所定割合下げ、前記部分の面積が、前記閾値とペアとなる前記領域のサイズに対応する面積よりも大きい場合、前記閾値を所定値又は所定割合上げる付記項２に記載の物体検出装置。 (Additional note 3)
The object detection device described in Appendix 2, wherein the processing unit lowers the threshold by a predetermined value or a predetermined percentage when the area of a portion of consecutive pixels whose probability density is equal to or greater than the threshold is smaller than an area corresponding to the size of the region paired with the threshold, and raises the threshold by a predetermined value or a predetermined percentage when the area of the portion is larger than an area corresponding to the size of the region paired with the threshold.

（付記項４）
　前記処理部は、前のフレームに対する前記検出部による物体検出結果において、第１サイズの領域から検出された物体のサイズを大きさ別に異なる複数段階のカテゴリに分類し、前記物体のサイズが平均に相当するカテゴリよりも、前記物体のサイズが大きいカテゴリに含まれる物体の数の方が多い場合には、前記第１サイズを所定割合拡大する処理、及び、前記物体のサイズが平均に相当するカテゴリよりも、前記物体のサイズが小さいカテゴリに含まれる物体の数の方が多い場合には、前記第１サイズを所定割合縮小する処理の少なくとも一方を行う付記項２に記載の物体検出装置。 (Additional note 4)
The object detection device described in Appendix 2, wherein the processing unit classifies the sizes of objects detected from a first size area in the object detection result by the detection unit for the previous frame into multiple different categories based on size, and performs at least one of a process of enlarging the first size by a predetermined percentage when the number of objects included in the category where the object size is large is greater than the category where the object size corresponds to the average, and a process of reducing the first size by a predetermined percentage when the number of objects included in the category where the object size is small is greater than the category where the object size corresponds to the average.

（付記項５）
　前記処理部は、前記推定部により推定された密度マップにおいて、前記確率密度が極大値となる画素を探索し、前記確率密度が極大値となる画素を中心とする正規分布を仮定すると共に、前記極大値に基づいて、前記正規分布の分散を算出し、
　前記抽出部は、前記処理部により探索された画素の周辺で、前記正規分布における前記分散の範囲を含む部分を、前記物体検出の対象とする領域として抽出する、
　付記項１に記載の物体検出装置。 (Additional note 5)
the processing unit searches for a pixel where the probability density has a maximum value in the density map estimated by the estimation unit, assumes a normal distribution centered on the pixel where the probability density has a maximum value, and calculates a variance of the normal distribution based on the maximum value;
The extraction unit extracts a portion including the range of variance in the normal distribution around the pixel searched by the processing unit as a region to be subjected to the object detection.
2. The object detection device according to claim 1.

（付記項６）
　前記抽出部は、複数の領域を抽出した場合、重複度が所定値以上の領域同士であって、マージ後の領域のサイズに対して、検出可能な物体の最小サイズが所定の条件を満たす領域同士をマージする付記項１～付記項５のいずれか１項に記載の物体検出装置。 (Additional note 6)
The object detection device according to any one of Supplementary Items 1 to 5, wherein when multiple regions are extracted, the extraction unit merges regions that have an overlap degree equal to or greater than a predetermined value and for which the minimum size of a detectable object satisfies a predetermined condition relative to the size of the region after merging.

（付記項７）
　推定部が、画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定し、
　抽出部が、前記推定部により推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出し、
　検出部が、前記抽出部により抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出し、
　処理部が、前記密度マップ、又は前記検出部による物体検出結果に応じて、前記抽出部により抽出する領域のサイズを設定するための処理を行う、
　物体検出方法。 (Supplementary Note 7)
The estimation unit estimates a density map of a target image to be subjected to object detection using a density estimation model generated in advance by machine learning so as to estimate a density map indicating a probability density of an object being present at each position of the image;
an extraction unit extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit;
a detection unit detects an object by applying an object detection model generated in advance by machine learning to a region of the target image corresponding to the region extracted by the extraction unit, in order to detect an object from an image;
a processing unit performs processing for setting a size of an area to be extracted by the extraction unit according to the density map or a result of object detection by the detection unit;
Object detection methods.

（付記項８）
　コンピュータを、
　画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定する推定部、
　前記推定部により推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出する抽出部、
　前記抽出部により抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出する検出部、及び、
　前記密度マップ、又は前記検出部による物体検出結果に応じて、前記抽出部により抽出する領域のサイズを設定するための処理を行う処理部、
　として機能させるための物体検出プログラム。 (Supplementary Note 8)
Computer,
an estimation unit that estimates a density map of a target image that is a target for object detection, using a density estimation model that has been generated in advance by machine learning, so as to estimate a density map indicating a probability density of an object being present at each position in the image;
an extraction unit that extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit;
a detection unit that detects an object by applying an object detection model that has been generated in advance by machine learning to a region of the target image corresponding to the region extracted by the extraction unit in order to detect an object from an image; and
a processing unit that performs processing for setting a size of an area to be extracted by the extraction unit according to the density map or a result of object detection by the detection unit;
An object detection program to act as a.

（付記項９）
　メモリと、
　前記メモリに接続された少なくとも１つのプロセッサと、
　を含み、
　前記プロセッサは、
　画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定し、
　推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出し、
　抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出し、
　前記密度マップ、又は物体検出結果に応じて、抽出する領域のサイズを設定するための処理を行う、
　ように構成されている物体検出装置。 (Supplementary Note 9)
Memory,
at least one processor coupled to the memory;
Including,
The processor,
Estimating a density map of a target image to be subjected to object detection using a density estimation model previously generated by machine learning so as to estimate a density map indicating a probability density of an object being present at each position of the image;
Extracting a region to be subjected to object detection based on the probability density in the estimated density map;
Detecting an object by applying an object detection model that has been generated in advance by machine learning to detect an object from an image to a region of the target image corresponding to the extracted region;
performing a process for setting a size of an area to be extracted according to the density map or an object detection result;
The object detection device is configured as follows.

（付記項１０）
　物体検出処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記録媒体であって、
　前記物体検出処理は、
　画像の各位置における物体が存在する確率密度を示す密度マップを推定するように、予め機械学習により生成された密度推定モデルを用いて、物体検出の対象となる対象画像の密度マップを推定し、
　推定された密度マップにおける確率密度に基づいて、物体検出の対象とする領域を抽出し、
　抽出された領域に対応する前記対象画像の領域に、画像から物体を検出するために予め機械学習により生成された物体検出モデルを適用して物体を検出し、
　前記密度マップ、又は物体検出結果に応じて、抽出する領域のサイズを設定するための処理を行う、
　ことを含む非一時的記録媒体。 (Supplementary Note 10)
A non-transitory recording medium storing a program executable by a computer to perform an object detection process,
The object detection process includes:
Estimating a density map of a target image to be subjected to object detection using a density estimation model previously generated by machine learning so as to estimate a density map indicating a probability density of an object being present at each position of the image;
Extracting a region to be subjected to object detection based on the probability density in the estimated density map;
Detecting an object by applying an object detection model that has been generated in advance by machine learning to detect an object from an image to a region of the target image corresponding to the extracted region;
performing a process for setting a size of an area to be extracted according to the density map or an object detection result;
Non-transitory recording media, including

１０、２１０  物体検出装置
１１   ＣＰＵ
１２   ＲＯＭ
１３   ＲＡＭ
１４   ストレージ
１５   入力部
１６   表示部
１７   通信Ｉ／Ｆ
１９   バス
２２   推定部
２４、２２４  抽出部
２６   検出部
２８、２２８  処理部
３０   密度推定モデル
３２   物体検出モデル 10, 210 Object detection device 11 CPU
12 ROM
13 RAM
14 Storage 15 Input unit 16 Display unit 17 Communication I/F
19 Bus 22 Estimation unit 24, 224 Extraction unit 26 Detection unit 28, 228 Processing unit 30 Density estimation model 32 Object detection model

Claims

an estimation unit that estimates a density map of a target image that is a target for object detection, using a density estimation model that has been generated in advance by machine learning, so as to estimate a density map that indicates a probability density of an object being present at each position in the image;
an extraction unit that extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit;
a detection unit that detects an object by applying an object detection model that has been generated in advance by machine learning to a region of the target image corresponding to the region extracted by the extraction unit, in order to detect an object from an image;
a processing unit that performs processing for setting a size of an area to be extracted by the extraction unit according to the density map or a result of object detection by the detection unit;
An object detection device comprising:

The object detection device according to claim 1, wherein the processing unit adjusts at least one of the threshold for identifying a portion where the probability density is equal to or greater than a threshold and the size of the region paired with the threshold, based on the density map for a frame in the input video prior to the target image or the object detection result by the detection unit, when each frame of the input video is the target image.

the processing unit searches for a pixel where the probability density has a maximum value in the density map estimated by the estimation unit, assumes a normal distribution centered on the pixel where the probability density has a maximum value, and calculates a variance of the normal distribution based on the maximum value;
The extraction unit extracts a portion including the range of variance in the normal distribution around the pixel searched by the processing unit as a region to be subjected to the object detection.
The object detection device according to claim 1 .

The estimation unit estimates a density map of a target image to be subjected to object detection using a density estimation model generated in advance by machine learning so as to estimate a density map indicating a probability density of an object being present at each position of the image;
an extraction unit extracts a region to be subjected to object detection based on the probability density in the density map estimated by the estimation unit;
a detection unit detects an object by applying an object detection model generated in advance by machine learning to a region of the target image corresponding to the region extracted by the extraction unit, in order to detect an object from an image;
a processing unit performs processing for setting a size of an area to be extracted by the extraction unit according to the density map or a result of object detection by the detection unit;
Object detection methods.