CN108399362A

CN108399362A - A kind of rapid pedestrian detection method and device

Info

Publication number: CN108399362A
Application number: CN201810069322.XA
Authority: CN
Inventors: 林倞; 尹森堂; 张冬雨; 王青
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-01-24
Filing date: 2018-01-24
Publication date: 2018-08-14
Anticipated expiration: 2038-01-24
Also published as: WO2019144575A1; CN108399362B

Abstract

The invention discloses a kind of rapid pedestrian detection method and devices, and described method includes following steps：Step S1 builds the configurable depth model based on convolutional neural networks, learns the network parameter for structure using training sample, obtains the model for test process；Step S2, input test sample, the changing rule for being perceived domain using neural network by trained model is detected the target object within the scope of different scale using different middle layers, predict the block diagram of target object in image, the present invention perceives the changing rule in domain by using neural network, the target object within the scope of particular dimensions is detected using different middle layers, the relationship in perception domain and article size has preferably been adapted to, has effectively increased testing result.

Description

A fast pedestrian detection method and device

技术领域technical field

本发明涉及行人检测技术领域，特别是涉及一种基于深度学习的面向嵌入式系统的快速行人检测方法及装置。The invention relates to the technical field of pedestrian detection, in particular to a deep learning-based fast pedestrian detection method and device for embedded systems.

背景技术Background technique

作为计算机视觉中目标检测的一部分，行人检测在现实世界的应用有着重要意义，随着图像采集技术的成熟与存储技术成本的下降，越来越多的摄像机被部署在公共场所，另一方面，随着自动驾驶、智能交通的推行，车载摄像头也产生了海量的视频资源。传统的人工筛选和处理，不仅效率低下，耗费大量人力物力，而且可能引入一些人为因素，导致一些偏差。近年来，深度学习在计算机视觉领域取得前所未有的突破，不仅效率远胜人力，准确度在很多领域也超过人类。因此，有效利用深度学习的方法进行行人检测的课题备受关注。As a part of object detection in computer vision, pedestrian detection is of great significance in real-world applications. With the maturity of image acquisition technology and the decline of storage technology costs, more and more cameras are deployed in public places. On the other hand, With the implementation of autonomous driving and intelligent transportation, car cameras have also produced massive video resources. Traditional manual screening and processing is not only inefficient and consumes a lot of manpower and material resources, but also may introduce some human factors, resulting in some deviations. In recent years, deep learning has made unprecedented breakthroughs in the field of computer vision. Not only is the efficiency far superior to that of manpower, but its accuracy also exceeds that of humans in many fields. Therefore, the subject of effective use of deep learning methods for pedestrian detection has attracted much attention.

人是视频监控或自动驾驶中最主要的目标之一，而行人检测的首要任务就是识别人体的存在，并提供相应的标注信息。由于在现实世界中捕捉到的图像质量参差不齐，对于小物体、遮挡的物体的检测一直是行人检测的难点，另一方面，车载摄像头也经常会捕捉到一些模糊的图像，这样的图像中也存在大量类似行人却不是行人的物体。而具体到嵌入式系统，由于识别能力强的大型神经网络模型通常难以有效率的运行在计算资源有限的嵌入式设备上，而对于嵌入式设备的应用需求又是实时的，因此兼顾检测准确率和效率是面向嵌入式系统的快速行人检测的重中之重。People are one of the most important targets in video surveillance or automatic driving, and the primary task of pedestrian detection is to identify the existence of human bodies and provide corresponding annotation information. Due to the uneven quality of images captured in the real world, the detection of small objects and occluded objects has always been a difficult point in pedestrian detection. On the other hand, vehicle cameras often capture some blurred images. In such images There are also a large number of pedestrian-like objects that are not pedestrians. As for embedded systems, since large-scale neural network models with strong recognition capabilities are usually difficult to efficiently run on embedded devices with limited computing resources, and the application requirements for embedded devices are real-time, so the detection accuracy And efficiency is the top priority for fast pedestrian detection for embedded systems.

发明内容Contents of the invention

为克服上述现有技术存在的不足，本发明之一目的在于提供一种快速行人检测方法及装置，通过利用神经网络感知域的变化规律，使用不同的中间层对特定尺度范围内的目标物体进行检测，更好的适应了感知域与物体大小的关系，有效提高了检测结果。In order to overcome the deficiencies in the above-mentioned prior art, one object of the present invention is to provide a fast pedestrian detection method and device, by using the change rule of the neural network perceptual domain, using different intermediate layers to detect target objects within a specific scale range Detection, better adapted to the relationship between the perception domain and the size of the object, effectively improving the detection results.

本发明之另一目的在于提供一种快速行人检测方法及装置，通过调整并训练VGG-16的网络得到适应嵌入式系统要求的squeeze VGG-16网络，有效降低了网络模型的参数量并加快了计算效率。Another object of the present invention is to provide a kind of fast pedestrian detection method and device, obtain the squeeze VGG-16 network that adapts to the requirement of embedded system by adjusting and training the network of VGG-16, effectively reduce the parameter quantity of network model and accelerate Computational efficiency.

本发明之再一目的在于提供一种快速行人检测方法及装置，通过去卷积的方法对特定网络层的特征图进行放大，增强了对小物体的检测，相比于传统图片放大的方法，几乎不增加显存和计算量。Another object of the present invention is to provide a fast pedestrian detection method and device, which can amplify the feature map of a specific network layer through deconvolution, which enhances the detection of small objects. Compared with the traditional image enlargement method, Almost no increase in video memory and calculation.

本发明之又一目的在于提供一种快速行人检测方法及装置，通过使用目标对象1.5倍大小的区域作为背景语义特征增加到网络中，对于模糊物体和远距离小物体的检测，有着极佳的性能。Yet another object of the present invention is to provide a fast pedestrian detection method and device, by using an area 1.5 times the size of the target object as the background semantic feature and adding it to the network, it has excellent detection of fuzzy objects and long-distance small objects performance.

为达上述及其它目的，本发明提出一种快速行人检测方法，包括如下步骤：In order to achieve the above and other purposes, the present invention proposes a rapid pedestrian detection method, which includes the following steps:

步骤S1，构建可配置的基于卷积神经网络的深度模型，利用训练样本学习出构建的网络参数，获得用于测试过程的模型；Step S1, constructing a configurable deep model based on convolutional neural network, using training samples to learn the constructed network parameters, and obtaining a model for the testing process;

步骤S2，输入测试样本，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的目标物体进行检测，预测出图像中目标物体的框图。Step S2, input test samples, use different intermediate layers to detect target objects in different scale ranges through the trained model, and use the changing rules of the neural network perceptual domain to predict the block diagram of the target object in the image.

优选地，步骤S1进一步包括：Preferably, step S1 further includes:

构建可配置的基于卷积神经网络的深度模型；Build configurable convolutional neural network-based deep models;

输入训练样本；input training samples;

初始化卷积神经网络及其参数，包括网络层中每层连接的权重和偏置；Initialize the convolutional neural network and its parameters, including the weights and biases of each layer connection in the network layer;

采用前向传播算法和后向传播算法，利用训练样本学习出构建的网络参数，即用于测试过程的模型。The forward propagation algorithm and the backward propagation algorithm are used to learn the constructed network parameters by using the training samples, that is, the model used for the testing process.

优选地，所述该深度模型包括多尺度的目标候选网络与目标检测网络，所述目标候选网络基于卷积神经网络不同层提出特征的差异性，在中间层分别生成对不同尺度目标物体的候选框图；所述目标检测网络在所述目标候选网络输出的候选框图的基础上进行精细化的分类和检测。Preferably, the depth model includes a multi-scale target candidate network and a target detection network, and the target candidate network is based on the differences in features proposed by different layers of the convolutional neural network, and generates candidates for target objects of different scales in the middle layer. Block diagram: the target detection network performs refined classification and detection on the basis of the candidate block diagram output by the target candidate network.

优选地，所述卷积神经网络由卷积层、降采样层、上采样层堆叠而成。所述卷积层是指对输入的图像或者特征图在二维空间上进行卷积运算，提取层次化特征；所述降采样层使用没有重叠的max-pooling操作，该操作用于提取形状和偏移不变的特征，同时减少特征图大小，提高计算效率；所述上采样层，是指对输入的特征图在二维空间上进行去卷积的操作，用以增大特征图的像素。Preferably, the convolutional neural network is formed by stacking convolutional layers, downsampling layers, and upsampling layers. The convolution layer refers to performing a convolution operation on the input image or feature map in a two-dimensional space to extract hierarchical features; the downsampling layer uses a max-pooling operation without overlap, which is used to extract shape and Offset features are constant, while reducing the size of the feature map and improving computational efficiency; the upsampling layer refers to the deconvolution operation on the input feature map in two-dimensional space to increase the pixels of the feature map .

优选地，所述深度模型采用Squeeze VGG-16卷积神经网络作为骨干网络，所述Squeeze VGG-16卷积神经网络采用conv1-1层和紧随其后的12层Fire模块层为特征提取的网络结构。Preferably, the depth model adopts the Squeeze VGG-16 convolutional neural network as the backbone network, and the Squeeze VGG-16 convolutional neural network adopts the conv1-1 layer and the following 12-layer Fire module layer as feature extraction network structure.

优选地，所述目标候选网络在所述Squeeze VGG-16卷积神经网络基础上，根据卷积层特征，在Fire9、Fire12、conv6以及增加的pooling层，产生网络分支，以进行不同尺度检测到物体的候选框的回归。Preferably, the target candidate network is based on the Squeeze VGG-16 convolutional neural network, according to the convolutional layer features, in Fire9, Fire12, conv6 and the added pooling layer, to generate network branches to detect different scales Regression of object proposals.

优选地，所述目标检测网络在所述目标候选区域的基础上，将目标候选区域预设倍数大小的图片区域作为目标的背景语义信息，将Fire9层的特征图进行一次上采样，作为增强对小物体感知的信息，并将背景语义信息与上采样信息经过感兴趣区域的池化获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归。Preferably, on the basis of the target candidate area, the target detection network uses the image area of the preset multiple size of the target candidate area as the background semantic information of the target, and performs an upsampling of the feature map of the Fire9 layer as an enhanced pair The information perceived by small objects, and the background semantic information and upsampling information are pooled in the region of interest to obtain fixed-size features, and then a fully connected layer is added to perform category and final candidate frame regression.

优选地，所述训练样本包括RGB图像数据和图像中行人区域的标注信息，实际训练用的图像数据是根据行人所在区域裁剪得到的小的patch。Preferably, the training samples include RGB image data and label information of the pedestrian area in the image, and the image data for actual training is a small patch cut out according to the area where the pedestrian is located.

优选地，所述后向传播算法，需先求出正向传播预测的目标框图与图像实际目标框图的损失函数然后求得其对参数W的梯度，采用梯度下降的算法更新W以最小化损失函数假定中间层有M个分支可以输出目标候选区域，l^m表示分支m的损失函数，α_m表示l^m函数的权重，S＝{S¹，S²，…，S^M}指相应尺度的目标对象，则损失函数可定义为：Preferably, the backward propagation algorithm needs to first obtain the loss function between the target block diagram predicted by the forward propagation and the actual target block diagram of the image Then obtain its gradient to the parameter W, and use the gradient descent algorithm to update W to minimize the loss function Assume that there are M branches in the middle layer that can output the target candidate area, l ^m represents the loss function of branch m, α _m represents the weight of the l ^m function, S={S ¹ , S ² ,...,S ^M } refers to the target of the corresponding scale object, the loss function can be defined as:

为达到上述目的，本发明还提供一种快速行人检测系统，包括：To achieve the above purpose, the present invention also provides a rapid pedestrian detection system, including:

训练单元，用于构建可配置的基于卷积神经网络的深度模型，利用训练样本学习出构建的网络参数，获得用于测试过程的模型；The training unit is used to build a configurable deep model based on convolutional neural network, use training samples to learn the constructed network parameters, and obtain a model for the testing process;

检测单元，用于输入测试样本，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的目标物体进行检测，预测出图像中目标物体的框图。The detection unit is used to input test samples, and use different intermediate layers to detect target objects in different scale ranges by using the trained model and using the changing law of the neural network perceptual domain, and predict the block diagram of the target object in the image.

与现有技术相比，本发明一种快速行人检测方法及装置借鉴压缩网络的方法，调整并训练VGG-16的网络得到适应嵌入式系统要求的squeeze VGG-16网络，有效降低了网络模型的参数量并加快了计算效率；另一方面，针对传统检测方法中感知域与物体大小不一致的问题，本发明利用神经网络感知域的变化规律(即神经网络层越深，感知域越大，适合检测大一些的目标物体)，使用不同的中间层对特定尺度范围内的目标物体进行检测，更好的适应了感知域与物体大小的关系，有效提高了检测结果；另外，为了增强对小物体的检测，本发明使用去卷积的方法对特定网络层的特征图进行放大，相比于传统图片放大的方法，几乎不增加显存和计算量；为了增强对于模糊物体的检测，在该层的特征图上，使用目标对象1.5倍大小的区域作为背景语义特征增加到网络中，对于模糊物体和远距离小物体的检测，有着极佳的性能。Compared with the prior art, a fast pedestrian detection method and device of the present invention learn from the method of compressing the network, adjust and train the VGG-16 network to obtain a squeeze VGG-16 network that meets the requirements of the embedded system, and effectively reduce the network model. parameter quantity and speed up the calculation efficiency; on the other hand, in view of the problem that the perception domain and the size of the object are inconsistent in the traditional detection method, the present invention utilizes the change rule of the neural network perception domain (that is, the deeper the neural network layer, the larger the perception domain, suitable for Detect larger target objects), use different intermediate layers to detect target objects within a specific scale range, better adapt to the relationship between the perception domain and the size of the object, and effectively improve the detection results; in addition, in order to enhance the detection of small objects detection, the present invention uses a deconvolution method to amplify the feature map of a specific network layer, compared to the traditional image magnification method, which hardly increases the amount of video memory and calculation; in order to enhance the detection of fuzzy objects, in this layer On the feature map, the area 1.5 times the size of the target object is used as the background semantic feature to add to the network. It has excellent performance for the detection of fuzzy objects and long-distance small objects.

附图说明Description of drawings

图1为本发明一种快速行人检测方法的步骤流程图；Fig. 1 is a flow chart of the steps of a fast pedestrian detection method of the present invention;

图2为本发明具体实施例中Squeeze VGG-16神经网络结构示意图；Fig. 2 is the schematic diagram of Squeeze VGG-16 neural network structure in the specific embodiment of the present invention;

图3为本发明具体实施例中Fire模块的示意图；Fig. 3 is the schematic diagram of Fire module in the specific embodiment of the present invention;

图4为本发明具体实施例中目标候选网络的结构示意图；4 is a schematic structural diagram of a target candidate network in a specific embodiment of the present invention;

图5为本发明具体实施例中目标检测网络的结构示意图；5 is a schematic structural diagram of a target detection network in a specific embodiment of the present invention;

图6为本发明具体实施例中快速行人检测的过程示意图；6 is a schematic diagram of the process of rapid pedestrian detection in a specific embodiment of the present invention;

图7为本发明一种快速行人检测装置的系统架构图；Fig. 7 is a system architecture diagram of a fast pedestrian detection device of the present invention;

图8为本发明具体实施例中训练单元的细部结构图；Fig. 8 is a detailed structural diagram of a training unit in a specific embodiment of the present invention;

图9为本发明具体实施例中检测单元的细部结构图。Fig. 9 is a detailed structure diagram of a detection unit in a specific embodiment of the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例并结合附图说明本发明的实施方式，本领域技术人员可由本说明书所揭示的内容轻易地了解本发明的其它优点与功效。本发明亦可通过其它不同的具体实例加以施行或应用，本说明书中的各项细节亦可基于不同观点与应用，在不背离本发明的精神下进行各种修饰与变更。The implementation of the present invention is described below through specific examples and in conjunction with the accompanying drawings, and those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. The present invention can also be implemented or applied through other different specific examples, and various modifications and changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present invention.

图1为本发明一种快速行人检测方法的步骤流程图。如图1所示，本发明一种快速行人检测方法，包括如下步骤：FIG. 1 is a flow chart of steps of a rapid pedestrian detection method of the present invention. As shown in Figure 1, a kind of fast pedestrian detection method of the present invention comprises the following steps:

步骤S1，构建可配置的基于卷积神经网络的深度模型，利用训练样本学习出构建的网络参数，获得用于测试过程的模型。在本发明具体实施例中，该深度模型由两个子网路组成：第一个子网络，为多尺度的目标候选网络，用于提取人物特征并给出候选区域，具体地，该目标候选网络基于卷积神经网络不同层提出特征的差异性，在中间层分别生成对不同尺度行人的候选框图；第二个子网络，为目标检测网络，增强检测的效果，其与目标候选网络共享参数，在候选框图的基础上进行精细化的分类和检测。具体地，步骤S1进一步包括：Step S1, constructing a configurable deep model based on convolutional neural network, using training samples to learn the constructed network parameters, and obtaining a model for the testing process. In a specific embodiment of the present invention, the depth model is composed of two sub-networks: the first sub-network is a multi-scale target candidate network, which is used to extract character features and give candidate areas, specifically, the target candidate network Based on the differences in features proposed by different layers of the convolutional neural network, candidate block diagrams for pedestrians of different scales are generated in the middle layer; the second sub-network is the target detection network, which enhances the detection effect, and it shares parameters with the target candidate network. Refined classification and detection are performed on the basis of candidate block diagrams. Specifically, step S1 further includes:

步骤S100，构建可配置的基于卷积神经网络的深度模型。Step S100, constructing a configurable deep model based on a convolutional neural network.

所述卷积神经网络由卷积层、降采样层、上采样层堆叠而成，所述卷积层是指对输入的图像或者特征图在二维空间上进行卷积运算，提取层次化特征；所述的降采样层使用没有重叠的max-pooling操作，该操作用于提取形状和偏移不变的特征，同时减少特征图大小，提高计算效率；所述的上采样层，是指对输入的特征图在二维空间上进行去卷积的操作，用以增大特征图的像素，主要用于目标检测网络，提升检测效果，在本发明具体实施例中，采用Squeeze VGG-16卷积神经网络作为骨干网络，如图2所示，该Squeeze VGG-16卷积神经网络采用conv1-1层和紧随其后的12层Fire模块作为卷积层，用以提取特征；其中的pool1-pool5是降采样层；使用在ImageNet数据集上预先训练好的模型作为初始化。即本发明首先利用ImageNet数据集预先训练Squeeze VGG-16作为网络初始化。The convolutional neural network is formed by stacking a convolutional layer, a downsampling layer, and an upsampling layer. The convolutional layer refers to performing a convolution operation on an input image or feature map in a two-dimensional space to extract hierarchical features. ; The down-sampling layer uses a max-pooling operation without overlapping, which is used to extract features with invariant shape and offset, while reducing the size of the feature map and improving computational efficiency; the up-sampling layer refers to the The input feature map is deconvoluted in two-dimensional space to increase the pixels of the feature map, which is mainly used in the target detection network to improve the detection effect. In the specific embodiment of the present invention, Squeeze VGG-16 volume The convolutional neural network is used as the backbone network. As shown in Figure 2, the Squeeze VGG-16 convolutional neural network uses the conv1-1 layer and the following 12-layer Fire module as the convolutional layer to extract features; the pool1 -pool5 is the downsampling layer; use the pre-trained model on the ImageNet dataset as initialization. That is, the present invention first uses the ImageNet data set to pre-train Squeeze VGG-16 as network initialization.

图3为本发明具体实施例中Fire模块的结构示意图。如图3所示，Fire模块由两个卷积核大小为1×1的卷积层和一个卷积核大小为3×3的卷积层组成，目的在于用1×1的卷积核代替3×3的卷积核，从而使参数量减少9倍，但为了不影响网络的表征能力，不是全部替代，而是一部分是用1×1的卷积核，一部分使用3×3的卷积核，这样做的另一个好处是减少3×3卷积核的输入通道，同时起到降低参数量的效果，具体地，Fire模块先是使用1×1的卷积层对输入层进行降维操作，然后参照GoogLeNet结构，使用1×1和3×3的卷积层提取特征，最后将两部分特征连接起来，这样的方式极大降低了计算量和模型参数。Fig. 3 is a schematic structural diagram of a Fire module in a specific embodiment of the present invention. As shown in Figure 3, the Fire module consists of two convolution layers with a convolution kernel size of 1×1 and a convolution layer with a convolution kernel size of 3×3. The purpose is to use a convolution kernel of 1×1 instead of 3×3 convolution kernel, which reduces the amount of parameters by 9 times, but in order not to affect the representation ability of the network, it is not all replaced, but part of it uses a 1×1 convolution kernel, and part of it uses a 3×3 convolution Kernel, another advantage of this is to reduce the input channels of the 3×3 convolution kernel, and at the same time reduce the amount of parameters. Specifically, the Fire module first uses a 1×1 convolution layer to perform dimensionality reduction operations on the input layer , and then referring to the GoogLeNet structure, use 1×1 and 3×3 convolutional layers to extract features, and finally connect the two parts of the features, which greatly reduces the amount of calculation and model parameters.

图4为本发明具体实施例中目标候选网络的架构示意图。在本发明具体实施例中，所述目标候选网络在Squeeze VGG-16卷积神经网络基础上，根据卷积层特征，在Fire9、Fire12、conv6以及增加的pooling层共计4层，产生网络分支，分支进行不同尺度检测到物体的候选框的回归。但对于Fire-9层，它比较接近主干网络的低层，相比其他层对梯度的影响会很大，学习过程不稳定，因此多了一个buffer(缓冲)层，如图4中det-conv层所示，buffer层避免检测分支的梯度被直接back-propagated(反向传播)到主干层。Fig. 4 is a schematic diagram of an architecture of a target candidate network in a specific embodiment of the present invention. In a specific embodiment of the present invention, the target candidate network is based on the Squeeze VGG-16 convolutional neural network, and according to the convolutional layer characteristics, a total of 4 layers are generated in Fire9, Fire12, conv6 and the added pooling layer to generate network branches, The branch performs the regression of the candidate boxes of detected objects at different scales. But for the Fire-9 layer, it is relatively close to the lower layer of the backbone network. Compared with other layers, it will have a greater impact on the gradient, and the learning process is unstable, so there is an additional buffer (buffer) layer, as shown in the det-conv layer in Figure 4. As shown, the buffer layer prevents the gradient of the detection branch from being directly back-propagated (backpropagated) to the backbone layer.

本发明利用神经网络感知域的变化规律(即神经网络层越深，感知域越大，适合检测大一些的目标物体)，使用不同的中间层对特定尺度范围内的目标物体进行检测，更好的适应了感知域与物体大小的关系，有效提高了检测结果。The present invention utilizes the change law of the neural network perception domain (that is, the deeper the neural network layer, the larger the perception domain, which is suitable for detecting larger target objects), and uses different intermediate layers to detect target objects within a specific scale range, which is better It adapts to the relationship between the perception domain and the size of the object, and effectively improves the detection results.

图5为本发明具体实施例中目标检测网络的架构示意图。所述目标检测网络与目标候选网络共享参数，将目标候选网络的候选框汇总，以增强监测网络对物体与背景的区分能力。在本发明具体实施例中，所述目标检测网络，在目标候选区域的基础上，将目标候选区域1.5倍大小的图片区域作为目标的背景语义信息；将Fire9层的特征图进行一次上采样，作为增强对小物体感知的信息，将背景语义信息与上采样信息经过感兴趣区域的池化(ROI pooling)获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归，具体地，主干cnn层连接一个proposals的节点，用于汇总目标候选网络所得到的候选框信息；另一方面，针对fire9层的特征图，W和H是输入图片的宽度和高度，立方体1代表物体区域的在特征图的映射，而立方体2代表context区域在特征图上的映射，context区域约为物体区域的1.5倍，同时为了加强对小物体的检测，再对Fire9层进行一次上采样，之后与faster RCNN算法类似，使用感兴趣区域的池化获得固定大小的特征；将Fire9层处理后的特征与proposals汇总的特征连接(concat)到一起，后增加一层全连接层，进行类别和最终候选框的回归，在此不予赘述。Fig. 5 is a schematic diagram of the architecture of the target detection network in a specific embodiment of the present invention. The target detection network shares parameters with the target candidate network, and summarizes the candidate frames of the target candidate network to enhance the ability of the monitoring network to distinguish objects from backgrounds. In a specific embodiment of the present invention, the target detection network, on the basis of the target candidate area, uses a picture area 1.5 times the size of the target candidate area as the background semantic information of the target; performs an upsampling of the feature map of the Fire9 layer, As the information to enhance the perception of small objects, the background semantic information and upsampling information are pooled in the region of interest (ROI pooling) to obtain fixed-size features, and then a fully connected layer is added to perform category and final candidate frame regression. , specifically, the backbone cnn layer is connected to a node of proposals, which is used to summarize the candidate frame information obtained by the target candidate network; on the other hand, for the feature map of the fire9 layer, W and H are the width and height of the input image, cube 1 Represents the mapping of the object area on the feature map, and cube 2 represents the mapping of the context area on the feature map. The context area is about 1.5 times the object area. At the same time, in order to strengthen the detection of small objects, an upsampling is performed on the Fire9 layer , and then similar to the faster RCNN algorithm, use the pooling of the region of interest to obtain fixed-size features; connect (concat) the features processed by the Fire9 layer with the features summarized by the proposals, and then add a fully connected layer to classify and the regression of the final candidate box, which will not be described here.

步骤S101，输入训练样本。Step S101, input training samples.

训练过程需要提供图像中参考人物的对应的框，同时为了加速训练，训练过程将含有参考人物的图像从原始图像中裁剪出来，形成一个个patch(图像块)，patch相比于原始图像更小，用以训练，有效加速了训练过程。具体地，在本发明中，输入的训练样本包括RGB图像数据和图像中行人区域的标注信息，实际训练用的图像数据是根据行人所在区域裁剪得到的小的patch(图象块)。用数学语言表示，训练样本其中X_i表示训练图片的一个patch；在实际应用中，除了行人这一类别，还有其他类别，例如背景、骑自行车车的人、坐着的人等K个类别，因此标注数据Y_i＝(y_i，b_i)由类别标签y_i∈{0，1，2，...，K}和框图坐标点组成，其中为框图左上角的起始坐标点，为框图宽度和高度。The training process needs to provide the corresponding frame of the reference person in the image. At the same time, in order to speed up the training, the training process cuts out the image containing the reference person from the original image to form a patch (image block), and the patch is smaller than the original image. , used for training, effectively speeding up the training process. Specifically, in the present invention, the input training sample includes RGB image data and label information of the pedestrian area in the image, and the image data used for actual training is a small patch (image block) obtained by cutting out the area where the pedestrian is located. Expressed in mathematical language, training samples Where X _i represents a patch of the training picture; in practical applications, in addition to the category of pedestrians, there are other categories, such as background, cyclists, sitting people and other K categories, so the labeled data Y _i = (y _i , b _i ) consists of class labels y _i ∈ {0, 1, 2, ..., K} and box plot coordinate points composed of is the starting coordinate point of the upper left corner of the block diagram, is the boxplot width and height.

步骤S102，初始化卷积神经网络及其参数，包括网络层中每层连接的权重和偏置。具体地，本发明利用ImageNet数据集预先训练Squeeze VGG-16卷积神经网络作为网络初始化。Step S102, initialize the convolutional neural network and its parameters, including the weight and bias of each layer connection in the network layer. Specifically, the present invention uses the ImageNet dataset to pre-train the Squeeze VGG-16 convolutional neural network as network initialization.

步骤S103，采用前向传播算法和后向传播算法，利用训练样本学习出构建的网络参数，即用于测试过程的模型。Step S103, using the forward propagation algorithm and the backward propagation algorithm, using the training samples to learn the constructed network parameters, that is, the model used in the testing process.

在本发明中，所述前向传播算法，首先将输入图像的大小归一化为3×480×640,截取3×448×448大小的patch和相应的标注信息作为卷积神经网络的输入，经过卷积层、降采样层和矫正线性单元层(ReLU Nonlinearity Layer)，在Fire9层，图像特征图大小为512×60×80；在Fire12层，特征图大小为512×30×40，在后面两个分支特征图大小依次是512×15×20和512×8×10。在不同特征图上，采用卷积的方式得到目标框图的四个坐标点和类别信息，以Fire9层为例，假定只检测行人和背景，则输出为特征大小为6×60×80，其中6包含背景、行人两个类别和候选框图四个坐标点。在目标检测网络中，将各个分支层得到的候选框图在proposals节点进行汇总，同时与Fire9层的背景语义信息和上采样信息经过感兴趣区域的池化操作得到的特征进行叠加，做最后的框图回归和类别回归。In the present invention, the forward propagation algorithm first normalizes the size of the input image to 3×480×640, intercepts a patch with a size of 3×448×448 and corresponding label information as the input of the convolutional neural network, After convolutional layer, downsampling layer and rectified linear unit layer (ReLU Nonlinearity Layer), in Fire9 layer, the image feature map size is 512×60×80; in Fire12 layer, the feature map size is 512×30×40, in the back The size of the feature maps of the two branches are 512×15×20 and 512×8×10, respectively. On different feature maps, the four coordinate points and category information of the target block diagram are obtained by convolution. Taking the Fire9 layer as an example, assuming that only pedestrians and backgrounds are detected, the output is a feature size of 6×60×80, of which 6 Contains two categories of background, pedestrian and four coordinate points of the candidate frame. In the target detection network, the candidate block diagrams obtained by each branch layer are summarized at the proposals node, and at the same time, they are superimposed with the background semantic information of the Fire9 layer and the features obtained by the pooling operation of the region of interest through the upsampling information to make the final block diagram Regression and categorical regression.

在本发明中，所述后向传播算法，需要先求出正向(即前向)传播预测的目标框图与图像实际目标框图的损失函数然后求得其对参数W的梯度，采用梯度下降的算法更新W以最小化损失函数假定中间层有M个分支可以输出目标候选区域(M个尺度的感知域可以近似的检测出图像中所有目标物体)，l^m表示分支m的损失函数，α_m表示l^m函数的权重，S＝{S¹，S²，…，S^M}指相应尺度的目标对象，则损失函数可定义为：In the present invention, the backward propagation algorithm needs to first obtain the loss function of the target block diagram predicted by the forward (ie forward) propagation and the actual target block diagram of the image Then obtain its gradient to the parameter W, and use the gradient descent algorithm to update W to minimize the loss function Assuming that the middle layer has M branches that can output target candidate areas (the perceptual domain of M scales can approximately detect all target objects in the image), l ^m represents the loss function of branch m, α _m represents the weight of l ^m function, S = {S ¹ , S ² ,..., S ^M } refers to the target object of the corresponding scale, then the loss function can be defined as:

所述损失函数，对于特定的检测层m，只有目标尺度在m所能检测的范围内，才对损失函数有贡献，故将损失函数定义为The loss function, for a specific detection layer m, only contributes to the loss function if the target scale is within the range that m can detect, so the loss function is defined as

其中,p(X)＝(p₀(X)，...，p_K(X))表示目标类别的概率分布；λ是平衡系数；b为框图的4个坐标点，指前向传播得到的坐标点；损失函数中，使用交叉熵损失函数定义类别回归，即Among them, p(X)=(p ₀ (X), ..., p _K (X)) represents the probability distribution of the target category; λ is the balance coefficient; b is the 4 coordinate points of the frame diagram, Refers to the coordinate points obtained by forward propagation; in the loss function, the cross-entropy loss function is used to define the category regression, ie

L_cls(p(X)，y)＝-log_y(P(X)) (3)L _cls (p(X), y)=-log _y (P(X)) (3)

使用平滑的曼哈顿距离标准(smooth L1 criterion)进行目标框图的回归，定义如下Use the smooth Manhattan distance criterion (smooth L1 criterion) to perform the regression of the target block diagram, defined as follows

步骤S2，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的目标物体进行检测，预测出图像中目标物体(如行人)的框图。In step S2, the trained model is used to detect target objects in different scales using different intermediate layers according to the change law of the neural network perceptual domain, and predict the block diagram of the target object (such as a pedestrian) in the image.

具体地，步骤S2进一步包括：Specifically, step S2 further includes:

步骤S200，载入训练好的模型；Step S200, loading the trained model;

步骤S201，输入测试样本；Step S201, inputting test samples;

步骤S202，利用训练好的模型，通过神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的行人进行检测，预测图像中行人的框图。图6为本发明具体实施例中快速行人检测的过程示意图，即利用模型中的目标候选网络在Squeeze VGG-16卷积神经网络基础上，根据卷积层特征，在fire9、fire12、conv6以及增加的pooling层共计4层产生网络分支，进行不同尺度检测到物体的目标候选区域(中间层a，中间层b，中间层c)；然后利用目标检测网络，在目标候选区域的基础上，将目标候选区域1.5倍大小的图片区域作为目标的背景语义信息，将Fire9层的特征图进行一次上采样，作为增强对小物体感知的信息，将背景语义信息与上采样信息经过感兴趣区域的池化获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归。优选地，于步骤S202中，还使用去卷积的方法对特定网络层的特征图进行放大。Step S202 , using the trained model, using different intermediate layers to detect pedestrians in different scales according to the changing rules of the neural network perceptual domain, and predict the block diagram of the pedestrian in the image. 6 is a schematic diagram of the process of fast pedestrian detection in a specific embodiment of the present invention, that is, using the target candidate network in the model on the basis of the Squeeze VGG-16 convolutional neural network, according to the convolutional layer features, in fire9, fire12, conv6 and increase The pooling layer of the pooling layer has a total of 4 layers to generate network branches, and detects target candidate areas of objects at different scales (middle layer a, middle layer b, middle layer c); The image area 1.5 times the size of the candidate area is used as the background semantic information of the target, and the feature map of the Fire9 layer is up-sampled once as information to enhance the perception of small objects, and the background semantic information and up-sampled information are pooled in the area of interest Obtain fixed-size features, and then add a fully connected layer to perform category and final candidate box regression. Preferably, in step S202, a deconvolution method is also used to enlarge the feature map of a specific network layer.

本发明提出的行人检测方法，分别借鉴两方面的评价指标：平均查准率mAP和每秒帧数FPS。mAP用于评价最后检测区域与真实目标人物区域的交并比的情况，在不同交并比下查准率的平均值；FPS，主要是效率指标，指每秒可以处理的图片数目。The pedestrian detection method proposed by the present invention uses two evaluation indexes for reference respectively: the average precision rate mAP and the number of frames per second FPS. mAP is used to evaluate the intersection ratio between the final detection area and the real target person area, and the average precision rate under different intersection ratios; FPS is mainly an efficiency indicator, which refers to the number of pictures that can be processed per second.

图7为本发明一种快速行人检测装置的系统架构图。如图7所示，本发明一种快速行人检测装置，包括：FIG. 7 is a system architecture diagram of a rapid pedestrian detection device according to the present invention. As shown in Figure 7, a fast pedestrian detection device of the present invention includes:

训练单元70，用于构建可配置的基于卷积神经网络的深度模型，利用训练样本学习出构建的网络参数，获得用于测试过程的模型。在本发明具体实施例中，训练单元70所构建的深度模型由两个子网路组成：第一个子网络，为多尺度的目标候选网络，用于提取人物特征并给出候选区域，具体地，该目标候选网络基于卷积神经网络不同层提出特征的差异性，在中间层分别生成对不同尺度行人的候选框图；第二个子网络，为目标检测网络，增强检测的效果，其与目标候选网络共享参数，在候选框图的基础上进行精细化的分类和检测。具体地，如图8所示，训练单元70进一步包括：The training unit 70 is configured to construct a configurable deep model based on a convolutional neural network, use training samples to learn constructed network parameters, and obtain a model for the testing process. In a specific embodiment of the present invention, the depth model constructed by the training unit 70 is composed of two sub-networks: the first sub-network is a multi-scale target candidate network, which is used to extract character features and give candidate regions, specifically , the target candidate network is based on the difference of features proposed by different layers of the convolutional neural network, and generates candidate block diagrams for pedestrians of different scales in the middle layer; the second sub-network is the target detection network, which enhances the detection effect. The network shares parameters, and performs refined classification and detection on the basis of candidate block diagrams. Specifically, as shown in Figure 8, the training unit 70 further includes:

模型构建单元701，用于构建可配置的基于卷积神经网络的深度模型。A model construction unit 701, configured to construct a configurable deep model based on a convolutional neural network.

所述卷积神经网络由卷积层、降采样层、上采样层堆叠而成，所述卷积层是指对输入的图像或者特征图在二维空间上进行卷积运算，提取层次化特征；所述的降采样层使用没有重叠的max-pooling操作，该操作用于提取形状和偏移不变的特征，同时减少特征图大小，提高计算效率，所述的上采样层，是指对输入的特征图在二维空间上进行去卷积的操作，用以增大特征图的像素。在本发明具体实施例中，采用Squeeze VGG-16卷积神经网络作为骨干网络。The convolutional neural network is formed by stacking a convolutional layer, a downsampling layer, and an upsampling layer. The convolutional layer refers to performing a convolution operation on an input image or feature map in a two-dimensional space to extract hierarchical features. ; The down-sampling layer uses a max-pooling operation without overlapping, which is used to extract features with invariant shape and offset, while reducing the size of the feature map and improving computational efficiency. The up-sampling layer refers to the The input feature map is deconvoluted in two-dimensional space to increase the pixels of the feature map. In a specific embodiment of the present invention, the Squeeze VGG-16 convolutional neural network is used as the backbone network.

在本发明具体实施例中，所述目标候选网络在Squeeze VGG-16卷积神经网络基础上，根据卷积层特征，在fire9、fire12、conv6以及增加的pooling层共计4层，产生网络分支，分支进行不同尺度检测到物体的候选框的回归。但对于fire-9层，它比较接近主干网络的低层，相比其他层对梯度的影响会很大，学习过程不稳定，因此多了一个buffer(缓冲)层，buffer层避免检测分支的梯度被直接back-propagated(反向传播)到主干层。In a specific embodiment of the present invention, the target candidate network is based on the Squeeze VGG-16 convolutional neural network, and according to the convolutional layer features, a total of 4 layers are generated in fire9, fire12, conv6 and the added pooling layer to generate network branches, The branch performs the regression of the candidate boxes of detected objects at different scales. But for the fire-9 layer, it is closer to the lower layer of the backbone network. Compared with other layers, it will have a greater impact on the gradient, and the learning process is unstable. Therefore, a buffer (buffer) layer is added. The buffer layer prevents the gradient of the detection branch from being Directly back-propagated (back propagation) to the backbone layer.

所述目标检测网络与目标候选网络共享参数，将目标候选网络的候选框汇总，以增强监测网络对物体与背景的区分能力。在本发明具体实施例中，所述目标检测网络，在目标候选区域的基础上，将目标候选区域1.5倍大小的图片区域作为目标的背景语义信息；将Fire9层的特征图进行一次上采样，作为增强对小物体感知的信息，将背景语义信息与上采样信息经过感兴趣区域的池化获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归，具体地，主干cnn层连接一个proposal的子网，W和H是输入图片的宽度和高度，立方体1代表物体区域的pooling，而立方体2代表context区域的pooling，context区域约为物体区域的1.5倍，同时为了加强对小物体的检测，再对Fire9层进行一次上采样，之后与faster RCNN算法类似，使用感兴趣区域的池化获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归。The target detection network shares parameters with the target candidate network, and summarizes the candidate frames of the target candidate network to enhance the ability of the monitoring network to distinguish objects from backgrounds. In a specific embodiment of the present invention, the target detection network, on the basis of the target candidate area, uses a picture area 1.5 times the size of the target candidate area as the background semantic information of the target; performs an upsampling of the feature map of the Fire9 layer, As the information to enhance the perception of small objects, the background semantic information and upsampling information are pooled in the region of interest to obtain fixed-size features, and then a layer of fully connected layers is added to perform the regression of categories and final candidate boxes. Specifically, The backbone cnn layer is connected to a proposal subnet, W and H are the width and height of the input image, cube 1 represents the pooling of the object area, and cube 2 represents the pooling of the context area, the context area is about 1.5 times the object area, and for Strengthen the detection of small objects, and then perform an upsampling on the Fire9 layer, and then similar to the faster RCNN algorithm, use the pooling of the region of interest to obtain fixed-size features, and then add a layer of fully connected layers to perform category and final candidate boxes return.

训练样本输入单元702，用于输入训练样本。The training sample input unit 702 is configured to input training samples.

具体地，训练样本其中X_i表示训练图片的一个patch，标注数据Y_i＝(y_i，b_i)由类别标签y_i和框图坐标点组成。Specifically, the training samples Where X _i represents a patch of the training picture, and the labeled data Y _i = (y _i , b _i ) consists of the category label y _i and the frame coordinate points composition.

初始化单元703，用于初始化卷积神经网络及其参数，包括网络层中每层连接的权重和偏置。具体地，本发明利用ImageNet数据集预先训练Squeeze VGG-16卷积神经网络作为网络初始化。The initialization unit 703 is used to initialize the convolutional neural network and its parameters, including the weight and bias of each layer connection in the network layer. Specifically, the present invention uses the ImageNet dataset to pre-train the Squeeze VGG-16 convolutional neural network as network initialization.

样本训练单元704，用于采用前向传播算法和后向传播算法，利用训练样本学习出构建的网络参数，即用于测试过程的模型。The sample training unit 704 is configured to use the forward propagation algorithm and the backward propagation algorithm to learn the constructed network parameters, that is, the model used in the testing process, by using the training samples.

所述后向传播算法，需要先求出正向传播预测的目标框图与图像实际目标框图的损失函数然后求得其对参数W的梯度，采用梯度下降的算法更新W以最小化损失函数假定中间层有M个分支可以输出目标候选区域(M个尺度的感知域可以近似的检测出图像中所有目标物体)，l^m表示分支m的损失函数，α_m表示l^m函数的权重，S＝{S¹，S²，…，S^M}指相应尺度的目标对象，则损失函数可定义为：The backward propagation algorithm needs to first obtain the loss function of the target block diagram predicted by the forward propagation and the actual target block diagram of the image Then obtain its gradient to the parameter W, and use the gradient descent algorithm to update W to minimize the loss function Assuming that the middle layer has M branches that can output target candidate areas (the perceptual domain of M scales can approximately detect all target objects in the image), l ^m represents the loss function of branch m, α _m represents the weight of l ^m function, S = {S ¹ , S ² ,..., S ^M } refers to the target object of the corresponding scale, then the loss function can be defined as:

其中，p(X)＝(p₀(X)，...，p_K(X))为目标类别的概率分布。损失函数中，使用交叉熵损失函数定义类别回归，即Wherein, p(X)=(p ₀ (X), . . . , p _K (X)) is the probability distribution of the target category. In the loss function, the category regression is defined using the cross-entropy loss function, namely

L_cls(p(X)，y)＝-log_y(P(X))L _cls (p(X), y)=-log _y (P(X))

使用smooth L1 criterion进行目标框图的回归，定义如下Use the smooth L1 criterion to perform the regression of the target block diagram, which is defined as follows

检测单元71，用于输入测试样本，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的目标物体(如行人)进行检测，预测出图像中目标物体(如行人)的框图。The detection unit 71 is used to input test samples, and uses different intermediate layers to detect target objects (such as pedestrians) in different scale ranges by using the trained model and using the change law of the neural network perceptual domain, and predicts the target object in the image ( such as a block diagram of a pedestrian).

具体地，如图9所示，检测单元71进一步包括：Specifically, as shown in FIG. 9, the detection unit 71 further includes:

模型载入单元710，用于载入训练好的模型；A model loading unit 710, configured to load a trained model;

测试样本输入单元711，用于输入测试样本；A test sample input unit 711, configured to input a test sample;

图像预测单元712，用于利用训练好的模型，通过训练好的模型利用神经网络感知域的变化规律使用不同的中间层对不同尺度范围内的行人进行检测，预测图像中行人的框图。具体地，图像预测单元712利用模型中的目标候选网络，在Squeeze VGG-16卷积神经网络基础上，根据卷积层特征，在Fire9、Fire12、conv6以及增加的pooling层共计4层产生网络分支，进行不同尺度检测到物体的目标候选区域；然后利用目标检测网络，在目标候选区域的基础上，将目标候选区域1.5倍大小的图片区域作为目标的背景语义信息，将Fire9层的特征图进行一次上采样，作为增强对小物体感知的信息，将背景语义信息与上采样信息经过感兴趣区域的池化获得固定大小的特征，之后增加一层全连接层，进行类别和最终候选框的回归。The image prediction unit 712 is configured to use the trained model to detect pedestrians in different scale ranges by using the trained model and using different intermediate layers to predict the block diagram of pedestrians in the image. Specifically, the image prediction unit 712 uses the target candidate network in the model, based on the Squeeze VGG-16 convolutional neural network, and according to the convolutional layer characteristics, a total of 4 layers in Fire9, Fire12, conv6 and the added pooling layer generate network branches , to detect the target candidate areas of the object at different scales; then use the target detection network, on the basis of the target candidate area, use the image area 1.5 times the size of the target candidate area as the background semantic information of the target, and perform the feature map of the Fire9 layer One upsampling, as the information to enhance the perception of small objects, the background semantic information and upsampling information are pooled in the region of interest to obtain fixed-size features, and then add a fully connected layer to perform category and final candidate frame regression .

综上所述，本发明一种快速行人检测方法及装置借鉴压缩网络的方法，调整并训练VGG-16的网络得到适应嵌入式系统要求的squeeze VGG-16网络，有效降低了网络模型的参数量并加快了计算效率；另一方面，针对传统检测方法中感知域与物体大小不一致的问题，本发明利用神经网络感知域的变化规律(即神经网络层越深，感知域越大，适合检测大一些的目标物体)，使用不同的中间层对特定尺度范围内的目标物体进行检测，更好的适应了感知域与物体大小的关系，有效提高了检测结果；另外，为了增强对小物体的检测，本发明使用去卷积的方法对特定网络层的特征图进行放大，相比于传统图片放大的方法，几乎不增加显存和计算量；为了增强对于模糊物体的检测，在该层的特征图上，使用目标对象1.5倍大小的区域作为背景语义特征增加到网络中，对于模糊物体和远距离小物体的检测，有着极佳的性能。In summary, a method and device for fast pedestrian detection in the present invention draws on the method of compressing the network, adjusts and trains the network of VGG-16 to obtain a squeeze VGG-16 network that meets the requirements of the embedded system, and effectively reduces the amount of parameters of the network model And speed up the calculation efficiency; on the other hand, in view of the problem that the perception domain and the size of the object are inconsistent in the traditional detection method, the present invention utilizes the change rule of the neural network perception domain (that is, the deeper the neural network layer, the larger the perception domain, which is suitable for detecting large objects). Some target objects), use different intermediate layers to detect target objects within a specific scale range, better adapt to the relationship between the perception domain and the size of the object, and effectively improve the detection results; in addition, in order to enhance the detection of small objects , the present invention uses the deconvolution method to enlarge the feature map of a specific network layer. Compared with the traditional image enlargement method, it hardly increases the amount of video memory and calculation; in order to enhance the detection of fuzzy objects, the feature map of this layer In the above, the area 1.5 times the size of the target object is used as the background semantic feature to add to the network, and it has excellent performance for the detection of fuzzy objects and long-distance small objects.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何本领域技术人员均可在不违背本发明的精神及范畴下，对上述实施例进行修饰与改变。因此，本发明的权利保护范围，应如权利要求书所列。The above-mentioned embodiments only illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Any person skilled in the art can modify and change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Therefore, the protection scope of the present invention should be listed in the claims.

Claims

1. A fast pedestrian detection method, comprising the steps of:

Step S1, constructing a configurable deep model based on convolutional neural network, using training samples to learn the constructed network parameters, and obtaining a model for the testing process;

Step S2, input test samples, use different intermediate layers to detect target objects in different scale ranges through the trained model, and use the changing rules of the neural network perceptual domain to predict the block diagram of the target object in the image.

2. A kind of fast pedestrian detection method as claimed in claim 1, is characterized in that, step S1 further comprises:

Build configurable convolutional neural network-based deep models;

input training samples;

Initialize the convolutional neural network and its parameters, including the weights and biases of each layer connection in the network layer;

The forward propagation algorithm and the backward propagation algorithm are used to learn the constructed network parameters by using the training samples, that is, the model used for the testing process.

3. A kind of fast pedestrian detection method as claimed in claim 2, it is characterized in that, described this depth model comprises multi-scale target candidate network and target detection network, and described target candidate network proposes based on different layers of convolutional neural network According to the difference of features, candidate frame diagrams of target objects of different scales are generated in the middle layer; the target detection network performs refined classification and detection on the basis of the candidate frame diagrams output by the target candidate network.

4. a kind of fast pedestrian detection method as claimed in claim 3 is characterized in that: described convolutional neural network is formed by stacking of convolution layer, down-sampling layer, up-sampling layer, and described convolution layer refers to The input image or feature map is convolved in two-dimensional space to extract hierarchical features; the downsampling layer uses a max-pooling operation without overlap, which is used to extract features with invariant shape and offset, and at the same time Reduce the size of the feature map and improve the calculation efficiency; the upsampling layer refers to the deconvolution operation on the input feature map in two-dimensional space to increase the pixels of the feature map.

5. A kind of fast pedestrian detection method as claimed in claim 4, is characterized in that: described depth model adopts Squeeze VGG-16 convolutional neural network as backbone network, and described Squeeze VGG-16 convolutional neural network adopts conv1- The first layer and the following 12-layer Fire module layer are the network structure of feature extraction.

6. A kind of fast pedestrian detection method as claimed in claim 5, is characterized in that: described target candidate network is on the basis of described Squeeze VGG-16 convolutional neural network, according to convolution layer feature, in Fire9, Fire12, Conv6 and the added pooling layer generate network branches for the regression of candidate frames of objects detected at different scales.

7. A kind of fast pedestrian detection method as claimed in claim 5, it is characterized in that: described target detection network is based on described target candidate area, uses the picture area of preset multiple size of target candidate area as the background of target Semantic information, the feature map of the Fire9 layer is upsampled once as information to enhance the perception of small objects, and the background semantic information and upsampling information are pooled in the region of interest to obtain a fixed-size feature, and then add a layer of full The connection layer performs the regression of the category and the final candidate box.

8. A kind of fast pedestrian detection method as claimed in claim 1, is characterized in that: described training sample comprises RGB image data and the mark information of pedestrian area in image, and the image data that actual training is used is to obtain according to the area cropping of pedestrian's place A small patch.

9. A kind of fast pedestrian detection method as claimed in claim 1, it is characterized in that: described backpropagation algorithm, needs to first obtain the loss function of the target frame diagram of forward propagation prediction and the actual target frame diagram of image Then obtain its gradient to the parameter W, and use the gradient descent algorithm to update W to minimize the loss function Assume that there are M branches in the middle layer that can output the target candidate area, l ^m represents the loss function of branch m, α _m represents the weight of the l ^m function, S={S ¹ , S ² ,...,S ^M } refers to the target of the corresponding scale object, the loss function can be defined as:

10. A rapid pedestrian detection system comprising:

The training unit is used to build a configurable deep model based on convolutional neural network, use training samples to learn the constructed network parameters, and obtain a model for the testing process;

The detection unit is used to input test samples, and use different intermediate layers to detect target objects in different scale ranges by using the trained model and using the changing law of the neural network perceptual domain, and predict the block diagram of the target object in the image.