CN111401143A

CN111401143A - Pedestrian tracking system and method

Info

Publication number: CN111401143A
Application number: CN202010118386.1A
Authority: CN
Inventors: 谢英红; 李路; 韩晓微; 涂斌斌; 李华
Original assignee: Shenyang University
Current assignee: Shenyang University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-07-10

Abstract

The invention provides a pedestrian tracking system and a pedestrian tracking method, and relates to the technical field of computer vision. Determining that a first frame of a plurality of video frames includes a target frame of a target object; for a subsequent frame except the first frame in the plurality of video frames, determining a current target frame including a target object in the current frame according to the determined target frame; inputting the current target frame into a pre-trained VGG-16 network, and acquiring a candidate feature map of the target frame in the image; inputting the candidate feature map into a pre-trained RPN network to obtain a plurality of target candidate regions; performing pooling operation on the characteristics of the target candidate regions through a plurality of convolution kernels with different sizes to obtain a plurality of interested regions aiming at the target object; performing full-link operation on the characteristics of the multiple interested areas, distinguishing a target from a background, and obtaining multiple tracking affine frames of a target object; and performing non-maximum suppression on the plurality of tracking affine frames to obtain a tracking result of the target object of the current frame.

Description

A pedestrian tracking system and method

技术领域technical field

本发明涉及计算机视觉技术领域，尤其涉及一种行人跟踪系统及方法。The invention relates to the technical field of computer vision, in particular to a pedestrian tracking system and method.

背景技术Background technique

行人跟踪技术即通过计算机视觉技术,对视频和图像中的画面进行行人目标的识别与跟踪。行人识别跟踪项目已经被许多国家列为重点研究项目,该项目被如此重视,是因为其技术超前且涉猎广泛:在国防军事领域，该技术可用于战场侦测、目标跟踪和精确制导等方面;在城市交通领域,该技术可用于智能交通、违章检测和无人驾驶等方面;在社会安全领域,该技术可用于人流量监测等。Pedestrian tracking technology uses computer vision technology to identify and track pedestrian targets in videos and images. The pedestrian identification and tracking project has been listed as a key research project by many countries. This project is so valued because its technology is advanced and covers a wide range: in the field of national defense and military, the technology can be used for battlefield detection, target tracking and precision guidance; In the field of urban transportation, the technology can be used in intelligent transportation, violation detection and driverless driving, etc.; in the field of social security, the technology can be used in the monitoring of human flow, etc.

现有技术中公开了许多行人跟踪方法和装置。这些系统和方法中虽然使用了很多流行的神经网络技术，但是针对形变目标的准确定位没有特殊的解决方案。Numerous pedestrian tracking methods and apparatuses are disclosed in the prior art. Although many popular neural network techniques are used in these systems and methods, there is no special solution for accurate localization of deformable targets.

发明内容SUMMARY OF THE INVENTION

针对现有技术的不足，本发明提供一种行人跟踪系统及方法。Aiming at the deficiencies of the prior art, the present invention provides a pedestrian tracking system and method.

本发明所采取的技术方案是：The technical scheme adopted by the present invention is:

一方面，本发明提供一种行人跟踪系统，包括包括存储器和处理器；In one aspect, the present invention provides a pedestrian tracking system including a memory and a processor;

所述存储器用来存储有计算机可执行的指令；the memory is used to store computer-executable instructions;

所述处理器用来执行所述可执行指令，确定多个视频帧中的第一帧包括目标对象的目标框；对于多个视频帧中的除了第一帧中的后续帧，根据所确定出的目标框确定出当前帧中包括目标对象的当前目标框；将当前目标框输入预训练好的VGG-16网中，获取该图像中目标框的候选特征图；将候选特征图输入到预训练好的RPN网络中获得多个目标候选区域；通过不同大小的多个卷积核将多个目标候选区域的特征进行池化操作，获得针对目标对象的多个感兴趣区域；将多个感兴趣区域的特征进行全链接操作，区分目标和背景，获得的目标对象的多个跟踪仿射框；以及对多个跟踪仿射框进行非极大值抑制，得到当前帧的目标对象的跟踪结果。The processor is configured to execute the executable instructions to determine that the first frame in the multiple video frames includes the target frame of the target object; for subsequent frames in the multiple video frames except the first frame, according to the determined The target frame determines the current target frame including the target object in the current frame; the current target frame is input into the pre-trained VGG-16 network to obtain the candidate feature map of the target frame in the image; the candidate feature map is input into the pre-trained Multiple target candidate regions are obtained in the RPN network of the The features of the full link operation are performed to distinguish the target and the background, and multiple tracking affine frames of the target object are obtained; and the non-maximum value suppression of the multiple tracking affine frames is performed to obtain the tracking result of the target object of the current frame.

另一方面，本发明还提供一种行人跟踪方法，采用上述的一种行人跟踪系统实现，该方法包括以下步骤：On the other hand, the present invention also provides a pedestrian tracking method, which is implemented by the above-mentioned pedestrian tracking system, and the method includes the following steps:

步骤1：确定出多个视频帧中的第一帧包括目标对象的目标框；Step 1: determine that the first frame in the multiple video frames includes the target frame of the target object;

步骤2：对于除第一帧的后续帧，根据所确定出的目标框确定出当前帧中包括所述目标对象的当前目标框；Step 2: For subsequent frames except the first frame, determine the current target frame including the target object in the current frame according to the determined target frame;

步骤3：将确定出的目标框调整成固定大小输入到预训练好的VGG-16网络中，获取所述当前帧中的目标框的候选特征图，设计损失函数。Step 3: Adjust the determined target frame to a fixed size and input it into the pre-trained VGG-16 network, obtain the candidate feature map of the target frame in the current frame, and design a loss function.

步骤4：将所述候选特征图输入到预训练好的RPN网络中获得多个目标候选区域；Step 4: Input the candidate feature map into the pre-trained RPN network to obtain multiple target candidate regions;

所述目标候选区域为当前帧中的目标对象存在的多个形状和位置同时存在的区域。The target candidate region is a region where multiple shapes and positions of the target object in the current frame exist simultaneously.

步骤5：通过不同大小的多个卷积核将所述多个目标候选区域的特征进行池化操作，获得针对所述目标对象的多个感兴趣区域；Step 5: Perform a pooling operation on the features of the multiple target candidate regions through multiple convolution kernels of different sizes to obtain multiple regions of interest for the target object;

所述多个不同大小的卷积核包括三个用于初略描述所述目标对象的不同形变的卷积核。The multiple convolution kernels of different sizes include three convolution kernels for briefly describing different deformations of the target object.

步骤6：将所述多个感兴趣区域的特征进行全链接操作，区分目标和背景，将所述多个跟踪仿射框与参考目标框进行对比，得到交叠面积最大的仿射跟踪框，从而获得的所述目标对象的多个跟踪仿射框；Step 6: Perform a full-link operation on the features of the multiple regions of interest, distinguish the target and the background, compare the multiple tracking affine frames with the reference target frame, and obtain the affine tracking frame with the largest overlapping area, Thereby obtained multiple tracking affine frames of the target object;

步骤7：对所述多个跟踪仿射框进行非极大值抑制，得到所述当前帧的所述目标对象的跟踪结果；Step 7: performing non-maximum suppression on the multiple tracking affine frames to obtain the tracking result of the target object in the current frame;

步骤8：判断当前图像下一帧的个数是否小于视频总帧数，如果否直接结束，如果是回到步骤2,进行下一帧图像的跟踪，直到所有视频的帧跟踪完毕。Step 8: Determine whether the number of the next frame of the current image is less than the total number of video frames, if not, end immediately, if it is, go back to step 2, and track the next frame of image until all video frames are tracked.

采用上述技术方案所产生的有益效果在于：The beneficial effects produced by the above technical solutions are:

本申请利用上一帧图像的仿射变换参数信息，对当前目标图像进行裁剪，缩小搜索范围，提高算法效率。另外，在池化操作时，应用不同大小、不同形状的卷积核，初步模拟目标的变形，有助于目标位置的更加准确提取。The present application uses the affine transformation parameter information of the previous frame of image to crop the current target image, narrow the search range, and improve the efficiency of the algorithm. In addition, during the pooling operation, convolution kernels of different sizes and shapes are applied to initially simulate the deformation of the target, which helps to extract the target position more accurately.

附图说明Description of drawings

图1本发明实施例的使用计算机架构实现框图。FIG. 1 is a block diagram of an embodiment of the present invention implemented using a computer architecture.

图2为本发明实施例的的行人跟踪算法的流程图。FIG. 2 is a flowchart of a pedestrian tracking algorithm according to an embodiment of the present invention.

图3为本发明实施例的流程示意性框图。FIG. 3 is a schematic block diagram of a process flow of an embodiment of the present invention.

图4为本发明实施例的水平NMS和仿射变换NMS效果对比图。FIG. 4 is a comparison diagram of the effects of horizontal NMS and affine transformation NMS according to an embodiment of the present invention.

图5为本发明实施例的跟踪结果图。FIG. 5 is a tracking result diagram of an embodiment of the present invention.

图6为为本发明实施例的的VGG-16的网络结构。FIG. 6 is a network structure of VGG-16 according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明具体实施方式加以详细的说明。The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

如图1所示，其示出了适于用来实现本公开的实施例的电子系统600的结构示意图。图1示出的电子系统仅仅是一个示例，不应对本公开的实施例的功能和使用范围带来任何限制。As shown in FIG. 1 , it shows a schematic structural diagram of an electronic system 600 suitable for implementing embodiments of the present disclosure. The electronic system shown in FIG. 1 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

如图1所示，电子系统600可以包括处理装置（例如中央处理器、图形处理器等）601，其可以根据存储在只读存储器（ROM）602中的程序或者从存储装置608加载到随机访问存储器（RAM）603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出（I/O）接口605也连接至总线604。As shown in FIG. 1 , electronic system 600 may include a processing device (eg, central processing unit, graphics processor, etc.) 601 that may be loaded into random access according to a program stored in read only memory (ROM) 602 or from storage device 608 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601 , the ROM 602 , and the RAM 603 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to bus 604 .

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器（LCD）、扬声器、振动器等的输出装置607；包括例如磁带、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子系统600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子系统600，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图6中示出的每个方框可以代表一个装置，也可以根据需要代表多个装置。Typically, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 607 of a computer, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. Communication means 609 may allow electronic system 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 illustrates electronic system 600 having various devices, it should be understood that not all of the illustrated devices are required to be implemented or available. More or fewer devices may alternatively be implemented or provided. Each block shown in FIG. 6 may represent one device, or may represent multiple devices as required.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置609从网络上被下载和安装，或者从存储装置608被安装，或者从ROM 602被安装。在该计算机程序被处理装置601执行时，执行本公开的实施例的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 609 , or from the storage device 608 , or from the ROM 602 . When the computer program is executed by the processing apparatus 601, the above-described functions defined in the methods of the embodiments of the present disclosure are executed.

需要说明的是，本公开的实施例所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器（RAM）、只读存储器（ROM）、可擦式可编程只读存储器（EPROM或闪存）、光纤、便携式紧凑磁盘只读存储器（CD-ROM）、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的实施例中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的实施例中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF（射频）等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium described in the embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above. In embodiments of the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. Rather, in embodiments of the present disclosure, a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

上述计算机可读介质可以是上述电子系统（在本文中还称为“基于仿射多任务回归的行人跟踪系统”）中所包含的；也可以是单独存在，而未装配入该电子系统中。上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子系统：1）确定出多个视频帧中上一帧包括目标对象的目标框；2）根据所确定出的目标框确定出当前帧中包括所述目标对象的当前目标框；3）将所述当前目标框输入预训练好的第一神经网络中，获取所述当前帧中的目标框的候选特征图；4）将所述候选特征图输入到预训练好的第二神经网络中获得多个目标候选区域； 5）将所述多个目标候选区域的特征进行池化操作，获得针对所述目标对象的多个感兴趣区域；6）将所述多个感兴趣区域的特征进行全链接操作，区分目标和背景，从而获得所述目标对象的多个跟踪仿射框；以及7）对所述多个跟踪仿射框进行非极大值抑制，得到所述当前帧的所述目标对象的跟踪结果。The above-mentioned computer-readable medium may be contained in the above-mentioned electronic system (also referred to herein as "a pedestrian tracking system based on affine multi-task regression"); or may exist alone without being assembled into the electronic system. The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic system: 1) determine a target frame in which the previous frame of the plurality of video frames includes the target object 2) Determine the current target frame including the target object in the current frame according to the determined target frame; 3) Input the current target frame into the pre-trained first neural network, obtain the current frame in the 4) Input the candidate feature map into the pre-trained second neural network to obtain multiple target candidate regions; 5) Perform a pooling operation on the features of the multiple target candidate regions , and obtain multiple regions of interest for the target object; 6) perform a full-link operation on the features of the multiple regions of interest to distinguish the target and the background, thereby obtaining multiple tracking affine frames of the target object; and 7) performing non-maximum suppression on the multiple tracking affine frames to obtain the tracking result of the target object in the current frame.

另一方面，本发明还提供一种基于仿射多任务回归的行人跟踪方法，如图2所示，采用上述的一种基于仿射多任务回归的行人跟踪系统实现，该方法包括以下步骤：On the other hand, the present invention also provides a pedestrian tracking method based on affine multi-task regression, as shown in FIG. 2 , which is implemented by the above-mentioned pedestrian tracking system based on affine multi-task regression, and the method includes the following steps:

初始化原始图像的大小。设原始图像大小为m×n（单位：像素）。当t=1时，手动标记该帧的目标框的位置。将目标框的中心位置坐标记作（cx, cy），其中t表示第t帧图像，t为正整数，cx，cy分别为目标框中心位置横纵坐标，目标框包括所要跟踪的对象，例如图3中标号301所示。Initialize the size of the original image. Let the original image size be m×n (unit: pixel). When t=1, manually mark the position of the target box for this frame. Mark the center position coordinates of the target frame as (cx, cy), where t represents the t-th frame image, t is a positive integer, cx, cy are the horizontal and vertical coordinates of the center position of the target frame, and the target frame includes the object to be tracked, for example Reference numeral 301 is shown in FIG. 3 .

初始化仿射变换参数：U ₁=[r1，r2，r3，r4，r5，r6]^T。Initialize affine transformation parameters: U ₁ =[ r 1, r 2, r 3, r 4, r 5, r 6] ^T .

步骤2：根据所确定出的目标框确定出当前帧中包括所述目标对象的当前目标框；Step 2: determine the current target frame including the target object in the current frame according to the determined target frame;

本实施例中根据所确定出的目标框确定出当前帧中包括所述目标对象的当前目标框。具体地，对输入的第t（t>2）帧图片进行裁剪操作，以t-1帧所跟踪或标识到的目标框的中心坐标（cx, cy）为中心，确定t帧的目标框。例如，假设将t-1帧中的目标框的外接矩形的两条边长，记做：a，b，则在第t帧图像上，以第t-1帧的目标中心点（cx,cy）为中心，裁剪出（2a）×（2b）大小的图片，例如图3中标号为302的矩形框。在本申请中，以上一帧目标的中心点为中心的目的是，使得裁剪后的图片包含目标信息，这是因为目标在相邻两帧的中心点坐标变化不大，只要在中心点附近的位置，裁剪出足够大的子图片，都可以包含要跟踪的目标。In this embodiment, the current target frame including the target object in the current frame is determined according to the determined target frame. Specifically, the input frame t (t>2) picture is cropped, and the target frame of frame t is determined with the center coordinates (cx, cy) of the target frame tracked or identified by frame t-1 as the center. For example, assuming that the lengths of the two sides of the circumscribed rectangle of the target frame in frame t-1 are denoted as: a, b, then on the image of frame t, the center point of the target in frame t-1 (cx, cy ) as the center, and crop out a picture of size (2a)×(2b), such as the rectangular box labeled 302 in Figure 3. In this application, the purpose of taking the center point of the target in the previous frame as the center is to make the cropped picture contain the target information. This is because the coordinates of the center point of the target in two adjacent frames do not change much, as long as the center point near the center point. position, and crop out a sub-image large enough to contain the target to be tracked.

将裁剪后的目标框调整成固定大小，送入预训练好的神经网络中，例如送入到上述的VGG-16网络中，获取该图像在网络中第五层卷积之后的特征图，即，获取该图像中目标框的候选特征图。例如图3中标号为303所示。Adjust the cropped target frame to a fixed size and send it to the pre-trained neural network, such as the above-mentioned VGG-16 network, to obtain the feature map of the image after the fifth layer convolution in the network, that is , obtain the candidate feature map of the target frame in the image. For example, the reference numeral 303 in FIG. 3 is shown.

本实施例中综合考虑系统的准确性与运行效率，采用经典的VGG-16网络结构来实现本申请的各个实施方式。如图1所示为示例性的VGG-16网络结构。如图1所示，该网络结构包括 13个卷积层（201）和3个全连接层（203）。具体地，如图1所示，首先用3×3、步幅为1的过滤器构建卷积层，假设网络输入大小为m×n×3（m和n为正整数），为了保证卷积之后的特征矩阵的前两维与输入矩阵的前两维维数相同即：m×n。在输入矩阵外加一圈0。将输入矩阵的维数变为(m+2)×(n+2), 再3×3卷积。这样卷积之后的特征矩阵的前两维仍为：m×n。然后用一个2×2，步幅为2的过滤器构建最大池化层202。接着再用256个相同的过滤器进行三次卷积操作，然后再池化，然后再卷积三次，再池化。以上所用的激活函数为现有的relu函数。如此进行几轮操作后，将最后得到的7×7×512的特征图进行全连接操作（即，全连接层203），得到4096个单元，然后进行softmax函数进行激活（即，如图所示的激活层204），输出从1000个对象中识别的结果。虽然在这里给出一个具体的VGG-16网络结构，但是本领域技术人员应该理解，在未背离本申请教导的情况下还可以采用其它的网络架构。In this embodiment, the accuracy and operation efficiency of the system are comprehensively considered, and the classic VGG-16 network structure is used to implement the various embodiments of the present application. Figure 1 shows an exemplary VGG-16 network structure. As shown in Figure 1, the network structure includes 13 convolutional layers (201) and 3 fully connected layers (203). Specifically, as shown in Figure 1, a convolutional layer is first constructed with 3×3 filters with a stride of 1, assuming that the network input size is m×n×3 (m and n are positive integers), in order to ensure the convolution The first two dimensions of the subsequent feature matrix are the same as the first two dimensions of the input matrix, namely: m×n. Append a circle of 0s to the input matrix. Change the dimension of the input matrix to (m+2)×(n+2), and then 3×3 convolution. In this way, the first two dimensions of the feature matrix after convolution are still: m×n. A max pooling layer 202 is then constructed with a 2×2 filter with a stride of 2. Then do three convolution operations with 256 of the same filters, then pool again, then convolve three more times, then pool again. The activation function used above is the existing relu function. After several rounds of operations in this way, the finally obtained 7×7×512 feature map is fully connected (ie, fully connected layer 203) to obtain 4096 units, and then activated by the softmax function (ie, as shown in the figure). 204), which outputs the results identified from 1000 objects. Although a specific VGG-16 network structure is given here, those skilled in the art should understand that other network structures can also be used without departing from the teachings of the present application.

再构架上述网络后，通过使用ImageNet数据集对其进行训练。该ImageNet数据集分为训练集和测试集。该数据集对应例如1000个类别。每个数据有对应的标签向量，每个标签向量对应一个不同的类别。本申请不关心输入图像的具体分类，只是应用该数据集训练VGG-16网络的权重。具体地，将上述ImageNet训练集调整成224×224×3大小，然后输入VGG-16网络以对该网络进行训练，得到网络各层或各单元的权重参数信息。然后，向训练得到的VGG-16网络结构中输入预先确定的测试数据集以及对应类别的标签向量。测试数据集的大小可例如同样为224×224×3。通过向VGG-16网络输入上述测试数据集以及对应类别的标签向量，可对VGG-16网络的输出结果进行检测，所检测的结果与标准数据进行比对，以根据比对的误差对VGG-16网络的参数（权重）进行调整。重复上面步骤，直到得到测试准确率达到预定的标准，例如准确率为98%以上。After re-architecting the above network, train it by using the ImageNet dataset. The ImageNet dataset is divided into training set and test set. This dataset corresponds to, for example, 1000 categories. Each data has a corresponding label vector, and each label vector corresponds to a different category. This application does not care about the specific classification of the input image, but only applies this dataset to train the weights of the VGG-16 network. Specifically, the above ImageNet training set is adjusted to a size of 224×224×3, and then input to the VGG-16 network to train the network to obtain the weight parameter information of each layer or unit of the network. Then, the pre-determined test data set and the label vector of the corresponding category are input into the VGG-16 network structure obtained by training. The size of the test data set may also be, for example, 224×224×3. By inputting the above test data set and the label vector of the corresponding category into the VGG-16 network, the output results of the VGG-16 network can be detected, and the detected results 16 The parameters (weights) of the network are adjusted. Repeat the above steps until the test accuracy reaches a predetermined standard, for example, the accuracy is more than 98%.

可选地，需要首先计算损失和回归，优化仿射变换参数。对于上述VGG-16整个网络的损失函数设计可例如表示为：Optionally, it is necessary to first calculate the loss and regression, and optimize the affine transformation parameters. The loss function design for the entire network of the above VGG-16 can be expressed as:

所述VGG-16网络的损失函数表示为：The loss function of the VGG-16 network is expressed as:

（1）

(1)

其中，α₁和α₂为学习率。p为类别tc的对数损失，公式如（2）所示。where α ₁ and α ₂ are learning rates. p is the logarithmic loss of category tc , and the formula is shown in (2).

L _c（p,tc）=-logp _tc （2） L _c ( p , tc ) = - logp _tc (2)

i表示正在计算损失的回归框的序号; i represents the sequence number of the regression box where the loss is being calculated;

tc表示是类别标签，例如：tc=1表示目标，tc=0表示背景； tc means category label, for example: tc =1 means target, tc =0 means background;

x，y，w，h和其它变量组合使用，分别表示横坐标/纵坐标/宽/高。 x , y , w , h and other variables are used in combination to represent abscissa/ordinate/width/height respectively.

参数 v _i=（v _x， v _y， v _w， v _h）是真实矩形边界框元组，包括中心点横坐标、纵坐标、宽和高；

是预测到的目标框元组，包括中心点横坐标、纵坐标、宽和高； The parameter v _i = ( v _{x ,} v _{y ,} v _{w ,} v _h ) is a true rectangular bounding box tuple, including the center point abscissa, ordinate, width and height;

is the predicted target frame tuple, including the center point abscissa, ordinate, width and height;

u _i=（r1,r2,r3,r4,r5,r6）为真实目标区域的仿射参数元组； u _i = ( r1 , r2 , r3 , r4 , r5 , r6 ) is the affine parameter tuple of the real target area;

为预测到目标区域的仿射参数元组；

is the affine parameter tuple that predicts the target area;

r1，r2，r3，r4，r5，r6）为真实目标区域的仿射变换固定结构的六个分量的值；

r 1, r 2, r 3, r 4, r 5, r 6) are the values of the six components of the affine transformation fixed structure of the real target area;

r1^*，r2^*，r3^*，r4^*，r5^*，r6^*）为预测到目标区域的仿射变换固定结构的六个分量的值；

r 1 ^* , r 2 ^* , r 3 ^* , r 4 ^* , r 5 ^* , r 6 ^* ) are the six components of the fixed structure of the affine transformation predicted to the target region;

表示仿射边界框参数损失函数；

represents the affine bounding box parameter loss function;

表示矩形边界框参数损失函数；

represents the parameter loss function of the rectangular bounding box;

令（w，w*）表示

或者

,

定义为: Let ( w , w* ) denote

or

,

defined as:

(3)

(4)

其中x为实数。where x is a real number.

将从神经网络中获得的上述特征图输入到RPN（Region Proposal Network）网络中，提取出获得多个目标的候选区域，例如2000个候选区域。例如图3中标号为304所示。RPN是生成多个不同大小的候选区域的网路，与VGG-16网络不同。候选区域是当前帧目标可能存在的多个形状和位置都的区域。本申请预先估计出算法可能存在的多个区域，然后对这些区域进行优化回归，筛选出较精确的跟踪区域。The above feature map obtained from the neural network is input into the RPN (Region Proposal Network) network, and candidate regions for obtaining multiple targets, such as 2000 candidate regions, are extracted. For example, the reference numeral 304 in FIG. 3 is shown. RPN is a network that generates multiple candidate regions of different sizes, which is different from the VGG-16 network. A candidate region is a region with multiple shapes and positions that may exist in the current frame target. The present application pre-estimates multiple regions that may exist in the algorithm, and then performs optimization regression on these regions to screen out more accurate tracking regions.

对这些不同大小的候选区域的特征进行池化操作，获得针对目标对象的多个感兴趣区域（ROI）。在这里，考虑到目标的变形，在池化层中设计多个不同大小的卷积核，例如设计三个卷积核，分别为：7×7，5×9和9×5。例如图3中标号为305所示。多个不同的池化核可以初略描述目标的形变。例如：7×7，5×9可以描述不同摄像头下站立的人，9×5可以描述人的弯腰等动作。当然也可以根据不同应用场景设计不同大小的池化核。The features of these candidate regions of different sizes are pooled to obtain multiple regions of interest (ROI) for the target object. Here, considering the deformation of the target, multiple convolution kernels of different sizes are designed in the pooling layer, for example, three convolution kernels are designed: 7×7, 5×9 and 9×5. For example, the reference numeral 305 in FIG. 3 is shown. Multiple different pooling kernels can roughly describe the deformation of the target. For example: 7×7, 5×9 can describe people standing under different cameras, 9×5 can describe people’s actions such as bending over. Of course, pooling kernels of different sizes can also be designed according to different application scenarios.

将上述池化的结果，即多个感兴趣区域（ROI）的特征进行全连接操作。在这里，全链接操作是将多个ROI特征依次串联起来。例如图3中标号为306所示。然后，使用softmax函数对串联起来的特征进行打分比较，得到所比较目标区域目标/背景结果的分值。例如，分值结果大于一定阈值的判定为目标区域，否则为背景区域。The results of the above pooling, that is, the features of multiple regions of interest (ROI), are fully connected. Here, the full link operation is to concatenate multiple ROI features in sequence. For example, the reference numeral 306 in FIG. 3 is shown. Then, use the softmax function to score and compare the concatenated features to obtain the score of the target/background result of the compared target area. For example, if the score result is greater than a certain threshold, it is determined as the target area, otherwise, it is the background area.

对得到的判定为目标区域的仿射区域，进行非极大值抑制（例如图3中标号为308所示），得到第t帧图像的跟踪结果，即对应的仿射参数和边框。例如图3中标号为309所示。在一个实施方式中，可将所述多个跟踪仿射框与参考目标框（即上一帧跟踪到的目标框）进行对比，得到交叠面积最大的仿射跟踪框，作为最终的跟踪结果。Non-maximum suppression is performed on the obtained affine region that is determined to be the target region (for example, the reference number 308 in Figure 3), and the tracking result of the t-th frame image is obtained, that is, the corresponding affine parameters and frame. For example, the reference numeral 309 in FIG. 3 is shown. In one embodiment, the multiple tracking affine frames can be compared with the reference target frame (that is, the target frame tracked in the previous frame) to obtain the affine tracking frame with the largest overlapping area as the final tracking result .

本文采用仿射变换表示目标几何变形。第t帧的目标区域的跟踪结果的仿射变换参数记作U _t，其结构为：U _t =[r1,r2,r3,r4,r5,r6]^T。对应的仿射变换矩阵

, 具有李群结构，ga（2）是对应于仿射李群GA（2）的李代数，矩阵G _j（

）是GA（2）的生成元以及矩阵ga（2）的基。对于矩阵GA（2）的生成元为： In this paper, affine transformation is used to represent the geometric deformation of the target. The affine transformation parameter of the tracking result of the target area of the t -th frame is denoted as U _t , and its structure is: U _t = [ r 1, r 2, r 3, r 4, r 5, r 6] ^T . Corresponding affine transformation matrix

, has a Lie group structure, ga (2) is a Lie algebra corresponding to the affine Lie group GA (2), the matrix G _j (

) is the generator of GA (2) and the basis of the matrix ga (2). The generator for the matrix GA (2) is:

(5)

对于李群矩阵，黎曼距离定义为矩阵对数运算:For Lie group matrices, the Riemann distance is defined as a matrix logarithm operation:

(6)

其中X和Y是李群矩阵的元素，给出了N的对称正定矩阵

的内均值定义： where X and Y are elements of a Lie group matrix, giving the symmetric positive definite matrix of N

The inner mean definition of :

(7)

其中

，q为常数； in

, q is a constant;

对上述多个跟踪仿射框进行非极大值抑制，得到t帧图像的跟踪结果。通过回归可能得到多个不同的目标区域，为了正确的得到一个精确度最高的检测算法，本申请采用仿射变换非极大值抑制方法来筛选出最后的跟踪结果。另外，上述损失函数的设计，将目标仿射形变考虑进去，提高了预测目标位置的准确性。Perform non-maximum suppression on the above multiple tracking affine frames to obtain the tracking results of t frames of images. Multiple different target regions may be obtained through regression. In order to correctly obtain a detection algorithm with the highest accuracy, the present application adopts an affine transformation non-maximum suppression method to screen out the final tracking result. In addition, the design of the above loss function takes the affine deformation of the target into account, which improves the accuracy of predicting the target position.

当前的对象检测方法，非极大值抑制(NMS)被广泛地用于后处理检测候选。在估计轴对齐边界框和倾斜边界框的同时，可以在轴对齐的边界框上执行正常的NMS，也可以在仿射变换边界框上执行倾斜NMS，本申请成为仿射变换非极大值抑制。在仿射变换非极大值抑制中，传统交点(IoU)的计算被修改为两个仿射边界框之间的IoU。算法效果如图4所示。在图4中，编号为401 的各个边框为非极大值抑制之前的候选跟踪框，编号为402的边框为进行正常的NMS抑制后得到的跟踪框，编号为403的边框为本申请进行仿射变换非极大值抑制得到的跟踪框。可以看出本申请得到的跟踪框更为准确。The current object detection method, Non-Maximum Suppression (NMS), is widely used to post-process detection candidates. While estimating the axis-aligned bounding box and the skewed bounding box, normal NMS can be performed on the axis-aligned bounding box, or skewed NMS can be performed on the affine transformed bounding box, this application is called affine transform non-maximum suppression . In affine transformation non-maximum suppression, the traditional intersection point (IoU) computation is modified as the IoU between two affine bounding boxes. The effect of the algorithm is shown in Figure 4. In Figure 4, each frame numbered 401 is the candidate tracking frame before non-maximum suppression, the frame numbered 402 is the tracking frame obtained after normal NMS suppression, and the frame numbered 403 is a simulation of this application. The tracking box obtained by the non-maximum suppression of the injection transform. It can be seen that the tracking frame obtained in this application is more accurate.

步骤8：确定t+1的个数是否小于视频总帧数，如果是回到步骤2,进行第t+1帧图像的跟踪。直到所有视频的帧跟踪完毕，算法结束。部分跟踪结果边框如图5中501，502，503，504箭头所指示黑色边框所示。Step 8: Determine whether the number of t+1 is less than the total number of video frames, if it is, go back to step 2, and perform the tracking of the t+1th frame image. The algorithm ends when all frames of the video are tracked. Part of the tracking result frame is shown as the black frame indicated by the arrows 501, 502, 503, 504 in Figure 5.

在本申请中，利用上一帧图像的仿射变换参数信息，对当前目标图像进行裁剪，缩小搜索范围，提高算法效率。另外，先将裁剪下来的图像输入到VGG-16网络计算特征，再输入到RPN网络，避免特征提取的重复计算，提高了算法效率。另外，在池化操作时，应用不同大小、不同形状的卷积核，初步模拟目标的变形，有助于目标位置的更加准确提取。在本申请中，上述网络的最高层输出的特征作为语义模型，利用仿射变换结果作为空间模型，两者形成优势互补，这是因为最高层的特征包含较多的语义信息和较少的空间信息。此外，上述包括仿射变换参数回归的多任务损失函数优化了网络性能。In this application, the affine transformation parameter information of the previous frame of image is used to crop the current target image, narrow the search range, and improve the efficiency of the algorithm. In addition, the cropped image is first input to the VGG-16 network to calculate the features, and then input to the RPN network, which avoids the repeated calculation of feature extraction and improves the efficiency of the algorithm. In addition, during the pooling operation, convolution kernels of different sizes and shapes are applied to initially simulate the deformation of the target, which helps to extract the target position more accurately. In this application, the features output by the highest layer of the above network are used as the semantic model, and the affine transformation result is used as the spatial model, and the two complement each other's advantages, because the features of the highest layer contain more semantic information and less space information. Furthermore, the above-mentioned multi-task loss function including affine transformation parameter regression optimizes the network performance.

在上述的行人跟踪系统，从所述RPN网络获得的所述候选区域是所述当前帧中的目标对象存在的多个形状和位置的区域。此外，所述步骤5通过不同大小的多个卷积核将所述多个目标候选区域的特征进行池化操作，获得针对所述目标对象的多个感兴趣区域。例如，多个不同大小的卷积核包括三个卷积核，以初略描述目标的不同形变。特别地，如上所述，考虑到目标的变形，在池化层中设计多个不同大小的卷积核，例如设计三个卷积核，分别为：7×7，5×9和9×5。多个不同的池化核可以初略描述目标的形变。例如：7×7，5×9可以描述不同摄像头下站立的人，9×5可以描述人的弯腰等动作。当然也可以根据不同应用场景设计不同大小的池化核。In the above-mentioned pedestrian tracking system, the candidate regions obtained from the RPN network are regions of multiple shapes and positions where the target object in the current frame exists. In addition, in the step 5, the features of the multiple target candidate regions are pooled through multiple convolution kernels of different sizes to obtain multiple regions of interest for the target object. For example, multiple convolution kernels of different sizes include three convolution kernels to briefly describe different deformations of the target. In particular, as mentioned above, considering the deformation of the target, multiple convolution kernels of different sizes are designed in the pooling layer, for example, three convolution kernels are designed: 7×7, 5×9 and 9×5 . Multiple different pooling kernels can roughly describe the deformation of the target. For example: 7×7, 5×9 can describe people standing under different cameras, 9×5 can describe people’s actions such as bending over. Of course, pooling kernels of different sizes can also be designed according to different application scenarios.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开的实施例中所涉及的发明范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述发明构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的（但不限于）具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned inventive concept, the above-mentioned Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the embodiments of the present disclosure (but not limited to) with similar functions.

Claims

1. a pedestrian tracking system, is characterized in that: comprise memory and processor;

the memory is used to store computer-executable instructions;

The processor is configured to execute the executable instructions to determine that the first frame in the multiple video frames includes the target frame of the target object; for subsequent frames in the multiple video frames except the first frame, according to the determined The target frame determines the current target frame including the target object in the current frame; the current target frame is input into the pre-trained VGG-16 network to obtain the candidate feature map of the target frame in the image; the candidate feature map is input into the pre-trained Multiple target candidate regions are obtained in the RPN network of the The features of the full link operation are performed to distinguish the target and the background, and multiple tracking affine frames of the target object are obtained; and the non-maximum value suppression of the multiple tracking affine frames is performed to obtain the tracking result of the target object of the current frame.

2. A pedestrian tracking method, realized by a pedestrian tracking system described in claim 1, characterized in that, comprising the following steps:

Step 1: determine that the first frame in the multiple video frames includes the target frame of the target object;

Step 2: For subsequent frames except the first frame, determine the current target frame including the target object in the current frame according to the determined target frame;

Step 3: Adjust the determined target frame to a fixed size and input it into the pre-trained VGG-16 network, obtain the candidate feature map of the target frame in the current frame, and design a loss function;

Step 4: Input the candidate feature map into the pre-trained RPN network to obtain multiple target candidate regions;

Step 5: Perform a pooling operation on the features of the multiple target candidate regions through multiple convolution kernels of different sizes to obtain multiple regions of interest for the target object;

Step 6: Perform a full-link operation on the features of the multiple regions of interest, distinguish the target and the background, compare the multiple tracking affine frames with the reference target frame, and obtain the affine tracking frame with the largest overlapping area, Thereby obtained multiple tracking affine frames of the target object;

Step 7: performing non-maximum suppression on the multiple tracking affine frames to obtain the tracking result of the target object in the current frame;

Step 8: Determine whether the number of the next frame of the current image is less than the total number of video frames, if not, end immediately, if it is, go back to step 2, and track the next frame of image until all video frames are tracked.

3 . The pedestrian tracking method according to claim 2 , wherein the target candidate region in step 4 is a region where multiple shapes and positions of the target object in the current frame exist simultaneously. 4 .

4 . The pedestrian tracking method according to claim 2 , wherein in step 5, the multiple convolution kernels of different sizes include three convolution kernels for roughly describing different deformations of the target object. 5 .