CN110503073B

CN110503073B - Dense multi-agent track prediction method for dynamic link at third view angle

Info

Publication number: CN110503073B
Application number: CN201910807587.XA
Authority: CN
Inventors: 刘洪波; 汪大峰; 张博; 李伯林; 帅真浩; 刘凯; 江欣; 林正奎
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2023-04-18
Anticipated expiration: 2039-08-29
Also published as: CN110503073A

Abstract

The invention discloses a dynamic link dense multi-agent trajectory prediction method under the third perspective, which utilizes a variational autoencoder visual component to perform data compression; the input trajectory frame X enters the dynamic cycle unit to complete the encoding network function; for the encoded data to decode. The present invention can not only simulate the space-time movement of the multi-agent fluid according to the dynamic change of the convolution kernel sampling point, but also extract the spatial features of the location of the multi-agent, and learn those pixels that are specifically sampled on the feature map according to the data , reducing spatial feature redundancy. The present invention uses a data-driven method to learn weights on the feature map according to a fixed convolution kernel, and then uses a sigmoid function to operate on the learned weight values to obtain the sampling range of spatio-temporal data, which is more in line with the objective sampling rules and improves the generalization ability of the model . The invention does not need to use intelligent body trajectory points, can realize multi-step prediction, improve model generalization ability, and reduce calculation complexity.

Description

A Dense Multi-Agent Trajectory Prediction Method Based on Dynamic Links in the Third View

技术领域technical field

本发明涉及一种多智能体轨迹预测技术，特别是一种第三视角下动态链接的密集多智能体轨迹预测方法。The invention relates to a multi-agent trajectory prediction technology, in particular to a dynamic link dense multi-agent trajectory prediction method under a third perspective.

背景技术Background technique

现代社会，密集多智能体运动越来越频繁，如大型演唱会、体育运动、宗教活动、大型集会等。尤其在中国这样一个人口众多的国家中，密集多智能体运动趋势的预测是公共安全研究的紧迫问题之一。显而易见，密集多智能体轨迹预测有助于制定相对应的安全管理策略，设计更好的多智能体分流模式，实时统计密集多智能体的流量、检测密集智能体的异常行为，保障广大公民的人身安全。In modern society, intensive multi-agent activities are becoming more and more frequent, such as large-scale concerts, sports, religious activities, and large-scale gatherings. Especially in a country with a large population like China, the prediction of dense multi-agent movement trends is one of the urgent issues in public security research. Obviously, the trajectory prediction of dense multi-agents is helpful to formulate corresponding security management strategies, design better multi-agent shunt mode, count the traffic of dense multi-agents in real time, detect the abnormal behavior of dense agents, and protect the safety of citizens. Personal safety.

目前对密集多智能体的轨迹预测还是以数据驱动的固定连接的轨迹点预测技术为主。以卷积循环网络结构为例，由于其卷积核的大小固定，采样邻居位置难以发生变化。这种技术不仅难以模拟到多智能体的流体时空运动趋势(比如多智能体聚集与扩散)，而且容易出现采样数据冗余性。对于长时间多步预测上，模型泛化能力会大大降低，不仅预测成本高，也浪费人力资源。到目前为止，密集多智能体轨迹预测还存在诸多亟待解决的难题。At present, the trajectory prediction of dense multi-agents is still based on the data-driven fixed-connection trajectory point prediction technology. Taking the convolutional recurrent network structure as an example, since the size of the convolution kernel is fixed, it is difficult to change the position of the sampling neighbors. This technique is not only difficult to simulate the fluid spatiotemporal movement trend of multi-agents (such as multi-agent aggregation and diffusion), but also prone to sampling data redundancy. For long-term multi-step forecasting, the generalization ability of the model will be greatly reduced, not only the cost of forecasting is high, but also human resources are wasted. So far, there are still many problems to be solved urgently in the trajectory prediction of dense multi-agents.

发明内容Contents of the invention

为解决现有技术存在的上述问题，本发明要提出一种能够解决时空数据动态变化并提高模型泛化能力的第三视角下动态链接的密集多智能体轨迹预测方法。In order to solve the above-mentioned problems in the prior art, the present invention proposes a dynamic link dense multi-agent trajectory prediction method under the third perspective that can solve the dynamic changes of spatio-temporal data and improve the model generalization ability.

为了实现上述目的，本发明的技术方案如下：一种第三视角下动态链接的密集多智能体轨迹预测方法，包括以下步骤：In order to achieve the above object, the technical solution of the present invention is as follows: a dense multi-agent trajectory prediction method dynamically linked under the third perspective, comprising the following steps:

A、利用变分自编码器视觉组件进行数据压缩A. Data Compression Using Variational Autoencoder Vision Components

变分自编码器视觉组件把输入具有时序依赖关系的连续轨迹帧放入端到端的神经网络中进行学习、并把轨迹帧数据进行抽象和压缩。具体步骤如下：The variational autoencoder vision component puts the input continuous trajectory frames with temporal dependencies into the end-to-end neural network for learning, and abstracts and compresses the trajectory frame data. Specific steps are as follows:

A1、输入的连续轨迹帧X₁,X₂,...,X_t-1,X_t具有不同的尺度，采用函数nn.imsize(X,128,128)调整到同一尺寸128×128，其中nn代表神经网络函数基类名称。A1. The input continuous trajectory frames X ₁ , X ₂ ,...,X _t-1 , X _t have different scales, and use the function nn.imsize(X,128,128) to adjust to the same size 128×128, where nn represents Neural network function base class name.

A2、对调整后的连续轨迹帧采用神经网络的全连接操作，编码成向量V，使之前的128×128高纬度变成400维度。如下式所示：A2. The adjusted continuous trajectory frames are fully connected by the neural network and encoded into a vector V, so that the previous high latitude of 128×128 becomes 400 dimensions. As shown in the following formula:

V＝nn.Linear(X,400) (1)V＝nn.Linear(X,400) (1)

A3、对向量V进行二维卷积操作，进行下采样，利用神经网络将向量V分别拟合成符合高斯分布的低维向量均值μ和方差δ，具体公式如下：A3. Perform a two-dimensional convolution operation on the vector V, perform down-sampling, and use the neural network to fit the vector V into a low-dimensional vector mean μ and variance δ that conform to the Gaussian distribution. The specific formula is as follows:

μ＝nn.Conv2d(V) (2)μ＝nn.Conv2d(V) (2)

δ＝nn.Conv2d(V) (3)δ＝nn.Conv2d(V) (3)

A4、使用重采样技巧，得知从N(μ,δ²)中采样一个轨迹帧X，相当于从标准正态分布N(0,1)采样一个ε，然后让X＝μ+ε×δ；于是，将从原先N(μ,δ²)中采样变换到了从标准高斯分布N(0,1)中采样，再通过参数变换得到从N(μ,δ²)中采样的结果。A4. Using the resampling technique, we know that sampling a trajectory frame X from N(μ,δ ² ) is equivalent to sampling an ε from the standard normal distribution N(0,1), and then let X=μ+ε×δ ; Therefore, the sampling from the original N(μ,δ ² ) is transformed into sampling from the standard Gaussian distribution N(0,1), and then the result of sampling from N(μ,δ ² ) is obtained through parameter transformation.

B、输入轨迹帧X进入动态循环单元完成编码网络功能B. The input trajectory frame X enters the dynamic cycle unit to complete the encoding network function

在进行编码网络结构之后，编码后的轨迹帧X进入动态循环单元，编码后的轨迹帧向量特征提取在动态循环单元流程中完成，具体步骤如下公式(4)：After encoding the network structure, the encoded trajectory frame X enters the dynamic recurrent unit, and the vector feature extraction of the encoded trajectory frame is completed in the dynamic recurrent unit process. The specific steps are as follows formula (4):

其中，代表哈达玛乘积操作，“*”代表卷积操作。W_xz、W_xr分别为更新门隐变量卷积权重、更新门输入卷积权重、重置门隐变量卷积权重、新隐变量卷积权重、重置门输入卷积权重。下标hz、xz、hr、hh、xz分别表示权重W所属为更新门隐变量卷积权重、更新门输入卷积权重、重置门隐变量卷积权重、新隐变量卷积权重、重置门输入卷积权重。实际网络结构运行时，在编码网络中所有权值共享。H_t-1代表t-1时刻的隐变量，k代表采样第几个链接的邻居时空数据点。in, Represents the Hadamard product operation, and "*" represents the convolution operation. W _xz , W _xr are respectively update gate hidden variable convolution weights, update gate input convolution weights, reset gate hidden variable convolution weights, new hidden variable convolution weights, and reset gate input convolution weights. The subscripts hz, xz, hr, hh, and xz indicate that the weight W belongs to update gate hidden variable convolution weight, update gate input convolution weight, reset gate hidden variable convolution weight, new hidden variable convolution weight, reset Gate input convolution weights. When the actual network structure is running, all values are shared in the encoded network. H _t-1 represents the hidden variable at time t-1, and k represents the neighbor spatio-temporal data point of which link is sampled.

B1、变分自编码器视觉组件处理的输入连续轨迹帧首先进入更新门Z_t，在传统门控循环单元更新门的基础上增加动态链接功能，用φ(P)来实现，P代表卷积核在特征图中采样位置。φ(P)的具体实现过程是在输入连续轨迹帧X的基础上用大小为3×3卷积神经网络获取时空数据的位置偏移，然后在原来输入特征图的坐标基础上加上位置偏移，通过双线性差值求出偏移后的坐标对应像素值。最后用3×3的卷积核对在变化后的位置上的值进行卷积操作。更新门用于控制前一时刻的状态信息被带入当前状态中的程度。函数Γ(H_t-1,φ(P))功能是动态选择时空数据采样点。B1. The input continuous trajectory frame processed by the visual component of the variational autoencoder first enters the update gate Z _t , and the dynamic link function is added on the basis of the update gate of the traditional gated recurrent unit, which is realized by φ(P), and P stands for convolution The kernel samples locations in the feature map. The specific implementation process of φ(P) is to use a 3×3 convolutional neural network to obtain the position offset of the spatio-temporal data on the basis of inputting the continuous trajectory frame X, and then add the position offset to the coordinates of the original input feature map. shift, and calculate the pixel value corresponding to the shifted coordinates through the bilinear difference. Finally, a 3×3 convolution kernel is used to perform a convolution operation on the value at the changed position. The update gate is used to control the degree to which state information from the previous moment is brought into the current state. The function Γ(H _t-1 ,φ(P)) is to dynamically select sampling points of spatio-temporal data.

B2、输入连续轨迹帧在更新门操作之后进入重置门，重置门同样采用φ(P)来实现动态链接功能。B2. The input continuous trajectory frame enters the reset gate after the update gate operation, and the reset gate also uses φ(P) to realize the dynamic link function.

B3、在更新门和重置门以后，确定当前时刻的隐变量。当前时刻的隐变量的采样幅度由Δm_k确定。Δm_k是对时空数据(H_t-1,X_t)首先采用卷积操作获取中间值，然后采用sigmoid函数确定采样值概率，范围为[0,1]。具体公式如下：B3. After updating the gate and resetting the gate, determine the hidden variable at the current moment. The sampling range of the hidden variable at the current moment is determined by Δm _k . Δm _k is to use the convolution operation to obtain the intermediate value of the space-time data (H _t-1 , X _t ), and then use the sigmoid function to determine the probability of the sampling value, and the range is [0,1]. The specific formula is as follows:

C、对于编码的数据进行解码C. Decode the encoded data

C1、根据前J个具有输入时序轨迹图像帧X观察结果，预测未来最可能的K个时序轨迹图像帧X序列；C1. Predict the most likely sequence of K time-series trajectory image frames X in the future according to the observation results of the first J input time-series trajectory image frames X;

C2、预测结果用以下公式表示：C2. The prediction result is expressed by the following formula:

X_t+1,...,X_t+k≈g_解码(f_编码(X_t-J+1,X_t-J+2,...,X_t)) (6)X _t+1 ,...,X _t+k ≈g _decoding (f _encoding (X _t-J+1 ,X _t-J+2 ,...,X _t )) (6)

结束。Finish.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明的动态链接结构利用数据驱动的方式首先学习特征图的坐标位置偏移，然后采用双线性插值把原特征图中的像素值映射到新特征图坐标位置上，最后用固定连接的卷积核进行下采样来实现动态链接变化。与现有技术相比，本发明不仅能根据卷积核采样点的动态变化模拟到多智能体流体时空运动，而且能够提取多智能体所处位置的空间特征。相比较先前模型的固定卷积循环网络结构,本发明不仅在结构上以及空间数据提取上有很大改进，而且能根据数据学习到具体在特征图上采样那些像素点，减少了空间特征冗余。1. The dynamic link structure of the present invention uses a data-driven approach to first learn the coordinate position offset of the feature map, then uses bilinear interpolation to map the pixel values in the original feature map to the coordinate positions of the new feature map, and finally uses a fixed connection The convolution kernel is down-sampled to achieve dynamic link changes. Compared with the prior art, the present invention can not only simulate the space-time movement of the multi-agent fluid according to the dynamic change of the sampling point of the convolution kernel, but also extract the spatial characteristics of the location of the multi-agent. Compared with the fixed convolutional loop network structure of the previous model, the present invention not only greatly improves the structure and spatial data extraction, but also learns to sample those pixels on the feature map according to the data, reducing the redundancy of spatial features .

2、本发明采用数据驱动的方式根据固定卷积核在特征图上学习到权重，然后采用sigmoid函数对学习到的权重值操作，得到时空数据的采样幅度(即像素点的采样概率)，更加符合客观采样规律，提高模型泛化能力。2. The present invention uses a data-driven approach to learn the weights on the feature map according to the fixed convolution kernel, and then uses the sigmoid function to operate the learned weight values to obtain the sampling range of the spatio-temporal data (i.e. the sampling probability of the pixel points), which is more It conforms to the objective sampling law and improves the generalization ability of the model.

3、本发明采用的时空预测模型把密集多智能体的运动看成时空数据像素预测问题。这种预测技术无需采用智能体轨迹点，可以实现多步预测、提高模型泛化能力，减少了计算复杂度。3. The spatio-temporal prediction model adopted in the present invention regards the movement of dense multi-agents as a spatio-temporal data pixel prediction problem. This prediction technology does not need to use the trajectory points of the agent, can realize multi-step prediction, improve the generalization ability of the model, and reduce the computational complexity.

4、本发明采用重采样技巧，使得采样这个操作不用参与梯度下降操作，改为采样结果参与，使得模型既可以减少参数里又可以参与训练。4. The present invention adopts the resampling technique, so that the sampling operation does not need to participate in the gradient descent operation, but instead participates in the sampling result, so that the model can reduce parameters and participate in training.

附图说明Description of drawings

本发明共有附图4张，其中：The present invention has 4 accompanying drawings, wherein:

图1是动态链接单元结构。Figure 1 is a dynamic link unit structure.

图2是编码解码结构。Figure 2 is the codec structure.

图3是预测结果图。Figure 3 is a graph of the prediction results.

图4是本发明的流程图。Fig. 4 is a flowchart of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明再进行进一步地描述。按照图4所示的流程对第三视角下的动态链接(如图1所示)的多智能体轨迹预测方法来介绍。首先输入连续具有时序相关的轨迹帧到编码网络中的变分自编码器的视觉组件部分进行编码，使输入的高维轨迹帧变成低维度的潜在变量。其具体操作是把输入时序轨迹帧通过全连接操作映射成高维向量，然后采用卷积操作进行下采样，得到低维度向量表示。低维度的潜在变量输入到编码网络(如图2所示)中的动态链接单元2可以提取到轨迹的动态时空数据特征。其具体操作，是首先根据固定卷积核大小学习到位置偏移，然后依据双线性差值获取到原先特征图到新特征图的像素对应关系，最后采用同样大小、固定的卷积核对新特征图进行下采样，以完成动态链接结构对时空数据采样。待提取完成时空特征向量，使其流向动态链接单元1继续进行时空特征提取。以此类推，直到输入连续轨迹帧在编码网络训练完成。把编码网络的训练的权重复制到解码网络结构中的动态链接单元1和动态链接单元2之中，然后对其进行解码输出预测的连续轨迹帧(图3所示)，第一列是输入的历史运动轨迹序列，第二列是真实的运动轨迹序列，第三列，是预测出来的轨迹序列。The present invention will be further described below in conjunction with the accompanying drawings. According to the process shown in Fig. 4, the multi-agent trajectory prediction method of the dynamic link (as shown in Fig. 1) under the third perspective is introduced. Firstly, the continuous temporally related trajectory frames are input to the visual component part of the variational autoencoder in the encoding network for encoding, so that the input high-dimensional trajectory frames become low-dimensional latent variables. The specific operation is to map the input time-series trajectory frame into a high-dimensional vector through a fully connected operation, and then use a convolution operation for down-sampling to obtain a low-dimensional vector representation. Low-dimensional latent variables are input to the dynamic link unit 2 in the encoding network (as shown in Figure 2) to extract the dynamic spatiotemporal data features of the trajectory. The specific operation is to first learn the position offset according to the fixed convolution kernel size, then obtain the pixel correspondence between the original feature map and the new feature map according to the bilinear difference, and finally use the same size and fixed convolution kernel to compare the new The feature map is down-sampled to complete the sampling of spatio-temporal data by the dynamic link structure. The spatio-temporal feature vectors to be extracted flow to the dynamic link unit 1 to continue spatio-temporal feature extraction. And so on, until the input of continuous trajectory frames is completed in the encoding network training. Copy the training weights of the encoding network to the dynamic link unit 1 and the dynamic link unit 2 in the decoding network structure, and then decode and output the predicted continuous trajectory frame (shown in Figure 3), the first column is the input The historical trajectory sequence, the second column is the real trajectory sequence, and the third column is the predicted trajectory sequence.

Claims

1. A dense multi-agent trajectory prediction method dynamically linked under a third perspective, characterized in that: comprising the following steps:

A. Data Compression Using Variational Autoencoder Vision Components

The variational autoencoder vision component puts the input continuous trajectory frames with timing dependencies into the end-to-end neural network for learning, and abstracts and compresses the trajectory frame data; the specific steps are as follows:

A1. The input continuous trajectory frames X ₁ , X ₂ ,...,X _t-1 ,X _t have different scales, and they are adjusted to the same size 128×128 by using the function nn.imsize(X,128,128), where nn represents the neural network function base class name;

A2. The adjusted continuous trajectory frame adopts the fully connected operation of the neural network and encodes it into a vector V, so that the previous 128×128 high latitude becomes 400 dimensions; as shown in the following formula:

V＝nn.Linear(X,400) (1)

A3. Perform a two-dimensional convolution operation on the vector V, perform down-sampling, and use the neural network to fit the vector V into a low-dimensional vector mean μ and variance δ that conform to the Gaussian distribution. The specific formula is as follows:

μ=nn.Conv2d(V) (2)

δ＝nn.Conv2d(V) (3)

A4. Using the resampling technique, we know that sampling a trajectory frame X from N(μ,δ ² ) is equivalent to sampling an ε from the standard normal distribution N(0,1), and then let X=μ+ε×δ ;Therefore, the sampling from the original N(μ,δ ² ) is transformed into sampling from the standard Gaussian distribution N(0,1), and then the result of sampling from N(μ,δ ² ) is obtained through parameter transformation;

B. The input trajectory frame X enters the dynamic cycle unit to complete the encoding network function

After encoding the network structure, the encoded trajectory frame X enters the dynamic recurrent unit, and the vector feature extraction of the encoded trajectory frame is completed in the dynamic recurrent unit process. The specific steps are as follows formula (4):

in, Represents Hadamard product operation, "*" represents convolution operation; W _xz , W _xr are respectively update gate hidden variable convolution weight, update gate input convolution weight, reset gate hidden variable convolution weight, new hidden variable convolution weight, reset gate input convolution weight; subscripts hz, xz, hr , hh, xz indicate that the weight W belongs to update gate hidden variable convolution weight, update gate input convolution weight, reset gate hidden variable convolution weight, new hidden variable convolution weight, reset gate input convolution weight; When the network structure is running, ownership values are shared in the encoding network; H _t-1 represents the hidden variable at time t-1, and k represents the neighbor spatiotemporal data point of which link is sampled;

B1. The input continuous trajectory frame processed by the visual component of the variational autoencoder first enters the update gate Z _t , and the dynamic link function is added on the basis of the update gate of the traditional gated recurrent unit, which is realized by φ(P), and P stands for convolution The sampling position of the kernel in the feature map; the specific implementation process of φ(P) is to use a 3×3 convolutional neural network to obtain the position offset of the spatio-temporal data on the basis of inputting the continuous trajectory frame X, and then input the feature map in the original Add the position offset on the basis of the coordinates, and calculate the pixel value corresponding to the offset coordinates through the bilinear difference; finally, use a 3×3 convolution kernel to perform a convolution operation on the value at the changed position; update The gate is used to control the degree to which the state information of the previous moment is brought into the current state; the function Γ(H _t-1 ,φ(P)) is to dynamically select the sampling points of spatio-temporal data;

B2. The input continuous trajectory frame enters the reset gate after the update gate operation, and the reset gate also uses φ(P) to realize the dynamic link function;

B3. After updating the gate and resetting the gate, determine the hidden variable at the current moment; the sampling range of the hidden variable at the current moment is determined by Δm _k ; Δm _k is the first convolution of the spatiotemporal data (H _t-1 ,X _t ) The operation obtains the intermediate value, and then uses the sigmoid function to determine the probability of the sampling value, the range is [0,1]; the specific formula is as follows:

C. Decode the encoded data

C1. Predict the most likely sequence of K time-series trajectory image frames X in the future according to the observation results of the first J input time-series trajectory image frames X;

C2. The prediction result is expressed by the following formula:

X _t+1 ,...,X _t+k ≈g _decoding (f _encoding (X _t-J+1 ,X _t-J+2 ,...,X _t )) (6)

Finish.