CN115917597A

CN115917597A - Boosting 2D Representations to 3D Using Attention Models

Info

Publication number: CN115917597A
Application number: CN202080102235.5A
Authority: CN
Inventors: 阿德里安·洛帕特; 萨拉·卡劳特
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2023-04-04
Anticipated expiration: 2040-08-31
Also published as: CN115917597B; WO2022042865A1

Abstract

A processing device (200) is described for forming a system for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets (101), each data set being indicative of a projection of a joint of an object on a two-dimensional space, the processing device comprising one or more processors (201) configured to: receiving (301) a data set; processing (302) a data set using an encoder architecture (104) having a plurality of encoder layers, each encoder layer comprising a respective self-attention mechanism (107) and a respective feed-forward network (109), each self-attention mechanism implementing a plurality of attention heads; and training (303) the encoder layers from the data set to improve the accuracy of the encoder architecture in estimating the three-dimensional state (106) of the one or more objects represented by the data set. This may allow for elevation of 2D body keypoints to 3D relative to the root (e.g. pelvic joint) by extrapolating the input data to estimate depth at real-time runtime using a small, efficient model.

Description

Boosting 2D Representations to 3D Using Attention Models

技术领域technical field

本发明涉及在处理设备处根据二维数据集估计对象的三维位置。The invention relates to estimating a three-dimensional position of an object from a two-dimensional data set at a processing device.

背景技术Background technique

近年来，人体关节的3D位置估计已成为广泛研究的课题。对将二维数据(以x、y关键点的形式)外推(extrapolate)到3D以预测与人体骨骼相关的关节的根相对坐标的方法进行定义受到了特别关注。有17个关键点描述人体骨骼，包括头部、肩部、肘部、手腕、骨盆、膝盖和脚踝。In recent years, 3D position estimation of human joints has been the subject of extensive research. Particular attention has been given to defining methods for extrapolating 2D data (in the form of x, y keypoints) to 3D to predict the root-relative coordinates of joints relative to the human skeleton. There are 17 keypoints describing the human skeleton, including the head, shoulders, elbows, wrists, pelvis, knees, and ankles.

最初，工作主要基于具有大量约束的预先设计模型，以解决不同人体关节之间的高度依赖性。Initially, the work was mainly based on pre-designed models with a large number of constraints to account for the high dependencies between different human joints.

随着卷积神经网络(Convolutional Neural Network，CNN)的发展，已经开发了姿势估计器，姿势估计器直接从RGB图像或中间2D预测端到端地重建3D姿势。这种方法已经迅速超过了以前手工估计器的准确性。With the development of Convolutional Neural Network (CNN), pose estimators have been developed, which reconstruct 3D pose end-to-end directly from RGB images or intermediate 2D predictions. This approach has rapidly surpassed the accuracy of previous hand-crafted estimators.

目前最先进的方法通常分为两种主要的方法。一些方法直接从单目图像端到端预测3D关键点。这通常会产生良好的结果，但需要非常大的模型。其他方法执行提升，其中使用2D预测器从图像中预测人体姿势，然后将2D关键点(相对于图像)扩大到3D。这种两步法通常使用时间卷积层(Temporal Convolution Layer)从来源于视频的姿势聚合信息。Current state-of-the-art methods generally fall into two main approaches. Some methods directly predict 3D keypoints end-to-end from monocular images. This usually produces good results, but requires very large models. Other methods perform lifting, where a 2D predictor is used to predict the human pose from an image, and then the 2D keypoints (relative to the image) are upscaled to 3D. This two-step approach typically uses Temporal Convolution Layers to aggregate information from poses derived from videos.

需要开发一种将对象的关节的2D投影提升到3D的方法，该方法通过使用小型高效模型在实时运行时外推输入数据来估计深度，从而克服现有方法的局限性。There is a need to develop a method for lifting 2D projections of a subject's joints to 3D that can overcome the limitations of existing methods by extrapolating input data to estimate depth in real-time using a small efficient model.

发明内容Contents of the invention

根据一个方面，提供了一种处理设备，用于形成用于估计由多个数据集表示的一个或多个关节对象的三维状态的系统，每个数据集指示对象的关节在二维空间上的投影，该处理设备包括一个或多个处理器，其被配置为：接收数据集；使用具有多个编码器层的编码器架构处理数据集，每个编码器层包括各自的自注意力机制和各自的前馈网络，每个自注意力机制实现多个注意力头；以及根据数据集训练编码器层，以提高编码器架构估计由数据集表示的一个或多个对象的三维状态的准确性。According to one aspect, there is provided a processing device for forming a system for estimating the three-dimensional state of one or more articulated objects represented by a plurality of data sets, each data set indicating the position of the object's joints in two-dimensional space projection, the processing device comprising one or more processors configured to: receive a data set; process the data set using an encoder architecture having a plurality of encoder layers, each encoder layer comprising a respective self-attention mechanism and Respective feed-forward networks, each self-attention mechanism implementing multiple attention heads; and encoder layers trained on the dataset to improve the accuracy of the encoder architecture in estimating the 3D state of one or more objects represented by the dataset .

这可以允许处理设备通过使用小型高效模型在实时运行时外推输入数据来估计深度，从而形成用于将包括例如2D人体关键点(x，y坐标)的2D数据集提升到相对于根部(如骨盆关节)的3D(x，y，z坐标)的系统。This can allow the processing device to estimate depth by extrapolating input data in real-time using a small efficient model, forming a method for lifting a 2D dataset including, for example, 2D body keypoints (x,y coordinates) relative to the root (e.g. 3D (x, y, z coordinates) system of the pelvic joint).

该对象或每个对象可以是人并且每个数据集可以指示人体姿势。每个数据集可以包括人体的多个关键点(每个都具有2D坐标)。通常，有17个关键点描述人体骨骼，包括头部、肩部、肘部、手腕、骨盆、膝盖和脚踝。指示人体姿势的每个数据集可以为多个关键点中的每个定义2D坐标。这可以允许估计人体关节的3D位置，这在视频游戏或虚拟现实应用中可能是有用的。The or each object may be a person and each data set may be indicative of a human pose. Each dataset can include multiple keypoints (each with 2D coordinates) of the human body. Typically, there are 17 keypoints describing the human skeleton, including the head, shoulders, elbows, wrists, pelvis, knees, and ankles. Each dataset indicating human pose can define 2D coordinates for each of multiple keypoints. This could allow estimating the 3D positions of human joints, which could be useful in video games or virtual reality applications.

编码器架构可以实现Transformer(改变)架构。在将2D关键点提升到3D的任务中，使用Transformer编码器模型可产生良好的准确性。The encoder architecture can implement the Transformer (change) architecture. In the task of lifting 2D keypoints to 3D, using the Transformer encoder model yields good accuracy.

每个自注意力机制的操作可以由一组参数定义，并且该设备可以被配置为在多个自注意力机制之间共享一个或多个这样的参数。权重共享可以使模型更有效，同时保持(或在某些情况下提高)预测对象3D状态的准确性。The operation of each self-attention mechanism may be defined by a set of parameters, and the device may be configured to share one or more of such parameters among multiple self-attention mechanisms. Weight sharing can make the model more efficient while maintaining (or in some cases improving) the accuracy of predicting the 3D state of the object.

该设备可以被配置为响应于连续数据集，调整一自注意力机制中的一个或多个所述参数，并在调整参数后，将该参数传播到一个或多个其他自注意力机制。因此，一个或一些注意力层的参数可以与其他注意力层共享。这可以进一步提高模型的效率。The device may be configured to adjust one or more of said parameters in a self-attention mechanism in response to the continuous data set, and after adjusting the parameter, propagate the parameter to one or more other self-attention mechanisms. Therefore, the parameters of one or some attention layers can be shared with other attention layers. This can further improve the efficiency of the model.

每个编码器层的操作可以由一组参数定义，该设备可以被配置为响应于连续数据集，调整一编码器层中的一个或多个这些参数，并且该设备可以被配置为不将这样的参数传播到任何其他编码器层。因此，在一些实施方式中，编码器层的参数可以不与其他编码器层共享。这可以进一步改进模型。The operation of each encoder layer may be defined by a set of parameters, the device may be configured to adjust one or more of these parameters in an encoder layer in response to successive data sets, and the device may be configured not to The parameters of are propagated to any other encoder layers. Therefore, in some embodiments, parameters of an encoder layer may not be shared with other encoder layers. This can further improve the model.

一个或多个处理器可以被配置为对数据集执行一维卷积以形成卷积数据并使用编码器架构处理卷积数据。这可以允许调整数据集的维度以匹配模型输入和输出的维度。The one or more processors may be configured to perform a one-dimensional convolution on the dataset to form convolved data and process the convolved data using the encoder architecture. This can allow the dimensions of the dataset to be adjusted to match the dimensions of the model inputs and outputs.

一个或多个处理器可以被配置为对连续数据集的系列执行一维卷积。这可以允许为2D输入序列估计对象的3D状态。例如，可以估计人体的3D运动序列。One or more processors may be configured to perform one-dimensional convolutions on series of continuous data sets. This can allow estimating the 3D state of an object for a 2D input sequence. For example, a 3D motion sequence of a human body can be estimated.

数据集的系列可以是奇数个数据集的系列。这可以允许在训练期间考虑需要获得三维状态的中央数据集任一侧的数据集。A series of datasets can be a series of an odd number of datasets. This can allow datasets on either side of the central dataset that need to obtain a 3D state to be considered during training.

该设备可以被配置为根据数据集的系列估计数据集的系列中的中间一个数据集的三维状态。数据集的系列中的中间一个数据集可以对应于原始感受野(receptivefield)的中心(用于预测一个3D姿势的姿势总数)。在训练期间，一半的感受野对应过去的姿势，另一半对应未来的姿势。感受野内的中间姿势可以是当前正在从2D提升到3D的姿势。The device may be configured to estimate, from the series of data sets, the three-dimensional state of an intermediate one of the series of data sets. The middle one in the series of datasets may correspond to the center of the original receptive field (the total number of poses used to predict a 3D pose). During training, half of the receptive field corresponds to past poses and the other half to future poses. An intermediate pose in the perceived field may be a pose that is currently being promoted from 2D to 3D.

一个或多个处理器可以被配置为训练编码器架构以将数据集提升到三个维度。训练编码器可以允许模型更准确地预测对象的3D状态。One or more processors can be configured to train the encoder architecture to lift the dataset into three dimensions. Training the encoder can allow the model to more accurately predict the 3D state of an object.

每个数据集可以表示人体关节相对于人体预定关节或结构的位置。该结构可以是骨盆。骨盆的位置可以作为根部。然后可以使用单独的模型确定相机到骨盆的距离。这可以允许确定每个关节相对于相机的位置。Each data set may represent the position of a human joint relative to a predetermined joint or structure of the human body. The structure may be the pelvis. The position of the pelvis can serve as the root. A separate model can then be used to determine the distance from the camera to the pelvis. This can allow determining the position of each joint relative to the camera.

该处理设备可以被配置为：接收多个图像，每个图像表示关节对象；以及对于每个图像，检测该图像中对象的关节位置，从而形成所述数据集之一。可以使用2D姿势估计器获得多个图像中人体的准确2D姿势。这可以允许从多个图像预测2D姿势，然后这些2D姿势可以用作输入到上述设备的数据集，以将2D数据集提升到3D。The processing device may be configured to: receive a plurality of images, each image representing an articulated object; and for each image, detect the joint position of the object in the image, thereby forming one of said datasets. Accurate 2D poses of human bodies in multiple images can be obtained using a 2D pose estimator. This could allow 2D poses to be predicted from multiple images, and these 2D poses could then be used as input datasets to the aforementioned devices to lift 2D datasets to 3D.

一旦训练完成，可以使用该系统在推理阶段估计由多个这样的数据集表示的一个或多个关节对象的三维状态。Once trained, the system can be used to estimate the 3D state of one or more articulated objects represented by multiple such datasets during the inference phase.

根据另一方面，提供了一种用于估计一个或多个关节对象的三维状态的系统，该系统由上述处理设备形成。According to another aspect, there is provided a system for estimating a three-dimensional state of one or more articulated objects, the system being formed by the above-mentioned processing device.

根据又一方面，提供了一种用于估计由多个数据集表示的一个或多个关节对象的三维状态的方法，每个数据集指示对象的关节在二维空间上的投影，该方法包括：接收数据集；使用具有多个编码器层的编码器架构处理数据集，每个编码器层包括各自的自注意力机制和各自的前馈网络，每个自注意力机制实现多个注意力头；以及根据数据集训练编码器层，以提高编码器架构估计由数据集表示的一个或多个对象的三维状态的准确性。According to yet another aspect, there is provided a method for estimating the three-dimensional state of one or more articulated objects represented by a plurality of data sets, each data set indicating a projection of the object's joints onto a two-dimensional space, the method comprising : Receive a dataset; process the dataset using an encoder architecture with multiple encoder layers, each encoder layer including its own self-attention mechanism and its own feed-forward network, each self-attention mechanism implementing multiple attentions head; and training an encoder layer on the dataset to improve the accuracy of the encoder architecture in estimating the three-dimensional state of one or more objects represented by the dataset.

该方法还可以包括估计由多个这种数据集表示的一个或多个关节对象的三维状态。The method may also include estimating the three-dimensional state of one or more articulated objects represented by a plurality of such datasets.

该方法的使用可以允许通过使用小型高效模型在实时运行时外推输入数据来估计深度，将包括例如2D人体关键点(x，y坐标)的2D数据集提升到相对于根部(如骨盆关节)的3D(x，y，z)。Use of this method can allow estimation of depth by extrapolating input data at runtime using a small efficient model, lifting a 2D dataset including for example 2D human body keypoints (x,y coordinates) relative to the root (eg pelvic joints) 3D(x, y, z).

附图说明Description of drawings

现在将参考附图以示例的方式描述本发明。The invention will now be described by way of example with reference to the accompanying drawings.

在附图中：In the attached picture:

图1示出了模型架构的概述，该架构将包括人体姿势的2D关键点的数据集序列作为输入，并使用对长期信息的自注意力生成3D姿势估计。Figure 1 shows an overview of the model architecture, which takes as input a dataset sequence comprising 2D keypoints of human poses, and generates 3D pose estimates using self-attention on long-term information.

图2示出了一种处理设备，用于形成用于估计由多个数据集表示的一个或多个关节对象的三维状态的系统。Figure 2 shows a processing device for forming a system for estimating the three-dimensional state of one or more articulated objects represented by a plurality of data sets.

图3示出了详细说明处理设备执行的方法步骤的示例性流程图。Fig. 3 shows an exemplary flowchart detailing the method steps performed by the processing device.

图4示出了使用Humans3.6M数据集中人体姿势的几个动作的定性结果：(a)使用CPN进行2D关键点预测的原始RGB图像；(b)使用本文描述的方法进行3D重建(n＝243，其中n是输入序列中姿势的感受野)；(c)地面实况3D关键点。Figure 4 shows qualitative results for several actions of human poses using the Humans3.6M dataset: (a) raw RGB images using CPN for 2D keypoint prediction; (b) 3D reconstruction using the method described in this paper (n = 243, where n is the receptive field of poses in the input sequence); (c) ground truth 3D keypoints.

具体实施方式Detailed ways

本文描述的方法通过处理数据集来举例说明，每个数据集都指示人体姿势。然而，应当理解该方法也可以应用于需要将数据从2D转换为3D的其他数据集和对象。The method described in this paper is exemplified by processing datasets, each indicating a human pose. However, it should be understood that the method can also be applied to other datasets and objects where data needs to be converted from 2D to 3D.

通常，描绘人体骨骼的关键点有17个，包括头部、肩部、肘部、手腕、骨盆、膝盖和脚踝。在本文描述的示例中，指示人体姿势的每个数据集为这些关键点中的每个定义2D坐标。优选地，指示人体姿势的每个数据集表示人体关节相对于人体预定关节或结构的位置。在优选实施方式中，该结构是人体骨盆。Typically, there are 17 key points for depicting the human skeleton, including the head, shoulders, elbows, wrists, pelvis, knees, and ankles. In the example described here, each dataset indicating human pose defines 2D coordinates for each of these keypoints. Preferably, each data set indicative of a pose of the human body represents the position of a joint of the human body relative to a predetermined joint or structure of the human body. In a preferred embodiment, the structure is a human pelvis.

在本文描述的示例中，输入到模型的各个数据集(例如，各个人体姿势)可以从包括描绘不同人体姿势的图像的较大运动捕捉数据集派生。这种较大的数据集可以包括如Human3.6M的运动捕捉数据集(参见C.Ionescu、D.Papava、V.Olaru和C.Sminchisescu，“Human3.6M：自然环境中3d人体感知的大规模数据集和预测方法(Large scale datasetsand predictive methods for 3d human sensing in natural environments)”，IEEE模式分析和机器智能(Pattern Analysis and Machine Intelligence，PAMI)汇刊，7(2013)，第1325-1339页)或HumanEva(参见L.Sigal、A.O.Balan和M.J.Black，“HumanEva：用于评估关节式人体运动的同步视频和运动捕捉数据集及基线算法(Synchronized video andmotion capture dataset and baseline algorithm for evaluation of articulatedhuman motion)”，国际计算机视觉杂志(International Journal of Computer Vision，IJCV)(2010)，87(1-2)：4)运动捕捉数据集。Human3.6M包括11个不同受试者的360万帧，但其中只有7个被注释。受试者执行了多达15种不同类型的动作，这些动作从四个不同的角度记录下来。相比之下，HumanEva是一个较小的运动捕捉数据集，只有三个受试者并从三个角度记录。In the examples described herein, each dataset input to the model (eg, individual human poses) may be derived from a larger motion capture dataset that includes images depicting different human poses. Such larger datasets can include motion capture datasets such as Human3.6M (see C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, "Human3.6M: Large-scale 3D Human Perception in Natural Environments") Datasets and predictive methods (Large scale datasets and predictive methods for 3d human sensing in natural environments), IEEE Transactions on Pattern Analysis and Machine Intelligence (Pattern Analysis and Machine Intelligence, PAMI), 7 (2013), pp. 1325-1339) or HumanEva (see L.Sigal, A.O.Balan, and M.J.Black, "HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion) ", International Journal of Computer Vision (IJCV) (2010), 87(1-2): 4) Motion Capture Dataset. Human3.6M includes 3.6 million frames from 11 different subjects, but only 7 of them are annotated. The subjects performed up to 15 different types of movements, which were recorded from four different angles. In contrast, HumanEva is a smaller motion capture dataset with only three subjects recorded from three angles.

本文描述的方法可以包括两步姿势估计方法。输入到模型的指示人体姿势的数据集可以首先通过使用2D姿势估计器来获得，以从图像中获得准确的人体2D姿势。这可以通过自上而下的方式完成。然后可以通过预测这些关节相对于根部(例如骨盆)的深度来提升这些关节。The methods described herein may include a two-step pose estimation approach. A dataset indicative of human pose input to the model can first be obtained by using a 2D pose estimator to obtain accurate 2D poses of the human body from images. This can be done in a top-down manner. These joints can then be lifted by predicting their depth relative to the root (eg pelvis).

因此，处理设备可以被配置为接收多个图像，每个图像表示一个关节对象。对于每个图像，该处理设备随后可以检测该图像中人体(或其他对象)关节的位置，从而形成指示单个姿势的数据集之一。Accordingly, the processing device may be configured to receive a plurality of images, each image representing a joint object. For each image, the processing device may then detect the positions of the joints of the human body (or other object) in the image, forming one of the datasets indicative of a single pose.

例如，可以通过使用地面实况人体边界框，然后使用2D姿势估计器来获得2D姿势。可用于获取2D姿势序列的一些常见2D估计模型是：堆叠沙漏(Stacked Hourglass，SH)，如A.Newell、K.Yang和J.Deng在“用于人体姿势估计的堆叠沙漏网络(Stacked hourglassnetworks for human pose estimation)”(IEEE欧洲计算机视觉会议(EuropeanConference on Computer Vision，ECCV)论文集(2016)，第483–499页)中所描述的；掩码RCNN(Mask-RCNN)，如K.He、G.Gkioxari、P.Dollar和R.Girshick在“掩码RCNN(Mask-RCNN)”(IEEE国际计算机视觉会议(International Conference on Computer Vision，ICCV)论文集(2017)，第2961–2969页)中所描述的；或级联金字塔网络(Cascaded Pyramid Network，CPN)，如Y.Chen、Z.Wang、Y.Peng、Z.Zhang、G.Yu和J.Sun在“用于多人姿势估计的级联金字塔网络(Cascaded pyramid network for multi-person pose estimation)”(IEEE计算机视觉和模式识别会议(Computer Vision and Pattern Recognition，CVPR)论文集(2018)，第7103–7112页)中所描述的。For example, the 2D pose can be obtained by using the ground truth human bounding box followed by a 2D pose estimator. Some common 2D estimation models that can be used to obtain 2D pose sequences are: Stacked Hourglass (SH), as described in A. Newell, K. Yang and J. Deng in "Stacked hourglass networks for human pose estimation (Stacked hourglass networks for human pose estimation)" (IEEE European Conference on Computer Vision (ECCV) Proceedings (2016), p. 483–499); mask RCNN (Mask-RCNN), such as K.He, G.Gkioxari, P.Dollar, and R.Girshick in "Mask-RCNN" (Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2961–2969) described; or Cascaded Pyramid Network (CPN), such as Y.Chen, Z.Wang, Y.Peng, Z.Zhang, G.Yu and J.Sun in "For multi-person pose estimation Cascaded pyramid network (Cascaded pyramid network for multi-person pose estimation)" (IEEE Computer Vision and Pattern Recognition Conference (Computer Vision and Pattern Recognition, CVPR) Proceedings (2018), pp. 7103–7112).

可替代地，可以使用不依赖地面实况的2D检测器。例如，SH和CPN可用作Human3.6M运动捕捉数据集的检测器，Mask-RCNN用作HumanEva运动捕捉数据集的检测器。然而，也可以使用地面实况2D姿势进行训练。在一个特定示例中，遵循D.Pavllo、C.Feichtenhofer、D.Grangier和M.Auli在“具有时间卷积和半监督训练的视频中的3D人体姿势估计(3Dhuman pose estimation in video with temporal convolutions and semi-supervisedtraining)”(IEEE计算机视觉和模式识别会议(CVPR)论文集(2019)，第7753–7762页)中所描述的方法，SH可以在MPII运动捕捉数据集上进行预训练(L.Pishchulin、E.Insafutdinov、S.Tang、B.Andres、M.Andriluka、P.Gehler和B.Schiele，“DeepCut：多人姿势估计的联合子集划分和标记(Joint Subset Partition and Labeling for MultiPerson Pose Estimation)”，IEEE计算机视觉和模式识别会议(CVPR)论文集(2016)，第4929–4937页)，并在Human3.6M运动捕捉数据集上进行微调。Mask-RCNN和CPN都可以在COCO上进行预训练(T.Lin、M.Maire、S.Belongie、J.Hays、P.Perona、D.Ramanan、P.Dollar和C.L.Zitnick，“Microsoft coco：上下文中的常见对象(Common objects in context)”，IEEE欧洲计算机视觉会议(European Conference on Computer Vision，ECCV)论文集(2014)，第740–755页)，然后在Human3.6M的2D姿势上进行微调，因为每个运动捕捉数据集中的关键点定义不同。更具体地，Mask-RCNN使用带有FPN的ResNet-101。由于CPN需要边界框，可以首先使用Mask-RCNN来检测人体。然后，它可以使用输入分辨率为384×384的ResNet-50从图像中确定关键点。Alternatively, 2D detectors that do not rely on ground truth can be used. For example, SH and CPN can be used as detectors for the Human3.6M motion capture dataset, and Mask-RCNN is used as a detector for the HumanEva motion capture dataset. However, it is also possible to use ground truth 2D poses for training. In a specific example, follow D.Pavllo, C.Feichtenhofer, D.Grangier, and M.Auli in "3D human pose estimation in video with temporal convolutions and Semi-supervised training)" (IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Proceedings (2019), pp. 7753–7762), SH can be pretrained on the MPII motion capture dataset (L.Pishchulin , E.Insafutdinov, S.Tang, B.Andres, M.Andriluka, P.Gehler, and B.Schiele, "DeepCut: Joint Subset Partition and Labeling for MultiPerson Pose Estimation ", IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Proceedings (2016), pp. 4929–4937), and fine-tuned on the Human3.6M motion capture dataset. Both Mask-RCNN and CPN can be pre-trained on COCO (T.Lin, M.Maire, S.Belongie, J.Hays, P.Perona, D.Ramanan, P.Dollar, and C.L.Zitnick, "Microsoft coco: Context Common objects in context", IEEE European Conference on Computer Vision (ECCV) Proceedings (2014), pp. 740–755), followed by fine-tuning on the 2D pose of Human3.6M , since keypoints are defined differently in each motion capture dataset. More specifically, Mask-RCNN uses ResNet-101 with FPN. Since CPN requires a bounding box, Mask-RCNN can be used first to detect the human body. It can then determine keypoints from the image using a ResNet-50 with an input resolution of 384×384.

在本文描述的示例中，使用开源Transformer模型来提升关键点数据集。在此示例中，2D关键点序列通过几个Transformer编码器块来生成一个3D姿势预测(对应于输入序列/感受野中心处的姿势)。可以以滑动窗口方式计算进一步的3D估计。In the example described in this paper, the open source Transformer model is used to boost the keypoint dataset. In this example, a sequence of 2D keypoints is passed through several Transformer encoder blocks to generate a 3D pose prediction (corresponding to the pose at the center of the input sequence/receptive field). Further 3D estimates can be computed in a sliding window manner.

如101处所示，设备将包括2D关键点的数据集序列作为输入(该输入序列统称为感受野)。该方法可用于不同数量的数据集。数据集的系列优选地是奇数个数据集的系列。例如，可以使用包括27、81和243个关键点数据集的感受野。优选地，选择序列的中间数据集进行提升，因为它对应于原始感受野的中心。As shown at 101, the device takes as input a sequence of datasets including 2D keypoints (the sequence of inputs is collectively referred to as a receptive field). The method can be used with different numbers of datasets. The series of data sets is preferably a series of an odd number of data sets. For example, receptive fields including 27, 81 and 243 keypoint datasets can be used. Preferably, the middle dataset of the sequence is chosen for boosting since it corresponds to the center of the original receptive field.

如上所述，与输入序列对应的数据集可以来自从图像帧估计2D姿势的2D预测器，或者直接来自2D地面实况。因此，这些输入数据集位于图像空间中。As mentioned above, the dataset corresponding to the input sequence can come from a 2D predictor that estimates 2D pose from image frames, or directly from the 2D ground truth. Therefore, these input datasets lie in image space.

如图1所示，在一些实施例中，可以执行某些修改以使模型的输入和输出的维度匹配。在此示例中，输入数据集的序列通过卷积层102来改变维度。在此示例中，Transformer的输入从输入维度[B，N，34]重新投影到[B，N，512]，其中B是批量大小，N是感受野(即每个处理步骤中输入到模型的人体姿势数)，34对应于17个关节乘以2(坐标数，即x和y)。As shown in Figure 1, in some embodiments certain modifications may be performed to match the dimensions of the input and output of the model. In this example, a sequence of input datasets is passed through a convolutional layer 102 to change dimensions. In this example, the Transformer's input is reprojected from input dimension [B, N, 34] to [B, N, 512], where B is the batch size and N is the receptive field (i.e., the input to the model at each processing step number of human poses), 34 corresponds to 17 joints times 2 (number of coordinates, i.e. x and y).

然后，添加时间编码以嵌入数据集(姿势)的输入序列的顺序。如103处所示，通过向输入嵌入添加向量来聚合时间编码。这允许模型利用姿势序列的顺序。这也可以称为位置编码，用于注入关于序列中的标记(token)的相对或绝对位置的信息。可以使用正弦和余弦函数来创建这些时间嵌入，然后将其相加为重新投影的输入。在本实施例中，注入的时间嵌入具有与输入相同的维度。Then, a temporal encoding is added to embed the order of the input sequence of the dataset (pose). As shown at 103, temporal encodings are aggregated by adding vectors to the input embeddings. This allows the model to exploit the order of pose sequences. This can also be called positional encoding and is used to inject information about the relative or absolute position of tokens in the sequence. These temporal embeddings can be created using sine and cosine functions, which are then summed as input for reprojection. In this example, the injected temporal embedding has the same dimensionality as the input.

然后将时间嵌入的数据集输入处理它们的Transformer编码器模型中。Transformer如图1中的104处所示。The temporally embedded datasets are then fed into a Transformer encoder model that processes them. The Transformer is shown at 104 in FIG. 1 .

自注意力模型用于自然语言处理(Natural Langauge Processing，NLP)，以随时间嵌入姿势，而不是使用完全卷积方法。编码器架构具有多个编码器层，每个编码器层包括各自的自注意力机制107和各自的前馈网络109。每个自注意力机制实现多个注意力头。来自块107和109的输出可以相加并标准化，如108和110处所示。Self-attention models are used in Natural Language Processing (NLP) to embed poses over time, rather than using fully convolutional approaches. The encoder architecture has multiple encoder layers, each comprising a respective self-attention mechanism 107 and a respective feed-forward network 109 . Each self-attention mechanism implements multiple attention heads. The outputs from blocks 107 and 109 can be summed and normalized as shown at 108 and 110 .

A.Vaswani、N.Shazeer、N.Parmar、J.Uszkoreit、L.Jones、A.N.Gomez、

.Kaiser和I.Polosukhin在“注意力就是你所需要的一切(Attention is all you need)”(IEEE神经信息处理系统(Neural Information Processing System，NeurIPS)进展论文集(2017)，第5998–6008页)中描述的基本Transformer编码器可作为基线。在此示例中，有512维隐藏层、8个多注意力头和6个编码器块。A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, ANGomez,

.Kaiser and I. Polosukhin in "Attention is all you need" (IEEE Neural Information Processing System (Neural Information Processing System, NeurIPS) Progress Papers (2017), pp. 5998–6008 The basic Transformer encoder described in ) can be used as a baseline. In this example, there are 512-dimensional hidden layers, 8 multi-attention heads, and 6 encoder blocks.

由于在该实施方式中没有使用Transformer的解码器部分，并且由于自注意力内的残差连接，输出的维度与输入的维度相同。在此示例中，即[B，N，512]。Since the decoder part of the Transformer is not used in this implementation, and due to the residual connections within self-attention, the dimensions of the output are the same as those of the input. In this example, that is [B, N, 512].

1D卷积层105改变维度，使得模型的输出(如106处所示)是针对当前正被提升的姿势，所有关节相对于骨盆的x、y和z坐标。The 1D convolutional layer 105 changes dimensions such that the output of the model (shown at 106 ) is the x, y and z coordinates of all joints relative to the pelvis for the current pose being lifted.

使用1D卷积层105将输出嵌入重新投影到所需的维度，从[B，1，512]到[B，1，51]，其中51对应于17个关节乘以3(对于x，y和z坐标)。然后计算损失，例如使用针对数据集3D地面实况的平均每关节位置误差(mean-per-joint position error，MPJPE)计算损失。然后将误差反向传播并开始下一个训练迭代。Use a 1D convolutional layer 105 to reproject the output embedding to the desired dimension, from [B, 1, 512] to [B, 1, 51], where 51 corresponds to 17 joints times 3 (for x, y and z-coordinate). The loss is then computed, e.g. using the mean-per-joint position error (MPJPE) against the 3D ground truth of the dataset. The error is then backpropagated and the next training iteration begins.

优选地，模型输出标记(当前姿势在姿势输入序列内从2D提升到3D)对应于感受野的中间数据集(中间姿势)。这是因为在训练期间，一半的感受野对应过去的姿势，另一半对应未来的姿势。因此，在训练期间，该模型利用来自过去和未来帧的时间数据，以便能够创建时间一致的预测。Preferably, the model output label (the current pose is lifted from 2D to 3D within the pose input sequence) corresponds to the intermediate data set of the receptive field (intermediate pose). This is because during training, half of the receptive field corresponds to past poses and the other half to future poses. Thus, during training, the model utilizes temporal data from past and future frames in order to be able to create temporally consistent predictions.

在推理期间，模型架构与训练期间保持相同，但在感受野中只使用过去的姿势帧，而不是训练期间使用的当前正被提升的姿势任一侧的帧。该模型可以以滑动窗口方式工作，以便最终可以为输入序列中所有2D表示获得3D重建。During inference, the model architecture remains the same as during training, but only past pose frames are used in the receptive field, rather than the frames on either side of the current pose being lifted that are used during training. The model can work in a sliding window manner so that eventually 3D reconstructions can be obtained for all 2D representations in the input sequence.

该架构的一个好处是可以在不影响模型大小的情况下修改感受野和多注意力头的数量。此外，还可以修改Transformer的超参数。A benefit of this architecture is that the number of receptive fields and multi-attention heads can be modified without affecting the model size. In addition, the hyperparameters of the Transformer can also be modified.

在一些实施例中，可以使用权重共享来保持或提高最终准确性，同时显著减少参数总数，从而构建更有效的模型。In some embodiments, weight sharing can be used to maintain or improve final accuracy while significantly reducing the total number of parameters, resulting in more efficient models.

可选地，可以在每个编码器块中共享参数。具体地，可以共享注意力层参数。Optionally, parameters can be shared within each encoder block. Specifically, attention layer parameters can be shared.

在共享注意力层参数的一个实施例中，每个自注意力机制的操作由一组参数定义，并且在多个自注意力机制之间共享这些参数中的一个或多个。可以响应于连续的数据集对一自注意力机制中的一个或多个所述参数进行调整，并在调整一参数后，可以将该参数传播到一个或多个其他自注意力机制。In one embodiment of shared attention layer parameters, the operation of each self-attention mechanism is defined by a set of parameters, and one or more of these parameters are shared among multiple self-attention mechanisms. One or more of said parameters in a self-attention mechanism may be adjusted in response to successive data sets, and after adjusting a parameter, the parameter may be propagated to one or more other self-attention mechanisms.

可替代地或附加地，每个编码器层的操作可以由一组参数定义。该设备可以被配置为响应于连续的数据集，调整一编码器层中的一个或多个这些参数。在一种实施方式中，该设备可以被配置为不将这样的参数传播到任何其他编码器层。Alternatively or additionally, the operation of each encoder layer may be defined by a set of parameters. The device may be configured to adjust one or more of these parameters in an encoder layer in response to successive data sets. In one embodiment, the device may be configured not to propagate such parameters to any other encoder layers.

特别是，仅共享注意力层参数(而不共享编码器块参数)可以提高最终准确性，同时显著减少参数总数。In particular, sharing only attention layer parameters (and not encoder block parameters) improves the final accuracy while significantly reducing the total number of parameters.

在一些实施例中，可以在训练和测试期间将额外数据扩充应用于数据集。例如，每个姿势可以水平翻转。In some embodiments, additional data augmentation may be applied to the dataset during training and testing. For example, each pose can be flipped horizontally.

在训练Transformer模型的注意力层期间，可以使用优化器，诸如Adam优化器(如S.J.Reddi、S.Kale和S.Kumar在“关于Adam的收敛及更多内容(On the convergence ofAdam and beyond)”(国际学习表征会议(International Conference on LearningRepresentation，ICLR)论文集(2018年))中所描述的)。例如，Human3.6M运动捕捉数据集的训练运行可以持续80代训练(epoch)，HumanEva运动捕捉数据集可以持续1000代训练。During the training of the attention layer of the Transformer model, an optimizer can be used, such as the Adam optimizer (as S.J.Reddi, S.Kale and S.Kumar in "On the convergence of Adam and beyond" (Described in Proceedings of the International Conference on Learning Representation (ICLR) (2018)). For example, a training run for the Human3.6M motion capture dataset can last for 80 epochs, and the HumanEva motion capture dataset can last for 1000 epochs.

学习速率可以对于第一个训练步骤线性增加(例如，1000次迭代，学习速率因子为12)，然后与步骤数的平方根倒数成比例减少。这通常称为NoamOpt。The learning rate can be increased linearly for the first training step (e.g., 1000 iterations with a learning rate factor of 12) and then decreased proportional to the inverse square root of the number of steps. This is commonly referred to as NoamOpt.

批量大小可与感受野值成比例。例如，n＝27，81，243时分别为b＝5120，3072，1536。The batch size can be proportional to the receptive wild value. For example, when n=27, 81, and 243, b=5120, 3072, and 1536, respectively.

在硬件方面，可以使用8个NVIDIA V100 GPU对系统进行训练和评估，并进行并行优化。通常，同时考虑到批量大小，每个感受野的训练时间可以分别约为8、14和40小时。On the hardware side, the system can be trained and evaluated using eight NVIDIA V100 GPUs, and optimized in parallel. Typically, the training time for each receptive field can be around 8, 14 and 40 hours, respectively, taking into account the batch size.

图2是被配置为执行本文所述方法的系统200的示意表示。系统200可以在诸如笔记本电脑、平板电脑、智能手机或电视(TV)等设备上实现。FIG. 2 is a schematic representation of a system 200 configured to perform the methods described herein. System 200 can be implemented on a device such as a laptop, tablet, smartphone, or television (TV).

系统200包括处理器201，处理器201被配置为以本文所述的方式处理数据集。例如，处理器201可以被实现为在诸如GPU或中央处理单元(Central Processing Unit，CPU)之类的可编程设备上运行的计算机程序。系统200包括被布置成与处理器201通信的存储器202。存储器202可以是非易失性存储器。处理器201还可以包括高速缓存(图2中未示出)，其可用于临时存储来自存储器202的数据。该系统可以包括多于一个处理器和多于一个存储器。存储器可以存储处理器可执行的数据。处理器可以被配置为根据以非暂时形式存储在机器可读存储介质上的计算机程序进行操作。计算机程序可以存储用于使处理器以本文所述的方式执行其方法的指令。System 200 includes a processor 201 configured to process data sets in the manner described herein. For example, the processor 201 may be implemented as a computer program running on a programmable device such as a GPU or a central processing unit (Central Processing Unit, CPU). System 200 includes a memory 202 arranged in communication with processor 201 . Memory 202 may be a non-volatile memory. Processor 201 may also include a cache (not shown in FIG. 2 ), which may be used to temporarily store data from memory 202 . The system may include more than one processor and more than one memory. The memory may store data executable by the processor. The processor may be configured to operate according to a computer program stored in a non-transitory form on a machine-readable storage medium. A computer program may store instructions for causing a processor to perform its method in the manner described herein.

图3示出了总结用于估计由多个数据集(例如，多个人体姿势)表示的一个或多个关节对象的三维状态的方法示例的流程图，每个数据集指示对象(例如人体)的关节在二维空间上的投影。Figure 3 shows a flowchart summarizing an example of a method for estimating the three-dimensional state of one or more articulated objects represented by multiple datasets (e.g., multiple human poses), each dataset indicating an object (e.g., a human body) The projection of the joints on the two-dimensional space.

在步骤301，该方法包括接收数据集。在步骤302，该方法包括使用具有多个编码器层的编码器架构处理数据集，每个编码器层包括各自的自注意力机制和各自的前馈网络，每个自注意力机制实现多个注意力头。在步骤303，该方法包括根据数据集训练编码器层，以提高编码器架构估计由数据集表示的一个或多个对象的三维状态的准确性。At step 301, the method includes receiving a data set. At step 302, the method includes processing the dataset using an encoder architecture having multiple encoder layers, each encoder layer comprising a respective self-attention mechanism and a respective feed-forward network, each self-attention mechanism implementing multiple attention head. At step 303, the method includes training an encoder layer based on the dataset to improve the accuracy of the encoder architecture in estimating the three-dimensional state of one or more objects represented by the dataset.

本文描述的是使用自注意力Transformer模型来估计2D关键点的深度。编码器的自注意力架构允许模型通过利用跨帧/姿势的远程时间信息来产生时间一致的姿势。This paper describes the use of a self-attention Transformer model to estimate the depth of 2D keypoints. The encoder's self-attention architecture allows the model to produce temporally consistent poses by exploiting long-range temporal information across frames/poses.

在一些实施方式中发现，本文所述的方法可以提供更好的结果，并允许比以前的方法更小的模型尺寸。It was found in some embodiments that the methods described herein can provide better results and allow for smaller model sizes than previous methods.

对于输入2D预测(Mask-RCNN和CPN)，发现本文描述的方法优于以前的提升方法，并且性能与使用从原始RGB图像提取的关键点和特征的方法相当。对于地面实况输入，发现该模型的性能优于以前的模型，实现了与同时预测身体形状和姿势的蒙皮多人线性模型(Skinned Multi-Person Linear Model，SMPL)或多视图方法相当的结果。模型中的参数数量很容易调整，并且比当前的方法(可能有大约1100-1700万个参数)更小(例如，950万)，同时仍能获得更好的性能。因此，与现有技术相比，该方法可以用更小的模型尺寸获得更好的结果。For input 2D predictions (Mask-RCNN and CPN), the method described in this paper is found to outperform previous boosting methods and perform comparable to methods using keypoints and features extracted from raw RGB images. For ground-truth inputs, the model is found to outperform previous models, achieving comparable results to Skinned Multi-Person Linear Model (SMPL) or multi-view methods that simultaneously predict body shape and pose. The number of parameters in the model is easily tuned and is smaller (e.g., 9.5 million) than current methods (which may have around 11-17 million parameters), while still achieving better performance. Therefore, compared with the state-of-the-art, this method can achieve better results with a smaller model size.

图4示出了使用Humans3.6M数据集中人体姿势的几个动作的定性结果。(a)列显示了使用CPN进行二维关键点预测的原始RGB图像。(b)列显示了使用本文所述方法进行的3D重建(具有n＝243的感受野)。(c)列显示了地面实况3D关键点。可以看出，获得的3D重建与地面实况三维关键点非常匹配。Figure 4 shows qualitative results for several actions using human poses in the Humans3.6M dataset. Column (a) shows the original RGB image using CPN for 2D keypoint prediction. Column (b) shows 3D reconstructions (with n=243 receptive fields) using the method described here. Column (c) shows the ground truth 3D keypoints. It can be seen that the obtained 3D reconstructions closely match the ground truth 3D keypoints.

因此，本文描述的方法允许通过使用小型高效模型在实时运行时外推输入数据来估计深度，将2D人体关键点(x，y坐标)提升到相对于根部(如骨盆关节)的3D(x，y，z)。Thus, the method described in this paper allows estimating depth by extrapolating input data at runtime using a small efficient model, lifting 2D human keypoints (x, y coordinates) to 3D (x, y coordinates) relative to roots (such as pelvic joints). y, z).

申请人在此单独公开本文所述的每个单独特征以及两个或多个此类特征的任何组合，前提是根据本领域技术人员的公知常识，这些特征或组合能够基于本说明书作为一个整体来执行，无论这些特征或特征的组合是否解决了本文公开的任何问题，并且不限于权利要求的范围。申请人指出，本发明的方面可以包括任何这样的单独特征或特征的组合。鉴于上述描述，对于本领域技术人员显而易见的是，可以在本发明的范围内进行各种修改。The applicant hereby separately discloses each individual feature described herein as well as any combination of two or more such features, provided that these features or combinations can be understood based on the present description as a whole according to the common general knowledge of a person skilled in the art. implementation, regardless of whether these features or combinations of features solve any of the problems disclosed herein, and are not limited to the scope of the claims. The applicant indicates that aspects of the invention may comprise any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A processing device (200) for forming a system for estimating the three-dimensional state of one or more articulated objects represented by a plurality of data sets (101), each data set indicating the joints of said object in two projection on a three-dimensional space, the processing device includes one or more processors (201), which are configured to:

receiving (301) said data set;

The dataset is processed (302) using an encoder architecture (104) with multiple encoder layers, each encoder layer comprising a respective self-attention mechanism (107) and a respective feed-forward network (109), each The self-attention mechanism implements multiple attention heads; and

The encoder layer is trained (303) based on the dataset to improve the accuracy of the encoder architecture in estimating a three-dimensional state (106) of one or more objects represented by the dataset.

2. The processing device (200) according to claim 1, wherein each object is a person and each data set is indicative of a human pose.

3. The processing device (200) according to claim 1 or 2, wherein the encoder architecture (104) implements a Transformer architecture.

4. The processing device (200) according to any one of the preceding claims, wherein the operation of each self-attention mechanism (107) is defined by a set of parameters, and the device is configured to operate on multiple self-attention mechanisms One or more of the parameters are shared between.

5. The processing device (200) according to claim 4, wherein said device is configured to adjust one or more of said parameters in a self-attention mechanism (107) in response to a continuous data set, and to adjust parameters After that, the parameters are propagated to one or more other self-attention mechanisms.

6. The processing device (200) according to claim 5, wherein the operation of each encoder layer is defined by a set of parameters, said device being configured to adjust one or more of the encoder layers in response to successive data sets. and the device is configured not to propagate the parameters to any other encoder layers.

7. The processing device (200) according to any one of the preceding claims, wherein the one or more processors (201 ) are configured to:

A one-dimensional convolution is performed (102) on the data set to form convolutional data and the convolutional data is processed using the encoder architecture (104).

8. The processing device (200) according to claim 7, wherein said one or more processors (201) are configured to perform said one-dimensional convolution (102) on a series of continuous data sets (101).

9. The processing device (200) according to claim 8, wherein the series of data sets (101 ) is a series of an odd number of data sets.

10. The processing device (200) according to claim 9, wherein the device is configured to estimate, from the series of data sets (101), the three-dimensional state of a middle one of the series of data sets.

11. The processing device (200) according to any one of the preceding claims, wherein the one or more processors (201) are configured to train the encoder architecture (104) to upscale the dataset to three dimensions.

12. The processing device (200) according to claim 11, wherein each data set represents the position of a human body joint relative to a predetermined joint or structure of the human body.

13. The treatment device (200) according to claim 12, wherein said structure is a pelvis.

14. The processing device (200) according to any one of the preceding claims, wherein the processing device is configured to:

receives a plurality of images, each representing a joint object; and

For each image, joint positions of said object in said image are detected to form one of said datasets (101).

15. A system for estimating a three-dimensional state of one or more articulated objects, said system being formed by a processing device (200) as claimed in any one of the preceding claims.

16. A method (300) for estimating a three-dimensional state of one or more articulated objects represented by a plurality of data sets (101), each data set indicating a projection of a joint of said object onto a two-dimensional space, The methods include:

receiving (301) said data set;