WO2025118238A1 - Suivi de mouvement avec réseau neuronal à tâches multiples - Google Patents
Suivi de mouvement avec réseau neuronal à tâches multiples Download PDFInfo
- Publication number
- WO2025118238A1 WO2025118238A1 PCT/CN2023/137099 CN2023137099W WO2025118238A1 WO 2025118238 A1 WO2025118238 A1 WO 2025118238A1 CN 2023137099 W CN2023137099 W CN 2023137099W WO 2025118238 A1 WO2025118238 A1 WO 2025118238A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- motion
- pose
- tensor
- key points
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
- G06V40/25—Recognition of walking or running movements, e.g. gait recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- This disclosure relates generally to neural networks (also referred to as “deep neural networks” or “DNNs” ) , and more specifically, to motion tracking with multi-task DNNs.
- DNNs deep neural networks
- DNNs are widely used in the domains of image recognition, video understanding, image or video generation, machine translation, mathematical reasoning, and so on.
- FIG. 1 is a block diagram of a motion tracking system, in accordance with various embodiments.
- FIG. 2 is a block diagram of a multi-task network module, in accordance with various embodiments.
- FIG. 3 is a block diagram of an optimization module, in accordance with various embodiments.
- FIG. 4 illustrates an example motion tracking pipeline, in accordance with various embodiments.
- FIG. 5 illustrates an example multi-task DNN, in accordance with various embodiments.
- FIGS. 6A and 6B illustrate three-dimensional (3D) motion tracking with reconstructed 3D ground plane, in accordance with various embodiments.
- FIG. 7 illustrates an example motion analysis for physical fitness, in accordance with various embodiments.
- FIG. 8 illustrates an example avatar animation generated based on motion tracking, in accordance with various embodiments.
- FIG. 9 illustrates an example CNN, in accordance with various embodiments.
- FIG. 10 illustrates an example convolution, in accordance with various embodiments.
- FIG. 11 is a flowchart showing a method of motion tracking, in accordance with various embodiments.
- FIG. 12 is a block diagram of an example computing device, in accordance with various embodiments.
- a DNN typically includes a sequence of layers.
- a DNN layer may include one or more deep learning operations (also referred to as “neural network operations” ) , such as convolution, pooling, elementwise operation, linear operation, nonlinear operation, and so on.
- a DL operation in a DNN may be performed on one or more internal parameters of the DNNs (e.g., weights) , which are determined during the training phase, and one or more activations.
- An activation may be a data point (also referred to as “data elements” or “elements” ) .
- Activations or weights of a DNN layer may be elements of a tensor of the DNN layer.
- a tensor is a data structure having multiple elements across one or more dimensions.
- Example tensors include a vector, which is a one-dimensional tensor, and a matrix, which is a two-dimensional tensor. There can also be three-dimensional tensors and even higher dimensional tensors.
- a DNN layer may have an input tensor (also referred to as “input feature map (IFM) ” ) including one or more input activations (also referred to as “input elements” ) and a weight tensor including one or more weights.
- a weight is an element in the weight tensor.
- a weight tensor of a convolution may be a kernel, a filter, or a group of filters.
- the output data of the DNN layer may be an output tensor (also referred to as “output feature map (OFM) ” ) that includes one or more output activations (also referred to as “output elements” ) .
- OFM output feature map
- a currently available approach is based on a bundle-adjustment-based algorithm using temporal context for 3D human pose estimation from monocular videos. This approach uses a two-dimensional (2D) pose detector to predict 2D key points for a single person for each video frame, then uses a 3D pose network to produce the initial SMPL (skinned multi-person linear model) parameters, and finally jointly optimizes the SMPL parameters and camera parameters over an entire video sequence by considering the reprojection error, temporal consistency constraint and 3D pose prior.
- 2D pose detector uses a two-dimensional (2D) pose detector to predict 2D key points for a single person for each video frame, then uses a 3D pose network to produce the initial SMPL (skinned multi-person linear model) parameters, and finally jointly optimizes the SMPL parameters and camera parameters over an entire video sequence by considering the reprojection error, temporal consistency constraint and 3D pose prior.
- this approach usually uses two AI models to respectively detect 2D key points and predict 3D parameters.
- the utilization of multiple AI models can cause high cost and low efficiency. For instance, more time and computational resources are typically required to train and run multiple AI models. Also, the utilization of separate models can result in poor prediction accuracy.
- this approach usually uses the temporal consistency between adjacent frames to achieve visually smooth results but can have strong depth inconsistency across the action sequence. In an example where people crouch down and up while his/her feet are motionless, this approach would provide motion tracking results with undesirable instability and depth shifting.
- Embodiments of the present disclosure may improve on at least some of the challenges and issues described above by using a multi-task DNN for tracking 3D motions of objects.
- the objects may include a person, animal, machine, vehicle, robot, or other types of movable objects.
- the multi-task DNN may process a monocular video capturing a moving object and produce both 2D and 3D motion tracking results.
- the multi-task DNN may include a branch, which can extract 2D features for 2D key points detection, and another branch, which can extract 3D features for 3D pose detection.
- the multi-task DNN may include yet another branch that can segment a part of the object (e.g., a foot of a person) from the moving object.
- the outputs of the multi-task DNN may be processed together, e.g., by a optimization module that can exploit the spatial-time consistency across the entire video sequence.
- the join optimization module can eliminate the front-back depth shifting and provide stable 3D motion tracking results.
- frames of a monocular video capturing a movement of the object may be cropped and then input into the multi-task DNN.
- the cropping of the frames may be based on a detection of the object in the first frame of the video.
- the detection of the object may not be needed in the other frames of the video.
- the multi-task DNN may process the frames separately.
- a backbone of the multi-task DNN may extract features of the object from a cropped frame and output a feature map.
- the backbone may be a network that includes one or more convolutional layers.
- the feature map computed by the backbone may be processed in a first branch of the multi-task DNN to compute a heat map that represents 2D key points on the object.
- the feature map and the heat map may be processed in a second branch of the neural network to compute a pose tensor that includes 3D pose parameters of the object.
- the pose tensor may be a one-dimensional tensor, and the pose tensor is also referred to as a pose vector.
- the pose tensor may be a two-dimensional tensor, and the pose tensor is also referred to as a pose matrix.
- the feature map may also be processed in a third branch of the neural network to compute one or more masks that represent a segmented part of the object.
- the heat maps, pose tensors, and masks of the cropped frames may be processed together to determine the 3D motion of the object. For instance, one or more filters may be applied to smooth the 2D key points extracted from the heat maps, pose tensors, and masks. Also, the motion of a root point on the object in a 3D camera space may be determined based on the heat maps and the pose tensors. For instance, a position parameter representing the motion of the root point may be generated. The masks may be used to refine the position parameter. Also, the position parameter and pose tensor may be optimized using an objective function. After the optimization or refinement, the position parameter and pose tensor may be used to determine the 3D motion of the object.
- the present disclosure provides a 3D motion tracking approach that can improve the motion tracking efficiency by reducing computational costs during training and inference and can also increase 3D motion tracking accuracy by fusing the 2D and 3D predictions of the muti-task DNN.
- the 3D motion tracking approach may require little or even no camera calibration or human priori.
- 3D skeleton motion in 3D space can be automatically computed.
- the 3D motion tracking approach can be practical and robust for many application scenarios, such as animation generation, sports analysis, fitness assistance, film and game production, and so on.
- the phrase “A or B” or the phrase “A and/or B” means (A) , (B) , or (A and B) .
- the phrase “A, B, or C” or the phrase “A, B, and/or C” means (A) , (B) , (C) , (A and B) , (A and C) , (B and C) , or (A, B, and C) .
- the terms “comprise, ” “comprising, ” “include, ” “including, ” “have, ” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion.
- a method, process, device, or DNN accelerator that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such method, process, device, or DNN accelerators.
- the term “or” refers to an inclusive “or” and not to an exclusive “or. ”
- FIG. 1 is a block diagram of a motion tracking system 100, in accordance with various embodiments.
- the motion tracking system 100 estimates and tracks motions of objects using videos that capture the objects.
- the motion tracking system 100 includes an interface module 110, an object detection module 120, a multi-task network module 130, a optimization module 140, an application module 150, and a datastore 160.
- an interface module 110 an object detection module 120, a multi-task network module 130, a optimization module 140, an application module 150, and a datastore 160.
- different or additional components may be included in the motion tracking system 100.
- functionality attributed to a component of the motion tracking system 100 may be accomplished by a different component included in the motion tracking system 100 or by a different module.
- the interface module 110 facilitates communications of the motion tracking system 100 with other systems, devices, or modules.
- the interface module 110 may receive videos from other systems or cameras.
- the videos may include monocular videos.
- a video may include a sequence of frames.
- a frame may be an image. Different frames may be associated with different time stamps. Some or all frames in a video may capture one or more objects.
- the interface module 110 may receive one or more training datasets for the multi-task network module 130 to train a multi-task DNN.
- the interface module 110 may transmit data generated by the motion tracking system 100 to other systems, devices, or modules.
- the interface module 110 may transmit estimated 3D motions of objects, animations of objects, motion analysis results, or other types of motion tracking information to other systems, devices, or modules.
- the object detection module 120 detects objects in videos.
- the object detection module 120 may detect an object in a video based on a frame in the video.
- the frame may be the first frame in the video or the first frame in a portion of the video.
- the object detection module 120 may determine or receive a class of the target object. Examples of the class of the target object may include person, car, truck, robot, dog, cat, and so on.
- the object detection module 120 may also classify one or more objects in the frame and determine whether any of the objects fall into the class of the target object.
- the object detection module 120 may use a trained model to classify objects. For instance, the object detection module 120 may input the frame into the trained model, and the trained model may output the classes of the objects in the frame.
- the trained model may be a DNN, such as a CNN.
- the object detection module 120 may generate a bounding box based on the detection of the object.
- the bounding box may be a 2D bounding box that surrounds an image of the object in the frame.
- the object detection module 120 may use the bounding box to extract the image of the object from the frame.
- the image of the object may be at least part of the frame.
- the object detection module 120 may detect the object in every frame of the video and extract images of the objects from the frames based on the object detection.
- the object detection module 120 may use 2D key points of the object that are predicted using one frame (e.g., the first frame) in the video to extract images of the object from other frames in the video.
- the 2D key points of the object may be predicted by a multi-task DNN using the first frame.
- the object detection module 120 may generate bounding boxes for the other frames based on the 2D key points and use the bounding boxes to extract the images of the object from other frames.
- the multi-task network module 130 generates multi-task DNNs.
- the multi-task network module 130 may define the architecture of a multi-task DNN used for 3D motion tracking.
- the multi-task network module 130 may define a backbone network in the multi-task DNN. For instance, the multi-task network module 130 may determine layers in the backbone network, data flow between the layers, and so on.
- the multi-task network module 130 may also define branches in the multi-task DNN based on tasks to be performed by the multi-task DNN.
- the multi-task network module 130 may determine a branch for a particular task and determine different branches for different tasks. For each branch, the multi-task network module 130 may determine one or more layers in the branch and data flow between the one or more layers, and so on.
- An example of multi-task DNNs generated by multi-task network module 130 is the multi-task DNN 500 in FIG. 5.
- the multi-task network module 130 may train the multi-task DNN. Values of internal parameters of the multi-task DNN may be determined through training. The multi-task network module 130 may also validate the accuracy of the multi-task DNN. After the multi-task DNN is trained or validated, the multi-task network module 130 may deploy the multi-task DNN to extract 2D and 3D features from images for predictions of 2D key points, 3D pose parameters, and ground part segmentation. Images input to the multi-task DNN may be images of objects. The images of objects may be images extracted by the object detection module 120 from video frames. In some embodiments, the multi-task network module 130 may input images into the muti-task DNN one by one. For each cropped frame, the multi-task DNN may generate multiple outputs. Certain aspects of the multi-task network module 130 are described below in conjunction with FIG. 2.
- the optimization module 140 optimizes 2D key points and 3D pose parameters computed by the multi-task DNN.
- the optimization module 140 may remove at least some noises in the 2D key points and 3D pose parameters.
- the optimization module 140 may apply one or more filters on the heat map to remove noises in the heat map. For instance, the optimization module 140 may extract 2D key points from the heat map and apply one or more filters on the 2D key points.
- the optimization module 140 may also apply one or more filters on the pose tensor to remove noises in the pose tensor.
- the optimization module 140 may also estimate the motion of a root point of the object. The root point may be one of the key points of the object.
- the optimization module 140 may generate a position parameter that represents the estimated motion of the root point in a 3D camera space.
- the optimization module 140 may optimize the pose tensor and the position parameter using an objective function.
- the optimization module may further refine the estimated motion of the root point using the one or more masks representing the ground part.
- the optimization module 140 may generate the 3D motion tracking result using the estimated root point motion and the optimized 3D pose parameters.
- the optimization module 140 may exploit global spatial-temporal consistency constraints for coherent 3D motion tracking over an entire video sequence.
- the optimization module 140 may eliminate the front-back shifting in depth that.
- depth uncertainty may be present in monocular reconstruction and may be counteracted by penalizing large variations in depth between not only the neighboring frames but also all the possible frame pairs.
- the optimization module 140 may use optional part-ground contacting constraints for the optimization to further improve the position-tracking stability. Certain aspects of the optimization module 140 are described below in conjunction with FIG. 3.
- the application module 150 generates files associated with motions for various applications.
- the motions may be 3D determined by the optimization module 140.
- the files may include images, videos, audios, text, symbols, other types of information, or some combination thereof.
- the application module 150 may receive a request for one or more motion files, e.g., from a client device associated with a user, a third-party system, a device, and so on.
- the request may include information indicating the application of the one or more files.
- the application module 150 may generate the one or more motion files based on information in the request.
- the request may be a request for an animation showing motion of an object.
- the application module 150 may generate one or more animation files based on a 3D motion determined by the optimization module 140.
- the request may be a request for descriptions of movements of a person in a video.
- the application module 150 may generate a text file that describes the person’s movements.
- the application module 150 may generate other types of files.
- the application module 150 may generate files that can be input into and processed by graphics software to make animation or other types of files.
- the datastore 160 stores data associated with the motion tracking system 100, such as data received, generated, or used by components of the motion tracking system 100.
- the datastore 160 may store parameters (e.g., internal parameters, hyperparameters, etc. ) of the multi-task DNN.
- the datastore 160 may also store training sets and validation sets used to train and validate the multi-task DNN.
- the datastore 160 may further store videos received by the interface module 110, outputs of the multi-task DNN, 3D motion tracking results output from the optimization module 140, motion files generated by the application module 150, and so on.
- the motion tracking system 100 may include or be associated with more than one datastore.
- the datastore 160 may be implemented as a random-access memory (RAM) , such as a static RAM (SRAM) , disk storage, nearline storage, online storage, offline storage, and so on.
- RAM random-access memory
- SRAM static RAM
- FIG. 2 is a block diagram of a multi-task network module 200, in accordance with various embodiments.
- the multi-task network module 200 uses a multi-task DNN to process video frames and predict 2D key points and 3D pose parameters.
- the multi-task network module 200 may be an example of the multi-task network module 130 in FIG. 1.
- the multi-task network module 200 includes a training module 210, a validating module 220, a multi-task neural network 230, and a deployment module 240.
- different or additional components may be included in the multi-task network module 200.
- functionality attributed to a component of the multi-task network module 200 may be accomplished by a different component included in the multi-task network module 200 or by a different module.
- the training module 210 trains the multi-task neural network 230 by using one or more training datasets.
- the training module 210 forms the one or more training dataset.
- the training dataset includes training samples and ground-truth labels of the training samples.
- a training sample may be an image, e.g., a cropped frame of a video.
- a ground-truth label may include verified or known 2D key points, 3D pose parameters, or masks of the corresponding training sample.
- a part of the training dataset may be used to initially train the multi-task neural network 230, and the rest of the training dataset may be held back as a validation subset used by the validating module 220 to validate performance of the multi-task neural network 230 after being trained.
- the portion of the training dataset not including the validation subset may be used to train the multi-task neural network 230.
- the training module 210 also determines hyperparameters for training the multi-task neural network 230.
- Hyperparameters may be variables specifying the training process. Hyperparameters may be different from parameters inside the multi-task neural network 230 (e.g., weights) .
- hyperparameters include variables determining the architecture of the multi-task neural network 230, such as number of layers in backbone, types of layers in backbone, number of layers in branches, types of layers in branches, connections between backbone and branches, connections between branches, and so on.
- Hyperparameters also include variables which determine how the multi-task neural network 230 is trained, such as batch size, number of epochs, etc.
- a batch size defines the number of training samples to work through before updating the parameters of the multi-task neural network 230.
- the batch size is the same as or smaller than the number of samples in the training dataset.
- the training dataset can be divided into one or more batches.
- the number of epochs defines how many times the entire training dataset is passed forward and backwards through the entire network.
- the number of epochs defines the number of times that the deep learning algorithm works through the entire training dataset.
- One epoch means that each training sample in the training dataset has had an opportunity to update the parameters inside the DNN.
- An epoch may include one or more batches.
- the number of epochs may be 1, 5, 10, 50, 100, 500, 1000, or even larger.
- the training module 210 may define the architecture of the DNN, e.g., based on some of the hyperparameters.
- the architecture of the multi-task neural network 230 may include a backbone and a plurality of branches associated with the backbone.
- the backbone may be a network that includes an input layer, an output layer, and a plurality of hidden layers.
- the input layer may include tensors (e.g., a multidimensional array) specifying attributes of the input image, such as the height of the input image, the width of the input image, and the depth of the input image (e.g., the number of bits specifying the color of a pixel in the input image) .
- the output layer includes labels of objects in the input layer.
- the hidden layers are layers between the input layer and output layer.
- the hidden layers include one or more convolutional layers and one or more other types of layers, such as pooling layers, fully-connected layers, normalization layers, SoftMax or logistic layers, and so on.
- the output layer may include an OFM representing features extracted by the backbone.
- the training module 210 may also add an activation function to a hidden layer or the output layer.
- An activation function of a layer transforms the weighted sum of the input of the layer to an output of the layer.
- the activation function may be, for example, a rectified linear unit (ReLU) activation function, a tangent activation function, or other types of activation functions.
- the training module 210 may define one or more attributes of tensors computed in the backbone, such as spatial size, datatype, and so on.
- the training module 210 may input a training dataset into the multi-task neural network 230.
- the training module 210 may modify the parameters inside the multi-task neural network 230 ( “internal parameters of the multi-task neural network 230” ) to minimize the error between labels of the training samples that are generated by the multi-task neural network 230 and the ground-truth labels of the training samples.
- the training module 210 uses a cost function to minimize the error.
- the training module 210 may train the multi-task neural network 230 for a predetermined number of epochs.
- the number of epochs is a hyperparameter that defines the number of times that the deep learning algorithm will work through the entire training dataset.
- One epoch means that each sample in the training dataset has had an opportunity to update internal parameters of the multi-task neural network 230.
- the training module 210 may stop updating the parameters in the multi-task neural network 230.
- the validating module 220 verifies accuracy of the multi-task neural network 230 after it is trained by the training module 210.
- the validating module 220 inputs samples in a validation dataset into the multi-task neural network 230 and uses the outputs of the multi-task neural network 230 to determine the model accuracy.
- a validation dataset may be formed of some or all the samples in the training dataset. Additionally or alternatively, the validation dataset includes additional samples, other than those in the training sets.
- the validating module 220 may determine an accuracy score measuring the precision, recall, or a combination of precision and recall of the DNN.
- the validating module 220 may compare the accuracy score with a threshold score. In an example where the validating module 220 determines that the accuracy score of the multi-task neural network 230 is less than the threshold score, the validating module 220 instructs the training module 210 to re-train the multi-task neural network 230. In one embodiment, the training module 210 may iteratively re-train the multi-task neural network 230 until the occurrence of a stopping condition, such as the accuracy measurement indication that the multi-task neural network 230 may be sufficiently accurate, or a number of training rounds having taken place.
- a stopping condition such as the accuracy measurement indication that the multi-task neural network 230 may be sufficiently accurate, or a number of training rounds having taken place.
- the multi-task neural network 230 is DNN that can perform multiple tasks.
- the multi-task neural network 230 is trained to receive an input, process the input using various layers in the multi-task neural network 230 and generates multiple outputs, each of which may constitute the result of performing a predetermined task.
- the multi-task neural network 230 may include a backbone and multiple branches coupled to the backbone.
- the backbone of the multi-task DNN may include one or more convolutional layers that can extract features from an input image and output a feature map.
- the feature map may represent the object.
- the feature map from the backbone may be provided to some or all the branches. Each branch may be used to perform a particular task.
- the multi-task neural network 230 can perform multiple tasks at the same time using the same input. Examples of the tasks may include 2D key points prediction, 3D pose prediction, ground part segmentation, other tasks for 3D motion tracking, or some combination thereof.
- An example of the multi-task neural network 230 is the multi-task DNN 500 in FIG. 5.
- the deployment module 240 deploys the multi-task neural network 230 to perform 3D motion tracking tasks.
- the deployment module 240 may input images into the multi-task neural network 230.
- An image may be a cropped frame from a video.
- the deployment module 240 may input the images into the multi-task neural network 230 one by one. For each input image, the deployment module 240 may obtain multiple outputs of the multi-task neural network 230. In some embodiments, the deployment module 240 may transmit the outputs of the multi-task neural network 230 to optimization module 140 for estimating 3D motions.
- FIG. 3 is a block diagram of an optimization module 300, in accordance with various embodiments.
- the optimization module 300 processes heat maps, pose tensors, and masks generated from a video capturing an object to determine a 3D motion of the object.
- the optimization module 300 may be an example of the optimization module 140 in FIG. 1.
- the optimization module 300 includes a filtering module 310, a root motion module 320, a global consistency module 330, and a refinement module 340.
- a filtering module 310 the optimization module 300 includes a filtering module 310, a root motion module 320, a global consistency module 330, and a refinement module 340.
- alternative configurations, different or additional components may be included in the optimization module 300.
- functionality attributed to a component of the optimization module 300 may be accomplished by a different component included in the optimization module 300 or by a different module.
- the filtering module 310 applies filters on 2D key points extracted from heat maps, 3D pose parameters in pose tensors, and masks generated from video frames.
- the heat maps, pose tensors, and masks may be generated by a multi-task DNN.
- the filtering module 310 may generate or receive a filter and apply the filter on the 2D key points from the heat maps, pose tensors, or masks to reduce or even eliminate noise in the heat maps, pose tensors, or masks.
- the filtering module 310 may generate smoothed 2D key points, 3D pose parameters, or masks that have less noise than the 2D key points, 3D pose parameters, or masks predicted by the multi-task DNN. Examples of the filters include box filter, gaussian filter, and so on.
- the filtering module 310 may generate or receive a 6DOF (Quaternion) block.
- the 6DOF block may implement quaternion representation of six-degrees-of-freedom equations of motion with respect to body axes.
- the filtering module 310 may apply the 6DoF continuous representation for 3D rotations to temporally filter 3D pose parameters in a pose tensor to smooth the pose tensor.
- the root motion module 320 determines the motion of a root point on the object in a 3D camera space. In some embodiments, for each frame I t , the root motion module 320 computes the root point’s motion represented by a position parameter in the 3D camera space by minimizing the objective function. may also be referred to as the initial position parameter of the root point.
- the motion of the root join may include a translation. An example of the translation may be a motion of the object in a straight line.
- the root point may be the key point that has a higher hierarchy than the other key points on the object. In an example where the object is a person, the root point may be a key point on the body of the person, e.g., a joint, navel, head, and so on.
- the root motion module 320 may define the root point. For instance, the root motion module 320 may select one of the key points as the root point.
- the root point may be a key point that is connected to some or all the other key points.
- the object function may be defined as where represents the 3D joint positions in the camera space, K t represents the heat map, and represents the initial pose tensor, e.g., the pose tensor generated by the multi-task DNN. represents a forward kinematics operation to produce root-relative 3D joint positions from joint rotations and P T may represent a predefined 3D structural template (e.g., a 3D skeleton template) .
- ⁇ is pinhole projection function from the 3D camera space to a 2D image plane using the camera’s intrinsic parameters (f x , f y , c x , c x ) , where f x , f y are the focal lengths in x, y directions, c x , c y are the x and y coordinates of the optical center in the image plane.
- the video has a resolution represented by (w, h)
- applying per-frame 3D pose and root translation on a video may not exploit and ensure spatial-temporal consistency of motion, which may lead to temporal jitter and unacceptable artifact for some applications. Further optimization or refinement may be conducted to address the temporal jitter and unacceptable artifact.
- the objective function for the global optimization is defined:
- E ( ⁇ , d) E proj ( ⁇ , d) +E spatial-tem (d) +E temporal ( ⁇ , d) +E prior ( ⁇ ) +E reg ( ⁇ , d)
- E proj ( ⁇ , d) ⁇ 1 ⁇ t ⁇ (P t ) -K t ⁇ 2 ,
- E spatial-tempo (d) ⁇ 2 ⁇ (t1, t2) ⁇ s (
- E temporal ( ⁇ , d) ⁇ 3 ⁇ t ⁇ P t -P t-1 ⁇ 2 + ⁇ P t -P t+1 ⁇ 2 ,
- E spatial-tempo is used for spatial-temporal consistency of the 3D motion tracking result. With E spatial-tempo , the overall stability and robustness can be improved.
- the 3D foot joint position (P t1 ) foot for frame t1 should be sufficiently close with the 3D foot joint position (P t2 ) foot for frame t2.
- E temporal encourages smooth motion for adjacent frames in temporal domain.
- E prior is a prior term to favor realistic and natural poses over implausible ones.
- E reg measures the similarity between the optimized parameters and the initial ones.
- the refinement module 340 refines the root motion determined by the root motion module 320.
- the root motion refinement may be optional.
- the refinement module 340 may estimate a 3D ground plane that represents a ground on which the object carries out the motion.
- when the object moves on the ground there may be at least one part of the object touching the ground. For instance, when the object is a person, at least one foot of the object may touch the ground at some or all moments during the object’s motion.
- the refinement module 340 may fit the 3D ground plane equation to a set of 3D body joints ⁇ Q i ⁇ , which are considered as touching points.
- the object is a person, when one foot of the person is in contact with the ground plane, it would have a small velocity at that moment.
- the refinement module 340 may track the foot part temporally and compute the velocity in the 2D image space using the overlap of the segmented masks in adjacent frames. In some embodiments, when the velocity is smaller than the predefined threshold, the corresponding foot can be assigned to be the touching point. After the 3D ground plane is determined, the refinement module 340 may adjust the optimized root motion d t for each frame so that each assigned 3D foot joint is on the 3D ground plane.
- FIG. 4 illustrates an example motion tracking pipeline 400, in accordance with various embodiments.
- the motion tracking pipeline 400 may be performed by the motion tracking system 100 in FIG. 1.
- the motion tracking pipeline 400 starts with an input stage 401.
- a plurality of frames is received by the motion tracking system 100.
- the frames may be arranged in a sequence, e.g., based on the timestamps associated with the frames.
- the frames may be from a video.
- the frames are processed separately in an AI prediction stage 402.
- the first frame may be used for object detection, key points detection, pose parameters regression, and ground part segmentation.
- Each of the subsequent frames may be used for key points detection, pose parameters regression, and ground part segmentation.
- Object detection is not performed on the subsequent frames in the AI prediction stage 402 as the key points detection, pose parameters regression, and ground part segmentation for the subsequent frames can be conducted based on object detection using the first frame.
- the outputs of the AI prediction stage 402 are processed in a global processing stage 403, in which the outputs generated from the frames in the AI prediction stage 402 may be processed together.
- the global processing stage 403 includes four steps: smoothing, root translation initiation, global consistency optimization, and root translation refinement.
- the smoothing step may be done by the filtering module 310 using one or more filters.
- the root translation initiation step may be done by the root motion module 320 to identify a root point and estimate motion of the root point in a 3D camera space.
- the global consistency optimization step may be done by the global consistency module 330 for global and spatial-temporal consistency over the entire sequence of frames.
- the root translation refinement step may be done by the refinement module 340 to counteract depth uncertainty in reconstruction of videos, e.g., monocular videos.
- large variations in depth between frame pairs, including neighboring frames pairs and non-neighboring frame pairs, may be penalized.
- a neighboring frame pair may include frames that are arranged right next to each other in the video.
- a non-neighboring frame pair may include frames that have one or more other frames between them in the video.
- the global processing stage 403 is followed by a result stage 404, where the estimated 3D motion result is available.
- the estimated 3D motion result may be output by the optimization module 140.
- the result stage 404 is followed by an application stage 405 which may be done by the application module 150 to use the estimated 3D motion result to generate one or more files for particular applications.
- FIG. 5 illustrates an example multi-task DNN 500, in accordance with various embodiments.
- the multi-task DNN 500 may be an example of the multi-task neural network 230 in FIG. 2.
- the multi-task DNN 500 includes a CNN backbone 510, convolutional layers 520, 530, and 540, a concatenation layer 550, and a regression layer 560.
- the convolutional layer 520 may constitute a branch of the multi-task DNN 500.
- the convolutional layer 530 may constitute another branch of the multi-task DNN 500.
- the convolutional layer 540, the concatenation layer 550, and the regression layer 560 may constitute yet another branch of the multi-task DNN 500.
- the multi-task DNN 500 may include different, fewer, or more layers. Also, one or more layers in the multi-task DNN 500 may be arranged differently. For the purpose of illustration, the multi-task DNN 500 receives an input image 501 and generates three outputs: a ground part mask 502, a heat map 503, and a pose tensor 504.
- the CNN backbone 510 receives the input image 501.
- the input image 501 may be an image of an object that is extracted from a frame in a video capturing the object.
- the input image 501 may be a cropped frame that may be generated by cropping the video frame using a 2D bounding box.
- the CNN backbone 510 may extract features from the input image 501 and generate a feature map.
- the CNN backbone 510 may include a plurality of convolutional layers.
- the CNN backbone 510 may also include one or more activation function layers, pooling layers, fully-connected layers, other types of layers, or some combination thereof.
- An example of the CNN backbone 510 may be the CNN 900 in FIG. 9 or be at part of the CNN 900.
- the feature map generated by the CNN backbone 510 may be an OFM of the last layer of the CNN backbone 510.
- the convolutional layer 520 receives the feature map from the CNN backbone 510 and generates a ground part mask 502.
- the convolution layer 520 may segment a ground part of the object from the object based on the feature map from the CNN backbone 510.
- the ground part mask 502 may represent the segmented ground part.
- the ground part may be a part of the object that is considered to be on the ground during the object’s movement. In an example where the object is a person, the ground part may be a foot of the person.
- the convolutional layer 520 may segment multiple ground parts of the object from the object and generate multiple ground part masks. Each of the ground part masks may represent a different one of the segmented ground parts.
- the convolutional layer 530 processes the feature map from the CNN backbone 510 to compute the heat map 503.
- the heat map 503 represents 2D key points on the object.
- the heat map 503 may be an OFM of the convolutional layer 530.
- the heat map 503 is represented by a rectangular prism in FIG. 5.
- the heat map 503 may be a 3D tensor denoted as F ⁇ R H ⁇ W ⁇ (J+1) having a spatial size of H ⁇ W ⁇ (J+1) , where H is the height of the heat map, W is the width of the heat map, and J+1 is the depth of the heat map 503.
- the depth of the heat map 503 may be in the channel dimension of the heat map 503, and the heat map 503 may have J+1 channels.
- the heat map 503 may model J key points and 1 background mask.
- Each key point may be represented by a 2D tensor denoted as F′ ⁇ R H ⁇ W .
- Each data element in the 2D tensor may be a pixel having an index (h, w) that indicates the position of the pixel in the 2D tensor. For instance, h may indicate which column of the 2D tensor the pixel is located at, and w may indicate which row of the 2D tensor the pixel is located.
- a pixel may encode the likelihood of a portion of the object belonging to a key point j.
- the key points may be body joints.
- the convolutional layer 540 receives the feature map from the CNN backbone 510 and generates a new feature map by applying a convolution on the feature map from the CNN backbone 510.
- the feature map generated by the convolutional layer 540 and the heat map 503 generated by the convolutional layer 530 are input into the concatenation layer 550.
- the concatenation layer 550 may also be referred to as a concatenator.
- the concatenation layer 550 concatenates the two feature maps in one of the dimensions. For instance, the concatenation layer 550 may concatenate the two feature map in the channel dimension.
- the concatenation layer 550 may stack the feature map generated by the convolutional layer 540 and the heat map 503 in the channel dimension to generate a new feature map.
- the heat map 503 includes J+1 channels and the feature map generated by the convolutional layer 540 has K channels
- the feature map generated the concatenation layer 550 may have J+1+K channels.
- the spatial size of the feature map generated by the convolutional layer 540 in a single channel may equal the spatial size of the heat map 503 in a single channel.
- the feature map generated by the convolutional layer 540 may has the same height H and the same width W.
- the spatial size of the feature map generated by the concatenation layer 550 in a single channel may equal the spatial size of the heat map 503 in a single channel.
- the feature map generated by the concatenation layer 550 may has the same height H and the same width W.
- the regression layer 560 receives the feature map generated by the concatenation layer 550 and generates the pose tensor 504.
- the pose tensor 504 includes 3D pose parameters.
- the regression layer 560 may change values of data elements in the feature map generated by the concatenation layer 550 to compute the 3D pose parameters.
- the regression layer 560 may apply regression on the feature map generated by the concatenation layer 550.
- the regression may be linear regression, nonlinear regression, etc.
- the regression layer 560 may use one or more pose and shape parameters as the regression target for computing the 3D pose parameters.
- the regression layer 560 may include a fully-connected layer.
- FIGS. 6A and 6B illustrate 3D motion tracking with reconstructed 3D ground plane 610, in accordance with various embodiments.
- the reconstructed 3D ground plane 610 has a checkboard pattern in FIGS. 6A and 6B. In other embodiments, the reconstructed 3D ground plane 610 may have different patterns.
- FIGS. 6A and 6B show a result of 3D motion tracking based on two frames 620A and 620B.
- the frames 620A and 620B may be frames in the same video, e.g., a video that captures the movement of a person.
- FIG. 6A shows the estimated 3D motion of the person captured by the frame 620A.
- FIG. 6B shows the estimated 3D motion of the person captured by the frame 620B.
- the 3D motions may be estimated by the motion tracking system 100 in FIG. 1 using the video.
- FIGS. 6A and 6B each show a graphic representation 630 of the person.
- the graphic representation 630 makes the estimated 3D motion.
- the graphic representation 630 includes a plurality of key points 640, individually referred to as key point 640.
- the key points 640 are circled with dashed circles in FIG. 6A. Even though the graphic representation 630 includes 16 key points 640, the graphic representation 630 may include a different number of key points 640 in other embodiments.
- a key point 640 may represent a body part of the person, e.g., bone joint, head, foot, naval, and so on.
- one of the key points 640 may be selected as a root point. For instance, the key point 640 representing the navel of the person may be selected as the root point.
- the motion of the root point may be estimated, and the estimated motion of the root point may be used to estimate the 3D motion of the person.
- the key points 640 may be determined by a multi-task DNN, e.g., the multi-task neural network 230 or the multi-task DNN 500.
- the two key points 640 touching the reconstructed 3D ground plane 610 in FIG. 6A may correspond to the two feet of the person.
- the multi-task DNN may generate two masks based on foot segmentation.
- FIG. 7 illustrates an example motion analysis for physical fitness, in accordance with various embodiments.
- the motion analysis may be based on a 3D motion of a person 710 that is estimated using a video capturing the person 710 running on a treadmill 720.
- the motion analysis may be done by the motion tracking system 100 in FIG. 1.
- FIG. 7 shows key points 715 of the person 710.
- FIG. 7 shows eight key points 715.
- a different number of key points 715 may be used for estimating the 3D motion.
- the estimated 3D motion may be used to monitor or evaluate the running exercise of the person 710.
- the estimated 3D motion may be used to determine the running speed of the person 710, measuring consumed calories, evaluate pose of the person 710 during the running, monitor health conditions of the person 710, provide recommendations for physical fitness, and so on.
- FIG. 8 illustrates an example avatar animation 810 generated based on motion tracking, in accordance with various embodiments.
- the avatar animation 810 may be generated by the motion tracking system 100 using a video 820.
- FIG. 8 shows a frame of the avatar animation 810 and a frame of the video 820.
- the video 820 captures a person playing basketball.
- the person’s 3D motion is estimated.
- the estimated 3D motion is used to generate the motion of the avatar in the avatar animation 810.
- the avatar may be a graphic representation of the person.
- the avatar in the avatar animation 810 is making the same or substantially similar movement as the person in the video 820.
- FIG. 9 illustrates an example CNN 900, in accordance with various embodiments.
- the CNN 900 (or part of the CNN 900) may be an example of the CNN backbone 510 in FIG. 5.
- the CNN 900 is trained to receive images of objects and OFM representing extracted features of the objects.
- the CNN 900 includes a sequence of layers comprising a plurality of convolutional layers 910 (individually referred to as “convolutional layer 910” ) , a plurality of pooling layers 920 (individually referred to as “pooling layer 920” ) , and a plurality of fully-connected layers 930 (individually referred to as “fully-connected layer 930” ) .
- the CNN 900 may include fewer, more, or different layers.
- the layers of the CNN 900 execute tensor computation that includes many tensor operations, such as convolution (e.g., multiply-accumulate (MAC) operations, etc. ) , pooling operations, elementwise operations (e.g., elementwise addition, elementwise multiplication, etc. ) , other types of tensor operations, or some combination thereof.
- convolution e.g., multiply-accumulate (MAC) operations, etc.
- pooling operations e.g., elementwise operations (e.g., elementwise addition, elementwise multiplication, etc. )
- elementwise operations e.g., elementwise addition, elementwise multiplication, etc.
- the convolutional layers 910 summarize the presence of features in the input to the CNN 900.
- the convolutional layers 910 function as feature extractors.
- the first layer of the CNN 900 is a convolutional layer 910.
- a convolutional layer 910 performs a convolution on an input tensor 940 (also referred to as IFM 940) and a filter 950.
- the IFM 940 is represented by a 7 ⁇ 7 ⁇ 3 three-dimensional (3D) matrix.
- the IFM 940 includes 3 input channels, each of which is represented by a 7 ⁇ 7 two-dimensional (2D) matrix.
- the 7 ⁇ 7 2D matrix includes 7 input elements (also referred to as input points) in each row and seven input elements in each column.
- the filter 950 is represented by a 3 ⁇ 3 ⁇ 3 3D matrix.
- the filter 950 includes 3 kernels, each of which may correspond to a different input channel of the IFM 940.
- a kernel is a 2D matrix of weights, where the weights are arranged in columns and rows.
- a kernel can be smaller than the IFM.
- each kernel is represented by a 3 ⁇ 3 2D matrix.
- the 3 ⁇ 3 kernel includes 3 weights in each row and three weights in each column. Weights can be initialized and updated by backpropagation using gradient descent. The magnitudes of the weights can indicate importance of the filter 950 in extracting features from the IFM 940.
- the convolution includes MAC operations with the input elements in the IFM 940 and the weights in the filter 950.
- the convolution may be a standard convolution 963 or a depthwise convolution 983. In the standard convolution 963, the whole filter 950 slides across the IFM 940. All the input channels are combined to produce an output tensor 960 (also referred to as OFM 960) .
- the OFM 960 is represented by a 5 ⁇ 5 2D matrix.
- the 5 ⁇ 5 2D matrix includes 5 output elements (also referred to as output points) in each row and five output elements in each column.
- the standard convolution includes one filter in the embodiments of FIG. 9. In embodiments where there are multiple filters, the standard convolution may produce multiple output channels in the OFM 960.
- the multiplication applied between a kernel-sized patch of the IFM 940 and a kernel may be a dot product.
- a dot product is the elementwise multiplication between the kernel-sized patch of the IFM 940 and the corresponding kernel, which is then summed, always resulting in a single value. Because it results in a single value, the operation is often referred to as the “scalar product. ”
- Using a kernel smaller than the IFM 940 is intentional as it allows the same kernel (set of weights) to be multiplied by the IFM 940 multiple times at different points on the IFM 940. Specifically, the kernel is applied systematically to each overlapping part or kernel-sized patch of the IFM 940, left to right, top to bottom.
- the result from multiplying the kernel with the IFM 940 one time is a single value.
- the multiplication result is a 2D matrix of output elements.
- the 2D output matrix (i.e., the OFM 960) from the standard convolution 963 is referred to as an OFM.
- the depthwise convolution 983 In the depthwise convolution 983, the input channels are not combined. Rather, MAC operations are performed on an individual input channel and an individual kernel and produce an output channel. As shown in FIG. 9, the depthwise convolution 983 produces a depthwise output tensor 980.
- the depthwise output tensor 980 is represented by a 5 ⁇ 5 ⁇ 3 3D matrix.
- the depthwise output tensor 980 includes 3 output channels, each of which is represented by a 5 ⁇ 5 2D matrix.
- the 5 ⁇ 5 2D matrix includes 5 output elements in each row and five output elements in each column.
- Each output channel is a result of MAC operations of an input channel of the IFM 940 and a kernel of the filter 950.
- the first output channel (patterned with dots) is a result of MAC operations of the first input channel (patterned with dots) and the first kernel (patterned with dots)
- the second output channel (patterned with horizontal strips) is a result of MAC operations of the second input channel (patterned with horizontal strips) and the second kernel (patterned with horizontal strips)
- the third output channel (patterned with diagonal stripes) is a result of MAC operations of the third input channel (patterned with diagonal stripes) and the third kernel (patterned with diagonal stripes) .
- the number of input channels equals the number of output channels, and each output channel corresponds to a different input channel.
- the input channels and output channels are referred to collectively as depthwise channels.
- a pointwise convolution 993 is then performed on the depthwise output tensor 980 and a 9 ⁇ 1 ⁇ 3 tensor 990 to produce the OFM 960.
- the OFM 960 is then passed to the next layer in the sequence.
- the OFM 960 is passed through an activation function.
- An example activation function is ReLU.
- ReLU is a calculation that returns the value provided as input directly, or the value zero if the input is zero or less.
- the convolutional layer 910 may receive several images as input and calculate the convolution of each of them with each of the kernels. This process can be repeated several times. For instance, the OFM 960 is passed to the subsequent convolutional layer 910 (i.e., the convolutional layer 910 following the convolutional layer 910 generating the OFM 960 in the sequence) .
- the subsequent convolutional layers 910 perform a convolution on the OFM 960 with new kernels and generate a new feature map.
- the new feature map may also be normalized and resized.
- the new feature map can be kernelled again by a further subsequent convolutional layer 910, and so on.
- a convolutional layer 910 has four hyperparameters: the number of kernels, the size F kernels (e.g., a kernel is of dimensions F ⁇ F ⁇ D pixels) , the S step with which the window corresponding to the kernel is dragged on the image (e.g., a step of one means moving the window one pixel at a time) , and the zero-padding P (e.g., adding a black contour of P pixels thickness to the input image of the convolutional layer 910) .
- the convolutional layers 910 may perform various types of convolutions, such as 2-dimensional convolution, dilated or atrous convolution, spatial separable convolution, depthwise separable convolution, transposed convolution, and so on.
- the CNN 900 includes 96 convolutional layers 910. In other embodiments, the CNN 900 may include a different number of convolutional layers.
- the pooling layers 920 down-sample feature maps generated by the convolutional layers, e.g., by summarizing the presence of features in the patches of the feature maps.
- a pooling layer 920 is placed between two convolution layers 910: a preceding convolutional layer 910 (the convolution layer 910 preceding the pooling layer 920 in the sequence of layers) and a subsequent convolutional layer 910 (the convolution layer 910 subsequent to the pooling layer 920 in the sequence of layers) .
- a pooling layer 920 is added after a convolutional layer 910, e.g., after an activation function (e.g., ReLU, etc. ) has been applied to the OFM 960.
- an activation function e.g., ReLU, etc.
- a pooling layer 920 receives feature maps generated by the preceding convolution layer 910 and applies a pooling operation to the feature maps.
- the pooling operation reduces the size of the feature maps while preserving their important characteristics. Accordingly, the pooling operation improves the efficiency of the CNN and avoids over- learning.
- the pooling layers 920 may perform the pooling operation through average pooling (calculating the average value for each patch on the feature map) , max pooling (calculating the maximum value for each patch of the feature map) , or a combination of both.
- the size of the pooling operation is smaller than the size of the feature maps.
- the pooling operation is 2 ⁇ 2 pixels applied with a stride of two pixels, so that the pooling operation reduces the size of a feature map by a factor of 2, e.g., the number of pixels or values in the feature map is reduced to one quarter the size.
- a pooling layer 920 applied to a feature map of 6 ⁇ 6 results in an output pooled feature map of 3 ⁇ 3.
- the output of the pooling layer 920 is inputted into the subsequent convolution layer 910 for further feature extraction.
- the pooling layer 920 operates upon each feature map separately to create a new set of the same number of pooled feature maps.
- the fully-connected layers 930 are the last layers of the CNN.
- the fully-connected layers 930 may be convolutional or not.
- the fully-connected layers 930 may also be referred to as linear layers.
- a fully-connected layer 930 (e.g., the first fully-connected layer in the CNN 900) may receive an input operand.
- the input operand may define the output of the convolutional layers 910 and pooling layers 920 and includes the values of the last feature map generated by the last pooling layer 920 in the sequence.
- the fully-connected layer 930 may apply a linear transformation to the input operand through a weight matrix.
- the weight matrix may be a kernel of the fully-connected layer 930.
- the linear transformation may include a tensor multiplication between the input operand and the weight matrix.
- the result of the linear transformation may be an output operand.
- the fully-connected layer may further apply a nonlinear transformation (e.g., by using a nonlinear activation function) on the result of the linear transformation to generate an output operand.
- the output operand may contain as many elements as there are classes: element i represents the probability that the image belongs to class i. Each element is therefore between 0 and 9, and the sum of all is worth one. These probabilities are calculated by the last fully-connected layer 930 by using a logistic function (binary classification) or a SoftMax function (multi-class classification) as an activation function.
- FIG. 10 illustrates an example convolution, in accordance with various embodiments.
- the convolution may be a deep learning operation in a convolutional layer, e.g., a convolutional layer in the backbone 510, the convolutional layer 520, the convolutional layer 530, the convolutional layer 540, or one of the convolutional layers 910.
- the convolution can be executed on an input tensor 1010 and filters 1020 (individually referred to as “filter 1020” ) .
- the result of the convolution is an output tensor 1030.
- the convolution is performed by a DNN accelerator.
- An example of the DNN accelerator may be the DNN accelerator 302 in FIG. 3.
- the input tensor 1010 includes activations (also referred to as “input activations, ” “elements, ” or “input elements” ) arranged in a 3D matrix.
- An input element is a data point in the input tensor 1010.
- the input tensor 1010 has a spatial size of 7 ⁇ 7 ⁇ 3, i.e., the input tensor 1010 includes three input channels and each input channel has a 7 ⁇ 7 2D matrix.
- Each input element in the input tensor 1010 may be represented by a (X, Y, Z) coordinate. In other embodiments, the height, width, or depth of the input tensor 1010 may be different.
- Each filter 1020 includes weights arranged in a 3D matrix. The values of the weights may be determined through training the DNN.
- a filter 1020 has a spatial size H f ⁇ W f ⁇ C f , where H f is the height of the filter (i.e., the length along the Y axis, which indicates the number of weights in a column in each kernel) , W f is the width of the filter (i.e., the length along the X axis, which indicates the number of weights in a row in each kernel) , and C f is the depth of the filter (i.e., the length along the Z axis, which indicates the number of channels) . In some embodiments, C f equals C in .
- each filter 1020 in FIG. 10 has a spatial size of 10 ⁇ 3 ⁇ 3, i.e., the filter 1020 includes 10 convolutional kernels with a spatial size of 3 ⁇ 3.
- the height, width, or depth of the filter 1020 may be different.
- the spatial size of the convolutional kernels is smaller than the spatial size of the 2D matrix of each input channel in the input tensor 1010.
- An activation or weight may take one or more bytes in a memory.
- the number of bytes for an activation or weight may depend on the data format. For example, when the activation or weight has an INT8 format, the activation takes one byte. When the activation or weight has a FP16 format, the activation or weight takes two bytes. Other data formats may be used for activations or weights.
- each filter 1020 slides across the input tensor 1010 and generates a 2D matrix for an output channel in the output tensor 1030.
- the 2D matrix has a spatial size of 5 ⁇ 5.
- the output tensor 1030 includes activations (also referred to as “output activations, ” “elements, ” or “output element” ) arranged in a 3D matrix.
- An output activation is a data point in the output tensor 1030.
- the output tensor 1030 has a spatial size H out ⁇ W out ⁇ C out , where H out is the height of the 3D matrix (i.e., the length along the Y axis, which indicates the number of output activations in a column in the 2D matrix of each output channel) , W out is the width of the 3D matrix (i.e., the length along the X axis, which indicates the number of output activations in a row in the 2D matrix of each output channel) , and C out is the depth of the 3D matrix (i.e., the length along the Z axis, which indicates the number of output channels) .
- C out may equal the number of filters 1020 in the convolution.
- H out and W out may depend on the heights and weights of the input tensor 1010 and each filter 1020.
- MAC operations can be performed on a 3 ⁇ 3 ⁇ 3 subtensor 1015 (which is highlighted with a dotted pattern in FIG. 10) in the input tensor 1010 and each filter 1020.
- the result of the MAC operations on the subtensor 1015 and one filter 1020 is an output activation.
- an output activation may include eight bits, e.g., one byte.
- an output activation may include more than one byte. For instance, an output element may include two bytes.
- a vector 1035 is produced.
- the vector 1035 is highlighted with slashes in FIG. 10.
- the vector 1035 includes a sequence of output activations, which are arranged along the Z axis.
- the output activations in the vector 1035 have the same (X, Y) coordinate, but the output activations correspond to different output channels and have different Z coordinates.
- the dimension of the vector 1035 along the Z axis may equal the total number of output channels in the output tensor 1030.
- the output activations in the output tensor 1030 may be further processed based on one or more activation functions before they are stored or inputted into the next layer of the CNN.
- the processing based on the one or more activation functions may be at least part of the post processing of the convolution.
- the post processing may include one or more other computations, such as offset computation, bias computation, and so on.
- the results of the post processing may be stored in a local memory of the compute block and be used as input to the next layer.
- the input activations in the input tensor 1010 may be results of post processing of the previous layer. Even though the input tensor 1010, filters 1020, and output tensor 1030 are 3D tensors in FIG. 10, the input tensor 1010, a filter 1020, or the output tensor 1030 may be a 2D tensor in other embodiments.
- FIG. 11 is a flowchart showing a method 1100 of motion tracking, in accordance with various embodiments.
- the method 1100 may be performed by the motion tracking system 100 in FIG. 1.
- the method 1100 is described with reference to the flowchart illustrated in FIG. 11, many other methods for motion tracking may alternatively be used.
- the order of execution of the steps in FIG. 11 may be changed.
- some of the steps may be changed, eliminated, or combined.
- the motion tracking system 100 extracts 1110 an image from a frame of a video based on detection of an object in the frame.
- the image captures the object.
- the video captures a movement of the object.
- the motion tracking system 100 generates 1120, by a backbone in a neural network using the image, a feature map of the object.
- the neural network further comprises a first branch and a second branch that are coupled to the backbone.
- the backbone includes one or more convolutional layers.
- the motion tracking system 100 generates 1130, by the first branch using the feature map, a heat map comprising one or more pixels.
- a pixel encodes a likelihood of a point on the object being a key point on the object.
- the heat map is a 3D tensor comprising a number of 2D matrix.
- a 2D matrix comprises pixels arranged in rows and columns. The number of 2D matrix is determined based on a number of key points on the object.
- the motion tracking system 100 generates 1140, by the second branch using the feature map and the heat map.
- a pose tensor comprises one or more parameters that represent a 3D pose of the object in the frame.
- the second branch comprises a convolutional layer, a concatenate layer, and a regression layer.
- the motion tracking system 100 determines 1150 a 3D motion of the object based on the heat map and the pose tensor.
- the video includes a sequence of frames.
- the frame is the first frame in the sequence.
- the motion tracking system 100 extracts a second image from a second frame in the sequence.
- the motion tracking system 100 generates, by the neural network using the second image, a second heat map and a second pose tensor.
- the three-dimensional motion of the object is determined further based on the second heat map and the second pose tensor.
- the motion tracking system 100 segments, by a third branch in the neural network using the feature map, a part of the object from the object.
- the motion tracking system 100 generates, by the third branch, one or more masks representing the segmented part.
- the 3D motion of the object is determined further based on the one or more masks.
- the motion tracking system 100 constructs a 3D ground plane based on the one or more masks.
- the 3D ground plane represents a ground on which the 3D motion of the object is carried out.
- the motion tracking system 100 applies a first filter on the heat map to generate a smoothed heat map.
- the motion tracking system 100 applies a second filter on the one or more 3D pose parameters to generate one or more smoothed 3D pose parameters.
- the motion tracking system 100 determines the 3D motion of the object based on the smoothed key points and the one or more smoothed 3D pose parameters.
- FIG. 12 is a block diagram of an example computing device 1200, in accordance with various embodiments.
- the computing device 1200 can be used as at least part of the motion tracking system 100.
- a number of components are illustrated in FIG. 12 as included in the computing device 1200, but any one or more of these components may be omitted or duplicated, as suitable for the application.
- some or all of the components included in the computing device 1200 may be attached to one or more motherboards. In some embodiments, some or all of these components are fabricated onto a single system on a chip (SoC) die. Additionally, in various embodiments, the computing device 1200 may not include one or more of the components illustrated in FIG.
- SoC system on a chip
- the computing device 1200 may include interface circuitry for coupling to the one or more components.
- the computing device 1200 may not include a display device 1206, but may include display device interface circuitry (e.g., a connector and driver circuitry) to which a display device 1206 may be coupled.
- the computing device 1200 may not include an audio input device 1218 or an audio output device 1208, but may include audio input or output device interface circuitry (e.g., connectors and supporting circuitry) to which an audio input device 1218 or audio output device 1208 may be coupled.
- the computing device 1200 may include a processing device 1202 (e.g., one or more processing devices) .
- the processing device 1202 processes electronic data from registers and/or memory to transform that electronic data into other electronic data that may be stored in registers and/or memory.
- the computing device 1200 may include a memory 1204, which may itself include one or more memory devices such as volatile memory (e.g., DRAM) , nonvolatile memory (e.g., read-only memory (ROM) ) , high bandwidth memory (HBM) , flash memory, solid state memory, and/or a hard drive.
- the memory 1204 may include memory that shares a die with the processing device 1202.
- the memory 1204 includes one or more non-transitory computer-readable media storing instructions executable to perform operations for motion tracking, e.g., the method 1100 described above in conjunction with FIG. 11 or some operations performed by the motion tracking system 100.
- the instructions stored in the one or more non-transitory computer-readable media may be executed by the processing device 1202.
- the computing device 1200 may include a communication chip 1212 (e.g., one or more communication chips) .
- the communication chip 1212 may be configured for managing wireless communications for the transfer of data to and from the computing device 1200.
- wireless and its derivatives may be used to describe circuits, devices, systems, methods, techniques, communications channels, etc., that may communicate data through the use of modulated electromagnetic radiation through a nonsolid medium. The term does not imply that the associated devices do not contain any wires, although in some embodiments they might not.
- the communication chip 1212 may implement any of a number of wireless standards or protocols, including but not limited to Institute for Electrical and Electronic Engineers (IEEE) standards including Wi-Fi (IEEE 802.10 family) , IEEE 802.16 standards (e.g., IEEE 802.16-2005 Amendment) , Long-Term Evolution (LTE) project along with any amendments, updates, and/or revisions (e.g., advanced LTE project, ultramobile broadband (UMB) project (also referred to as "3GPP2" ) , etc. ) .
- IEEE Institute for Electrical and Electronic Engineers
- Wi-Fi IEEE 802.10 family
- IEEE 802.16 standards e.g., IEEE 802.16-2005 Amendment
- LTE Long-Term Evolution
- LTE Long-Term Evolution
- UMB ultramobile broadband
- WiMAX Broadband Wireless Access
- the communication chip 1212 may operate in accordance with a Global System for Mobile Communication (GSM) , General Packet Radio Service (GPRS) , Universal Mobile Telecommunications System (UMTS) , High Speed Packet Access (HSPA) , Evolved HSPA (E-HSPA) , or LTE network.
- GSM Global System for Mobile Communication
- GPRS General Packet Radio Service
- UMTS Universal Mobile Telecommunications System
- HSPA High Speed Packet Access
- E-HSPA Evolved HSPA
- the communication chip 1212 may operate in accordance with Enhanced Data for GSM Evolution (EDGE) , GSM EDGE Radio Access Network (GERAN) , Universal Terrestrial Radio Access Network (UTRAN) , or Evolved UTRAN (E-UTRAN) .
- the communication chip 1212 may operate in accordance with Code-division Multiple Access (CDMA) , Time Division Multiple Access (TDMA) , Digital Enhanced Cordless Telecommunications (DECT) , Evolution-Data Optimized (EV-DO) , and derivatives thereof, as well as any other wireless protocols that are designated as 3G, 4G, 5G, and beyond.
- the communication chip 1212 may operate in accordance with other wireless protocols in other embodiments.
- the computing device 1200 may include an antenna 1222 to facilitate wireless communications and/or to receive other wireless communications (such as AM or FM radio transmissions) .
- the communication chip 1212 may manage wired communications, such as electrical, optical, or any other suitable communication protocols (e.g., the Ethernet) .
- the communication chip 1212 may include multiple communication chips. For instance, a first communication chip 1212 may be dedicated to shorter-range wireless communications such as Wi-Fi or Bluetooth, and a second communication chip 1212 may be dedicated to longer-range wireless communications such as global positioning system (GPS) , EDGE, GPRS, CDMA, WiMAX, LTE, EV-DO, or others.
- GPS global positioning system
- a first communication chip 1212 may be dedicated to wireless communications
- a second communication chip 1212 may be dedicated to wired communications.
- the computing device 1200 may include battery/power circuitry 1214.
- the battery/power circuitry 1214 may include one or more energy storage devices (e.g., batteries or capacitors) and/or circuitry for coupling components of the computing device 1200 to an energy source separate from the computing device 1200 (e.g., AC line power) .
- the computing device 1200 may include a display device 1206 (or corresponding interface circuitry, as discussed above) .
- the display device 1206 may include any visual indicators, such as a heads-up display, a computer monitor, a projector, a touchscreen display, a liquid crystal display (LCD) , a light-emitting diode display, or a flat panel display, for example.
- LCD liquid crystal display
- the computing device 1200 may include an audio output device 1208 (or corresponding interface circuitry, as discussed above) .
- the audio output device 1208 may include any device that generates an audible indicator, such as speakers, headsets, or earbuds, for example.
- the computing device 1200 may include an audio input device 1218 (or corresponding interface circuitry, as discussed above) .
- the audio input device 1218 may include any device that generates a signal representative of a sound, such as microphones, microphone arrays, or digital instruments (e.g., instruments having a musical instrument digital interface (MIDI) output) .
- MIDI musical instrument digital interface
- the computing device 1200 may include a GPS device 1216 (or corresponding interface circuitry, as discussed above) .
- the GPS device 1216 may be in communication with a satellite-based system and may receive a location of the computing device 1200, as known in the art.
- the computing device 1200 may include another output device 1210 (or corresponding interface circuitry, as discussed above) .
- Examples of the other output device 1210 may include an audio codec, a video codec, a printer, a wired or wireless transmitter for providing information to other devices, or an additional storage device.
- the computing device 1200 may include another input device 1220 (or corresponding interface circuitry, as discussed above) .
- Examples of the other input device 1220 may include an accelerometer, a gyroscope, a compass, an image capture device, a keyboard, a cursor control device such as a mouse, a stylus, a touchpad, a bar code reader, a Quick Response (QR) code reader, any sensor, or a radio frequency identification (RFID) reader.
- the computing device 1200 may have any desired form factor, such as a handheld or mobile computer system (e.g., a cell phone, a smart phone, a mobile internet device, a music player, a tablet computer, a laptop computer, a netbook computer, an ultrabook computer, a personal digital assistant (PDA) , an ultramobile personal computer, etc. ) , a desktop computer system, a server or other networked computing component, a printer, a scanner, a monitor, a set-top box, an entertainment control unit, a vehicle control unit, a digital camera, a digital video recorder, or a wearable computer system.
- the computing device 1200 may be any other electronic device that processes data.
- Example 1 provides a computer-implemented method, including extracting an image from a frame of a video based on detection of an object in the frame, the image capturing the object; generating, by a backbone in a neural network using the image, a feature map of the object, the neural network further including a first branch and a second branch that are coupled to the backbone; generating, by the first branch using the feature map, a heat map including one or more pixels, a pixel encoding a likelihood of a point on the object being a key point on the object; generating, by the second branch using the feature map and the heat map, a pose tensor including one or more parameters that represent a three-dimensional (3D) pose of the object in the frame; and determining a 3D motion of the object based on the heat map and the pose tensor.
- a computer-implemented method including extracting an image from a frame of a video based on detection of an object in the frame, the image capturing the object; generating, by
- Example 2 provides the computer-implemented method of example 1, in which the video includes a sequence of frames, the frame is a first frame in the sequence, and the computer-implemented method further includes extracting a second image from a second frame in the sequence based on the heat map for the first frame; and generating, by the neural network using the second image, a second heat map and a second pose tensor, in which the three-dimensional motion of the object is determined further based on the second heat map and the second pose tensor.
- Example 3 provides the computer-implemented method of example 1 or 2, further including segmenting, by a third branch in the neural network using the feature map, a part of the object from the object; and generating, by the third branch, one or more masks representing the segmented part, in which the 3D motion of the object is determined further based on the one or more masks.
- Example 4 provides the computer-implemented method of example 3, further including constructing a 3D ground plane based on the one or more masks, the 3D ground plane representing a ground on which the 3D motion of the object is carried out.
- Example 5 provides the computer-implemented method of any one of examples 1-4, in which the heat map is a 3D tensor including a number of 2D matrix, a 2D matrix includes pixels arranged in rows and columns, and the number of 2D matrix is determined based on a number of key points on the object.
- the heat map is a 3D tensor including a number of 2D matrix
- a 2D matrix includes pixels arranged in rows and columns
- the number of 2D matrix is determined based on a number of key points on the object.
- Example 6 provides the computer-implemented method of any one of examples 1-5, in which the second branch includes a convolutional layer, a concatenate layer, and a regression layer.
- Example 7 provides the computer-implemented method of any one of examples 1-6, in which determining the 3D motion of the object including applying a first filter on key points extracted from the heat map to generate smoothed key points; applying a second filter on the one or more 3D pose parameters to generate one or more smoothed 3D pose parameters; and determining the 3D motion of the object based on the smoothed key points and the one or more smoothed 3D pose parameters.
- Example 8 provides the computer-implemented method of any one of examples 1-7, in which determining the 3D motion of the object includes determining positions of key points on the object in a 3D camera space based on the pose tensor, the key points including a root point and one or more other key points connected to the root point; converting the positions of the key points in the 3D camera space to positions of the key points in a 2D plane of the video by projecting the 3D camera space to the 2D plane based on a resolution of the video; generating a position parameter representing a motion of the root point in the 3D camera space based on the pose tensor and the positions of the key points in the 2D plane; and determining the 3D motion of the object based on the motion of the root point in the 3D camera space.
- Example 9 provides the computer-implemented method of example 8, in which determining the 3D motion of the object further includes optimizing the position parameter and the pose tensor based on an objective function; and determining the 3D motion of the object based on the optimized position parameter and the optimized pose tensor.
- Example 10 provides the computer-implemented method of any one of examples 1-9, further including generating a 3D animation that illustrates the 3D motion of the object, in which the video is a monocular video.
- Example 11 provides one or more non-transitory computer-readable media storing instructions executable to perform operations, the operations including extracting an image from a frame of a video based on detection of an object in the frame, the image capturing the object; generating, by a backbone in a neural network using the image, a feature map of the object, the neural network further including a first branch and a second branch that are coupled to the backbone; generating, by the first branch using the feature map, a heat map including one or more pixels, a pixel encoding a likelihood of a point on the object being a key point on the object; generating, by the second branch using the feature map and the heat map, a pose tensor including one or more parameters that represent a three-dimensional (3D) pose of the object in the frame; and determining a 3D motion of the object based on the heat map and the pose tensor.
- the operations including extracting an image from a frame of a video based on detection of an object in the frame, the image
- Example 12 provides the one or more non-transitory computer-readable media of example 11, in which the video includes a sequence of frames, the frame is a first frame in the sequence, and the operations further include extracting a second image from a second frame in the sequence based on the heat map for the first frame; and generating, by the neural network using the second image, a second heat map and a second pose tensor, in which the three-dimensional motion of the object is determined further based on the second heat map and the second pose tensor.
- Example 13 provides the one or more non-transitory computer-readable media of example 11 or 12, in which the operations further include segmenting, by a third branch in the neural network using the feature map, a part of the object from the object; and generating, by the third branch, one or more masks representing the segmented part, in which the 3D motion of the object is determined further based on the one or more masks.
- Example 14 provides the one or more non-transitory computer-readable media of example 13, in which the operations further include constructing a 3D ground plane based on the one or more masks, the 3D ground plane representing a ground on which the 3D motion of the object is carried out.
- Example 15 provides the one or more non-transitory computer-readable media of any one of examples 11-14, in which the heat map is a 3D tensor including a number of 2D matrix, a 2D matrix includes pixels arranged in rows and columns, and the number of 2D matrix is determined based on a number of key points on the object.
- the heat map is a 3D tensor including a number of 2D matrix
- a 2D matrix includes pixels arranged in rows and columns
- the number of 2D matrix is determined based on a number of key points on the object.
- Example 16 provides the one or more non-transitory computer-readable media of any one of examples 11-15, in which the second branch includes a convolutional layer, a concatenate layer, and a regression layer.
- Example 17 provides the one or more non-transitory computer-readable media of any one of examples 11-16, in which determining the 3D motion of the object includes determining positions of key points on the object in a 3D camera space based on the pose tensor, the key points including a root point and one or more other key points connected to the root point; converting the positions of the key points in the 3D camera space to positions of the key points in a 2D plane of the video by projecting the 3D camera space to the 2D plane based on a resolution of the video; generating a position parameter representing a motion of the root point in the 3D camera space based on the pose tensor and the positions of the key points in the 2D plane; and determining the 3D motion of the object based on the motion of the root point in the 3D camera space.
- Example 18 provides an apparatus, including a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations including extracting an image from a frame of a video based on detection of an object in the frame, the image capturing the object, generating, by a backbone in a neural network using the image, a feature map of the object, the neural network further including a first branch and a second branch that are coupled to the backbone, generating, by the first branch using the feature map, a heat map including one or more pixels, a pixel encoding a likelihood of a point on the object being a key point on the object, generating, by the second branch using the feature map and the heat map, a pose tensor including one or more parameters that represent a three-dimensional (3D) pose of the object in the frame, and determining a 3D motion of the object based on the heat map and the pose tensor.
- 3D three-dimensional
- Example 19 provides the apparatus of example 18, in which the video includes a sequence of frames, the frame is a first frame in the sequence, and the operations further include extracting a second image from a second frame in the sequence based on the heat map for the first frame; and generating, by the neural network using the second image, a second heat map and a second pose tensor, in which the three-dimensional motion of the object is determined further based on the second heat map and the second pose tensor.
- Example 20 provides the apparatus of example 19, in which determining the 3D motion of the object includes determining positions of key points on the object in a 3D camera space based on the pose tensor, the key points including a root point and one or more other key points connected to the root point; converting the positions of the key points in the 3D camera space to positions of the key points in a 2D plane of the video by projecting the 3D camera space to the 2D plane based on a resolution of the video; generating a position parameter representing a motion of the root point in the 3D camera space based on the pose tensor and the positions of the key points in the 2D plane; and determining the 3D motion of the object based on the motion of the root point in the 3D camera space.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
Un mouvement tridimensionnel (3D) d'un objet peut être déterminé à l'aide d'un réseau neuronal profond (DNN) à tâches multiples. Des trames dans une vidéo capturant l'objet peuvent être rognées et entrées dans le réseau neuronal. Le squelette de DNN peut extraire des caractéristiques de l'objet à partir d'une trame recadrée et délivrer en sortie une carte de caractéristiques. La carte de caractéristiques peut être traitée dans la première branche de DNN pour calculer une carte thermique représentant des points clés 2D sur l'objet. La carte de caractéristiques et la carte thermique peuvent être traitées dans la seconde branche de DNN pour calculer un tenseur de pose qui comprend des paramètres de pose 3D de l'objet. La carte de caractéristiques peut également être traitée dans la troisième branche de DNN pour calculer un ou plusieurs masques représentant une partie segmentée de l'objet. Les cartes thermiques, les tenseurs de pose et les masques des trames recadrées peuvent être traités ensemble pour déterminer le mouvement 3D de l'objet.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2023/137099 WO2025118238A1 (fr) | 2023-12-07 | 2023-12-07 | Suivi de mouvement avec réseau neuronal à tâches multiples |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2023/137099 WO2025118238A1 (fr) | 2023-12-07 | 2023-12-07 | Suivi de mouvement avec réseau neuronal à tâches multiples |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025118238A1 true WO2025118238A1 (fr) | 2025-06-12 |
Family
ID=95981534
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2023/137099 Pending WO2025118238A1 (fr) | 2023-12-07 | 2023-12-07 | Suivi de mouvement avec réseau neuronal à tâches multiples |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025118238A1 (fr) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109087329A (zh) * | 2018-07-27 | 2018-12-25 | 中山大学 | 基于深度网络的人体三维关节点估计框架及其定位方法 |
| EP3493104A1 (fr) * | 2017-12-03 | 2019-06-05 | Facebook, Inc. | Optimisations pour la détection, la segmentation et le mappage de structure d'une instance d'objet dynamique |
| CN112766186A (zh) * | 2021-01-22 | 2021-05-07 | 北京工业大学 | 一种基于多任务学习的实时人脸检测及头部姿态估计方法 |
| CN113095262A (zh) * | 2021-04-21 | 2021-07-09 | 大连理工大学 | 一种基于多任务信息互补的三维体素手势姿态估计方法 |
| CN114066932A (zh) * | 2021-09-26 | 2022-02-18 | 浙江工业大学 | 一种实时的基于深度学习的多人人体三维姿态估计和跟踪方法 |
-
2023
- 2023-12-07 WO PCT/CN2023/137099 patent/WO2025118238A1/fr active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP3493104A1 (fr) * | 2017-12-03 | 2019-06-05 | Facebook, Inc. | Optimisations pour la détection, la segmentation et le mappage de structure d'une instance d'objet dynamique |
| CN109087329A (zh) * | 2018-07-27 | 2018-12-25 | 中山大学 | 基于深度网络的人体三维关节点估计框架及其定位方法 |
| CN112766186A (zh) * | 2021-01-22 | 2021-05-07 | 北京工业大学 | 一种基于多任务学习的实时人脸检测及头部姿态估计方法 |
| CN113095262A (zh) * | 2021-04-21 | 2021-07-09 | 大连理工大学 | 一种基于多任务信息互补的三维体素手势姿态估计方法 |
| CN114066932A (zh) * | 2021-09-26 | 2022-02-18 | 浙江工业大学 | 一种实时的基于深度学习的多人人体三维姿态估计和跟踪方法 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12154188B2 (en) | Training neural networks for vehicle re-identification | |
| US9552510B2 (en) | Facial expression capture for character animation | |
| Urtasun et al. | Sparse probabilistic regression for activity-independent human pose inference | |
| US8917907B2 (en) | Continuous linear dynamic systems | |
| US10943352B2 (en) | Object shape regression using wasserstein distance | |
| US20150347846A1 (en) | Tracking using sensor data | |
| CN111429414B (zh) | 基于人工智能的病灶影像样本确定方法和相关装置 | |
| Khan et al. | SkinViT: A transformer based method for Melanoma and Nonmelanoma classification | |
| WO2023151237A1 (fr) | Procédé et appareil d'estimation de position du visage, dispositif électronique et support de stockage | |
| US20230016455A1 (en) | Decomposing a deconvolution into multiple convolutions | |
| Kalash et al. | Relative saliency and ranking: Models, metrics, data and benchmarks | |
| US20230401427A1 (en) | Training neural network with budding ensemble architecture based on diversity loss | |
| WO2023041181A1 (fr) | Dispositif électronique et procédé de détermination de taille humaine à l'aide de réseaux neuronaux | |
| EP4607477A1 (fr) | Détermination de la position du regard sur de multiples écrans à l'aide d'une caméra monoculaire | |
| WO2025118238A1 (fr) | Suivi de mouvement avec réseau neuronal à tâches multiples | |
| Wang et al. | Swimmer’s posture recognition and correction method based on embedded depth image skeleton tracking | |
| US20240144447A1 (en) | Saliency maps and concept formation intensity for diffusion models | |
| Chun-man et al. | Face expression recognition based on improved MobileNeXt | |
| WO2025123208A1 (fr) | Réseau d'annotation pour estimation de pose tridimensionnelle | |
| WO2024072472A1 (fr) | Génération de carte d'activation de classe efficace sans gradient | |
| WO2025200078A1 (fr) | Suivi de visage basé sur une agrégation spatio-temporelle et antérieur rigide | |
| WO2025200079A1 (fr) | Codeur assimilable convertissant un nuage de points en grille pour reconnaissance visuelle | |
| WO2025097349A1 (fr) | Vision artificielle basée sur un graphe à l'aide d'un apprenant à grille progressive et d'un réseau neuronal convolutif | |
| US20250111205A1 (en) | Multi-scale neural network for anomaly detection | |
| Raza et al. | Lightweight-CancerNet: a deep learning approach for brain tumor detection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23960550 Country of ref document: EP Kind code of ref document: A1 |