WO2024177851A1 - Dynamic temporal fusion for video recognition - Google Patents
Dynamic temporal fusion for video recognition Download PDFInfo
- Publication number
- WO2024177851A1 WO2024177851A1 PCT/US2024/015603 US2024015603W WO2024177851A1 WO 2024177851 A1 WO2024177851 A1 WO 2024177851A1 US 2024015603 W US2024015603 W US 2024015603W WO 2024177851 A1 WO2024177851 A1 WO 2024177851A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- temporal
- local
- context features
- features
- generating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/48—Matching video sequences
Definitions
- CNNs can be used for various recognition tasks.
- CNNs are a network architecture for deep learning that learns directly from data and are used to find patterns in images to recognize objects, classes or categories.
- a CNN can be trained to identify a type of vehicle or an animal that might be in an image.
- CNNs can also be used for video recognition.
- the techniques described herein relate to an apparatus for performing video action classification, including: at least one memory; and at least one processor coupled to at least one memory and configured to: generate, via a first network, frame-level features obtained from a set of input frames; generate, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generate, via a second multi-scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classify the set of input frames PATENT Qualcomm Docket No.2301338WO based on the first local temporal context features and the second local temporal context features.
- the techniques described herein relate to a method of classifying video, the method including one or more of: generating, via a first network, frame-level features obtained from a set of input frames; generating, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generating, via a second multi-scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classifying the set of input frames based on the first local temporal context features and the second local temporal context features.
- the techniques described herein relate to a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to be configured to: generate, via a first network, frame-level features obtained from a set of input frames; generate, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generate, via a second multi- scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classify the set of input frames based on the first local temporal context features and the second local temporal context features.
- the techniques described herein relate to an apparatus for generating virtual content in a distributed system, the apparatus including one or more of: means for generating, via a first network, frame-level features obtained from a set of input frames; means for generating, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; means for generating, via a second multi-scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and means for classifying the set of input frames based on the first local temporal context features and the second local temporal context features.
- the techniques described herein relate to an apparatus for performing video classification, including one or more of: a neural network configured to generate frame-level features in consecutive frames from a set of video frames; a first multi-scale temporal feature fusion engine having a first kernel size configured to generate first local context features based on the frame-level features; a second multi- scale temporal feature fusion engine having a second kernel size configured to generate second local context features based on the frame-level features; a first temporal-relation cross transformer classifier configured to generate a first distance between a query video associated with the set of video frames and sets of support videos based on the first local context features; a second temporal-relation cross transformer classifier configured to generate a second distance between a query video associated with the set of video frames and the sets of support videos based on the second local context features; and a calculating engine configured to calculate a final distance between the query video and the sets of support videos based on the first distance and the second distance
- the techniques described herein relate to a method of performing video classification, the method including one or more of: generating, via a neural network configured to receive a set of video frames, frame-level features in consecutive frames from the set of video frames; generating, via a first multi-scale temporal feature fusion engine having a first kernel size, first local context features based on the frame-level features; generating, via a second multi-scale temporal feature fusion engine having a second kernel size, second local context features based on the frame-level features; generating, via a first temporal-relation cross transformer classifier and based on the first local context features, a first distance between a query video associated with the set of video frames and sets of support videos; generating, via a second temporal- relation cross transformer classifier and based on the second local context features, a second distance between a query video associated with the set of video frames and the sets of support videos; and calculating a final distance between the query video and the sets of support videos based on the first distance and
- the techniques described herein relate to a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to be configured to: generate, via a neural network configured to receive a set of video frames, frame-level features in PATENT Qualcomm Docket No.2301338WO consecutive frames from the set of video frames; generate, via a first multi-scale temporal feature fusion engine having a first kernel size, first local context features based on the frame-level features; generate, via a second multi-scale temporal feature fusion engine having a second kernel size, second local context features based on the frame-level features; generate, via a first temporal-relation cross transformer classifier and based on the first local context features, a first distance between a query video associated with the set of video frames and sets of support videos; generate, via a second temporal-relation cross transformer classifier and based on the second local context features, a second distance between a query video associated with the set of video
- the techniques described herein relate to an apparatus for generating virtual content in a distributed system, the apparatus including one or more of: means for generating, via a neural network configured to receive a set of video frames, frame-level features in consecutive frames from the set of video frames; means for generating, via a first multi-scale temporal feature fusion engine having a first kernel size, first local context features based on the frame-level features; means for generating, via a second multi-scale temporal feature fusion engine having a second kernel size, second local context features based on the frame-level features; means for generating, via a first temporal-relation cross transformer classifier and based on the first local context features, a first distance between a query video associated with the set of video frames and sets of support videos; means for generating, via a second temporal-relation cross transformer classifier and based on the second local context features, a second distance between a query video associated with the set of video frames and the sets of
- one or more of the apparatuses described herein is, is part of, and/or includes an extended reality (XR) device or system (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a mobile device (e.g., a mobile telephone or other mobile device), a wearable device, a wireless communication device, a camera, a personal computer, a laptop computer, a vehicle or a PATENT Qualcomm Docket No.2301338WO computing device or component of a vehicle, a server computer or server device (e.g., an edge or cloud-based server, a personal computer acting as a server device, a mobile device such as a mobile phone acting as a server device, an XR device acting as a server device, a vehicle acting as a server device, a network router, or other device acting as a server device), another device, or a combination thereof.
- XR extended reality
- VR virtual reality
- AR augmented reality
- the apparatus includes a camera or multiple cameras for capturing one or more images. In some aspects, the apparatus further includes a display for displaying one or more images, notifications, and/or other displayable data. In some aspects, the apparatuses described above can include one or more sensors (e.g., one or more inertial measurement units (IMUs), such as one or more gyroscopes, one or more gyrometers, one or more accelerometers, any combination thereof, and/or other sensor.
- IMUs inertial measurement units
- aspects may be implemented via integrated chip implementations or other non-module-component based devices (e.g., end-user devices, vehicles, communication devices, computing devices, industrial equipment, retail/purchasing devices, medical devices, and/or artificial intelligence devices).
- non-module-component based devices e.g., end-user devices, vehicles, communication devices, computing devices, industrial equipment, retail/purchasing devices, medical devices, and/or artificial intelligence devices.
- aspects may be implemented in chip-level components, modular components, non-modular components, non-chip-level components, device-level components, and/or system-level PATENT Qualcomm Docket No.2301338WO components.
- Devices incorporating described aspects and features may include additional components and features for implementation and practice of claimed and described aspects. It is intended that aspects described herein may be practiced in a wide variety of devices, components, systems, distributed arrangements, and/or end-user devices of varying size, shape, and constitution.
- FIG.1 illustrates a multi-scale temporal feature fusion engine for a sequence of frame-level features for a video, in accordance with some examples
- FIG.2 illustrates the use of multiple multi-scale temporal feature fusion engines for a sequence of frame-level features for the video in a classifier, in accordance with some examples
- FIG. 3 is a flow diagram illustrating an example of a process for performing multi-scale temporal feature fusion, in accordance with some examples
- FIG. 4 is a flow diagram illustrating an example of a process for classifying a video using multiple multi-scale temporal feature fusion engines using different kernel sizes, in accordance with some examples; and PATENT Qualcomm Docket No.2301338WO [0022]
- FIG. 5 is a block diagram illustrating an example of a computing system, in accordance with some examples. DETAILED DESCRIPTION [0023] Certain aspects of this disclosure are provided below for illustration purposes. Alternate aspects may be devised without departing from the scope of the disclosure. Additionally, well-known elements of the disclosure will not be described in detail or will be omitted so as not to obscure the relevant details of the disclosure.
- a video that is fed into a convolutional neural network (CNN) for video recognition can include temporal dynamics as well as spatial appearance.
- a two- dimensional CNN may process a video representation but such a CNN is usually applied on individual frames and cannot model temporal information of video.
- the two- dimensional CNN processes data by sliding a kernel along two dimensions of the data, such as along an image width and an image height.
- the two-dimensional CNN can extract the spatial features (e.g., edges, color distribution, and so forth) from the data using its kernel.
- Three-dimensional CNNs can jointly learn spatial and temporal features but the computation cost is large, making deployment on edge devices (e.g., edge devices that provide an entry point into a service provider core network, such as aa router, switch, multiplexer or other device) difficult.
- a three-dimensional CNN can be used with three- dimensional image data such as magnetic resonance imaging (MRI) data or video data.
- MRI magnetic resonance imaging
- the kernel moves in three directions and the input and output PATENT Qualcomm Docket No.2301338WO data of the three-dimensional CNN is four dimensional. What is needed in the art is a new approach to designing effective video representations to tackle these challenges.
- a spatial average pooling approach is often applied to summarize the width ⁇ height sized feature as a one- dimensional-feature at the last layer of a neural network.
- the use of spatial average pooling can preserve overall image-level characteristics.
- the use of spatial average pooling can also reduce the complexity of the feature.
- Systems and techniques are described herein for generating video representations that convey different temporal dynamics at different temporal periods of video recognition. The video processing can make the video representations more robust.
- the systems and techniques can include extracting and merging temporal information at different frame rates.
- the systems and techniques provide effective video representations to enable video recognition by convolutional neural networks or other networks.
- the systems and techniques introduce a temporal fusion (also referred to as a temporal module) engine, which may be on top of certain features (e.g., the average-pooled feature) of two-dimensional CNNs.
- a video input e.g., a sequence of image or video frames
- different temporal dynamics can be conveyed at different temporal granularities (e.g., different frame-rates), which can make the video representation more robust.
- a fine-grained frame-level feature may show less temporal dynamics than a tuple-level (a set of neighboring frames) feature.
- the systems and techniques described herein can extract and merge temporal information to leverage temporal dynamics from the features in diverse temporal granularity.
- the system and techniques described herein can improve the classification (e.g., recognition, detection, etc.) of a video based on support videos.
- a system can be trained with five classes from five support videos.
- the example approach may be characterized as a five-way, five-shot classification.
- the five support videos may include, for example, one video for vehicles, one video for animals, one video for buildings, one video for plants and one video for tools.
- the input query videos will be processed as described herein to classify the input query videos into one of these classes.
- a distance value can be calculated between the query video and the set of support videos PATENT Qualcomm Docket No.2301338WO and a classification probability can be calculated over the classes which in some aspects can be related to the negative value of the distance.
- This system and techniques can recognize actions of interest that are identified by the support videos in testing (query) videos.
- the system and techniques can include a multi-scale temporal feature fusion (MSTFF) module (e.g., as described below with respect to FIG. 1), where the features describing local temporal contexts in videos are enhanced by collaboratively merging important information in frame-level features (e.g., with no temporal context).
- MSTFF multi-scale temporal feature fusion
- the systems and techniques can classify input videos by utilizing multiple MSTFF modules varying the scope of local temporal context extraction (e.g., as described below with respect to FIG.2).
- the system can obtain a discriminative video representation which can be useful in a few-shot task where support videos are not sufficient (e.g., a case where there are not enough support videos) to describe an action class.
- the systems and techniques can include learning a local temporal context-level auxiliary classifier in parallel with the main classifier (e.g., as described below with respect to FIG.2).
- FIG.1 illustrates an example of a MSTFF module 100 for processing a sequence of frame-level features for a video.
- Action recognition has been widely studied in deep learning, but most methods require large-scale video datasets as noted above.
- a learned deep network can face videos whose action classes are unseen during training.
- few-shot action recognition aims to recognize fine-grained actions in testing videos based on meager support videos whose action classes are unseen in training.
- few-shot action recognition aims to recognize action classes from input videos with few training samples or few support videos used for training a model.
- support videos are not enough to reliably represent an action class in the few-shot setting, it is helpful to extract meaningful temporal information to describe the actions of interest from videos.
- the meaningful cues are included in parts of a video rather than over all the frames in the video, it is also helpful to reliably describe sub-actions in the video.
- a challenge exists due to the speed and the start and end time of an action being diverse depending on the videos.
- the tuples of PATENT Qualcomm Docket No.2301338WO the frames of query and support videos can be matched at multiple cardinalities.
- better sub-sequence representation can be generated considering the spatial context in frames as well as the temporal context.
- a hierarchical matching model can use coarse-to-fine cues in spatial and temporal domains, respectively.
- the frame-level features from a backbone network may already include some information for the spatial context.
- the systems and techniques described herein provide a collaborative fusion of two different types of features in different temporal scales: frame-level features and features for the temporal local context in the sub- sequences of video frames.
- the systems and techniques can utilize the MSTFF module 100 shown in FIG. 1.
- the MSTFF module 100 can generate robust temporal local context representations preserving the important information in the frame-level features for an input video represented by frame-level features X 102.
- the MSTFF module 100 can extract local temporal context information from neighboring frame-level features.
- a cross-attention module 116 can propagate highly relevant frame-level features to the features including the local temporal context. For example, through the cross-attention module 116, two features in different temporal scales are combined with high compatibility. Additionally, a local temporal context-level auxiliary classifier (as shown in FIG. 2 but not in FIG. 1) can be used that induces stable learning of the model and boosts the few-shot action recognition performance, which is discarded at the testing phase. [0035]
- the MSTFF module 100 illustrates a combination of a one-dimensional convolution layer 104 (which can in some aspects be temporal in nature) and a cross attention module 116.
- the one-dimensional convolution layer 104 (e.g., with a kernel size k) can summarize the information in a consecutive k frames resulting in outputs U 114 which can also be called a key 110.
- the cross-attention module 116 receives a query 112 (e.g., a value U can represent coarse-grained features) and a key 110, and a value 108 (e.g., a parameter X can represent fine-grained features) and can process the data by providing a weighted sum of values based on a relationship between the query 112 and the key 110.
- the described process can mean transferring the knowledge of fine-grained PATENT Qualcomm Docket No.2301338WO features (before the one-dimensional convolution layer 104) to the coarse-grained ones (after the one-dimensional convolution layer 104). There also can be a skip-connection to preserve the information of the query features.
- the cross-attention module 116 can convey information from two different temporal granularities (e.g., for a frame-level & a tuple-level). [0036] Due to the challenge of addressing videos in the few-shot regime, much effort has been put into the problem. While memory networks are exploited to obtain key-frame representations in some cases, different length query and support videos can be aligned.
- a video or query ⁇ can be represented by a sequence of uniformly sampled T frames.
- a backbone network (e.g., a convolutional network or network 204 of any type and that is not shown in FIG. 1 but is shown in FIG. 2) can generate frame-level features X ⁇ ⁇ ⁇ , ... , ⁇ 102 where ⁇ ⁇ R ⁇ .
- the MSTFF module 100 can apply a one- dimensional convolution layer 104 along a temporal axis, called the temporal one- dimensional convolution, to obtain local temporal context features from neighboring sub- sequence of the frames by: ⁇ ⁇ ⁇ ⁇ ⁇ (1) where ⁇ denotes the convolution operation along temporal axis, ⁇ ⁇ denotes a weight of the one-dimensional convolutional layer 104.
- ⁇ is the kernel size of ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
- the length of the sequence U is ⁇ ⁇ where ⁇ ⁇ ⁇ ⁇ .
- ⁇ ⁇ ⁇ ⁇ , ... , ⁇ ⁇ ⁇ 114 represents the local temporal context information.
- PATENT Qualcomm Docket No.2301338WO [0039]
- the system can apply a cross-attention module 116 to attend U 114 (which can be represented as a key 110) by frame-level features X 102.
- the process can be referred to as crossattention.
- the feature sequences U 114 and the frame-level features X 102 can be projected to queries ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 112 and key-value pairs ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , (e.g., key 110 and value 108) respectively.
- the cross-attended feature ⁇ ⁇ ⁇ ⁇ ⁇ 122 for ⁇ ⁇ is computed by ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ (2) ⁇ ⁇ where the temperature ⁇ in order to scale the dot-product of the query 112 and the key 110.
- the computation can be performed by component or module 120 which can perform, for example, a bn-tanh (batch- normalization tanh) operation. Tanh is a hyperbolic tangent function.
- the batch- normalization can be performed before a tanh operation.
- the mean and standard deviation of the batch of activations can be calculated.
- the mean can be subtracted from each activation value, and then each will be divided by the batch’s standard deviation.
- the expected value of any activation is now zero, which is the central value of the input to the tanh function and the standard deviation of an activation is one which means that most activation values will be between [ ⁇ 1, 1].
- the system also combines ⁇ ⁇ ⁇ ⁇ 122 with neighboring frame-level features 118 by simple average pooling via an average pooling module 106 (e.g., by one-dimensional average pooling) on the frame-level features X 102 along the temporal axis, which results in ⁇ ⁇ frame-level features ⁇ ⁇ ⁇ ⁇ 118.
- an average pooling module 106 e.g., by one-dimensional average pooling
- temporal context features ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 126 are generated, via a summing component 124, by: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ (3)
- the approach level context collaboratively.
- the average pooling module 106 can provide information of temporally PATENT Qualcomm Docket No.2301338WO neighboring frames at a single temporal granularity.
- the cross-attended features e.g., the query 112, the value 108, the key 110
- temporally average pooled features 118 can be summed by the summing component 124 to generate the final features ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 126.
- ⁇ the kernel
- FIG.2 illustrates the use of multiple multi-scale temporal feature fusion engines for a sequence of frame-level features for a video processed by a classifier 200.
- the system can uniformly sample T frames 202 per video and employ a network 204.
- the network 204 can be any number of different networks such as a backbone network including a two-dimensional convolutional neural network.
- the ResNet backbone is an example network which is a “Residual Network” backbone network that is a class neural network used for computer vision tasks. Other neural networks can be used as well.
- the network 204 can be a pre-trained network on an image dataset such as the ImageNet dataset.
- the network 204 can be a backbone network such as a high-resolution network HRNet-48 as is known in the art.
- the network 204 extracts T frame-level features 206 for each video.
- the respective MSTFF module 208, 210 generates a sequence of ⁇ ⁇ features describing the local temporal context.
- the resulting temporal context features 214, 216 (combined as a pair of local temporal context features 212) from support and query videos are fed into the final respective classifier 218, 220.
- Example classification modules include temporal relation matching (TRM) classification modules.
- a last convolution module of the network 204 can be replaced with a temporal fusion engine (e.g., the MSTFF module 100 or other baseline components).
- a temporal fusion engine e.g., the MSTFF module 100 or other baseline components.
- PATENT Qualcomm Docket No.2301338WO In a testing phase of the few-shot action recognition, an input testing video can be classified into one of ⁇ classes where each class is described by a handful ⁇ support videos and the classes are unseen during training.
- the system e.g., the MSTFF module 100
- a meta-learning strategy can be used. In such a strategy, action classes in training set ⁇ ⁇ and testing set ⁇ ⁇ are not overlapped.
- the system e.g., the MSTFF module 100 or the classifier 200
- the system e.g., the MSTFF module
- An N-way classification module 224 can produce the output of the classifier 200.
- an auxiliary classifier 222 can be jointly trained to categorize the input query into one of the ground-truth training classes rather than the target N classes of a given episode. The approach is beneficial for the network 204 to prevent overfitting and boost the few-shot N-way classification performance.
- the system e.g., the MSTFF module
- the auxiliary classifier 222 on top of the MSTFF module which can include two MSTFF modules 208, 210 each using a respective kernel value k 1 , k 2 .
- the design the auxiliary classifier can include in some aspects a two-layer multilayer perceptron (MLP).
- each of the enhanced temporal context features ⁇ ⁇ 214, 216 can be fed into the auxiliary classifier 222, and the auxiliary classifier 222 is learned to classify data into one of
- the ground-truth label can be shared for all ⁇ ⁇ , ⁇ in the same video.
- each temporal context feature can better represent the action cues.
- the auxiliary classifier 222 is discarded in the deployed model.
- the auxiliary classifier 222 classifies the input query videos into one of the entire training class pool, which helps stable few-shot learning.
- the output can be a multi-way classification, such as the c-way classification 226 shown in FIG. 2.
- the auxiliary classifier 222 is not deployed in testing.
- PATENT Qualcomm Docket No.2301338WO PATENT Qualcomm Docket No.2301338WO
- a respective classifier 218, 220 can output distances ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ... , ⁇ ⁇ ⁇ ⁇ between a query video and N sets of support videos.
- the system obtains the final distance d between the query and support videos by accumulating the distances from all ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
- the classification probability over class ⁇ is computed by ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ (4) ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ which is proportional ultimately classify the video into one of the classes.
- the system can optimize the model by using a cross-entropy loss for N-way classification as the main loss: L ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ log ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ (5) [0054]
- the system can learn the auxiliary classifier: L ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
- FIG.3 is a flow diagram illustrating an example of a process 300 for performing multi-scale temporal feature fusion, in accordance with some examples disclosed herein.
- the operations of the process 300 may be implemented as software components that are executed and run on one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof).
- the process 300 can be performed by any device or group of devices.
- the PATENT Qualcomm Docket No.2301338WO operations of the process 300 may be implemented as software components that are executed and run on one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s)).
- the process 300 includes the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate (e.g., via a first network or via computing system 500) frame-level features obtained from a set of input frames.
- the first network can be a two-dimensional convolutional neural network.
- the process 300 includes the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate, via a first multi- scale temporal feature fusion engine 208, first local temporal context features from a first neighboring sub-sequence of the set of input frames.
- the first multi-scale temporal feature fusion engine can apply a first kernel value for generating the first local temporal context features and wherein the second multi-scale temporal feature fusion engine 210 applies a second kernel value for generating the second local temporal context features.
- the first neighboring sub-sequence of the set of input frames can equal the second neighboring sub-sequence of the set of input frames.
- the process to generate, via the first multi-scale temporal feature fusion engine, the first local temporal context features from the first neighboring sub-sequence of the set of input frames further can include one or more of generating, via a first convolutional neural network, first local temporal context features from the set of input frames; generating, via a first cross-attention module 116, a first cross attended feature output based on the first local temporal context features; generating, via a first average pooling module 106, a first average pooling dataset from the set of input frames; and generating the first local temporal context features by adding the first cross attended feature output to the first average pooling dataset.
- the first convolutional neural network and the second convolutional neural network each perform a one-dimensional convolution with a respective kernel.
- the first convolutional neural network and the second PATENT Qualcomm Docket No.2301338WO convolutional neural network each perform the one-dimensional convolution with the respective kernel to summarize information in consecutive k frames of the set of input frames to generate the first local temporal context features and the second local temporal context features.
- the first average pooling module 106 and the second average pooling module 106 each provide information of temporally neighboring frames at a single temporal granularity.
- the first cross-attention module 116 generates the first cross attended feature output based on a relationship between a query 112 and a key 110 associated with the set of input frames and wherein the second cross-attention module 116 generates the second cross attended feature output based on the relationship between the query 112 and the key 110 associated with the set of input frames.
- the process 300 can include the one or more processors being configured to: generate, via a second multi-scale temporal feature fusion engine 210, second local temporal context features from a second neighboring sub-sequence of the set of input frames.
- the step of generating, via the second multi-scale temporal feature fusion engine 210, the second local temporal context features from the second neighboring sub- sequence of the set of input frames further can include generating, via a second convolutional neural network, second local temporal context features from the set of input frames; generating, via a second cross attention module 116, a second cross attended feature output based on the first local temporal context features; generating, via a second average pooling module 106, a second average pooling dataset 118 from the set of input frames; and generating the second local temporal context features by adding the second cross attended feature output to the second average pooling dataset 118.
- the first cross-attention module 116 and the second cross-attention module 116 both transfer data associated with fine-grained features before a one- dimensional convolution on the set of input frames to data associated with coarse-grained features after the one-dimensional convolution on the set of input frames.
- the first cross attention module 116 and the second cross attention module 116 each convey information from two different temporal granularities.
- the two different temporal granularities comprise a frame-level granularity and a tuple-level granularity.
- the process 300 can include the one or more processors being configured to: classify the set of input frames based on the first local temporal context features and the second local temporal context features.
- the step in block 308 can further include classifying, via an auxiliary classifier 222, the first local temporal context features and the second local temporal context features during a training process.
- the auxiliary classifier 222 can include a two-layer multilayer perceptrons (MLP).
- An example apparatus for performing video action classification (e.g., recognition, detection, etc.) can include at least one memory and at least one processor coupled to at least one memory.
- the at least one processor can be configured to: generate, via a first network, frame-level features obtained from a set of input frames; generate, via a first multi-scale temporal feature fusion engine 208, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generate, via a second multi-scale temporal feature fusion engine 210, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classify the set of input frames based on the first local temporal context features and the second local temporal context features.
- the video action classification can relate to recognition, detection, or other operations that relate to video actions.
- Another example apparatus for performing video classification can include a network 204 (e.g., a neural network) configured to receive a set of video frames and generate frame-level features in consecutive frames from the set of video frames; a first multi-scale temporal feature fusion engine 208 having a first kernel size configured to receive the frame-level features and generate first local context features; a second multi- scale temporal feature fusion engine 210 having a second kernel size configured to receive the frame-level features and generate second local context features; a first temporal- relation cross transformer classifier 218 configured to receive the first local context features and generate a first distance between a query video associated with the set of video frames and sets of support videos; a second temporal-relation cross transformer classifier 220 configured to receive the second local context features and generate a second distance between a query video associated with the set of video frames and the sets of support videos; and a calculating engine configured to calculate a final distance between the query video and the sets of support videos based on the first distance and the second distance.
- a network 204
- FIG. 4 is a flow diagram illustrating an example of a process 400 for classifying a video using multiple multi-scale temporal feature fusion engines using different kernel sizes.
- the process 400 can be performed by any device or group of devices.
- the operations of the process 400 may be implemented as software components that are executed and run on one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof).
- the process 400 of performing video classification can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate, via a neural network configured to receive a set of video frames, frame-level features in consecutive frames from the set of video frames.
- the process 400 can include the one or more processors (e.g., processor 510 of FIG.
- the process 400 can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate, via a second multi-scale temporal feature fusion engine 210 having a second kernel size, second local context features based on the frame-level features.
- processors e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof
- the process 400 can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate, via a second multi-scale temporal feature fusion engine 210 having a second kernel size, second local context features based on the frame-level features.
- the process 400 can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: generate, via a first temporal-relation cross transformer classifier 218 and based on the first local context features, a first distance between a query video associated with the set of video frames and sets of support videos.
- processors e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof
- PATENT Qualcomm Docket No.2301338WO PATENT Qualcomm Docket No.2301338WO
- the process 400 can include the one or more processors (e.g., processor 510 of FIG.
- the process 400 can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: calculate a final distance between the query video and the sets of support videos based on the first distance and the second distance.
- processors e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof
- the process 400 can include the one or more processors (e.g., processor 510 of FIG. 5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof) being configured to: perform optimization by calculating a main loss based on the first temporal-relational cross transformer classifier 218 and the second temporal-relational cross transformer classifier 220 and an auxiliary loss based on an auxiliary classifier 220 used during a training process.
- the process 400 can include the one or more processors (e.g., processor 510 of FIG.
- a system can include a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform any one or more operations disclosed herein.
- an apparatus for generating video content classification can include one or more means for performing any one or more operations disclosed herein.
- the processes described herein may be performed by a computing device or apparatus (e.g., a network server, a client device, or any other device, etc. a processor 510 PATENT Qualcomm Docket No.2301338WO of FIG.5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof).
- a computing device or apparatus e.g., a network server, a client device, or any other device, etc. a processor 510 PATENT Qualcomm Docket No.2301338WO of FIG.5 and/or other processor(s), the MSTFF module 100 and/or the classifier 200, or subcomponents thereof.
- the processes 300, 400 may be performed by a computer system.
- the process 300 and/or the process 400 may be performed by a computing device with the computing system 500 shown in FIG. 5.
- a wireless communication device with the computing architecture shown in FIG.5 may include the components of the computer system and may implement the operations of FIG.3 and/or FIG.4.
- the computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out the steps of processes described herein.
- the computing device may include a display, one or more network interfaces configured to communicate and/or receive the data, any combination thereof, and/or other component(s).
- the one or more network interfaces may be configured to communicate and/or receive wired and/or wireless data, including data according to the 3G, 4G, 5G, and/or other cellular standard, data according to the WiFi (802.11x) standards, data according to the Bluetooth TM standard, data according to the Internet Protocol (IP) standard, and/or other types of data.
- the components of the computing device may be implemented in circuitry.
- the components may include and/or may be implemented using electronic circuits or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or may include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein.
- programmable electronic circuits e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits
- CPUs central processing units
- the process 300 and the process 400 are illustrated as a logical flow diagrams, the operation of which represent a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof.
- the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations.
- computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform PATENT Qualcomm Docket No.2301338WO particular functions or implement particular data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the processes.
- FIG.5 is a diagram illustrating an example of a system for implementing certain aspects of the present technology.
- FIG. 5 is a diagram illustrating an example of a system for implementing certain aspects of the present technology.
- FIG. 5 is a diagram illustrating an example of a system for implementing certain aspects of the present technology.
- FIG. 5 is a diagram illustrating an example of a system for implementing certain aspects of the present technology.
- FIG. 5 is a diagram illustrating an example of a system for implementing certain aspects of the present technology. In particular, FIG.
- computing system 500 may be for example any computing device making up internal computing system, a remote computing system, a camera, or any component thereof in which the components of the system are in communication with each other using connection 505.
- Connection 505 may be a physical connection using a bus, or a direct connection into processor 510, such as in a chipset architecture.
- Connection 505 may also be a virtual connection, networked connection, or logical connection.
- computing system 500 is a distributed system in which the functions described in this disclosure may be distributed within a datacenter, multiple data centers, a peer network, etc.
- one or more of the described system components represents many such components each performing some or all of the function for which the component is described.
- Example system 500 includes at least one processing unit (CPU or processor) 510 and connection 505 that communicatively couples various system components including system memory or cache 515, such as read-only memory (ROM) 520 and random access memory (RAM) 525 to processor 510.
- Computing system 500 may include PATENT Qualcomm Docket No.2301338WO a cache 515 of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 510.
- Processor 510 may include any general-purpose processor and a hardware service or software service, such as services 532, 534, and 536 stored in storage device 530, configured to control processor 510 as well as a special-purpose processor where software instructions are incorporated into the actual processor design.
- Processor 510 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc.
- a multi-core processor may be symmetric or asymmetric.
- computing system 500 includes an input device 545, which may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech, etc.
- Computing system 500 may also include output device 535, which may be one or more of a number of output mechanisms. In some instances, multimodal systems may enable a user to provide multiple types of input/output to communicate with computing system 500. [0090] Computing system 500 may include communications interface 540, which may generally govern and manage the user input and system output.
- the communication interface may perform or facilitate receipt and/or transmission wired or wireless communications using wired and/or wireless transceivers, including those making use of an audio jack/plug, a microphone jack/plug, a universal serial bus (USB) port/plug, an Apple TM Lightning TM port/plug, an Ethernet port/plug, a fiber optic port/plug, a proprietary wired port/plug, 3G, 4G, 5G and/or other cellular data network wireless signal transfer, a Bluetooth TM wireless signal transfer, a Bluetooth TM low energy (BLE) wireless signal transfer, an IBEACON TM wireless signal transfer, a radio-frequency identification (RFID) wireless signal transfer, near-field communications (NFC) wireless signal transfer, dedicated short range communication (DSRC) wireless signal transfer, 802.11 Wi-Fi wireless signal transfer, wireless local area network (WLAN) signal transfer, Visible Light Communication (VLC), Worldwide Interoperability for Microwave Access (WiMAX), Infrared (IR) communication wireless signal transfer, Public Switch
- the communications interface 540 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers that are used to determine a location of the computing system 500 based on receipt of one or more signals from one or more satellites associated with one or more GNSS systems.
- GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russia-based Global Navigation Satellite System (GLONASS), the China-based BeiDou Navigation Satellite System (BDS), and the Europe-based Galileo GNSS.
- GPS Global Positioning System
- GLONASS Russia-based Global Navigation Satellite System
- BDS BeiDou Navigation Satellite System
- Galileo GNSS Europe-based Galileo GNSS
- Storage device 530 may be a non-volatile and/or non-transitory and/or computer-readable memory device and may be a hard disk or other types of computer readable media which may store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, a floppy disk, a flexible disk, a hard disk, magnetic tape, a magnetic strip/stripe, any other magnetic storage medium, flash memory, memristor memory, any other solid-state memory, a compact disc read only memory (CD-ROM) optical disc, a rewritable compact disc (CD) optical disc, digital video disk (DVD) optical disc, a blu- ray disc (BDD) optical disc, a holographic optical disk, another optical medium, a secure digital (SD) card, a micro secure digital (microSD) card, a Memory Stick® card, a smartcard chip, a EMV chip, a subscriber identity module (SIM) card, a
- the storage device 530 may include software services, servers, services, etc., that when the code that defines such software is executed by the processor 510, it causes the system to perform a function.
- a hardware service that performs a particular function may include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 510, connection 505, output device 535, etc., to carry out the function.
- computer- readable medium includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data.
- a computer-readable medium may include a non- transitory medium in which data may be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections.
- Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash memory, memory or memory devices.
- a computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements.
- a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.
- This disclosure describes the MSTFF module 100 shown in FIG.1 and the local temporal context feature-level auxiliary classifier 222 shown in FIG.2 for few-shot action recognition. Use of the MSTFF module 100 is effective to obtain the richer video descriptor by combining the features in frame-level and local temporal context-level, collaboratively.
- circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail.
- well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects.
- those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system.
- a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.
- a process corresponds to a function
- its termination may correspond to a return of the function to the calling function or the main function.
- Processes and methods according to the above-described examples may be implemented using computer-executable instructions that are stored or otherwise available from computer-readable media. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used may be accessible over a network.
- the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code.
- Examples of computer- readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.
- the computer-readable storage devices, mediums, and memories may include a cable or wireless signal containing a bitstream and the like.
- non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
- data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof, in some cases depending in part on the particular application, in part on the desired design, in part on the corresponding technology, etc.
- PATENT Qualcomm Docket No.2301338WO PATENT Qualcomm Docket No.2301338WO
- the program code or code segments to perform the necessary tasks may be stored in a computer-readable or machine-readable medium.
- a processor(s) may perform the necessary tasks. Examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also may be embodied in peripherals or add-in cards. Such functionality may also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
- the instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure.
- the techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices.
- the techniques may be realized at least in part by a computer- readable data storage medium comprising program code including instructions that, when executed, performs one or more of the methods, algorithms, and/or operations described above.
- the computer-readable data storage medium may form part of a computer program product, which may include packaging materials.
- the computer-readable medium may comprise memory or data storage media, such as random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, PATENT Qualcomm Docket No.2301338WO and the like.
- RAM random access memory
- SDRAM synchronous dynamic random access memory
- ROM read-only memory
- NVRAM non-volatile random access memory
- EEPROM electrically erasable programmable read-only memory
- FLASH memory magnetic or optical data storage media
- the techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that may be accessed, read, and/or executed by a computer, such as propagated signals or waves.
- the program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable logic arrays
- Such a processor may be configured to perform any of the techniques described in this disclosure.
- a general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
- a processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein.
- Coupled to or “communicatively coupled to” refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e.g., connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly.
- PATENT Qualcomm Docket No.2301338WO Claim language or other language reciting “at least one of” a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B.
- claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any duplicate information or data (e.g., A and A, B and B, C and C, A and A and B, and so on), or any other ordering, duplication, or combination of A, B, and C.
- the language “at least one of” a set and/or “one or more” of a set does not limit the set to the items listed in the set.
- An apparatus for performing video action classification comprising: at least one memory; and at least one processor coupled to at least one memory and configured to: generate, via a first network, frame-level features obtained from a set of input frames; generate, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generate, via a second multi-scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classify the set of input frames based on the first local temporal context features and the second local temporal context features.
- Aspect 3 The apparatus of Aspect 1, wherein the first multi-scale temporal feature fusion engine applies a first kernel value for generating the first local temporal context features and wherein the second multi-scale temporal feature fusion engine applies a second kernel value for generating the second local temporal context features.
- at least one processor is further configured to: classify, via an auxiliary classifier, the first local temporal context features and the second local temporal context features during a training process.
- Aspect 4. The apparatus of Aspect 3, wherein the auxiliary classifier comprises a two-layer multilayer perceptron (MLP). PATENT Qualcomm Docket No.2301338WO [0114] Aspect 5.
- Aspect 6 The apparatus of any of Aspects 1 to 5, wherein the first network comprises a two-dimensional convolutional neural network.
- Aspect 6 The apparatus of any one of Aspects 1 to 5, wherein at least one processor is further configured to generate, via the first multi-scale temporal feature fusion engine, the first local temporal context features from the first neighboring sub- sequence of the set of input frames by: generating, via a first convolutional neural network, first local temporal context features from the set of input frames; generating, via a first cross attention module, a first cross attended feature output based on the first local temporal context features; generating, via a first average pooling module, a first average pooling dataset from the set of input frames; and generating the first local temporal context features by adding the first cross attended feature output to the first average pooling dataset.
- Aspect 7 The apparatus of Aspect 6, wherein at least one processor is further configured to generate, via the second multi-scale temporal feature fusion engine, the second local temporal context features from the second neighboring sub-sequence of the set of input frames by: generating, via a second convolutional neural network, second local temporal context features from the set of input frames; generating, via a second cross attention module, a second cross attended feature output based on the first local temporal context features; generating, via a second average pooling module, a second average pooling dataset from the set of input frames; and generating the second local temporal context features by adding the second cross attended feature output to the second average pooling dataset.
- Aspect 9 The apparatus of any one of Aspects 6 or 7, wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub- sequence of the set of input frames.
- Aspect 9 The apparatus of Aspect 7, wherein the first cross attention module generates the first cross attended feature output based on a relationship between a query 112 and a key 110 associated with the set of input frames and wherein the second cross attention module generates the second cross attended feature output based on the relationship between the query 112 and the key 110 associated with the set of input frames.
- PATENT Qualcomm Docket No.2301338WO PATENT Qualcomm Docket No.2301338WO
- Aspect 9 wherein the first cross attention module and the second cross attention module both transfer data associated with fine-grained features before a one-dimensional convolution on the set of input frames to data associated with coarse-grained features after the one-dimensional convolution on the set of input frames.
- Aspect 11 The apparatus of Aspect 10, wherein the first cross attention module and the second cross attention module each convey information from two different temporal granularities.
- Aspect 12 The apparatus of Aspect 11, wherein the two different temporal granularities comprise a frame-level granularity and a tuple-level granularity.
- Aspect 14 The apparatus of any one of Aspects 7 to 12, wherein the first average pooling module and the second average pooling module each provide information of temporally neighboring frames at a single temporal granularity.
- Aspect 14 The apparatus of any one of Aspects 7 to 12, wherein the first convolutional neural network and the second convolutional neural network each perform a one-dimensional convolution with a respective kernel.
- Aspect 15 The apparatus of Aspect 14, wherein the first convolutional neural network and the second convolutional neural network each perform the one-dimensional convolution with the respective kernel to summarize information in consecutive k frames of the set of input frames to generate the first local temporal context features and the second local temporal context features.
- a method of classifying video comprising: generating, via a first network, frame-level features obtained from a set of input frames; generating, via a first multi-scale temporal feature fusion engine, first local temporal context features from a first neighboring sub-sequence of the set of input frames; generating, via a second multi-scale temporal feature fusion engine, second local temporal context features from a second neighboring sub-sequence of the set of input frames; and classifying the set of input frames based on the first local temporal context features and the second local temporal context features.
- Aspect 16 wherein the first multi-scale temporal feature fusion engine applies a first kernel value for generating the first local temporal context features and wherein the second multi-scale temporal feature fusion engine applies a second kernel value for generating the second local temporal context features.
- Aspect 18 The method of Aspect 16, wherein method includes classifying, via an auxiliary classifier, the first local temporal context features and the second local temporal context features during a training process.
- Aspect 19 The method of Aspect 18, wherein the auxiliary classifier comprises a two-layer multilayer perceptron (MLP).
- MLP multilayer multilayer perceptron
- Aspect 20 The method of any of Aspects 16 to 19, wherein the first network comprises a two-dimensional convolutional neural network.
- Aspect 21 The method of any one of Aspects 16 to 20, wherein generating, via the first multi-scale temporal feature fusion engine, the first local temporal context features from the first neighboring sub-sequence of the set of input frames further comprises: generating, via a first convolutional neural network, first local temporal context features from the set of input frames; generating, via a first cross attention module, a first cross attended feature output based on the first local temporal context features; generating, via a first average pooling module, a first average pooling dataset from the set of input frames; and generating the first local temporal context features by adding the first cross attended feature output to the first average pooling dataset.
- Aspect 22 Aspect 22.
- generating, via the second multi- scale temporal feature fusion engine, the second local temporal context features from the second neighboring sub-sequence of the set of input frames further comprises: generating, via a second convolutional neural network, second local temporal context features from the set of input frames; generating, via a second cross attention module, a second cross attended feature output based on the first local temporal context features; generating, via a second average pooling module, a second average pooling dataset from the set of input frames; and generating the second local temporal context features by adding the second cross attended feature output to the second average pooling dataset.
- Aspect 24 The method of Aspect 22, wherein the first cross attention module generates the first cross attended feature output based on a relationship between a query and a key associated with the set of input frames and wherein the second cross attention module generates the second cross attended feature output based on the relationship between the query 112 and the key 110 associated with the set of input frames.
- Aspect 25 The method of any one of Aspects 20 to 22, wherein the first neighboring sub-sequence of the set of input frames equals the second neighboring sub- sequence of the set of input frames.
- Aspect 24 The method of Aspect 24, wherein the first cross attention module and the second cross attention module both transfer data associated with fine-grained features before a one-dimensional convolution on the set of input frames to data associated with coarse-grained features after the one-dimensional convolution on the set of input frames.
- Aspect 26 The method of Aspect 25, wherein the first cross attention module and the second cross attention module each convey information from two different temporal granularities.
- Aspect 27 The method of Aspect 26, wherein the two different temporal granularities comprise a frame-level granularity and a tuple-level granularity.
- Aspect 29 The method of any one of Aspects 22 to 27, wherein the first convolutional neural network and the second convolutional neural network each perform a one-dimensional convolution with a respective kernel.
- Aspect 30 The method of Aspect 29, wherein the first convolutional neural network and the second convolutional neural network each perform the one-dimensional convolution with the respective kernel to summarize information in consecutive k frames of the set of input frames to generate the first local temporal context features and the second local temporal context features.
- An apparatus for performing video classification comprising: a neural network configured to generate frame-level features in consecutive frames from a set of video frames; a first multi-scale temporal feature fusion engine having a first kernel size configured to generate first local context features based on the frame-level features; a second multi-scale temporal feature fusion engine having a second kernel size configured to generate second local context features based on the frame-level features; a first temporal-relation cross transformer classifier configured to generate a first distance between a query video associated with the set of video frames and sets of support videos based on the first local context features; a second temporal-relation cross transformer classifier configured to generate a second distance between a query video associated with the set of video frames and the sets of support videos based on the second local context features; and a calculating engine configured to calculate a final distance between the query video and the sets of support videos based on the first distance and the second distance.
- Aspect 32 The apparatus of Aspect 31, wherein the apparatus is optimized by calculating a main loss based on the first temporal-relational cross transformer classifier and the second temporal-relational cross transformer classifier and an auxiliary loss based on an auxiliary classifier used during a training process.
- Aspect 33 The apparatus of Aspect 31, further comprising: an auxiliary classifier configured to receive the first local context features and the second local context features and output a multi-way classification.
- Aspect 34 The apparatus of Aspect 33, wherein the auxiliary classifier is not used after training the apparatus.
- Aspect 35 The apparatus of Aspect 35.
- a method of performing video classification comprising: generating, via a neural network configured to receive a set of video frames, frame-level features in consecutive frames from the set of video frames; generating, via a first multi- scale temporal feature fusion engine having a first kernel size, first local context features based on the frame-level features; generating, via a second multi-scale temporal feature fusion engine having a second kernel size, second local context features based on the frame-level features; generating, via a first temporal-relation cross transformer classifier and based on the first local context features, a first distance between a query video PATENT Qualcomm Docket No.2301338WO associated with the set of video frames and sets of support videos; generating, via a second temporal-relation cross transformer classifier and based on the second local context features, a second distance between a query video associated with the set of video frames and the sets of support videos; and calculating a final distance between the query video and the sets of support videos based on the first distance and the second distance.
- Aspect 36 The method of Aspect 35, further comprising: performing optimization by calculating a main loss based on the first temporal-relational cross transformer classifier and the second temporal-relational cross transformer classifier and an auxiliary loss based on an auxiliary classifier used during a training process.
- Aspect 37 The method of Aspect 35, further comprising: outputting, via an auxiliary classifier configured to receive the first local context features and the second local context features, a multi-way classification.
- Aspect 38 The method of Aspect 37, wherein the auxiliary classifier is not used after training.
- Aspect 39 Aspect 39.
- a non-transitory computer-readable medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations according to any of Aspects 16 to 30 or 35 to 38.
- Aspect 40 An apparatus for generating a classification of video content, the apparatus including one or more means for performing operations according to any of Aspects 16 to 30 or 35 to 38.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202480012743.2A CN120752680A (en) | 2023-02-21 | 2024-02-13 | Dynamic temporal fusion for video recognition |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363486212P | 2023-02-21 | 2023-02-21 | |
| US63/486,212 | 2023-02-21 | ||
| US18/504,968 US20240282081A1 (en) | 2023-02-21 | 2023-11-08 | Dynamic temporal fusion for video recognition |
| US18/504,968 | 2023-11-08 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024177851A1 true WO2024177851A1 (en) | 2024-08-29 |
Family
ID=90366716
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/015603 Ceased WO2024177851A1 (en) | 2023-02-21 | 2024-02-13 | Dynamic temporal fusion for video recognition |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120752680A (en) |
| WO (1) | WO2024177851A1 (en) |
-
2024
- 2024-02-13 WO PCT/US2024/015603 patent/WO2024177851A1/en not_active Ceased
- 2024-02-13 CN CN202480012743.2A patent/CN120752680A/en active Pending
Non-Patent Citations (4)
| Title |
|---|
| BING LI ET AL: "Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 June 2022 (2022-06-16), XP091248127 * |
| PERRETT TOBY ET AL: "Temporal-Relational CrossTransformers for Few-Shot Action Recognition", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 475 - 484, XP034008864, DOI: 10.1109/CVPR46437.2021.00054 * |
| TAILIN CHEN ET AL: "Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 August 2021 (2021-08-10), XP091032360 * |
| THATIPELLI ANIRUDH ET AL: "Spatio-temporal Relation Modeling for Few-shot Action Recognition", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 18 June 2022 (2022-06-18), pages 19926 - 19935, XP034196037, DOI: 10.1109/CVPR52688.2022.01933 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120752680A (en) | 2025-10-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12100169B2 (en) | Sparse optical flow estimation | |
| US20190108447A1 (en) | Multifunction perceptrons in machine learning environments | |
| US11941822B2 (en) | Volumetric sampling with correlative characterization for dense estimation | |
| KR102893857B1 (en) | A task-agnostic open-set prototype for few-shot open-set recognition. | |
| WO2022104026A1 (en) | Consistency measure for image segmentation processes | |
| US12236614B2 (en) | Scene segmentation and object tracking | |
| US12307589B2 (en) | Generating semantically-labelled three-dimensional models | |
| US20240119721A1 (en) | Processing data using convolution as a transformer operation | |
| US10438088B2 (en) | Visual-saliency driven scene description | |
| US20240386650A1 (en) | Planar mesh reconstruction using images from multiple camera poses | |
| CN119998816A (en) | Use convolution as the TRANSFORMER operation to process the data | |
| US20240282081A1 (en) | Dynamic temporal fusion for video recognition | |
| US11423262B2 (en) | Automatically filtering out objects based on user preferences | |
| US20250285296A1 (en) | Neural ordinary differential equations for optical flow estimation | |
| US20240161487A1 (en) | Adaptive mixed-resolution processing | |
| US20250095259A1 (en) | Avatar animation with general pretrained facial movement encoding | |
| WO2024177851A1 (en) | Dynamic temporal fusion for video recognition | |
| US20250356671A1 (en) | Personalized open-vocabulary semantic segmentation for images | |
| US20250094781A1 (en) | Multi-task gating for machine learning systems | |
| US20250284930A1 (en) | Semantics aware auxiliary refinement network | |
| WO2025239984A1 (en) | Personalized open-vocabulary semantic segmentation | |
| US20250272965A1 (en) | Leveraging adapters for parameter efficient transformer models | |
| US20250308194A1 (en) | Defenses for attacks against non-max suppression (nms) for object detection | |
| US20250054284A1 (en) | Heatmap reduction for object detection | |
| US20240233331A1 (en) | Message passing network based object signature for object tracking |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24712675 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202547055909 Country of ref document: IN |
|
| WWP | Wipo information: published in national office |
Ref document number: 202547055909 Country of ref document: IN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202480012743.2 Country of ref document: CN |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| WWP | Wipo information: published in national office |
Ref document number: 202480012743.2 Country of ref document: CN |