[go: up one dir, main page]

WO2024091472A1 - Histogram-based action detection - Google Patents

Histogram-based action detection Download PDF

Info

Publication number
WO2024091472A1
WO2024091472A1 PCT/US2023/035759 US2023035759W WO2024091472A1 WO 2024091472 A1 WO2024091472 A1 WO 2024091472A1 US 2023035759 W US2023035759 W US 2023035759W WO 2024091472 A1 WO2024091472 A1 WO 2024091472A1
Authority
WO
WIPO (PCT)
Prior art keywords
vectors
keypoints
bins
spatial
temporal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2023/035759
Other languages
French (fr)
Inventor
Marios Savvides
Kai Hu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Carnegie Mellon University
Original Assignee
Carnegie Mellon University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Carnegie Mellon University filed Critical Carnegie Mellon University
Publication of WO2024091472A1 publication Critical patent/WO2024091472A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Definitions

  • Skeleton-based action recognition is a computer vision task that involves recognizing human actions from 3D skeletal keypoint data captured, for example, from sequential frames of a video clip.
  • a variety of sensors can be used for the capture of the video sequence, for example, standard video cameras, Microsoft Kinect devices, Intel RealSense devices and other wearable devices.
  • the skeletal keypoint data may be extracted from the frames of a video by a trained machine learning model.
  • FIG.1A shows one possible exemplary scheme for identifying the various body parts by number.
  • FIG.1B shows an example of the keypoint data that might be extracted from a frame of a video clip.
  • the input to the action classification model will typically be in the form of a file containing the location of each keypoint with respect to a coordinate system defined in the context of a video frame.
  • the input file may also contain other information, for Attorney Docket: 8350.2023-065WO example, the number of recorded frames in the video, the number of observed skeletons appearing in the current frame, the number of identified joints and the orientation of each joint. Other information may also be present in the file. In various embodiments, a separate file may be used for each frame of the video, or the data for all frames of the video may be present in a single file. [0005] There are many prior art methods and models available for the detection, recognition and classification of actions based on keypoint data. State of the art methods for action recognition take as input the RGB frames of the video and use deep neural networks to classify the action.
  • FIG.1 shows the tracking of the joints of a skeletal representation of a body through multiple frames of a video, providing a temporal component to the analysis of the depicted action.
  • FIG.2 is a flowchart representing the steps of the disclosed method.
  • a model that takes as input a set of novel features, namely histograms of keypoints representing the information of skeletal actions. After deriving the histogram-based features, the features are input to a single layer linear classifier that makes the action predictions.
  • the disclosed method can be up to 1,000 times faster than neural networks. For example, on the NVIDIA Jetson Xavier device, the disclosed model takes no more than 2 ms to make predictions on a 5 second video dip. Most importantly, the performance of the disclosed model in terms of accuracy matches the performance of the state of the art models. [0011] Consider a video clip of ⁇ frames, having at most ⁇ different persons depicted.
  • the input of the disclosed model is a 4-D tensor of size ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 2, where ⁇ is the number of keypoints in the keypoint scheme being used. For example, for the keypoint annotation scheme shown in FIG.1A, ⁇ ⁇ 17 (wherein the keypoints are numbered 0-16). The last dimension, “2”, represents the ⁇ and ⁇ coordinates of the keypoints in the frame.
  • be the 4-dimensional tensor input and Attorney Docket: 8350.2023-065WO ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ a 1-dimensional array of two numbers, denoting the ⁇ ⁇ keypoint of the ⁇ ⁇ person in the ⁇ ⁇ frame.
  • the relative position features of keypoint ⁇ and keypoint ⁇ are: ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇
  • ⁇ ⁇ vectors in the set ⁇ ⁇ , ⁇ When the video clip is long or many persons show up in the video clip, the number of features are large. To reduce the number of features, the features are grouped into ⁇ bins and a histogram of these features is derived.
  • the features are first grouped in terms of the vector length, as shown in step 204a.
  • the maximum length of the vectors in ⁇ ⁇ , ⁇ is ⁇ .
  • ⁇ groups are created where the ⁇ ⁇ group contains the vectors whose length is between ⁇ ⁇ ⁇ and ⁇ ⁇ ⁇ ⁇ .
  • a histogram of size ⁇ can be derived.
  • the ⁇ number in the is ratio of vectors falling into the ⁇ ⁇ bin.
  • the features are next grouped in terms of the vector orientation at step 206a.
  • ⁇ groups are created, where the ⁇ ⁇ group contains the vectors whose orientation to Attorney Docket: 8350.2023-065WO the horizontal is between ⁇ ⁇ ⁇ 360 and ⁇ ⁇ ⁇ 360. This also results in a histogram of size ⁇ . [0016] For each of the ⁇ ⁇ pairings (spatial features), two histograms of size ⁇ result. Thus, the total size of the spatial feature for each body sis ⁇ ⁇ ⁇ ⁇ 1 ⁇ . [0017] The movement of one keypoint from one frame to ⁇ frames later is used to represent the temporal information.
  • the temporal features of keypoint ⁇ are: ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ , ⁇ ⁇
  • 8 choices of the hyperparameter have been selected: ⁇ ⁇ ⁇ ⁇ 1, 2, 4, 8, ⁇ 8 , ⁇ 16 , ⁇ 24 , ⁇ 32 ⁇
  • is the number of frames in the video
  • the first 4 choices for ⁇ describe the motion information that is invariant of the playing speed
  • the last 4 choices for ⁇ describe the motion information that is invariant of the video length.
  • the hyperparameter ⁇ could represent any set of frames within the video clip.
  • the vectors are also grouped by vector length and orientation. The total size of the resulting temporal feature is then 8 ⁇ ⁇ 2 ⁇ .
  • the input to the classifier that classifies the action is a vector of a size depending on the chosen ⁇ (i.e., the number of bins in the histograms).
  • the feature representing the spatial information is derived in step 204b and the feature representing the temporal information is derived in step 206b.
  • a higher number of bins provides higher accuracy while a lower number of bins increases speed.
  • ⁇ ⁇ 9 provides a good trade off between acturacy and speed.
  • the feature size is 2448 for both spatial and temporal information.
  • the trained model 210 is preferably a linear classifier, but model 210 can be any architecture of trained machine learning model.
  • trained model 210 is a 2-layer MLP (multi-layer perceptron) trained by solving a logistic regression on a training dataset.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed herein is a method for skeleton-based action recognition using handcrafted features. The input of the model is keypoint data of one or more skeletons from the frames of a video clip. Several histogram features are used to describe the spatial and temporal patterns of the corresponding body. These features are concatenated and sent to a linear classifier to predict the category of the actions.

Description

Attorney Docket: 8350.2023-065WO PATENT APPLICATION FILED UNDER THE PATENT COOPERATION TREATY AT THE UNITED STATES RECEIVING OFFICE   FOR    Histogram-Based Action Detection     APPLICANT  CARNEGIE MELLON UNIVERSITY      INVENTORS  Marios Savvides Kai Hu         PREPARED BY: M
Figure imgf000003_0001
Dennis M. Carleton, Principal KDW Firm PLLC 2601 WESTON PKWY. SUITE 103 CARY, NC 27513 Attorney Docket: 8350.2023-065WO   Histogram-Based Action Detection Related Applications [0001] This application claims the benefit of U.S. Provisional Patent Application No. 63/418,798, filed October 24, 2022, the contents of which are incorporated herein in its entirety. Background of the Invention [0002] Skeleton-based action recognition is a computer vision task that involves recognizing human actions from 3D skeletal keypoint data captured, for example, from sequential frames of a video clip. A variety of sensors can be used for the capture of the video sequence, for example, standard video cameras, Microsoft Kinect devices, Intel RealSense devices and other wearable devices. [0003] The skeletal keypoint data may be extracted from the frames of a video by a trained machine learning model. FIG.1A shows one possible exemplary scheme for identifying the various body parts by number. For example, one scheme identifies keypoints representing the following body parts: nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle and right ankle. Other, more or less detailed schemes are equally valid and could be used with the disclosed methods. FIG.1B shows an example of the keypoint data that might be extracted from a frame of a video clip. [0004] The input to the action classification model will typically be in the form of a file containing the location of each keypoint with respect to a coordinate system defined in the context of a video frame. The input file may also contain other information, for Attorney Docket: 8350.2023-065WO example, the number of recorded frames in the video, the number of observed skeletons appearing in the current frame, the number of identified joints and the orientation of each joint. Other information may also be present in the file. In various embodiments, a separate file may be used for each frame of the video, or the data for all frames of the video may be present in a single file. [0005] There are many prior art methods and models available for the detection, recognition and classification of actions based on keypoint data. State of the art methods for action recognition take as input the RGB frames of the video and use deep neural networks to classify the action. As a result, these models incur a heavy computational cost and are not suitable for real-time deployment on, for example, edge devices having limited computing capability. Summary of the Invention [0006] Disclosed herein is a method and model for skeleton-based action recognition using handcrafted features. The input of the model is keypoint data of one or more skeletal representations of bodies depicted in the frames of a video clip. Several histogram features are used to describe the spatial and temporal patterns of the corresponding body. These features are concatenated and sent to a linear classifier to predict the category of the actions. [0007] The model disclosed herein provides the advantage of being lightweight and providing a fast inference time. The performance gap between the disclosed model and the state of the art method is acceptable. For example, on the UCF10 dataset, the best performance is 87% for state-of-the-art skeleton methods, while the model disclosed herein can achieve 85% accuracy but much more quickly (i.e., one the order of 100 to 1000 times quicker than traditional neural networks). This makes the Attorney Docket: 8350.2023-065WO disclosed model ideal for deployment on edge devices having limited computing resources. Brief Description of the Drawings [0008] FIG.1 shows the tracking of the joints of a skeletal representation of a body through multiple frames of a video, providing a temporal component to the analysis of the depicted action. [0009] FIG.2 is a flowchart representing the steps of the disclosed method. Detailed Description [0010] Disclosed herein is a model that takes as input a set of novel features, namely histograms of keypoints representing the information of skeletal actions. After deriving the histogram-based features, the features are input to a single layer linear classifier that makes the action predictions. The disclosed method can be up to 1,000 times faster than neural networks. For example, on the NVIDIA Jetson Xavier device, the disclosed model takes no more than 2 ms to make predictions on a 5 second video dip. Most importantly, the performance of the disclosed model in terms of accuracy matches the performance of the state of the art models. [0011] Consider a video clip of ^^ frames, having at most ^^ different persons depicted. The input of the disclosed model is a 4-D tensor of size ^^ ൈ ^^ ൈ ^^ ൈ 2, where ^^ is the number of keypoints in the keypoint scheme being used. For example, for the keypoint annotation scheme shown in FIG.1A, ^^ ൌ 17 (wherein the keypoints are numbered 0-16). The last dimension, “2”, represents the ^^ and ^^ coordinates of the keypoints in the frame. Specifically, let ^^ be the 4-dimensional tensor input and Attorney Docket: 8350.2023-065WO ^^^ ^^, ^^, ^^^ a 1-dimensional array of two numbers, denoting the ^^௧^ keypoint of the ^^௧^ person in the ^^௧^ frame. [0012] The relative position of two different keypoints in the same frame are used to represent the spatial information of the action. There are ^^ keypoints, thus we have ^^^ି^^ ଶ pairings of keypoints representing spatial features. Note that the keypoints representing a single body are only paired with other keypoints from that same body. Comparing the relative keypoints from two different bodies has no meaning as far as determining the actions of a single body. The pairings are derived in step 202 in FIG. 2. For every ^^ ranging from 1 to ^^ and ^^ ranging from ^^ ^ 1 to ^^, the relative position features of keypoint ^^ and keypoint ^^ are: ^^^,^ ൌ ^ ^^^ ^^, ^^, ^^^ െ ^^^ ^^, ^^, ^^^|∀ ^^ ∈ ^1, ^^^, ^^ ∈ ^1, ^^^^ (1) [0013] There are ^^ ^^ vectors in the set ^^^,^. When the video clip is long or many persons show up in the video clip, the number of features are large. To reduce the number of features, the features are grouped into ^^ bins and a histogram of these features is derived. [0014] The features are first grouped in terms of the vector length, as shown in step 204a. Suppose the maximum length of the vectors in ^^^,^ is ^^. In this case, ^^ groups are created where the ^^௧^ group contains the vectors whose length is between ^ି^ ^ ^^ and ^ ௧^
Figure imgf000007_0001
^ ^^. Then, a histogram of size ^^ can be derived. The ^^ number in the is ratio of vectors falling into the ^^௧^ bin. [0015] The features are next grouped in terms of the vector orientation at step 206a. ^^ groups are created, where the ^^௧^ group contains the vectors whose orientation to Attorney Docket: 8350.2023-065WO the horizontal is between ^ି^ ^ ൈ 360 and ^ ^ ൈ 360. This also results in a histogram of size ^^.
Figure imgf000008_0001
[0016] For each of the ^^^ି^^ ଶ pairings (spatial features), two histograms of size ^^ result. Thus, the total size of the spatial feature for each body sis ^^ ^^^ ^^ െ 1^. [0017] The movement of one keypoint from one frame to ^^ frames later is used to represent the temporal information. For every ^^ ranging from 1 to ^^ and the hyperparameter ^^, the temporal features of keypoint ^^ are: ^^^,^ ൌ ^ ^^ ^ ^^, ^^, ^^ ^ െ ^^ ^ ^^, ^^ ^ ^^, ^^ ^| ∀ ^^ ∈ ^ 1, ^^ ^ , ^^ ∈ ^ 1, ^^ െ ^^ ^^ (2) [0018] The
Figure imgf000008_0002
information with different speeds are needed. Intuitively, 8 choices of the hyperparameter have been selected: ^^ ∈ ^1, 2, 4, 8, ^^ൗ 8 , ^^ൗ 16 , ^^ൗ 24 , ^^ൗ 32 [0019] As ^^ is the number of frames in the video, the first 4 choices for ^^ describe the motion information that is invariant of the playing speed, while the last 4 choices for ^^ describe the motion information that is invariant of the video length. As would be realized, the hyperparameter ^^ could represent any set of frames within the video clip. [0020] Similar to the spatial features, the vectors are also grouped by vector length and orientation. The total size of the resulting temporal feature is then 8 ^^ ൈ 2 ^^. Attorney Docket: 8350.2023-065WO [0021] The input to the classifier that classifies the action is a vector of a size depending on the chosen ^^ (i.e., the number of bins in the histograms). The feature representing the spatial information is derived in step 204b and the feature representing the temporal information is derived in step 206b. A higher number of bins provides higher accuracy while a lower number of bins increases speed. ^^ ൌ 9 provides a good trade off between acturacy and speed. At ^^ ൌ 9, the feature size is 2448 for both spatial and temporal information. These two features derived in 204b and 206b are concatenated in step 208, resulting in a 4896-dimensional vector, which is used as the final handcrafted feature (note that the size of the final vector will vary based on the selection of the value for ^^). The final vector is then input to trained model 210. [0022] The trained model 210 is preferably a linear classifier, but model 210 can be any architecture of trained machine learning model. In preferred embodiments, trained model 210 is a 2-layer MLP (multi-layer perceptron) trained by solving a logistic regression on a training dataset. [0023] As would be realized by those of skill in the art, the novelty of the invention lies in the preparation and derivation of the histogram-based feature vector. The specific derivation is provided as an exemplary embodiment only and the invention is not meant to be limited thereby. Modifications and variations are intended to be within the scope of the invention, which is given by the following claims:

Claims

Attorney Docket: 8350.2023-065WO Claims: 1. A method comprising: receiving a set of coordinates representing locations of skeletal keypoints in multiple frames of a video clip showing actions of a one or more bodies represented by the skeletal keypoints; extracting, from the set of coordinates, spatial vectors representing relative positions of each pair of keypoints with in each frame; extracting, from the set of coordinates, temporal vectors representing relative positions of each keypoint over multiple frames; grouping the spatial vectors into a first set of a predetermined number of bins based on lengths of the spatial vectors and into a second set of the predetermined number of bins based on orientations of the spatial vectors; grouping the temporal vectors into a third set of the predetermined number of bins based on lengths of the temporal vectors and into a fourth set of the predetermined number of bins based on orientations of the temporal vectors; deriving an input vector representing comprising a spatial feature and a temporal feature representing the number of spatial and temporal vectors falling into each bin in each set of bins respectively; and inputting the input vectors into an action classifier and receiving a classification of an action of the one or more bodies in the video clip. 2. The method of claim 1 further comprising: obtaining a video clip; and putting the video clip to a pose estimation machine learning model; and Attorney Docket: 8350.2023-065WO receiving the set of coordinates representing locations of skeletal keypoints from the pose estimation machine learning model. 3. The method of claim 1 wherein the action classifier is a trained machine learning model trained by solving a logistic regression on a training dataset. 4. The method of claim 3 wherein the action classifier is a 2-layer perceptron. 5. The method of claim 1 wherein the input vector is a 4-dimensional tensor of size ^^ ൈ ^^ ൈ ^^ ൈ 2; wherein ^^ is the number of bodies depicted in the video clip; wherein ^^ is the number of frames in the video clip; wherein ^^ is the number of keypoints per body; and wherein 2 represents the number of coordinates describing the location of each keypoint. 6. The method of claim 1 wherein the spatial vectors represent the relative positions of the keypoints with respect only to other keypoints within the body containing the respective keypoints, resulting in a set of spatial vectors for each body shown in the video clip. 7. The method of claim 1 wherein in grouping the spatial vectors into the first and second sets of the predetermined number of bins create histograms of the spatial vectors. Attorney Docket: 8350.2023-065WO 8. The method of claim 1 wherein in grouping the temporal vectors into the third and fourth sets of the predetermined number of bins create histograms of the temporal vectors. 9. The method of claim 1 wherein a size of the spatial feature is ^^ ^^^ ^^ െ 1^; wherein ^^ is the predetermined number of bins; and wherein ^^ is the number of keypoints per body. 10. The method of claim 1 wherein the temporal feature captures motion speed information by extracting vectors over a sampling of frames within the video clip. 11. The method of claim 10 wherein the sampling of frames is given by a hyperparameter representing the frames between which the relative positions of each pair of keypoint are extracted. 12. The method of claim 11 wherein the hyperparameter is of the form: ^^ ∈ ^1, 2, 4, 8, ^^ൗ 8 , ^^ൗ 16 , ^^ൗ 24 , ^^ൗ 32 wherein ^^ is the number of frames in the video clip. 13. The method of claim 12 wherein the sampling of frames represents motion information that is invariant to both playing speed of the video and the video length. 14. The method of claim 1 wherein a size of the temporal feature is 8 ^^ ൈ 2 ^^; wherein ^^ is the predetermined number of bins; and wherein ^^ is the number of keypoints per body. Attorney Docket: 8350.2023-065WO 15. The method of claim 1 wherein a higher number of bins provides greater accuracy, while a lower number of bins provides faster speed.
PCT/US2023/035759 2022-10-24 2023-10-24 Histogram-based action detection Ceased WO2024091472A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263418798P 2022-10-24 2022-10-24
US63/418,798 2022-10-24

Publications (1)

Publication Number Publication Date
WO2024091472A1 true WO2024091472A1 (en) 2024-05-02

Family

ID=90831722

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/035759 Ceased WO2024091472A1 (en) 2022-10-24 2023-10-24 Histogram-based action detection

Country Status (1)

Country Link
WO (1) WO2024091472A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316578A1 (en) * 2016-04-29 2017-11-02 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
US20200310549A1 (en) * 2019-03-29 2020-10-01 Tata Consultancy Services Limited Systems and methods for three-dimensional (3d) reconstruction of human gestures from radar based measurements
WO2021189145A1 (en) * 2020-03-27 2021-09-30 Sportlogiq Inc. System and method for group activity recognition in images and videos with self-attention mechanisms
US20220138967A1 (en) * 2020-11-01 2022-05-05 Southwest Research Institute Markerless Motion Capture of Animate Subject with Prediction of Future Motion
US20220138536A1 (en) * 2020-10-29 2022-05-05 Hong Kong Applied Science And Technology Research Institute Co., Ltd Actional-structural self-attention graph convolutional network for action recognition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170316578A1 (en) * 2016-04-29 2017-11-02 Ecole Polytechnique Federale De Lausanne (Epfl) Method, System and Device for Direct Prediction of 3D Body Poses from Motion Compensated Sequence
US20200310549A1 (en) * 2019-03-29 2020-10-01 Tata Consultancy Services Limited Systems and methods for three-dimensional (3d) reconstruction of human gestures from radar based measurements
WO2021189145A1 (en) * 2020-03-27 2021-09-30 Sportlogiq Inc. System and method for group activity recognition in images and videos with self-attention mechanisms
US20220138536A1 (en) * 2020-10-29 2022-05-05 Hong Kong Applied Science And Technology Research Institute Co., Ltd Actional-structural self-attention graph convolutional network for action recognition
US20220138967A1 (en) * 2020-11-01 2022-05-05 Southwest Research Institute Markerless Motion Capture of Animate Subject with Prediction of Future Motion

Similar Documents

Publication Publication Date Title
US20250363350A1 (en) Method and system for activity classification
Yang et al. An emotion recognition model based on facial recognition in virtual learning environment
Ha et al. Multi-modal convolutional neural networks for activity recognition
Asif et al. Privacy preserving human fall detection using video data
Verma et al. Gesture recognition using kinect for sign language translation
Javeed et al. Body-worn hybrid-sensors based motion patterns detection via bag-of-features and Fuzzy logic optimization
Padhi et al. Hand gesture recognition using densenet201-mediapipe hybrid modelling
CN112329513A (en) High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network
Chalasani et al. Egocentric gesture recognition for head-mounted ar devices
Reining et al. Attribute representation for human activity recognition of manual order picking activities
Abdulhamied et al. Real-time recognition of American sign language using long-short term memory neural network and hand detection
Gavrilescu Proposed architecture of a fully integrated modular neural network-based automatic facial emotion recognition system based on Facial Action Coding System
Neyra-Gutiérrez et al. Feature extraction with video summarization of dynamic gestures for peruvian sign language recognition
Almaadeed et al. A novel approach for robust multi human action detection and recognition based on 3-dimentional convolutional neural networks
Tur et al. Isolated sign recognition with a siamese neural network of RGB and depth streams
Agrawal et al. Redundancy removal for isolated gesture in Indian sign language and recognition using multi-class support vector machine
Nikhil et al. Retracted: Finger Recognition and Gesture based Virtual Keyboard
Karthik et al. Survey on Gestures Translation System for Hearing Impaired People in Emergency Situation using Deep Learning Approach
WO2024091472A1 (en) Histogram-based action detection
Baranwal et al. Implementation of MFCC based hand gesture recognition on HOAP-2 using Webots platform
Armandika et al. Dynamic hand gesture recognition using temporal-stream convolutional neural networks
Subramanian et al. Enhancing Object Detection through Auditory-Visual Fusion on Raspberry Pi and FogBus
Monica et al. Recognition of medicine using cnn for visually impaired
Bora et al. ISL gesture recognition using multiple feature fusion
Alba-Flores UAVs control using 3D hand keypoint gestures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23883367

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 23883367

Country of ref document: EP

Kind code of ref document: A1