CN112101176B

CN112101176B - User identity recognition method and system combining user gait information

Info

Publication number: CN112101176B
Application number: CN202010943184.0A
Authority: CN
Inventors: 凌贺飞; 周雨; 王润生; 黄昌喜
Original assignee: Yuanshen Technology Hangzhou Co ltd
Current assignee: Yuanshen Technology Hangzhou Co ltd
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2024-04-05
Anticipated expiration: 2040-09-09
Also published as: CN112101176A

Abstract

The invention provides a user identity recognition method and a system combining user gait information, wherein the method comprises the following steps: and carrying out gesture detection on the pedestrian object of each frame in the video sequence of the original data set by using a two-dimensional gesture estimation system, and extracting gesture information. And then preprocessing the extracted joint coordinate sequence to generate a human skeleton data set. And finally, constructing a space-time diagram convolution network model, dividing a skeleton diagram into six subgraphs, sharing joints between the subgraphs, using a diagram convolution network learning identification model, training by using a constructed data set, adopting a multi-loss strategy combining classification loss and contrast loss, optimizing network parameters by using random gradient descent, and predicting the accuracy of the trained model by using a verification set. The invention fully utilizes the effective information of the articulation point, reserves the motion state in the time dimension as much as possible, has higher robustness to the change and carrying state of clothing, and has good generalization capability in the task of crossing visual angles.

Description

User identity recognition method and system combining user gait information

Technical Field

The invention belongs to the field of gait recognition in computer vision, and particularly relates to a user identity recognition method and system combined with user gait information.

Background

In the task of human recognition, there are a variety of biological features, such as iris, fingerprint, face, etc., gait also belongs to a behavioral biological feature. Compared with other biological characteristics, gait has the advantages of difficult stealing and imitation due to the unique non-contact characteristic, is particularly suitable for long-distance human identification, and attracts more and more attention in the field of video monitoring. Gait recognition has heretofore remained a very challenging problem because it relies on video sequences taken in a controlled or uncontrolled environment, as the appearance characteristics of pedestrians can change over time, changes in the capture perspective can also greatly change the appearance of a person during walking, and is affected by factors such as clothing and footwear changes, walking surface, walking speed and emotional conditions.

Existing gait recognition methods can be divided into two main categories. The first is a model-based method, which uses human body features to manually fit pictures of each frame by studying and analyzing gait videos or gait contours, and models according to the structure of the human body and the local motion patterns of different body parts. The method is used for extracting dynamic or static information related to gait in the early traditional gait recognition research work, and has the advantages of high requirements on an extracted original data set, huge model parameters, high calculation cost and poor effect due to complex method. The second category is appearance-based methods that extract gait representations directly from the video without explicit consideration of the body's infrastructure. In such methods, gait energy images (Gait Engery Image, GEI) are the most common input, as it achieves a good compromise between recognition rate and computational simplicity. The gait energy pattern is a gait pattern formed by all contours in a gait cycle according to a certain rule, which mixes dynamic information and static information in a sequence of contours, and the energy of each pixel in the pattern is obtained by calculating the average pixel of the contours in a gait cycle, but the contours of a person are easily changed in shape by the influence of covariates like clothes and carrying objects, and the like, which directly leads to the reduction of the recognition rate.

In view of these problems, several deformation methods of the GEI have been mainly focused on improvement of dynamic parts to mitigate the influence of appearance changes caused by clothing and carrying conditions. On the other hand, due to the recent explosive development of deep learning in the field of computer vision, particularly, a deep convolutional neural network has achieved very excellent performance for processing various tasks, and in recent years, research work for gait recognition based on deep learning has been endless. The algorithm complexity of the non-model method is much smaller than that of the model-based algorithm and is high in calculation efficiency, however, the method still has a great challenge to the covariates affecting gait recognition performance, including observation angle, clothing change, walking speed, carrying state, resolution and the like. In the past, bones have been successfully used in the fields of object recognition, as well as human behavior recognition, video tracking, pedestrian re-recognition, etc., and have achieved excellent performance. The existing method based on the gesture key points always models skeleton data into vector sequences or pseudo images, and then sends the vector sequences or pseudo images into a convolutional neural network (Convolutional Neural Networks, CNN) or a cyclic neural network (Recurrent Neural Network, RNN) for processing, and the method links the key points into a characteristic vector only at each time step and lacks of full utilization of effective information of human body joints.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a user identity recognition method and a system combined with user gait information, and aims to solve the problems that the profile of a person in the existing gait task is easily influenced by covariates such as clothes and carrying objects to change the shape to directly cause the reduction of recognition rate and the effective information of a human body joint point is not fully utilized in the traditional algorithm based on the gesture key points.

In order to achieve the above object, in a first aspect, the present invention provides a user identification method combined with gait information of a user, including the following steps:

determining a gait dataset of the user; carrying out gesture estimation on each video frame in the gait data set, and determining gesture key point coordinates of each video frame; the gait data set is a plurality of video frames containing gait information of a user;

determining a user bone dataset based on the user gesture keypoint coordinates; the user skeleton data set comprises coordinates of each node of the user;

connecting coordinates of each node of the user based on the user bone data set according to the bone structure of the user to construct a user bone space-time topological graph;

inputting the skeleton space-time topological graph of the user into a space-time graph convolution network model, and combining gait information of different users stored in the space-time graph convolution network model in advance to identify the identity information of the user.

In an alternative embodiment, the gesture keypoints correspond to multiple nodes of the user; the user skeleton data set is determined based on the user gesture key point coordinates, specifically:

normalizing the gesture key point coordinates of each video frame based on the central positions of the two nodes of the neck and the hip of the user to obtain the coordinates of each node of the normalized user, and taking the coordinates of each node of the normalized user as a user skeleton data set.

In an optional embodiment, the coordinates of each node of the user are connected according to the skeleton structure of the user based on the user skeleton data set, so as to construct a user skeleton space-time topological graph, specifically:

connecting joint points in a video frame on a spatial domain of the single video frame according to a user skeleton structure, dividing the connected joint points of the user into six parts including a head, a trunk, a left arm, a right arm, a left leg and a right leg by using a spatial diagram in the single video frame, and forming six subgraphs with shared vertexes and shared edges; the same node of adjacent video frames are connected to form a time sequence edge of a time-space diagram; and repeating the two steps for all video frames to obtain all time sequence edges and joint points of all video frames to jointly form a user skeleton space-time diagram.

In an alternative embodiment, the space-time diagram convolutional network model comprises a multi-layer backbone network, each layer backbone network comprises a space domain convolutional network SGCN and a time domain convolutional network TCN, and the SGCN and the TCN transmit features in adjacent serial connection;

in SGCN, after the input features pass through a convolution layer of a convolution kernel, extracting high-dimensional differentiation space features under the interaction of the joint points in a first-order neighborhood in the input features by combining an attention mechanism, and inputting the high-dimensional differentiation space features into TCN;

in TCN, the high-dimensional differentiation space features are subjected to normalization time domain feature distribution by a batch normalization layer (BN layer), and are activated by utilizing a linear rectification function, so that the high-dimensional differentiation space features are used as input of a convolution layer of a convolution kernel, and finally effective expression of user joint features in a plurality of continuous time domains is realized, and the high-dimensional features in space dimension and time sequence dimension are extracted;

the nonlinear mapping from the input feature space to the high-dimensional feature space is completed through stacking and cascading multi-layer backbone networks, and high-dimensional differentiation features are obtained;

outputting the high-dimensional differentiation characteristics by using a pooling layer and a full-connection layer; the high-dimensional differentiation feature is used to identify identity information of the user.

In an alternative embodiment, the attention mechanism is used to enhance the saliency and distinguishability of the extracted spatiotemporal features;

the attention mechanism distributes different weights to each joint point of the user, focuses on the joint point with relatively large effect, ignores the joint point with relatively small effect and selects the effective joint point of the gait feature; the attention mechanism is developed by adding a learnable mask before input to the space-time diagram convolutional network model.

In an alternative embodiment, the space-time diagram convolution network model combines classification loss and contrast loss, reduces the eigenvalue distance in the user gait, and increases the difference between the user gaits;

the classification loss function uses Softmax loss as a supervision signal to provide class center information for the space-time diagram convolution network model, and meanwhile, uses contrast loss to constrain the relations between classes.

In a second aspect, the present invention provides a user identification system in combination with user gait information, comprising:

a gait data determining unit for determining a gait data set of the user; carrying out gesture estimation on each video frame in the gait data set, and determining gesture key point coordinates of each video frame; the gait data set is a plurality of video frames containing gait information of a user;

a bone data determining unit for determining a user bone data set based on the user gesture key point coordinates; the user skeleton data set comprises coordinates of each node of the user;

the skeleton topology construction unit is used for connecting the coordinates of each node of the user based on the user skeleton data set according to the skeleton structure of the user so as to construct a user skeleton space-time topological graph;

the user identity identification unit is used for inputting the user skeleton space-time topological graph into a space-time graph convolution network model and identifying the identity information of the user by combining gait information of different users stored in the space-time graph convolution network model in advance.

In an optional embodiment, the gesture key points determined by the gait data determining unit correspond to a plurality of nodes of the user;

the skeleton data determining unit normalizes the gesture key point coordinates of each video frame based on the central positions of the two joint points of the neck and the hip of the user to obtain the coordinates of each joint point of the normalized user, and the coordinates of each joint point of the normalized user are used as a user skeleton data set.

In an optional embodiment, the skeleton topology building unit connects the joint points in the video frame according to the skeleton structure of the user on the spatial domain of the single video frame, and divides the connected joint points of the user into six parts including a head, a trunk, a left arm, a right arm, a left leg and a right leg by using the spatial graph in the single video frame to form six subgraphs with shared vertexes and shared edges; the same node of adjacent video frames are connected to form a time sequence edge of a time-space diagram; and repeating the two steps for all video frames to obtain all time sequence edges and joint points of all video frames to jointly form a user skeleton space-time diagram.

In an alternative embodiment, the space-time diagram convolutional network model comprises a multi-layer backbone network, each layer backbone network comprises a space domain convolutional network SGCN and a time domain convolutional network TCN, and the SGCN and the TCN transmit features in adjacent serial connection; in SGCN, after the input features pass through a convolution layer of a convolution kernel, extracting high-dimensional differentiation space features under the interaction of the joint points in a first-order neighborhood in the input features by combining an attention mechanism, and inputting the high-dimensional differentiation space features into TCN; in TCN, the high-dimensional differentiation space features normalize time domain feature distribution through a batch normalization layer, and are activated by utilizing a linear rectification function, so that the high-dimensional differentiation space features are used as input of a convolution layer of a convolution kernel, effective expression of user joint features in a plurality of continuous time domains is finally realized, and high-dimensional features in space dimension and time sequence dimension are extracted; the nonlinear mapping from the input feature space to the high-dimensional feature space is completed through stacking and cascading multi-layer backbone networks, and high-dimensional differentiation features are obtained; outputting the high-dimensional differentiation characteristics by using a pooling layer and a full-connection layer; the high-dimensional differentiation feature is used to identify identity information of the user.

In general, the above technical solutions conceived by the present invention have the following beneficial effects compared with the prior art:

the invention provides a user identity recognition method and a system combining user gait information, which can more effectively extract the space information and time sequence information of a joint point by combining graph convolution and gesture key points, and can obtain the beneficial effect of acquiring gait characteristics with more distinguishing capability. The present invention optimizes the network in combination with an attention mechanism and a multiple loss policy. The attention adding mechanism strengthens the saliency of the extracted space-time characteristics, combines classification loss and contrast loss, reduces intra-gait distance and increases inter-gait difference.

The invention provides a user identity recognition method and a system combining user gait information, and researches an effective method capable of simulating dynamic bones to solve gait recognition tasks, and captures information from graph nodes and links thereof, so that the system has generalization capability and fault tolerance capability of the system is improved. A space-time diagram is constructed from the joint point sequence to dynamically model the skeleton sequence, so that the space-time diagram convolution network can automatically learn the space characteristics and time sequence information of human skeletons in walking, focus on modeling gait dynamics and eliminating the influence of pedestrian appearance on recognition, and partition the diagram (namely, dividing the diagram into a plurality of sub-diagrams sharing the joint points) to learn advanced attributes among different parts and relations among the different parts, so that fusion of local and global information is realized. The method adopts the organic combination of classification loss and contrast verification loss, effectively utilizes different relations between the identity information of the targets and the targets, and increases the degree of distinction between the features.

Drawings

FIG. 1 is a flow chart of a user identification method combining user gait information provided by the invention;

FIG. 2 is a flowchart illustrating a partition map rolling operation according to an embodiment of the present invention;

FIG. 3 is a block diagram of a space-time diagram convolutional network model provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of a user identification system combined with user gait information according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a user identity recognition method and a system combining user gait information, wherein the method comprises the following steps: firstly, an open-source two-dimensional attitude estimation system is utilized to carry out attitude detection on pedestrian objects of each frame in a video sequence of an original dataset, and attitude information is extracted. And then, performing a series of preprocessing operations on the extracted joint coordinate sequences to generate a human skeleton data set for gait recognition, and preparing for model training of subsequent gait recognition tasks. Finally, a space-time diagram convolution network model is constructed, in order to capture advanced semantic information, a skeleton diagram is divided into six subgraphs, joints are shared between the subgraphs, a part-based diagram convolution network learning recognition model is used, so that the performance is effectively improved, a constructed data set is utilized for training, a multi-loss strategy combining classification loss and comparison loss is adopted as a loss function, network parameters are optimized by using random gradient descent, and the accuracy of the trained model is predicted by using a verification set. The method can fully utilize the effective information of the node, retain the motion state in the time dimension to a greater extent, has higher robustness to the change and carrying state of clothing, and has good generalization capability in the task of crossing visual angles.

The method of the present invention is based on early studies of gait perception, which showed that the movement of the joints over time is sufficient to enable a human to identify a familiar person. In recent years, a gesture estimation algorithm based on deep learning has higher robustness to self-occlusion, clothes change and carried objects, and compared with gait information extracted from gait images, since the joints of a body do not depend on the appearance shape and cannot change due to the change of clothes and the change of carrying conditions, the application believes that the gesture recognition by using gesture key points is beneficial to alleviating the influence of covariate change on recognition performance.

The invention captures space and time sequence information in gait based on a human body posture joint point method, combines the concept of a topological graph with posture key points, researches an effective method capable of simulating dynamic bones to solve gait recognition tasks, focuses on modeling gait dynamics and eliminating the influence of pedestrian appearance on recognition, solves the problems that the profile of a person in the existing gait tasks is easily influenced by covariant like clothes, carrying and the like to change shape, directly leads to the reduction of recognition rate and lacks of full utilization of effective information of human body joint points in a traditional algorithm based on the posture key points.

The technical scheme adopted by the invention is a gait recognition method combining a graph rolling network and human body posture key points, which comprises the following implementation steps:

step 1, acquiring a video sequence in an original gait data set, and estimating the gesture of a pedestrian object of each frame in the sequence;

step 2, performing a series of preprocessing operations on the obtained human body posture joint point data to obtain a human skeleton data set for the recognition task;

step 3, constructing a corresponding human skeleton space-time topological graph on the obtained human skeleton data set;

step 4, constructing a partition graph convolution network, designing a loss function, training by using a training set, and optimizing network parameters by using a random gradient descent algorithm;

and 5, identifying unknown samples in the verification set by using the trained model to obtain estimated identity information, and predicting the accuracy.

Preferably, step 1 comprises the following specific steps:

step 11: the pedestrian gait process data is acquired using a video acquisition device as a simultaneous dataset or using a common gait dataset comprising video in the original dataset, such as a CASIA-B dataset, a USF dataset, etc.

Step 12: and carrying out gesture estimation on each frame in the video sequence of the data set by using a mature gesture estimation system to obtain a gesture key point coordinate set of each frame.

Preferably, step 2 comprises the following specific steps:

step 21: normalizing the joint point coordinates in each frame obtained in the step 1 based on the center positions of the two joints, namely the neck and the hip;

step 22: randomly dividing the data set into a training set and a verification set;

step 23: the samples are normalized and serialized to be compatible with the model input format. The purpose of normalization is to make the length of all samples uniform by repeating frames sequentially to completely fill the established fixed number of frames. Serialization involves preloading standardized samples in a subset to convert them into a physical Python file. For each subset, the present application generates two physical files, a sample and a tag.

Preferably, the step 3 specifically comprises: connecting joint points in a frame on a spatial domain of a single frame according to a natural skeleton structure of a human body, dividing the spatial map in the single frame into six parts including a head, a trunk, left and right arms and left and right legs based on a method of dividing the contour map into fixed limbs and bodies, and forming six subgraphs with shared vertexes and shared edges; connecting the same key points of adjacent frames to form a time sequence edge of a time-space diagram; and repeating the two steps for all the input frames to obtain all sides and joint points of all the input frames to jointly form a human skeleton space-time diagram.

Preferably, step 4 comprises the following specific steps:

step 41: constructing a partitioned space-time graph convolution network model, firstly performing space convolution on each sub-graph in a space domain, then combining convolution sub-graphs by using a weighting and fusion strategy, and finally performing time convolution on the graph after the aggregation operation of the sub-graphs is realized. Wherein a learning mask is added to form an attention mechanism before the graph convolution of the spatial domain, a learning weight matrix is given to each adjacency matrix to learn the importance of the spatial edge, and different importance is given to all the nodes in the adjacency.

Step 42: the training set is used for training, the random gradient descent method is used for optimizing network parameters, a multi-loss strategy is adopted to combine the classifying loss Softmax loss and the contrast loss Contrasitive loss, the characteristic value distance in the gait is reduced, and the difference between the gaits is increased.

Preferably, the step 5 specifically comprises: the adopted evaluation index is average accuracy commonly used in gait recognition, a gait recognition task is regarded as a classification problem, a section of gait sequence sample of normal walking of pedestrians is given, firstly, according to an object classification training model, the sample obtains characteristic representation irrelevant to covariates in a low-dimensional space through a characteristic learning network based on a deep neural network. Then dividing the test set into two parts of a validation set and a registration set probe set, and evaluating the similarity between the probe and the probe by using the characteristics obtained by training to obtain a model. The accuracy of a covariate at a certain angle is the number of all video sequences under the condition that are predicted to be accurate/the number of all video sequences under the angle of the covariate, and the average accuracy under the covariate condition is the result of averaging the accuracy calculations under all angles.

Fig. 1 is a flowchart of a user identification method combined with user gait information, as shown in fig. 1, including the following steps:

s110, determining a gait data set of a user; carrying out gesture estimation on each video frame in the gait data set, and determining gesture key point coordinates of each video frame; the gait data set is a plurality of video frames containing gait information of a user;

s120, determining a user skeleton data set based on the user gesture key point coordinates; the user skeleton data set comprises coordinates of each node of the user;

s130, connecting coordinates of all nodes of the user based on the user skeleton data set according to the skeleton structure of the user to construct a user skeleton space-time topological graph;

s140, inputting the skeleton space-time topological graph of the user into a space-time graph convolution network model, and identifying identity information of the user by combining gait information of different users stored in the space-time graph convolution network model in advance.

In a specific embodiment, the method mainly comprises three major steps, wherein the first step is to extract gesture information from a video sequence of an original dataset, perform gesture detection on a pedestrian object of each frame in the video by using an open-source two-dimensional gesture estimation system, extract a joint coordinate sequence of the pedestrian object and perform preprocessing on the joint coordinate sequence. And secondly, constructing a time-space diagram from the key point sequence to dynamically model a time-space domain in the skeleton sequence. And finally, expanding the graph convolution to a space-time graph model to extract the spatial characteristics and the time sequence characteristics of the human skeleton. In order to capture high-level semantic information, the skeleton graph is divided into six subgraphs and joints are shared between the subgraphs, and a part-based graph rolling network learning recognition model is used, so that the performance is effectively improved. To extract spatiotemporal features with large inter-gait differences and small intra-gait differences, a multiple-loss strategy is employed to optimize the network.

(A) Construction of human skeleton gait data set

The existing gait data sets all contain data that is a profile of the appearance of the walking object, and in order to make the original data set compatible with the model input, a series of preprocessing operations are required to be performed, which will generate a new data set that contains bone estimates of all the pedestrian objects in the gait data set.

Firstly, acquiring walking videos of individual objects from an original data set, and then carrying out gesture estimation on the walking objects in each video to obtain coordinates of all nodes in each frame for constructing a skeleton topological graph. Each video frame contains a section named "pose coordinates post" in which the coordinates of the X-axis and Y-axis of the human joint are estimated, and a section of "confidence score" contains the confidence of each joint. The file of each object contains all frames of the video sequence and object tags. The input to the model is a feature vector on the node, which consists of a coordinate vector and an estimated confidence.

The second step performs normalization operation. Because the camera is fixed in the shooting process of the data set, the distance between a person and the fixed point of the camera can always change in the walking process. In order to eliminate the effects of such distance variations, all walking objects can be viewed and analyzed in a relatively uniform size pose, with a standardized operation for each joint. It is contemplated that the center of the neck and hip joint is two relatively stable positions during walking of a person. Therefore, based on the two joint positions, normalization equations are defined as follows:

wherein P is _i Representing the coordinates of the joint points of the human body, P _neck Represents the joint point coordinates of the neck, H represents the distance between the center points of the two joints of the neck and the buttocks of the person, and P' _i Representing the results after normalization.

The third step is to divide the data set. A new data set containing basic estimates of all skeletal keypoints will be generated by the steps described above. Finally, the samples are normalized and serialized to make them compatible with the model input format. The purpose of normalization is to make the length of all samples uniform by repeating frames sequentially to completely fill the established fixed number of frames. Serialization involves preloading standardized samples in a subset to convert them into a physical Python file, which contains the representation in its memory, i.e., the format used by the model. For each subset, two physical files, a sample and a tag are generated.

(B) Construction of pedestrian skeleton topological graph

Any gait video sequence of a pedestrian can be expressed as a T group of skeleton sequences with N joints, wherein T represents the number of frames of the video, a space-time diagram is constructed on the skeleton sequences, the diagram is formed by the skeletons of the T frames, and the same joints of each frame are communicated. Thus, an undirected graph g= (V, E) can be obtained, where V represents the set of all the nodes of interest within the skeleton sequence of the input frame and E is a set of edges. In this figure, identical nodes of successive frames are connected to form the temporal edges of the figure, and the nodes within each frame are connected to form the spatial edges of the figure according to the natural human skeletal structure. E may be expressed as:

E _s (t)＝{v _ti v _tj |(i，j∈N)} (1.2)

E _t ＝{v _ti v _(t+1)i } (1.3)

wherein v is _ti And v _tj Is a different joint point in the same frame, N is a group of joint sets connected according to a natural body skeleton structure, v _ti And v _(t+1)i Is the same node of a different frame.

Since the human body is a hinge structure, which can be regarded as rigid parts connected to each other, the present application also divides the human body skeleton diagram into several parts, each sub-diagram representing a local area of the human body. The graph G is here divided into several skeletal partitions representing corresponding parts of the body. Then, G may be represented as a combination of subgraphs with specific motion trajectory properties:

G＝U _{i∈{1，...，k}} S _i (1.4)

where k represents the number of partitions, where the value of k is set to 6, i.e. the human skeleton is divided into 6 parts. S is S _i ＝(V _i ，E _i ) Is a sub-graph of G with shared vertices or shared edges with other sub-graphs. For a specific division of the space-time diagram, the constructed space-time diagram is divided into six parts of the head, torso, left and right arms and left and right legs, here based on the previous method of dividing the contour diagram into fixed limbs and bodies.

(C) Space-time diagram convolution network construction and training stage

(C1) Partial-based space-time diagram convolutional network definition

First defining a neighborhood of vertices in a skeleton map of a pedestrian, where the vertices v are represented using a sampling function N _ti And its domain set, as shown in equation 1.5:

N _s (v _ti )＝{v _qj |d(v _tj ，v _ti )≤D，v _ti ，v _tj ∈V _s ，|q-t|≤[τ/2]} (1.5)

wherein v is _ti And v _tj Respectively represent two nodes in the same frame, v _ti And v _qt The same nodes respectively representing different frames, V _s Is the set of nodes in subgraph s, d (v _tj ，v _ti ) Representing the secondary vertex v _ti To v _tj Where D takes a value of 1, representing a set of neighbors that are distance 1 in the spatial dimension.

Then, a weight function is defined to assign weights to calculate inner products of the input feature vectors, and labels are assigned to vertices in the neighborhood. In this work, nodes in the 1-neighborhood are divided into two subsets in the spatial domain, namely the node itself and the nodes in the neighborhood, to model the relative position change between the nodes. On the time domain, a partitioning rule similar to the time convolution TCN is directly used. Finally, the Cartesian products of the subset of spatial partitions and the subset of temporal partitions will together constitute a result after utilizing the partitioning rules on the space-time diagram. Mapping can then be performed by equation 1.6:

L _st ＝d(v _tj ，v _ti )+(q-t+[τ/2]) (1.6)

the nodes in the neighborhood are mapped to corresponding subset labels. In the spatial dimension, the invention sets K labels (L _s : v→ {0, 1) to allocate all nodes within each vertex 1 neighborhood, τ (L) in the time dimension _t : v→ {0,..tau. -1 }) to assign different weights to vertices of different frames within the neighborhood. Thus, according to v _ti A time-space domain is generated.

The graph rolling network can be performed on a defined partitioned space-time graph using the sampling functions and weighting functions defined above. First, a spatial convolution is performed on each sub-graph in the spatial domain, as shown in equation 1.7:

the convolutions subgraphs are then combined using a weighted sum fusion strategy as shown in equation 1.8:

where n is the number of partitions, the aggregate function will have shared vertices or be fused between two sub-graphs by two parts of the feature information connecting the edges. Then, a temporal convolution is performed on the graph after the aggregate operation of the subgraph is implemented, as shown in equation 1.9:

that is, a convolution operation is independently performed on each partition of each frame in the spatial domain, and is aggregated in each single frame, and then a temporal convolution is performed on the aggregated map, i.e., the convolution over the temporal domain, as shown in fig. 2, where (a) in fig. 2 represents a spatial map divided over the single frame, (b) in fig. 2 represents a spatial map after the aggregation is performed over the single frame, and (c) in fig. 2 represents a human skeleton space-time map.

(C2) Input to a graph rolling network

The input of the whole network is the result of preprocessing the output of the two-dimensional pose estimation system, and for a batch input gait video, the input can be expressed as a five-dimensional tensor matrix (N, C, T, V, M), wherein N represents the number of videos contained in a batch, C is the channel number, and is set to 3, and represents three characteristics of x, y and confidence contained in a joint, T represents the number of video frames, and the frame is sequentially repeated to completely fill the established fixed number of frames to make the lengths of all samples uniform, V represents the number of extracted joints, and the 18 joints of a human body are marked by using the two-dimensional pose estimation system Openphase.

(C3) Structure of space-time diagram convolution network

Fig. 3 is a proposed zoned spatiotemporal convolution network architecture with human gait spatiotemporal patterns as input, outputting high-dimensional differentiation features for characterizing identity information. The network layer takes a space-time characteristic network as a backbone and is composed of the following modules: firstly, normalizing input space-time diagram features by utilizing a BN layer; then, inputting the normalized features into a partitioned space-time diagram convolution network composed of SGCN and TCN, and extracting high-dimensional features on a space dimension (joint) and a time sequence dimension (key frame) by combining joint point interaction weights represented by an attention mechanism; then, through stacking the cascaded six-layer backbone network, the nonlinear mapping from the input space to the characteristic space is completed; and finally, outputting high-dimensional differentiation characteristics for classification by using the pooling layer and the full-connection layer.

The partitioned spatio-temporal feature network of the present invention includes an SGCN module for the spatial domain and a TCN module for the time domain, as shown in fig. 3. In the space domain convolution network (SGCN), the input feature X is convolved into a convolution layer of 1 multiplied by 1, and then, the high-dimensional differentiation space feature F under the interaction of the joint points in the first-order neighborhood in the feature is extracted by combining a attention mechanism, and the F is input into the time domain convolution network (TCN). In TCN, F is normalized by BN layer to obtain time domain feature distribution, and activated by linear rectification function (ReLU), and then used as input of convolution layer with convolution kernel scale of 9×1, to finally realize effective expression of human joint feature in 9 continuous time domains. In order to ensure that the characteristic distribution of the space-time characteristic network before and after input and output is consistent, after the TCN finishes the characteristic expression in the time domain, the characteristic input BN layer needs to be further normalized in batches, and meanwhile, the Dropout layer is combined to inactivate neurons with the probability of 0.5, so that the problem of overfitting caused by excessive network parameters and insufficient training data is avoided.

(C4) Multi-mechanism combination

In order to obtain gait expression with more discrimination capability, the attention mechanism is used for enhancing the saliency and the distinguishing property of the extracted space-time characteristics, and simultaneously, joint classification loss and contrast loss are proposed, the distance between characteristic values in the gait is reduced, and the difference between the gaits is increased.

(C4-1) attention mechanism Module

In the constructed human skeleton topological graph, the number of nodes in the neighborhood of each vertex can be different, so that the characteristic value of the vertex with more nodes in the neighborhood can be more remarkable. In walking, as not all joints can effectively improve the performance of gait recognition, even some joints can be rendered worse, a priori knowledge has shown that the motion diagnosis track of the legs and arms is important in gait, and the motion diagnosis track is more focused on the subtle body differences of the legs and arms of a subject than other recognition tasks, the thought of Attention introducing mechanism Attention is to allocate different weights to each joint point, focus on the joint points with larger roles, and neglect some joint points with smaller roles to select effective joints with gait characteristics. The attention mechanism is formed here by adding a learnable mask before the graph is rolled. In the invention, each model unit has own weight parameter for training, and each adjacent matrix is endowed with a learnable weight matrix to learn the importance of a space edge and endow all the nodes in the adjacent matrix with different importance. Can be represented by the following formula 1.10:

x′ _i ＝∑ _{j∈neighbor(i)} a _learn (i，j)Wx _j (1.10)

(C4-2) multiple loss strategy

The invention combines the advantages of classification loss and contrast loss, proposes a multi-loss supervision scheme, improves the network classification performance and improves generalization capability, as shown in a formula 1.11:

L＝L _s +αL _c +λ||W|| ² (1.11)

wherein: l (L) _s Representing Softmax loss; l (L) _c Representing contrast loss; alpha represents a balance weight; lambda W ² Representing a regularization term;

the loss function uses Softmax loss as a supervision signal to provide category center information for the network, and meanwhile, uses contrast loss to constrain the relationship between the categories, so that the characteristics of 'compactness in the category and separation between the categories' are shown. This brings two benefits: on one hand, the Softmax loss solves the problem of difficult convergence caused by unbalanced sample pair of the contrast loss, and simplifies the sampling and training process; on the other hand, the comparison loss optimizes the intra-class and inter-class relations, and the problem that the generalization capability of the network model is limited is solved. And finally, the gait recognition performance is improved.

The invention discloses a gait recognition method combining human body posture key points and a graph rolling network. The method comprises the following steps: firstly, an open-source two-dimensional attitude estimation system is utilized to carry out attitude detection on pedestrian objects of each frame in a video sequence of an original dataset, and attitude information is extracted. And then, performing a series of preprocessing operations on the extracted joint coordinate sequences to generate a human skeleton data set for gait recognition, and preparing for model training of subsequent gait recognition tasks. Finally, a space-time diagram convolution network model is constructed, in order to capture advanced semantic information, a skeleton diagram is divided into six subgraphs, joints are shared between the subgraphs, a part-based diagram convolution network learning recognition model is used, so that the performance is effectively improved, a constructed data set is utilized for training, a multi-loss strategy combining classification loss and comparison loss is adopted as a loss function, network parameters are optimized by using random gradient descent, and the accuracy of the trained model is predicted by using a verification set. The method can fully utilize the effective information of the node, retain the motion state in the time dimension to a greater extent, has higher robustness to the change and carrying state of clothing, and has good generalization capability in the task of crossing visual angles.

Fig. 4 is a schematic diagram of a user identification system combined with user gait information according to the present invention, as shown in fig. 4, including:

a gait data determination unit 410 for determining a gait data set of the user; carrying out gesture estimation on each video frame in the gait data set, and determining gesture key point coordinates of each video frame; the gait data set is a plurality of video frames containing gait information of a user;

a bone data determination unit 420 for determining a user bone data set based on the user gesture key point coordinates; the user skeleton data set comprises coordinates of each node of the user;

a skeleton topology construction unit 430, configured to connect coordinates of each node of the user based on the user skeleton data set according to the skeleton structure of the user, so as to construct a user skeleton space-time topological graph;

the user identity identifying unit 440 is configured to input the user skeleton space-time topological graph into a space-time graph convolutional network model, and identify identity information of a user by combining gait information of different users stored in the space-time graph convolutional network model in advance.

It should be understood that the functions of the units in fig. 4 may be referred to in the foregoing detailed description of the method embodiment, and are not described herein.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The user identity recognition method combined with the user gait information is characterized by comprising the following steps of:

constructing a user skeleton data set based on the user gesture key point coordinates; the user skeleton data set comprises coordinates of each node of the user;

connecting coordinates of all nodes of a user as space edges according to a skeleton structure of the user in each frame based on the user skeleton data set, and connecting the same nodes of two adjacent frames as time edges to construct a user skeleton space-time topological graph;

inputting the user skeleton space-time topological graph into a space-time graph convolution network model, and combining gait information of different users stored in the space-time graph convolution network model in advance to identify identity information of the users;

the space-time diagram convolutional network model comprises a plurality of layers of backbone networks, each layer of backbone network comprises a space domain convolutional network SGCN and a time domain convolutional network TCN, and the SGCN and the TCN transmit characteristics in an adjacent serial connection mode; in SGCN, after the input features pass through a convolution layer of a convolution kernel, extracting high-dimensional differentiation space features under the interaction of the joint points in a first-order neighborhood in the input features by combining an attention mechanism, and inputting the high-dimensional differentiation space features into TCN; in TCN, the high-dimensional differentiation space features normalize time domain feature distribution through a batch normalization layer, and are activated by utilizing a linear rectification function, so that the high-dimensional differentiation space features are used as input of a convolution layer of a convolution kernel, effective expression of user joint features in a plurality of continuous time domains is finally realized, and high-dimensional features in space dimension and time sequence dimension are extracted; the nonlinear mapping from the input feature space to the high-dimensional feature space is completed through stacking and cascading multi-layer backbone networks, and the high-dimensional differentiation feature is obtained; outputting the high-dimensional differentiation characteristics by using a pooling layer and a full-connection layer; the high-dimensional differentiation feature is used to identify identity information of the user.

2. The user identification method according to claim 1, wherein the gesture key points correspond to a plurality of nodes of the user; the user skeleton data set is determined based on the user gesture key point coordinates, specifically:

3. The method for identifying user identity according to claim 1, wherein the coordinates of each node of the user are connected according to the skeleton structure of the user based on the user skeleton data set to construct a user skeleton space-time topological graph, specifically:

4. The user identification method of claim 1, wherein the attention mechanism is used to enhance the saliency and distinguishability of the extracted spatiotemporal features;

5. The user identification method according to claim 1, wherein the space-time diagram convolution network model combines classification loss and contrast loss, reduces the distance of characteristic values in user gait, and increases the difference between user gaits;

6. A user identification system incorporating user gait information, comprising:

the user identity identification unit is used for inputting the user skeleton space-time topological graph into a pre-trained space-time graph convolution network model and identifying the identity information of the user by combining gait information of different users pre-stored in the space-time graph convolution network model; the space-time diagram convolutional network model comprises a plurality of layers of backbone networks, each layer of backbone network comprises a space domain convolutional network SGCN and a time domain convolutional network TCN, and the SGCN and the TCN transmit characteristics in an adjacent serial connection mode; in SGCN, after the input features pass through a convolution layer of a convolution kernel, extracting high-dimensional differentiation space features under the interaction of the joint points in a first-order neighborhood in the input features by combining an attention mechanism, and inputting the high-dimensional differentiation space features into TCN; in TCN, the high-dimensional differentiation space features normalize time domain feature distribution through a batch normalization layer, and are activated by utilizing a linear rectification function, so that the high-dimensional differentiation space features are used as input of a convolution layer of a convolution kernel, effective expression of user joint features in a plurality of continuous time domains is finally realized, and high-dimensional features in space dimension and time sequence dimension are extracted; the nonlinear mapping from the input feature space to the high-dimensional feature space is completed through stacking and cascading multi-layer backbone networks, and the high-dimensional differentiation feature is obtained; outputting the high-dimensional differentiation characteristics by using a pooling layer and a full-connection layer; the high-dimensional differentiation feature is used to identify identity information of the user.

7. The user identification system according to claim 6, wherein the posture key points determined by the gait data determining unit correspond to a plurality of nodes of the user;

8. The system of claim 6, wherein the skeleton topology construction unit connects joints in the video frame according to the skeleton structure of the user in the spatial domain of the single video frame, and divides the connected joints in the spatial map in the single video frame into six parts including a head, a trunk, a left arm, a right arm, a left leg and a right leg, to form six subgraphs with shared vertices and shared edges; the same node of adjacent video frames are connected to form a time sequence edge of a time-space diagram; and repeating the two steps for all video frames to obtain all time sequence edges and joint points of all video frames to jointly form a user skeleton space-time diagram.