CN111079616A

CN111079616A - A Neural Network-Based Single Person Motion Pose Correction Method

Info

Publication number: CN111079616A
Application number: CN201911258388.4A
Authority: CN
Inventors: 谢雪梅; 高旭; 孔龙飞
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-28
Anticipated expiration: 2039-12-10
Also published as: CN111079616B

Abstract

The invention discloses a single-person movement posture correction method based on a neural network, which mainly solves the problems of low accuracy and low efficiency of current physical education teachers' exercise guidance for students. The implementation scheme is: download the image data set containing human joint points and their corresponding annotation files, and construct a training data set; build a human joint point detection network based on spatial domain transformation, and use the training data set to train it; collect standard motion and ordinary motion pictures, respectively, are input into the trained human joint point detection network based on spatial domain transformation to obtain their respective joint point coordinates, respectively form standard motion and ordinary motion data sets and match them to obtain standard matching pictures; calculate common motion The Euclidean distance between each joint point in the picture and the standard matching picture, and the joint points whose statistics are greater than the scoring threshold are the action points that need to be corrected. The present invention improves the accuracy rate of movement posture correction and training efficiency, and can be used for single person movement posture correction.

Description

Single-person movement posture correction method based on neural network

Technical Field

The invention belongs to the technical field of image recognition and computer vision, and mainly relates to a single-person movement posture correction method which can be used for guiding the training of ordinary people.

Background

With the rapid development of modern socioeconomic performance, many people neglect the importance of exercise for health. To solve this problem, the state has introduced a series of sports in the middle school entrance to urge students to do physical exercises such as throwing a solid ball, long running, etc. Due to the large population base of the country, the number difference between sports teachers and students is large, and students cannot be guided timely and effectively. The introduction of intelligent motion posture correction methods is urgent. Therefore, there is an urgent need for a method for correcting exercise posture of ordinary people

At present, the correction of the single motion posture is mainly guided by a sports teacher. The guiding mode is to evaluate and correct the actions of the students by guiding the experience of teachers in the sports. This way of guidance is very dependent on the level of exercise of the instructor, and when the instructor experiences a deviation, the training will often have the opposite effect. In addition, because the population base of China is huge and the number of sports teachers is limited, each student cannot be fully guided, which is very unfair for some students who cannot be guided.

Disclosure of Invention

The invention aims to provide a single-person motion posture correction method based on a neural network aiming at the defects of the existing motion posture correction method so as to improve the accuracy and efficiency of motion posture correction.

The idea of the invention is to set up a human body joint point detection network based on spatial domain conversion, construct a standard motion data set, construct a common motion data set, set a scoring threshold value to be 50, and determine action points needing to be corrected. The method comprises the following implementation steps:

(1) collecting a training data set:

(1a) downloading an image data set containing human body joint points and storing the image data set into a training image folder A;

(1b) downloading a label file corresponding to the data set, and storing the label file into a training label folder B;

(1c) putting the image folder and the label folder into the same folder to form a training data set;

(2) constructing a human body joint point detection network based on spatial domain conversion, which is formed by cascading an image spatial domain conversion sub-network and a human body joint point detection sub-network, wherein:

the image space domain conversion sub-network consists of 3 convolutional layers in sequence;

the human body joint point detection sub-network comprises 9 convolution layers and 4 deconvolution layers, namely 4 deconvolution layers are sequentially connected between 8 convolution layers and the last convolution layer which are sequentially cascaded;

(3) training a human body joint point detection network based on spatial domain conversion:

(3a) reading a training data set image from a training image folder A, inputting the image into the human body joint point detection network based on spatial domain conversion constructed in the step (2), generating a spatial conversion image through an image spatial conversion sub-network in the human body joint point detection network, and outputting a predicted coordinate value of a human body joint point through the human body joint point detection sub-network by the spatial conversion image;

(3b) reading the labeled coordinate values corresponding to the images of the training data set from the training labeled folder B, calculating the loss value L of the human body joint point network, and training the network constructed in the step (2) by using the loss value and adopting a random gradient descent algorithm to obtain a trained human body joint point detection network based on spatial domain conversion;

(4) constructing a standard motion data set:

(4a) shooting a standard action video demonstrated by a standard athlete;

(4b) collecting each frame of the shot standard action video into a picture, and storing the picture into a standard picture folder C;

(4c) respectively inputting the collected pictures into a trained human body joint point detection network based on spatial domain conversion to obtain coordinate information of each human body joint point, and storing the obtained coordinate information into a standard labeling folder D;

(5) constructing a common motion data set:

(5a) shooting a non-standard motion video demonstrated by a common athlete;

(5b) collecting each frame of the shot non-standard action video into an image, and storing the image into a test image folder E;

(5c) respectively inputting the collected pictures into a trained human body joint point detection network based on spatial domain conversion to obtain coordinate information of each human body joint point, and storing the obtained coordinate information into a test labeling folder F;

(6) setting a scoring threshold value to be 50, determining an action point needing correction:

(6a) reading coordinate information corresponding to the test picture from the test labeling folder F;

(6b) reading coordinate information corresponding to the standard picture from the standard labeling folder D;

(6c) sequentially calculating the Euclidean distance sum of the coordinates of the joint points of the test picture and the standard picture, and taking the standard picture with the minimum Euclidean distance sum as a standard matching picture of the test picture;

(6d) and calculating the Euclidean distance between the test picture and each joint point in the standard matching picture, and counting the joint points which are larger than the set scoring threshold value, namely the joint points to be corrected.

Compared with the prior art, the invention has the following advantages:

1. the identification accuracy is high

The existing posture correction method is very dependent on the exercise experience and exercise level of a teacher, and when the experience of the teacher deviates or is not very proficient in a certain exercise, misleading effects are often generated on the exercise and training of students. The invention establishes a human body joint point detection network based on spatial domain conversion and collects standard motion videos, and strictly and standard definition is carried out on standard actions, so that the accuracy of guidance is greatly improved.

2. The training efficiency is high

In the existing posture correction method, because the number of teachers is far smaller than that of students, the students often cannot be effectively guided at any time. By establishing a set of universal motion posture detection method, the invention enables students to receive training at any time, thereby greatly improving the training efficiency.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

FIG. 2 is a graph of the standard action collected in the present invention.

FIG. 3 is a graph of the test action collected in the present invention.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

Referring to fig. 1, the specific implementation steps for this example are as follows.

Step 1, a training data set is collected.

(1.1) downloading an image data set containing human body joint points from an open website and storing the image data set into a training image folder A;

(1.2) downloading a label file corresponding to the data set from the public website, and storing the label file into a training label folder B;

the label file contains coordinate information of 18 joint points in the human body, and the 18 joint points are respectively as follows: nose, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, right eye, left eye, right ear, and left ear;

and (1.3) putting the image folder and the label folder into the same folder to form a training data set.

And 2, building a human body joint point detection network based on spatial domain conversion.

(2.1) constructing an image spatial domain conversion sub-network:

the sub-network is in turn composed of 3 convolutional layers, of which:

the convolution kernel size of the 1 st convolution layer is 1 multiplied by 1, the number of convolution kernels is 3, and the step length is 1;

the convolution kernel size of the 2 nd convolution layer is 1 multiplied by 1, the number of convolution kernels is 64, and the step length is 1;

the convolution kernel size of the 3 rd convolution layer is 1 × 1, the number of convolution kernels is 3, and the step size is 1.

(2.2) constructing a human joint point detection sub-network:

the sub-network comprises 9 convolutional layers and 4 anti-convolutional layers, and the structural relationship is as follows: first convolution layer → second convolution layer → third convolution layer → fourth convolution layer → fifth convolution layer → sixth convolution layer → seventh convolution layer → eighth convolution layer → first reverse convolution layer → second reverse convolution layer → third reverse convolution layer → fourth reverse convolution layer → ninth convolution layer, wherein:

the convolution kernel size of the first convolution layer is 3 multiplied by 3, the number of the convolution kernels is 128, and the step length is 1;

the convolution kernel size of the second convolution layer is 1 multiplied by 1, the number of convolution kernels is 256, and the step length is 2;

the convolution kernel size of the third convolution layer is 3 multiplied by 3, the number of convolution kernels is 256, and the step length is 1;

the convolution kernel size of the fourth convolution layer is 1 × 1, the number of convolution kernels is 256, and the step length is 2;

the convolution kernel size of the fifth convolution layer is 3 × 3, the number of convolution kernels is 256, and the step length is 1;

the convolution kernel size of the sixth convolution layer is 1 × 1, the number of convolution kernels is 256, and the step size is 2;

the convolution kernel size of the seventh convolution layer is 3 × 3, the number of convolution kernels is 256, and the step size is 1;

the convolution kernel size of the eighth convolution layer is 1 × 1, the number of convolution kernels is 256, and the step size is 1;

the convolution kernel size of the first deconvolution layer is 3 × 3, the number of convolution kernels is 256, and the step size is 2;

the convolution kernel size of the second deconvolution layer is 3 × 3, the number of convolution kernels is 128, and the step size is 2;

the convolution kernel size of the third deconvolution layer is 3 × 3, the number of convolution kernels is 128, and the step size is 2;

the convolution kernel size of the fourth deconvolution layer is 3 × 3, the number of convolution kernels is 128, and the step size is 1;

the size of convolution kernels of the ninth convolution layer is 1 multiplied by 1, the number of convolution kernels is 18, and the step size is 1;

and (2.3) cascading the established image spatial domain conversion sub-network with a human body joint point detection sub-network to form a human body joint point detection network based on spatial domain conversion.

And 3, training a human body joint point detection network based on spatial domain conversion.

(3.1) reading a training data set image from the training image folder A, inputting the image into the human body joint point detection network based on spatial domain conversion constructed in the step (2), generating a spatial conversion image through an image spatial conversion sub-network in the human body joint point detection network, and outputting a predicted coordinate value of the human body joint point through the human body joint point detection sub-network by the spatial conversion image;

(3.2) reading the labeled coordinate values corresponding to the images of the training data set from the training labeled folder B, and calculating the loss value L of the human body joint point detection network based on spatial domain conversion:

wherein i represents the serial number of the human body joint point, x_i' and y_i' labeling abscissa and ordinate values, x, respectively representing the joint points of the corresponding serial numbers_iAnd y_iRespectively representing the abscissa and the ordinate of a predicted coordinate value output by a human body joint point detection network based on spatial domain conversion;

(3.3) detecting the loss value L of the network by using the human body joint points based on spatial domain conversion, and training the network constructed in the step (2) by adopting a random gradient descent algorithm:

(3.3.1) taking the derivative of the loss value of the human body joint point detection network based on the spatial domain conversion:

f represents a derivative value of a loss value L of the human body joint point detection network based on the spatial domain conversion to a network parameter theta thereof, and theta represents a parameter of the human body joint point detection network based on the spatial domain conversion;

(3.3.2) calculating an updated value of the human body joint point detection network parameter based on the spatial domain conversion:

θ₂＝θ-αF

wherein, theta₂Representing the updated value of the human body joint point detection network parameters based on the spatial domain conversion, α is the learning rate of the human body joint point detection network based on the spatial domain conversion, and the value is 0.00025;

(3.3.3) detecting updated values of network parameters θ with human body joint points based on spatial domain transformation₂Replacing the parameter theta of the original network;

(3c4) and (3) iterating the steps from (3.3.1) to (3.3.3) for 150000 times to obtain the trained human body joint point detection network based on spatial domain conversion.

And 4, constructing a standard motion data set:

(4.1) shooting a standard motion video demonstrated by a standard athlete, wherein the shooting equipment is Canon EOS 5D Mark IV, and the video frame rate is 60 frames/second;

(4.2) collecting each frame of the shot standard motion video into a picture as shown in figure 2, and storing the picture into a standard picture folder C;

and (4.3) respectively inputting the collected pictures into a trained human body joint point detection network based on spatial domain conversion to obtain coordinate information of each human body joint point, and storing the obtained coordinate information into a standard labeling folder D.

And 5, constructing a common motion data set.

(5.1) shooting a nonstandard action video demonstrated by a common athlete, wherein the shooting equipment is Canon EOS 5D MarkIV, and the video frame rate is 60 frames/second;

(5.2) collecting each frame of the shot non-standard motion video into a picture as shown in figure 3, and storing the picture into a test picture folder E;

and (5.3) respectively inputting the acquired pictures into a trained human body joint point detection network based on spatial domain conversion to obtain coordinate information of each human body joint point, and storing the obtained coordinate information into a test labeling folder F.

And 6, determining action points needing to be corrected.

(6.1) reading the coordinate information corresponding to the test picture from the test labeling folder F;

(6.2) reading coordinate information corresponding to the standard picture from the standard labeling folder D;

(6.3) sequentially calculating the sum of Euclidean distances between the coordinates of the test picture and the coordinates of the joint points of the standard picture:

wherein, P represents the sum of Euclidean distances between the coordinates of the test picture and the coordinates of the joint points of the standard picture, i represents the serial number of the joint points of the human body, a_i' and b_i' respectively representing the abscissa and ordinate values of the joint point of the corresponding serial number in the test picture, a_iAnd b_iRespectively representing the abscissa and ordinate values of the joint point with the corresponding serial number in the standard picture.

(6.4) taking the standard picture with the minimum sum of Euclidean distances as a standard matching picture of the test picture from the sum of the Euclidean distances of the coordinates of the joint points of the calculated test picture and the standard picture;

(6.5) calculating the Euclidean distance between the test picture and each joint point in the standard matching picture:

Q_j＝(c'_j-c_j)²+(d'_j-d_j)²,j＝1,2,...,18

wherein Q is_jRepresenting Euclidean distance of j-th joint point coordinates of the test picture and the standard picture, wherein j represents a serial number of human body joint points, c'_jAnd d'_jRespectively representing the abscissa and ordinate values of the joint point of the corresponding serial number in the test picture, c_jAnd d_jRespectively representing the abscissa and ordinate values of the joint point with the corresponding serial number in the standard matching picture.

(6.6) setting a scoring threshold value to be 50, and counting the joint points which are greater than the scoring threshold value in the Euclidean distance between the test picture and each joint point in the standard matching picture, namely the joint points to be corrected.

The foregoing description is only an example of the present invention and is not intended to limit the invention, so that it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. a single person motion posture correction method based on neural network, is characterized in that, comprises as follows:

(1) Collect training data sets:

(1a) Download the image dataset containing human joint points, and store it in the training image folder A;

(1b) Download the annotation file corresponding to the dataset and store it in the training annotation folder B;

(1c) Put the image folder and the annotation folder into the same folder to form a training data set;

(2) Build a human joint detection network based on spatial domain transformation, which is composed of a cascade of image space domain transformation sub-network and human joint detection sub-network, where:

The image space domain conversion sub-network is sequentially composed of three convolutional layers;

The human body joint point detection sub-network includes 9 convolutional layers and 4 deconvolutional layers, that is, 4 deconvolutional layers are sequentially connected between the 8 convolutional layers cascaded and the last convolutional layer in turn ;

(3) Training a human joint point detection network based on spatial domain transformation:

(3a) Read the training dataset image from the training image folder A, input the image into the human joint point detection network based on spatial domain transformation constructed in (2), and generate spatial transformation through the image space transformation sub-network in it. image, the spatially transformed image then passes through the human body joint point detection sub-network, and outputs the predicted coordinate value of the human body joint point;

(3b) Read the labeled coordinate value corresponding to the training dataset image from the training label folder B, calculate the loss value L of the human joint point network, and use the loss value to train the network constructed in (2) by using the stochastic gradient descent algorithm , to obtain a trained human joint detection network based on spatial domain transformation;

(4) Construct a standard motion dataset:

(4a) Shooting video of standard movements demonstrated by standard athletes;

(4b) collecting each frame of the standard action video into a picture, and storing it in the standard picture folder C;

(4c) respectively input the collected pictures into the trained human body joint point detection network based on spatial domain transformation, obtain the coordinate information of each human body joint point, and store the obtained coordinate information in the standard labeling folder D;

(5) Construct a common motion dataset:

(5a) Shooting videos of non-standard movements demonstrated by ordinary athletes;

(5b) each frame of the non-standard action video that is shot is collected into a picture, and is stored in the test picture folder E;

(5c) respectively input the collected pictures into the trained human body joint point detection network based on spatial domain transformation, obtain the coordinate information of each human body joint point, and store the obtained coordinate information in the test labeling folder F;

(6) Set the scoring threshold to 50, and determine the action point that needs to be corrected:

(6a) read the coordinate information corresponding to the test picture from the test label folder F;

(6b) read the coordinate information corresponding to the standard picture from the standard labeling folder D;

(6c) calculate the Euclidean distance sum of the joint point coordinates of the test picture coordinates and the standard picture successively, and take the standard picture with the smallest Euclidean distance sum as the standard matching picture of this test picture;

(6d) Calculate the Euclidean distance between each joint point in the test picture and its standard matching picture, and count the joint points greater than the set scoring threshold, which are the action points to be corrected.

2. The method according to claim 1, wherein (1b) download the corresponding annotation file of the data set, including the human body picture and 18 joint point position coordinate information of the human body in each picture, and the 18 joint points are respectively: nose , neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, right eye, left eye, right ear and left ear.

3. The method according to claim 1, wherein the 3 convolution layers of the spatial domain conversion sub-network in (2) have the following parameters:

The convolution kernel size of the first convolutional layer is 1×1, the number of convolution kernels is 3, and the stride is 1;

The convolution kernel size of the second convolutional layer is 1×1, the number of convolution kernels is 64, and the stride is 1;

The convolution kernel size of the third convolutional layer is 1×1, the number of convolution kernels is 3, and the stride is 1.

4. The method according to claim 1, wherein the human body joint point detection sub-network built in (2) has the following structure: the first convolutional layer→the second convolutional layer→the third convolutional layer→the fourth Convolutional layer → fifth convolutional layer → sixth convolutional layer → seventh convolutional layer → eighth convolutional layer → first deconvolutional layer → second deconvolutional layer → third deconvolutional layer → first Four deconvolution layers → ninth convolution layers, the parameters of each layer are as follows:

The convolution kernel size of the first convolutional layer is 3×3, the number of convolution kernels is 128, and the stride is 1;

The kernel size of the second convolutional layer is 1×1, the number of kernels is 256, and the stride is 2.

The convolution kernel size of the third convolutional layer is 3×3, the number of convolution kernels is 256, and the stride is 1;

The convolution kernel size of the fourth convolutional layer is 1×1, the number of convolution kernels is 256, and the stride is 2;

The convolution kernel size of the fifth convolutional layer is 3×3, the number of convolution kernels is 256, and the stride is 1;

The convolution kernel size of the sixth convolutional layer is 1×1, the number of convolution kernels is 256, and the stride is 2;

The convolution kernel size of the seventh convolutional layer is 3×3, the number of convolution kernels is 256, and the stride is 1;

The convolution kernel size of the eighth convolution layer is 1×1, the number of convolution kernels is 256, and the stride is 1;

The convolution kernel size of the first deconvolution layer is 3×3, the number of convolution kernels is 256, and the stride is 2;

The convolution kernel size of the second deconvolution layer is 3×3, the number of convolution kernels is 128, and the stride is 2;

The convolution kernel size of the third deconvolution layer is 3×3, the number of convolution kernels is 128, and the stride is 2;

The convolution kernel size of the fourth deconvolution layer is 3×3, the number of convolution kernels is 128, and the stride is 1;

The convolution kernel size of the ninth convolutional layer is 1×1, the number of convolution kernels is 18, and the stride is 1.

5. The method according to claim 1, wherein the loss value L of the human body joint point detection network calculated in (3b), its calculation formula is:

Among them, i represents the serial number of the human body joint point, x' _i and y' _i represent the marked abscissa and ordinate value of the joint point of the corresponding serial number, respectively, x _i and y _i represent the predicted coordinates output by the human joint point detection network The abscissa and ordinate of the value.

6. The method according to claim 1, wherein the loss value is used in (3b) to perform a stochastic gradient descent algorithm on the human body joint point detection network based on spatial domain transformation, and its implementation is as follows:

(3b1) Calculate the derivative of the loss value of the human joint point detection network based on the spatial domain transformation according to the following formula:

Among them, F represents the derivative value of the loss value L of the human joint point detection network based on the spatial domain transformation to its network parameter θ, and θ represents the parameters of the human joint point detection network based on the spatial domain transformation;

(3b2) Calculate the updated value of the network parameters of the human body joint point detection based on the spatial domain transformation according to the following formula:

θ ₂ =θ-αF

Among them, θ ₂ represents the update value of the parameters of the human joint point detection network based on the spatial domain transformation, α is the learning rate of the human joint point detection network based on the spatial domain transformation, and the value is 0.00025;

(3b3) Replace the parameter θ of the original network with the updated value θ ₂ of the human joint point detection network parameter based on the spatial domain transformation;

(3b4) Steps (3b1) to (3b3) are iterated 150,000 times to obtain a trained human joint detection network based on spatial domain transformation.

7. The method according to claim 1, wherein the Euclidean distance sum of the joint point coordinates of the test picture coordinates and the standard picture is calculated in (6c), and the formula is as follows:

Among them, P represents the sum of the Euclidean distance between the coordinates of the test image and the joint point coordinates of the standard image, i represents the serial number of the joint point of the human body, and a' _i and b' _i represent the abscissa of the joint point corresponding to the serial number in the test image respectively. and ordinate values, a _i and b _i respectively represent the abscissa and ordinate values of the joint points of the corresponding serial numbers in the standard picture.

8. The method according to claim 1, wherein the Euclidean distance of each joint point in the test picture and its standard matching picture is calculated in (6d), and its formula is as follows:

Q _j =(c' _j -c _j ) ² +(d' _j -d _j ) ² ,j=1,2,...,18

Among them, Q _j represents the Euclidean distance between the coordinates of the jth joint point of the test image and the standard image, j represents the serial number of the joint point of the human body, and c' _j and d' _j represent the horizontal direction of the joint point of the corresponding serial number in the test image, respectively. Coordinate and ordinate value, c _j and d _j respectively represent the abscissa and ordinate values of the joint point of the corresponding serial number in the standard matching picture.