CN111507289A

CN111507289A - Video matching method, computer device and storage medium

Info

Publication number: CN111507289A
Application number: CN202010321932.1A
Authority: CN
Inventors: 周康明; 戚风亮
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-08-07

Abstract

The application relates to a video matching method, a computer device and a storage medium. The method comprises the following steps: acquiring a video source; inputting a video source into a neural network model for feature extraction to obtain a feature vector corresponding to each video frame in the video source; determining a target video sequence matched with a video source in a video library according to a feature vector corresponding to each video frame in the video source and the preset video library; the neural network model is obtained by training according to the loss between the positive sample video source and the negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects and objects to be detected which are included in each video frame of the positive sample video source are objects of the same category, and sample objects and objects to be detected which are included in each video frame of the negative sample video source are objects of different categories. The method can improve the accuracy of the video detection result.

Description

Video matching method, computer device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a video matching method, a computer device, and a storage medium.

Background

Pedestrian Re-Identification (Re-ID for short) is also called cross-lens tracking technology, and refers to the technology of searching the same target Person under different cameras through computer vision technology. When the computer vision technology is used for realizing pedestrian re-identification, generally, a deep learning network is used for extracting a target object in a video, matching and detecting the extracted target object and target objects in other videos, and determining whether the target objects in different videos are the same target object.

In the process of implementing pedestrian re-identification, because a deep learning network is needed, the deep learning network needs to be trained in advance, and in the process of conventionally training the deep learning network, a triple loss function (i.e. loss between positive and negative sample videos) is generally adopted to train the deep learning network so as to obtain the trained deep learning network.

However, the deep learning network trained by the method is inaccurate, so that the finally obtained detection result of the target object is inaccurate.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a video matching method, apparatus, computer device and storage medium capable of improving accuracy of detection results.

A method of video matching, the method comprising:

acquiring a video source; the video source comprises at least one video frame, and each video frame comprises an object to be detected;

inputting a video source into a neural network model for feature extraction to obtain a feature vector corresponding to each video frame in the video source;

determining a target video sequence matched with a video source in a video library according to a feature vector corresponding to each video frame in the video source and the preset video library; the target video sequence comprises at least two target video sources;

the neural network model is obtained by training according to the loss between the positive sample video source and the negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects and objects to be detected which are included in each video frame of the positive sample video source are objects of the same category, and sample objects and objects to be detected which are included in each video frame of the negative sample video source are objects of different categories.

In one embodiment, the training method of the neural network model includes:

inputting each video frame of the positive sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; inputting each video frame of the negative sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the negative sample video source;

calculating first loss between the characteristic data corresponding to each video frame of the positive sample video source and the characteristic data corresponding to each video frame of the negative sample video source;

calculating second loss among the characteristic data corresponding to each video frame of the positive sample video source, and calculating third loss among the characteristic data corresponding to each video frame of the negative sample video source;

and training the initial neural network model based on the first loss, the second loss and the third loss to obtain the neural network model.

In one embodiment, if the positive sample video source includes a plurality of first video sources, the calculating a second loss between feature data corresponding to each video frame of the positive sample video source includes:

calculating loss among characteristic data corresponding to all video frames in the same first video source to obtain first-class internal loss corresponding to each video frame;

calculating loss among the characteristic data corresponding to each video frame in different first video sources to obtain second-class internal loss corresponding to each video frame;

and obtaining a second loss according to the first-class internal loss corresponding to each video frame and the second-class internal loss corresponding to each video frame.

In one embodiment, the calculating the loss between the feature data corresponding to the video frames belonging to the same first video source to obtain the first-class internal loss corresponding to each video frame includes:

calculating the relative distance between the feature data corresponding to any one video frame in the same first video source and the feature data corresponding to other video frames to obtain a first relative distance corresponding to each video frame;

comparing the first relative distance corresponding to each video frame with a preset first distance threshold value to obtain a comparison result of each video frame;

and determining the first-class internal loss corresponding to each video frame according to the comparison result of each video frame.

In one embodiment, the determining the first intra-class loss corresponding to each video frame according to the comparison result of each video frame includes:

if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first type internal loss corresponding to the video frame;

and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame.

In one embodiment, the calculating the loss between the feature data corresponding to the video frames belonging to different first video sources to obtain the second intra-class loss corresponding to each video frame includes:

calculating the relative distance between the characteristic data corresponding to each video frame in any one first video source and the characteristic data corresponding to each video frame in other first video sources to obtain a second relative distance corresponding to each video frame;

comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame;

and determining the second-class internal loss corresponding to each video frame according to the comparison result of each video frame.

In one embodiment, the determining the second intra-class loss corresponding to each video frame according to the comparison result of each video frame includes:

if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining 0 as the second intra-class loss corresponding to the video frame;

and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame.

In one embodiment, the first distance threshold is less than the second distance threshold.

A video matching apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video source; the video source comprises at least one video frame, and each video frame comprises an object to be detected;

the extraction module is used for inputting the video source into the neural network model for feature extraction to obtain a feature vector corresponding to each video frame in the video source; the neural network model is obtained by training according to the loss between a positive sample video source and a negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects and objects to be detected which are included in each video frame of the positive sample video source are objects of the same category, and sample objects and objects to be detected which are included in each video frame of the negative sample video source are objects of different categories;

the determining module is used for determining a target video sequence matched with the video source in the video library according to the feature vector corresponding to each video frame in the video source and a preset video library; the target video sequence comprises at least two target video sources.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to the video matching method, the video matching device, the computer equipment and the storage medium, each video frame of a video source can be input into a trained neural network model for feature extraction, so that a feature vector corresponding to each video frame is obtained, and a target video source sequence matched with the video source is obtained in a preset video library according to the feature vector corresponding to each video frame, wherein the neural network model is obtained by training through loss among positive and negative sample video sources, loss among video frames of a positive sample video source and loss among video frames of a negative sample. In the method, when the neural network model is trained, the loss between the positive sample video source and the negative sample video source is considered, so that the distance between the video sources of different types can be ensured to be far enough, and the respective losses of the positive sample video source and the negative sample video source are also considered, so that the distance between the video sources of the same type can be ensured to be close enough, that is, the clustering result of the video sources of the same type can be ensured to be better, therefore, the neural network model trained by utilizing the losses is more accurate, so that when the accurate neural network model is utilized to detect the video sources, the obtained detection result is more accurate, and the accuracy of the detection result can be improved.

Drawings

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a flow diagram illustrating a video matching method in accordance with one embodiment;

FIG. 3 is a flow diagram illustrating a video matching method in accordance with one embodiment;

FIG. 4 is a flowchart illustrating a video matching method according to another embodiment;

FIG. 5 is a block diagram of a video matching device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Video re-identification (re-id) refers to the fact that whether a target appears under different cameras is retrieved through a computer vision technology, and the video re-identification has wide application in an actual scene as a derivative field of a picture re-id. In a real monitoring scene, video information exists widely, so that a large amount of video information is inevitably available for a model to use, the video information can provide information of different behaviors of the same person at different angles, and whether two targets belong to the same person or object can be judged more reliably based on abundant video information. Meanwhile, deep learning is an important branch in the field of artificial intelligence, and has been greatly successful in the fields of image and voice recognition and the like. Among them, neural networks have been widely used in universities and enterprises as important tools in deep learning. By far, neural networks mainly comprise two types: convolutional neural networks, which are mainly used for image recognition, and cyclic (recurrent) neural networks, which are mainly used in the speech domain. At present, the application of deep learning to a re-id task is a mainstream method at present, in order to implement the task, in the process of training a deep learning network, picture sequences belonging to the same target need to be clustered together, and picture sequences belonging to different targets need to be separated by a certain distance. However, the existing triplet loss function only penalizes relative distance, specifically: in order to calculate the triplet loss function, a triplet needs to be constructed, in which an anchor point (target picture sequence), a positive sample picture sequence and a negative sample picture sequence are included. The positive sample picture sequence is a picture sequence with the same category (id) as the anchor point, the negative sample picture sequence is a picture sequence with different categories (id) from the anchor point, the distance between the positive sample picture sequence and the anchor point is assumed to be d1, the distance between the negative sample picture sequence and the anchor point is d2, the training of the deep learning network is completed as long as d2 is greater than d1 by a given threshold value, but the distance between the sample picture sequences in the same category cannot be guaranteed to be compact enough, an error between the sample picture sequences in the same category is possibly too large, the clustering effect cannot be guaranteed, and the obtained detection result is also inaccurate when the deep learning network trained by the existing method is used for detecting the picture sequences. The application provides a video matching method, a video matching device, a computer device and a storage medium, which can solve the technical problems.

The video matching method provided by the application can be applied to computer equipment, and the computer equipment can be a terminal or a server. Taking a computer device as an example, the internal structure diagram thereof can be as shown in fig. 1. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video matching method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The execution subject of the embodiment of the present application may be a computer device, or may be a video matching apparatus, and the following description will be given taking the computer device as the execution subject.

In an embodiment, a video matching method is provided, and the embodiment relates to a specific process of how to detect a video source to be detected by using a trained neural network model to obtain a target video sequence. As shown in fig. 2, the method may include the steps of:

s202, acquiring a video source; the video source comprises at least one video frame, and each video frame comprises an object to be detected.

The object to be detected may be a human, an animal, a plant, an object, or the like. Accordingly, the video source may be a video source obtained by image-capturing the person, animal, plant, object, or the like. The number of video frames captured by the video source may be determined according to actual situations, for example, 10 video frames, 20 video frames, and so on, where the video source includes video frames having time continuity, and for example, the second video frame in the video source is captured after the first video frame and is a frame image captured at two consecutive time instants. The video source may also be referred to as a picture sequence, and each video frame may also be referred to as each picture in the picture sequence. In addition, each video frame includes an object to be detected, but the posture and the like of the included object to be detected may be the same or different. The embodiment mainly aims at the picture sequence that the object to be detected is a pedestrian and the video source is a pedestrian.

Specifically, the computer device may acquire an image of the object to be detected by using image acquisition devices (e.g., a video camera, a snapshot camera, a camera, etc.) connected to the computer device, so as to obtain a video source of the object to be detected, and transmit the video source of the object to be detected to the computer device, so that the computer device can obtain the video source; of course, the computer device may also read the video source from a database in which the video source of the object to be detected is stored in advance; certainly, the video source can also be obtained by reading from the cloud end by the computer device, and certainly, the video source can also be obtained by the computer device in other manners, and the obtaining manner of the video source is not particularly limited in this embodiment. In this embodiment, the category of the object to be detected in the acquired video source is unknown, and taking a pedestrian as an example, the category may be name, gender, age, occupation, and the like.

S204, inputting the video source into the neural network model for feature extraction to obtain a feature vector corresponding to each video frame in the video source; the neural network model is obtained by training according to the loss between the positive sample video source and the negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects and objects to be detected which are included in each video frame of the positive sample video source are objects of the same category, and sample objects and objects to be detected which are included in each video frame of the negative sample video source are objects of different categories.

The neural network model may be a ResNet (residual network) model, a SeNet (compressive-and-Excitation Networks) model, a Goog L eNet (Google inclusion Net, Google initial network, which belongs to a convolutional neural network) model, or the like, and may also be other neural Networks, in which a ResNet50 network is mainly used in this embodiment.

In addition, before the video source is input into the neural network model process, each video frame in the video source may also be preprocessed, where the preprocessing may be normalization processing on pixel values on each video frame, or average pooling processing, or the like.

In addition, before the neural network model processes data, the neural network model can be trained, the training can be in a mode of training the initial neural network model by adopting the value of the loss function, the loss function may also be referred to herein as a triplet loss function, but the triplet loss function includes a loss of three parts, the first part being the loss between the positive sample video source and the negative sample video source, the second part being the loss between the video frames of the positive sample video source, the third part being the loss between the video frames of the negative sample video source, where the positive sample video source includes sample objects of a different class than the negative sample video source, then in the three-part loss here, the first partial loss belongs to an inter-class loss between different classes, and the second partial loss and the third partial loss belong to an intra-class loss between the same classes. Alternatively, the intra-class loss may include intra-class loss between video frames belonging to the same category and the same video source, and may also include intra-class loss between video frames belonging to the same category but not the same video source. The neural network model is trained in a mode of combining intra-class loss and inter-class loss, the clustering effect of the samples of the same class can be better, namely the samples of the same class are closer and more similar, and the distance between the samples of different classes can be far enough, so that the classes of the detection objects can be distinguished more easily subsequently, and a better matching result can be obtained.

Specifically, after each video frame in the video source is preprocessed, each preprocessed video frame may be input into a trained neural network model, feature extraction is performed on each video frame to obtain a feature map corresponding to each video frame, and then the feature map corresponding to each video frame is converted into a vector, which is recorded as a feature vector, so as to facilitate subsequent calculation.

S206, determining a target video sequence matched with the video source in a video library according to the feature vector corresponding to each video frame in the video source and the preset video library; the target video sequence comprises at least two target video sources.

The preset video library may include a large number of corresponding relations between the video sources and the feature vectors of the video frames, and may be pre-established before video matching, and the establishment method may include: the method comprises the steps of collecting a large number of video sources in advance, extracting and obtaining a feature vector of each video frame in each video source to serve as a group of feature vectors of each video source, and binding the video sources and a group of corresponding feature vectors to obtain a video library. In addition, the target video source refers to a video source with the highest matching degree with the video source in the video library, and the matching can be the matching between an object to be detected in the video source and an object in the video source of the video library, for example, if the object to be detected in the video source is a certain animal, a video source similar to the animal can be matched in the video library.

Specifically, after obtaining the feature vectors corresponding to the video frames in the video source, the computer device may use the feature vectors of the video frames of the video source as a set of vectors to be matched, match the set of vectors to be matched with a set of feature vectors corresponding to each video source in a preset video library, obtain a plurality of sets of target feature vectors with a higher matching degree with the set of vectors to be matched, obtain a plurality of target video sources corresponding to the plurality of sets of target feature vectors, and output a target video sequence composed of the plurality of target video sources to a user. When a group of feature vectors to be matched is matched with a certain group of feature vectors in a video library, the feature vectors of the two groups of feature vectors are respectively matched to obtain a matching score of each feature vector, then all the matching scores are averaged, the average value is used as the matching score of the two groups of feature vectors, the matching score can represent the matching degree, and the higher the score is, the higher the matching degree is.

In the video matching method, each video frame of a video source can be input into a trained neural network model for feature extraction to obtain a feature vector corresponding to each video frame, and a target video source sequence matched with the video source is obtained in a preset video library according to the feature vector corresponding to each video frame, wherein the neural network model is obtained by training through loss between a positive sample video source and a negative sample video source, loss between video frames of the positive sample video source and loss between video frames of the negative sample. In the method, when the neural network model is trained, the loss between the positive sample video source and the negative sample video source is considered, so that the distance between the video sources of different types can be ensured to be far enough, and the respective losses of the positive sample video source and the negative sample video source are also considered, so that the distance between the video sources of the same type can be ensured to be close enough, that is, the clustering result of the video sources of the same type can be ensured to be better, therefore, the neural network model trained by utilizing the losses is more accurate, so that when the accurate neural network model is utilized to detect the video sources, the obtained detection result is more accurate, and the accuracy of the detection result can be improved.

In another embodiment, another video matching method is provided, and the embodiment relates to a specific process of training a neural network model. On the basis of the above embodiment, as shown in fig. 3, the training process of the neural network model may include the following steps:

s302, inputting each video frame of the positive sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; and inputting each video frame of the negative sample video source into the initial neural network model to obtain the characteristic data corresponding to each video frame of the negative sample video source.

In this step, the positive sample video source may include one video source, or may include a plurality of video sources, and each video frame of the positive sample video source includes the same type of sample object. Similarly, the negative sample video source may include one video source, or may include a plurality of video sources, and each video frame includes the same sample object type, but the sample objects in the negative sample video source are different from the sample objects in the positive sample video source. For example, taking sample objects as a pedestrian and a pedestrian B as an example, the people included in each video frame of the positive sample video source are the same pedestrian a, and the people included in each video frame of the negative sample video source may be the same pedestrian B, which are different. In addition, the obtained feature data can be a feature vector regardless of a positive sample video source or a negative sample video source, so that the feature vector is convenient for subsequent calculation.

S304, calculating first loss between the characteristic data corresponding to each video frame of the positive sample video source and the characteristic data corresponding to each video frame of the negative sample video source.

In this step, when the first loss is calculated, any one of the video frames in the positive sample video source and the negative sample video source may be used as an anchor point, a distance between another video frame and the anchor point video frame is calculated (the distance may be an euclidean distance, or the like, or may be another distance), and the first loss is obtained by the calculated distance and a preset distance interval.

Illustratively, assume that the positive sample video source includes one video source including video frames respectively of (a)₁，a₂) The features extracted for each video frame are each [ f (a) ]₁)，f(a₂)]The negative sample video source also includes a video source including video frames respectively of (b)₁，b₂) The features extracted for each video frame are each [ f (b) ]₁)，f(b₂)]Assume that the anchor video frame is a in the positive sample video source₁Preset distance interval α, then f (a) can be calculated here₁) And f (a)₂) The distance between f (a) and₁) And f (b)₁)、f(b₂) The loss between the anchor video frame and the other video frame is obtained by the calculated distance, and two formulas are obtained, namely the following formulas (1) and (2):

in formula (1), if f (a)₁) And f (a)₂) The sum of the distances between α is greater than f (a)₁) And f (b)₁) I.e., the distance between the video frames of the positive sample video source is greater than the distance between the negative sample video source, i.e., the inter-class distance is smaller, then a loss occurs, and f (a) can be adjusted₁) And f (a)₂) The sum of the distance between α and f (a)₁) And f (b)₁) The difference in distance between is taken as a loss if f (a)₁) And f (a)₂) The sum of the distance between (a) and (α) is not more than f (a)₁) And f (b)₁) That is, the distance between the video frames of the positive sample video source is smaller than the distance between the negative sample video source, that is, the positive sample is farther away from the negative sampleThen at this point it can be considered that there is no loss, e.g. the loss can be set to 0, etc. Similarly, the above method is also adopted for the formula (2) to determine the loss. Of course, the other video frames may also be used as anchor points to obtain corresponding losses, and finally, the final loss is obtained by averaging or summing all the obtained losses, which is denoted as a first loss, and is an inter-class loss between the positive sample video source and the negative sample video source.

S306, calculating second loss among the feature data corresponding to each video frame of the positive sample video source, and calculating third loss among the feature data corresponding to each video frame of the negative sample video source.

In this step, when calculating the loss between the video frames of the positive sample video source, any one of the video frames in the positive sample video source may be used as an anchor point, the distance between the feature data of the other video frames and the feature data of the anchor point video frame is calculated, and a second loss is obtained through a comparison result of the distance and a distance threshold; or taking any video frame in the positive sample video source as an anchor point, averaging the feature data of other video frames, calculating the distance between the feature data obtained by averaging and the feature data of the anchor point video frame, and obtaining a second loss through the comparison result of the distance and the distance threshold; the second loss may be obtained in other manners, which are not limited specifically, and the distance threshold may be determined according to actual situations.

Similarly, the negative sample video source may be calculated in the same manner as the positive sample video source to obtain a third loss.

The second loss is an intra-class loss of the positive sample video source, and may include an intra-class loss between video frames belonging to the same video source in the positive sample video source, and may also include an intra-class loss between video frames belonging to different video sources in the positive sample video source; similarly, the third loss is an intra-class loss of the negative sample video source, and may include an intra-class loss between video frames belonging to the same video source in the negative sample video source, and may also include an intra-class loss between video frames belonging to different video sources in the negative sample video source.

And S308, training the initial neural network model based on the first loss, the second loss and the third loss to obtain the neural network model.

Specifically, after the first loss, the second loss and the third loss are obtained, the sum or the average of the three losses may be obtained, then the sum or the average is used as a value of a new triple loss function, the neural network model is trained, parameters of the neural network model are adjusted, when the value of the loss function of the neural network model is smaller than a preset threshold value or when the value of the loss function is basically stable (i.e., no change occurs), it may be determined that the neural network model has been trained, otherwise, training is continued, and when training is completed, the parameters of the neural network model may be fixed, so as to facilitate use in extracting features in the next step.

The video matching method provided by the embodiment can calculate the inter-class loss between the positive sample video source and the negative sample video source, calculate the intra-class loss of each of the positive sample video source and the negative sample video source, and train the neural network model through the calculated intra-class loss and the inter-class loss together to obtain the trained neural network model. In this embodiment, the neural network model is trained simultaneously by using the intra-class loss and the inter-class loss, the intra-class loss can reflect the clustering degree between similar video sources, and the inter-class loss can reflect the degree that different video sources can be distinguished, so that when the method of this embodiment is used for training the neural network model, the adopted loss is richer, the obtained neural network model is more accurate, and finally, the features extracted by using the neural network model are more accurate.

In another embodiment, another video matching method is provided, and this embodiment relates to a specific process of how to calculate a second loss corresponding to a positive sample video source if the positive sample video source includes a plurality of first video sources. On the basis of the above embodiment, as shown in fig. 4, the above S306 may include the following steps:

s402, calculating loss among the characteristic data corresponding to each video frame in the same first video source to obtain first-class internal loss corresponding to each video frame.

In this embodiment, the categories of the plurality of first video sources included in the positive sample video source are all the same, where the category of the video source refers to that the objects included in the video frames of the video source are the same kind of objects, for example, the video sources in different postures of the same person.

In this step, intra-class loss between video frames of the same category and the same video source is mainly calculated, and optionally, the following steps a1-A3 may be adopted for calculation, as follows:

step A1, calculating the relative distance between the feature data corresponding to any one video frame in the same first video source and the feature data corresponding to other video frames to obtain the first relative distance corresponding to each video frame.

Step a2, comparing the first relative distance corresponding to each video frame with a preset first distance threshold to obtain a comparison result of each video frame.

Step A3, according to the comparison result of each video frame, determining the first class internal loss corresponding to each video frame.

In steps a1-A3, the relative distance between two video frames may be a euclidean distance, a2 norm, or the like; the first distance threshold may be a value according to the actual situation. Optionally, if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first intra-class loss corresponding to the video frame; and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame. For example, assuming that there are 3 video frames in a video frame, each video frame will obtain 2 first relative distances, compare the 2 first relative distances of each video frame with the first distance threshold respectively, each video frame will obtain 2 losses, and then each video frame will average or sum the respective 2 losses to obtain its own final loss, which is recorded as its corresponding first-class internal loss.

S404, calculating loss among the characteristic data corresponding to each video frame in different first video sources to obtain a second type internal loss corresponding to each video frame.

In this step, intra-class loss between video frames of the same class but different video sources is mainly calculated, and optionally, the following steps B1-B3 may be adopted for calculation, as follows:

step B1, calculating a relative distance between the feature data corresponding to each video frame in any one of the first video sources and the feature data corresponding to each video frame in the other first video sources, to obtain a second relative distance corresponding to each video frame.

And step B2, comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame.

And step B3, determining the second-class internal loss corresponding to each video frame according to the comparison result of each video frame.

In steps B1-B3, the relative distance between two video frames may be a euclidean distance, a2 norm, or the like; the second distance threshold may be a value determined according to an actual situation, and may be the same as or different from the first distance threshold, in this embodiment, the first distance threshold is different from the second distance threshold, and optionally, the first distance threshold is smaller than the second distance threshold; the first distance threshold value is smaller than the second distance threshold value, so that the clustering effect among the video frames belonging to the same video source can be better ensured, and the clustering effect among different video sources can be better ensured as much as possible.

Here, the second relative distance may be calculated for each video frame of the same category but belonging to different video sources, and the obtained second relative distance is compared with the second relative distance threshold to obtain a comparison result of each video frame, optionally, if the comparison result of one video frame is the corresponding video frameIf the second relative distance is smaller than a second distance threshold, determining 0 as a second intra-class loss corresponding to the video frame; and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame. For example, assume that the positive sample video source has 2 first video sources, each including 3 video frames, as (a)₁，a₂，a₃)，(b₁，b₂，b₃) Can calculate a₁And b₁、b₂、b₃And then, each video frame obtains 3 losses by averaging or summing the respective 3 losses to obtain the final loss of the video frame, and the final loss is recorded as the corresponding second intra-class loss.

S406, obtaining a second loss according to the first-class internal loss corresponding to each video frame and the second-class internal loss corresponding to each video frame.

Specifically, after the first-class internal loss and the second-class internal loss corresponding to each video frame are obtained, the first-class internal loss and the second-class internal loss of all the video frames may be summed or averaged to obtain a final loss, which is recorded as the second loss.

It should be noted that, in the above S402-S406, a second loss process description is obtained when the positive sample video source includes a plurality of first video sources, and correspondingly, the negative sample video source may also include a plurality of second video sources, where the number of the second video sources may be the same as or different from the number of the first video sources, and then a third loss of the negative sample video source is calculated according to the above S402-S406, so as to obtain a third loss.

According to the video matching method provided by the embodiment, if the positive sample video source comprises a plurality of first video sources, the intra-class loss between the video frames of the same type and the same video source can be calculated, the intra-class loss between the video frames of the same type and different video sources can be calculated, and the two intra-class losses are taken as the intra-class loss of the positive sample video source, so that the intra-class loss of the positive sample video source can be completely calculated, namely the intra-class loss of the positive sample video source is richer, therefore, when the neural network model is trained by utilizing the intra-class losses in the follow-up process, the clustering effect of the sample video source can be ensured, and the neural network model trained in such a way is more accurate.

In another embodiment, in order to facilitate a more detailed description of the technical solution of the present application, the following description is given in conjunction with a more detailed embodiment, and the method may include the following steps S1-S13:

s1, obtain the sample video source, including positive sample video source and negative sample video source, the positive sample video source includes a plurality of first video sources, and the negative sample video source includes a plurality of second video sources, and each video frame of sample video source all includes the detection object.

S2, inputting each video frame of the positive sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; and inputting each video frame of the negative sample video source into the initial neural network model to obtain the characteristic data corresponding to each video frame of the negative sample video source.

S3, a first loss between feature data corresponding to each video frame of the positive sample video source and feature data corresponding to each video frame of the negative sample video source is calculated.

S4, calculating the relative distance between the feature data corresponding to any one video frame in the same first video source and the feature data corresponding to other video frames to obtain the first relative distance corresponding to each video frame; comparing the first relative distance corresponding to each video frame with a preset first distance threshold value to obtain a comparison result of each video frame; if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first type internal loss corresponding to the video frame; and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame.

S5, calculating the relative distance between the feature data corresponding to each video frame in any one first video source and the feature data corresponding to each video frame in other first video sources to obtain a second relative distance corresponding to each video frame; comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame; if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining 0 as the second intra-class loss corresponding to the video frame; and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame.

And S6, obtaining a second loss according to the first class internal loss corresponding to each video frame in S4 and the second class internal loss corresponding to each video frame in S5.

S7, calculating the relative distance between the feature data corresponding to any one video frame in the same second video source and the feature data corresponding to other video frames to obtain a first relative distance corresponding to each video frame; comparing the first relative distance corresponding to each video frame with a preset first distance threshold value to obtain a comparison result of each video frame; if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first type internal loss corresponding to the video frame; and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame.

S8, calculating the relative distance between the feature data corresponding to each video frame in any one second video source and the feature data corresponding to each video frame in other second video sources to obtain a second relative distance corresponding to each video frame; comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame; if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining 0 as the second intra-class loss corresponding to the video frame; and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame.

And S9, obtaining a third loss according to the first class internal loss corresponding to each video frame in S7 and the second class internal loss corresponding to each video frame in S8.

And S10, training the initial neural network model based on the first loss, the second loss and the third loss to obtain a trained neural network model.

S11, acquiring a video source; the video source comprises at least one video frame, and each video frame comprises an object to be detected.

And S12, inputting the video source into the trained neural network model for feature extraction, and obtaining feature vectors corresponding to each video frame in the video source.

And S13, determining a target video sequence matched with the video source in the video library according to the feature vector corresponding to each video frame in the video source and a preset video library.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided a video matching apparatus including: an obtaining module 10, an extracting module 11 and a determining module 12, wherein:

an obtaining module 10, configured to obtain a video source; the video source comprises at least one video frame, and each video frame comprises an object to be detected;

the extraction module 11 is configured to input a video source into the neural network model to perform feature extraction, so as to obtain feature vectors corresponding to each video frame in the video source; the neural network model is obtained by training according to the loss between a positive sample video source and a negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects and objects to be detected which are included in each video frame of the positive sample video source are objects of the same category, and sample objects and objects to be detected which are included in each video frame of the negative sample video source are objects of different categories;

the determining module 12 is configured to determine, according to a feature vector corresponding to each video frame in a video source and a preset video library, a target video sequence matched with the video source in the video library; the target video sequence comprises at least two target video sources.

For the specific definition of the video matching apparatus, reference may be made to the above definition of the video matching method, which is not described herein again.

In another embodiment, another video matching apparatus is provided, and on the basis of the above embodiment, the apparatus further includes a model training module, where the model training module includes: the device comprises an extraction unit, a first calculation unit, a second calculation unit and a training unit, wherein:

the extraction unit is used for inputting each video frame of the positive sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; inputting each video frame of the negative sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the negative sample video source;

the first calculating unit is used for calculating first loss between the characteristic data corresponding to each video frame of the positive sample video source and the characteristic data corresponding to each video frame of the negative sample video source;

the second calculating unit is used for calculating second loss among the characteristic data corresponding to each video frame of the positive sample video source and calculating third loss among the characteristic data corresponding to each video frame of the negative sample video source;

and the training unit is used for training the initial neural network model based on the first loss, the second loss and the third loss to obtain the neural network model.

In another embodiment, there is provided another video matching apparatus, based on the above embodiment, if the positive sample video source includes a plurality of first video sources, the second calculating unit may further include: a first calculation subunit, a second calculation subunit and a determination subunit, wherein:

the first calculating subunit is used for calculating the loss between the characteristic data corresponding to each video frame in the same first video source to obtain the first-class internal loss corresponding to each video frame;

the second calculating subunit is used for calculating the loss between the characteristic data corresponding to each video frame in different first video sources to obtain a second type internal loss corresponding to each video frame;

and the determining subunit is used for obtaining a second loss according to the first-class internal loss corresponding to each video frame and the second-class internal loss corresponding to each video frame.

Optionally, the first calculating subunit is further configured to calculate a relative distance between feature data corresponding to any one video frame belonging to the same first video source and feature data corresponding to other video frames, so as to obtain a first relative distance corresponding to each video frame; comparing the first relative distance corresponding to each video frame with a preset first distance threshold value to obtain a comparison result of each video frame; and determining the first-class internal loss corresponding to each video frame according to the comparison result of each video frame.

Optionally, the first calculating subunit is further configured to determine 0 as a first intra-class loss corresponding to the video frame if the comparison result of one video frame indicates that the first relative distance corresponding to the video frame is smaller than the first distance threshold; and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame.

Optionally, the second calculating subunit is further configured to calculate a relative distance between feature data corresponding to each video frame in any one of the first video sources and feature data corresponding to each video frame in other first video sources, so as to obtain a second relative distance corresponding to each video frame; comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame; and determining the second-class internal loss corresponding to each video frame according to the comparison result of each video frame.

Optionally, the second calculating subunit is further configured to determine 0 as a second intra-class loss corresponding to the video frame if the comparison result of one video frame indicates that the second relative distance corresponding to the video frame is smaller than the second distance threshold; and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame.

Optionally, the first distance threshold is smaller than the second distance threshold.

The various modules in the video matching apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of:

inputting each video frame of the positive sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; inputting each video frame of the negative sample video source into the initial neural network model to obtain characteristic data corresponding to each video frame of the negative sample video source; calculating first loss between the characteristic data corresponding to each video frame of the positive sample video source and the characteristic data corresponding to each video frame of the negative sample video source; calculating second loss among the characteristic data corresponding to each video frame of the positive sample video source, and calculating third loss among the characteristic data corresponding to each video frame of the negative sample video source; and training the initial neural network model based on the first loss, the second loss and the third loss to obtain the neural network model.

calculating loss among characteristic data corresponding to all video frames in the same first video source to obtain first-class internal loss corresponding to each video frame; calculating loss among the characteristic data corresponding to each video frame in different first video sources to obtain second-class internal loss corresponding to each video frame; and obtaining a second loss according to the first-class internal loss corresponding to each video frame and the second-class internal loss corresponding to each video frame.

calculating the relative distance between the feature data corresponding to any one video frame in the same first video source and the feature data corresponding to other video frames to obtain a first relative distance corresponding to each video frame; comparing the first relative distance corresponding to each video frame with a preset first distance threshold value to obtain a comparison result of each video frame; and determining the first-class internal loss corresponding to each video frame according to the comparison result of each video frame.

if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first type internal loss corresponding to the video frame; and if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining the first relative distance corresponding to the video frame as the first intra-class loss corresponding to the video frame.

calculating the relative distance between the characteristic data corresponding to each video frame in any one first video source and the characteristic data corresponding to each video frame in other first video sources to obtain a second relative distance corresponding to each video frame; comparing the second relative distance corresponding to each video frame with a preset second distance threshold value to obtain a comparison result of each video frame; and determining the second-class internal loss corresponding to each video frame according to the comparison result of each video frame.

if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining 0 as the second intra-class loss corresponding to the video frame; and if the comparison result of one video frame is that the second relative distance corresponding to the video frame is smaller than the second distance threshold, determining the second relative distance corresponding to the video frame as the second intra-class loss corresponding to the video frame.

In one embodiment, a readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for video matching, the method comprising:

inputting the video source into a neural network model for feature extraction to obtain a feature vector corresponding to each video frame in the video source;

determining a target video sequence matched with the video source in a video library according to the feature vector corresponding to each video frame in the video source and the preset video library; the target video sequence comprises at least two target video sources;

the neural network model is obtained by training according to the loss between a positive sample video source and a negative sample video source, the loss between each video frame of the positive sample video source and the loss between each video frame of the negative sample video source, sample objects included in each video frame of the positive sample video source and objects to be detected are objects of the same category, and sample objects included in each video frame of the negative sample video source and the objects to be detected are objects of different categories.

2. The method of claim 1, wherein the training method of the neural network model comprises:

inputting each video frame of the positive sample video source into an initial neural network model to obtain characteristic data corresponding to each video frame of the positive sample video source; inputting each video frame of the negative sample video source to an initial neural network model to obtain characteristic data corresponding to each video frame of the negative sample video source;

calculating a first loss between the feature data corresponding to each video frame of the positive sample video source and the feature data corresponding to each video frame of the negative sample video source;

3. The method of claim 2, wherein if the positive sample video source comprises a plurality of first video sources, the calculating a second loss between feature data corresponding to each video frame of the positive sample video source comprises:

and obtaining the second loss according to the first-class internal loss corresponding to each video frame and the second-class internal loss corresponding to each video frame.

4. The method according to claim 3, wherein the calculating the loss between the feature data corresponding to the video frames belonging to the same first video source to obtain the first intra-class loss corresponding to each video frame comprises:

5. The method according to claim 4, wherein said determining the first intra-class loss corresponding to each video frame according to the comparison result of each video frame comprises:

if the comparison result of one video frame is that the first relative distance corresponding to the video frame is smaller than the first distance threshold, determining 0 as the first intra-class loss corresponding to the video frame;

6. The method according to any one of claims 4 to 5, wherein the calculating the loss between the feature data corresponding to the video frames belonging to different first video sources to obtain the second intra-class loss corresponding to each video frame comprises:

7. The method according to claim 6, wherein said determining the second intra-class loss corresponding to each video frame according to the comparison result of each video frame comprises:

8. The method of claim 7, wherein the first distance threshold is less than the second distance threshold.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.

10. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.