CN116416562A

CN116416562A - Domain adaptive video classification method, device, device, medium and product

Info

Publication number: CN116416562A
Application number: CN202310431413.4A
Authority: CN
Inventors: 饶竹一; 高圣溥
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2023-04-17
Filing date: 2023-04-17
Publication date: 2023-07-11

Abstract

The application relates to a field-adaptive video classification method, device, equipment, medium and product. The method comprises the following steps: acquiring video input samples of a source domain and at least one target domain, and classifying based on the characteristics of the video input samples of the source domain to obtain an initial video classification model; constructing at least two private networks, wherein the private networks are used for respectively acquiring semantic irrelevant information characteristics of video input samples in each field; acquiring feature data of a video input sample of a source domain and at least one target domain extracted by an initial video classification model, acquiring feature distribution distances of extracted features of the video classification model and each private network, performing maximum processing on the feature distribution distances, and calculating maximum mean value difference to obtain public semantic information features; and (3) iteratively training the initial video classification model and each private network, and obtaining a target video classification model universal to the field when the iteration stop condition is met, and carrying out video classification according to the target video classification model, thereby being beneficial to improving the field adaptability and accuracy.

Description

Domain-adaptive video classification method, device, equipment, medium and product

技术领域Technical Field

本申请涉及计算机技术领域，特别是涉及一种领域自适应的视频分类方法、装置、设备、介质和产品。The present application relates to the field of computer technology, and in particular to a domain-adaptive video classification method, device, equipment, medium and product.

背景技术Background Art

近年来，无监督域自适应引起了大量的研究关注，其目的是通过学习一个领域无关的特征表示，使得在有标注的源域数据集上训练的模型，在无标注且分布不同的目标域依然保持较好的表现。In recent years, unsupervised domain adaptation has attracted a lot of research attention. Its goal is to learn a domain-independent feature representation so that the model trained on a labeled source domain dataset can still maintain good performance in an unlabeled and differently distributed target domain.

在有标注的源域数据集上训练视频分类可以得到视频分类模型，目标域的视频与源域通常具有不同的特征分布且无标注，在训练得到的视频分类模型上无法得到良好的视频分类效果，因此在视频分类时需要对不同领域的视频完成迁移学习。A video classification model can be obtained by training video classification on a labeled source domain dataset. The target domain videos and the source domain videos usually have different feature distributions and are unlabeled. It is impossible to obtain good video classification results on the trained video classification model. Therefore, transfer learning of videos from different fields is required during video classification.

目前针对不同领域的视频分类方法，通常是通过领域自适应方法实现，在一个神经网络中基于对抗学习来学习领域无关的特征表示。对抗学习方法在特征提取网络中设置带有梯度反转层的领域判别器，领域判别器用于判断提取到特征的领域来源，特征提取器用于学习如何提取到更多公共语义信息来混淆领域判别器。At present, video classification methods for different fields are usually implemented through domain adaptation methods, which learn domain-independent feature representations based on adversarial learning in a neural network. The adversarial learning method sets a domain discriminator with a gradient reversal layer in the feature extraction network. The domain discriminator is used to determine the domain source of the extracted features, and the feature extractor is used to learn how to extract more common semantic information to confuse the domain discriminator.

然而，对于包含大部分与语义无关的干扰信息的视频数据而言，学习两个域之间公共的语义信息十分困难，采用传统的对抗学习匹配样本级特征分布的领域自适应效果差。However, for video data that contains most of the interference information that is irrelevant to semantics, it is very difficult to learn the common semantic information between the two domains, and the domain adaptation effect of matching sample-level feature distribution using traditional adversarial learning is poor.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够实现领域自适应的视频分类方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a video classification method, device, computer equipment, computer-readable storage medium and computer program product that can achieve domain adaptation in response to the above technical problems.

第一方面，本申请提供了一种领域自适应的视频分类方法。所述方法包括：In a first aspect, the present application provides a domain-adaptive video classification method. The method comprises:

获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型；Obtain video input samples of a source domain and at least one target domain, and classify the video input samples of the source domain based on features to obtain an initial video classification model;

构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征；Construct at least two private networks, where the private networks are used to obtain semantically irrelevant information features of video input samples in various fields respectively;

获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征；Acquire feature data of video input samples of a source domain and at least one target domain extracted by an initial video classification model, obtain feature distribution distances of features extracted by the video classification model and each private network, maximize the feature distribution distances and calculate the maximum mean difference to obtain common semantic information features;

迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。The initial video classification model and each private network are iteratively trained. When the iteration stopping condition is met, a domain-wide target video classification model is obtained, and video classification is performed according to the target video classification model.

在其中一个实施例中，构建至少两个私有网络，包括：In one embodiment, at least two private networks are constructed, including:

获取视频输入样本的背景数据，背景数据作为监督信号用于私有网络的重构训练；Obtain background data of video input samples, and use the background data as a supervisory signal for reconstruction training of the private network;

通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据；Reconstruction training of video input samples in various fields is performed through a private network to obtain reconstructed background data;

获取背景数据与重构背景数据之间的重构损失；Obtaining the reconstruction loss between the background data and the reconstructed background data;

最小化重构损失，得到语义无关信息特征。Minimize the reconstruction loss and obtain semantically irrelevant information features.

在其中一个实施例中，私有网络包括视频特征提取器和重构网络，通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据，包括：In one embodiment, the private network includes a video feature extractor and a reconstruction network, and the private network is used to perform reconstruction training of video input samples in various fields to obtain reconstructed background data, including:

基于视频特征提取器得到各领域视频输入样本的背景特征；Based on the video feature extractor, background features of video input samples in various fields are obtained;

基于重构网络对背景特征重构得到重构背景数据；Reconstructing the background features based on the reconstruction network to obtain reconstructed background data;

获取背景数据与重构背景数据之间的重构损失，包括：Obtain the reconstruction loss between the background data and the reconstructed background data, including:

获取背景数据与重构背景数据之间的距离；Obtaining the distance between background data and reconstructed background data;

通过基于距离度量的损失函数和距离，计算重构损失。The reconstruction loss is calculated using a loss function based on a distance metric and the distance.

在其中一个实施例中，初始视频分类模型包括特征提取器、域判别器和分类器，基于源域视频输入样本的特征进行分类得到初始视频分类模型，包括；In one embodiment, the initial video classification model includes a feature extractor, a domain discriminator and a classifier, and the initial video classification model is obtained by classifying based on the features of the source domain video input sample, including:

通过特征提取器获取源域视频输入样本的特征；Obtain features of source domain video input samples through a feature extractor;

通过分类器对特征进行分类；Classify features through classifiers;

获取分类损失，分类损失用于迭代训练初始视频分类模型和各私有网络；Obtain classification loss, which is used to iteratively train the initial video classification model and each private network;

获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，包括：Acquiring feature data of video input samples of a source domain and at least one target domain extracted by an initial video classification model, including:

通过特征提取器得到源域和至少一个目标域的视频输入样本的初始特征数据；Obtaining initial feature data of video input samples of a source domain and at least one target domain through a feature extractor;

根据域判别器对初始特征数据进行对抗训练，得到对抗训练后的目标特征数据；Perform adversarial training on the initial feature data according to the domain discriminator to obtain the target feature data after adversarial training;

根据分类器得到目标特征数据的视频分类；Video classification of target feature data obtained by the classifier;

获取域判别器的对抗训练损失，对抗训练损失用于迭代训练初始视频分类模型和各私有网络。Obtain the adversarial training loss of the domain discriminator, which is used to iteratively train the initial video classification model and each private network.

在其中一个实施例中，该方法还包括：In one embodiment, the method further comprises:

构建特征来源分类器，根据特征来源分类器确定输入特征的来源标识，其中，来源标识用于确定输入特征的来源是初始视频分类模型或私有网络；Constructing a feature source classifier, and determining a source identifier of an input feature according to the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is an initial video classification model or a private network;

获取特征来源分类器的来源分类损失，来源分类损失用于迭代训练初始视频分类模型和各私有网络。The source classification loss of the feature source classifier is obtained, and the source classification loss is used to iteratively train the initial video classification model and each private network.

在其中一个实施例中，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，包括：In one embodiment, the initial video classification model and each private network are iteratively trained to obtain a domain-general target video classification model when an iteration stop condition is met, including:

基于损失函数获取训练损失，根据训练损失得到迭代停止条件；Obtain training loss based on the loss function, and obtain the iteration stop condition based on the training loss;

根据训练损失反向传播计算损失函数的梯度，更新损失函数；Calculate the gradient of the loss function based on the training loss backpropagation and update the loss function;

在训练损失稳定的情况下，满足迭代停止条件，得到领域通用的目标视频分类模型。When the training loss is stable and the iteration stopping condition is met, a domain-wide target video classification model is obtained.

第二方面，本申请还提供了一种能够实现领域自适应的视频分类装置。所述装置包括：In a second aspect, the present application also provides a video classification device capable of achieving domain adaptation. The device comprises:

视频分类模块，用于获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型；A video classification module, used to obtain video input samples of a source domain and at least one target domain, and classify the video input samples of the source domain based on their features to obtain an initial video classification model;

私有网络模块，用于构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征；A private network module, used to construct at least two private networks, where the private networks are used to obtain semantically irrelevant information features of video input samples in various fields respectively;

均值差异模块，用于获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征；A mean difference module is used to obtain feature data of video input samples of a source domain and at least one target domain extracted by an initial video classification model, obtain feature distribution distances of features extracted by the video classification model and each private network, maximize the feature distribution distances and calculate the maximum mean difference to obtain common semantic information features;

迭代训练模块，用于迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。The iterative training module is used to iteratively train the initial video classification model and each private network. When the iteration stop condition is met, a domain-wide target video classification model is obtained, and video classification is performed according to the target video classification model.

第三方面，本申请还提供了一种计算机设备。计算机设备包括存储器和处理器，存储器存储有计算机程序，处理器执行计算机程序时实现以下步骤：In a third aspect, the present application further provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

第四方面，本申请还提供了一种计算机可读存储介质。计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the following steps are implemented:

第五方面，本申请还提供了一种计算机程序产品。计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现以下步骤：In a fifth aspect, the present application further provides a computer program product. The computer program product includes a computer program, and when the computer program is executed by a processor, the following steps are implemented:

上述的领域自适应的视频分类方法、装置、计算机设备、存储介质和计算机程序产品，通过获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型，构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征，获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型实现了领域自适应的视频分类，该方法通过构建私有网络并提取出语义无关信息特征，并获取语义无关信息特征与初始视频分类模型提取的特征数据之间的最大均值差异，对最大均值差异进行最大化处理后，也就是说，在实现语义无关信息特征与特征数据之间的特征差异最大以后，有利于实现视频分类模型在视频分类过程中获取到公共语义信息特征且忽略语义无关信息特征，有利于降低目标域视频中存在的语义无关信息特征导致目标域视频无法在初始视频分类模型中得到适应性视频分类的影响，提高了领域迁移中视频分类的领域适应性，对于包含大部分与语义无关的干扰信息的视频数据的分类，领域的自适应效果好，领域自适应的视频分类准确度高，具有较高的可靠性。The above-mentioned domain-adaptive video classification method, apparatus, computer equipment, storage medium and computer program product obtain video input samples from a source domain and at least one target domain, classify them based on the features of the source domain video input samples to obtain an initial video classification model, construct at least two private networks, the private networks are used to respectively obtain semantically irrelevant information features of video input samples from each domain, obtain feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, obtain feature distribution distances between the video classification model and the features extracted by each private network, maximize the feature distribution distances and calculate the maximum mean difference to obtain common semantic information features, iteratively train the initial video classification model and each private network, and when the iteration stop condition is met, obtain a domain-universal target video classification model, and implement domain adaptation based on the target video classification model. Video classification, this method constructs a private network and extracts semantically irrelevant information features, and obtains the maximum mean difference between the semantically irrelevant information features and the feature data extracted by the initial video classification model. After maximizing the maximum mean difference, that is, after achieving the maximum feature difference between the semantically irrelevant information features and the feature data, it is beneficial for the video classification model to obtain public semantic information features and ignore semantically irrelevant information features during the video classification process, which is beneficial to reducing the impact of semantically irrelevant information features in the target domain video that causes the target domain video to be unable to obtain adaptive video classification in the initial video classification model, and improves the domain adaptability of video classification in domain migration. For the classification of video data containing most of the interference information irrelevant to semantics, the domain adaptation effect is good, the domain adaptive video classification accuracy is high, and it has high reliability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一个实施例中视频分类方法的应用环境图；FIG1 is a diagram of an application environment of a video classification method according to an embodiment;

图2为一个实施例中视频分类方法的流程示意图；FIG2 is a schematic diagram of a flow chart of a video classification method in one embodiment;

图3为一个实施例中视频分类方法的最大均值差异示意图；FIG3 is a schematic diagram of the maximum mean difference of a video classification method according to an embodiment;

图4为一个实施例中私有网络的结构示意图；FIG4 is a schematic diagram of the structure of a private network in one embodiment;

图5为一个实施例中初始视频分类模型的结构示意图；FIG5 is a schematic diagram of the structure of an initial video classification model in one embodiment;

图6为另一个实施例中视频分类方法的流程示意图；FIG6 is a schematic flow chart of a video classification method in another embodiment;

图7为一个实施例中视频分类装置的结构框图；FIG7 is a structural block diagram of a video classification device in one embodiment;

图8为另一个实施例中视频分类装置的结构框图；FIG8 is a structural block diagram of a video classification device in another embodiment;

图9为一个实施例中计算机设备为服务器的内部结构图；FIG9 is an internal structure diagram of a computer device as a server in one embodiment;

图10为一个实施例中计算机设备为终端的内部结构图。FIG. 10 is a diagram showing the internal structure of a computer device that is a terminal in one embodiment.

具体实施方式DETAILED DESCRIPTION

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

本申请实施例提供的领域自适应的视频分类方法，可以应用于如图1所示的应用环境中。其中，终端102通过网络与服务器104进行通信。数据存储系统可以存储服务器104需要处理的数据。数据存储系统可以集成在服务器104上，也可以放在云上或其他网络服务器上。终端102获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型，构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征，获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。其中，终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑、物联网设备和便携式可穿戴设备，物联网设备可为智能音箱、智能电视、智能空调、智能车载设备等。便携式可穿戴设备可为智能手表、智能手环、头戴设备等。服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The domain-adaptive video classification method provided in the embodiment of the present application can be applied in the application environment shown in FIG. 1 . Among them, the terminal 102 communicates with the server 104 through the network. The data storage system can store the data that the server 104 needs to process. The data storage system can be integrated on the server 104, or it can be placed on the cloud or other network servers. The terminal 102 obtains video input samples of the source domain and at least one target domain, classifies based on the features of the source domain video input samples to obtain an initial video classification model, constructs at least two private networks, and the private networks are used to obtain semantically irrelevant information features of video input samples in each domain, obtain feature data of video input samples of the source domain and at least one target domain extracted by the initial video classification model, obtain the feature distribution distance of the video classification model and the features extracted by each private network, maximize the feature distribution distance and calculate the maximum mean difference, obtain the common semantic information features, iteratively train the initial video classification model and each private network, and when the iteration stop condition is met, obtain the domain-general target video classification model, and perform video classification according to the target video classification model. The terminal 102 may be, but is not limited to, various personal computers, laptops, smart phones, tablet computers, IoT devices, and portable wearable devices. The IoT devices may be smart speakers, smart TVs, smart air conditioners, smart car-mounted devices, etc. The portable wearable devices may be smart watches, smart bracelets, head-mounted devices, etc. The server 104 may be implemented as an independent server or a server cluster consisting of multiple servers.

在一个实施例中，如图2所示，提供了一种领域自适应的视频分类方法，以该方法应用于图1中的服务器104为例进行说明，包括以下步骤：In one embodiment, as shown in FIG. 2 , a domain-adaptive video classification method is provided, and the method is applied to the server 104 in FIG. 1 as an example for description, including the following steps:

步骤202，获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型。Step 202: Obtain video input samples of a source domain and at least one target domain, and classify the video input samples of the source domain based on features to obtain an initial video classification model.

其中，源域表示与测试样本不同的领域，具有丰富的监督标注信息；目标域表示测试样本所在的领域，无标签或者只有少量标签，示例性的，在迁移学习中，来自源域的原始样本包含视频类别标签，来自目标域的原始样本不包含视频类别标签且与源域样本分布不同。Among them, the source domain represents a domain different from the test sample and has rich supervised annotation information; the target domain represents the domain where the test sample is located, which has no labels or only a small number of labels. For example, in transfer learning, the original samples from the source domain contain video category labels, and the original samples from the target domain do not contain video category labels and have a different distribution from the source domain samples.

示例性的，通过图像帧提取和降采样处理输入的源域和目标域视频数据，降采样的方式为从随机位置开始按固定的采样频率采样16帧，得到大小一致的源域和目标域视频输入样本，对源域视频输入样本进行3000次视频分类任务的迭代训练得到初始视频分类模型。Exemplarily, the input source domain and target domain video data are processed by image frame extraction and downsampling. The downsampling method is to sample 16 frames starting from a random position at a fixed sampling frequency to obtain source domain and target domain video input samples of the same size. The source domain video input samples are iteratively trained for 3000 video classification tasks to obtain an initial video classification model.

步骤204，构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征。Step 204: construct at least two private networks, where the private networks are used to obtain semantically irrelevant information features of video input samples in various fields.

其中，语义无关信息特征是指视频分类中与领域紧密相关而与语义无关的背景信息，表现为视频分类的干扰特征。Among them, semantically irrelevant information features refer to background information in video classification that is closely related to the domain but irrelevant to semantics, and manifests as interference features for video classification.

示例性的，对于动态的视频来说，静态的背景特征是最主要的语义无关信息特征，构建私有网络以后，可以通过背景重构训练不同的私有网络，分别获取到源域和目标域视频输入样本的背景特征。For example, for dynamic videos, static background features are the most important semantically irrelevant information features. After building a private network, different private networks can be trained through background reconstruction to obtain the background features of the source domain and target domain video input samples respectively.

步骤206，获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征。Step 206, obtain feature data of video input samples of the source domain and at least one target domain extracted by the initial video classification model, obtain feature distribution distances of the features extracted by the video classification model and each private network, maximize the feature distribution distance and calculate the maximum mean difference to obtain common semantic information features.

其中，最大均值差异MMD用于衡量不同特征分布之间的距离，如果该距离的均值差异达到最大，就说明采样的样本来自完全不同的分布。公共语义信息特征是指与领域无关而与语义相关的特征，表现为划分视频类别有关的特征。Among them, the maximum mean difference (MMD) is used to measure the distance between different feature distributions. If the mean difference of the distance reaches the maximum, it means that the samples are from completely different distributions. Common semantic information features refer to features that are not related to the domain but are related to semantics, which are manifested as features related to the classification of video categories.

示例性的，初始视频模型提取出源域和目标域的视频输入样本中的特征数据，该特征数据、私有网络提取出源域视频输入样本的背景特征和目标域视频输入样本的背景特征具有不同的特征分布，对三个输入样本的特征分布距离最大化，得到最大均值差异MMD，计算公式表示为：Exemplarily, the initial video model extracts feature data from the video input samples of the source domain and the target domain. The feature data and the background features of the source domain video input samples extracted by the private network and the background features of the target domain video input samples have different feature distributions. The feature distribution distance of the three input samples is maximized to obtain the maximum mean difference MMD, and the calculation formula is expressed as:

其中，X_S,X_T分别为源域和目标域视频输入样本，φ(x_s)和φ(x_t)为对应的核函数。Among them, X _S , _XT are the source domain and target domain video input samples respectively, and φ(x _s ) and φ(x _t ) are the corresponding kernel functions.

示例性的，如图3所示视频分类方法的最大均值差异示意图，在特征分布距离最大化的情况下，特征之间的差异性最大，可以实现在初始视频分类模型获取到的特征数据中忽略源域和目标域背景特征的效果，得到源域和目标域的公共语义信息特征。在最大化特征分布距离的过程中，初始视频分类模型的最大均值差异的损失函数计算公式表示为：For example, as shown in FIG3 , the maximum mean difference diagram of the video classification method, when the feature distribution distance is maximized, the difference between the features is the largest, which can achieve the effect of ignoring the background features of the source domain and the target domain in the feature data obtained by the initial video classification model, and obtain the common semantic information features of the source domain and the target domain. In the process of maximizing the feature distribution distance, the loss function calculation formula of the maximum mean difference of the initial video classification model is expressed as:

其中，

为最大均值差异损失，在源域的背景特征分布为d_Sp、目标域的背景特征分布为d_Tp和初始视频分类模型的特征数据分布为d_main的情况下，d_Sp与d_Tp之间的最大均值差异为MMD(d_Sp,d_Tp)，d_main与d_Sp之间的最大均值差异为MMD(d_main,d_Sp)，d_main与d_Tp之间的最大均值差异为MMD(d_main,d_Tp)。in,

is the maximum mean difference loss. When the background feature distribution of the source domain is d _Sp , the background feature distribution of the target domain is d _Tp and the feature data distribution of the initial video classification model is d _main , the maximum mean difference between d _Sp and d _Tp is MMD(d _Sp ,d _Tp ), the maximum mean difference between d _main and d _Sp is MMD(d _main ,d _Sp ), and the maximum mean difference between d _main and d _Tp is MMD(d _main ,d _Tp ).

步骤208，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。Step 208, iteratively train the initial video classification model and each private network, and when the iteration stop condition is met, obtain a domain-wide target video classification model, and perform video classification according to the target video classification model.

其中，迭代训练是根据迭代算法进行视频分类模型训练的过程，迭代是重复反馈过程的活动，迭代算法是从某个值开始不断地根据上一步的结果计算出下一步的结果。Among them, iterative training is the process of training the video classification model according to the iterative algorithm. Iteration is the activity of repeating the feedback process. The iterative algorithm starts from a certain value and continuously calculates the result of the next step based on the result of the previous step.

示例性的，可以根据视频分类模型训练过程中的损失函数进行迭代训练，在损失函数不再变化的情况下，具体的，在损失函数不再下降的情况下或者损失函数趋于稳定的情况下，视频分类模型实现收敛，得到可以实现目标域视频准确分类的目标视频分类模型，可以根据目标视频分类模型进行视频分类。Exemplarily, iterative training can be performed based on the loss function in the video classification model training process. When the loss function no longer changes, specifically, when the loss function no longer decreases or the loss function tends to be stable, the video classification model converges to obtain a target video classification model that can accurately classify target domain videos. Video classification can then be performed based on the target video classification model.

上述能够实现领域自适应的视频分类方法，通过获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型，构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征，获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型实现了领域自适应的视频分类，该方法通过构建私有网络并提取出语义无关信息特征，并获取语义无关信息特征与初始视频分类模型提取的特征数据之间的最大均值差异，对最大均值差异进行最大化处理后，也就是说，在实现语义无关信息特征与特征数据之间的特征差异最大以后，有利于实现视频分类模型在视频分类过程中获取到公共语义信息特征且忽略语义无关信息特征，有利于降低目标域视频中存在的语义无关信息特征导致目标域视频无法在初始视频分类模型中得到适应性视频分类的影响，提高了领域迁移中视频分类的领域适应性，对于包含大部分与语义无关的干扰信息的视频数据的分类，领域的自适应效果好，领域自适应的视频分类准确度高，具有较高的可靠性。The above-mentioned video classification method capable of achieving domain adaptation obtains video input samples from a source domain and at least one target domain, classifies the video input samples based on the features of the source domain to obtain an initial video classification model, constructs at least two private networks, and the private networks are used to respectively obtain semantically irrelevant information features of video input samples from each domain, obtains feature data of video input samples from the source domain and at least one target domain extracted by the initial video classification model, obtains feature distribution distances between the video classification model and features extracted by each private network, maximizes the feature distribution distance and calculates the maximum mean difference to obtain common semantic information features, iteratively trains the initial video classification model and each private network, and when the iteration stop condition is met, obtains a domain-universal target video classification model, and implements domain-adaptive video classification according to the target video classification model. The method is By building a private network and extracting semantically irrelevant information features, and obtaining the maximum mean difference between the semantically irrelevant information features and the feature data extracted by the initial video classification model, the maximum mean difference is maximized, that is, after achieving the maximum feature difference between the semantically irrelevant information features and the feature data, it is beneficial for the video classification model to obtain public semantic information features and ignore semantically irrelevant information features during the video classification process, which is beneficial to reducing the influence of semantically irrelevant information features in the target domain video that causes the target domain video to be unable to obtain adaptive video classification in the initial video classification model, and improves the domain adaptability of video classification in domain migration. For the classification of video data containing most of the interference information irrelevant to semantics, the domain adaptation effect is good, the domain adaptive video classification accuracy is high, and it has high reliability.

在一个实施例中，构建至少两个私有网络，包括：获取视频输入样本的背景数据，背景数据作为监督信号用于私有网络的重构训练；通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据；获取背景数据与重构背景数据之间的重构损失；最小化重构损失，得到语义无关信息特征。In one embodiment, at least two private networks are constructed, including: obtaining background data of video input samples, using the background data as a supervisory signal for reconstruction training of the private network; performing reconstruction training of video input samples in various fields through the private network to obtain reconstructed background data; obtaining a reconstruction loss between the background data and the reconstructed background data; and minimizing the reconstruction loss to obtain semantically irrelevant information features.

其中，监督信号是指在监督学习中，视频输入样本期望的输出值。重构是图像重构(IR)，目的在于根据ground truth图像抽取出的的各类信息对图像进行重构。groundtruth是指，在监督学习中，数据是有标注的，以(x,t)的形式出现，其中x是输入数据，t是标注，正确的t标注是ground truth。重构损失是指重构的损失函数，也就是私有网络图像重构的预测值与真实值的差异程度的运算函数。损失函数包括基于距离度量的损失函数和基于概率分布度量的损失函数。Among them, the supervisory signal refers to the expected output value of the video input sample in supervised learning. Reconstruction is image reconstruction (IR), which aims to reconstruct the image based on various information extracted from the ground truth image. Groundtruth means that in supervised learning, the data is labeled and appears in the form of (x, t), where x is the input data, t is the label, and the correct t label is the ground truth. Reconstruction loss refers to the reconstruction loss function, that is, the operation function of the degree of difference between the predicted value and the true value of the private network image reconstruction. The loss function includes the loss function based on the distance metric and the loss function based on the probability distribution metric.

示例性的，如图4所示的私有网络的结构示意图，可以通过时间中值滤波器TMF提取源域和目标域视频输入样本的背景数据，得到源域和目标域背景数据，输入源域背景数据和源域视频输入样本到源域私有网络，输入目标域背景数据和目标域视频输入样本到目标域私有网络，背景数据作为监督信号是私有网络进行重构训练过程中期望的输出值，也是正确的t标注ground truth，私有网络经图像重构的输出值与背景数据作比较可以计算重构的损失函数，在重构训练的过程中，最小化重构损失，可以缩小背景数据与重构背景数据之间的差异，实现私有网络学习到与语义的无关信息即语义无关信息特征。Exemplarily, as shown in the structural diagram of the private network in FIG4 , the background data of the source domain and target domain video input samples can be extracted by the temporal median filter TMF to obtain the source domain and target domain background data, the source domain background data and the source domain video input samples are input into the source domain private network, and the target domain background data and the target domain video input samples are input into the target domain private network. The background data is used as a supervisory signal and is the expected output value of the private network during the reconstruction training process, and is also the correct t-labeled ground truth. The output value of the private network after image reconstruction is compared with the background data to calculate the reconstruction loss function. In the process of reconstruction training, the reconstruction loss is minimized, and the difference between the background data and the reconstructed background data can be reduced, so that the private network can learn semantically irrelevant information, that is, semantically irrelevant information features.

本实施例中，使用时间中值滤波器来获取视频输入样本的背景数据，时间中值滤波器是一种简单、直观且快速的视频背景提取方法，而损失函数用于模型的训练阶段，在得到单次训练得出的预测值和差异值之间的损失值之后，根据最小化损失值的方向更新私有网络的各个参数，达到降低真实值与预测值之间的损失，使得模型生成的预测值往真实值方向靠拢的效果，进而达到私有网络学习到源域和目标域视频输入样本语义无关信息特征的效果。In this embodiment, a temporal median filter is used to obtain background data of a video input sample. The temporal median filter is a simple, intuitive and fast method for extracting video background. The loss function is used in the training phase of the model. After obtaining the loss value between the predicted value and the difference value obtained by a single training, the parameters of the private network are updated in the direction of minimizing the loss value, thereby reducing the loss between the true value and the predicted value, and making the predicted value generated by the model move closer to the true value, thereby achieving the effect of the private network learning the semantically irrelevant information features of the source domain and target domain video input samples.

在一个实施例中，私有网络包括视频特征提取器和重构网络，通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据，包括：基于视频特征提取器得到各领域视频输入样本的背景特征；基于重构网络对背景特征重构得到重构背景数据；获取背景数据与重构背景数据之间的重构损失，包括：获取背景数据与重构背景数据之间的距离；通过基于距离度量的损失函数和距离，计算重构损失。In one embodiment, a private network includes a video feature extractor and a reconstruction network. Reconstruction training of video input samples in various fields is performed through the private network to obtain reconstructed background data, including: obtaining background features of video input samples in various fields based on the video feature extractor; reconstructing the background features based on the reconstruction network to obtain reconstructed background data; obtaining a reconstruction loss between the background data and the reconstructed background data, including: obtaining the distance between the background data and the reconstructed background data; and calculating the reconstruction loss through a loss function based on a distance metric and the distance.

其中，视频特征提取器用于提取视频输入样本的特征。重构网络用于对提取到的特征进行背景重构。The video feature extractor is used to extract the features of the video input samples, and the reconstruction network is used to reconstruct the background based on the extracted features.

示例性的，如图4所示的私有网络的结构示意图，源域私有网络包括源域视频特征提取器F_Sp和源域重构网络，目标域私有网络包括目标域视频特征提取器F_Tp和目标域重构网络，获取到视频特征提取器提取到的视频输入样本的背景特征，可以根据背景特征在重构网络进行图像重构得到预测的重构背景数据，根据源域和目标域重构背景数据与背景数据之间的距离计算源域和目标域的L₂损失函数，其中，L₂损失函数又被称为欧氏距离，是一种常用的距离度量方函数，通常用于度量数据点之间的相似度，得到源域重构损失的计算公式表示为：Exemplarily, as shown in the structural diagram of the private network in FIG4 , the source domain private network includes a source domain video feature extractor F _Sp and a source domain reconstruction network, and the target domain private network includes a target domain video feature extractor F _Tp and a target domain reconstruction network. The background features of the video input sample extracted by the video feature extractor are obtained, and the image can be reconstructed in the reconstruction network according to the background features to obtain the predicted reconstructed background data. The L ₂ loss function of the source domain and the target domain is calculated according to the distance between the reconstructed background data of the source domain and the target domain and the background data. The L ₂ loss function is also called the Euclidean distance, which is a commonly used distance measurement function, and is usually used to measure the similarity between data points. The calculation formula of the source domain reconstruction loss is expressed as:

目标域重构损失的计算公式表示为：The calculation formula of the target domain reconstruction loss is expressed as:

私有网络重构损失为两项重构损失之和的计算公式表示为：The calculation formula of the private network reconstruction loss is the sum of two reconstruction losses:

其中，b_Sp为源域重构背景数据，b_S为源域背景数据，

为源域的L₂损失函数，b_Tp为目标域重构背景数据，b_T为目标域背景数据，

为目标域的L₂损失函数，

为私有网络重构损失函数。Among them, _bSp is the source domain reconstruction background data, _bS is the source domain background data,

is the _L2 loss function of the source domain, _bTp is the reconstructed background data of the target domain, _bT is the background data of the target domain,

is the _L2 loss function of the target domain,

Reconstruct the loss function for private networks.

本实施例中，通过基于距离度量的损失函数度量特征空间上视频输入样本真实值和私有网络预测值之间的距离，特征空间上两个点的距离越小，可以得到私有网络的预测性能越好，且L₂损失函数的曲线在接近目标时足够平缓，所以可以利用这个特点在接近目标时，逐渐缓慢收敛过去，适合用于图像处理。In this embodiment, the distance between the true value of the video input sample and the predicted value of the private network in the feature space is measured by a loss function based on distance measurement. The smaller the distance between two points in the feature space, the better the prediction performance of the private network can be obtained, and the curve of the _L2 loss function is sufficiently flat when approaching the target, so this feature can be used to gradually and slowly converge when approaching the target, which is suitable for image processing.

在其中一个实施例中，初始视频分类模型包括特征提取器、域判别器和分类器，基于源域视频输入样本的特征进行分类得到初始视频分类模型，包括；通过特征提取器获取源域视频输入样本的特征；通过分类器对特征进行分类；获取分类损失，分类损失用于迭代训练初始视频分类模型和各私有网络；获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，包括：通过特征提取器得到源域和至少一个目标域的视频输入样本的初始特征数据；根据域判别器对初始特征数据进行对抗训练，得到对抗训练后的目标特征数据；根据分类器得到目标特征数据的视频分类；获取域判别器的对抗训练损失，对抗训练损失用于迭代训练初始视频分类模型和各私有网络。In one of the embodiments, the initial video classification model includes a feature extractor, a domain discriminator and a classifier, and the initial video classification model is obtained by classifying based on the features of the source domain video input samples, including: obtaining the features of the source domain video input samples through the feature extractor; classifying the features through the classifier; obtaining the classification loss, and the classification loss is used to iteratively train the initial video classification model and each private network; obtaining the feature data of the video input samples of the source domain and at least one target domain extracted by the initial video classification model, including: obtaining the initial feature data of the video input samples of the source domain and at least one target domain through the feature extractor; performing adversarial training on the initial feature data according to the domain discriminator to obtain target feature data after adversarial training; obtaining the video classification of the target feature data according to the classifier; obtaining the adversarial training loss of the domain discriminator, and the adversarial training loss is used to iteratively train the initial video classification model and each private network.

其中，对抗训练是指在初始视频分类模型的训练过程中，域判别器与图像分类器的输入都来自于特征提取器提取的特征，域判别器用于最大化域判别损失，混淆目标域视频输入数据与源域视频输入数据，图像分类器用于最小化图像分类损失，实现图像的精准分类。域判别器包括一个梯度反转层和两层全连接层，用于判别特征提取器提取到的特征是来自源域还是目标域，其中，域判别损失函数的梯度与图像分类损失函数的梯度方向相反，通过梯度反转层可以实现域判别损失的梯度反向传播到特征提取器的参数之前自动取反，进而实现对抗训练。分类器是指图像分类器，用于对视频输入样本提取到的特征进行视频分类。Among them, adversarial training means that in the training process of the initial video classification model, the inputs of the domain discriminator and the image classifier are all from the features extracted by the feature extractor. The domain discriminator is used to maximize the domain discrimination loss and confuse the target domain video input data with the source domain video input data. The image classifier is used to minimize the image classification loss to achieve accurate image classification. The domain discriminator includes a gradient reversal layer and two fully connected layers, which are used to discriminate whether the features extracted by the feature extractor are from the source domain or the target domain. The gradient of the domain discrimination loss function is opposite to the gradient of the image classification loss function. The gradient reversal layer can realize the automatic inversion of the gradient of the domain discrimination loss before backpropagating to the parameters of the feature extractor, thereby realizing adversarial training. The classifier refers to an image classifier, which is used to classify the video based on the features extracted from the video input samples.

示例性的，如图5所示的初始视频分类模型的结构示意图通过特征提取器获取到源域视频输入样本的特征，源域视频输入样本为包含视频分类标签的输入视频样本，分类器根据提取到的特征进行3000次的视频分类任务训练，得到初始视频分类模型，初始视频分类模型对于有视频分类标签的源域视频能实现准确的视频分类，获取到视频分类任务过程中视频输入样本分类的输出值与真实值之间的差异，得到视频分类损失，计算视频分类损失函数的计算公式表示为：Exemplarily, the structural diagram of the initial video classification model shown in FIG5 obtains the features of the source domain video input sample through the feature extractor, and the source domain video input sample is an input video sample containing a video classification label. The classifier performs 3000 video classification task trainings based on the extracted features to obtain the initial video classification model. The initial video classification model can achieve accurate video classification for source domain videos with video classification labels. The difference between the output value and the true value of the video input sample classification during the video classification task is obtained to obtain the video classification loss. The calculation formula for calculating the video classification loss function is expressed as:

其中，

为视频分类损失，x是输入样本，x∈X_S是输入样本为源域视频输入样本，y为源域的视频类别标签也是真实值，σ为softmax函数，C(F(x))为分类器对源域视频输入样本进行分类得到的输出经过softmax函数计算得到概率值。in,

is the video classification loss, x is the input sample, x∈X _S is the input sample is the source domain video input sample, y is the video category label of the source domain and also the true value, σ is the softmax function, C(F(x)) is the output obtained by the classifier to classify the source domain video input sample and the probability value is calculated by the softmax function.

通过特征提取器提取源域和目标域视频输入样本的初始特征数据，在特征提取器和分类器之间存在域判别器的梯度反转层，最大化域判别的损失值、最小化视频分类任务的损失值可以实现初始特征数据的对抗训练，得到源域和目标域领域判别混淆的目标特征数据，可以根据目标特征数据进行源域和目标域输入视频的分类，获取到对抗训练中的损失值，计算对抗训练损失函数的计算公式表示为：The feature extractor extracts the initial feature data of the source domain and target domain video input samples. There is a gradient reversal layer of the domain discriminator between the feature extractor and the classifier. Maximizing the loss value of domain discrimination and minimizing the loss value of the video classification task can realize adversarial training of the initial feature data, and obtain the target feature data of the source domain and target domain domain discrimination confusion. The source domain and target domain input videos can be classified according to the target feature data, and the loss value in adversarial training can be obtained. The calculation formula for calculating the adversarial training loss function is expressed as:

其中，

为对抗训练损失，y_d是一个代表领域标签的二维向量，即输入视频样本所在领域的真实值，当输入x，x∈X_S是源域原始样本时，y_d＝<1,0>，或当输入x是目标域原始样本x∈X_T时，y_d＝<0,1>，σ为softmax函数，为(D(F(X)))域判别器对源域和目标域视频输入样本进行领域判别得到的输出经过softmax函数计算得到概率值。in,

is the adversarial training loss, _yd is a two-dimensional vector representing the domain label, that is, the true value of the domain where the input video sample is located. When the input x, _x∈XS is the original sample of the source domain, _yd ＝<1,0>, or when the input x is the original sample of the target domain _x∈XT , _yd ＝<0,1>, σ is the softmax function, and the output obtained by the (D(F(X))) domain discriminator for the source domain and target domain video input samples is calculated by the softmax function to obtain the probability value.

本实施例中，实现了初始视频分类模型的视频分类任务训练和对抗训练。In this embodiment, video classification task training and adversarial training of the initial video classification model are implemented.

在其中一个实施例中，该方法还包括：构建特征来源分类器，根据特征来源分类器确定输入特征的来源标识，其中，来源标识用于确定输入特征的来源是初始视频分类模型或私有网络；获取特征来源分类器的来源分类损失，来源分类损失用于迭代训练初始视频分类模型和各私有网络。In one embodiment, the method further includes: constructing a feature source classifier, determining a source identifier of an input feature according to the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is an initial video classification model or a private network; obtaining a source classification loss of the feature source classifier, and the source classification loss is used to iteratively train the initial video classification model and each private network.

示例性的，获取初始视频分类模型的特征提取器和私有网络的视频特征提取器提取的特征，该特征携带来源标识，来源标识用于确定输入特征的来源是初始视频分类模型的特征提取器F或源域私有网络的源域视频特征提取器F_Sp或目标域私有网络的目标域视频特征提取器F_Tp，输入提取的特征到特征来源分类器，特征来源分类器根据输入的提取特征得到特征来源于初始视频分类模型或私有网络，根据输出值和真实值计算来源分类损失，来源分类损失函数的计算公式表示为：Exemplarily, features extracted by the feature extractor of the initial video classification model and the video feature extractor of the private network are obtained, and the features carry a source identifier, and the source identifier is used to determine whether the source of the input feature is the feature extractor F of the initial video classification model or the source domain video feature extractor F _Sp of the source domain private network or the target domain video feature extractor F _Tp of the target domain private network, and the extracted features are input to the feature source classifier, and the feature source classifier obtains that the features come from the initial video classification model or the private network according to the input extracted features, and calculates the source classification loss according to the output value and the true value. The calculation formula of the source classification loss function is expressed as:

其中，

为来源分类损失，y_N为特征提取器F、源域视频特征提取器F_Sp和目标域视频特征提取器F_Tp的来源标识，f为任一输入的提取特征，C_N(f)为特征来源分类器判断输入特征得到的输出经过softmax函数计算得到的概率值。in,

is the source classification loss, y _N is the source identifier of the feature extractor F, the source domain video feature extractor F _Sp and the target domain video feature extractor F _Tp , f is the extracted feature of any input, and C _N (f) is the probability value calculated by the softmax function after the output obtained by the feature source classifier judging the input feature.

本实施例中，通过增加特征来源分类器，区分视频分类模型和私有网络提取到的特征，实现了增强其经过训练得到的特征内容的差异性。In this embodiment, by adding a feature source classifier to distinguish the features extracted by the video classification model and the private network, the difference in the feature content obtained through training is enhanced.

在其中一个实施例中，该方法还包括：迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，包括：基于损失函数获取训练损失，根据训练损失得到迭代停止条件；根据训练损失反向传播计算损失函数的梯度，更新损失函数；在训练损失稳定的情况下，满足迭代停止条件，得到领域通用的目标视频分类模型。In one embodiment, the method further includes: iteratively training the initial video classification model and each private network, and obtaining a domain-wide target video classification model when an iteration stop condition is met, including: obtaining a training loss based on a loss function, and obtaining an iteration stop condition according to the training loss; calculating the gradient of the loss function by back-propagation according to the training loss, and updating the loss function; when the training loss is stable, the iteration stop condition is met, and a domain-wide target video classification model is obtained.

其中，反向传播是“误差反向传播”的简称，是一种与最优化方法结合使用的，用来训练人工神经网络的常见方法。该方法对网络中所有权重计算损失函数的梯度，这个梯度会反馈给最优化方法，用来更新权值以最小化损失函数。Back propagation is short for "error back propagation", which is a common method used in combination with optimization methods to train artificial neural networks. This method calculates the gradient of the loss function for all weights in the network, and this gradient is fed back to the optimization method to update the weights to minimize the loss function.

示例性的，获取到视频分类损失

对抗训练损失

重构损失

最大均值差异损失

来源分类损失

根据总损失函数对初始视频分类模型进行迭代训练，总损失函数计算公式为：For example, the video classification loss is obtained

Adversarial Training Loss

Reconstruction loss

Maximum mean difference loss

Source classification loss

The initial video classification model is iteratively trained according to the total loss function, and the total loss function calculation formula is:

函数图像的曲面上方向导数的最大值的方向就代表了梯度的方向，在做梯度下降的时候，应该是沿着梯度的反方向更新，反向传播计算损失函数的梯度，结合随机梯度下降(stochastic gradient descent，SGD)优化算法，可以迭代训练初始训练模型，实现最小化损失函数，也就是说，根据得到的总损失函数计算供述，反向传播计算出梯度以后，总损失函数朝最小化损失函数方向靠近，梯度会反馈到随机梯度下降优化算法，优化算法可以根据梯度更新初始视频分类模型的模型参数，迭代训练初始视频分类模型，在总损失函数得出的最小值稳定的情况下，模型逐渐收敛得到目标视频分类模型，目标视频分类模型表现为总损失最小，也就是说对于视频分类的准确性高，实现了目标域的视频分类在视频分类模型中也可以得到准确的分类。The direction of the maximum value of the directional derivative on the surface of the function image represents the direction of the gradient. When doing gradient descent, it should be updated in the opposite direction of the gradient, and the gradient of the loss function is calculated by back propagation. Combined with the stochastic gradient descent (SGD) optimization algorithm, the initial training model can be iteratively trained to minimize the loss function. In other words, according to the total loss function obtained, the statement is calculated. After the gradient is calculated by back propagation, the total loss function approaches the direction of minimizing the loss function, and the gradient will be fed back to the stochastic gradient descent optimization algorithm. The optimization algorithm can update the model parameters of the initial video classification model according to the gradient, and iteratively train the initial video classification model. When the minimum value obtained by the total loss function is stable, the model gradually converges to obtain the target video classification model. The target video classification model shows the minimum total loss, that is, the accuracy of video classification is high, and the video classification of the target domain can also be accurately classified in the video classification model.

如图6所示为另一个实施例中领域自适应的视频分类方法的流程示意图，该领域自适应的视频分类方法包括如下步骤：FIG6 is a flow chart of a domain-adaptive video classification method according to another embodiment. The domain-adaptive video classification method comprises the following steps:

步骤602，获取原始的源域和目标域视频数据，对视频数据进行视频帧提取和降采样处理，得到源域和目标域视频输入样本。Step 602: obtain original source domain and target domain video data, perform video frame extraction and downsampling processing on the video data, and obtain source domain and target domain video input samples.

对于原始的视频数据，进行图像帧提取得到RGB视频帧序列，并进行视频帧的采样。根据训练初始视频分类模型的原始源域和目标域视频，获取采集好的RGB帧序列，从随机位置开始每隔4帧采样一帧作为输入数据，每个样本采样t帧，作为训练初始视频分类模型的视频输入样本。For the original video data, image frame extraction is performed to obtain an RGB video frame sequence, and video frame sampling is performed. According to the original source domain and target domain videos for training the initial video classification model, the collected RGB frame sequence is obtained, and one frame is sampled every 4 frames starting from a random position as input data. Each sample samples t frames as the video input sample for training the initial video classification model.

步骤604，构建初始视频分类模型，包括特征提取器、域判别器和分类器，根据特征提取器得到源域视频输入样本的源域特征，根据分类器和得到源域特征的视频分类，获取视频分类损失函数

Step 604, construct an initial video classification model, including a feature extractor, a domain discriminator and a classifier, obtain source domain features of the source domain video input sample according to the feature extractor, classify the video according to the classifier and the obtained source domain features, and obtain the video classification loss function

使用有视频分类标签的源域视频输入样本对初始视频分类模型进行预训练，初始视频分类模型采用I3D视频分类模型，首先只使用源域数据进行3000次迭代的预训练。预训练使用SGD(标准梯度下降)优化算法，学习率取0.001。The initial video classification model is pre-trained using source domain video input samples with video classification labels. The initial video classification model uses the I3D video classification model. First, only the source domain data is used for pre-training for 3000 iterations. The pre-training uses the SGD (standard gradient descent) optimization algorithm with a learning rate of 0.001.

步骤606，构建源域和目标域私有网络，私有网络包括视频特征提取器和重构网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征。Step 606, constructing source domain and target domain private networks, the private networks including a video feature extractor and a reconstruction network, the private networks are used to obtain semantically irrelevant information features of video input samples in each domain.

步骤608，通过时间中值滤波器获取到源域和目标域视频输入样本的背景图，背景图作为监督信号用于私有网络的重构训练。Step 608: The background image of the source domain and the target domain video input samples is obtained through a temporal median filter, and the background image is used as a supervisory signal for reconstruction training of the private network.

使用固定参数的时间中值滤波器提取视频输入样本的背景图，输入样本RGB帧序列的维度为时间×高度×宽度×通道数(t×h×w×c)，提取到的背景图维度为高度×宽度×通道数(h×w×c)，没有时间维度。A fixed-parameter temporal median filter is used to extract the background image of the video input sample. The dimension of the input sample RGB frame sequence is time × height × width × number of channels (t × h × w × c), and the dimension of the extracted background image is height × width × number of channels (h × w × c), without a time dimension.

步骤610，通过各私有网络的视频特征提取器得到源域和目标域视频输入样本的背景特征，通过各私有网络的重构网络得到源域和目标域视频输入样本的重构背景图。Step 610, obtaining background features of source domain and target domain video input samples through the video feature extractors of each private network, and obtaining reconstructed background images of source domain and target domain video input samples through the reconstruction networks of each private network.

步骤612，分别获取源域和目标域背景图与源域和目标域重构背景图之间的源域和目标域的L₂损失，得到源域和目标域重构损失，获取重构损失的L₂损失函数，最小化L₂损失函数，得到源域和目标域视频输入样本的语义无关信息特征。Step 612, respectively obtain the _L2 losses of the source domain and the target domain between the background images of the source domain and the target domain and the reconstructed background images of the source domain and the target domain, obtain the reconstruction losses of the source domain and the target domain, obtain the _L2 loss function of the reconstruction loss, minimize the _L2 loss function, and obtain the semantically irrelevant information features of the source domain and the target domain video input samples.

将源域视频输入样本输入源域私有网络，进行背景重构训练学习源域的语义无关信息特征，计算

将目标域视频输入样本输入目标域私有网络，计算

求两个私有网络重构训练损失之和

The source domain video input samples are input into the source domain private network, and background reconstruction training is performed to learn the semantically irrelevant information features of the source domain.

Input the target domain video input sample into the target domain private network and calculate

Find the sum of the reconstruction training losses of the two private networks

步骤614，根据特征提取器获取源域和目标域视频输入样本的初始特征，根据域判别器对初始特征进行对抗训练，得到目标特征，根据分类器得到目标特征的分类，获取对抗训练损失函数

Step 614, obtain the initial features of the source domain and target domain video input samples according to the feature extractor, perform adversarial training on the initial features according to the domain discriminator to obtain the target features, obtain the classification of the target features according to the classifier, and obtain the adversarial training loss function

初始视频分类模型的特征维度为1024，域判别器由一个梯度反转层和一个两层的全连接层分类器构成，输入为1024维的特征向量，隐藏层维度为100，输出为2维向量。使用有标签的源域样本和无标签的目标域样本作为输入数据，训练主网络特征提取器提取领域无关特征的能力，计算

The feature dimension of the initial video classification model is 1024. The domain discriminator consists of a gradient reversal layer and a two-layer fully connected layer classifier. The input is a 1024-dimensional feature vector, the hidden layer dimension is 100, and the output is a 2-dimensional vector. Using labeled source domain samples and unlabeled target domain samples as input data, the main network feature extractor is trained to extract domain-independent features and calculate

步骤616，最大化重构背景图与目标特征之间的特征分布距离，得到公共语义信息特征，获取最大化特征差异值的损失函数。Step 616, maximize the feature distribution distance between the reconstructed background image and the target feature, obtain the common semantic information feature, and obtain the loss function that maximizes the feature difference value.

最大化三个特征分布的MMD距离，计算

MMD中核函数采用多个高斯核。Maximize the MMD distance of the three feature distributions and calculate

The kernel function in MMD uses multiple Gaussian kernels.

步骤618，构建特征来源分类器，输入目标特征与私有网络视频特征提取器得到的背景特征到特征来源分类器，根据特征来源分类器确定输入特征来源初始视频分类模型或私有网络，获取来源分类损失函数。Step 618, construct a feature source classifier, input the target feature and the background feature obtained by the private network video feature extractor into the feature source classifier, determine the input feature source initial video classification model or private network according to the feature source classifier, and obtain the source classification loss function.

步骤620，迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。Step 620, iteratively train the initial video classification model and each private network, and when the iteration stop condition is met, obtain a domain-wide target video classification model, and perform video classification according to the target video classification model.

根据获取到的所有损失函数表达式相加得到总损失函数

表达式，反向传播计算总损失函数的梯度，采用SGD(标准梯度下降)优化算法，更新初始视频分类模型的参数，学习率设置为0.0001，重复重构训练、对抗训练的过程，迭代训练16000次，在总损失函数稳定不变的情况下，得到目标视频分类模型，根据目标视频分类模型可以实现源域和目标域的视频分类。The total loss function is obtained by adding all the loss function expressions obtained.

Expression, back-propagation calculates the gradient of the total loss function, uses the SGD (standard gradient descent) optimization algorithm, updates the parameters of the initial video classification model, sets the learning rate to 0.0001, repeats the reconstruction training and adversarial training processes, and iterates the training for 16,000 times. When the total loss function remains stable, the target video classification model is obtained. According to the target video classification model, video classification in the source domain and the target domain can be realized.

步骤622，获取测试样本对目标视频分类模型进行测试训练。Step 622: Acquire test samples to perform test training on the target video classification model.

获取测试目标视频分类模型的原始视频，从RGB帧序列中5个随机的位置进行采样作为输入数据，将5个样本的预测结果取平均值作为最终预测结果。每个样本RGB帧序列的维度为时间×高度×宽度×通道数(t×h×w×c)，本实施例中，t取16，h和w取224。The original video of the test target video classification model is obtained, and samples are taken from 5 random positions in the RGB frame sequence as input data. The prediction results of the 5 samples are averaged as the final prediction result. The dimension of each sample RGB frame sequence is time × height × width × number of channels (t × h × w × c). In this embodiment, t is 16, and h and w are 224.

应该理解的是，虽然如上的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the steps in the flowcharts involved in the above embodiments are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowcharts involved in the above embodiments may include multiple steps or multiple stages, and these steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these steps or stages is not necessarily carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的领域自适应的视频分类方法的领域自适应的视频分类装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的领域自适应的视频分类装置实施例中的具体限定可以参见上文中对于领域自适应的视频分类方法的限定，在此不再赘述。Based on the same inventive concept, the embodiment of the present application also provides a domain-adaptive video classification device for implementing the domain-adaptive video classification method involved above. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the above method, so the specific limitations in the domain-adaptive video classification device embodiment provided below can refer to the limitations of the domain-adaptive video classification method above, and will not be repeated here.

在一个实施例中，如图7所示，提供了一种领域自适应的视频分类装置700，包括：视频分类模块702、私有网络模块704、均值差异模块706和迭代训练模块708，其中：In one embodiment, as shown in FIG. 7 , a domain-adaptive video classification device 700 is provided, including: a video classification module 702, a private network module 704, a mean difference module 706, and an iterative training module 708, wherein:

视频分类模块702，用于获取源域和至少一个目标域的视频输入样本，基于源域视频输入样本的特征进行分类得到初始视频分类模型；A video classification module 702 is used to obtain video input samples of a source domain and at least one target domain, and classify the video input samples of the source domain based on features to obtain an initial video classification model;

私有网络模块704，用于构建至少两个私有网络，私有网络用于分别获取各领域视频输入样本的语义无关信息特征；A private network module 704 is used to construct at least two private networks, and the private networks are used to obtain semantically irrelevant information features of video input samples in various fields respectively;

均值差异模块706，用于获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，获取视频分类模型和各私有网络提取特征的特征分布距离，对特征分布距离进行最大化处理并计算最大均值差异，得到公共语义信息特征；The mean difference module 706 is used to obtain feature data of the video input samples of the source domain and at least one target domain extracted by the initial video classification model, obtain the feature distribution distance of the video classification model and the features extracted by each private network, maximize the feature distribution distance and calculate the maximum mean difference to obtain the common semantic information feature;

迭代训练模块708，用于迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，根据目标视频分类模型进行视频分类。The iterative training module 708 is used to iteratively train the initial video classification model and each private network. When the iteration stop condition is met, a domain-wide target video classification model is obtained, and video classification is performed according to the target video classification model.

在一个实施例中，私有网络模块704还用于构建至少两个私有网络，包括：获取视频输入样本的背景数据，背景数据作为监督信号用于私有网络的重构训练；通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据；获取背景数据与重构背景数据之间的重构损失；最小化重构损失，得到语义无关信息特征。In one embodiment, the private network module 704 is also used to construct at least two private networks, including: obtaining background data of video input samples, using the background data as a supervisory signal for reconstruction training of the private network; performing reconstruction training of video input samples in various fields through the private network to obtain reconstructed background data; obtaining the reconstruction loss between the background data and the reconstructed background data; minimizing the reconstruction loss to obtain semantically irrelevant information features.

在一个实施例中，私有网络模块704还用于私有网络包括视频特征提取器和重构网络，通过私有网络进行各领域视频输入样本的重构训练，得到重构背景数据，包括：基于视频特征提取器得到各领域视频输入样本的背景特征；基于重构网络对背景特征重构得到重构背景数据；获取背景数据与重构背景数据之间的重构损失，包括：获取背景数据与重构背景数据之间的距离；通过基于距离度量的损失函数和距离，计算重构损失。In one embodiment, the private network module 704 is also used for a private network including a video feature extractor and a reconstruction network. Reconstruction training of video input samples in various fields is performed through the private network to obtain reconstructed background data, including: obtaining background features of video input samples in various fields based on the video feature extractor; reconstructing the background features based on the reconstruction network to obtain reconstructed background data; obtaining the reconstruction loss between the background data and the reconstructed background data, including: obtaining the distance between the background data and the reconstructed background data; calculating the reconstruction loss through a loss function based on a distance metric and the distance.

在一个实施例中，均值差异模块706还用于初始视频分类模型包括特征提取器、域判别器和分类器，基于源域视频输入样本的特征进行分类得到初始视频分类模型，包括；通过特征提取器获取源域视频输入样本的特征；通过分类器对特征进行分类；获取分类损失，分类损失用于迭代训练初始视频分类模型和各私有网络；获取初始视频分类模型提取的源域和至少一个目标域的视频输入样本的特征数据，包括：通过特征提取器得到源域和至少一个目标域的视频输入样本的初始特征数据；根据域判别器对初始特征数据进行对抗训练，得到对抗训练后的目标特征数据；根据分类器得到目标特征数据的视频分类；获取域判别器的对抗训练损失，对抗训练损失用于迭代训练初始视频分类模型和各私有网络。In one embodiment, the mean difference module 706 is also used for the initial video classification model including a feature extractor, a domain discriminator and a classifier, and the initial video classification model is obtained by classification based on the features of the source domain video input samples, including: obtaining the features of the source domain video input samples through the feature extractor; classifying the features through the classifier; obtaining the classification loss, and the classification loss is used for iterative training of the initial video classification model and each private network; obtaining the feature data of the video input samples of the source domain and at least one target domain extracted by the initial video classification model, including: obtaining the initial feature data of the video input samples of the source domain and at least one target domain through the feature extractor; performing adversarial training on the initial feature data according to the domain discriminator to obtain the target feature data after adversarial training; obtaining the video classification of the target feature data according to the classifier; obtaining the adversarial training loss of the domain discriminator, and the adversarial training loss is used for iterative training of the initial video classification model and each private network.

在一个实施例中，如图8所示，该装置还包括来源分类模块810，用于构建特征来源分类器，根据特征来源分类器确定输入特征的来源标识，其中，来源标识用于确定输入特征的来源是初始视频分类模型或私有网络；获取特征来源分类器的来源分类损失，来源分类损失用于迭代训练初始视频分类模型和各私有网络。In one embodiment, as shown in FIG8 , the device further includes a source classification module 810 for constructing a feature source classifier, determining a source identifier of an input feature according to the feature source classifier, wherein the source identifier is used to determine whether the source of the input feature is an initial video classification model or a private network; and obtaining a source classification loss of the feature source classifier, wherein the source classification loss is used to iteratively train the initial video classification model and each private network.

在一个实施例中，迭代训练模块708还用于迭代训练初始视频分类模型和各私有网络，在满足迭代停止条件时，得到领域通用的目标视频分类模型，包括：基于损失函数获取训练损失，根据训练损失得到迭代停止条件；根据训练损失反向传播计算损失函数的梯度，更新损失函数；在训练损失稳定的情况下，满足迭代停止条件，得到领域通用的目标视频分类模型。In one embodiment, the iterative training module 708 is also used to iteratively train the initial video classification model and each private network, and when the iteration stop condition is met, a domain-wide target video classification model is obtained, including: obtaining the training loss based on the loss function, and obtaining the iteration stop condition according to the training loss; calculating the gradient of the loss function by backpropagation according to the training loss, and updating the loss function; when the training loss is stable, the iteration stop condition is met, and the domain-wide target video classification model is obtained.

上述领域自适应的视频分类装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned domain-adaptive video classification device can be implemented in whole or in part by software, hardware, or a combination thereof. Each of the above-mentioned modules can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each of the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图9所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O)和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储视频分类数据。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种领域自适应的视频分类方法。In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be shown in FIG9. The computer device includes a processor, a memory, an input/output interface (Input/Output, referred to as I/O) and a communication interface. The processor, the memory and the input/output interface are connected via a system bus, and the communication interface is connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program and a database. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The database of the computer device is used to store video classification data. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by the processor, a field-adaptive video classification method is implemented.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图10所示。该计算机设备包括处理器、存储器、输入/输出接口、通信接口、显示单元和输入装置。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口、显示单元和输入装置通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端进行有线或无线方式的通信，无线方式可通过WIFI、移动蜂窝网络、NFC(近场通信)或其他技术实现。该计算机程序被处理器执行时以实现一种领域自适应的视频分类方法。该计算机设备的显示单元用于形成视觉可见的画面，可以是显示屏、投影装置或虚拟现实成像装置。显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG10. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input device. The processor, the memory, and the input/output interface are connected via a system bus, and the communication interface, the display unit, and the input device are connected to the system bus via the input/output interface. The processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The input/output interface of the computer device is used to exchange information between the processor and an external device. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner may be implemented through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. When the computer program is executed by the processor, a field-adaptive video classification method is implemented. The display unit of the computer device is used to form a visually visible picture, which may be a display screen, a projection device, or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device can be a touch layer covering the display screen, or a button, trackball or touchpad set on the computer device shell, or an external keyboard, touchpad or mouse.

本领域技术人员可以理解，前述结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will appreciate that the aforementioned structure is merely a block diagram of a portion of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述各方法实施例的步骤。In one embodiment, a computer device is provided, including a memory and a processor, wherein a computer program is stored in the memory, and the processor implements the steps of the above-mentioned method embodiments when executing the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述各方法实施例的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the steps of the above-mentioned method embodiments are implemented.

在一个实施例中，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述各方法实施例的步骤。In one embodiment, a computer program product is provided, including a computer program, which implements the steps of the above-mentioned method embodiments when executed by a processor.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those skilled in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing the relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage medium, and the computer program can include the processes of the embodiments of the above-mentioned methods when executed. Among them, any reference to the memory, database or other medium used in the embodiments provided in the present application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. As an illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM). The database involved in each embodiment provided in this application may include at least one of a relational database and a non-relational database. Non-relational databases may include distributed databases based on blockchains, etc., but are not limited to this. The processor involved in each embodiment provided in this application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic device, a data processing logic device based on quantum computing, etc., but are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be understood as limiting the scope of the present application. It should be pointed out that, for a person of ordinary skill in the art, several modifications and improvements can be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims

1. A method for domain-adaptive video classification, the method comprising:

acquiring video input samples of a source domain and at least one target domain, and classifying based on the characteristics of the video input samples of the source domain to obtain an initial video classification model;

constructing at least two private networks, and acquiring video input samples of the source domain and at least one target domain, wherein the private networks are used for respectively acquiring semantic independent information characteristics of the video input samples of each domain;

Acquiring feature data of video input samples of the source domain and at least one target domain extracted by the initial video classification model, acquiring feature distribution distances of extracted features of the initial video classification model and each private network, performing maximum processing on the feature distribution distances, and calculating maximum mean value difference to obtain public semantic information features;

and iteratively training the initial video classification model and each private network, and obtaining a target video classification model universal to the field when the iteration stop condition is met, and carrying out video classification according to the target video classification model.

2. The method of claim 1, wherein said constructing at least two private networks comprises:

obtaining background data of the video input sample, wherein the background data is used as a supervision signal for reconstruction training of the private network;

performing reconstruction training of the video input samples in each field through the private network to obtain reconstruction background data;

obtaining reconstruction loss between the background data and the reconstructed background data;

and minimizing the reconstruction loss to obtain the semantic independent information features.

3. The method of claim 2, wherein the private network includes a video feature extractor and a reconstruction network, wherein the reconstructing training of each of the domain video input samples through the private network results in reconstructed background data, comprising:

Obtaining background characteristics of each field video input sample based on the video characteristic extractor;

reconstructing the background features based on the reconstruction network to obtain the reconstructed background data;

the obtaining the reconstruction loss between the background data and the reconstructed background data comprises:

acquiring the distance between the background data and the reconstructed background data;

the reconstruction loss is calculated by a loss function based on a distance measure and the distance.

4. The method of claim 1, wherein the initial video classification model comprises a feature extractor, a domain arbiter, and a classifier, wherein the classifying based on the features of the source domain video input samples results in an initial video classification model comprising;

acquiring the characteristics of the source domain video input sample through the characteristic extractor;

classifying the features by the classifier;

obtaining classification loss, wherein the classification loss is used for iteratively training the initial video classification model and each private network;

the obtaining the feature data of the video input samples of the source domain and at least one target domain extracted by the initial video classification model includes:

Obtaining initial feature data of a video input sample of the source domain and at least one target domain by the feature extractor;

performing countermeasure training on the initial feature data according to the domain discriminator to obtain target feature data after the countermeasure training;

obtaining video classification of the target feature data according to the classifier;

and obtaining an countermeasure training penalty of the domain discriminator, wherein the countermeasure training penalty is used for iteratively training the initial video classification model and each private network.

5. The method according to claim 1, wherein the method further comprises:

constructing a feature source classifier, and determining a source identifier of an input feature according to the feature source classifier, wherein the source identifier is used for determining that the source of the input feature is an initial video classification model or a private network;

and obtaining source classification loss of the characteristic source classifier, wherein the source classification loss is used for iteratively training the initial video classification model and each private network.

6. The method of claim 1, wherein the iteratively training the initial video classification model and each of the private networks, when an iteration stop condition is satisfied, results in a domain-generic target video classification model, comprising:

Acquiring training loss based on a loss function, and acquiring the iteration stop condition according to the training loss;

calculating the gradient of the loss function according to the training loss back propagation, and updating the loss function;

and under the condition that the training loss is stable, the iteration stop condition is met, and the universal target video classification model in the field is obtained.

7. A domain adaptive device, the device comprising:

the video classification module is used for acquiring video input samples of a source domain and at least one target domain, and classifying the video input samples based on the characteristics of the video input samples of the source domain to obtain an initial video classification model;

the private network module is used for constructing at least two private networks, and the private networks are used for respectively acquiring semantic independent information characteristics of the video input samples in each field;

the average value difference module is used for acquiring the characteristic data of the video input samples of the source domain and at least one target domain extracted by the initial video classification model, acquiring the maximum average value difference of the characteristic distribution distances extracted by the video classification model and each private network, and carrying out maximum processing on the maximum average value difference to obtain public semantic information characteristics;

And the iteration training module is used for iteratively training the initial video classification model and each private network, obtaining a target video classification model universal to the field when the iteration stopping condition is met, and carrying out video classification according to the target video classification model.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.