CN117336525A

CN117336525A - Video processing method, device, computer equipment and storage medium

Info

Publication number: CN117336525A
Application number: CN202311257597.3A
Authority: CN
Inventors: 陈昌儒; 李标
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2024-01-02

Abstract

This application discloses a video processing method, device, computer equipment and storage medium. The video processing method includes: identifying key frames in a video to be processed to obtain key frames in the video to be processed; dividing the video to be processed into Be a plurality of video clips; based on the key frame, extract a partial video frame from each video clip of the plurality of video clips; based on the target prompt information and the partial video frame, generate a third video frame for inputting a large language model An input information, the target prompt information is used to prompt the large language model to perform video understanding; input the first input information to the large language model to obtain the video understanding result corresponding to the video to be processed. This method can improve the accuracy of understanding long-term videos.

Description

Video processing method, device, computer equipment and storage medium

技术领域Technical field

本申请涉及视频处理技术领域，更具体地，涉及一种视频处理方法、装置、计算机设备及存储介质。The present application relates to the field of video processing technology, and more specifically, to a video processing method, device, computer equipment and storage medium.

背景技术Background technique

随着互联网和数字技术的快速发展，大量的视频数据在各种应用场景中产生，如安全监控、智能交通、医疗健康等。视频理解技术是问答、对话、推荐等一系列上层多模态人工智能任务的必备基础，然而，相关技术中在对视频时长较长的视频进行视频理解时的准确性不佳。With the rapid development of the Internet and digital technology, a large amount of video data is generated in various application scenarios, such as security monitoring, intelligent transportation, medical health, etc. Video understanding technology is an essential foundation for a series of upper-level multi-modal artificial intelligence tasks such as question and answer, dialogue, and recommendation. However, related technologies have poor accuracy in video understanding of long videos.

发明内容Contents of the invention

本申请提出了一种视频处理方法、装置、计算机设备及存储介质，可以提升长时视频的理解的准确性。This application proposes a video processing method, device, computer equipment and storage medium, which can improve the accuracy of understanding long-term videos.

第一方面，本申请实施例提供了一种视频处理方法，所述方法包括：对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧；将所述待处理视频划分为多个视频片段；基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧；基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解；将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。In a first aspect, embodiments of the present application provide a video processing method. The method includes: performing key frame identification on a video to be processed to obtain key frames in the video to be processed; dividing the video to be processed into multiple Video clips; based on the key frame, extract a partial video frame from each video clip of the plurality of video clips; based on the target prompt information and the partial video frame, generate first input information for inputting a large language model , the target prompt information is used to prompt the large language model to perform video understanding; input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

第二方面，本申请实施例提供了一种视频处理装置，所述装置包括：关键帧识别模块、视频划分模块、视频帧提取模块、信息生成模块以及视频理解模块，其中，所述关键帧识别模块用于对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧；所述视频划分模块用于将所述待处理视频划分为多个视频片段；所述视频帧提取模块用于基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧；所述信息生成模块用于基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解；所述视频理解模块用于将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。In the second aspect, embodiments of the present application provide a video processing device, which includes: a key frame identification module, a video dividing module, a video frame extraction module, an information generation module, and a video understanding module, wherein the key frame identification module The module is used to identify key frames in the video to be processed to obtain the key frames in the video to be processed; the video dividing module is used to divide the video to be processed into multiple video segments; the video frame extraction module is used to Based on the key frame, a partial video frame is extracted from each video segment of the plurality of video segments; the information generation module is used to generate a large language model for input based on the target prompt information and the partial video frame. First input information, the target prompt information is used to prompt the large language model to perform video understanding; the video understanding module is used to input the first input information to the large language model to obtain the video to be processed Corresponding video understanding results.

第三方面，本申请实施例提供了一种计算机设备，包括：一个或多个处理器；存储器；一个或多个应用程序，其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个应用程序配置用于执行上述第一方面提供的视频处理方法。In a third aspect, embodiments of the present application provide a computer device, including: one or more processors; a memory; and one or more application programs, wherein the one or more application programs are stored in the memory and Configured to be executed by the one or more processors, the one or more application programs are configured to execute the video processing method provided by the above-mentioned first aspect.

第四方面，本申请实施例提供了一种计算机可读取存储介质，所述计算机可读取存储介质中存储有程序代码，所述程序代码可被处理器调用执行上述第一方面提供的视频处理方法。In a fourth aspect, embodiments of the present application provide a computer-readable storage medium. Program code is stored in the computer-readable storage medium. The program code can be called by a processor to execute the video provided in the first aspect. Approach.

本申请提供的方案，通过对待处理视频进行关键帧识别，得到待处理视频中的关键帧，将待处理视频划分为多个视频片段，基于关键帧，从多个视频片段的每个视频片段中提取部分视频帧，基于获取到的部分视频帧以及用于提示大语言模型进行视频理解的目标提示信息，生成用于输入大语言模型的第一输入信息，然后将第一输入信息输入至大语言模型，得到待处理视频对应的视频理解结果。由此，由于在通过大语言模型对待处理视频进行视频理解时，是根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，再根据提取的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性。The solution provided by this application is to obtain the key frames in the video to be processed by identifying the key frames in the video to be processed, and divide the video to be processed into multiple video segments. Based on the key frames, from each video segment of the multiple video segments Extract some video frames, generate first input information for inputting into the large language model based on the acquired partial video frames and target prompt information used to prompt the large language model for video understanding, and then input the first input information into the large language model model to obtain the video understanding results corresponding to the video to be processed. Therefore, when using a large language model to understand the video to be processed, after extracting some video frames based on the identified key frames and divided video segments, the input to the large language model is determined based on the extracted partial video frames. Input information, so when understanding long-term videos, there is no need to input the large language model multiple times after segmenting the video, so that the large language model can better understand the long-term videos and improve the understanding of long-term videos. accuracy.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

图1示出了根据本申请一个实施例的视频处理方法的流程示意图。Figure 1 shows a schematic flowchart of a video processing method according to an embodiment of the present application.

图2示出了根据本申请另一个实施例的视频处理方法的流程示意图。Figure 2 shows a schematic flowchart of a video processing method according to another embodiment of the present application.

图3示出了根据本申请又一个实施例的视频处理方法的流程示意图。Figure 3 shows a schematic flowchart of a video processing method according to yet another embodiment of the present application.

图4示出了根据本申请再一个实施例的视频处理方法的流程示意图。Figure 4 shows a schematic flowchart of a video processing method according to yet another embodiment of the present application.

图5示出了根据本申请又另一个实施例的视频处理方法的流程示意图。Figure 5 shows a schematic flowchart of a video processing method according to yet another embodiment of the present application.

图6示出了根据本申请又再一个实施例的视频处理方法的流程示意图。Figure 6 shows a schematic flowchart of a video processing method according to yet another embodiment of the present application.

图7示出了根据本申请一个实施例的视频处理装置的一种框图。Figure 7 shows a block diagram of a video processing device according to an embodiment of the present application.

图8是本申请实施例的用于执行根据本申请实施例的视频处理方法的计算机设备的框图。FIG. 8 is a block diagram of a computer device for executing a video processing method according to an embodiment of the present application.

图9是本申请实施例的用于保存或者携带实现根据本申请实施例的视频处理方法的程序代码的存储单元。Figure 9 is a storage unit used to save or carry the program code for implementing the video processing method according to the embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those in the technical field to better understand the solution of the present application, the technical solution in the embodiment of the present application will be clearly and completely described below in conjunction with the drawings in the embodiment of the present application.

视频理解是一个涵盖了多个领域的技术，包括计算机视觉、自然语言处理、语音识别等。在视频理解中，通常需要处理视频数据以识别和理解视频中的内容，例如识别和分类视频中的物体、场景、事件等。在视频理解的早期技术中，通过可以通过视频数据处理技术进行视频理解，视频数据处理技术是指将原始视频转化为可分析的格式，例如将视频中的图像帧提取出来并进行预处理，以便进行后续的图像识别和目标检测等任务，在视频理解中，这种技术可以帮助计算机准确地识别和理解视频中的各种元素，例如人物、场景、动作等。Video understanding is a technology covering multiple fields, including computer vision, natural language processing, speech recognition, etc. In video understanding, video data usually needs to be processed to identify and understand the content in the video, such as identifying and classifying objects, scenes, events, etc. in the video. In the early technology of video understanding, video understanding can be carried out through video data processing technology. Video data processing technology refers to converting the original video into an analyzable format, such as extracting the image frames in the video and preprocessing them so that Carry out subsequent tasks such as image recognition and target detection. In video understanding, this technology can help the computer accurately identify and understand various elements in the video, such as characters, scenes, actions, etc.

随着近年来人工智能(Artificial Intelligence，AI)技术的兴起，出现了通过深度学习来实现视频理解，可以通过训练神经网络模型，以从大量数据中提取有用的特征，在视频理解中，这种技术可以帮助计算机准确地识别和理解视频中的内容。并且，随着人工智能的发展，可以通过大语言模型(Large Language Model，LLM)进行视频理解，大语言模型是指基于大量数据和计算资源训练出的复杂的人工神经网络，能够学习到丰富的语言模式和知识，从而对自然语言输入产生准确的响应。With the rise of artificial intelligence (AI) technology in recent years, video understanding has emerged through deep learning. Neural network models can be trained to extract useful features from a large amount of data. In video understanding, this Technology can help computers accurately identify and understand content in videos. Moreover, with the development of artificial intelligence, video understanding can be carried out through large language models (LLM). Large language models refer to complex artificial neural networks trained based on large amounts of data and computing resources, which can learn rich Language patterns and knowledge to produce accurate responses to natural language input.

相关技术中，通过大语言模型进行视频理解时，可以通过针对原始视频，对原始视频的视频帧进行特征提取后，将提取到的视频帧的图像特征转换到大语言模型的特征空间，再输入到大语言模型中，从而由大语言模型针对输入的内容输出视频理解的结果。但是在原始视频的时长较长时，原始视频中的视频帧的数量较多，而大语言模型对每次输入的内容是有限制的，因此对实际输入的原始视频进行处理时，会将原始视频分成多段视频，然后针对每段视频的视频帧进行特征提取后，转换到大语言模型的特征空间，再输入到大语言模型中，从而得到每段视频对应的视频理解结果。在这样的方式中，由于视频被分为多段分别确定输入信息后，再输入到大语言模型，就会导致大语言模型进行视频理解时，会丢失原始视频的整体性的信息，进而对长时间的视频感知和理解有限，以及对于视频内容理解的颗粒度较粗，不能精细化理解视频里的内容。In related technologies, when using a large language model for video understanding, you can perform feature extraction on the video frames of the original video, convert the extracted image features of the video frames into the feature space of the large language model, and then input into the large language model, so that the large language model outputs video understanding results based on the input content. However, when the duration of the original video is long, the number of video frames in the original video is large, and the large language model has limitations on the content of each input. Therefore, when the actual input original video is processed, the original video will be processed. The video is divided into multiple videos, and then feature extraction is performed on the video frames of each video, and then converted to the feature space of the large language model, and then input into the large language model to obtain the video understanding results corresponding to each video. In this way, since the video is divided into multiple segments and the input information is determined separately and then input into the large language model, it will cause the large language model to lose the overall information of the original video when it performs video understanding, which will further affect the long-term performance of the video. The video perception and understanding is limited, and the granularity of the understanding of video content is relatively coarse, and the content in the video cannot be understood in a refined manner.

针对上述问题，发明人提出了本申请实施例提供的视频处理方法、装置、计算机设备以及存储介质，可以实现在通过大语言模型对待处理视频进行视频理解时，是根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，再根据提取的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性。其中，具体的视频处理方法在后续的实施例中进行详细的说明。In response to the above problems, the inventor proposed the video processing method, device, computer equipment and storage medium provided by the embodiments of the present application, which can realize video understanding of the video to be processed through a large language model based on the identified key frames and divisions. After extracting some video frames from the video clip, the input information for entering the large language model is determined based on the extracted partial video frames. Therefore, when understanding long-term videos, there is no need to input the large language model multiple times after segmenting the video. This enables the large language model to better understand long-term videos and improve the accuracy of understanding long-term videos. The specific video processing method will be described in detail in subsequent embodiments.

下面再结合附图对本申请实施例提供的视频处理方法进行详细介绍。The video processing method provided by the embodiment of the present application will be introduced in detail below with reference to the accompanying drawings.

请参阅图1，图1示出了本申请一个实施例提供的视频处理方法的流程示意图。在具体的实施例中，所述视频处理方法应用于如图7所示的视频处理装置700以及配置有所述视频处理装置700的计算机设备100(图8)。下面将以计算机设备为例，说明本实施例的具体流程，当然，可以理解的，本实施例所应用的计算机设备可以为服务器、智能手机、平板电脑、智能手表、笔记本电脑等，在此不做限定。下面将针对图1所示的流程进行详细的阐述，所述视频处理方法具体可以包括以下步骤：Please refer to Figure 1. Figure 1 shows a schematic flow chart of a video processing method provided by an embodiment of the present application. In a specific embodiment, the video processing method is applied to the video processing device 700 shown in FIG. 7 and the computer device 100 (FIG. 8) configured with the video processing device 700. The following will take a computer device as an example to illustrate the specific process of this embodiment. Of course, it can be understood that the computer device applied in this embodiment can be a server, a smart phone, a tablet computer, a smart watch, a notebook computer, etc., which will not be used here. Make limitations. The process shown in Figure 1 will be described in detail below. The video processing method may specifically include the following steps:

步骤S110：对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。Step S110: Perform key frame identification on the video to be processed to obtain key frames in the video to be processed.

在本申请实施例中，计算机设备在对待处理视频进行视频理解时，考虑到待处理视频的时长较长时，需要将待处理视频分为多段视频，然后针对多段视频转换为用于输入大语言模型的信息后，再输入到大语言模型中得到每段视频对应的视频理解结果，从而会导致大语言模型进行视频理解时，会丢失原始视频的整体性的信息，进而使得对待处理视频进行视频理解的准确性不足，因此计算机设备可以针对需要进行视频理解的待处理视频，进行关键帧的识别，以便根据识别到的待处理视频中的关键帧，对待处理视频进行压缩，从而减少待处理视频中的视频帧，进而无需对视频分段后多次输入大语言模型。其中，关键帧是指待处理视频中的视频内容的变化度大于目标阈值的视频帧，可以理解地，待处理视频中的视频内容的变化度较大时，通常是视频中的角色、物体等运动变化中的关键动作所处视频帧，或者是场景内容发生切换时所处的视频帧，因此关键帧在其邻近的视频帧中能够具有较好的代表性。In the embodiment of the present application, when the computer device performs video understanding on the video to be processed, considering that the video to be processed is long, it needs to divide the video to be processed into multiple videos, and then convert the multiple videos into large language input formats. The information of the model is then input into the large language model to obtain the video understanding results corresponding to each video. This will cause the large language model to lose the overall information of the original video when performing video understanding, thus making the video to be processed difficult to process. The accuracy of understanding is insufficient, so the computer device can identify key frames for the video to be processed that requires video understanding, so that the video to be processed can be compressed based on the identified key frames in the video to be processed, thereby reducing the number of videos to be processed. video frames in the video frame, thereby eliminating the need to input a large language model multiple times after segmenting the video. Among them, key frames refer to video frames in which the degree of change of the video content in the video to be processed is greater than the target threshold. It can be understood that when the degree of change of the video content in the video to be processed is large, it is usually characters, objects, etc. in the video. The video frame where the key action in the motion changes is located, or the video frame where the scene content is switched, so the key frame can be better representative in its adjacent video frames.

在一些实施方式中，计算机设备可以通过帧间差分算法，识别待处理视频中的关键帧。其中，帧间差分算法是一种通过对视频图像序列的连续两帧图像，做差分运算获取运动目标轮廓的方法，具体而言，这种算法首先需要获取视频序列中的连续两帧图像，然后通过计算这两帧图像之间的差异来获取运动目标的轮廓，如果两帧图像之间出现明显的差别，那么这个差别就会被记录下来，这种差别可以通过计算图像对应位置的像素值之差的绝对值来得到。In some implementations, the computer device can identify key frames in the video to be processed through an inter-frame difference algorithm. Among them, the inter-frame difference algorithm is a method of obtaining the outline of a moving target by performing a difference operation on two consecutive frames of a video image sequence. Specifically, this algorithm first needs to obtain two consecutive frames of images in the video sequence, and then The outline of the moving target is obtained by calculating the difference between the two frames of images. If there is an obvious difference between the two frames of images, then this difference will be recorded. This difference can be calculated by calculating the difference between the pixel values of the corresponding positions in the image. The absolute value of the difference is obtained.

在以上方式中，在通过帧间差分算法识别待处理视频中的关键帧时，可以针对每相邻的两帧视频帧，获取相邻两帧视频帧之间的变化度；若后一帧视频帧相对前一帧视频帧的变化度大于目标阈值，则可以将后一帧视频帧确定为关键帧；若后一帧视频帧相对前一帧视频帧的变化度不大于目标阈值，则该后一帧视频帧不是关键帧。In the above method, when identifying key frames in the video to be processed through the inter-frame difference algorithm, the degree of change between two adjacent video frames can be obtained for each two adjacent video frames; if the latter video frame If the change degree of the frame relative to the previous video frame is greater than the target threshold, the next video frame can be determined as a key frame; if the change degree of the next video frame relative to the previous video frame is not greater than the target threshold, then the subsequent video frame can be determined as a key frame. A video frame is not a keyframe.

在一些实施方式中，也可以通过预先训练的关键帧识别模型，对待处理视频进行关键帧识别，从而得到待处理视频中的关键帧。在针对待处理视频识别关键帧时，可以将待处理视频的每帧视频帧输入到关键帧识别模型中，以得到关键帧识别模型对每帧视频帧的识别结果；根据每帧视频帧的识别结果，可以确定出待处理视频中的关键帧。其中，关键帧识别模型的具体类型可以不做限定，例如，关键帧识别模型可以是深度神经网络(DeepNeural Network，DNN)、长短期记忆神经网络(Long Short-Term Memory，LSTM)、Transformer网络等。In some implementations, a pre-trained key frame recognition model can also be used to perform key frame recognition on the video to be processed, thereby obtaining the key frames in the video to be processed. When identifying key frames for the video to be processed, each video frame of the video to be processed can be input into the key frame recognition model to obtain the recognition result of each video frame by the key frame recognition model; based on the recognition of each video frame As a result, key frames in the video to be processed can be determined. Among them, the specific type of the key frame recognition model is not limited. For example, the key frame recognition model can be a deep neural network (DeepNeural Network, DNN), a long short-term memory neural network (Long Short-Term Memory, LSTM), a Transformer network, etc. .

在一种可能的实施方式中，关键帧识别模型针对每帧视频帧的识别结果可以包括视频帧是否为关键帧，例如，识别结果可以包括1和0，若识别结果为1，则表示视频帧为关键帧，若识别结果为0，则表示视频帧不是关键帧。如此，根据关键帧识别模型针对每帧视频帧的识别结果，可以确定出待处理视频中哪些视频帧为关键帧。In a possible implementation, the recognition result of the key frame recognition model for each video frame may include whether the video frame is a key frame. For example, the recognition result may include 1 and 0. If the recognition result is 1, it means that the video frame is a key frame. If the recognition result is 0, it means that the video frame is not a key frame. In this way, based on the recognition results of each video frame by the key frame recognition model, it can be determined which video frames in the video to be processed are key frames.

在一种可能的实施方式中，关键帧识别模型针对每帧视频帧的识别结果可以包括关键帧概率，该关键帧概率用于表征该视频帧为关键帧的概率，关键帧概率越大，则该视频帧为关键帧的可能性越大。在该方式中，可以将每帧视频帧对应的关键帧概率与关键帧阈值进行比较；根据比较结果，若关键帧概率大于关键帧阈值，则可以确定该视频帧为关键帧；若关键帧概率不大于关键帧阈值，则可以确定该视频帧不为关键帧。如此，可以通过对每帧视频帧对应的识别结果，进行进一步的判断后，确定出待处理视频中存在的关键帧。In a possible implementation, the recognition result of each video frame by the key frame recognition model may include a key frame probability. The key frame probability is used to characterize the probability that the video frame is a key frame. The greater the key frame probability, the The more likely the video frame is a key frame. In this method, the key frame probability corresponding to each video frame can be compared with the key frame threshold; according to the comparison result, if the key frame probability is greater than the key frame threshold, the video frame can be determined to be a key frame; if the key frame probability is not greater than the key frame threshold, it can be determined that the video frame is not a key frame. In this way, the key frames existing in the video to be processed can be determined through further judgment on the recognition results corresponding to each video frame.

在一种可能的实施方式中，以上关键帧识别模型可以通过以下方式训练得到：获取样本视频，样本视频中每帧视频帧可以被标注关键帧标签，该关键帧标签用于表征该视频帧是否为关键帧；在基于样本视频训练得到关键帧识别模型时，可以将样本视频的每帧样本视频帧输入到初始识别模型中，得到初始识别模型针对样本视频帧的识别结果，然后根据每帧样本视频帧的识别结果以及每帧样本视频帧所标注的关键帧标签，确定出初始识别模型对应的损失值；在确定出初始识别模型对应的损失值之后，则可以根据损失值对初始识别模型进行迭代训练，得到最终的关键帧识别模型。其中，确定初始识别模型对应的损失值时，可以根据每帧样本视频帧对应的识别结果与标注的关键帧标签之间的差异确定损失值。In a possible implementation, the above key frame recognition model can be trained in the following manner: obtain a sample video, and each video frame in the sample video can be marked with a key frame label. The key frame label is used to characterize whether the video frame is a key frame; when the key frame recognition model is obtained based on sample video training, each sample video frame of the sample video can be input into the initial recognition model to obtain the recognition result of the initial recognition model for the sample video frame, and then based on each frame of sample The recognition results of the video frames and the key frame labels marked on each sample video frame determine the loss value corresponding to the initial recognition model; after determining the loss value corresponding to the initial recognition model, the initial recognition model can be evaluated based on the loss value. Iterative training to obtain the final key frame recognition model. When determining the loss value corresponding to the initial recognition model, the loss value can be determined based on the difference between the recognition result corresponding to each sample video frame and the annotated key frame label.

可选地，在根据确定出的损失值对初始识别模型进行迭代训练时，可以根据计算得到的损失值，调整初始识别模型的模型参数；重复将样本视频的样本视频帧输入至初始识别模型，得到初始识别模型对输入的样本视频帧输出的识别结果，基于样本视频帧的识别结果以及标注的关键帧标签，确定初始设备模型对应的损失值，以及根据损失值，调整初始识别模型的模型参数，直至满足训练结束条件，得到训练后的关键帧识别模型。迭代训练的训练结束条件可以包括：迭代训练的次数达到目标次数；或者初始识别模型的总损失值满足设定条件。Optionally, when iteratively training the initial recognition model based on the determined loss value, the model parameters of the initial recognition model can be adjusted based on the calculated loss value; the sample video frames of the sample video are repeatedly input to the initial recognition model, Obtain the recognition results of the input sample video frames output by the initial recognition model. Based on the recognition results of the sample video frames and the annotated key frame labels, determine the loss value corresponding to the initial device model, and adjust the model parameters of the initial recognition model based on the loss value. , until the training end conditions are met, and the trained key frame recognition model is obtained. The training end conditions of iterative training may include: the number of iterative training reaches the target number; or the total loss value of the initial recognition model meets the set conditions.

当然，对待处理视频进行关键帧识别的具体方式在本申请实施例中可以不做限定。Of course, the specific method for identifying key frames in the video to be processed may not be limited in the embodiments of this application.

在一些实施方式中，执行本申请实施例提供的视频处理方法的执行主体为服务器时，该待处理视频可以是由电子设备上传的视频，或者根据电子设备发送的选择指令，从服务器存储的视频中选择的目标视频；执行本申请实施例提供的视频处理方法的执行主体为电子设备时，例如智能手机、平板电脑、笔记本电脑等，该待处理视频可以是电子设备本地存储的视频，也可以是从服务器下载的视频。In some implementations, when the execution subject of the video processing method provided by the embodiments of the present application is a server, the video to be processed can be a video uploaded by an electronic device, or a video stored in the server according to a selection instruction sent by the electronic device. The target video selected in the application; when the execution subject of the video processing method provided by the embodiment of the present application is an electronic device, such as a smartphone, a tablet, a laptop, etc., the video to be processed can be a video stored locally on the electronic device, or it can be It is a video downloaded from the server.

步骤S120：将所述待处理视频划分为多个视频片段。Step S120: Divide the video to be processed into multiple video segments.

在本申请实施例中，计算机设备针对待处理视频进行视频理解时，还可以将待处理视频划分为多个视频片段，以便根据划分的视频片段，以及以上识别到的关键帧，对待处理视频进行压缩，从而减少待处理视频中的视频帧。In this embodiment of the present application, when the computer device performs video understanding on the video to be processed, it can also divide the video to be processed into multiple video segments, so that the video to be processed can be processed based on the divided video segments and the key frames identified above. Compression, thereby reducing video frames in the video to be processed.

在一些实施方式中，计算机设备对待处理视频划分视频片段时，可以根据视频片段中的不同场景、不同人物、不同物体等，对待处理视频进行划分，例如，在待处理视频中的场景发生变化时，则确定出下一个视频片段的起点，在待处理视频的场景再次发生变化时，确定出该下一个视频片段的终点，从而划分出一个视频片段。In some embodiments, when the computer device divides the video to be processed into video clips, it can divide the video to be processed according to different scenes, different characters, different objects, etc. in the video clips. For example, when the scene in the video to be processed changes. , the starting point of the next video clip is determined, and when the scene of the video to be processed changes again, the end point of the next video clip is determined, thereby dividing a video clip.

在一些实施方式中，计算机设备也可以按照目标时间长度，对待处理视频划分为多个视频片段，例如，可以将待处理视频划分为多个时间长度为10秒的视频片段。In some implementations, the computer device can also divide the video to be processed into multiple video segments according to the target time length. For example, the video to be processed can be divided into multiple video segments with a time length of 10 seconds.

当然，对待处理视频进行视频片段的划分的具体方式在本申请实施例中可以不做限定。Of course, the specific method of dividing the video to be processed into video segments may not be limited in the embodiments of this application.

步骤S130：基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧。Step S130: Based on the key frame, extract a partial video frame from each video segment of the plurality of video segments.

在本申请实施例中，在识别得到待处理视频中的关键帧，并将待处理视频划分为多个视频片段之后，则可以根据划分的视频片段以及以上识别到的关键帧，对待处理视频进行压缩，从而减少待处理视频中的视频帧。其中，可以基于关键帧，从划分得到的多个视频片段的每个视频片段中提取部分视频帧，以减少每个视频片段中的视频帧的数量，从而减少用于确定输入大语言模型的输入信息的视频帧的数量。可以理解地，由于关键帧在其相邻的多个视频帧中具有代表性，因此可以以此为依据，在每个视频片段中提取部分视频帧，从而使视频帧的数量减少的同时，能够保证后续视频理解时能够有效提取待处理视频中的信息。In this embodiment of the present application, after the key frames in the video to be processed are identified and the video to be processed is divided into multiple video segments, the video to be processed can be processed based on the divided video segments and the key frames identified above. Compression, thereby reducing video frames in the video to be processed. Among them, partial video frames can be extracted from each video segment of the divided multiple video segments based on key frames to reduce the number of video frames in each video segment, thereby reducing the input used to determine the input of the large language model Number of video frames of information. Understandably, since the key frame is representative among its adjacent video frames, partial video frames can be extracted from each video segment based on this, thereby reducing the number of video frames and at the same time. This ensures that the information in the video to be processed can be effectively extracted during subsequent video understanding.

在一些实施方式中，可以针对以上多个视频片段中的每个视频片段，确定视频片段中是否包括关键帧；若视频片段中包括关键帧，则可以从该视频片段中获取该关键帧以及该关键帧邻近的第一目标数量的视频帧，作为该视频片段中的部分视频帧；若视频片段中不包括关键帧，则可以根据指定提取规则，从该视频片段中提取第二目标数量的视频帧。其中，第一目标数量与第二目标数量的具体数值可以不做限定，例如，可以为5帧，也可以为10帧，20帧等。In some implementations, for each of the above multiple video segments, it may be determined whether the video segment includes a key frame; if the video segment includes a key frame, the key frame and the key frame may be obtained from the video segment. The first target number of video frames adjacent to the key frame are used as part of the video frames in the video clip; if the video clip does not include key frames, the second target number of videos can be extracted from the video clip according to the specified extraction rules. frame. The specific values of the first target number and the second target number may not be limited. For example, they may be 5 frames, 10 frames, 20 frames, etc.

在一种可能的实施方式中，第二目标数量可以小于第一目标数量，也就是说，对于包括关键帧的视频片段，在提取部分视频帧时，提取的部分视频帧的数量可以多于不包括关键帧的视频片段中提取的部分视频帧的数量。可以理解地，由于不包括关键帧的视频片段是画面变化不大的片段，因此其视频帧包括的信息是较为接近的，故可以提取更少的视频帧，以进一步减少后续视频理解时所使用的视频帧的数量。In a possible implementation, the second target number may be smaller than the first target number, that is, for a video segment including key frames, when extracting partial video frames, the number of extracted partial video frames may be more than not The number of partial video frames extracted from the video clip that includes keyframes. Understandably, since video clips that do not include key frames are clips with little picture change, the information contained in their video frames is relatively close, so fewer video frames can be extracted to further reduce the number of frames used in subsequent video understanding. The number of video frames.

在一种可能的实施方式中，在任一视频片段中不包括关键帧的情况下，根据指定提取规则，从该视频片段中提取第二目标数量的视频帧时，可以从该视频片段中随机提取第二目标数量的视频帧。In a possible implementation, when no key frame is included in any video segment, when a second target number of video frames are extracted from the video segment according to the specified extraction rule, the video frame may be randomly extracted from the video segment. A second target number of video frames.

在一种可能的实施方式中，在任一视频片段中不包括关键帧的情况下，根据指定提取规则，从该视频片段中提取第二目标数量的视频帧时，也可以提取满足目标分布条件的第二目标数量的视频帧，作为该视频片段中的部分视频帧。其中，目标分布条件可以是均匀分布于视频片段中，从而保证后续视频理解时对于该视频片段能够有效提取信息。In a possible implementation, when any video clip does not include key frames, when extracting a second target number of video frames from the video clip according to the specified extraction rules, it is also possible to extract the video frames that meet the target distribution conditions. The second target number of video frames are used as partial video frames in the video clip. Among them, the target distribution condition can be uniformly distributed in the video clips, thereby ensuring that information can be effectively extracted from the video clips during subsequent video understanding.

当然，在本申请实施例中，基于关键帧，从多个视频片段的每个视频片段中提取部分视频帧的具体方式可以不做限定，仅需保证基于关键帧从待处理视频中保留必要的视频帧，以便有效提取待处理视频中的信息即可。Of course, in the embodiment of the present application, the specific method of extracting some video frames from each video segment of multiple video segments based on key frames is not limited. It only needs to ensure that the necessary video frames are retained from the video to be processed based on the key frames. Video frames are required to effectively extract the information in the video to be processed.

步骤S140：基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解。Step S140: Generate first input information for inputting a large language model based on target prompt information and the partial video frames, where the target prompt information is used to prompt the large language model to perform video understanding.

在本申请实施例中，在针对以上每个视频片段提取部分视频帧后，则可以根据提取到的所有视频帧，确定用于输入大语言模型的输入信息。其中，可以根据目标提示信息，以及以上提取到的所有视频帧，生成用于输入大语言模型的输入信息，并将其作为输入信息。其中，目标提示信息可以是用于提示大语言模型进行视频理解的提示词，即prompt，目标提示信息可以提供用户的意图，大语言模型后续对目标提示信息进行理解后，能够知晓用户意图，并根据用户意图进行相应的处理。可以理解地，由于是针对待处理视频，通过识别的关键帧对划分的每个视频片段提取部分视频帧后，再来生成用于输入大语言模型的输入信息，因此能够减少需要用来生成输入信息的视频帧的数量，且能保证这些剩下的视频帧能够提供视频理解的必要信息，从而对于待处理视频，仅需一次生成用于输入大语言模型的输入信息，进而输入信息中会保留待处理视频的整体性的信息，使大语言模型输出的视频理解结果更加准确。In this embodiment of the present application, after extracting part of the video frames for each of the above video clips, the input information for inputting the large language model can be determined based on all the extracted video frames. Among them, the input information for inputting the large language model can be generated based on the target prompt information and all the video frames extracted above, and used as input information. Among them, the target prompt information can be a prompt word, that is, prompt, used to prompt the large language model to understand the video. The target prompt information can provide the user's intention. After the large language model subsequently understands the target prompt information, it can know the user's intention and Perform corresponding processing according to user intent. Understandably, since it is for the video to be processed, after extracting some video frames from each divided video segment through the identified key frames, the input information for inputting into the large language model is then generated, so the need for generating input information can be reduced. The number of video frames can ensure that these remaining video frames can provide the necessary information for video understanding. Therefore, for the video to be processed, the input information for inputting the large language model only needs to be generated once, and the input information will be retained in the to-be-processed video. Processing the overall information of the video makes the video understanding results output by the large language model more accurate.

在一些实施方式中，可以针对以上得到的部分视频帧，进行特征提取，然后将提取到的特征转换到大语言模型的特征空间，即将其转换为通常输入大语言模型的文本tokens，从而可以与以上目标提示信息(也是文本tokens)拼接后，得到用于输入大语言模型的第一输入信息。In some implementations, feature extraction can be performed on some of the video frames obtained above, and then the extracted features are converted to the feature space of a large language model, that is, they are converted into text tokens that are usually input to the large language model, so that they can be used with After the above target prompt information (also text tokens) is spliced, the first input information used to input the large language model is obtained.

步骤S150：将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S150: Input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，在获得以上第一输入信息后，则可以将第一输入信息输入至大语言模型中，从而得到针对待处理视频对应的视频理解结果。其中，视频理解结果可以包括对视频的描述、针对以上目标提示信息进行视频理解后的回答等，视频理解结果中的具体内容可以不做限定。以上大语言模型可以是预先训练的能够进行视频理解的模型，该大语言模型可以是基于AIGC的生成式大语言模型，且该大语言模型可以根据以上输入信息生成视频理解结果。大语言模型根据以上第一输入信息，对第一输入信息中的以上目标提示信息进行理解，获得用户意图后，可以针对以上第一输入信息中根据以上部分视频帧提取的信息，输出视频理解结果。In the embodiment of the present application, after obtaining the above first input information, the first input information can be input into the large language model, thereby obtaining a video understanding result corresponding to the video to be processed. Among them, the video understanding results may include a description of the video, answers to the above target prompt information after video understanding, etc. The specific content of the video understanding results may not be limited. The above large language model can be a pre-trained model capable of video understanding. The large language model can be a generative large language model based on AIGC, and the large language model can generate video understanding results based on the above input information. The large language model understands the above target prompt information in the first input information based on the above first input information. After obtaining the user's intention, it can output video understanding results based on the information extracted from the above part of the video frames in the above first input information. .

在一些实施方式中，考虑到大语言模型的能力较为宽泛，且视频理解的场景可能众多，因此还可以针对大语言模型，增加低秩自适应(Low-Rank Adaptation，LoRA)模型，并采用相应场景下的样本视频，对LoRA模型进行训练，使大语言模型能够在这些场景下能够更好地进行视频理解。其中，LoRA模型通过微调大语言模型中的UNet模块中的交叉注意力层得到的；LoRA模型应用于大语言模型中的交叉注意力层，通过在冻结大语言模型的参数的情况下，采用相应场景下的样本视频进行训练，从而可以得到能够在该场景下能够更好地进行视频理解的LoRA模型，使用时，将该LoRA模型的参数注入(inject)大语言模型，从而使大语言模型输出在该场景下的视频理解结果。In some implementations, considering that the capabilities of large language models are relatively broad, and there may be many scenarios for video understanding, a Low-Rank Adaptation (LoRA) model can also be added to the large language model, and the corresponding Sample videos in the scene are used to train the LoRA model, so that the large language model can better understand the video in these scenes. Among them, the LoRA model is obtained by fine-tuning the cross-attention layer in the UNet module in the large language model; the LoRA model is applied to the cross-attention layer in the large language model, by freezing the parameters of the large language model and using the corresponding Sample videos in the scene are trained to obtain a LoRA model that can better understand the video in the scene. When used, the parameters of the LoRA model are injected into the large language model, so that the large language model outputs Video understanding results in this scenario.

在一些实施方式中，行本申请实施例提供的视频处理方法的执行主体为电子设备时，则以上步骤S110至步骤S150中，可以部分步骤在电子设备本地执行，部分由电子设备提交相应请求至服务器，由服务器执行后并返回结果。例如，对待处理视频帧进行关键帧识别，可以由电子设备向相应服务器提交关键帧识别请求，并将待处理视频上传至该服务器，以便服务器对待处理视频进行关键帧识别，并将识别结果反馈至电子设备；电子设备本地对待处理视频划分为多个视频片段，然后基于关键帧，从多个视频片段的每个视频片段中提取部分视频帧；再将提取的部分视频帧以及目标提示信息上传至服务器，服务器基于目标提示信息以及部分视频帧，生成的用于输入大语言模型的第一输入信息后，将第一输入信息输入至所述大语言模型，得到待处理视频对应的视频理解结果，并将视频理解结果返回至电子设备。In some implementations, when the execution subject of the video processing method provided by the embodiments of the present application is an electronic device, then part of the above steps S110 to step S150 may be executed locally on the electronic device, and part of the steps may be submitted by the electronic device to a corresponding request. Server, executed by the server and returns the result. For example, to perform key frame identification on video frames to be processed, the electronic device can submit a key frame identification request to the corresponding server and upload the video to be processed to the server, so that the server can perform key frame identification on the video to be processed and feed the identification results back to Electronic device; The electronic device locally divides the video to be processed into multiple video segments, and then extracts partial video frames from each of the multiple video segments based on key frames; then uploads the extracted partial video frames and target prompt information to The server, after generating the first input information for inputting the large language model based on the target prompt information and part of the video frames, inputs the first input information into the large language model to obtain the video understanding result corresponding to the video to be processed, And return the video understanding results to the electronic device.

本申请实施例提供的视频处理方法，在通过大语言模型对待处理视频进行视频理解时，根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，再根据提取的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性。The video processing method provided by the embodiment of the present application uses a large language model to perform video understanding of the video to be processed. After extracting some video frames based on the identified key frames and divided video segments, the video processing method is determined based on the extracted partial video frames. Enter the input information of the large language model. Therefore, when understanding long-term videos, there is no need to input the large language model multiple times after segmenting the video. This allows the large language model to better understand the long-term videos and improve the understanding of the long-term videos. Accuracy of long-duration video comprehension.

请参阅图2，图2示出了本申请另一个实施例提供的视频处理方法的流程示意图。该视频处理方法应用于上述计算机设备，下面将针对图2所示的流程进行详细的阐述，所述视频处理方法具体可以包括以下步骤：Please refer to Figure 2. Figure 2 shows a schematic flowchart of a video processing method provided by another embodiment of the present application. This video processing method is applied to the above-mentioned computer equipment. The process shown in Figure 2 will be described in detail below. The video processing method may specifically include the following steps:

步骤S210：对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。Step S210: Perform key frame identification on the video to be processed to obtain key frames in the video to be processed.

在本申请实施例中，步骤S210可以参阅前述实施例的内容，在此不再赘述。In this embodiment of the present application, reference may be made to the content of the foregoing embodiments for step S210, which will not be described again here.

步骤S220：将所述待处理视频划分为多个视频片段。Step S220: Divide the video to be processed into multiple video segments.

在一些实施方式中，在将待处理视频划分为多个视频片段时，可以基于关键帧，将待处理视频划分为多个视频片段，其中，多个视频片段的每个视频片段中包括至少一帧所述关键帧。也就是说，在针对待处理视频划分视频片段时，是参考了识别得到的关键帧进行的，并且使划分的每个视频片段中都包括有至少一帧关键帧，使每个视频片段都包括了至少一个变化度大于目标阈值的视频帧。In some embodiments, when dividing the video to be processed into multiple video segments, the video to be processed may be divided into multiple video segments based on key frames, wherein each video segment of the multiple video segments includes at least one frame the key frame. That is to say, when dividing the video clips for the video to be processed, it is done with reference to the identified key frames, and each divided video clip includes at least one key frame, so that each video clip includes At least one video frame with a change greater than the target threshold is obtained.

在一种可能的实施方式中，在基于关键帧，将待处理视频划分为多个视频片段时，可以基于以上关键帧，对待处理视频标注多个目标时间戳，其中，相邻的两个目标时间戳之间的视频构成一个视频片段，目标时间戳位于相邻的两个关键帧之间。也就是说，可以通过标注时间戳，对待处理视频进行分割，使相邻两个目标时间戳分割出一个视频片段，从而划分得到多个视频片段，后续针对待处理视频进一步处理时，可以根据标注的时间戳，从每相邻的两个时间戳构成的视频片段中提取部分视频帧。In a possible implementation, when dividing the video to be processed into multiple video segments based on key frames, the video to be processed can be marked with multiple target timestamps based on the above key frames, where two adjacent targets The videos between timestamps constitute a video clip, and the target timestamp is between two adjacent keyframes. That is to say, the video to be processed can be segmented by marking the timestamp, so that two adjacent target timestamps can be divided into one video segment, thereby dividing multiple video segments. When further processing the video to be processed, the video to be processed can be further processed according to the annotation. timestamp, extract part of the video frame from the video clip consisting of each two adjacent timestamps.

可选地，以上目标关键帧可以位于两个关键帧之间，且视频画面的变化度小于目标阈值的时间点；以上划分的每个视频片段的持续时长大于或等于指定时长，例如10秒，20秒，30秒，60秒等。Optionally, the above target key frame can be located between two key frames, and the change degree of the video picture is less than the target threshold; the duration of each video segment divided above is greater than or equal to the specified duration, such as 10 seconds, 20 seconds, 30 seconds, 60 seconds, etc.

可选地，以上识别到的关键帧中可以包括主关键帧以及次关键帧，主关键帧是相对前一帧视频帧的变化度大于第一阈值的视频帧，次关键帧是相对前一帧视频帧的变化度大于第二阈值的视频帧，且第一阈值大于第二阈值，也就是说，主关键帧相对前一帧视频帧的变化度与次关键帧相对前一帧视频帧的变化度都大于一定阈值，但是主关键帧相对前一帧的变化度更大。由于主关键帧相对前一帧的变化度，比次关键帧相对前一帧的变化度更大，因此主关键帧更具有代表性，故在基于以上关键帧，对待处理视频标注多个目标时间戳时，可以根据主关键帧，选取位于两个主关键帧之间，且视频画面的变化度小于目标阈值的目标时间点，并在目标时间点处标注目标时间戳。Optionally, the key frames identified above may include primary key frames and secondary key frames. The primary key frame is a video frame whose degree of change relative to the previous video frame is greater than the first threshold. The secondary key frame is a video frame relative to the previous frame. The degree of change of the video frame is greater than the second threshold of the video frame, and the first threshold is greater than the second threshold. That is to say, the degree of change of the main key frame relative to the previous video frame and the change of the secondary key frame relative to the previous video frame The degrees are all greater than a certain threshold, but the degree of change of the main key frame relative to the previous frame is greater. Since the change degree of the main key frame relative to the previous frame is greater than the change degree of the secondary key frame relative to the previous frame, the main key frame is more representative. Therefore, based on the above key frames, the video to be processed is marked with multiple target times. When stamping, you can select the target time point between the two main key frames and the change degree of the video picture is less than the target threshold based on the main key frame, and mark the target timestamp at the target time point.

步骤S230：基于所述关键帧，针对所述多个视频片段的每个视频片段进行采样，得到所述每个视频片段中的部分视频帧。Step S230: Based on the key frame, sample each video segment of the plurality of video segments to obtain partial video frames in each video segment.

在本申请实施例中，在识别得到待处理视频中的关键帧，并将待处理视频划分为多个视频片段之后，则可以基于以上关键帧，针对多个视频片段的每个视频片段进行采样，从而得到每个视频片段中的部分视频帧。也就是说，从每个视频片段中采样一定数量的视频帧，从而得到每个视频片段中的部分视频帧。In the embodiment of the present application, after identifying the key frames in the video to be processed and dividing the video to be processed into multiple video segments, sampling can be performed for each video segment of the multiple video segments based on the above key frames. , thereby obtaining partial video frames in each video clip. That is, a certain number of video frames are sampled from each video segment to obtain partial video frames in each video segment.

在一些实施方式中，以上视频片段可以是基于关键帧划分出的，且每个视频片段都包括至少一个关键帧。在基于关键帧，针对多个视频片段的每个视频片段进行采样时，可以基于每个视频片段中的关键帧，对每个视频片段进行划分，得到每个视频片段对应的多个视频子片段；对每个视频片段对应的多个视频子片段中的每个视频子片段进行采样，得到每个视频片段中的部分视频帧，其中，关键帧所在视频子片段对应的采样率高于其他视频子片段对应的采样率，其他视频子片段为所述多个视频子片段中除关键帧所在视频子片段以外的视频子片段。In some implementations, the above video segments may be divided based on key frames, and each video segment includes at least one key frame. When sampling each video segment of multiple video segments based on key frames, each video segment can be divided based on the key frames in each video segment to obtain multiple video sub-segments corresponding to each video segment. ;Sample each video sub-segment in the multiple video sub-segments corresponding to each video segment to obtain some video frames in each video segment. Among them, the sampling rate corresponding to the video sub-segment where the key frame is located is higher than that of other videos. The sampling rate corresponding to the sub-segment, and the other video sub-segments are the video sub-segments among the multiple video sub-segments except the video sub-segment where the key frame is located.

在以上实施方式中，可以基于每个视频片段中的关键帧，从每个视频片段中划分出包含关键帧的视频子片段，以及不包含关键帧的视频子片段。也就是说，对于每个视频片段，还可以根据关键帧在视频片段中所处位置，对视频片段再进行划分，并得到多个视频子片段，使部分视频子片段中包括关键帧，部分视频子片段中不包括关键帧，然后按照关键帧所在视频子片段对应的采样率，对关键帧所在视频子片段进行采样，按照不包括关键帧的视频子片段对应的采样率，对不包括关键帧的视频子片段进行采样。并且，由于不包括关键帧的视频片段是画面变化不大的片段，因此其各个视频帧包括的信息是较为接近的，故可以采用相对较低的采样率，从而提取相对较少的视频帧，而对于包括关键帧的视频片段，则可以采用相对更高的采样率，从而能保留必要的视频帧，以便有效提取待处理视频中的信息。In the above implementation, based on the key frames in each video segment, each video segment can be divided into video sub-segments containing key frames and video sub-segments not containing key frames. That is to say, for each video segment, the video segment can also be further divided according to the position of the key frame in the video segment, and multiple video sub-segments are obtained, so that some video sub-segments include key frames, and some video sub-segments include key frames. The sub-segment does not include key frames, and then the video sub-segment where the key frame is located is sampled according to the sampling rate corresponding to the video sub-segment where the key frame is located, and the video sub-segment where the key frame is located is sampled according to the sampling rate corresponding to the video sub-segment excluding key frames. Sample video sub-clips. Moreover, since video clips that do not include key frames are clips with little picture change, the information contained in each video frame is relatively close, so a relatively low sampling rate can be used to extract relatively few video frames. For video clips that include key frames, a relatively higher sampling rate can be used to retain the necessary video frames to effectively extract information from the video to be processed.

在一种可能的实施方式中，在基于每个视频片段中的关键帧，从每个视频片段中划分出包含关键帧的视频子片段，以及不包含关键帧的视频子片段之后，若包含关键帧的视频子片段是相邻的两个视频子片段，则可以确定两个视频子片段中的关键帧之间相距的时长，若相距的时长小于时长阈值，则可以将该两个视频子片段合并为一个视频子片段，以便进行采样后，能够更好地保留用于视频理解的有效信息。In a possible implementation, after dividing each video segment into video sub-segments containing key frames and video sub-segments not containing key frames based on the key frames in each video segment, if the key frames are included The video sub-segments of the frame are two adjacent video sub-segments, then the duration between the key frames in the two video sub-segments can be determined. If the distance is less than the duration threshold, the two video sub-segments can be Merge into one video sub-segment so that after sampling, effective information for video understanding can be better retained.

在一种可能的实施方式中，以上识别到的关键帧中可以包括主关键帧以及次关键帧，主关键帧是相对前一帧视频帧的变化度大于第一阈值的视频帧，次关键帧是相对前一帧视频帧的变化度大于第二阈值的视频帧，且第一阈值大于第二阈值，也就是说，主关键帧相对前一帧视频帧的变化度与次关键帧相对前一帧视频帧的变化度都大于一定阈值，但是主关键帧相对前一帧的变化度更大。在以上的实施方式中，包括主关键帧的视频子片段的采样率，可以大于包括次关键帧的视频子片段的采样率，从而能够更好地保留用于视频理解的有效信息。In a possible implementation, the key frames identified above may include primary key frames and secondary key frames. The primary key frame is a video frame whose degree of change relative to the previous video frame is greater than the first threshold. The secondary key frame It is a video frame whose degree of change relative to the previous video frame is greater than the second threshold, and the first threshold is greater than the second threshold. That is to say, the degree of change of the main key frame relative to the previous video frame is the same as the degree of change of the secondary key frame relative to the previous video frame. The degree of change of each video frame is greater than a certain threshold, but the degree of change of the main key frame is greater than the previous frame. In the above embodiments, the sampling rate of the video sub-segment including the primary key frame may be greater than the sampling rate of the video sub-segment including the secondary key frame, so that effective information for video understanding can be better retained.

在一些实施方式中，以上划分的多个视频片段中，可以部分视频片段中包括关键帧，部分视频片段中不包括关键帧。In some implementations, among the multiple video segments divided above, some video segments may include key frames, and some video segments may not include key frames.

对于包括关键帧的每个视频片段，可以按照前一实施方式，基于视频片段中的关键帧，对视频片段进行划分，得到视频片段对应的多个视频子片段，然后对视频片段对应的多个视频子片段中的每个视频子片段进行采样，得到该视频片段中的部分视频帧，其中，关键帧所在视频子片段对应的采样率高于其他视频子片段对应的采样率，其他视频子片段为多个视频子片段中除关键帧所在视频子片段以外的视频子片段。For each video segment including a key frame, the video segment can be divided based on the key frames in the video segment according to the previous embodiment to obtain multiple video sub-segments corresponding to the video segment, and then multiple video sub-segments corresponding to the video segment are obtained. Each video sub-segment in the video sub-segment is sampled to obtain some video frames in the video segment. Among them, the sampling rate corresponding to the video sub-segment where the key frame is located is higher than the sampling rate corresponding to other video sub-segments. Other video sub-segments It is a video sub-segment among multiple video sub-segments except the video sub-segment where the key frame is located.

对于不包括关键帧的每个视频片段，可以划分为多个时长相同的视频子片段，然后确定每个视频子片段与位置上最近的关键帧之间相距的时长，再根据该时长确定每个视频子片段对应的采样率，该采样率与该时长呈负相关，也就是说，与位置上最近的关键帧相距越远，则采样率越低，如此，可以进一步减少保留的视频帧的数量的同时，更好地保留用于视频理解的有效信息。例如，在任一不包括关键帧的视频片段中，包括视频子片段1、视频子片段2和视频子片段3，视频子片段1与其最近的关键帧之间相距10秒，视频子片段2与其最近的关键帧之间相距8秒，视频子片段3与其最近的关键帧之间相距15秒，则视频子片段2对应的采样率最高，视频片段1对应的采样率次之，视频片段3对应的采样率最低。For each video segment that does not include key frames, it can be divided into multiple video sub-segments of the same duration, and then determine the duration between each video sub-segment and the nearest key frame in position, and then determine each video segment based on this duration. The sampling rate corresponding to the video sub-clip. The sampling rate is negatively correlated with the duration. That is to say, the further away from the nearest key frame, the lower the sampling rate. In this way, the number of retained video frames can be further reduced. At the same time, it better retains effective information for video understanding. For example, in any video segment that does not include key frames, including video sub-segment 1, video sub-segment 2 and video sub-segment 3, video sub-segment 1 is 10 seconds away from its nearest key frame, and video sub-segment 2 is 10 seconds away from its nearest key frame. The distance between key frames is 8 seconds, and the distance between video sub-segment 3 and its nearest key frame is 15 seconds. Then the sampling rate corresponding to video sub-segment 2 is the highest, the sampling rate corresponding to video segment 1 is second, and the sampling rate corresponding to video segment 3 is The sampling rate is the lowest.

在一些实施方式中，考虑到每次需要进行视频理解的待处理视频的视频时长不是固定的，为了保证根据提取的所有部分视频帧，能一次性生成用于输入大语言模型的输入信息，在针对多个视频片段的每个视频片段进行采样后，得到的视频帧的总数可以是固定值，例如400帧、500帧、700帧等。在针对多个视频片段的每个视频片段进行采样时，则可以根据待处理视频的具体时长，动态调整采样率，以保证各个视频片段对应的部分视频帧的总数为以上固定值。In some implementations, considering that the video duration of the video to be processed that needs to be processed for video understanding is not fixed each time, in order to ensure that input information for inputting a large language model can be generated at one time based on all partial video frames extracted, in After sampling each video segment of multiple video segments, the total number of video frames obtained may be a fixed value, such as 400 frames, 500 frames, 700 frames, etc. When sampling each video segment of multiple video segments, the sampling rate can be dynamically adjusted according to the specific duration of the video to be processed to ensure that the total number of partial video frames corresponding to each video segment is the above fixed value.

步骤S240：基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解。Step S240: Generate first input information for inputting a large language model based on target prompt information and the partial video frames, where the target prompt information is used to prompt the large language model to perform video understanding.

步骤S250：将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S250: Input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，步骤S240以及步骤S250可以参阅前述实施例的内容，在此不再赘述。In the embodiment of the present application, the content of step S240 and step S250 can be referred to the previous embodiment, and will not be described again here.

本申请实施例提供的视频处理方法，在通过大语言模型对待处理视频进行视频理解时，根据识别的关键帧，对划分的视频片段进行采样，得到部分视频帧之后，再根据得到的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性；并且，由于部分视频帧是根据关键帧采样得到的，因此可以保留必要的视频帧，从而基于部分视频帧确定的输入信息在输入大语言模型后，大语言模型能够有效提取待处理视频中的信息，进而保证了视频理解的准确性。The video processing method provided by the embodiment of the present application uses a large language model to perform video understanding on the video to be processed. The divided video segments are sampled according to the identified key frames. After obtaining partial video frames, the video is then processed based on the obtained partial video frames. Determine the input information used to input the large language model, so when understanding the long-term video, there is no need to input the large language model multiple times after segmenting the video, thus enabling the large language model to better understand the long-term video. , to improve the accuracy of understanding long-term videos; and, because some video frames are sampled based on key frames, necessary video frames can be retained, so that the input information determined based on the partial video frames can be input into the large language model. The large language model can effectively extract the information in the video to be processed, thereby ensuring the accuracy of video understanding.

请参阅图3，图3示出了本申请又一个实施例提供的视频处理方法的流程示意图。该视频处理方法应用于上述计算机设备，下面将针对图3所示的流程进行详细的阐述，所述视频处理方法具体可以包括以下步骤：Please refer to Figure 3, which shows a schematic flowchart of a video processing method provided by yet another embodiment of the present application. This video processing method is applied to the above-mentioned computer equipment. The process shown in Figure 3 will be described in detail below. The video processing method may specifically include the following steps:

步骤S310：对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。Step S310: Perform key frame identification on the video to be processed to obtain the key frames in the video to be processed.

步骤S320：将所述待处理视频划分为多个视频片段，所述每个视频片段中包括至少一帧所述关键帧。Step S320: Divide the video to be processed into multiple video segments, and each video segment includes at least one frame of the key frame.

在本申请实施例中，步骤S310以及步骤S320可以参阅其他实施例的内容，在此不再赘述。In the embodiment of the present application, the content of steps S310 and S320 can be referred to other embodiments, and will not be described again here.

步骤S330：针对所述多个视频片段中的每个视频片段，获取与所述关键帧邻近的第一数量的视频帧，并从其他视频帧中获取第二数量的视频帧，得到所述每个视频片段对应的所述第一数量的视频帧以及所述第二数量的视频帧，其中，所述其他视频帧为所述每个视频片段中除所述关键帧以及所述关键帧邻近的第一数量的视频帧以外的视频帧。Step S330: For each video segment in the plurality of video segments, obtain a first number of video frames adjacent to the key frame, and obtain a second number of video frames from other video frames to obtain each of the video frames. The first number of video frames and the second number of video frames corresponding to each video segment, wherein the other video frames are the key frames in each video segment except the key frames and the adjacent key frames. Video frames other than the first number of video frames.

在本申请实施例中，在识别得到待处理视频中的关键帧，并将待处理视频划分为多个视频片段之后，在基于关键帧从每个视频片段中提取部分视频帧时，可以针对每个视频片段，获取与关键帧邻近的第一数量的视频帧，并从其他视频帧中获取第二数量的视频帧。也就是说，是提取包含关键帧的第一数量的连续视频帧，以及从除提取出的视频帧中以外的其他视频帧中提取第二数量的视频帧，从而不仅可以减少视频帧的数量，还可以保留必要的视频帧。In the embodiment of the present application, after identifying the key frames in the video to be processed and dividing the video to be processed into multiple video segments, when extracting partial video frames from each video segment based on the key frames, the key frames can be extracted for each video segment. A video clip is obtained, a first number of video frames adjacent to the key frame are obtained, and a second number of video frames are obtained from other video frames. That is to say, a first number of continuous video frames containing key frames are extracted, and a second number of video frames are extracted from other video frames except the extracted video frames, thereby not only reducing the number of video frames, but Necessary video frames can also be preserved.

在一些实施方式中，从以上其他视频帧中获取第二数量的视频帧时，可以在其他视频帧中进行采样，得到第二数量的视频帧。其中，进行采样的采样率可以根据第二数量以及其他视频帧的数量确定。通过采样的方式提取第二数量的视频帧，能够保证从其他视频帧中提取的视频帧分布均匀，从而保证后续视频理解时对于该视频片段能够有效提取信息。In some implementations, when obtaining the second number of video frames from the above other video frames, sampling can be performed in other video frames to obtain the second number of video frames. Wherein, the sampling rate for sampling may be determined according to the second number and the number of other video frames. Extracting the second number of video frames through sampling can ensure that the video frames extracted from other video frames are evenly distributed, thereby ensuring that information from the video clip can be effectively extracted during subsequent video understanding.

步骤S340：将所述每个视频片段对应的所述关键帧、所述第一数量的视频帧以及所述第二数量的视频帧，确定为所述每个视频片段对应的部分视频帧。Step S340: Determine the key frame, the first number of video frames, and the second number of video frames corresponding to each video segment as partial video frames corresponding to each video segment.

在本申请实施例中，在针对每个视频片段，提取以上第一数量的视频帧以及第二数量的视频帧之后，则可以将每个视频片段中的关键帧、以上第一数量的视频帧以及第二数量的视频帧，确定为每个视频片段对应的部分视频帧。In this embodiment of the present application, after extracting the above first number of video frames and the above second number of video frames for each video segment, the key frames in each video segment, the above first number of video frames can be and a second number of video frames, determined as partial video frames corresponding to each video segment.

步骤S350：基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解。Step S350: Generate first input information for inputting a large language model based on target prompt information and the partial video frames, where the target prompt information is used to prompt the large language model to perform video understanding.

步骤S360：将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S360: Input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，步骤S350以及步骤S360可以参阅其他实施例的内容，在此不再赘述。In the embodiment of the present application, the content of steps S350 and S360 can be referred to other embodiments, and will not be described again here.

本申请实施例提供的视频处理方法，在通过大语言模型对待处理视频进行视频理解时，根据识别的关键帧，从每个视频片段中获取与关键帧邻近的第一数量的视频帧，并从每个视频片段的其他视频帧中获取第二数量的视频帧，得到每个视频片段对应的部分视频帧之后，再根据得到的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性；并且，由于部分视频帧至少包含了关键帧及其邻近的视频帧，因此保留了视频片段中的有效信息，从而基于部分视频帧确定的输入信息在输入大语言模型后，大语言模型能够有效提取待处理视频中的信息，进而保证了视频理解的准确性。The video processing method provided by the embodiment of the present application uses a large language model to perform video understanding of the video to be processed. According to the identified key frames, a first number of video frames adjacent to the key frames are obtained from each video segment, and from Obtain the second number of video frames from other video frames of each video segment. After obtaining the partial video frames corresponding to each video segment, the input information for inputting the large language model is determined based on the obtained partial video frames. Therefore, after When understanding long-term videos, there is no need to input the large language model multiple times after segmenting the video, so that the large language model can better understand the long-term videos and improve the accuracy of understanding the long-term videos; and, Since some video frames at least contain key frames and their adjacent video frames, the effective information in the video clips is retained, so that after the input information determined based on the partial video frames is input to the large language model, the large language model can effectively extract the to-be-processed information. The information in the video thus ensures the accuracy of video understanding.

请参阅图4，图4示出了本申请再一个实施例提供的视频处理方法的流程示意图。该视频处理方法应用于上述计算机设备，下面将针对图4所示的流程进行详细的阐述，所述视频处理方法具体可以包括以下步骤：Please refer to Figure 4, which shows a schematic flowchart of a video processing method provided by yet another embodiment of the present application. This video processing method is applied to the above-mentioned computer equipment. The process shown in Figure 4 will be described in detail below. The video processing method may specifically include the following steps:

步骤S410：对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。Step S410: Perform key frame identification on the video to be processed to obtain the key frames in the video to be processed.

步骤S420：将所述待处理视频划分为多个视频片段。Step S420: Divide the video to be processed into multiple video segments.

步骤S430：基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧。Step S430: Based on the key frame, extract a partial video frame from each video segment of the plurality of video segments.

在本申请实施例中，步骤S410至步骤S430可以参阅其他实施例的内容，在此不再赘述。In this embodiment of the present application, the contents of steps S410 to S430 may be referred to other embodiments, and will not be described again here.

步骤S440：对所述部分视频帧进行特征提取，得到所述部分视频帧对应的图像特征。Step S440: Perform feature extraction on the partial video frames to obtain image features corresponding to the partial video frames.

在本申请实施例中，在基于识别的关键帧，针对每个视频片段提取取部分视频帧之后，基于目标提示信息以及所有视频片段对应的以上部分视频帧，生成用于输入大语言模型的第一输入信息时，可以对提取的所有视频帧(即以上所有视频片段对应的以上部分视频帧)进行特征提取，得到这些视频帧对应的图像特征，以便后续将这些图像特征转换为用于输入大语言模型的信息，即转换到大语言模型的特征空间，从而能够输入到大语言模型中。其中，对于每个视频帧提取的图像特征可以包括全局特征以及多个图像块的块特征(patch特征)。全局特征通常指的是在整个图像中提取的特征，例如颜色、纹理和形状等，这些特征可以描述整个图像的内容和上下文信息，常用于图像分类、目标检测和场景识别等任务；patch特征是从图像中提取出的局部小块或补丁的特征，通常用于表达图像的局部细节信息，这些特征可以是颜色、纹理或形状等。In the embodiment of this application, after extracting part of the video frames for each video clip based on the identified key frames, based on the target prompt information and the above part of the video frames corresponding to all video clips, a third video frame for inputting the large language model is generated. When information is input, feature extraction can be performed on all the extracted video frames (that is, the above part of the video frames corresponding to all the above video clips) to obtain the image features corresponding to these video frames, so that these image features can be subsequently converted into large-scale input images. The information of the language model is converted into the feature space of the large language model, so that it can be input into the large language model. The image features extracted for each video frame may include global features and patch features of multiple image blocks. Global features usually refer to features extracted from the entire image, such as color, texture, and shape. These features can describe the content and contextual information of the entire image and are often used in tasks such as image classification, target detection, and scene recognition; patch features are The features of local small blocks or patches extracted from the image are usually used to express local detailed information of the image. These features can be color, texture or shape, etc.

在一些实施方式中，可以将所有视频片段对应的部分视频帧输入到预先训练的特征提取模型，从而得到这些视频帧中每个视频帧对应的图像特征。其中，用于图像特征提取的特征提取模型的具体类型可以不做限定，例如可以是视觉编码器，比如视觉分类器的低分辨率变体(Vision Transformer for Low Resolution Classification，ViT-L)。In some implementations, partial video frames corresponding to all video clips can be input to a pre-trained feature extraction model, thereby obtaining image features corresponding to each video frame in these video frames. The specific type of the feature extraction model used for image feature extraction is not limited. For example, it can be a visual encoder, such as a low-resolution variant of the visual classifier (Vision Transformer for Low Resolution Classification, ViT-L).

步骤S450：对所述部分视频帧对应的图像特征进行池化，并将池化后的图像特征进行拼接，得到待输入特征。Step S450: Pool the image features corresponding to the partial video frames, and splice the pooled image features to obtain features to be input.

在本申请实施例中，在针对提取到以上视频帧对应的图像特征后，则可以针对视频帧对应的图像特征进行池化，以减少数据量，降低后续计算量和内存消耗，然后将池化后的图像特征进行拼接，得到待输入特征。In the embodiment of this application, after extracting the image features corresponding to the above video frames, the image features corresponding to the video frames can be pooled to reduce the amount of data, subsequent calculation amount and memory consumption, and then pooling The final image features are spliced to obtain the features to be input.

在一些实施方式中，以上视频帧对应的图像特征可以包括全局特征以及多个图像块对应的块特征，在对以上视频帧对应的图像特征进行池化，并将池化后的图像特征进行拼接，得到待输入特征时，可以对图像特征输入空间池和时间池中，从而通过空间池对以上视频帧对应的图像特征进行空间池化，通过时间池对以上视频帧对应的图像特征进行时间池化，然后对空间池和时间池得到的池化后的图像特征进行拼接，从而得到以上待输入特征。In some embodiments, the image features corresponding to the above video frames may include global features and block features corresponding to multiple image blocks. The image features corresponding to the above video frames are pooled, and the pooled image features are spliced. , when the features to be input are obtained, the image features can be input into the spatial pool and the temporal pool, so that the image features corresponding to the above video frames are spatially pooled through the spatial pool, and the image features corresponding to the above video frames are temporally pooled through the temporal pool. ization, and then splicing the pooled image features obtained by spatial pooling and temporal pooling to obtain the above input features.

在该方式中，在时间池中，可以根据以上视频帧中的视频帧的顺序，对以上所有视频片段的部分视频帧对应的全局特征进行拼接，得到第一特征；在空间池中，可以对以上所有视频片段的部分视频帧对应的块特征进行平均池化，得到第二特征；然后将第一特征以及第二特征进行拼接，得到待输入特征。如此，由于时间池中，是按照保留的所有视频帧的时间顺序，即视频帧在待处理视频中的先后顺序进行的，因此能够保留时间信息，确保视频理解时的准确性。In this method, in the temporal pool, the global features corresponding to some video frames of all the above video clips can be spliced according to the order of the video frames in the above video frames to obtain the first feature; in the spatial pool, the first feature can be obtained The block features corresponding to some video frames of all the above video clips are average pooled to obtain the second feature; then the first feature and the second feature are spliced to obtain the feature to be input. In this way, since the time pool is processed according to the time order of all the video frames retained, that is, the order of the video frames in the video to be processed, time information can be retained and the accuracy of video understanding can be ensured.

在以上方式中，对以上所有视频片段的部分视频帧对应的块特征进行平均池化时，可以将每一时间段内的所有视频帧的多个块特征进行平均池化，从得到与每个时间段的所有视频帧对应的第二特征。其中，该时间段可以是基于以上视频片段确定的，具体可以是每个视频片段提取到的部分视频帧中的最早的视频帧的时间戳至最晚的视频帧的时间戳所构成的时间段。In the above method, when the block features corresponding to some video frames of all the above video clips are average pooled, the multiple block features of all video frames in each time period can be average pooled to obtain the corresponding block features of each video frame. The second feature corresponding to all video frames in the time period. The time period may be determined based on the above video segments, and specifically may be the time period consisting of the timestamp of the earliest video frame in the partial video frames extracted from each video segment to the timestamp of the latest video frame. .

步骤S460：将所述待输入特征与目标提示信息进行拼接，得到用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解。Step S460: Splice the features to be input and the target prompt information to obtain the first input information for inputting the large language model. The target prompt information is used to prompt the large language model to perform video understanding.

步骤S470：将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S470: Input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，步骤S460以及步骤S470可以参阅其他实施例的内容，在此不再赘述。In the embodiment of the present application, the contents of steps S460 and S470 may be referred to other embodiments, and will not be described again here.

本申请实施例提供的视频处理方法，在通过大语言模型对待处理视频进行视频理解时，根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，再对提取的部分视频帧提取图像特征后，对图像特征进行池化，再与提示词进行拼接，得到用于输入大语言模型的输入信息，因此在对长时视频进行理解时，可以一次性针对待输入视频确定用于输入大语言模型的输入信息，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性。The video processing method provided by the embodiment of the present application uses a large language model to perform video understanding of the video to be processed, and then extracts image features from the extracted partial video frames based on the identified key frames and divided video segments. Finally, the image features are pooled and then spliced with the prompt words to obtain the input information for inputting the large language model. Therefore, when understanding the long-term video, the video to be input can be determined at one time for inputting the large language. The input information of the model can enable the large language model to better understand long-term videos and improve the accuracy of understanding long-term videos.

请参阅图5，图5示出了本申请又另一个实施例提供的视频处理方法的流程示意图。该视频处理方法应用于上述计算机设备，下面将针对图5所示的流程进行详细的阐述，所述视频处理方法具体可以包括以下步骤：Please refer to FIG. 5 , which shows a schematic flowchart of a video processing method provided by yet another embodiment of the present application. This video processing method is applied to the above-mentioned computer equipment. The process shown in Figure 5 will be described in detail below. The video processing method may specifically include the following steps:

步骤S510：获取待处理视频对应的视频时长。Step S510: Obtain the video duration corresponding to the video to be processed.

在本申请实施例中，在对待处理视频进行视频理解时，可以确定待处理视频对应的视频时长，以便确定本次需要进行视频理解的待处理视频是否为长时视频。In the embodiment of the present application, when performing video understanding on the video to be processed, the video duration corresponding to the video to be processed can be determined, so as to determine whether the video to be processed that needs to be video understood this time is a long-term video.

步骤S520：若所述视频时长小于或等于目标时长，则基于所述待处理视频中的视频帧以及所述目标提示信息，生成用于输入大语言模型的第二输入信息。Step S520: If the video duration is less than or equal to the target duration, generate second input information for inputting a large language model based on the video frames in the video to be processed and the target prompt information.

在本申请实施例中，在获取待处理视频的视频时长后，可以将视频时长与目标时长进行比较；根据比较结果，若视频时长小于或等于目标时长，则表示本次需要进行视频理解的待处理视频不是长时视频，因此按照常规的处理方式，可以针对待处理视频确定的输入信息，可以一次就能输入至大语言模型中，而不需要对待处理视频分段后，多次输入大语言模型，故可以直接基于待处理视频中的视频帧以及目标提示信息，生成用于输入大语言模型的第二输入信息。In the embodiment of this application, after obtaining the video duration of the video to be processed, the video duration can be compared with the target duration; according to the comparison result, if the video duration is less than or equal to the target duration, it means that the video duration needs to be understood this time. The processed video is not a long-term video, so according to the conventional processing method, the input information determined for the video to be processed can be input into the large language model at one time, without the need to input the large language multiple times after segmenting the video to be processed. model, it can directly generate the second input information for inputting the large language model based on the video frames and target prompt information in the video to be processed.

其中，可以针对待处理视频中的视频帧进行特征提取，得到待处理视频中的视频帧对应的图像特征；对待处理视频中的视频帧进行池化，并将池化后的图像特征进行拼接，得到待输入特征；将待输入特征与目标提示信息进行拼接，得到用于输入大语言模型的第二输入信息。其中，针对视频帧的图像特征的池化方式可以参阅前一实施例的内容，在此不再赘述。Among them, feature extraction can be performed on the video frames in the video to be processed to obtain the image features corresponding to the video frames in the video to be processed; the video frames in the video to be processed are pooled, and the pooled image features are spliced. Obtain the features to be input; splice the features to be input and the target prompt information to obtain the second input information for inputting the large language model. For the pooling method of the image features of the video frame, please refer to the content of the previous embodiment, and will not be described again here.

步骤S530：将所述第二输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S530: Input the second input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，在获得以上第二输入信息后，则可以将第二输入信息输入至大语言模型中，从而得到针对待处理视频对应的视频理解结果。如此，可以实现一次将输入信息输入大语言模型，就能完成对待处理视频的视频理解。In this embodiment of the present application, after obtaining the above second input information, the second input information can be input into the large language model, thereby obtaining a video understanding result corresponding to the video to be processed. In this way, the input information can be input into the large language model once, and the video understanding of the video to be processed can be completed.

步骤S540：若所述视频时长大于目标时长，对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。Step S540: If the video duration is greater than the target duration, perform key frame identification on the video to be processed to obtain the key frames in the video to be processed.

在本申请实施例中，在获取待处理视频的视频时长后，将视频时长与目标时长进行比较；根据比较结果，若视频时长大于目标时长，则表示本次需要进行视频理解的待处理视频是长时视频，因此可以对待处理视频进行关键帧识别，得到待处理视频中的关键帧，并执行后续步骤，从而实现提升长时视频的视频理解的准确性。In the embodiment of this application, after obtaining the video duration of the video to be processed, the video duration is compared with the target duration; according to the comparison result, if the video duration is greater than the target duration, it means that the video to be processed this time needs to be video understood. For long-term videos, it is possible to identify key frames in the video to be processed, obtain the key frames in the video to be processed, and perform subsequent steps, thereby improving the accuracy of video understanding of long-term videos.

步骤S550：将所述待处理视频划分为多个视频片段。Step S550: Divide the video to be processed into multiple video segments.

步骤S560：基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧。Step S560: Based on the key frame, extract a partial video frame from each video segment of the plurality of video segments.

步骤S570：基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解。Step S570: Generate first input information for inputting a large language model based on target prompt information and the partial video frames, where the target prompt information is used to prompt the large language model to perform video understanding.

步骤S580：将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Step S580: Input the first input information to the large language model to obtain a video understanding result corresponding to the video to be processed.

在本申请实施例中，步骤S550至步骤S580可以参阅其他实施例的内容，在此不再赘述。In the embodiment of this application, the contents of steps S550 to S580 may be referred to other embodiments, and will not be described again here.

本申请实施例提供的视频处理方法，在通过大语言模型对待处理视频进行视频理解时，根据待处理视频的视频时长，在视频时长大于目标时长的情况下，根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，根据提取的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，可以一次性针对待输入视频确定用于输入大语言模型的输入信息，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性；而在视频时长不大于目标时长的情况下，则直接根据待处理视频的视频帧确定用于输入大语言模型的输入信息，从而不仅能够对长时视频进行准确地理解，也能对短视频进行准确地理解。The video processing method provided by the embodiment of the present application uses a large language model to perform video understanding of the video to be processed, based on the video duration of the video to be processed, and when the video duration is greater than the target duration, based on the identified key frames and divided videos After extracting some video frames from the clip, the input information used to input the large language model is determined based on the extracted partial video frames. Therefore, when understanding the long-term video, the input information used to input the large language model can be determined for the video to be input at once. Input information, so that the large language model can better understand long-term videos and improve the accuracy of understanding long-term videos; and when the video duration is not greater than the target duration, it is directly based on the video to be processed The video frame determines the input information used to input the large language model, so that not only long-term videos can be accurately understood, but also short videos can be accurately understood.

下面再通过举例对前述实施例提供的视频处理方法进行说明。The video processing method provided by the foregoing embodiment will be described below with an example.

请参阅图6，在对待处理视频进行视频理解时，可以对待处理视频进行关键帧视频以及视频片段划分，使待处理视频被标注用于划分视频片段的时间戳，以及关键帧；然后再基于标注的时间戳以及关键帧，对每个视频片段提取部分视频帧；再将提取的视频帧输入视觉编码器，得到各个视频帧对应的图像特征；再将提取的图像特征输入空间池和时间池中，在空间池中，按照视频帧的时间顺序，对所有视频帧对应的图像特征中的全局特征进行拼接，在时间池中，针对所有视频帧对应的图像特征中的patch特征进行平均池化；然后将空间池以及时间池得到的特征进行拼接，并映射到大语言模型的特征空间，与目标提示信息拼接后，输入到大语言模型，从而大语言模型根据输入信息，输出视频理解结果，例如，在3S-5.6S，有两个人在愉快的交流；5min30秒-8min15秒，有4个人在飙车混战，经历了街道、桥梁、河流的场景。Please refer to Figure 6. When performing video understanding on the video to be processed, the video to be processed can be divided into key frame videos and video segments, so that the video to be processed is marked with the timestamp used to divide the video segments and key frames; and then based on the annotation timestamps and key frames, extract part of the video frames from each video clip; then input the extracted video frames into the visual encoder to obtain the image features corresponding to each video frame; then input the extracted image features into the spatial pool and time pool , In the spatial pool, the global features in the image features corresponding to all video frames are spliced according to the time sequence of the video frames. In the temporal pool, the patch features in the image features corresponding to all video frames are average pooled; Then the features obtained from the spatial pool and the temporal pool are spliced and mapped to the feature space of the large language model. After being spliced with the target prompt information, they are input to the large language model, so that the large language model outputs video understanding results based on the input information, such as , in 3S-5.6S, two people were happily communicating; in 5min30sec-8min15sec, four people were racing in a melee, experiencing scenes of streets, bridges, and rivers.

请参阅图7，其示出了本申请实施例提供的一种视频处理装置700的结构框图。该视频处理装置700应用上述的计算机设备，该视频处理装置700包括：关键帧识别模块710、视频划分模块720、视频帧提取模块730、信息生成模块740以及视频理解模块750。其中，所述关键帧识别模块710用于对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧；所述视频划分模块720用于将所述待处理视频划分为多个视频片段；所述视频帧提取模块730用于基于所述关键帧，从所述多个视频片段的每个视频片段中提取部分视频帧；所述信息生成模块740用于基于目标提示信息以及所述部分视频帧，生成用于输入大语言模型的第一输入信息，所述目标提示信息用于提示所述大语言模型进行视频理解；所述视频理解模块750用于将所述第一输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。Please refer to FIG. 7 , which shows a structural block diagram of a video processing device 700 provided by an embodiment of the present application. The video processing device 700 applies the above-mentioned computer equipment. The video processing device 700 includes: a key frame identification module 710, a video segmentation module 720, a video frame extraction module 730, an information generation module 740, and a video understanding module 750. Among them, the key frame identification module 710 is used to perform key frame identification on the video to be processed to obtain the key frames in the video to be processed; the video dividing module 720 is used to divide the video to be processed into multiple video segments. ; The video frame extraction module 730 is configured to extract a partial video frame from each video segment of the plurality of video segments based on the key frame; the information generation module 740 is configured to extract a partial video frame based on the target prompt information and the partial The video frame is used to generate the first input information for inputting the large language model, and the target prompt information is used to prompt the large language model to perform video understanding; the video understanding module 750 is used to input the first input information to The large language model obtains the video understanding results corresponding to the video to be processed.

在一些实施方式中，视频帧提取模块730可以具体用于基于所述关键帧，针对所述多个视频片段的每个视频片段进行采样，得到所述每个视频片段中的部分视频帧。In some implementations, the video frame extraction module 730 may be specifically configured to sample each video segment of the plurality of video segments based on the key frame to obtain partial video frames in each video segment.

在一种可能的实施方式中，所述每个视频片段中包括至少一帧所述关键帧。视频帧提取模块730还可以具体用于基于所述每个视频片段中的所述关键帧，对所述每个视频片段进行划分，得到所述每个视频片段对应的多个视频子片段；对所述每个视频片段对应的多个视频子片段中的每个视频子片段进行采样，得到所述每个视频片段中的部分视频帧，其中，所述关键帧所在视频子片段对应的采样率高于其他视频子片段对应的采样率，所述其他视频子片段为所述多个视频子片段中除所述关键帧所在视频子片段以外的视频子片段。In a possible implementation, each video segment includes at least one frame of the key frame. The video frame extraction module 730 may also be specifically configured to divide each video segment based on the key frame in each video segment to obtain multiple video sub-segments corresponding to each video segment; Each video sub-segment in the plurality of video sub-segments corresponding to each video segment is sampled to obtain partial video frames in each video segment, where the sampling rate corresponding to the video sub-segment where the key frame is located is Higher than the corresponding sampling rate of other video sub-segments, which are video sub-segments of the plurality of video sub-segments other than the video sub-segment where the key frame is located.

可选地，视频帧提取模块730还可以具体用于基于所述每个视频片段中的所述关键帧，从所述每个视频片段中划分出包含所述关键帧的视频子片段，以及不包含所述关键帧的视频子片段。Optionally, the video frame extraction module 730 may also be specifically configured to divide the video sub-segments containing the key frames from each video segment based on the key frames in each video segment, and not The video subsegment containing the keyframe.

在一些实施方式中，所述每个视频片段中包括至少一帧所述关键帧，视频帧提取模块730可以具体用于针对所述多个视频片段中的每个视频片段，获取与所述关键帧邻近的第一数量的视频帧，并从其他视频帧中获取第二数量的视频帧，得到所述每个视频片段对应的所述第一数量的视频帧以及所述第二数量的视频帧，其中，所述其他视频帧为所述每个视频片段中除所述关键帧以及所述关键帧邻近的第一数量的视频帧以外的视频帧；将所述每个视频片段对应的所述关键帧、所述第一数量的视频帧以及所述第二数量的视频帧，确定为所述每个视频片段对应的部分视频帧。In some embodiments, each video segment includes at least one frame of the key frame, and the video frame extraction module 730 may be specifically configured to obtain, for each video segment in the plurality of video segments, information related to the key frame. frame adjacent first number of video frames, and obtain a second number of video frames from other video frames to obtain the first number of video frames and the second number of video frames corresponding to each video segment. , wherein the other video frames are video frames in each video segment other than the key frame and the first number of video frames adjacent to the key frame; Key frames, the first number of video frames, and the second number of video frames are determined as partial video frames corresponding to each video segment.

在一些实施方式中，视频划分模块720可以具体用于基于所述关键帧，将所述待处理视频划分为多个视频片段，其中，所述多个视频片段的每个视频片段中包括至少一帧所述关键帧。In some embodiments, the video dividing module 720 may be specifically configured to divide the video to be processed into multiple video segments based on the key frame, wherein each video segment of the multiple video segments includes at least one frame the key frame.

在一种可能的实施方式中，视频划分模块720还可以具体用于基于所述关键帧，对所述待处理视频标注多个目标时间戳，其中，相邻的两个所述目标时间戳之间的视频构成一个视频片段，所述目标时间戳位于相邻的两个关键帧之间。In a possible implementation, the video dividing module 720 can also be specifically configured to mark multiple target timestamps on the video to be processed based on the key frame, wherein the two adjacent target timestamps The video between them constitutes a video clip, and the target timestamp is located between two adjacent key frames.

在一些实施方式中，信息生成模块740可以具体用于对所述部分视频帧进行特征提取，得到所述部分视频帧对应的图像特征；对所述部分视频帧对应的图像特征进行池化，并将池化后的图像特征进行拼接，得到待输入特征；将所述待输入特征与所述目标提示信息进行拼接，得到用于输入大语言模型的第一输入信息。In some embodiments, the information generation module 740 may be specifically configured to perform feature extraction on the partial video frames to obtain image features corresponding to the partial video frames; pool the image features corresponding to the partial video frames, and The pooled image features are spliced to obtain features to be input; the features to be input are spliced with the target prompt information to obtain first input information for inputting a large language model.

在一种可能的实施方式中，所述部分视频帧中每帧视频帧对应的图像特征包括全局特征以及多个图像块对应的块特征，信息生成模块740还可以具体用于根据所述部分视频帧中的视频帧的顺序，对所述部分视频帧对应的所述全局特征进行拼接，得到第一特征；对所述部分视频帧对应的块特征进行平均池化，得到第二特征；将所述第一特征以及所述第二特征进行拼接，得到所述待输入特征。In a possible implementation, the image features corresponding to each video frame in the partial video frames include global features and block features corresponding to multiple image blocks. The information generation module 740 may also be specifically configured to generate The order of the video frames in the frame, splicing the global features corresponding to the partial video frames to obtain the first feature; performing average pooling on the block features corresponding to the partial video frames to obtain the second feature; The first feature and the second feature are spliced together to obtain the feature to be input.

在一些实施方式中，该视频处理装置还可以包括时长获取模块，时长获取模块可以用于在所述对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧之前，获取所述待处理视频对应的视频时长；关键帧识别模块可以具体用于若所述视频时长大于目标时长，则对待处理视频进行关键帧识别，得到所述待处理视频中的关键帧。In some embodiments, the video processing device may further include a duration acquisition module. The duration acquisition module may be used to acquire the to-be-processed video before performing key frame identification on the to-be-processed video to obtain the key frames in the to-be-processed video. Process the video duration corresponding to the video; the key frame identification module can be specifically used to perform key frame identification on the video to be processed if the video duration is greater than the target duration, and obtain the key frames in the video to be processed.

在一种可能的实施方式中，信息生成模块还可以用于在所述获取所述待处理视频对应的视频时长之后，若所述视频时长小于或等于目标时长，则基于所述待处理视频中的视频帧以及所述目标提示信息，生成用于输入大语言模型的第二输入信息；视频理解模块还可以用于将所述第二输入信息输入至所述大语言模型，得到所述待处理视频对应的视频理解结果。In a possible implementation, the information generation module may also be configured to, after obtaining the video duration corresponding to the video to be processed, if the video duration is less than or equal to the target duration, based on the video duration to be processed. video frames and the target prompt information to generate second input information for inputting into the large language model; the video understanding module can also be used to input the second input information into the large language model to obtain the to-be-processed Video understanding results corresponding to the video.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述装置和模块的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the above-described devices and modules can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

在本申请所提供的几个实施例中，模块相互之间的耦合可以是电性，机械或其它形式的耦合。In several embodiments provided in this application, the coupling between modules may be electrical, mechanical or other forms of coupling.

另外，在本申请各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application can be integrated into one processing module, or each module can exist physically alone, or two or more modules can be integrated into one module. The above integrated modules can be implemented in the form of hardware or software function modules.

综上所述，本申请提供的方案，通过对待处理视频进行关键帧识别，得到待处理视频中的关键帧，将待处理视频划分为多个视频片段，基于关键帧，从多个视频片段的每个视频片段中提取部分视频帧，基于获取到的部分视频帧以及用于提示大语言模型进行视频理解的目标提示信息，生成用于输入大语言模型的第一输入信息，然后将第一输入信息输入至大语言模型，得到待处理视频对应的视频理解结果。由此，由于在通过大语言模型对待处理视频进行视频理解时，是根据识别的关键帧以及划分的视频片段提取的部分视频帧之后，再根据提取的部分视频帧确定用于输入大语言模型的输入信息，因此在对长时视频进行理解时，无需对视频分段后多次输入大语言模型，从而能够使大语言模型更好地对长时视频进行视频理解，提升对长时视频的理解的准确性。To sum up, the solution provided by this application obtains the key frames in the video to be processed by identifying the key frames in the video to be processed, divides the video to be processed into multiple video segments, and based on the key frames, extracts the key frames from the multiple video segments. Partial video frames are extracted from each video clip, and based on the obtained partial video frames and the target prompt information used to prompt the large language model for video understanding, the first input information for inputting the large language model is generated, and then the first input is The information is input to the large language model to obtain the video understanding results corresponding to the video to be processed. Therefore, when using a large language model to understand the video to be processed, after extracting some video frames based on the identified key frames and divided video segments, the input to the large language model is determined based on the extracted partial video frames. Input information, so when understanding long-term videos, there is no need to input the large language model multiple times after segmenting the video, so that the large language model can better understand the long-term videos and improve the understanding of long-term videos. accuracy.

请参考图8，其示出了本申请实施例提供的一种计算机设备的结构框图。该计算机设备100可以是智能手机、平板电脑、智能手表、电子书等能够运行应用程序的设备。本申请中的计算机设备100可以包括一个或多个如下部件：处理器110、存储器120、以及一个或多个应用程序，其中一个或多个应用程序可以被存储在存储器120中并被配置为由一个或多个处理器110执行，一个或多个应用程序配置用于执行如前述方法实施例所描述的方法。Please refer to FIG. 8 , which shows a structural block diagram of a computer device provided by an embodiment of the present application. The computer device 100 may be a smartphone, a tablet computer, a smart watch, an e-book, or other device capable of running application programs. The computer device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein one or more application programs may be stored in the memory 120 and configured to One or more processors 110 execute, and one or more application programs are configured to execute the method as described in the foregoing method embodiments.

处理器110可以包括一个或者多个处理核。处理器110利用各种接口和线路连接整个计算机设备100内的各个部分，通过运行或执行存储在存储器120内的指令、程序、代码集或指令集，以及调用存储在存储器120内的数据，执行计算机设备100的各种功能和处理数据。可选地，处理器110可以采用数字信号处理(Digital Signal Processing，DSP)、现场可编程门阵列(Field－Programmable Gate Array，FPGA)、可编程逻辑阵列(ProgrammableLogic Array，PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(CentralProcessing Unit，CPU)、图形处理器(Graphics Processing Unit，GPU)和调制解调器等中的一种或几种的组合。其中，CPU主要处理操作系统、用户界面和应用程序等；GPU用于负责显示内容的渲染和绘制；调制解调器用于处理无线通信。可以理解的是，上述调制解调器也可以不集成到处理器110中，单独通过一块通信芯片进行实现。Processor 110 may include one or more processing cores. The processor 110 uses various interfaces and lines to connect various parts of the entire computer device 100, and executes by running or executing instructions, programs, code sets or instruction sets stored in the memory 120, and calling data stored in the memory 120. Various functions of computer device 100 and processing data. Optionally, the processor 110 may adopt at least one of digital signal processing (Digital Signal Processing, DSP), field-programmable gate array (Field-Programmable Gate Array, FPGA), and programmable logic array (Programmable Logic Array, PLA). implemented in hardware form. The processor 110 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), a modem, and the like. Among them, the CPU mainly handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the display content; and the modem is used to handle wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 110 and may be implemented solely through a communication chip.

存储器120可以包括随机存储器(Random Access Memory，RAM)，也可以包括只读存储器(Read-Only Memory)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区，其中，存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储计算机设备100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。The memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, codes, sets of codes, or sets of instructions. The memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing an operating system and instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing each of the following method embodiments, etc. The storage data area can also store data created during use of the computer device 100 (such as phone book, audio and video data, chat record data), etc.

请参考图9，其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读介质800中存储有程序代码，所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。Please refer to FIG. 9 , which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. Program code is stored in the computer-readable medium 800, and the program code can be called by the processor to execute the method described in the above method embodiment.

计算机可读存储介质800可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地，计算机可读存储介质800包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质800具有执行上述方法中的任何方法步骤的程序代码810的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码810可以例如以适当形式进行压缩。Computer-readable storage medium 800 may be electronic memory such as flash memory, EEPROM (electrically erasable programmable read-only memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 800 includes non-transitory computer-readable storage medium. The computer-readable storage medium 800 has storage space for program code 810 that performs any method steps in the above-described methods. These program codes can be read from or written into one or more computer program products. Program code 810 may, for example, be compressed in a suitable form.

最后应说明的是：以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present application, but not to limit it; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: it can still Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

1. A video processing method, characterized in that the method includes:

Perform key frame identification on the video to be processed to obtain the key frames in the video to be processed;

Divide the video to be processed into multiple video segments;

extracting a partial video frame from each of the plurality of video segments based on the key frame;

Generate first input information for inputting a large language model based on target prompt information and the partial video frame, where the target prompt information is used to prompt the large language model to perform video understanding;

The first input information is input into the large language model to obtain a video understanding result corresponding to the video to be processed.

2. The method according to claim 1, characterized in that, based on the key frame, extracting a partial video frame from each video segment of the plurality of video segments includes:

Based on the key frame, sampling is performed for each video segment of the plurality of video segments to obtain partial video frames in each video segment.

3. The method according to claim 2, characterized in that each video segment includes at least one frame of the key frame, and based on the key frame, for each video of the plurality of video segments, Sampling the clips to obtain partial video frames in each video clip, including:

Based on the key frames in each video segment, divide each video segment to obtain multiple video sub-segments corresponding to each video segment;

Sampling each video sub-segment in the plurality of video sub-segments corresponding to each video segment is performed to obtain partial video frames in each video segment, where the samples corresponding to the video sub-segment where the key frame is located are The rate is higher than the corresponding sampling rate of other video sub-segments, which are video sub-segments among the plurality of video sub-segments other than the video sub-segment where the key frame is located.

4. The method according to claim 3, characterized in that, based on the key frames in each video segment, dividing each video segment to obtain the corresponding Multiple video subclips, including:

Based on the key frame in each video segment, video sub-segments containing the key frame and video sub-segments not containing the key frame are divided from each video segment.

5. The method of claim 1, wherein each video segment includes at least one frame of the key frame, and based on the key frame, each video segment from the plurality of video segments Extract some video frames from the clip, including:

For each video segment in the plurality of video segments, obtain a first number of video frames adjacent to the key frame, and obtain a second number of video frames from other video frames to obtain each video segment. The corresponding first number of video frames and the second number of video frames, wherein the other video frames are the first number of adjacent key frames in each video segment except the key frames. video frames other than video frames;

The key frames corresponding to each video segment, the first number of video frames, and the second number of video frames are determined as partial video frames corresponding to each video segment.

6. The method of claim 1, wherein dividing the video to be processed into multiple video segments includes:

Based on the key frame, the video to be processed is divided into multiple video segments, wherein each video segment of the multiple video segments includes at least one frame of the key frame.

7. The method of claim 6, wherein dividing the video to be processed into multiple video segments based on the key frames includes:

Based on the key frame, multiple target timestamps are marked on the video to be processed, wherein the video between two adjacent target timestamps constitutes a video segment, and the target timestamps are located between two adjacent target timestamps. between keyframes.

8. The method according to any one of claims 1 to 7, characterized in that generating first input information for inputting a large language model based on target prompt information and the partial video frame includes:

Perform feature extraction on the partial video frames to obtain image features corresponding to the partial video frames;

Pool the image features corresponding to the partial video frames, and splice the pooled image features to obtain the features to be input;

The features to be input are spliced with the target prompt information to obtain first input information for inputting a large language model.

9. The method of claim 8, wherein the image features corresponding to each video frame in the partial video frames include global features and block features corresponding to multiple image blocks, and the partial video frames The corresponding image features are pooled, and the pooled image features are spliced to obtain the features to be input, including:

Splicing the global features corresponding to the partial video frames according to the order of the video frames in the partial video frames to obtain the first feature;

Perform average pooling on the block features corresponding to the partial video frames to obtain the second features;

The first feature and the second feature are spliced to obtain the feature to be input.

10. The method according to any one of claims 1 to 7, characterized in that, before performing key frame identification on the video to be processed to obtain the key frames in the video to be processed, the method further includes:

Obtain the video duration corresponding to the video to be processed;

If the video duration is greater than the target duration, the step of identifying key frames in the video to be processed is performed to obtain the key frames in the video to be processed.

11. The method according to claim 10, characterized in that after obtaining the video duration corresponding to the video to be processed, the method further includes:

If the video duration is less than or equal to the target duration, generate second input information for inputting a large language model based on the video frames in the video to be processed and the target prompt information;

The second input information is input into the large language model to obtain a video understanding result corresponding to the video to be processed.

12. A video processing device, characterized in that the device includes: a key frame identification module, a video dividing module, a video frame extraction module, an information generation module and a video understanding module, wherein,

The key frame identification module is used to identify key frames in the video to be processed and obtain the key frames in the video to be processed;

The video dividing module is used to divide the video to be processed into multiple video segments;

The video frame extraction module is configured to extract a partial video frame from each video segment of the plurality of video segments based on the key frame;

The information generation module is configured to generate first input information for inputting a large language model based on target prompt information and the partial video frame, and the target prompt information is used to prompt the large language model to perform video understanding;

The video understanding module is used to input the first input information into the large language model to obtain a video understanding result corresponding to the video to be processed.

13. A computer device, characterized by comprising:

one or more processors;

memory;

one or more programs, wherein said one or more programs are stored in said memory and configured to be executed by said one or more processors, said one or more programs are configured to perform as claimed The method described in any one of 1-11.

14. A computer-readable storage medium, characterized in that program code is stored in the computer-readable storage medium, and the program code can be called and executed by a processor as described in any one of claims 1-11 Methods.