CN118366075A

CN118366075A - Video recognition method, device, equipment and storage medium

Info

Publication number: CN118366075A
Application number: CN202410395016.0A
Authority: CN
Inventors: 南国顺; 杜航; 崔琪楣; 张嘉阳; 谢滨竹; 张驷乘; 许峻瑞; 陶小峰
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2024-04-02
Filing date: 2024-04-02
Publication date: 2024-07-19
Anticipated expiration: 2044-04-02
Also published as: CN118366075B

Abstract

The present application proposes a video recognition method, device, equipment and storage medium, the method comprising: obtaining video data to be recognized and key information corresponding to the video data; extracting key video data corresponding to the key information in the video data; obtaining prompt information for the recognition result; determining the target video data corresponding to the prompt information in the key video data; sending the target video data and the prompt information to a large language model, and outputting the recognition result for the prompt information. The embodiment of the present application can output a more accurate recognition result by combining the prompt information of the recognition result, and by processing the video data, the video data to be recognized has a small amount of data, covers the key data, and has high recognition efficiency.

Description

Video identification method, device, equipment and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a video identification method, a device, equipment and a storage medium.

Background

Currently, video anomaly understanding (Video Anomaly Understanding, abbreviated as VAU) is critical to timely finding potential safety hazards, optimizing operation flows, and improving safety and efficiency, for example, abnormal conditions in traffic monitoring, environment monitoring, industrial production and other scenes can be monitored.

The VAU can automatically identify and understand the abnormal events in the video. Related VAU technology focuses on two aspects: firstly, video anomaly detection, namely identifying an anomaly event existing in a video through an algorithm model; and secondly, abnormal positioning, namely determining the position of an abnormal event at the starting and ending point on the time axis or in the spatial range.

However, the output of current VAU technology is relatively single and inefficient.

Disclosure of Invention

The application provides a video identification method, a device, equipment and a storage medium, which can solve the technical problems of single output and low efficiency of the current VAU technology.

An embodiment of a first aspect of the present application provides a video recognition method, including:

Acquiring video data to be identified and key information corresponding to the video data, wherein the key information is descriptive information corresponding to an abnormal event in the video data;

extracting key video data corresponding to the key information from the video data;

Acquiring prompt information aiming at the identification result;

determining target video data corresponding to the prompt information in the key video data;

And inputting the target video data and the prompt information into a large language model, and outputting a recognition result aiming at the prompt information.

An embodiment of a second aspect of the present application provides a video recognition apparatus, including:

The acquisition module is used for acquiring video data to be identified and key information corresponding to the video data, wherein the key information is descriptive information corresponding to an abnormal event in the video data;

the extraction module is used for extracting key video data corresponding to the key information in the video data;

the determining module is used for acquiring prompt information aiming at the identification result;

the determining module is further configured to determine target video data corresponding to the prompt information in the key video data;

and the output module is used for inputting the target video data and the prompt information into a large language model and outputting the recognition result aiming at the prompt information.

An embodiment of a third aspect of the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor running the computer program to implement the method of the first aspect.

An embodiment of the fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor to perform the method of the first aspect described above.

The technical scheme provided by the embodiment of the application has at least the following technical effects or advantages:

In the embodiment of the application, firstly, the video data to be identified and key information corresponding to the video data are acquired; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures.

In the drawings:

FIG. 1 is a flow chart of a video recognition method according to an embodiment of the present application;

FIG. 2 is a flow chart of an evaluation process according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a video recognition device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 5 shows a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.

It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

VAU technology aims at automatically identifying and understanding abnormal events in video, and researchers focus on developing and optimizing video anomaly detection and localization techniques, i.e., finding and locating anomaly information in video through models. However, current VAU monitoring methods have significant shortcomings in understanding the causal relationships behind and the impact of abnormal events.

In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for video identification, where in the embodiments of the present application, first, to-be-identified video data and key information corresponding to the video data are obtained; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.

The video recognition method of the present application may be performed by a computing device, which may be a server, for example, a server, a plurality of servers, a server cluster, a cloud computing platform, or the like, and optionally, the computing device may also be a terminal device, for example, a mobile phone, a tablet computer, a game machine, a portable computer, a desktop computer, an advertisement machine, an all-in-one machine, or the like.

In each embodiment of the present application, a video recognition method of video data to be processed is taken as an example for illustration. And for execution subject, computing devices are illustrated in various embodiments of the application.

The following describes a video identification method, apparatus, device and storage medium according to embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, the method specifically includes the steps of:

S101, acquiring video data to be identified and key information corresponding to the video data.

The key information is descriptive information corresponding to an abnormal event in the video data.

The video data to be identified can be video data acquired by acquisition equipment in scenes such as traffic monitoring, environment monitoring and industrial production.

The key information may be generated based on a key field input by a person, or may be identified based on video data, for example, the key field input by the person is "anomaly detection", and the generated key information may be that a traffic accident exists in the video segment.

S102, extracting key video data corresponding to the key information in the video data.

It can be understood that, in order to cover the long-term relevance between video segments, the video data to be identified is generally longer, so that the identification duration is longer and the efficiency is lower, so that the video data to be identified can be reduced by extracting the key video data, and the long-term relevance between the video segments is not lost.

Since the identification video data is mainly intended to identify abnormal events covered in the video data, the key video data corresponding to the key information can be extracted, so that abnormal events in the video data with higher long-term relevance can be identified through shorter key video data.

S103, acquiring prompt information aiming at the identification result.

It can be understood that the abnormal event generally includes an abnormal reason, an abnormal type, a start-stop time and an event detailed description, and the user may only care about the abnormal reason, so that prompt information input by the user can be received, so that a targeted recognition result can be output to the prompt information later.

S104, determining target video data corresponding to the prompt information in the key video data.

S105, inputting the target video data and the prompt information into a large language model, and outputting a recognition result aiming at the prompt information.

In order to further improve the recognition efficiency of the video data, the target video data corresponding to the prompt information can be determined in the key video data, so that the model can be more focused on the target video data corresponding to the prompt information, and the model can rapidly output the recognition result corresponding to the prompt information.

After the recognition result is obtained, semantic analysis can be performed on the recognition result, if the semantic of the recognition result does not meet the requirement, the large-scale language model is controlled to output the recognition result again, for example, the recognition result is that the reason of the traffic accident is that the front vehicle suddenly stops and parks, and if the recognition result does not meet the semantic obviously, the large-scale language model is controlled to output the recognition result again.

The embodiment of the application provides a video identification method, which comprises the steps of firstly, acquiring video data to be identified and key information corresponding to the video data; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.

In some embodiments, extracting key video data corresponding to key information in video data includes:

Extracting a plurality of video frames of the video data according to a preset frame rate;

dividing a plurality of video frames into a plurality of areas respectively;

Extracting a first feature and a plurality of second features of key information corresponding to each of the plurality of areas by using a preset neural network model;

determining at least one first feature that matches the plurality of second features;

and determining the video data corresponding to the at least one first feature as key video data.

The preset frame rate can be flexibly set based on actual conditions.

After extracting a plurality of video frames, in order to analyze the local features of the video frames, the video frames need to be divided into a plurality of regions, i.e., into a plurality of patch blocks.

In some embodiments, partitioning a video frame into multiple patch blocks may be implemented as:

It is first necessary to determine the size of each patch, typically a fixed square or rectangular area.

The degree of overlap and stride of a patch is determined, which determines the overlap or spacing between adjacent patches.

And cutting the video frame, namely moving a window in the horizontal and vertical directions according to the set patch size and the set stride, and gradually cutting the whole video frame.

Each time the window is moved, the patch area of the current position is extracted and saved as an independent image block.

In addition, boundary conditions need to be considered during the cutting process to ensure that all image areas are cut correctly and that no information is lost.

The preset neural network model can be an image and text pre-training model (Contrastive Language-IMAGE PRETRAINING, called CLIP for short), a visual self-attention model (Vision Transformer, called ViT for short), a detection model (Detection Transformer, called DETR for short) and other models for extracting features.

The extraction of a plurality of region features is described in the CLIP:

For each extracted patch block, a trained CLIP model is used to extract its feature representation. The CLIP model can map an image into a semantic space, so that the position of the CLIP block in the semantic space can be understood according to the feature vector of the patch block, that is, the first features corresponding to a plurality of regions are acquired.

Wherein the general representation of the key information may be a text description.

For the extraction of the CLIP from the features in the multiple regions, key information in the form of text descriptions may be input into the CLIP model, and the key information may be passed through a series of coding layers, including a transducer or other structure, and then converted into a computer-understandable numerical representation.

The encoded text is mapped into a high-dimensional semantic space. In this process, the CLIP model learns how best to represent the semantic information of the text such that texts of similar meaning are closer together in semantic space.

The CLIP may extract text feature vectors from the text encoding. These feature vectors capture the position and meaning of the text in semantic space, i.e. a number of second features that extract key information.

At least one first feature matching the plurality of second features is determined by a cross-attention operation, i.e. in CLIP, the first feature and the second feature perform a cross-attention calculation. This means that the model will focus on the important parts of the image feature vector from the text feature vector, and will focus on the important parts of the text feature vector from the image feature vector.

After the cross-attention calculation, a similarity score between the text and the image can be obtained. This similarity score reflects the degree of correlation between the text description and the image content. In CLIP, cosine similarity or other similarity measure methods are typically used to calculate this similarity.

Further, the first feature and the second feature, the similarity of which is greater than the preset similarity threshold, may be determined as two features that match each other, where the similarity threshold may be flexibly set based on actual situations. So that at least one first feature matches a plurality of second features. The at least one first feature represents the feature of the video data corresponding to the key information, and the key video data corresponding to the key information can be obtained by combining the at least one first feature.

In some embodiments, determining target video data corresponding to the prompt information in the key video data includes:

Extracting a plurality of third features of the prompt information by using a preset neural network model;

determining at least one first feature and determining at least one fourth feature that matches the plurality of third features;

and determining the key video data corresponding to the at least one fourth feature as target video data.

It will be appreciated that a targeted answer to the prompt is required, and in order to meet this requirement and improve the output efficiency, it is necessary to determine the video data corresponding to the prompt from the key video data.

The prompt information can be text information or voice information input by a user or text information after semantic understanding is carried out on the text information or voice information input by the user.

The process of extracting the plurality of third features of the prompt information by the preset neural network model is consistent with the process of extracting the key information, and is not described herein.

Further, at least one first feature may be determined by a cross-attention operation to determine at least one fourth feature that matches the plurality of third features.

And further determining the key video data corresponding to the at least one fourth feature as target video data corresponding to the prompt information.

In the practical application of the large language model, the Hard Prompt design and the Soft Prompt design are combined.

The Hard sympt design mainly has the following effects: by utilizing ChatGPT to assist in confirming and supplementing prompt information input by a user, the large language model is ensured to be capable of more accurately understanding the intention of the user, the large language model is guided to identify specific events related to abnormal occurrence in video in a multi-round dialogue mode, and after multiple rounds of iteration, VLM can focus on the time period or video fragment most related to the problem.

Effect of Soft Prompt design: soft Prompt is embedded into a large language model by utilizing a form of a learnable parameter vector, so that understanding of different video scenes and problems by the model can be adjusted in a more flexible mode, and the model is further assisted in deep mining of the inherent logic of an abnormal event.

The whole model execution flow mainly comprises data preprocessing, prompt design and interaction and answer generation.

In the data preprocessing stage:

Extracting a plurality of video frames of the video data according to a preset frame rate; dividing a plurality of video frames into a plurality of areas respectively; extracting a first feature and a plurality of second features of key information corresponding to each of the plurality of areas by using a preset neural network model; at least one first feature that matches the plurality of second features is determined by a cross-attention operation. Namely, top-k video features related to the anomaly information are extracted through the CLIP.

In the Prompt design and interaction stage: extracting a plurality of third features of the prompt information by using a preset neural network model; at least one fourth feature that matches the plurality of third features is determined from the at least one first feature by a cross-attention operation. And determining the corresponding characteristics of the prompt information by utilizing the video characteristics of the CLIP at top-k related to the abnormal information.

In the answer generation stage:

and (3) determining the characteristics corresponding to the prompt information from the top-k video characteristics related to the abnormal information and the top-k video characteristics related to the abnormal information, inputting the characteristics into a large language model, further generating a recognition result, and only trimming a top-k selector by using a cross entropy loss function during training.

In some embodiments, the large language model is obtained in advance through training, and the training process of the large language model includes:

Acquiring sample data, wherein the sample data comprises a plurality of sample video data and sample labeling information corresponding to each sample video data;

Inputting the first sample video data and first sample marking information corresponding to the first sample video data into a large language model aiming at the first sample video data to obtain a prediction recognition result;

generating a tag identification result based on the first sample labeling information;

calculating a loss function value based on the predictive recognition result and the tag recognition result;

and based on the loss function value, adjusting model parameters of the large-scale language model, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.

For sample data, a series of training sample data sets are created in the related technology, and are mainly divided into two types of weak supervision and semi-supervision, wherein a weak supervision standard mainly comprises UCF-Crimes and XD-Violence, the UCF-Crimes data set comprises about 100GB of video data from real supervision, the total number is about 1900, the total number is 128 hours, but the whole quality is lower, some videos have repeated playing, too short duration, single mode, difficult judgment of abnormal behaviors and the like, compared with the XD-Violence data set, the scene in the data set is richer, the data set is not limited to the monitoring video, and the data set simultaneously comprises two mode information of the video and the audio. Semi-supervised benchmarks mainly include UBnormal and STREET SCENCE, etc., UBnormal is a dataset of 29 virtual scenes, 236902 video frames, where annotation information includes pixel-level and video-level annotations; STREET SCENCE data includes 46 training and 35 test videos taken from the USB camera, which contains 205 naturally occurring anomalies (derailment and illegal turning around, etc.). The above work mainly helps the model capture the time information of the abnormal event through the labeling information of the frame level and the pixel level.

In view of the deficiencies of the training sample data sets in the related art, the sample data sets in the embodiments of the present application focus on solving the more practical problem, namely, not only identifying abnormal events in video, but also understanding deeply what the cause of these abnormalities occur, what their specific performance is, and what their severity or scope of influence is. In particular, the sample data set in the embodiment of the present application has the following features: 1. detailed manual labeling information: each sample video is manually carefully annotated to provide a detailed description of the type of anomaly, start-stop time, event, and textual description of the interpretation of the cause of the anomaly and the effects of the anomaly in natural language. 2. Long timing correlation: the annotation emphasizes the long-term relevance between video segments, requiring the model to capture the relationship between long-distance frames, e.g., in a traffic accident scenario, the model needs to infer the cause of the accident that occurs later from a critical event (e.g., a sudden stop of a vehicle) a few seconds ago. 3. Longer video is of higher quality: CUVA contains real world video with an average duration of 117 seconds and provides high quality text annotations, thus approaching the complex video anomaly understanding and analysis requirements in real scenes, compared to the shorter video instances in existing datasets.

The construction process of the sample data set in the embodiment of the application mainly comprises the following steps: data collection and fine annotation. First, 1000 video samples containing abnormal events are selected from the real world, wherein the samples contain 10 major classes and 42 minor classes of different abnormal types, and then, the collected video needs to be annotated, and the annotation content comprises: the type of the anomaly, the start-stop time and specific event description, natural language interpretation of the cause of the anomaly, and free text description of the root cause of the anomaly and the effect of the anomaly.

In summary, the sample data includes a plurality of sample video data and sample labeling information corresponding to each sample video data.

Further, for any sample video data of the plurality of sample video data, the sample video data and sample marking information corresponding to the sample video data are input into a large language model, and a prediction recognition result is obtained.

Wherein the predictive recognition result is generally a predictive text description of the sample video data anomaly event.

Thus, a corresponding tag recognition result, i.e., a correct text description corresponding to the sample video data, can be generated based on the sample annotation information of the sample video data.

Calculating a loss function value based on the predictive recognition result and the tag recognition result; and based on the loss function value, adjusting model parameters of the large-scale language model, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.

The preset training completion condition may be that the loss function value is smaller than the preset loss function value, or the training frequency reaches the preset training frequency, and the preset loss function value and the preset training frequency may be flexibly set based on actual conditions.

The large language may include a plurality of multi-modal models, the multi-modal models may include BLIP-2, LLaVA, mPLUG-Owl, videoChat, and the like, that is, the output of the large language model may include a plurality of prediction recognition results, and the plurality of prediction recognition results may be calculated with correct text descriptions one by one to obtain corresponding loss function values, so as to adjust corresponding model parameters.

In some embodiments, the method further comprises:

acquiring description information of video data to be identified;

Determining evaluation reference text information of the recognition result based on the description information;

and evaluating the output result of the large language model based on the evaluation reference text information.

It will be appreciated that, upon obtaining the output of a large language model, the output may be evaluated to evaluate the large language model.

The evaluation indexes for the model in the related art comprise ROUGE, BLEURT, BLEU and the like, and the three are all evaluation methods for evaluating the output of the natural language generation system, and the BLEU mainly uses an N-gram method to evaluate the similarity between the text generated by the machine and the reference text; ROUGE is mainly used for measuring how much key information in the reference text is contained in the generated text; BLEURT is directed to understanding BLEU-based ratings through ranking and conversion, text representation using BERT embedding, and then ranking using BLEU scores by establishing a hierarchical relationship between system output and reference text. In addition, there are other GPT-based evaluation methods, such as TouchStone, funQA, in which TouchStone uses a large language model as a evaluator to construct a comprehensive visual dialogue data set, converts multimodal input content into a form understandable to LLMs by using detailed image annotation, and directly evaluates LVLMs dialogue quality; funQA specify three rigorous tasks to measure model understanding of anti-intuitive video, including anti-intuitive timestamp localization, detailed video description, and anti-intuitive reasoning.

In the embodiment of the application, the evaluation reference text information and the output result are input into the preset evaluation model, so that the preset evaluation model evaluates the large language model based on the evaluation reference text information and the output result.

The preset evaluation model can be a multi-mode evaluation model, the model adopts Video-ChatGPT as a basic model in the construction process, and natural language prompts are used for guiding the model to determine the task type to be evaluated.

In some embodiments, evaluating the output results of the large language model based on the evaluation reference text information includes:

Determining an importance curve of the video data based on the video data;

Determining target evaluation reference text information corresponding to prompt information in the evaluation reference text information;

and evaluating the output result by combining the video data, the importance curve and the target evaluation reference text information.

In order to comprehensively evaluate the large language model, in addition to the large language model may be evaluated based on the evaluation reference text information and the output result, since the input of the large language model includes the hint information, the target evaluation reference text information corresponding to the hint information may be generated based on the evaluation reference text information, and the large language model may be evaluated based on the target evaluation reference text information.

The importance curve may be an importance weight corresponding to each video frame in the video data, and the importance curve may be input by a user or obtained through a feature importance analysis or model interpretation method.

Furthermore, the output result can be evaluated by combining the video data, the importance curve and the target evaluation reference text information, so that the method can be more suitable for understanding the causal relationship of the abnormal event by human beings, and is especially suitable for measuring the recognition and interpretation capability of the large language model on the occurrence reason and influence of the abnormal event in the video abnormal analysis process

In some embodiments, the large language model includes a plurality of multi-modal models, the recognition result includes a plurality of sub-recognition results, and the evaluating the output result in combination with the video data, the importance curve, and the target evaluation reference text information includes:

Determining evaluation results corresponding to the plurality of sub-recognition results by combining the video data, the importance curve and the target evaluation reference text information, wherein the evaluation results comprise at least one of the following: the method comprises the steps of optimizing a plurality of sub-recognition results, scoring corresponding to the plurality of sub-recognition results, evaluating the plurality of sub-recognition results and evaluating the reason of the evaluating sequence.

Since the recognition result includes a plurality of sub-recognition results, the corresponding model is evaluated based on each sub-recognition result, and by scoring and sorting the plurality of sub-recognition results, respectively, the optimal model in the large language model and the reason of the output evaluation order can be clarified, and the reason why the evaluation is performed can be clarified, so that the user can comprehensively understand the reason of the evaluation.

In order to describe the above evaluation process in detail, an embodiment of the present application provides a flowchart of an evaluation process, as shown in fig. 2, obtaining Video data, where the Video data has corresponding annotation information, where the annotation information includes time information, description information, a result, a cause and a type, further obtaining an importance curve of the Video data through the Video data, inputting the Video data and prompt information into a large language model, where the large language model includes a VideoChat model, a Video-ChatGPT model, an oter model, a Video-LLaMA model and a mPLUG model, generating evaluation reference text information based on the annotation information, determining target evaluation reference text information corresponding to the prompt information in the evaluation reference text information, and further inputting the Video data, the importance curve, the target evaluation reference text information and a plurality of sub-recognition results into a preset evaluation model, so that the preset evaluation model outputs scores corresponding to the sub-recognition results, an evaluation sequence of the sub-recognition results, and an evaluation sequence cause of the sub-recognition results.

The embodiment of the application also provides a video identification device which is used for executing the video identification method provided by any embodiment. As shown in fig. 3, the apparatus includes: an acquisition module 301, an extraction module 302, a determination module 303 and an output module 304.

The acquiring module 301 is configured to acquire video data to be identified and key information corresponding to the video data, where the key information is description information corresponding to an abnormal event in the video data;

an extracting module 302, configured to extract key video data corresponding to the key information in the video data;

A determining module 303, configured to obtain a prompt message for a recognition result;

The determining module 303 is further configured to determine target video data corresponding to the prompt information from the key video data;

And the output module 304 is configured to input the target video data and the prompt information into a large language model, and output a recognition result for the prompt information.

The embodiment of the application provides a video identification device, which comprises the steps of firstly, acquiring video data to be identified and key information corresponding to the video data; extracting key video data corresponding to the key information from the video data; acquiring prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.

In some embodiments, the extraction module 302 is to:

Dividing the plurality of video frames into a plurality of regions, respectively;

Extracting a first feature corresponding to each of the plurality of areas and a plurality of second features of the key information by using a preset neural network model;

Determining at least one first feature matching the plurality of second features by a cross-attention operation;

In some embodiments, the extracting module 302 is specifically configured to:

Determining at least one fourth feature matching the plurality of third features from the at least one first feature by a cross-attention operation;

inputting the first sample video data and first sample marking information corresponding to the first sample video data into a large language model to obtain a prediction recognition result aiming at the first sample video data, wherein the first sample video data is any sample video data in the plurality of sample video data;

And adjusting model parameters of the large-scale language model based on the loss function value, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.

In some embodiments, the apparatus further comprises:

the obtaining module 301 is further configured to obtain annotation information of the video data to be identified;

the determining module 303 is further configured to determine evaluation reference text information of the identification result based on the labeling information;

And the evaluation module is used for evaluating the output result of the large language model based on the evaluation reference text information.

In some embodiments, the evaluation module is specifically configured to:

Determining an importance curve of the video data based on the video data;

Determining target evaluation reference text information corresponding to the prompt information in the evaluation reference text information;

In some embodiments, the large language model includes a plurality of multi-modal models, the recognition result includes a plurality of sub-recognition results, and the evaluation module is further specifically configured to include:

Determining evaluation results corresponding to the plurality of sub-recognition results by combining the video data, the importance curve and the target evaluation reference text information, wherein the evaluation results comprise at least one of the following: the optimal result of the plurality of sub-recognition results, the scores corresponding to the plurality of sub-recognition results, the evaluation sequence of the plurality of sub-recognition results and the reason of the evaluation sequence.

The video identification device provided by the embodiment of the application and the video identification method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the video identification device and the video identification method provided by the embodiment of the application due to the same inventive concept.

The embodiment of the application also provides electronic equipment for executing the video identification method. Referring to fig. 4, a schematic diagram of an electronic device according to some embodiments of the present application is shown. As shown in fig. 4, the electronic device 7 includes: processor 700, memory 701, bus 702, and communication interface 703, processor 700, communication interface 703, and memory 701 being connected by bus 702; the memory 701 stores a computer program executable on the processor 700, and the processor 700 executes the video recognition method according to any of the foregoing embodiments of the present application when the computer program is executed.

The memory 701 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the device network element and at least one other network element is achieved through at least one communication interface 703 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 702 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. The memory 701 is configured to store a program, and the processor 700 executes the program after receiving an execution instruction, and the video recognition method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 700 or implemented by the processor 700.

The processor 700 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in processor 700. The processor 700 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 701, and the processor 700 reads information in the memory 701, and in combination with its hardware, performs the steps of the above method.

The electronic equipment provided by the embodiment of the application and the video identification method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic equipment and the video identification method provided by the embodiment of the application due to the same inventive concept.

The embodiment of the present application further provides a computer readable storage medium corresponding to the video recognition method provided in the foregoing embodiment, referring to fig. 5, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, and the computer program when executed by a processor performs the video recognition method provided in any of the foregoing embodiments.

It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.

The computer readable storage medium provided by the above embodiment of the present application has the same advantageous effects as the method adopted, operated or implemented by the application program stored therein, because of the same inventive concept as the video recognition method provided by the embodiment of the present application.

It should be noted that:

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the following schematic diagram: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A video recognition method, comprising:

Acquire video data to be identified and key information corresponding to the video data, wherein the key information is description information corresponding to an abnormal event in the video data;

Get prompt information for recognition results;

The target video data and the prompt information are input into a large language model, and a recognition result for the prompt information is output.

2. The method according to claim 1, characterized in that the step of extracting key video data corresponding to the key information from the video data comprises:

Dividing the plurality of video frames into a plurality of regions respectively;

Extracting first features corresponding to each of the plurality of regions and a plurality of second features of the key information using a preset neural network model;

determining, by a cross-attention operation, at least one first feature that matches the plurality of second features;

The video data corresponding to the at least one first feature is determined as key video data.

3. The method according to claim 2, characterized in that the step of determining the target video data corresponding to the prompt information in the key video data comprises:

Extracting a plurality of third features of the prompt information using a preset neural network model;

Determining at least one fourth feature matching the plurality of third features on the at least one first feature through a cross-attention operation;

The key video data corresponding to the at least one fourth feature is determined as the target video data.

4. The method according to any one of claims 1 to 3, characterized in that the large language model is obtained in advance through training, and the training process of the large language model comprises:

Acquire sample data, where the sample data includes a plurality of sample video data and sample annotation information corresponding to each sample video data;

For first sample video data, inputting the first sample video data and first sample annotation information corresponding to the first sample video data into a large language model to obtain a prediction recognition result, wherein the first sample video data is any sample video data among the multiple sample video data;

Generate a label recognition result based on the first sample annotation information;

Calculate a loss function value based on the predicted recognition result and the label recognition result;

Based on the loss function value, the model parameters of the large language model are adjusted, and the training is continued until a preset training completion condition is met to obtain a trained large language model.

5. The method according to claim 1, characterized in that the method further comprises:

Acquire labeling information of the video data to be identified;

Determining evaluation reference text information of the recognition result based on the annotation information;

The output result of the large language model is evaluated based on the evaluation reference text information.

6. The method according to claim 5, characterized in that evaluating the output result of the large language model based on the evaluation reference text information comprises:

determining an importance curve of the video data based on the video data;

Determine target evaluation reference text information corresponding to the prompt information in the evaluation reference text information;

The output result is evaluated in combination with the video data, the importance curve and the target evaluation reference text information.

7. The method according to claim 6, characterized in that the large language model includes multiple multimodal models, the recognition result includes multiple sub-recognition results, and the evaluating the output result by combining the video data, the importance curve and the target evaluation reference text information comprises:

Evaluation results corresponding to the multiple sub-recognition results are determined in combination with the video data, the importance curve and the target evaluation reference text information, and the evaluation results include at least one of the following: the optimal result of the multiple sub-recognition results, the scores corresponding to each of the multiple sub-recognition results, the evaluation order of the multiple sub-recognition results and the reasons for the evaluation order.

8. A video recognition device, comprising:

An acquisition module, used to acquire video data to be identified and key information corresponding to the video data, wherein the key information is description information corresponding to an abnormal event in the video data;

An extraction module, used for extracting key video data corresponding to the key information from the video data;

A determination module is used to obtain prompt information for the recognition result;

The determination module is further used to determine the target video data corresponding to the prompt information in the key video data;

The output module is used to input the target video data and the prompt information into a large language model and output a recognition result for the prompt information.

9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1 to 7.