[go: up one dir, main page]

CN118366075A - Video recognition method, device, equipment and storage medium - Google Patents

Video recognition method, device, equipment and storage medium Download PDF

Info

Publication number
CN118366075A
CN118366075A CN202410395016.0A CN202410395016A CN118366075A CN 118366075 A CN118366075 A CN 118366075A CN 202410395016 A CN202410395016 A CN 202410395016A CN 118366075 A CN118366075 A CN 118366075A
Authority
CN
China
Prior art keywords
video data
information
key
sample
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410395016.0A
Other languages
Chinese (zh)
Other versions
CN118366075B (en
Inventor
南国顺
杜航
崔琪楣
张嘉阳
谢滨竹
张驷乘
许峻瑞
陶小峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202410395016.0A priority Critical patent/CN118366075B/en
Publication of CN118366075A publication Critical patent/CN118366075A/en
Application granted granted Critical
Publication of CN118366075B publication Critical patent/CN118366075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提出一种视频识别方法、装置、设备及存储介质,该方法包括:获取待识别的视频数据和视频数据对应的关键信息;提取视频数据中与关键信息对应的关键视频数据;获取针对识别结果的提示信息;在关键视频数据中确定与提示信息对应的目标视频数据;将目标视频数据以及提示信息发送至大型语言模型,输出针对提示信息的识别结果。本申请实施例通过结合识别结果的提示信息,能够输出更加准确的识别结果,且通过对视频数据进行处理,进行识别的视频数据数据量小,且涵盖关键数据,识别效率高。

The present application proposes a video recognition method, device, equipment and storage medium, the method comprising: obtaining video data to be recognized and key information corresponding to the video data; extracting key video data corresponding to the key information in the video data; obtaining prompt information for the recognition result; determining the target video data corresponding to the prompt information in the key video data; sending the target video data and the prompt information to a large language model, and outputting the recognition result for the prompt information. The embodiment of the present application can output a more accurate recognition result by combining the prompt information of the recognition result, and by processing the video data, the video data to be recognized has a small amount of data, covers the key data, and has high recognition efficiency.

Description

Video identification method, device, equipment and storage medium
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to a video identification method, a device, equipment and a storage medium.
Background
Currently, video anomaly understanding (Video Anomaly Understanding, abbreviated as VAU) is critical to timely finding potential safety hazards, optimizing operation flows, and improving safety and efficiency, for example, abnormal conditions in traffic monitoring, environment monitoring, industrial production and other scenes can be monitored.
The VAU can automatically identify and understand the abnormal events in the video. Related VAU technology focuses on two aspects: firstly, video anomaly detection, namely identifying an anomaly event existing in a video through an algorithm model; and secondly, abnormal positioning, namely determining the position of an abnormal event at the starting and ending point on the time axis or in the spatial range.
However, the output of current VAU technology is relatively single and inefficient.
Disclosure of Invention
The application provides a video identification method, a device, equipment and a storage medium, which can solve the technical problems of single output and low efficiency of the current VAU technology.
An embodiment of a first aspect of the present application provides a video recognition method, including:
Acquiring video data to be identified and key information corresponding to the video data, wherein the key information is descriptive information corresponding to an abnormal event in the video data;
extracting key video data corresponding to the key information from the video data;
Acquiring prompt information aiming at the identification result;
determining target video data corresponding to the prompt information in the key video data;
And inputting the target video data and the prompt information into a large language model, and outputting a recognition result aiming at the prompt information.
An embodiment of a second aspect of the present application provides a video recognition apparatus, including:
The acquisition module is used for acquiring video data to be identified and key information corresponding to the video data, wherein the key information is descriptive information corresponding to an abnormal event in the video data;
the extraction module is used for extracting key video data corresponding to the key information in the video data;
the determining module is used for acquiring prompt information aiming at the identification result;
the determining module is further configured to determine target video data corresponding to the prompt information in the key video data;
and the output module is used for inputting the target video data and the prompt information into a large language model and outputting the recognition result aiming at the prompt information.
An embodiment of a third aspect of the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor running the computer program to implement the method of the first aspect.
An embodiment of the fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor to perform the method of the first aspect described above.
The technical scheme provided by the embodiment of the application has at least the following technical effects or advantages:
In the embodiment of the application, firstly, the video data to be identified and key information corresponding to the video data are acquired; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures.
In the drawings:
FIG. 1 is a flow chart of a video recognition method according to an embodiment of the present application;
FIG. 2 is a flow chart of an evaluation process according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a video recognition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 5 shows a schematic diagram of a storage medium according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
It is noted that unless otherwise indicated, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.
VAU technology aims at automatically identifying and understanding abnormal events in video, and researchers focus on developing and optimizing video anomaly detection and localization techniques, i.e., finding and locating anomaly information in video through models. However, current VAU monitoring methods have significant shortcomings in understanding the causal relationships behind and the impact of abnormal events.
In order to solve the above problems, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for video identification, where in the embodiments of the present application, first, to-be-identified video data and key information corresponding to the video data are obtained; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.
The video recognition method of the present application may be performed by a computing device, which may be a server, for example, a server, a plurality of servers, a server cluster, a cloud computing platform, or the like, and optionally, the computing device may also be a terminal device, for example, a mobile phone, a tablet computer, a game machine, a portable computer, a desktop computer, an advertisement machine, an all-in-one machine, or the like.
In each embodiment of the present application, a video recognition method of video data to be processed is taken as an example for illustration. And for execution subject, computing devices are illustrated in various embodiments of the application.
The following describes a video identification method, apparatus, device and storage medium according to embodiments of the present application with reference to the accompanying drawings.
Referring to fig. 1, the method specifically includes the steps of:
S101, acquiring video data to be identified and key information corresponding to the video data.
The key information is descriptive information corresponding to an abnormal event in the video data.
The video data to be identified can be video data acquired by acquisition equipment in scenes such as traffic monitoring, environment monitoring and industrial production.
The key information may be generated based on a key field input by a person, or may be identified based on video data, for example, the key field input by the person is "anomaly detection", and the generated key information may be that a traffic accident exists in the video segment.
S102, extracting key video data corresponding to the key information in the video data.
It can be understood that, in order to cover the long-term relevance between video segments, the video data to be identified is generally longer, so that the identification duration is longer and the efficiency is lower, so that the video data to be identified can be reduced by extracting the key video data, and the long-term relevance between the video segments is not lost.
Since the identification video data is mainly intended to identify abnormal events covered in the video data, the key video data corresponding to the key information can be extracted, so that abnormal events in the video data with higher long-term relevance can be identified through shorter key video data.
S103, acquiring prompt information aiming at the identification result.
It can be understood that the abnormal event generally includes an abnormal reason, an abnormal type, a start-stop time and an event detailed description, and the user may only care about the abnormal reason, so that prompt information input by the user can be received, so that a targeted recognition result can be output to the prompt information later.
S104, determining target video data corresponding to the prompt information in the key video data.
S105, inputting the target video data and the prompt information into a large language model, and outputting a recognition result aiming at the prompt information.
In order to further improve the recognition efficiency of the video data, the target video data corresponding to the prompt information can be determined in the key video data, so that the model can be more focused on the target video data corresponding to the prompt information, and the model can rapidly output the recognition result corresponding to the prompt information.
After the recognition result is obtained, semantic analysis can be performed on the recognition result, if the semantic of the recognition result does not meet the requirement, the large-scale language model is controlled to output the recognition result again, for example, the recognition result is that the reason of the traffic accident is that the front vehicle suddenly stops and parks, and if the recognition result does not meet the semantic obviously, the large-scale language model is controlled to output the recognition result again.
The embodiment of the application provides a video identification method, which comprises the steps of firstly, acquiring video data to be identified and key information corresponding to the video data; extracting key video data corresponding to the key information from the video data; determining prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.
In some embodiments, extracting key video data corresponding to key information in video data includes:
Extracting a plurality of video frames of the video data according to a preset frame rate;
dividing a plurality of video frames into a plurality of areas respectively;
Extracting a first feature and a plurality of second features of key information corresponding to each of the plurality of areas by using a preset neural network model;
determining at least one first feature that matches the plurality of second features;
and determining the video data corresponding to the at least one first feature as key video data.
The preset frame rate can be flexibly set based on actual conditions.
After extracting a plurality of video frames, in order to analyze the local features of the video frames, the video frames need to be divided into a plurality of regions, i.e., into a plurality of patch blocks.
In some embodiments, partitioning a video frame into multiple patch blocks may be implemented as:
It is first necessary to determine the size of each patch, typically a fixed square or rectangular area.
The degree of overlap and stride of a patch is determined, which determines the overlap or spacing between adjacent patches.
And cutting the video frame, namely moving a window in the horizontal and vertical directions according to the set patch size and the set stride, and gradually cutting the whole video frame.
Each time the window is moved, the patch area of the current position is extracted and saved as an independent image block.
In addition, boundary conditions need to be considered during the cutting process to ensure that all image areas are cut correctly and that no information is lost.
The preset neural network model can be an image and text pre-training model (Contrastive Language-IMAGE PRETRAINING, called CLIP for short), a visual self-attention model (Vision Transformer, called ViT for short), a detection model (Detection Transformer, called DETR for short) and other models for extracting features.
The extraction of a plurality of region features is described in the CLIP:
For each extracted patch block, a trained CLIP model is used to extract its feature representation. The CLIP model can map an image into a semantic space, so that the position of the CLIP block in the semantic space can be understood according to the feature vector of the patch block, that is, the first features corresponding to a plurality of regions are acquired.
Wherein the general representation of the key information may be a text description.
For the extraction of the CLIP from the features in the multiple regions, key information in the form of text descriptions may be input into the CLIP model, and the key information may be passed through a series of coding layers, including a transducer or other structure, and then converted into a computer-understandable numerical representation.
The encoded text is mapped into a high-dimensional semantic space. In this process, the CLIP model learns how best to represent the semantic information of the text such that texts of similar meaning are closer together in semantic space.
The CLIP may extract text feature vectors from the text encoding. These feature vectors capture the position and meaning of the text in semantic space, i.e. a number of second features that extract key information.
At least one first feature matching the plurality of second features is determined by a cross-attention operation, i.e. in CLIP, the first feature and the second feature perform a cross-attention calculation. This means that the model will focus on the important parts of the image feature vector from the text feature vector, and will focus on the important parts of the text feature vector from the image feature vector.
After the cross-attention calculation, a similarity score between the text and the image can be obtained. This similarity score reflects the degree of correlation between the text description and the image content. In CLIP, cosine similarity or other similarity measure methods are typically used to calculate this similarity.
Further, the first feature and the second feature, the similarity of which is greater than the preset similarity threshold, may be determined as two features that match each other, where the similarity threshold may be flexibly set based on actual situations. So that at least one first feature matches a plurality of second features. The at least one first feature represents the feature of the video data corresponding to the key information, and the key video data corresponding to the key information can be obtained by combining the at least one first feature.
In some embodiments, determining target video data corresponding to the prompt information in the key video data includes:
Extracting a plurality of third features of the prompt information by using a preset neural network model;
determining at least one first feature and determining at least one fourth feature that matches the plurality of third features;
and determining the key video data corresponding to the at least one fourth feature as target video data.
It will be appreciated that a targeted answer to the prompt is required, and in order to meet this requirement and improve the output efficiency, it is necessary to determine the video data corresponding to the prompt from the key video data.
The prompt information can be text information or voice information input by a user or text information after semantic understanding is carried out on the text information or voice information input by the user.
The process of extracting the plurality of third features of the prompt information by the preset neural network model is consistent with the process of extracting the key information, and is not described herein.
Further, at least one first feature may be determined by a cross-attention operation to determine at least one fourth feature that matches the plurality of third features.
And further determining the key video data corresponding to the at least one fourth feature as target video data corresponding to the prompt information.
In the practical application of the large language model, the Hard Prompt design and the Soft Prompt design are combined.
The Hard sympt design mainly has the following effects: by utilizing ChatGPT to assist in confirming and supplementing prompt information input by a user, the large language model is ensured to be capable of more accurately understanding the intention of the user, the large language model is guided to identify specific events related to abnormal occurrence in video in a multi-round dialogue mode, and after multiple rounds of iteration, VLM can focus on the time period or video fragment most related to the problem.
Effect of Soft Prompt design: soft Prompt is embedded into a large language model by utilizing a form of a learnable parameter vector, so that understanding of different video scenes and problems by the model can be adjusted in a more flexible mode, and the model is further assisted in deep mining of the inherent logic of an abnormal event.
The whole model execution flow mainly comprises data preprocessing, prompt design and interaction and answer generation.
In the data preprocessing stage:
Extracting a plurality of video frames of the video data according to a preset frame rate; dividing a plurality of video frames into a plurality of areas respectively; extracting a first feature and a plurality of second features of key information corresponding to each of the plurality of areas by using a preset neural network model; at least one first feature that matches the plurality of second features is determined by a cross-attention operation. Namely, top-k video features related to the anomaly information are extracted through the CLIP.
In the Prompt design and interaction stage: extracting a plurality of third features of the prompt information by using a preset neural network model; at least one fourth feature that matches the plurality of third features is determined from the at least one first feature by a cross-attention operation. And determining the corresponding characteristics of the prompt information by utilizing the video characteristics of the CLIP at top-k related to the abnormal information.
In the answer generation stage:
and (3) determining the characteristics corresponding to the prompt information from the top-k video characteristics related to the abnormal information and the top-k video characteristics related to the abnormal information, inputting the characteristics into a large language model, further generating a recognition result, and only trimming a top-k selector by using a cross entropy loss function during training.
In some embodiments, the large language model is obtained in advance through training, and the training process of the large language model includes:
Acquiring sample data, wherein the sample data comprises a plurality of sample video data and sample labeling information corresponding to each sample video data;
Inputting the first sample video data and first sample marking information corresponding to the first sample video data into a large language model aiming at the first sample video data to obtain a prediction recognition result;
generating a tag identification result based on the first sample labeling information;
calculating a loss function value based on the predictive recognition result and the tag recognition result;
and based on the loss function value, adjusting model parameters of the large-scale language model, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.
For sample data, a series of training sample data sets are created in the related technology, and are mainly divided into two types of weak supervision and semi-supervision, wherein a weak supervision standard mainly comprises UCF-Crimes and XD-Violence, the UCF-Crimes data set comprises about 100GB of video data from real supervision, the total number is about 1900, the total number is 128 hours, but the whole quality is lower, some videos have repeated playing, too short duration, single mode, difficult judgment of abnormal behaviors and the like, compared with the XD-Violence data set, the scene in the data set is richer, the data set is not limited to the monitoring video, and the data set simultaneously comprises two mode information of the video and the audio. Semi-supervised benchmarks mainly include UBnormal and STREET SCENCE, etc., UBnormal is a dataset of 29 virtual scenes, 236902 video frames, where annotation information includes pixel-level and video-level annotations; STREET SCENCE data includes 46 training and 35 test videos taken from the USB camera, which contains 205 naturally occurring anomalies (derailment and illegal turning around, etc.). The above work mainly helps the model capture the time information of the abnormal event through the labeling information of the frame level and the pixel level.
In view of the deficiencies of the training sample data sets in the related art, the sample data sets in the embodiments of the present application focus on solving the more practical problem, namely, not only identifying abnormal events in video, but also understanding deeply what the cause of these abnormalities occur, what their specific performance is, and what their severity or scope of influence is. In particular, the sample data set in the embodiment of the present application has the following features: 1. detailed manual labeling information: each sample video is manually carefully annotated to provide a detailed description of the type of anomaly, start-stop time, event, and textual description of the interpretation of the cause of the anomaly and the effects of the anomaly in natural language. 2. Long timing correlation: the annotation emphasizes the long-term relevance between video segments, requiring the model to capture the relationship between long-distance frames, e.g., in a traffic accident scenario, the model needs to infer the cause of the accident that occurs later from a critical event (e.g., a sudden stop of a vehicle) a few seconds ago. 3. Longer video is of higher quality: CUVA contains real world video with an average duration of 117 seconds and provides high quality text annotations, thus approaching the complex video anomaly understanding and analysis requirements in real scenes, compared to the shorter video instances in existing datasets.
The construction process of the sample data set in the embodiment of the application mainly comprises the following steps: data collection and fine annotation. First, 1000 video samples containing abnormal events are selected from the real world, wherein the samples contain 10 major classes and 42 minor classes of different abnormal types, and then, the collected video needs to be annotated, and the annotation content comprises: the type of the anomaly, the start-stop time and specific event description, natural language interpretation of the cause of the anomaly, and free text description of the root cause of the anomaly and the effect of the anomaly.
In summary, the sample data includes a plurality of sample video data and sample labeling information corresponding to each sample video data.
Further, for any sample video data of the plurality of sample video data, the sample video data and sample marking information corresponding to the sample video data are input into a large language model, and a prediction recognition result is obtained.
Wherein the predictive recognition result is generally a predictive text description of the sample video data anomaly event.
Thus, a corresponding tag recognition result, i.e., a correct text description corresponding to the sample video data, can be generated based on the sample annotation information of the sample video data.
Calculating a loss function value based on the predictive recognition result and the tag recognition result; and based on the loss function value, adjusting model parameters of the large-scale language model, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.
The preset training completion condition may be that the loss function value is smaller than the preset loss function value, or the training frequency reaches the preset training frequency, and the preset loss function value and the preset training frequency may be flexibly set based on actual conditions.
The large language may include a plurality of multi-modal models, the multi-modal models may include BLIP-2, LLaVA, mPLUG-Owl, videoChat, and the like, that is, the output of the large language model may include a plurality of prediction recognition results, and the plurality of prediction recognition results may be calculated with correct text descriptions one by one to obtain corresponding loss function values, so as to adjust corresponding model parameters.
In some embodiments, the method further comprises:
acquiring description information of video data to be identified;
Determining evaluation reference text information of the recognition result based on the description information;
and evaluating the output result of the large language model based on the evaluation reference text information.
It will be appreciated that, upon obtaining the output of a large language model, the output may be evaluated to evaluate the large language model.
The evaluation indexes for the model in the related art comprise ROUGE, BLEURT, BLEU and the like, and the three are all evaluation methods for evaluating the output of the natural language generation system, and the BLEU mainly uses an N-gram method to evaluate the similarity between the text generated by the machine and the reference text; ROUGE is mainly used for measuring how much key information in the reference text is contained in the generated text; BLEURT is directed to understanding BLEU-based ratings through ranking and conversion, text representation using BERT embedding, and then ranking using BLEU scores by establishing a hierarchical relationship between system output and reference text. In addition, there are other GPT-based evaluation methods, such as TouchStone, funQA, in which TouchStone uses a large language model as a evaluator to construct a comprehensive visual dialogue data set, converts multimodal input content into a form understandable to LLMs by using detailed image annotation, and directly evaluates LVLMs dialogue quality; funQA specify three rigorous tasks to measure model understanding of anti-intuitive video, including anti-intuitive timestamp localization, detailed video description, and anti-intuitive reasoning.
In the embodiment of the application, the evaluation reference text information and the output result are input into the preset evaluation model, so that the preset evaluation model evaluates the large language model based on the evaluation reference text information and the output result.
The preset evaluation model can be a multi-mode evaluation model, the model adopts Video-ChatGPT as a basic model in the construction process, and natural language prompts are used for guiding the model to determine the task type to be evaluated.
In some embodiments, evaluating the output results of the large language model based on the evaluation reference text information includes:
Determining an importance curve of the video data based on the video data;
Determining target evaluation reference text information corresponding to prompt information in the evaluation reference text information;
and evaluating the output result by combining the video data, the importance curve and the target evaluation reference text information.
In order to comprehensively evaluate the large language model, in addition to the large language model may be evaluated based on the evaluation reference text information and the output result, since the input of the large language model includes the hint information, the target evaluation reference text information corresponding to the hint information may be generated based on the evaluation reference text information, and the large language model may be evaluated based on the target evaluation reference text information.
The importance curve may be an importance weight corresponding to each video frame in the video data, and the importance curve may be input by a user or obtained through a feature importance analysis or model interpretation method.
Furthermore, the output result can be evaluated by combining the video data, the importance curve and the target evaluation reference text information, so that the method can be more suitable for understanding the causal relationship of the abnormal event by human beings, and is especially suitable for measuring the recognition and interpretation capability of the large language model on the occurrence reason and influence of the abnormal event in the video abnormal analysis process
In some embodiments, the large language model includes a plurality of multi-modal models, the recognition result includes a plurality of sub-recognition results, and the evaluating the output result in combination with the video data, the importance curve, and the target evaluation reference text information includes:
Determining evaluation results corresponding to the plurality of sub-recognition results by combining the video data, the importance curve and the target evaluation reference text information, wherein the evaluation results comprise at least one of the following: the method comprises the steps of optimizing a plurality of sub-recognition results, scoring corresponding to the plurality of sub-recognition results, evaluating the plurality of sub-recognition results and evaluating the reason of the evaluating sequence.
Since the recognition result includes a plurality of sub-recognition results, the corresponding model is evaluated based on each sub-recognition result, and by scoring and sorting the plurality of sub-recognition results, respectively, the optimal model in the large language model and the reason of the output evaluation order can be clarified, and the reason why the evaluation is performed can be clarified, so that the user can comprehensively understand the reason of the evaluation.
In order to describe the above evaluation process in detail, an embodiment of the present application provides a flowchart of an evaluation process, as shown in fig. 2, obtaining Video data, where the Video data has corresponding annotation information, where the annotation information includes time information, description information, a result, a cause and a type, further obtaining an importance curve of the Video data through the Video data, inputting the Video data and prompt information into a large language model, where the large language model includes a VideoChat model, a Video-ChatGPT model, an oter model, a Video-LLaMA model and a mPLUG model, generating evaluation reference text information based on the annotation information, determining target evaluation reference text information corresponding to the prompt information in the evaluation reference text information, and further inputting the Video data, the importance curve, the target evaluation reference text information and a plurality of sub-recognition results into a preset evaluation model, so that the preset evaluation model outputs scores corresponding to the sub-recognition results, an evaluation sequence of the sub-recognition results, and an evaluation sequence cause of the sub-recognition results.
The embodiment of the application also provides a video identification device which is used for executing the video identification method provided by any embodiment. As shown in fig. 3, the apparatus includes: an acquisition module 301, an extraction module 302, a determination module 303 and an output module 304.
The acquiring module 301 is configured to acquire video data to be identified and key information corresponding to the video data, where the key information is description information corresponding to an abnormal event in the video data;
an extracting module 302, configured to extract key video data corresponding to the key information in the video data;
A determining module 303, configured to obtain a prompt message for a recognition result;
The determining module 303 is further configured to determine target video data corresponding to the prompt information from the key video data;
And the output module 304 is configured to input the target video data and the prompt information into a large language model, and output a recognition result for the prompt information.
The embodiment of the application provides a video identification device, which comprises the steps of firstly, acquiring video data to be identified and key information corresponding to the video data; extracting key video data corresponding to the key information from the video data; acquiring prompt information aiming at the identification result; determining target video data corresponding to the prompt information in the key video data; and sending the target video data and the prompt information to the large language model, and outputting the recognition result aiming at the prompt information. The embodiment of the application can output more accurate identification results by combining the prompt information of the identification results, and the video data is processed to identify the video data, so that the data size of the video data is small, key data is covered, and the identification efficiency is high.
In some embodiments, the extraction module 302 is to:
extracting a plurality of video frames of the video data according to a preset frame rate;
Dividing the plurality of video frames into a plurality of regions, respectively;
Extracting a first feature corresponding to each of the plurality of areas and a plurality of second features of the key information by using a preset neural network model;
Determining at least one first feature matching the plurality of second features by a cross-attention operation;
and determining the video data corresponding to the at least one first feature as key video data.
In some embodiments, the extracting module 302 is specifically configured to:
extracting a plurality of third features of the prompt information by using a preset neural network model;
Determining at least one fourth feature matching the plurality of third features from the at least one first feature by a cross-attention operation;
and determining the key video data corresponding to the at least one fourth feature as target video data.
In some embodiments, the large language model is obtained in advance through training, and the training process of the large language model includes:
Acquiring sample data, wherein the sample data comprises a plurality of sample video data and sample labeling information corresponding to each sample video data;
inputting the first sample video data and first sample marking information corresponding to the first sample video data into a large language model to obtain a prediction recognition result aiming at the first sample video data, wherein the first sample video data is any sample video data in the plurality of sample video data;
generating a tag identification result based on the first sample labeling information;
Calculating a loss function value based on the predictive recognition result and the tag recognition result;
And adjusting model parameters of the large-scale language model based on the loss function value, and continuing training until a preset training completion condition is met, so as to obtain the trained large-scale language model.
In some embodiments, the apparatus further comprises:
the obtaining module 301 is further configured to obtain annotation information of the video data to be identified;
the determining module 303 is further configured to determine evaluation reference text information of the identification result based on the labeling information;
And the evaluation module is used for evaluating the output result of the large language model based on the evaluation reference text information.
In some embodiments, the evaluation module is specifically configured to:
Determining an importance curve of the video data based on the video data;
Determining target evaluation reference text information corresponding to the prompt information in the evaluation reference text information;
and evaluating the output result by combining the video data, the importance curve and the target evaluation reference text information.
In some embodiments, the large language model includes a plurality of multi-modal models, the recognition result includes a plurality of sub-recognition results, and the evaluation module is further specifically configured to include:
Determining evaluation results corresponding to the plurality of sub-recognition results by combining the video data, the importance curve and the target evaluation reference text information, wherein the evaluation results comprise at least one of the following: the optimal result of the plurality of sub-recognition results, the scores corresponding to the plurality of sub-recognition results, the evaluation sequence of the plurality of sub-recognition results and the reason of the evaluation sequence.
The video identification device provided by the embodiment of the application and the video identification method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the video identification device and the video identification method provided by the embodiment of the application due to the same inventive concept.
The embodiment of the application also provides electronic equipment for executing the video identification method. Referring to fig. 4, a schematic diagram of an electronic device according to some embodiments of the present application is shown. As shown in fig. 4, the electronic device 7 includes: processor 700, memory 701, bus 702, and communication interface 703, processor 700, communication interface 703, and memory 701 being connected by bus 702; the memory 701 stores a computer program executable on the processor 700, and the processor 700 executes the video recognition method according to any of the foregoing embodiments of the present application when the computer program is executed.
The memory 701 may include a high-speed random access memory (RAM: random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the device network element and at least one other network element is achieved through at least one communication interface 703 (which may be wired or wireless), the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
Bus 702 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be divided into address buses, data buses, control buses, etc. The memory 701 is configured to store a program, and the processor 700 executes the program after receiving an execution instruction, and the video recognition method disclosed in any of the foregoing embodiments of the present application may be applied to the processor 700 or implemented by the processor 700.
The processor 700 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in processor 700. The processor 700 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 701, and the processor 700 reads information in the memory 701, and in combination with its hardware, performs the steps of the above method.
The electronic equipment provided by the embodiment of the application and the video identification method provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic equipment and the video identification method provided by the embodiment of the application due to the same inventive concept.
The embodiment of the present application further provides a computer readable storage medium corresponding to the video recognition method provided in the foregoing embodiment, referring to fig. 5, the computer readable storage medium is shown as an optical disc 30, on which a computer program (i.e. a program product) is stored, and the computer program when executed by a processor performs the video recognition method provided in any of the foregoing embodiments.
It should be noted that examples of the computer readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical or magnetic storage medium, which will not be described in detail herein.
The computer readable storage medium provided by the above embodiment of the present application has the same advantageous effects as the method adopted, operated or implemented by the application program stored therein, because of the same inventive concept as the video recognition method provided by the embodiment of the present application.
It should be noted that:
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the following schematic diagram: i.e., the claimed application requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1.一种视频识别方法,其特征在于,包括:1. A video recognition method, comprising: 获取待识别的视频数据和所述视频数据对应的关键信息,所述关键信息为所述视频数据中异常事件对应的描述信息;Acquire video data to be identified and key information corresponding to the video data, wherein the key information is description information corresponding to an abnormal event in the video data; 提取所述视频数据中与所述关键信息对应的关键视频数据;Extracting key video data corresponding to the key information from the video data; 获取针对识别结果的提示信息;Get prompt information for recognition results; 在所述关键视频数据中确定与所述提示信息对应的目标视频数据;Determining target video data corresponding to the prompt information in the key video data; 将所述目标视频数据以及所述提示信息输入大型语言模型,输出针对所述提示信息的识别结果。The target video data and the prompt information are input into a large language model, and a recognition result for the prompt information is output. 2.根据权利要求1所述的方法,其特征在于,所述提取所述视频数据中与所述关键信息对应的关键视频数据,包括:2. The method according to claim 1, characterized in that the step of extracting key video data corresponding to the key information from the video data comprises: 按照预设帧率提取所述视频数据的多个视频帧;Extracting a plurality of video frames of the video data according to a preset frame rate; 将所述多个视频帧分别分割成多个区域;Dividing the plurality of video frames into a plurality of regions respectively; 利用预设神经网络模型提取所述多个区域各自对应的第一特征以及所述关键信息的多个第二特征;Extracting first features corresponding to each of the plurality of regions and a plurality of second features of the key information using a preset neural network model; 通过交叉注意力操作确定与所述多个第二特征相匹配的至少一个第一特征;determining, by a cross-attention operation, at least one first feature that matches the plurality of second features; 将所述至少一个第一特征对应的视频数据确定为关键视频数据。The video data corresponding to the at least one first feature is determined as key video data. 3.根据权利要求2所述的方法,其特征在于,所述在所述关键视频数据中确定与所述提示信息对应的目标视频数据,包括:3. The method according to claim 2, characterized in that the step of determining the target video data corresponding to the prompt information in the key video data comprises: 利用预设神经网络模型提取所述提示信息的多个第三特征;Extracting a plurality of third features of the prompt information using a preset neural network model; 通过交叉注意力操作在所述至少一个第一特征确定与所述多个第三特征相匹配的至少一个第四特征;Determining at least one fourth feature matching the plurality of third features on the at least one first feature through a cross-attention operation; 将所述至少一个第四特征对应的关键视频数据确定为目标视频数据。The key video data corresponding to the at least one fourth feature is determined as the target video data. 4.根据权利要求1-3任一项所述的方法,其特征在于,所述大型语言模型预先通过训练得到,所述大型语言模型的训练过程包括:4. The method according to any one of claims 1 to 3, characterized in that the large language model is obtained in advance through training, and the training process of the large language model comprises: 获取样本数据,所述样本数据包括多个样本视频数据以及各样本视频数据对应的样本标注信息;Acquire sample data, where the sample data includes a plurality of sample video data and sample annotation information corresponding to each sample video data; 针对第一样本视频数据,将所述第一样本视频数据以及所述第一样本视频数据对应的第一样本标注信息输入大型语言模型,得到预测识别结果,所述第一样本视频数据为所述多个样本视频数据中的任一样本视频数据;For first sample video data, inputting the first sample video data and first sample annotation information corresponding to the first sample video data into a large language model to obtain a prediction recognition result, wherein the first sample video data is any sample video data among the multiple sample video data; 基于所述第一样本标注信息生成标签识别结果;Generate a label recognition result based on the first sample annotation information; 基于所述预测识别结果和所述标签识别结果计算损失函数值;Calculate a loss function value based on the predicted recognition result and the label recognition result; 基于所述损失函数值,调整所述大型语言模型的模型参数,继续训练,直至满足预设的训练完成条件,得到训练好的大型语言模型。Based on the loss function value, the model parameters of the large language model are adjusted, and the training is continued until a preset training completion condition is met to obtain a trained large language model. 5.根据权利要求1所述的方法,其特征在于,所述方法还包括:5. The method according to claim 1, characterized in that the method further comprises: 获取所述待识别的视频数据的标注信息;Acquire labeling information of the video data to be identified; 基于所述标注信息确定所述识别结果的评价参考文本信息;Determining evaluation reference text information of the recognition result based on the annotation information; 基于所述评价参考文本信息评价所述大型语言模型的输出结果。The output result of the large language model is evaluated based on the evaluation reference text information. 6.根据权利要求5所述的方法,其特征在于,基于所述评价参考文本信息评价所述大型语言模型的输出结果,包括:6. The method according to claim 5, characterized in that evaluating the output result of the large language model based on the evaluation reference text information comprises: 基于所述视频数据确定所述视频数据的重要性曲线;determining an importance curve of the video data based on the video data; 确定所述评价参考文本信息中对应于所述提示信息的目标评价参考文本信息;Determine target evaluation reference text information corresponding to the prompt information in the evaluation reference text information; 结合所述视频数据、所述重要性曲线和所述目标评价参考文本信息对所述输出结果进行评价。The output result is evaluated in combination with the video data, the importance curve and the target evaluation reference text information. 7.根据权利要求6所述的方法,其特征在于,所述大型语言模型包括多个多模态模型,所述识别结果包括多个子识别结果,所述结合所述视频数据、所述重要性曲线和所述目标评价参考文本信息对所述输出结果进行评价,包括:7. The method according to claim 6, characterized in that the large language model includes multiple multimodal models, the recognition result includes multiple sub-recognition results, and the evaluating the output result by combining the video data, the importance curve and the target evaluation reference text information comprises: 结合所述视频数据、所述重要性曲线和所述目标评价参考文本信息确定所述多个子识别结果对应的评价结果,所述评价结果包括以下至少一种:所述多个子识别结果的最优结果、所述多个子识别结果各自对应的得分、所述多个子识别结果的评价顺序以及评价顺序原因。Evaluation results corresponding to the multiple sub-recognition results are determined in combination with the video data, the importance curve and the target evaluation reference text information, and the evaluation results include at least one of the following: the optimal result of the multiple sub-recognition results, the scores corresponding to each of the multiple sub-recognition results, the evaluation order of the multiple sub-recognition results and the reasons for the evaluation order. 8.一种视频识别装置,其特征在于,包括:8. A video recognition device, comprising: 获取模块,用于获取待识别的视频数据和所述视频数据对应的关键信息,所述关键信息为所述视频数据中异常事件对应的描述信息;An acquisition module, used to acquire video data to be identified and key information corresponding to the video data, wherein the key information is description information corresponding to an abnormal event in the video data; 提取模块,用于提取所述视频数据中与所述关键信息对应的关键视频数据;An extraction module, used for extracting key video data corresponding to the key information from the video data; 确定模块,用于获取针对识别结果的提示信息;A determination module is used to obtain prompt information for the recognition result; 所述确定模块,还用于在所述关键视频数据中确定与所述提示信息对应的目标视频数据;The determination module is further used to determine the target video data corresponding to the prompt information in the key video data; 输出模块,用于将所述目标视频数据以及所述提示信息输入大型语言模型,输出针对所述提示信息的识别结果。The output module is used to input the target video data and the prompt information into a large language model and output a recognition result for the prompt information. 9.一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器运行所述计算机程序以实现如权利要求1-7任一项所述的方法。9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method according to any one of claims 1 to 7. 10.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行实现如权利要求1-7中任一项所述的方法。10. A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1 to 7.
CN202410395016.0A 2024-04-02 2024-04-02 Video recognition methods, devices, equipment and storage media Active CN118366075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410395016.0A CN118366075B (en) 2024-04-02 2024-04-02 Video recognition methods, devices, equipment and storage media

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410395016.0A CN118366075B (en) 2024-04-02 2024-04-02 Video recognition methods, devices, equipment and storage media

Publications (2)

Publication Number Publication Date
CN118366075A true CN118366075A (en) 2024-07-19
CN118366075B CN118366075B (en) 2025-12-02

Family

ID=91877379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410395016.0A Active CN118366075B (en) 2024-04-02 2024-04-02 Video recognition methods, devices, equipment and storage media

Country Status (1)

Country Link
CN (1) CN118366075B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120234651A (en) * 2025-03-28 2025-07-01 山东云天安全技术有限公司 A target video-oriented abnormality recognition method, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269913A (en) * 2022-07-01 2022-11-01 深圳先进技术研究院 Video retrieval method based on attention fragment prompt
CN115802028A (en) * 2022-11-03 2023-03-14 中国农业银行股份有限公司 A video anomaly detection method, device, electronic equipment and storage medium
CN117336525A (en) * 2023-09-26 2024-01-02 Oppo广东移动通信有限公司 Video processing method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269913A (en) * 2022-07-01 2022-11-01 深圳先进技术研究院 Video retrieval method based on attention fragment prompt
CN115802028A (en) * 2022-11-03 2023-03-14 中国农业银行股份有限公司 A video anomaly detection method, device, electronic equipment and storage medium
CN117336525A (en) * 2023-09-26 2024-01-02 Oppo广东移动通信有限公司 Video processing method, device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杜航 等: "Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly", 《COMPUTER VISION AND PATTERN RECOGNITION》, 30 April 2024 (2024-04-30), pages 1 - 19 *
邹玲 等: "基于用户兴趣的视频片段提取方法", 《中国科技论文》, vol. 13, no. 2, 31 December 2018 (2018-12-31), pages 202 - 207 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120234651A (en) * 2025-03-28 2025-07-01 山东云天安全技术有限公司 A target video-oriented abnormality recognition method, electronic device and storage medium
CN120234651B (en) * 2025-03-28 2025-10-21 山东云天安全技术有限公司 A target video-oriented anomaly recognition method, electronic device, and storage medium

Also Published As

Publication number Publication date
CN118366075B (en) 2025-12-02

Similar Documents

Publication Publication Date Title
CN111294646B (en) Video processing method, device, equipment and storage medium
CN114996513B (en) Video question answering method and system based on cross-modal prompt learning
CN115828112B (en) Fault event response method and device, electronic equipment and storage medium
CN115994317B (en) Incomplete multi-view multi-label classification method and system based on depth contrast learning
Le et al. An overview of deep learning in industry
CN114898466B (en) A video action recognition method and system for smart factories
CN115471771A (en) Video time sequence action positioning method based on semantic level time sequence correlation modeling
WO2023173552A1 (en) Establishment method for target detection model, application method for target detection model, and device, apparatus and medium
CN117315249A (en) Refers to image segmentation model training and segmentation methods, systems, equipment and media
CN115883878B (en) Video editing method, device, electronic equipment and storage medium
CN110968721A (en) Method and system for searching infringement of mass images and computer readable storage medium thereof
CN115147641A (en) A Video Classification Method Based on Knowledge Distillation and Multimodal Fusion
CN118366075B (en) Video recognition methods, devices, equipment and storage media
CN117115474A (en) An end-to-end single target tracking method based on multi-stage feature extraction
CN114973202A (en) Traffic scene obstacle detection method based on semantic segmentation
WO2024038114A1 (en) Determining failure cases in trained neural networks using generative neural networks
CN119578393B (en) Intelligent alarm processing process report generation method and system based on law enforcement recorder video
CN114626430A (en) Emotion recognition model training method, emotion recognition device and emotion recognition medium
CN120316709A (en) Multimodal understanding optimization method based on fine-grained feature extraction and global information integration
CN120296626A (en) Method and system for analyzing abnormality of measurement point time series based on multi-modal large model
CN119513697A (en) Sow estrus behavior recognition method and system based on multimodal feature fusion
CN119810702A (en) Video data processing method, device, electronic device and readable storage medium
CN119048121A (en) Power grid violation supervision method, terminal equipment and storage medium
CN117453951A (en) Model training method, data retrieval device and electronic equipment
CN116883886A (en) A weakly supervised temporal language localization method and device based on dual-level contrastive learning and noise robustness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant