[go: up one dir, main page]

CN120529036A - Early warning and storage method, system, equipment and medium based on infrared video monitoring - Google Patents

Early warning and storage method, system, equipment and medium based on infrared video monitoring

Info

Publication number
CN120529036A
CN120529036A CN202511016292.2A CN202511016292A CN120529036A CN 120529036 A CN120529036 A CN 120529036A CN 202511016292 A CN202511016292 A CN 202511016292A CN 120529036 A CN120529036 A CN 120529036A
Authority
CN
China
Prior art keywords
layer
convolution
feature
early warning
infrared
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202511016292.2A
Other languages
Chinese (zh)
Other versions
CN120529036B (en
Inventor
张考
刘志彬
潘志庚
胡志华
李明
丁鑫
丁延智
黄睿阳
宋昊阳
周晨
朱崇博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202511016292.2A priority Critical patent/CN120529036B/en
Publication of CN120529036A publication Critical patent/CN120529036A/en
Application granted granted Critical
Publication of CN120529036B publication Critical patent/CN120529036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B31/00Predictive alarm systems characterised by extrapolation or other computation using updated historic data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/20Cameras or camera modules comprising electronic image sensors; Control thereof for generating image signals from infrared radiation only
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Business, Economics & Management (AREA)
  • Emergency Management (AREA)
  • Image Processing (AREA)
  • Closed-Circuit Television Systems (AREA)
  • Burglar Alarm Systems (AREA)

Abstract

The invention discloses an early warning and storage method, system, equipment and medium based on infrared video monitoring, and belongs to the technical field of image processing and video monitoring early warning. The method comprises the steps of obtaining videos shot by an infrared camera, extracting infrared images by the aid of the videos, taking Yolov networks as feature extraction networks, improving the feature extraction networks, extracting feature graphs of the infrared images through the improved feature extraction networks, classifying candidate areas and outputting early warning information of target objects, and storing the infrared images containing the early warning information as videos according to time sequence. The invention adds a selectable bar convolution attention layer in the feature extraction stage, constructs a multi-scale feature fusion pyramid guided by difference in the feature fusion stage, is beneficial to infrared video monitoring and early warning, and effectively reduces the storage space occupied by invalid videos by storing the monitoring videos in a screening way and only storing key video fragments containing early warning objects according to detection results.

Description

Early warning and storage method, system, equipment and medium based on infrared video monitoring
Technical Field
The invention relates to the technical field of image processing and video monitoring and early warning, in particular to an early warning and storage method, system, equipment and medium based on infrared video monitoring.
Background
Thermal infrared imaging technology has gained wide attention in recent years due to its advantages such as high concealment, low power consumption, all-weather operation, and strong anti-interference capability, especially in the field of computer vision for intelligent monitoring, automatic driving, etc. The infrared camera is used for shooting in the monitoring area, and the infrared camera is used for detecting specific objects such as people, ships and vehicles in real time, so that accidents can be found timely, the specific objects can be positioned quickly, the efficiency of searching and rescuing tasks can be improved greatly, and the infrared camera has important application value for guaranteeing the safety of the area.
The features learned by the convolutional neural network cannot capture tiny and complex details sufficiently, so that the convolutional neural network cannot distinguish small objects from objects with similar backgrounds or other appearances in the object detection and classification processes in images, the small objects and the background noise are difficult to distinguish by directly applying the conventional convolution-based image detection algorithm, a large number of false detection and omission detection conditions are caused, and the detection performance is insufficient to support the actual application scene of infrared small object monitoring and early warning.
In addition, the infrared camera is used for continuously shooting the monitoring area for 24 hours, the generated video data is huge in quantity, and if the video data are stored completely, a large amount of storage space is occupied. However, in the videos, objects needing early warning are not generated most of the time, so that resource waste is caused, meanwhile, difficulty is brought to subsequent early warning video playback, and the treatment efficiency of monitoring early warning work is greatly reduced.
Disclosure of Invention
Aiming at the problems, the invention aims to provide an early warning and storage method, system, equipment and medium based on infrared video monitoring, which remarkably enhance the capability of extracting image features in an infrared environment, optimize the feature fusion process and improve the accuracy of identifying target objects in infrared images.
The invention provides an early warning and storing method based on infrared video monitoring, which comprises the following steps:
Acquiring a video shot by an infrared camera, extracting an infrared image by utilizing the video, preprocessing the infrared image, constructing a data set from the preprocessed infrared image, and dividing the data set into a training set and a verification set according to a proportion;
Taking Yolov network as a feature extraction network, constructing a selectable bar convolution attention layer and difference feature fusion network, improving the feature extraction network, and extracting a feature map of an infrared image through the improved feature extraction network, wherein the feature map comprises candidate areas of a target object;
Classifying the candidate areas and outputting early warning information of the target object;
And storing the infrared images containing the early warning information into videos according to the time sequence.
Further, the improved feature extraction network comprises a feature extraction backbone network, a difference feature fusion network and a detection head;
the feature extraction backbone network comprises a first convolution layer, a second convolution layer, a first feature extraction layer, a third convolution layer, a second feature extraction layer, a fourth convolution layer, a third feature extraction layer, a fifth convolution layer, a fourth feature extraction layer and a selectable strip-shaped convolution attention layer which are sequentially connected.
Further, the difference feature fusion network comprises a first upsampling layer, a first difference fusion layer, a first feature fusion layer, a second upsampling layer, a second difference fusion layer, a second feature fusion layer, a sixth convolution layer, a first connection layer, a third feature fusion layer, a seventh convolution layer, a second connection layer and a fourth feature fusion layer which are sequentially connected;
The second feature extraction layer is connected with the second difference fusion layer, the third feature extraction layer is connected with the first difference fusion layer, the selectable strip convolution attention layer is respectively connected with the first upsampling layer and the second connection layer, the first feature fusion layer is connected with the first connection layer, and the second feature fusion layer, the third feature fusion layer and the fourth feature fusion layer are all connected with the detection head.
Further, the process of obtaining the feature map X out by marking the input feature map of the selectable bar-shaped convolution attention layer as X in and the output feature map of the selectable bar-shaped convolution attention layer as X out includes:
the feature map X in firstly generates alternative strip receptive fields through horizontal strip convolution and vertical strip convolution respectively, and then obtains a feature map F H、FV through 1X 1 convolution respectively;
Carrying out channel splicing on the feature map F H、FV to obtain a feature map F;
Extracting spatial relation from the feature map F by using average pooling and maximum pooling, connecting the spatial pooling features by convolution, applying Sigmoid activation function, and obtaining the feature map after spatial selection The expression is:
([]),
Wherein MaxPool is maximum pooling, avgPool is average pooling, In order for the convolution operation to be performed,Activating a function for Sigmoid;
Then, the feature map to be output Weighting according to the corresponding space selection, and fusing by convolution to obtain a attention characteristic diagram S, wherein the formula is as follows:
,
Finally, the output signature X out of the selectable bar convolved attention layer is derived from the element-wise product between the input signature X in and the attention signature S, as follows:
Further, the step of generating an alternative strip receptive field by first respectively rolling the feature map X in by horizontal and vertical strips includes:
Firstly, creating a horizontal convolution kernel and a vertical convolution kernel by using asymmetric padding P (L, R, T, B), wherein L, R, T, B represents the number of padding pixels in the left, right, up and down directions respectively;
Performing horizontal asymmetric zero padding P (k, 0,1, 0), P (0, k,0, 1) and vertical asymmetric zero padding P (0, 1, k, 0), P (1, 0, k) on the input tensor to obtain an output tensor ;
Then, the output tensor is subjected to parallel convolution to obtain tensorsThe formula is:
,
,
,
,
,
,
wherein, the Indicating convolution using a convolution kernel of height 1 and width k,Indicating convolution using a convolution kernel of height k and width 1,The number of output channels is the convolution kernel, s is the convolution step,AndThe feature map X in is the height, width and channel number of the output tensorIs a function of the input tensor of (a);
Tensor is to be tensed AndSplicing tensorsAndSplicing to obtain a new tensor, wherein the formula is as follows:
,
,
Wherein Cat represents a stitching operation;
using a height of 2, width of 2 and output channel of Convolved pairs of (a)AndRespectively normalizing to obtain final horizontal bar convolution output tensorAnd a vertical bar convolution tensorThe formula is:
,
,
,
=,
wherein h 2、w2 represents the number of channels and the height, respectively.
Further, the working process of the difference feature fusion network comprises the following steps:
For adjacent layer characteristic diagrams respectively AndRepresenting, pixel-by-pixel subtraction, absolute value operation and 3 x 3 convolutionGenerating a difference characteristic diagram D, wherein the formula is as follows:
,
where i.e {1,2}, Respectively representing shallow layer characteristics, middle layer characteristics and deep layer characteristics;
Then, the difference feature map D is subjected to a Sigmoid activation function and then is respectively connected with the feature map AndMultiplying to obtain weighted feature diagramAndMap the characteristics ofAndAdding up,AndSumming to generate a feature map containing multi-scale informationAndThe formulas are respectively as follows:
,
,
,
,
For characteristic diagram AndMake channel connection, then apply 33 Convolution fusion, multiplying the fused characteristic with the difference characteristic, and then passing through 33, Obtaining a final fusion difference characteristic diagram after convolutionThe formula is:
,
wherein, the Representing the channel feature connections.
Further, the step of preprocessing the infrared image includes:
random horizontal flipping of the image, random scaling of the image size, random cropping of the image area, random changing of brightness, contrast, color saturation of the image, and Mosaic data enhancement of the image.
Further, the early warning information includes object category, confidence and detection frame coordinates.
In a second aspect, the present invention provides an infrared video monitoring based early warning and storage system, comprising:
the image acquisition module is used for acquiring video shot by the infrared camera and extracting an infrared image by utilizing the video;
the image processing module is used for preprocessing the infrared image;
The feature extraction module is used for extracting a feature map from the preprocessed image;
And the early warning and storage module is used for classifying the candidate areas and outputting early warning information of the target object, and storing the infrared images containing the early warning information into videos according to time sequence.
In a third aspect, the invention provides a computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method described in the first aspect above when executing the computer program.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described in the first aspect above.
Compared with the prior art, the invention has the remarkable advantages that:
1. The improved Yolov network architecture adds a selectable bar-shaped convolution attention layer in the feature extraction stage, constructs a multi-scale feature fusion pyramid guided by difference in the feature fusion stage, improves the accuracy of infrared picture detection, and is beneficial to infrared video monitoring and early warning;
2. According to the invention, screening storage is carried out on the monitoring video, only the key video fragments containing the early warning targets are stored according to the detection result, so that the storage space occupied by the invalid video is effectively reduced, the user can conveniently and quickly locate and search the object to be early warned, and the efficiency of playing back the monitoring video by the user is improved.
Drawings
FIG. 1 is a flow chart of an infrared video monitoring based early warning and storage method;
FIG. 2 is a schematic diagram of an improved feature extraction network;
FIG. 3 is a schematic diagram of an alternative bar-shaped convolution attention layer;
FIG. 4 is a schematic diagram of a difference feature fusion network;
FIG. 5 is a flow chart of recording an early warning video;
fig. 6 is a schematic diagram of early warning video storage.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Furthermore, in the description of the present specification and the appended claims, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise.
Example 1
Referring to fig. 1, the early warning and storage method based on infrared video monitoring according to the embodiment includes the following steps:
Step 1, acquiring a video shot by an infrared camera, extracting an infrared image by utilizing the video, preprocessing the infrared image, constructing a data set by the preprocessed infrared image, and dividing the data set into a training set and a verification set according to a proportion.
In this example, an infrared camera is used to capture video in a monitoring area, a video stream is transmitted to a computer device connected with the camera, an SRS streaming media server is built on a local computer, h.264 encoding is performed on the video stream of the infrared camera by using an open source computer program FFmpeg, an RTMP protocol is adopted to push the video to the SRS streaming media server, and the SRS streaming media server generates a video stream access address. For the infrared video to be detected, the address of the OpenCV receiving the RTMP video stream is taken as a parameter, connection with a streaming media server is established, video data is pulled, a proper decoder is selected according to the video stream coding type, the video data is decoded into frame-by-frame image data, and after the infrared pictures containing target objects such as people, vehicles and ships are screened out, 3 ten thousand pictures are obtained. The images are divided into training data sets and verification data sets, for example, the proportion is 8:2, target objects such as people, vehicles and ships in the images are marked by using an image marking tool, and data enhancement is carried out on infrared image data, including random horizontal image overturning, random image size scaling, random image area clipping, random image brightness, contrast ratio and color saturation changing, and Mosaic data enhancement is carried out on the images.
And 2, taking the Yolov network as a feature extraction network, constructing a selectable strip convolution attention layer and difference feature fusion network, improving the feature extraction network, and extracting a feature map of the infrared image through the improved feature extraction network, wherein the feature map comprises candidate areas of the target object.
In the example, yolov network is selected as a feature extraction network, yolov network is improved, specifically, a selectable bar convolution attention layer is arranged to add a feature extraction backbone network, and a difference feature fusion network architecture is adopted to optimize a feature fusion network.
With reference to fig. 2, further, the improved feature extraction network includes a feature extraction backbone network, a differential feature fusion network and a detection head;
the feature extraction backbone network comprises a first convolution layer, a second convolution layer, a first feature extraction layer, a third convolution layer, a second feature extraction layer, a fourth convolution layer, a third feature extraction layer, a fifth convolution layer, a fourth feature extraction layer and a selectable strip-shaped convolution attention layer which are sequentially connected.
Further, the difference feature fusion network comprises a first upsampling layer, a first difference fusion layer, a first feature fusion layer, a second upsampling layer, a second difference fusion layer, a second feature fusion layer, a sixth convolution layer, a first connection layer, a third feature fusion layer, a seventh convolution layer, a second connection layer and a fourth feature fusion layer which are sequentially connected;
The second feature extraction layer is connected with the second difference fusion layer, the third feature extraction layer is connected with the first difference fusion layer, the selectable strip convolution attention layer is respectively connected with the first upsampling layer and the second connection layer, the first feature fusion layer is connected with the first connection layer, and the second feature fusion layer, the third feature fusion layer and the fourth feature fusion layer are all connected with the detection head.
In the infrared image, people are generally distributed in a vertical bar shape, ships and vehicles are distributed in a horizontal bar shape, and in the example, a bar convolution is designed, a receptive field is in a bar shape, the gray level distribution of a target is simulated, the receptive field is enlarged, and the feature extraction is enhanced. The selectable bar convolution attention layer comprises a horizontal bar convolution module, a vertical bar convolution module and a space selection module, wherein the input feature images respectively generate selectable bar receptive fields through the horizontal bar convolution module and the vertical bar convolution module, and in order to enhance the capability of the network to pay attention to the most relevant space context area, the space selection module is introduced to improve the classification and positioning capability of the target object.
Further, with reference to fig. 3, the input feature map for the selectable bar convolved attention layer is denoted as X in, the process of obtaining feature map X out by plotting the output feature map of the selectable bar convolved attention layer as X out includes:
The feature map X in firstly generates alternative strip receptive fields through horizontal strip convolution and vertical strip convolution respectively, and then obtains feature maps F H and F V through 1X 1 convolution respectively;
Carrying out channel splicing on the feature images F H and F V to obtain a feature image F;
Extracting spatial relation from the feature map F by using average pooling and maximum pooling, connecting the spatial pooling features by convolution, applying Sigmoid activation function, and obtaining the feature map after spatial selection The expression is:
([]),
Wherein MaxPool is maximum pooling, avgPool is average pooling, In order for the convolution operation to be performed,Activating a function for Sigmoid;
Then, the feature map to be output Weighting according to the corresponding space selection, and fusing by convolution to obtain a attention characteristic diagram S, wherein the formula is as follows:
,
Finally, the output signature X out of the selectable bar convolved attention layer is derived from the element-wise product between the input signature X in and the attention signature S, as follows:
Further, the step of generating an alternative strip receptive field by first respectively rolling the feature map X in by horizontal and vertical strips includes:
Firstly, creating a horizontal convolution kernel and a vertical convolution kernel by using asymmetric padding P (L, R, T, B), wherein L, R, T, B represents the number of padding pixels in the left, right, up and down directions respectively;
Performing horizontal asymmetric zero padding P (k, 0,1, 0), P (0, k,0, 1) and vertical asymmetric zero padding P (0, 1, k, 0), P (1, 0, k) on the input tensor to obtain an output tensor ;
Then, the output tensor is subjected to parallel convolution to obtain tensorsThe formula is:
,
,
,
,
,
,
wherein, the Indicating convolution using a convolution kernel of height 1 and width k,Indicating convolution using a convolution kernel of height k and width 1,The number of output channels is the convolution kernel, s is the convolution step,AndThe feature map X in is the height, width and channel number of the output tensorIs a function of the input tensor of (a);
Tensor is to be tensed AndSplicing tensorsAndSplicing to obtain a new tensor, wherein the formula is as follows:
,
,
Wherein Cat represents a stitching operation;
using a height of 2, width of 2 and output channel of Convolved pairs of (a)AndRespectively normalizing to obtain final horizontal bar convolution output tensorAnd a vertical bar convolution tensorThe formula is:
,
,
,
=,
wherein h 2、w2 represents the number of channels and the height, respectively.
Shallow features typically contain more detail and edge information, while deep features contain more of the higher level of semantic information. In order to integrate information of different scales, the understanding of the model on details and global information is improved, and the deep features are subjected to up-sampling and then are subtracted with the shallow features pixel by pixel to obtain difference features. Through subtraction, the model can learn the difference between deep features and shallow features, and the shallow features are prevented from being lost, so that more detail information is reserved.
In combination with the structural schematic diagram of the difference feature fusion network shown in fig. 4, further, the difference feature fusion network is used for fusing features obtained from the feature graphs of adjacent layers and generating a difference feature graph, and the working process includes:
For adjacent layer characteristic diagrams respectively AndRepresenting, pixel-by-pixel subtraction, absolute value operation and 3 x 3 convolutionGenerating a difference characteristic diagram D, wherein the formula is as follows:
,
where i.e {1,2}, The second feature extraction layer outputs a shallow feature F 1, the third feature extraction layer outputs a middle feature F 2 and the strip convolution attention output deep feature F 3 can be selected;
Then, the difference feature map D is subjected to a Sigmoid activation function and then is respectively connected with the feature map AndMultiplying to obtain weighted feature diagramAndMap the characteristics ofAndAdding up,AndSumming to generate a feature map containing multi-scale informationAndThe formulas are respectively as follows:
,
,
,
,
For characteristic diagram AndMake channel connection, then apply 33 Convolution fusion, multiplying the fused characteristic with the difference characteristic, and then passing through 33, Obtaining a final fusion difference characteristic diagram after convolutionThe formula is:
,
wherein, the Representing the channel feature connections.
And step 3, classifying the candidate areas and outputting the early warning information of the target object.
For example, the second feature fusion layer outputs a feature map with the size of 80×80, the third feature fusion layer outputs a feature map with the size of 40×40, the fourth feature fusion layer outputs a feature map with the size of 20×20, the three-scale feature map is convolved by the detection head to obtain the category, the confidence coefficient and the detection frame coordinates of the target, all the detection frames from the three scales are combined, then preliminary screening is performed according to the confidence coefficient threshold, finally redundant frames are removed through non-maximum suppression, and the category, the confidence coefficient and the detection frame coordinates of the target corresponding to each detection frame are the final early warning information.
In this example, the improved feature extraction network is first trained using image data and label data in a training set, the training process comprising:
The method comprises the steps of adjusting the image size of a training set to 640 x 640, inputting a feature extraction backbone network to perform feature extraction, then fusing three different-scale features output by the feature extraction backbone network through a difference feature fusion network, outputting three different-scale feature graphs, wherein the sizes of the three different-scale feature graphs are respectively 80 x 80, 40 x 40 and 20 x 20, and three detection heads generate a target object category, a confidence coefficient and a detection frame coordinate corresponding to each detection frame. And comparing the predicted result of the model with the true value of the label, and calculating a loss function, such as CIoU loss. And calculating the gradient of the model parameters by using a back propagation algorithm according to the calculated loss, and updating the parameters of the model according to the calculated gradient by using an SGD optimizer. The hardware environment used in the training process is NVIDIA GeForce RTX-4090D GPU and Intel (R) Xeon (R) CPU E5-2680 v4, the software environment is Pytorch deep learning framework under the Ubuntu system, the training initial learning rate is set to be 0.01, the training round number is 200, and the batch size is 32.
After training, inputting the pictures in the test set into an improved feature extraction network, wherein the feature map of each scale outputs a group of prediction results including target object types, confidence coefficient and detection frame coordinates. And splicing the outputs of the three detection heads, screening the confidence coefficient, and filtering out the detection frames with the confidence coefficient lower than the threshold value of 0.2. And finally, performing non-maximum suppression, eliminating redundant detection frames, and only reserving the detection frame with the highest confidence coefficient at the same target position to obtain a final detection result.
And reading each frame of image data through traversal, transmitting the image data as input into an improved feature extraction network, detecting a specific infrared object, obtaining a final detection result, and transmitting early warning information such as object types, confidence level, detection frame coordinates and the like to a client server in json format.
And 4, storing the infrared images containing the early warning information into videos according to the time sequence.
Referring to fig. 5, when the present invention detects that one or more target objects exist in the current frame, if the current frame is not in recording state, the system will automatically start the recording function, create a new video file and start writing the current and subsequent frames into the file, and if the current frame is in recording state, continue writing the current frame into the current video file. When the system does not detect any target object in the current frame, if the system is in a recording state before, the system stops writing operation, and finishes and stores the current video file, so that the automatic storage of the video clips containing the early warning target is realized.
As shown in fig. 6, in the total detection video, the early warning video segment A, B, C contains the target object, the rest of the video segments are normal videos, the system automatically records and stores the early warning video A, B, C, and the normal video segments in which the target object is not detected are not stored. The method effectively reduces the storage of irrelevant video data and obviously reduces the waste of storage resources. Meanwhile, only the key video containing the early warning target is reserved, so that the playback efficiency of the early warning video is improved, a user can conveniently and quickly locate and inquire related objects, and the response capability and the use convenience of the system in practical application are improved.
In one example, an infrared camera is used for continuously shooting a fixed area for 10 minutes, the video storage result is shown in table 1, 6 sections of videos are stored totally, the total time length of the stored videos is 61 seconds, video fragments containing early warning objects are separated from the original videos and stored, the space waste for storing invalid background videos is greatly reduced, and the efficiency of playing back monitoring videos and checking early warning targets by users is improved.
Table 1 video storage results table
Example two
The early warning and storage system based on infrared video monitoring of this embodiment includes:
the image acquisition module is used for acquiring video shot by the infrared camera and extracting an infrared image by utilizing the video;
the image processing module is used for preprocessing the infrared image;
The feature extraction module is used for extracting a feature map from the preprocessed image;
And the early warning and storage module is used for classifying the candidate areas and outputting early warning information of the target object, and storing the infrared images containing the early warning information into videos according to time sequence.
Example III
The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the processor executes the computer program.
The present embodiment also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described in the above method embodiments.

Claims (11)

1. The early warning and storing method based on infrared video monitoring is characterized by comprising the following steps:
Acquiring a video shot by an infrared camera, extracting an infrared image by utilizing the video, preprocessing the infrared image, constructing a data set from the preprocessed infrared image, and dividing the data set into a training set and a verification set according to a proportion;
Taking Yolov network as a feature extraction network, constructing a selectable bar convolution attention layer and difference feature fusion network, improving the feature extraction network, and extracting a feature map of an infrared image through the improved feature extraction network, wherein the feature map comprises candidate areas of a target object;
Classifying the candidate areas and outputting early warning information of the target object;
And storing the infrared images containing the early warning information into videos according to the time sequence.
2. The infrared video monitoring-based early warning and storage method according to claim 1, wherein the improved feature extraction network comprises a feature extraction backbone network, a difference feature fusion network and a detection head;
the feature extraction backbone network comprises a first convolution layer, a second convolution layer, a first feature extraction layer, a third convolution layer, a second feature extraction layer, a fourth convolution layer, a third feature extraction layer, a fifth convolution layer, a fourth feature extraction layer and a selectable strip-shaped convolution attention layer which are sequentially connected.
3. The infrared video monitoring-based early warning and storage method according to claim 2, wherein the difference feature fusion network comprises a first upsampling layer, a first difference fusion layer, a first feature fusion layer, a second upsampling layer, a second difference fusion layer, a second feature fusion layer, a sixth convolution layer, a first connection layer, a third feature fusion layer, a seventh convolution layer, a second connection layer and a fourth feature fusion layer which are sequentially connected;
The second feature extraction layer is connected with the second difference fusion layer, the third feature extraction layer is connected with the first difference fusion layer, the selectable strip convolution attention layer is respectively connected with the first upsampling layer and the second connection layer, the first feature fusion layer is connected with the first connection layer, and the second feature fusion layer, the third feature fusion layer and the fourth feature fusion layer are all connected with the detection head.
4. The method for infrared video monitoring based on the pre-warning and storage of claim 3, wherein the input feature map of the selectable bar-shaped convolution attention layer is denoted as X in, the process of obtaining feature map X out by plotting the output feature map of the selectable bar convolved attention layer as X out includes:
the feature map X in firstly generates alternative strip receptive fields through horizontal strip convolution and vertical strip convolution respectively, and then obtains a feature map F H、FV through 1X 1 convolution respectively;
Carrying out channel splicing on the feature map F H、FV to obtain a feature map F;
Extracting spatial relation from the feature map F by using average pooling and maximum pooling, connecting the spatial pooling features by convolution, applying Sigmoid activation function, and obtaining the feature map after spatial selection The expression is:
([]),
Wherein MaxPool is maximum pooling, avgPool is average pooling, In order for the convolution operation to be performed,Activating a function for Sigmoid;
Then, the feature map to be output Weighting according to the corresponding space selection, and fusing by convolution to obtain a attention characteristic diagram S, wherein the formula is as follows:
,
Finally, the output signature X out of the selectable bar convolved attention layer is derived from the element-wise product between the input signature X in and the attention signature S, as follows:
5. The method of infrared video monitoring based pre-warning and storage of claim 4, wherein the step of generating an alternative strip receptive field by first respectively rolling a horizontal strip and convolving a vertical strip with a signature X in comprises:
Firstly, creating a horizontal convolution kernel and a vertical convolution kernel by using asymmetric padding P (L, R, T, B), wherein L, R, T, B represents the number of padding pixels in the left, right, up and down directions respectively;
Performing horizontal asymmetric zero padding P (k, 0,1, 0), P (0, k,0, 1) and vertical asymmetric zero padding P (0, 1, k, 0), P (1, 0, k) on the input tensor to obtain an output tensor ;
Then, the output tensor is subjected to parallel convolution to obtain tensorsThe formula is:
,
,
,
,
,
,
wherein, the Indicating convolution using a convolution kernel of height 1 and width k,Indicating convolution using a convolution kernel of height k and width 1,The number of output channels is the convolution kernel, s is the convolution step,AndThe feature map X in is the height, width and channel number of the output tensorIs a function of the input tensor of (a);
Tensor is to be tensed AndSplicing tensorsAndSplicing to obtain a new tensor, wherein the formula is as follows:
,
,
Wherein Cat represents a stitching operation;
using a height of 2, width of 2 and output channel of Convolved pairs of (a)AndRespectively normalizing to obtain final horizontal bar convolution output tensorAnd a vertical bar convolution tensorThe formula is:
,
,
,
=,
wherein h 2、w2 represents the number of channels and the height, respectively.
6. The method for early warning and storage based on infrared video monitoring according to claim 5, wherein the working process of the difference feature fusion network comprises the following steps:
For adjacent layer characteristic diagrams respectively AndRepresenting, pixel-by-pixel subtraction, absolute value operation and 3 x 3 convolutionGenerating a difference characteristic diagram D, wherein the formula is as follows:
,
where i.e {1,2}, Respectively representing shallow layer characteristics, middle layer characteristics and deep layer characteristics;
Then, the difference feature map D is subjected to a Sigmoid activation function and then is respectively connected with the feature map AndMultiplying to obtain weighted feature diagramAndMap the characteristics ofAndAdding up,AndSumming to generate a feature map containing multi-scale informationAndThe formulas are respectively as follows:
,
,
,
,
For characteristic diagram AndMake channel connection, then apply 33 Convolution fusion, multiplying the fused characteristic with the difference characteristic, and then passing through 33, Obtaining a final fusion difference characteristic diagram after convolutionThe formula is:
,
wherein, the Representing the channel feature connections.
7. The method for pre-warning and storing based on infrared video monitoring according to claim 1, wherein the step of preprocessing the infrared image comprises:
random horizontal flipping of the image, random scaling of the image size, random cropping of the image area, random changing of brightness, contrast, color saturation of the image, and Mosaic data enhancement of the image.
8. The method of claim 1, wherein the pre-warning information includes object type, confidence level, and detection frame coordinates.
9. Early warning and memory system based on infrared video monitoring, characterized by comprising:
the image acquisition module is used for acquiring video shot by the infrared camera and extracting an infrared image by utilizing the video;
the image processing module is used for preprocessing the infrared image;
The feature extraction module is used for extracting a feature map from the preprocessed image;
And the early warning and storage module is used for classifying the candidate areas and outputting early warning information of the target object, and storing the infrared images containing the early warning information into videos according to time sequence.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when executing the computer program.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.
CN202511016292.2A 2025-07-23 2025-07-23 Early warning and storage method, system, equipment and medium based on infrared video monitoring Active CN120529036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511016292.2A CN120529036B (en) 2025-07-23 2025-07-23 Early warning and storage method, system, equipment and medium based on infrared video monitoring

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511016292.2A CN120529036B (en) 2025-07-23 2025-07-23 Early warning and storage method, system, equipment and medium based on infrared video monitoring

Publications (2)

Publication Number Publication Date
CN120529036A true CN120529036A (en) 2025-08-22
CN120529036B CN120529036B (en) 2025-09-23

Family

ID=96754728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511016292.2A Active CN120529036B (en) 2025-07-23 2025-07-23 Early warning and storage method, system, equipment and medium based on infrared video monitoring

Country Status (1)

Country Link
CN (1) CN120529036B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220051025A1 (en) * 2019-11-15 2022-02-17 Tencent Technology (Shenzhen) Company Limited Video classification method and apparatus, model training method and apparatus, device, and storage medium
US20220164934A1 (en) * 2020-09-30 2022-05-26 Boe Technology Group Co., Ltd. Image processing method and apparatus, device, video processing method and storage medium
CN119763027A (en) * 2024-11-11 2025-04-04 霖久智慧(广东)科技有限公司 Property monitoring method, device, equipment, and storage medium based on improved YOLOV5
CN119919887A (en) * 2025-01-02 2025-05-02 青岛海容商用冷链股份有限公司 A production line safety monitoring method and system based on machine vision
CN120163956A (en) * 2025-02-28 2025-06-17 南京信息工程大学 A method and related device for infrared small target detection based on selective state space model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220051025A1 (en) * 2019-11-15 2022-02-17 Tencent Technology (Shenzhen) Company Limited Video classification method and apparatus, model training method and apparatus, device, and storage medium
US20220164934A1 (en) * 2020-09-30 2022-05-26 Boe Technology Group Co., Ltd. Image processing method and apparatus, device, video processing method and storage medium
CN119763027A (en) * 2024-11-11 2025-04-04 霖久智慧(广东)科技有限公司 Property monitoring method, device, equipment, and storage medium based on improved YOLOV5
CN119919887A (en) * 2025-01-02 2025-05-02 青岛海容商用冷链股份有限公司 A production line safety monitoring method and system based on machine vision
CN120163956A (en) * 2025-02-28 2025-06-17 南京信息工程大学 A method and related device for infrared small target detection based on selective state space model

Also Published As

Publication number Publication date
CN120529036B (en) 2025-09-23

Similar Documents

Publication Publication Date Title
Chen et al. Tom-net: Learning transparent object matting from a single image
CN114419014B (en) Surface defect detection method based on feature reconstruction
CN115239647B (en) A two-stage full-reference video quality assessment method based on adaptive sampling and multi-scale temporal sequencing
CN112819858B (en) Target tracking method, device, equipment and storage medium based on video enhancement
CN115375991A (en) Strong/weak illumination and fog environment self-adaptive target detection method
Babu et al. An efficient image dahazing using Googlenet based convolution neural networks
CN119863371A (en) Detection driving foggy-day image enhancement method based on half-channel Fourier transform
CN112348011A (en) A vehicle damage assessment method, device and storage medium
CN116665015B (en) A method for detecting weak and small targets in infrared sequence images based on YOLOv5
CN111445496A (en) Underwater image recognition tracking system and method
Manimurugan et al. HLASwin-T-ACoat-Net based underwater object detection
CN114973390B (en) Complex background infrared weak target detection method combined with eye movement attention mechanism
Chen et al. Deep trident decomposition network for single license plate image glare removal
CN119515730B (en) SLAM dynamic disturbance suppression method based on fuzzy processing and target detection
CN120529036B (en) Early warning and storage method, system, equipment and medium based on infrared video monitoring
CN114550032A (en) Video smoke detection method of end-to-end three-dimensional convolution target detection network
CN114004742A (en) Image reconstruction method, training method, detection method, device and storage medium
CN120147181A (en) An image defogging algorithm guided by fog concentration information
CN118097547B (en) Smoke detection method, system, medium, electronic equipment and smoke detection model
CN119671962A (en) An insulator defect detection method based on improved YOLOv5
CN115205793B (en) Electric power machine room smoke detection method and device based on deep learning secondary confirmation
Gomez-Nieto et al. Quality aware features for performance prediction and time reduction in video object tracking
Liang et al. AODs-CLYOLO: An Object Detection Method Integrating Fog Removal and Detection in Haze Environments.
Park et al. Improving Instance Segmentation using Synthetic Data with Artificial Distractors
Liu et al. Classification guided thick fog removal network for drone imaging: ClassifyCycle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant