CN120529036A

CN120529036A - Early warning and storage method, system, equipment and medium based on infrared video monitoring

Info

Publication number: CN120529036A
Application number: CN202511016292.2A
Authority: CN
Inventors: 张考; 刘志彬; 潘志庚; 胡志华; 李明; 丁鑫; 丁延智; 黄睿阳; 宋昊阳; 周晨; 朱崇博
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2025-07-23
Filing date: 2025-07-23
Publication date: 2025-08-22
Anticipated expiration: 2045-07-23
Also published as: CN120529036B

Abstract

The invention discloses an early warning and storage method, system, equipment and medium based on infrared video monitoring, and belongs to the technical field of image processing and video monitoring early warning. The method comprises the steps of obtaining videos shot by an infrared camera, extracting infrared images by the aid of the videos, taking Yolov networks as feature extraction networks, improving the feature extraction networks, extracting feature graphs of the infrared images through the improved feature extraction networks, classifying candidate areas and outputting early warning information of target objects, and storing the infrared images containing the early warning information as videos according to time sequence. The invention adds a selectable bar convolution attention layer in the feature extraction stage, constructs a multi-scale feature fusion pyramid guided by difference in the feature fusion stage, is beneficial to infrared video monitoring and early warning, and effectively reduces the storage space occupied by invalid videos by storing the monitoring videos in a screening way and only storing key video fragments containing early warning objects according to detection results.

Description

Early warning and storage method, system, equipment and medium based on infrared video monitoring

Technical Field

The invention relates to the technical field of image processing and video monitoring and early warning, in particular to an early warning and storage method, system, equipment and medium based on infrared video monitoring.

Background

Thermal infrared imaging technology has gained wide attention in recent years due to its advantages such as high concealment, low power consumption, all-weather operation, and strong anti-interference capability, especially in the field of computer vision for intelligent monitoring, automatic driving, etc. The infrared camera is used for shooting in the monitoring area, and the infrared camera is used for detecting specific objects such as people, ships and vehicles in real time, so that accidents can be found timely, the specific objects can be positioned quickly, the efficiency of searching and rescuing tasks can be improved greatly, and the infrared camera has important application value for guaranteeing the safety of the area.

The features learned by the convolutional neural network cannot capture tiny and complex details sufficiently, so that the convolutional neural network cannot distinguish small objects from objects with similar backgrounds or other appearances in the object detection and classification processes in images, the small objects and the background noise are difficult to distinguish by directly applying the conventional convolution-based image detection algorithm, a large number of false detection and omission detection conditions are caused, and the detection performance is insufficient to support the actual application scene of infrared small object monitoring and early warning.

In addition, the infrared camera is used for continuously shooting the monitoring area for 24 hours, the generated video data is huge in quantity, and if the video data are stored completely, a large amount of storage space is occupied. However, in the videos, objects needing early warning are not generated most of the time, so that resource waste is caused, meanwhile, difficulty is brought to subsequent early warning video playback, and the treatment efficiency of monitoring early warning work is greatly reduced.

Disclosure of Invention

Aiming at the problems, the invention aims to provide an early warning and storage method, system, equipment and medium based on infrared video monitoring, which remarkably enhance the capability of extracting image features in an infrared environment, optimize the feature fusion process and improve the accuracy of identifying target objects in infrared images.

The invention provides an early warning and storing method based on infrared video monitoring, which comprises the following steps:

Acquiring a video shot by an infrared camera, extracting an infrared image by utilizing the video, preprocessing the infrared image, constructing a data set from the preprocessed infrared image, and dividing the data set into a training set and a verification set according to a proportion;

Taking Yolov network as a feature extraction network, constructing a selectable bar convolution attention layer and difference feature fusion network, improving the feature extraction network, and extracting a feature map of an infrared image through the improved feature extraction network, wherein the feature map comprises candidate areas of a target object;

Classifying the candidate areas and outputting early warning information of the target object;

And storing the infrared images containing the early warning information into videos according to the time sequence.

Further, the improved feature extraction network comprises a feature extraction backbone network, a difference feature fusion network and a detection head;

the feature extraction backbone network comprises a first convolution layer, a second convolution layer, a first feature extraction layer, a third convolution layer, a second feature extraction layer, a fourth convolution layer, a third feature extraction layer, a fifth convolution layer, a fourth feature extraction layer and a selectable strip-shaped convolution attention layer which are sequentially connected.

Further, the difference feature fusion network comprises a first upsampling layer, a first difference fusion layer, a first feature fusion layer, a second upsampling layer, a second difference fusion layer, a second feature fusion layer, a sixth convolution layer, a first connection layer, a third feature fusion layer, a seventh convolution layer, a second connection layer and a fourth feature fusion layer which are sequentially connected;

The second feature extraction layer is connected with the second difference fusion layer, the third feature extraction layer is connected with the first difference fusion layer, the selectable strip convolution attention layer is respectively connected with the first upsampling layer and the second connection layer, the first feature fusion layer is connected with the first connection layer, and the second feature fusion layer, the third feature fusion layer and the fourth feature fusion layer are all connected with the detection head.

Further, the process of obtaining the feature map X _out by marking the input feature map of the selectable bar-shaped convolution attention layer as X _in and the output feature map of the selectable bar-shaped convolution attention layer as X _out includes:

the feature map X _in firstly generates alternative strip receptive fields through horizontal strip convolution and vertical strip convolution respectively, and then obtains a feature map F _H、F_V through 1X 1 convolution respectively;

Carrying out channel splicing on the feature map F _H、F_V to obtain a feature map F;

Extracting spatial relation from the feature map F by using average pooling and maximum pooling, connecting the spatial pooling features by convolution, applying Sigmoid activation function, and obtaining the feature map after spatial selection The expression is:

([]),

Wherein MaxPool is maximum pooling, avgPool is average pooling, In order for the convolution operation to be performed,Activating a function for Sigmoid;

Then, the feature map to be output Weighting according to the corresponding space selection, and fusing by convolution to obtain a attention characteristic diagram S, wherein the formula is as follows:

,

Finally, the output signature X _out of the selectable bar convolved attention layer is derived from the element-wise product between the input signature X _in and the attention signature S, as follows:

。

Further, the step of generating an alternative strip receptive field by first respectively rolling the feature map X _in by horizontal and vertical strips includes:

Firstly, creating a horizontal convolution kernel and a vertical convolution kernel by using asymmetric padding P (L, R, T, B), wherein L, R, T, B represents the number of padding pixels in the left, right, up and down directions respectively;

Performing horizontal asymmetric zero padding P (k, 0,1, 0), P (0, k,0, 1) and vertical asymmetric zero padding P (0, 1, k, 0), P (1, 0, k) on the input tensor to obtain an output tensor 、、、;

Then, the output tensor is subjected to parallel convolution to obtain tensorsThe formula is:

,

wherein, the Indicating convolution using a convolution kernel of height 1 and width k,Indicating convolution using a convolution kernel of height k and width 1,The number of output channels is the convolution kernel, s is the convolution step,AndThe feature map X _in is the height, width and channel number of the output tensor、、Is a function of the input tensor of (a);

Tensor is to be tensed AndSplicing tensorsAndSplicing to obtain a new tensor, wherein the formula is as follows:

,

Wherein Cat represents a stitching operation;

using a height of 2, width of 2 and output channel of Convolved pairs of (a)AndRespectively normalizing to obtain final horizontal bar convolution output tensorAnd a vertical bar convolution tensorThe formula is:

,

=,

wherein h ₂、w₂ represents the number of channels and the height, respectively.

Further, the working process of the difference feature fusion network comprises the following steps:

For adjacent layer characteristic diagrams respectively AndRepresenting, pixel-by-pixel subtraction, absolute value operation and 3 x 3 convolutionGenerating a difference characteristic diagram D, wherein the formula is as follows:

,

where i.e {1,2}, Respectively representing shallow layer characteristics, middle layer characteristics and deep layer characteristics;

Then, the difference feature map D is subjected to a Sigmoid activation function and then is respectively connected with the feature map AndMultiplying to obtain weighted feature diagramAndMap the characteristics ofAndAdding up,AndSumming to generate a feature map containing multi-scale informationAndThe formulas are respectively as follows:

,

For characteristic diagram AndMake channel connection, then apply 33 Convolution fusion, multiplying the fused characteristic with the difference characteristic, and then passing through 33, Obtaining a final fusion difference characteristic diagram after convolutionThe formula is:

,

wherein, the Representing the channel feature connections.

Further, the step of preprocessing the infrared image includes:

random horizontal flipping of the image, random scaling of the image size, random cropping of the image area, random changing of brightness, contrast, color saturation of the image, and Mosaic data enhancement of the image.

Further, the early warning information includes object category, confidence and detection frame coordinates.

In a second aspect, the present invention provides an infrared video monitoring based early warning and storage system, comprising:

the image acquisition module is used for acquiring video shot by the infrared camera and extracting an infrared image by utilizing the video;

the image processing module is used for preprocessing the infrared image;

The feature extraction module is used for extracting a feature map from the preprocessed image;

And the early warning and storage module is used for classifying the candidate areas and outputting early warning information of the target object, and storing the infrared images containing the early warning information into videos according to time sequence.

In a third aspect, the invention provides a computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method described in the first aspect above when executing the computer program.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described in the first aspect above.

Compared with the prior art, the invention has the remarkable advantages that:

1. The improved Yolov network architecture adds a selectable bar-shaped convolution attention layer in the feature extraction stage, constructs a multi-scale feature fusion pyramid guided by difference in the feature fusion stage, improves the accuracy of infrared picture detection, and is beneficial to infrared video monitoring and early warning;

2. According to the invention, screening storage is carried out on the monitoring video, only the key video fragments containing the early warning targets are stored according to the detection result, so that the storage space occupied by the invalid video is effectively reduced, the user can conveniently and quickly locate and search the object to be early warned, and the efficiency of playing back the monitoring video by the user is improved.

Drawings

FIG. 1 is a flow chart of an infrared video monitoring based early warning and storage method;

FIG. 2 is a schematic diagram of an improved feature extraction network;

FIG. 3 is a schematic diagram of an alternative bar-shaped convolution attention layer;

FIG. 4 is a schematic diagram of a difference feature fusion network;

FIG. 5 is a flow chart of recording an early warning video;

fig. 6 is a schematic diagram of early warning video storage.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Furthermore, in the description of the present specification and the appended claims, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise.

Example 1

Referring to fig. 1, the early warning and storage method based on infrared video monitoring according to the embodiment includes the following steps:

Step 1, acquiring a video shot by an infrared camera, extracting an infrared image by utilizing the video, preprocessing the infrared image, constructing a data set by the preprocessed infrared image, and dividing the data set into a training set and a verification set according to a proportion.

In this example, an infrared camera is used to capture video in a monitoring area, a video stream is transmitted to a computer device connected with the camera, an SRS streaming media server is built on a local computer, h.264 encoding is performed on the video stream of the infrared camera by using an open source computer program FFmpeg, an RTMP protocol is adopted to push the video to the SRS streaming media server, and the SRS streaming media server generates a video stream access address. For the infrared video to be detected, the address of the OpenCV receiving the RTMP video stream is taken as a parameter, connection with a streaming media server is established, video data is pulled, a proper decoder is selected according to the video stream coding type, the video data is decoded into frame-by-frame image data, and after the infrared pictures containing target objects such as people, vehicles and ships are screened out, 3 ten thousand pictures are obtained. The images are divided into training data sets and verification data sets, for example, the proportion is 8:2, target objects such as people, vehicles and ships in the images are marked by using an image marking tool, and data enhancement is carried out on infrared image data, including random horizontal image overturning, random image size scaling, random image area clipping, random image brightness, contrast ratio and color saturation changing, and Mosaic data enhancement is carried out on the images.

And 2, taking the Yolov network as a feature extraction network, constructing a selectable strip convolution attention layer and difference feature fusion network, improving the feature extraction network, and extracting a feature map of the infrared image through the improved feature extraction network, wherein the feature map comprises candidate areas of the target object.

In the example, yolov network is selected as a feature extraction network, yolov network is improved, specifically, a selectable bar convolution attention layer is arranged to add a feature extraction backbone network, and a difference feature fusion network architecture is adopted to optimize a feature fusion network.

With reference to fig. 2, further, the improved feature extraction network includes a feature extraction backbone network, a differential feature fusion network and a detection head;

In the infrared image, people are generally distributed in a vertical bar shape, ships and vehicles are distributed in a horizontal bar shape, and in the example, a bar convolution is designed, a receptive field is in a bar shape, the gray level distribution of a target is simulated, the receptive field is enlarged, and the feature extraction is enhanced. The selectable bar convolution attention layer comprises a horizontal bar convolution module, a vertical bar convolution module and a space selection module, wherein the input feature images respectively generate selectable bar receptive fields through the horizontal bar convolution module and the vertical bar convolution module, and in order to enhance the capability of the network to pay attention to the most relevant space context area, the space selection module is introduced to improve the classification and positioning capability of the target object.

Further, with reference to fig. 3, the input feature map for the selectable bar convolved attention layer is denoted as X _in, the process of obtaining feature map X _out by plotting the output feature map of the selectable bar convolved attention layer as X _out includes:

The feature map X _in firstly generates alternative strip receptive fields through horizontal strip convolution and vertical strip convolution respectively, and then obtains feature maps F _H and F _V through 1X 1 convolution respectively;

Carrying out channel splicing on the feature images F _H and F _V to obtain a feature image F;

([]),

,

。

,

Wherein Cat represents a stitching operation;

,

=,

Shallow features typically contain more detail and edge information, while deep features contain more of the higher level of semantic information. In order to integrate information of different scales, the understanding of the model on details and global information is improved, and the deep features are subjected to up-sampling and then are subtracted with the shallow features pixel by pixel to obtain difference features. Through subtraction, the model can learn the difference between deep features and shallow features, and the shallow features are prevented from being lost, so that more detail information is reserved.

In combination with the structural schematic diagram of the difference feature fusion network shown in fig. 4, further, the difference feature fusion network is used for fusing features obtained from the feature graphs of adjacent layers and generating a difference feature graph, and the working process includes:

,

where i.e {1,2}, The second feature extraction layer outputs a shallow feature F ₁, the third feature extraction layer outputs a middle feature F ₂ and the strip convolution attention output deep feature F ₃ can be selected;

,

wherein, the Representing the channel feature connections.

And step 3, classifying the candidate areas and outputting the early warning information of the target object.

For example, the second feature fusion layer outputs a feature map with the size of 80×80, the third feature fusion layer outputs a feature map with the size of 40×40, the fourth feature fusion layer outputs a feature map with the size of 20×20, the three-scale feature map is convolved by the detection head to obtain the category, the confidence coefficient and the detection frame coordinates of the target, all the detection frames from the three scales are combined, then preliminary screening is performed according to the confidence coefficient threshold, finally redundant frames are removed through non-maximum suppression, and the category, the confidence coefficient and the detection frame coordinates of the target corresponding to each detection frame are the final early warning information.

In this example, the improved feature extraction network is first trained using image data and label data in a training set, the training process comprising:

The method comprises the steps of adjusting the image size of a training set to 640 x 640, inputting a feature extraction backbone network to perform feature extraction, then fusing three different-scale features output by the feature extraction backbone network through a difference feature fusion network, outputting three different-scale feature graphs, wherein the sizes of the three different-scale feature graphs are respectively 80 x 80, 40 x 40 and 20 x 20, and three detection heads generate a target object category, a confidence coefficient and a detection frame coordinate corresponding to each detection frame. And comparing the predicted result of the model with the true value of the label, and calculating a loss function, such as CIoU loss. And calculating the gradient of the model parameters by using a back propagation algorithm according to the calculated loss, and updating the parameters of the model according to the calculated gradient by using an SGD optimizer. The hardware environment used in the training process is NVIDIA GeForce RTX-4090D GPU and Intel (R) Xeon (R) CPU E5-2680 v4, the software environment is Pytorch deep learning framework under the Ubuntu system, the training initial learning rate is set to be 0.01, the training round number is 200, and the batch size is 32.

After training, inputting the pictures in the test set into an improved feature extraction network, wherein the feature map of each scale outputs a group of prediction results including target object types, confidence coefficient and detection frame coordinates. And splicing the outputs of the three detection heads, screening the confidence coefficient, and filtering out the detection frames with the confidence coefficient lower than the threshold value of 0.2. And finally, performing non-maximum suppression, eliminating redundant detection frames, and only reserving the detection frame with the highest confidence coefficient at the same target position to obtain a final detection result.

And reading each frame of image data through traversal, transmitting the image data as input into an improved feature extraction network, detecting a specific infrared object, obtaining a final detection result, and transmitting early warning information such as object types, confidence level, detection frame coordinates and the like to a client server in json format.

And 4, storing the infrared images containing the early warning information into videos according to the time sequence.

Referring to fig. 5, when the present invention detects that one or more target objects exist in the current frame, if the current frame is not in recording state, the system will automatically start the recording function, create a new video file and start writing the current and subsequent frames into the file, and if the current frame is in recording state, continue writing the current frame into the current video file. When the system does not detect any target object in the current frame, if the system is in a recording state before, the system stops writing operation, and finishes and stores the current video file, so that the automatic storage of the video clips containing the early warning target is realized.

As shown in fig. 6, in the total detection video, the early warning video segment A, B, C contains the target object, the rest of the video segments are normal videos, the system automatically records and stores the early warning video A, B, C, and the normal video segments in which the target object is not detected are not stored. The method effectively reduces the storage of irrelevant video data and obviously reduces the waste of storage resources. Meanwhile, only the key video containing the early warning target is reserved, so that the playback efficiency of the early warning video is improved, a user can conveniently and quickly locate and inquire related objects, and the response capability and the use convenience of the system in practical application are improved.

In one example, an infrared camera is used for continuously shooting a fixed area for 10 minutes, the video storage result is shown in table 1, 6 sections of videos are stored totally, the total time length of the stored videos is 61 seconds, video fragments containing early warning objects are separated from the original videos and stored, the space waste for storing invalid background videos is greatly reduced, and the efficiency of playing back monitoring videos and checking early warning targets by users is improved.

Table 1 video storage results table

Example two

The early warning and storage system based on infrared video monitoring of this embodiment includes:

the image processing module is used for preprocessing the infrared image;

Example III

The present embodiment provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method embodiments described above when the processor executes the computer program.

The present embodiment also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method described in the above method embodiments.

Claims

1. The early warning and storing method based on infrared video monitoring is characterized by comprising the following steps:

2. The infrared video monitoring-based early warning and storage method according to claim 1, wherein the improved feature extraction network comprises a feature extraction backbone network, a difference feature fusion network and a detection head;

3. The infrared video monitoring-based early warning and storage method according to claim 2, wherein the difference feature fusion network comprises a first upsampling layer, a first difference fusion layer, a first feature fusion layer, a second upsampling layer, a second difference fusion layer, a second feature fusion layer, a sixth convolution layer, a first connection layer, a third feature fusion layer, a seventh convolution layer, a second connection layer and a fourth feature fusion layer which are sequentially connected;

4. The method for infrared video monitoring based on the pre-warning and storage of claim 3, wherein the input feature map of the selectable bar-shaped convolution attention layer is denoted as X _in, the process of obtaining feature map X _out by plotting the output feature map of the selectable bar convolved attention layer as X _out includes:

([]),

,

。

5. The method of infrared video monitoring based pre-warning and storage of claim 4, wherein the step of generating an alternative strip receptive field by first respectively rolling a horizontal strip and convolving a vertical strip with a signature X _in comprises:

,

Wherein Cat represents a stitching operation;

,

=,

6. The method for early warning and storage based on infrared video monitoring according to claim 5, wherein the working process of the difference feature fusion network comprises the following steps:

,

wherein, the Representing the channel feature connections.

7. The method for pre-warning and storing based on infrared video monitoring according to claim 1, wherein the step of preprocessing the infrared image comprises:

8. The method of claim 1, wherein the pre-warning information includes object type, confidence level, and detection frame coordinates.

9. Early warning and memory system based on infrared video monitoring, characterized by comprising:

the image processing module is used for preprocessing the infrared image;

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 8 when executing the computer program.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 8.