CN116403135B

CN116403135B - Video significance prediction method and system based on audio and video features

Info

Publication number: CN116403135B
Application number: CN202310247030.1A
Authority: CN
Inventors: 陈震中; 陈钊; 张考
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2025-10-17
Anticipated expiration: 2043-03-10
Also published as: CN116403135A

Abstract

The present invention addresses the field of video saliency prediction and discloses a method and system for video saliency prediction based on audio and video features. First, through data preprocessing, the video to be predicted is processed into video frames and logarithmic Mel-spectrograms. A three-way parallel encoder is used to extract the spatial and temporal features of the video frames, as well as the semantic features of the audio synchronized with the video frames. Next, an audio and video feature fusion module based on a channel attention mechanism utilizes the semantic information of visual and audio features to adaptively learn the inter-channel weights of the visual features, thereby achieving audio-visual feature fusion. Finally, the audio and video features are passed through a decoder to obtain a saliency prediction map for the video frame.

Description

Video significance prediction method and system based on audio and video features

Technical Field

The invention relates to the field of video saliency prediction, in particular to a video saliency prediction method and system based on audio and video characteristics.

Background

When observing a scene, the human visual system can quickly and selectively identify the most attractive part from a large amount of received information, rather than processing the information of all areas. Even in complex environments, humans can quickly focus on important parts of a scene. This is called the visual attention mechanism, and the area of the scene that can draw attention of the human visual system is called the saliency area. Understanding and modeling this mechanism to predict which regions in a video are more attractive to humans is the content of the study of video salience. With the continuous development and wide use of photographing apparatuses, video data can be very conveniently acquired and used. Compared with the image, the video has stronger expressive power and richer information. The research of video significance prediction is of great significance to the analysis and processing of videos. By applying the human visual attention mechanism to the field of video processing, a computer can be assisted in selectively processing information in a video. It enables computers to allocate limited computing resources preferentially to more important areas, which will greatly increase the efficiency of video processing methods. The research on video significance prediction can be applied to the field of video processing, such as the fields of video quality evaluation, video compression, video monitoring, target recognition, target segmentation and the like.

Video saliency prediction has seen a significant improvement in its prediction accuracy over the past decades, from traditional saliency prediction algorithms using manual features to models using deep learning methods. Video saliency prediction can be classified into a signal compression-based method and an image pixel processing-based method. The signal compression-based method regards significance as a measure of signal compressibility, uses video bitstream extraction features to make significance predictions, they encode and compress video first, then measure the reconstruction errors of the decoded data, and the regions that produce larger reconstruction errors will be considered significant. The image pixel-based method predicts the salient region using spatio-temporal information extracted from the video frame, starting from the feature integration theory. However, the existing mainstream video saliency prediction method only uses visual information in a scene, and ignores synchronously played audio information, which is different from the situation that people watch videos every day. While vision is the primary means of human perception of the external environment, hearing also provides a significant amount of important information at the same time. Humans receive relevant information from the same object or event through both hearing and vision, e.g., hearing engine sounds and seeing a running vehicle, hearing talking sounds and seeing a person's mouth and lips turning, which synchronized information is combined in the human brain, which together affect the human visual attention distribution in the scene. Cognitive neurological studies indicate that the combination of audiovisual stimuli can enhance the response of the human sensory system and that integrating audiovisual information can more quickly locate areas of the scene that attract human visual attention. The eye movement experimental results of the same video under different audio conditions more intuitively prove that the audio information can influence the interested area of the human visual system. Therefore, the prediction result obtained by the video significance prediction method based on the audio and video features is more fit with the watching state of human under the real scene. Compared with the mature video saliency prediction method, the research on the saliency prediction method based on the audio and video characteristics is in progress. At present, research based on significance prediction of audio and video features is developed by a video significance prediction method, but the application of audio information in video significance tasks is very little, and most of existing attempts simply add or connect audio features and visual features. Therefore, it is a challenge to use audio and video information to make significant predictions how to reasonably blend audio features and video features.

Disclosure of Invention

The invention aims to solve the problems that the visual system and the auditory system act together to influence the attention area of human beings when the human beings watch a scene, most of the current video saliency prediction methods only process visual information such as video frames, and the methods can better predict the interested area of the visual system in the scene, but hardly process the influence of audio information on the attention of human beings, and meanwhile, the existing video saliency prediction methods considering the auditory information realize feature fusion simply by adding or connecting audio and video features, and cannot effectively fuse the audio and video features, so that the predicted saliency area is inaccurate. Aiming at the current situation, the invention provides a significance prediction method based on audio and video characteristics.

The method is inspired by a double-flow network frame, a multi-flow network frame is designed, and visual space saliency features in video frames, visual time saliency features among video frames and audio semantic saliency features synchronous with the video frames in time are respectively extracted. The invention provides an audio and video semantic fusion module at a channel level, which weights the result of visual saliency by utilizing semantic information of audio and video features in a scene, so as to predict the visual saliency area by utilizing the audio and video features. The video significance prediction method of the audio and video features adopts a multi-stream architecture to improve accuracy of significance prediction. In order to effectively extract visual temporal features, visual spatial features and audio semantic features related to saliency, a brand new multi-stream architecture is proposed, a spatial encoder extracts visual spatial features from video frames, a temporal encoder extracts visual temporal features between frames from a plurality of consecutive video frames, and an audio encoder extracts audio semantic features from a logarithmic mel-frequency spectrogram time-synchronized with the video frames. In order to effectively extract the accuracy of visual saliency prediction, a visual saliency base feature extraction module is provided, which is improved on the basis of MobileNet-V2 model and is used for extracting multi-level saliency related base features of each frame in a space decoder and a time decoder. In order to effectively fuse audio and video features, an audio and video fusion module is provided, which weights the result of visual saliency prediction on the channel level by utilizing semantic information in the audio and video features, so that a more obvious channel in the audio and video information obtains larger weight, and the influence of the audio information on the final saliency prediction result is obtained while the audio information interference is not introduced.

A video significance prediction method based on audio and video features comprises the following specific steps:

step 1, data preprocessing, namely firstly processing a video to be predicted and time-synchronized audio thereof into a continuous video frame and a logarithmic Mel spectrogram;

step 2, firstly, constructing a visual saliency basic feature extraction module for extracting features of video frames;

step 3, extracting space significance features and time significance features contained in the video frames based on the features of the video frames, fusing the space features and the time features to obtain visual significance features, and extracting audio semantic features contained in the logarithmic mel-frequency spectrogram;

Step 4, constructing an audio-video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio-video saliency features;

step 5, integrating the audio and video saliency features into a single-channel saliency map by using a decoder;

Step 6, training the integral model formed by the steps 2-5;

And 7, utilizing the trained integral model to realize visual saliency prediction.

Further, the specific implementation manner of the step 1 is as follows;

step 1.1, firstly, processing a video to be predicted, separating out audio, converting the audio into a wav format, and cutting the video into video frames according to the original video frame rate;

Step 1.2, processing the video frame comprises the steps of reading a picture and converting the picture into an RGB format, adjusting the resolution of the picture into W multiplied by H, carrying out normalization operation on the picture, and converting the data type into a Tensor type;

Step 1.3, converting an audio wav file into a logarithmic Mel spectrogram which is time-synchronous with video frames, wherein the method comprises the steps of resampling the audio wav file into 16000Hz, dividing the resampled audio file into frames with non-overlapping time length of 960ms, decomposing each frame with 960ms through a window with time length of 25ms and step length of 10ms by short-time Fourier transform, integrating the generated spectrogram into 64 frequency bands with Mel intervals, adding a small offset to the amplitude of each frequency band, carrying out logarithmic conversion to obtain a logarithmic Mel spectrogram with the size of 96 multiplied by 64 and the channel number of 1, acquiring each frame timestamp of the video frame, corresponding the time length covered by the logarithmic Mel spectrogram with the video frame timestamp, and converting the data type into Tensor type.

Further, the visual saliency basic feature extraction module is improved on the basis of a MobileNet-V2 model, and the specific improvement is as follows:

Firstly, changing a final space pooling layer in a MobileNet-V2 model into a pyramid pooling layer with holes, wherein the pyramid pooling layer with holes consists of K=4 parallel convolution layers with holes, using a single convolution layer with the aperture size of 1 to retain original information, extracting different scale features by using three convolution layers with different aperture sizes, and connecting the features obtained by the four convolution layers to be used as the output of the last convolution block of the MobileNet-V2 model;

And secondly, connecting MobileNet-V2 to output of the last three convolution blocks, respectively carrying out channel dimension adjustment on the output characteristics of the three convolution blocks by using three convolution layers, wherein the output spatial resolutions obtained by the three convolution blocks are different, carrying out bilinear interpolation up-sampling on the output of the last two convolution blocks, carrying out scaling on the feature graphs to enable the sizes of the three feature graphs to be the same, connecting the three features according to the channel direction, and then obtaining a visual saliency basic feature F _x by using one convolution layer, wherein x is the sequence number of an input video frame.

Further, in the step 3, a spatial encoder is constructed to extract spatial significance features contained in the video frame, and the specific implementation manner is as follows;

after the video frame X _n is subjected to a visual saliency basic feature extraction module to obtain a feature F _n, integrating features of different abstract levels in F _x through a convolution layer, adding a convolution Gaussian prior layer to simulate a center deflection phenomenon in order to improve the performance of a model, automatically combining K=8 Gaussian graphs with different horizontal and vertical variances through two convolution layers to obtain a prior feature graph, adding the prior feature graph into a saliency feature, wherein the Gao Situ size is the same as that of the saliency feature, and the like Gao Situ is generated by the following equation:

where f (x, y) is a Gaussian function value of Gao Situ at the (x, y) position and μ _x、μ_y、σ_x、σ_y represents the mean and variance in the horizontal and vertical directions, respectively, all Gaussian maps are set N e {1,2,3,., 8}, connecting the a priori feature map to the salient features in the channel direction, integrating a priori features and salient features and outputting spatially salient features by two convolution layers

Further, in the step 3, a time encoder is constructed to extract time significance features contained in the video frame, and the specific implementation manner is as follows;

n continuous video frames (X ₁,X₂,…,X_n) are processed through n parallel visual saliency basic feature extraction modules, the obtained features are connected to obtain time features with the size of n multiplied by 45 multiplied by 60 multiplied by 256, the time features are extracted and integrated through two layers of convolution layers, a convolution Gaussian prior layer is added in a time encoder to simulate a center deflection phenomenon, K=8 Gaussian graphs with different horizontal and vertical variances are automatically combined through the two layers of convolution layers to obtain a prior feature graph, and all Gaussian graphs are set According to experience settingN epsilon {1,2,3,., 8}, connecting the obtained prior feature map to the salient features of each frame according to the channel direction, integrating the prior features and the salient features through two three-dimensional convolution layers, and outputting time salient features

Inputting logarithmic mel-spectrogram a _n into audio encoder in time synchronization with video frame of input spatial encoder, obtaining audio semantic features from logarithmic mel-spectrogram a _n using ResNet model as audio encoder

Further, in step 3, the spatial and temporal features are connected according to the channel layer, and the spatial and temporal features are automatically fused by using two convolution layers to output visual saliency features

Further, in the step 3, an audio encoder is constructed to extract audio semantic features contained in the log-mel spectrogram, and the implementation mode is as follows;

further, the audio and video fusion module automatically learns weight parameters among channels based on a channel attention mechanism, and the specific implementation mode is as follows;

first, visual saliency is characterized The second step is to compress the visual saliency features through a global averaging pooling layer by using two non-linear fully connected layers f _V and f _A Audio semantic featuresAnd then calculating the attention weight W _n of the channel hierarchy by using a multi-layer perceptron U with a sigmoid activation function, wherein the flow is shown in the following equation:

finally, the obtained attention weight W _n is the same as Element-by-element multiplication to obtain salient features based on audio and video features

Further, in step 6, training is performed in two steps, firstly, a SALICON training visual saliency basic feature extraction module is used, a random gradient descent algorithm is used for training, then a DIEM training integral model is used, and a random gradient descent algorithm is used for training.

Further, the loss function used in training is shown in the following formula:

L(S,M,F)=α₁L_kl(S,M)+α₂L_cc(S,M)+α₃L_nss(S,F)

S represents a single-channel significance prediction graph output by a model, M represents a significance density graph in a data true value, F represents a gaze point graph in the data true value, alpha ₁、α₂、α₃ is a weight set manually, and L _kl、L_cc、L_nss respectively represents calculation formulas of KL divergence, linear correlation coefficient and normalized path scanning significance, and the calculation formulas are specifically as follows:

Wherein i represents the i-th pixel;

Wherein cov (S, M) and σ (S, M) represent the covariance and standard deviation between the model-derived saliency prediction result and the true saliency density map, respectively;

where μ (S) and σ (S) represent the mean and variance, respectively, of the significance prediction result map, and N represents the number of eye gaze points.

On the other hand, the invention also provides a visual saliency prediction system based on the audio and video characteristics, which comprises the following modules:

the data preprocessing module is used for preprocessing data, and firstly, the video to be predicted and the time-synchronous audio thereof are processed into continuous video frames and logarithmic Mel spectrograms;

The basic feature extraction module is used for constructing a visual saliency basic feature extraction module and extracting features of the video frames;

The visual saliency feature and audio semantic feature extraction module is used for extracting spatial saliency features and temporal saliency features contained in the video frames based on the features of the video frames, fusing the spatial features and the temporal features to obtain visual saliency features, and extracting audio semantic features contained in the logarithmic Mel spectrogram;

the feature fusion module is used for constructing an audio and video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio and video saliency features;

the single-channel saliency map synthesis module is used for integrating the audio and video saliency features into a single-channel saliency map by using a decoder;

The training module is used for training an integral model formed by the basic feature extraction module, the visual saliency feature and the audio semantic feature extraction module and the feature fusion module;

and the prediction module is used for realizing visual saliency prediction by using the trained integral model.

The invention has the following beneficial effects:

Spatial features, temporal features, and audio features in video frames and audio frames are more effectively extracted by using a completely new multi-stream network framework. The audio and video feature fusion module which obtains the weighting weight of the visual saliency feature channel by fusing audio and video semantic features is used for fusing visual and auditory features more effectively.

Drawings

Fig. 1 is a general frame diagram of the present invention.

Fig. 2 is a frame diagram of a visual saliency basic feature extraction module according to the present invention.

Fig. 3 is a frame diagram of the audio/video feature fusion module according to the present invention.

Fig. 4 is a flow chart of the overall method.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

The overall flow chart of the invention is shown in fig. 4, and the specific implementation steps are as follows:

And 1, preprocessing data.

Step 1.1, firstly, processing the video to be predicted, separating out audio, converting the audio into a wav format, and cutting the video into video frames according to the original video frame rate.

The processing of the video frame comprises the steps of reading a picture and converting the picture into an RGB format, adjusting the resolution of the picture to 360 multiplied by 640 pixels, carrying out normalization operation on the picture, and converting the data type into a Tensor type.

The method comprises the steps of 1.3 converting an audio wav file into a logarithmic Mel spectrogram which is time-synchronous with video frames, wherein the content comprises the steps of resampling the audio wav file into 16000Hz, dividing the resampled audio file into frames with non-overlapping time length of 960ms, decomposing each frame with 960ms through a window with time length of 25ms and step length of 10ms by short-time Fourier transform, integrating the generated spectrogram into 64 frequency bands with Mel intervals, adding a small offset into the amplitude of each frequency band, carrying out logarithmic conversion to obtain a logarithmic Mel spectrogram with the size of 96 multiplied by 64 and the channel number of 1, obtaining each frame time stamp of the video frame, corresponding the time length covered by the logarithmic Mel spectrogram with the video frame time stamp, and converting the data type into Tensor type.

And 2, extracting spatial, temporal and audio features contained in the video frame and the logarithmic Mel spectrogram. The invention constructs a basic framework according to a multi-stream network structure, and comprises a space encoder, a time encoder, an audio-video characteristic fusion module and a decoder. The model is input into n time-synchronous continuous video frames and audio frames, wherein a single video frame X _n is input into a space encoder, n continuous video frames (X ₁,X₂,…,X_n) are input into a time encoder, a logarithmic Mel spectrogram A _n which is time-synchronous with the video frames of the input space encoder is input into an audio encoder, and according to multiple experiments, n=7 can obtain the best result. The spatial signature F _s, temporal signature F _t, and audio signature F _a are saliently correlated by three encoders. For ease of understanding, the convolution layer parameters described herein are expressed in order as the number of channels_convolution kernel size×convolution kernel size_step, and the pooling layer parameters are expressed in order as convolution kernel size×convolution kernel size_step, with the feature extraction steps specifically being as follows:

And 2.1, constructing a visual saliency basic feature extraction module. The invention simulates the processing flow of human visual system to information, namely, extraction and integration of most basic visual characteristics can be completed in the process from retina to basic visual cortex, and a weight sharing visual significance basic characteristic module is adopted in a visual space encoder and a visual time encoder. The module is improved on the basis of MobileNet-V2 model, and the specific improvement is as follows:

The final spatial pooling layer in MobileNet-V2 model is first replaced by a holed pyramid pooling layer consisting of k=4 parallel holed convolution layers, the original information is retained using a single convolution layer with aperture size 1 (rate=1), the convolution kernel size of the convolution layer is 1×1, and the number of output characteristic channels is 256. The convolution kernel size of three convolution layers with different aperture sizes (rate= {6,12,18 }) is 3×3, the convolution layers with 256 output characteristic channels extract different scale characteristics, and the characteristics obtained by the four convolution layers with holes are connected together to be used as the output of the last convolution block of the MobileNet-V2 model.

The second step connects MobileNet-V2 the outputs of the last three convolutions. The output characteristics of the three convolution blocks are respectively subjected to channel dimension adjustment by using convolution layers with convolution kernel sizes of 1×1 and output characteristic channel numbers of 64, 128 and 256. The spatial resolutions of the outputs obtained by the three convolution blocks are not the same, the outputs of the last two convolution blocks are subjected to bilinear interpolation up-sampling, and the feature map is scaled so that the three feature maps have the same size, and the three feature maps are(45×80 Pixels). The three features are then connected in the channel direction and passed through a (256_3×3_1) convolution layer to obtain the visual saliency base feature F _x (x is the input video frame number), with the size of 45×80×256.

Step 2.2, constructing a space encoder. After the video frame X _n passes through the visual saliency basic feature extraction module to obtain the feature F _n, the features of different abstraction levels in F _x are integrated through a convolution layer with the convolution kernel size of 3×3 and the output feature channel number of 256. In order to improve the performance of the model, a convolution Gaussian prior layer is added to simulate the center deflection phenomenon, and a prior feature map is obtained by automatically combining K=8 Gaussian maps with different horizontal and vertical variances through two convolution layers with the convolution kernel size of 3×3 and the output feature channel number of 64, wherein the size of Gao Situ is the same as that of the significant feature(45×80 Pixels). Gao Situ is generated by the following equation:

Where f (x, y) is the Gaussian function value of Gao Situ at the (x, y) position, and μ _x、μ_y、σ_x、σ_y represents the mean and variance in the horizontal and vertical directions, respectively. Setting all Gaussian graphs N.e {1,2,3,..8 }. The prior feature map is connected to the salient features according to the channel direction, and the prior features and the salient features are integrated through two layers of convolution layers with the convolution kernel size of 3 multiplied by 3 and the output feature channel number of 256, and the spatial salient features are output

And 2.3, constructing a time encoder. 7 consecutive video frames (X ₁,X₂,…,X_n) are passed through 7 parallel visual saliency base feature extraction modules and the resulting features are connected to yield features of size 7X 45X 60X 256. The method comprises the steps of extracting and integrating time characteristics by adopting two continuous three-dimensional convolution layers with convolution kernel size of 3 multiplied by 3 and output characteristic channel number of 256, simulating center deflection phenomenon by using a convolution Gaussian prior layer in a time encoder similar to a space encoder, obtaining prior characteristic graphs by automatically combining K=8 Gaussian graphs with different horizontal and vertical variances by using two (64_3 multiplied by 3_1) convolution layers, and setting all Gaussian graphsAccording to experience settingN.e {1,2,3,..8 }. The obtained a priori feature map is connected to the salient features of each frame in the channel direction. Three-dimensional convolution layers with the channel number of 256 integrate prior features and saliency features through two layers of convolution kernels with the size of 3 multiplied by 3 and output time saliency featuresAnd 2.4, constructing an audio encoder. Inputting logarithmic mel-spectrogram a _n into audio encoder in time synchronization with video frame of input spatial encoder, obtaining audio semantic features from logarithmic mel-spectrogram a _n using ResNet model as audio encoderThe size is 1×309.

And 2.5, fusing the spatial and temporal features to obtain visual saliency features. The space and time features are connected according to the channel layer, the space and time features are automatically fused by using two layers of convolution layers, the parameters of the convolution layers are (1024_3×3_1), (512_3×3_1), and the visual saliency features are output

And 3, constructing an audio and video feature fusion module. Based on a channel attention mechanism, the purpose of the audio-video feature fusion module is that the designed audio-video feature fusion module can automatically learn weight parameters among channels on a channel level, and the semantic information of the audio-video features is utilized to selectively enhance or inhibit the expression of visual features related or not related to the audio information. Compared with simple addition or connection of the audio features and the visual features, the audio-video feature fusion method designed by the invention is more mild, does not cause the lack of the visual features or bring about a large amount of noise information in the audio features, in other words, is an audio-guided visual attention fusion method. First, visual saliency is characterizedThe visual saliency features are compressed to channel-level statistics by a global averaging pooling layer P, 1 x 512 in size. The second step utilizes two non-linear fully connected layers f _V and f _A to pool the visual saliency features of the layers through global averagingAudio semantic featuresAnd (3) performing dimension adjustment, wherein the size is adjusted to be 1 multiplied by 512, and then adding the dimension elements to obtain the audio and video semantic features. The attention weight W _n of the channel hierarchy is then calculated with a multi-layer perceptron U with a sigmoid activation function, with a size of 1x 512. The above procedure is shown in the following equation:

finally, the obtained attention weight W _n is the same as Element-by-element multiplication to obtain salient features based on audio and video featuresThe size is 45×80×512.

And 4, constructing a decoder. Obtaining salient features based on audio and video featuresIt needs to be integrated into a single channel saliency map. The invention adjusts the attention weight between visual characteristic channels through the audio and video characteristic fusion module, and can generate a single-channel saliency map through a simple decoder. The invention adopts two layers of convolution layers with the convolution kernel size of 3 multiplied by 3, and outputs the convolution layers with the characteristic channel number of 128 and 1 to characteristic the audio and video significanceAnd integrating to generate a single-channel saliency map S _n.

Model training details:

The proposed model is implemented on an NVIDIA 1080GPU using Pytorch. Wherein MobileNet-V2 in the visual saliency base feature extraction module is initialized with its public weights and ResNet in the audio encoder is initialized with the public weights trained on the large audio-video dataset.

First the model trains a visual saliency base feature extraction module using SALICON. The SALICON dataset is one of the largest image saliency prediction datasets at present, and contains 20000 pieces of image data. By training the visual saliency basic feature extraction module on the data set, the requirement for visual saliency data can be reduced, and the performance of the model can be improved. In the training process, the visual saliency basic feature extraction module uses a random gradient descent algorithm for training, the initial learning rate is set to be 10 ^-3, the momentum is set to be 0.9, and the weight attenuation coefficient is set to be 0.0005. The batch size is set to 32, i.e. 32 pictures will be processed per training iteration.

The second step model trains the whole model using DIEM. The DIEM dataset is one of the datasets commonly used for video saliency prediction, and contains 85 video segments of varying duration, 27 to 217 seconds, which are rich in variety and provide corresponding audio information. The model proposed by the invention is trained using a random gradient descent algorithm, the initial learning rate is set to 10 ^-3, the momentum is set to 0.9, and the weight decay is set to 0.0005. The batch size is set to 4, i.e. 4 x 7 = 28 pictures will be processed per training iteration. The loss function used in the two-step training is the same, as shown in the following formula:

L(S,M,F)=α₁L_kl(S,M)+α₂L_cc(S,M)+α₃L_nss(S,F)

Wherein S represents a single-channel significance prediction graph output by the model, M represents a significance density graph in the data true value, and F represents a gaze point graph in the data true value. L _kl、L_cc、L_nss respectively represents calculation formulas of KL divergence, linear correlation coefficient and normalized path scanning significance, and the calculation formulas are specifically as follows:

KL divergence (Kullback-Leibler divergence, KL) is an asymmetry metric, and smaller values indicate that the result of significance prediction is closer to a true significance density map. Where i denotes the i-th pixel.

Linear correlation coefficients (Linear Correlation Coefficient, CC) are commonly used to measure the correlation between two variables, with larger values representing closer significance predictions to true values, where cov (S, M) and σ (S, M) represent the covariance and standard deviation, respectively, between the model derived significance predictions and the true significance density map.

Normalized path scan saliency (Normalized SCANPATH SALIENCY, NSS), which is specially designed for saliency prediction result evaluation, can be regarded as a Normalized saliency condition for measuring the gaze point position, and the calculation method is to take the average value of the Normalized saliency map corresponding to the human eye gaze point. Where μ (S) and σ (S) represent the mean and variance, respectively, of the saliency prediction result map, i represents the ith pixel, and N represents the number of human eye gaze points. The larger NSS value indicates the better the significance prediction model performance.

Alpha ₁、α₂、α₃ is a manually set weight, 1, -0.2, -0.1 in this example, respectively.

The specific implementation manner of each module corresponds to each step, and the invention is not written.

The implementations described herein are merely illustrative of the principles of the present invention, and various modifications or additions may be made to the described implementations by those skilled in the art without departing from the spirit or scope of the invention as defined in the accompanying claims.

Claims

1. The video significance prediction method based on the audio and video features is characterized by comprising the following steps of:

in step 3, the space and time features are connected according to the channel layer, and the two convolution layers are used for automatically fusing the space and time features to output visual saliency features

Step3, constructing an audio encoder to extract audio semantic features contained in the logarithmic mel spectrogram, wherein the specific implementation mode is as follows;

The audio and video fusion module automatically learns weight parameters among channels based on a channel attention mechanism, and the specific implementation mode is as follows;

step 6, training the integral model formed by the steps 2-5;

2. The video saliency prediction method based on audio and video features of claim 1, wherein the specific implementation manner of step 1 is as follows;

3. The video saliency prediction method based on audio and video features of claim 1, wherein the visual saliency basic feature extraction module is improved on the basis of MobileNet-V2 model, and the specific improvement is as follows:

4. The video saliency prediction method based on audio and video features of claim 1, wherein the spatial saliency feature contained in the video frame is extracted by a spatial encoder constructed in step 3, and the implementation mode is as follows;

where f (x, y) is a Gaussian function value of Gao Situ at the (x, y) position and μ _x、μ_y、σ_x、σ_y represents the mean and variance in the horizontal and vertical directions, respectively, all Gaussian maps are set Connecting the prior feature map to the salient features according to the channel direction, integrating the prior features and the salient features through two layers of convolution layers and outputting spatial salient features

5. The video saliency prediction method based on audio and video features of claim 1, wherein the time encoder is constructed in step 3 to extract the time saliency features contained in the video frames, and the implementation is as follows;

n continuous video frames (X ₁,X₂,…,X_b) are processed through n parallel visual saliency basic feature extraction modules, the obtained features are connected to obtain time features with the size of n multiplied by 45 multiplied by 60 multiplied by 256, the time features are extracted and integrated through two layers of convolution layers, a convolution Gaussian prior layer is added in a time encoder to simulate a center deflection phenomenon, K=8 Gaussian graphs with different horizontal and vertical variances are automatically combined through the two layers of convolution layers to obtain a prior feature graph, and all Gaussian graphs are set According to experience settingConnecting the obtained prior feature map to the salient features of each frame according to the channel direction, integrating the prior features and the salient features through two layers of three-dimensional convolution layers and outputting time salient features

6. The method for predicting video saliency based on audio and video features according to claim 1, wherein in step 6, training is performed in two steps, namely training a visual saliency basic feature extraction module by using SALICON, training by using a random gradient descent algorithm, training by using a DIEM training overall model, and training by using a random gradient descent algorithm.

7. The method for predicting video saliency based on audio and video features of claim 1, wherein the loss function is used in training as shown in the following formula:

L(S,M,F)=α₁L_kl(S,M)+α₂L_cc(S,M)+α₃L_nss(S,F)

Wherein i represents the i-th pixel;

8. A video saliency prediction system based on audio and video features, for implementing a video saliency prediction method based on audio and video features as claimed in any one of claims 1 to 7, comprising the following modules: