[go: up one dir, main page]

CN116403135B - Video significance prediction method and system based on audio and video features - Google Patents

Video significance prediction method and system based on audio and video features

Info

Publication number
CN116403135B
CN116403135B CN202310247030.1A CN202310247030A CN116403135B CN 116403135 B CN116403135 B CN 116403135B CN 202310247030 A CN202310247030 A CN 202310247030A CN 116403135 B CN116403135 B CN 116403135B
Authority
CN
China
Prior art keywords
features
video
audio
saliency
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310247030.1A
Other languages
Chinese (zh)
Other versions
CN116403135A (en
Inventor
陈震中
陈钊
张考
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310247030.1A priority Critical patent/CN116403135B/en
Publication of CN116403135A publication Critical patent/CN116403135A/en
Application granted granted Critical
Publication of CN116403135B publication Critical patent/CN116403135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

本发明针对视频显著性预测领域,公开了一种基于音视频特征的视频显著性预测方法和系统。首先通过数据预处理将待预测视频处理为视频帧与对数梅尔频谱图,通过三路并行的编码器提取视频帧空间特征,时间特征,与视频帧时间同步的音频的语义特征。之后通过一个基于通道注意力机制的音视频特征融合模块,利用视觉特征与音频特征语义信息,自适应地学习视觉特征的通道间权重,以此实现音视特征的融合。最后将音视频特征通过解码器获得视频帧的显著性预测图。

The present invention addresses the field of video saliency prediction and discloses a method and system for video saliency prediction based on audio and video features. First, through data preprocessing, the video to be predicted is processed into video frames and logarithmic Mel-spectrograms. A three-way parallel encoder is used to extract the spatial and temporal features of the video frames, as well as the semantic features of the audio synchronized with the video frames. Next, an audio and video feature fusion module based on a channel attention mechanism utilizes the semantic information of visual and audio features to adaptively learn the inter-channel weights of the visual features, thereby achieving audio-visual feature fusion. Finally, the audio and video features are passed through a decoder to obtain a saliency prediction map for the video frame.

Description

Video significance prediction method and system based on audio and video features
Technical Field
The invention relates to the field of video saliency prediction, in particular to a video saliency prediction method and system based on audio and video characteristics.
Background
When observing a scene, the human visual system can quickly and selectively identify the most attractive part from a large amount of received information, rather than processing the information of all areas. Even in complex environments, humans can quickly focus on important parts of a scene. This is called the visual attention mechanism, and the area of the scene that can draw attention of the human visual system is called the saliency area. Understanding and modeling this mechanism to predict which regions in a video are more attractive to humans is the content of the study of video salience. With the continuous development and wide use of photographing apparatuses, video data can be very conveniently acquired and used. Compared with the image, the video has stronger expressive power and richer information. The research of video significance prediction is of great significance to the analysis and processing of videos. By applying the human visual attention mechanism to the field of video processing, a computer can be assisted in selectively processing information in a video. It enables computers to allocate limited computing resources preferentially to more important areas, which will greatly increase the efficiency of video processing methods. The research on video significance prediction can be applied to the field of video processing, such as the fields of video quality evaluation, video compression, video monitoring, target recognition, target segmentation and the like.
Video saliency prediction has seen a significant improvement in its prediction accuracy over the past decades, from traditional saliency prediction algorithms using manual features to models using deep learning methods. Video saliency prediction can be classified into a signal compression-based method and an image pixel processing-based method. The signal compression-based method regards significance as a measure of signal compressibility, uses video bitstream extraction features to make significance predictions, they encode and compress video first, then measure the reconstruction errors of the decoded data, and the regions that produce larger reconstruction errors will be considered significant. The image pixel-based method predicts the salient region using spatio-temporal information extracted from the video frame, starting from the feature integration theory. However, the existing mainstream video saliency prediction method only uses visual information in a scene, and ignores synchronously played audio information, which is different from the situation that people watch videos every day. While vision is the primary means of human perception of the external environment, hearing also provides a significant amount of important information at the same time. Humans receive relevant information from the same object or event through both hearing and vision, e.g., hearing engine sounds and seeing a running vehicle, hearing talking sounds and seeing a person's mouth and lips turning, which synchronized information is combined in the human brain, which together affect the human visual attention distribution in the scene. Cognitive neurological studies indicate that the combination of audiovisual stimuli can enhance the response of the human sensory system and that integrating audiovisual information can more quickly locate areas of the scene that attract human visual attention. The eye movement experimental results of the same video under different audio conditions more intuitively prove that the audio information can influence the interested area of the human visual system. Therefore, the prediction result obtained by the video significance prediction method based on the audio and video features is more fit with the watching state of human under the real scene. Compared with the mature video saliency prediction method, the research on the saliency prediction method based on the audio and video characteristics is in progress. At present, research based on significance prediction of audio and video features is developed by a video significance prediction method, but the application of audio information in video significance tasks is very little, and most of existing attempts simply add or connect audio features and visual features. Therefore, it is a challenge to use audio and video information to make significant predictions how to reasonably blend audio features and video features.
Disclosure of Invention
The invention aims to solve the problems that the visual system and the auditory system act together to influence the attention area of human beings when the human beings watch a scene, most of the current video saliency prediction methods only process visual information such as video frames, and the methods can better predict the interested area of the visual system in the scene, but hardly process the influence of audio information on the attention of human beings, and meanwhile, the existing video saliency prediction methods considering the auditory information realize feature fusion simply by adding or connecting audio and video features, and cannot effectively fuse the audio and video features, so that the predicted saliency area is inaccurate. Aiming at the current situation, the invention provides a significance prediction method based on audio and video characteristics.
The method is inspired by a double-flow network frame, a multi-flow network frame is designed, and visual space saliency features in video frames, visual time saliency features among video frames and audio semantic saliency features synchronous with the video frames in time are respectively extracted. The invention provides an audio and video semantic fusion module at a channel level, which weights the result of visual saliency by utilizing semantic information of audio and video features in a scene, so as to predict the visual saliency area by utilizing the audio and video features. The video significance prediction method of the audio and video features adopts a multi-stream architecture to improve accuracy of significance prediction. In order to effectively extract visual temporal features, visual spatial features and audio semantic features related to saliency, a brand new multi-stream architecture is proposed, a spatial encoder extracts visual spatial features from video frames, a temporal encoder extracts visual temporal features between frames from a plurality of consecutive video frames, and an audio encoder extracts audio semantic features from a logarithmic mel-frequency spectrogram time-synchronized with the video frames. In order to effectively extract the accuracy of visual saliency prediction, a visual saliency base feature extraction module is provided, which is improved on the basis of MobileNet-V2 model and is used for extracting multi-level saliency related base features of each frame in a space decoder and a time decoder. In order to effectively fuse audio and video features, an audio and video fusion module is provided, which weights the result of visual saliency prediction on the channel level by utilizing semantic information in the audio and video features, so that a more obvious channel in the audio and video information obtains larger weight, and the influence of the audio information on the final saliency prediction result is obtained while the audio information interference is not introduced.
A video significance prediction method based on audio and video features comprises the following specific steps:
step 1, data preprocessing, namely firstly processing a video to be predicted and time-synchronized audio thereof into a continuous video frame and a logarithmic Mel spectrogram;
step 2, firstly, constructing a visual saliency basic feature extraction module for extracting features of video frames;
step 3, extracting space significance features and time significance features contained in the video frames based on the features of the video frames, fusing the space features and the time features to obtain visual significance features, and extracting audio semantic features contained in the logarithmic mel-frequency spectrogram;
Step 4, constructing an audio-video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio-video saliency features;
step 5, integrating the audio and video saliency features into a single-channel saliency map by using a decoder;
Step 6, training the integral model formed by the steps 2-5;
And 7, utilizing the trained integral model to realize visual saliency prediction.
Further, the specific implementation manner of the step 1 is as follows;
step 1.1, firstly, processing a video to be predicted, separating out audio, converting the audio into a wav format, and cutting the video into video frames according to the original video frame rate;
Step 1.2, processing the video frame comprises the steps of reading a picture and converting the picture into an RGB format, adjusting the resolution of the picture into W multiplied by H, carrying out normalization operation on the picture, and converting the data type into a Tensor type;
Step 1.3, converting an audio wav file into a logarithmic Mel spectrogram which is time-synchronous with video frames, wherein the method comprises the steps of resampling the audio wav file into 16000Hz, dividing the resampled audio file into frames with non-overlapping time length of 960ms, decomposing each frame with 960ms through a window with time length of 25ms and step length of 10ms by short-time Fourier transform, integrating the generated spectrogram into 64 frequency bands with Mel intervals, adding a small offset to the amplitude of each frequency band, carrying out logarithmic conversion to obtain a logarithmic Mel spectrogram with the size of 96 multiplied by 64 and the channel number of 1, acquiring each frame timestamp of the video frame, corresponding the time length covered by the logarithmic Mel spectrogram with the video frame timestamp, and converting the data type into Tensor type.
Further, the visual saliency basic feature extraction module is improved on the basis of a MobileNet-V2 model, and the specific improvement is as follows:
Firstly, changing a final space pooling layer in a MobileNet-V2 model into a pyramid pooling layer with holes, wherein the pyramid pooling layer with holes consists of K=4 parallel convolution layers with holes, using a single convolution layer with the aperture size of 1 to retain original information, extracting different scale features by using three convolution layers with different aperture sizes, and connecting the features obtained by the four convolution layers to be used as the output of the last convolution block of the MobileNet-V2 model;
And secondly, connecting MobileNet-V2 to output of the last three convolution blocks, respectively carrying out channel dimension adjustment on the output characteristics of the three convolution blocks by using three convolution layers, wherein the output spatial resolutions obtained by the three convolution blocks are different, carrying out bilinear interpolation up-sampling on the output of the last two convolution blocks, carrying out scaling on the feature graphs to enable the sizes of the three feature graphs to be the same, connecting the three features according to the channel direction, and then obtaining a visual saliency basic feature F x by using one convolution layer, wherein x is the sequence number of an input video frame.
Further, in the step 3, a spatial encoder is constructed to extract spatial significance features contained in the video frame, and the specific implementation manner is as follows;
after the video frame X n is subjected to a visual saliency basic feature extraction module to obtain a feature F n, integrating features of different abstract levels in F x through a convolution layer, adding a convolution Gaussian prior layer to simulate a center deflection phenomenon in order to improve the performance of a model, automatically combining K=8 Gaussian graphs with different horizontal and vertical variances through two convolution layers to obtain a prior feature graph, adding the prior feature graph into a saliency feature, wherein the Gao Situ size is the same as that of the saliency feature, and the like Gao Situ is generated by the following equation:
where f (x, y) is a Gaussian function value of Gao Situ at the (x, y) position and μ x、μy、σx、σy represents the mean and variance in the horizontal and vertical directions, respectively, all Gaussian maps are set N e {1,2,3,., 8}, connecting the a priori feature map to the salient features in the channel direction, integrating a priori features and salient features and outputting spatially salient features by two convolution layers
Further, in the step 3, a time encoder is constructed to extract time significance features contained in the video frame, and the specific implementation manner is as follows;
n continuous video frames (X 1,X2,…,Xn) are processed through n parallel visual saliency basic feature extraction modules, the obtained features are connected to obtain time features with the size of n multiplied by 45 multiplied by 60 multiplied by 256, the time features are extracted and integrated through two layers of convolution layers, a convolution Gaussian prior layer is added in a time encoder to simulate a center deflection phenomenon, K=8 Gaussian graphs with different horizontal and vertical variances are automatically combined through the two layers of convolution layers to obtain a prior feature graph, and all Gaussian graphs are set According to experience settingN epsilon {1,2,3,., 8}, connecting the obtained prior feature map to the salient features of each frame according to the channel direction, integrating the prior features and the salient features through two three-dimensional convolution layers, and outputting time salient features
Inputting logarithmic mel-spectrogram a n into audio encoder in time synchronization with video frame of input spatial encoder, obtaining audio semantic features from logarithmic mel-spectrogram a n using ResNet model as audio encoder
Further, in step 3, the spatial and temporal features are connected according to the channel layer, and the spatial and temporal features are automatically fused by using two convolution layers to output visual saliency features
Further, in the step 3, an audio encoder is constructed to extract audio semantic features contained in the log-mel spectrogram, and the implementation mode is as follows;
further, the audio and video fusion module automatically learns weight parameters among channels based on a channel attention mechanism, and the specific implementation mode is as follows;
first, visual saliency is characterized The second step is to compress the visual saliency features through a global averaging pooling layer by using two non-linear fully connected layers f V and f A Audio semantic featuresAnd then calculating the attention weight W n of the channel hierarchy by using a multi-layer perceptron U with a sigmoid activation function, wherein the flow is shown in the following equation:
finally, the obtained attention weight W n is the same as Element-by-element multiplication to obtain salient features based on audio and video features
Further, in step 6, training is performed in two steps, firstly, a SALICON training visual saliency basic feature extraction module is used, a random gradient descent algorithm is used for training, then a DIEM training integral model is used, and a random gradient descent algorithm is used for training.
Further, the loss function used in training is shown in the following formula:
L(S,M,F)=α1Lkl(S,M)+α2Lcc(S,M)+α3Lnss(S,F)
S represents a single-channel significance prediction graph output by a model, M represents a significance density graph in a data true value, F represents a gaze point graph in the data true value, alpha 1、α2、α3 is a weight set manually, and L kl、Lcc、Lnss respectively represents calculation formulas of KL divergence, linear correlation coefficient and normalized path scanning significance, and the calculation formulas are specifically as follows:
Wherein i represents the i-th pixel;
Wherein cov (S, M) and σ (S, M) represent the covariance and standard deviation between the model-derived saliency prediction result and the true saliency density map, respectively;
where μ (S) and σ (S) represent the mean and variance, respectively, of the significance prediction result map, and N represents the number of eye gaze points.
On the other hand, the invention also provides a visual saliency prediction system based on the audio and video characteristics, which comprises the following modules:
the data preprocessing module is used for preprocessing data, and firstly, the video to be predicted and the time-synchronous audio thereof are processed into continuous video frames and logarithmic Mel spectrograms;
The basic feature extraction module is used for constructing a visual saliency basic feature extraction module and extracting features of the video frames;
The visual saliency feature and audio semantic feature extraction module is used for extracting spatial saliency features and temporal saliency features contained in the video frames based on the features of the video frames, fusing the spatial features and the temporal features to obtain visual saliency features, and extracting audio semantic features contained in the logarithmic Mel spectrogram;
the feature fusion module is used for constructing an audio and video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio and video saliency features;
the single-channel saliency map synthesis module is used for integrating the audio and video saliency features into a single-channel saliency map by using a decoder;
The training module is used for training an integral model formed by the basic feature extraction module, the visual saliency feature and the audio semantic feature extraction module and the feature fusion module;
and the prediction module is used for realizing visual saliency prediction by using the trained integral model.
The invention has the following beneficial effects:
Spatial features, temporal features, and audio features in video frames and audio frames are more effectively extracted by using a completely new multi-stream network framework. The audio and video feature fusion module which obtains the weighting weight of the visual saliency feature channel by fusing audio and video semantic features is used for fusing visual and auditory features more effectively.
Drawings
Fig. 1 is a general frame diagram of the present invention.
Fig. 2 is a frame diagram of a visual saliency basic feature extraction module according to the present invention.
Fig. 3 is a frame diagram of the audio/video feature fusion module according to the present invention.
Fig. 4 is a flow chart of the overall method.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
The overall flow chart of the invention is shown in fig. 4, and the specific implementation steps are as follows:
And 1, preprocessing data.
Step 1.1, firstly, processing the video to be predicted, separating out audio, converting the audio into a wav format, and cutting the video into video frames according to the original video frame rate.
The processing of the video frame comprises the steps of reading a picture and converting the picture into an RGB format, adjusting the resolution of the picture to 360 multiplied by 640 pixels, carrying out normalization operation on the picture, and converting the data type into a Tensor type.
The method comprises the steps of 1.3 converting an audio wav file into a logarithmic Mel spectrogram which is time-synchronous with video frames, wherein the content comprises the steps of resampling the audio wav file into 16000Hz, dividing the resampled audio file into frames with non-overlapping time length of 960ms, decomposing each frame with 960ms through a window with time length of 25ms and step length of 10ms by short-time Fourier transform, integrating the generated spectrogram into 64 frequency bands with Mel intervals, adding a small offset into the amplitude of each frequency band, carrying out logarithmic conversion to obtain a logarithmic Mel spectrogram with the size of 96 multiplied by 64 and the channel number of 1, obtaining each frame time stamp of the video frame, corresponding the time length covered by the logarithmic Mel spectrogram with the video frame time stamp, and converting the data type into Tensor type.
And 2, extracting spatial, temporal and audio features contained in the video frame and the logarithmic Mel spectrogram. The invention constructs a basic framework according to a multi-stream network structure, and comprises a space encoder, a time encoder, an audio-video characteristic fusion module and a decoder. The model is input into n time-synchronous continuous video frames and audio frames, wherein a single video frame X n is input into a space encoder, n continuous video frames (X 1,X2,…,Xn) are input into a time encoder, a logarithmic Mel spectrogram A n which is time-synchronous with the video frames of the input space encoder is input into an audio encoder, and according to multiple experiments, n=7 can obtain the best result. The spatial signature F s, temporal signature F t, and audio signature F a are saliently correlated by three encoders. For ease of understanding, the convolution layer parameters described herein are expressed in order as the number of channels_convolution kernel size×convolution kernel size_step, and the pooling layer parameters are expressed in order as convolution kernel size×convolution kernel size_step, with the feature extraction steps specifically being as follows:
And 2.1, constructing a visual saliency basic feature extraction module. The invention simulates the processing flow of human visual system to information, namely, extraction and integration of most basic visual characteristics can be completed in the process from retina to basic visual cortex, and a weight sharing visual significance basic characteristic module is adopted in a visual space encoder and a visual time encoder. The module is improved on the basis of MobileNet-V2 model, and the specific improvement is as follows:
The final spatial pooling layer in MobileNet-V2 model is first replaced by a holed pyramid pooling layer consisting of k=4 parallel holed convolution layers, the original information is retained using a single convolution layer with aperture size 1 (rate=1), the convolution kernel size of the convolution layer is 1×1, and the number of output characteristic channels is 256. The convolution kernel size of three convolution layers with different aperture sizes (rate= {6,12,18 }) is 3×3, the convolution layers with 256 output characteristic channels extract different scale characteristics, and the characteristics obtained by the four convolution layers with holes are connected together to be used as the output of the last convolution block of the MobileNet-V2 model.
The second step connects MobileNet-V2 the outputs of the last three convolutions. The output characteristics of the three convolution blocks are respectively subjected to channel dimension adjustment by using convolution layers with convolution kernel sizes of 1×1 and output characteristic channel numbers of 64, 128 and 256. The spatial resolutions of the outputs obtained by the three convolution blocks are not the same, the outputs of the last two convolution blocks are subjected to bilinear interpolation up-sampling, and the feature map is scaled so that the three feature maps have the same size, and the three feature maps are(45×80 Pixels). The three features are then connected in the channel direction and passed through a (256_3×3_1) convolution layer to obtain the visual saliency base feature F x (x is the input video frame number), with the size of 45×80×256.
Step 2.2, constructing a space encoder. After the video frame X n passes through the visual saliency basic feature extraction module to obtain the feature F n, the features of different abstraction levels in F x are integrated through a convolution layer with the convolution kernel size of 3×3 and the output feature channel number of 256. In order to improve the performance of the model, a convolution Gaussian prior layer is added to simulate the center deflection phenomenon, and a prior feature map is obtained by automatically combining K=8 Gaussian maps with different horizontal and vertical variances through two convolution layers with the convolution kernel size of 3×3 and the output feature channel number of 64, wherein the size of Gao Situ is the same as that of the significant feature(45×80 Pixels). Gao Situ is generated by the following equation:
Where f (x, y) is the Gaussian function value of Gao Situ at the (x, y) position, and μ x、μy、σx、σy represents the mean and variance in the horizontal and vertical directions, respectively. Setting all Gaussian graphs N.e {1,2,3,..8 }. The prior feature map is connected to the salient features according to the channel direction, and the prior features and the salient features are integrated through two layers of convolution layers with the convolution kernel size of 3 multiplied by 3 and the output feature channel number of 256, and the spatial salient features are output
And 2.3, constructing a time encoder. 7 consecutive video frames (X 1,X2,…,Xn) are passed through 7 parallel visual saliency base feature extraction modules and the resulting features are connected to yield features of size 7X 45X 60X 256. The method comprises the steps of extracting and integrating time characteristics by adopting two continuous three-dimensional convolution layers with convolution kernel size of 3 multiplied by 3 and output characteristic channel number of 256, simulating center deflection phenomenon by using a convolution Gaussian prior layer in a time encoder similar to a space encoder, obtaining prior characteristic graphs by automatically combining K=8 Gaussian graphs with different horizontal and vertical variances by using two (64_3 multiplied by 3_1) convolution layers, and setting all Gaussian graphsAccording to experience settingN.e {1,2,3,..8 }. The obtained a priori feature map is connected to the salient features of each frame in the channel direction. Three-dimensional convolution layers with the channel number of 256 integrate prior features and saliency features through two layers of convolution kernels with the size of 3 multiplied by 3 and output time saliency featuresAnd 2.4, constructing an audio encoder. Inputting logarithmic mel-spectrogram a n into audio encoder in time synchronization with video frame of input spatial encoder, obtaining audio semantic features from logarithmic mel-spectrogram a n using ResNet model as audio encoderThe size is 1×309.
And 2.5, fusing the spatial and temporal features to obtain visual saliency features. The space and time features are connected according to the channel layer, the space and time features are automatically fused by using two layers of convolution layers, the parameters of the convolution layers are (1024_3×3_1), (512_3×3_1), and the visual saliency features are output
And 3, constructing an audio and video feature fusion module. Based on a channel attention mechanism, the purpose of the audio-video feature fusion module is that the designed audio-video feature fusion module can automatically learn weight parameters among channels on a channel level, and the semantic information of the audio-video features is utilized to selectively enhance or inhibit the expression of visual features related or not related to the audio information. Compared with simple addition or connection of the audio features and the visual features, the audio-video feature fusion method designed by the invention is more mild, does not cause the lack of the visual features or bring about a large amount of noise information in the audio features, in other words, is an audio-guided visual attention fusion method. First, visual saliency is characterizedThe visual saliency features are compressed to channel-level statistics by a global averaging pooling layer P, 1 x 512 in size. The second step utilizes two non-linear fully connected layers f V and f A to pool the visual saliency features of the layers through global averagingAudio semantic featuresAnd (3) performing dimension adjustment, wherein the size is adjusted to be 1 multiplied by 512, and then adding the dimension elements to obtain the audio and video semantic features. The attention weight W n of the channel hierarchy is then calculated with a multi-layer perceptron U with a sigmoid activation function, with a size of 1x 512. The above procedure is shown in the following equation:
finally, the obtained attention weight W n is the same as Element-by-element multiplication to obtain salient features based on audio and video featuresThe size is 45×80×512.
And 4, constructing a decoder. Obtaining salient features based on audio and video featuresIt needs to be integrated into a single channel saliency map. The invention adjusts the attention weight between visual characteristic channels through the audio and video characteristic fusion module, and can generate a single-channel saliency map through a simple decoder. The invention adopts two layers of convolution layers with the convolution kernel size of 3 multiplied by 3, and outputs the convolution layers with the characteristic channel number of 128 and 1 to characteristic the audio and video significanceAnd integrating to generate a single-channel saliency map S n.
Model training details:
The proposed model is implemented on an NVIDIA 1080GPU using Pytorch. Wherein MobileNet-V2 in the visual saliency base feature extraction module is initialized with its public weights and ResNet in the audio encoder is initialized with the public weights trained on the large audio-video dataset.
First the model trains a visual saliency base feature extraction module using SALICON. The SALICON dataset is one of the largest image saliency prediction datasets at present, and contains 20000 pieces of image data. By training the visual saliency basic feature extraction module on the data set, the requirement for visual saliency data can be reduced, and the performance of the model can be improved. In the training process, the visual saliency basic feature extraction module uses a random gradient descent algorithm for training, the initial learning rate is set to be 10 -3, the momentum is set to be 0.9, and the weight attenuation coefficient is set to be 0.0005. The batch size is set to 32, i.e. 32 pictures will be processed per training iteration.
The second step model trains the whole model using DIEM. The DIEM dataset is one of the datasets commonly used for video saliency prediction, and contains 85 video segments of varying duration, 27 to 217 seconds, which are rich in variety and provide corresponding audio information. The model proposed by the invention is trained using a random gradient descent algorithm, the initial learning rate is set to 10 -3, the momentum is set to 0.9, and the weight decay is set to 0.0005. The batch size is set to 4, i.e. 4 x 7 = 28 pictures will be processed per training iteration. The loss function used in the two-step training is the same, as shown in the following formula:
L(S,M,F)=α1Lkl(S,M)+α2Lcc(S,M)+α3Lnss(S,F)
Wherein S represents a single-channel significance prediction graph output by the model, M represents a significance density graph in the data true value, and F represents a gaze point graph in the data true value. L kl、Lcc、Lnss respectively represents calculation formulas of KL divergence, linear correlation coefficient and normalized path scanning significance, and the calculation formulas are specifically as follows:
KL divergence (Kullback-Leibler divergence, KL) is an asymmetry metric, and smaller values indicate that the result of significance prediction is closer to a true significance density map. Where i denotes the i-th pixel.
Linear correlation coefficients (Linear Correlation Coefficient, CC) are commonly used to measure the correlation between two variables, with larger values representing closer significance predictions to true values, where cov (S, M) and σ (S, M) represent the covariance and standard deviation, respectively, between the model derived significance predictions and the true significance density map.
Normalized path scan saliency (Normalized SCANPATH SALIENCY, NSS), which is specially designed for saliency prediction result evaluation, can be regarded as a Normalized saliency condition for measuring the gaze point position, and the calculation method is to take the average value of the Normalized saliency map corresponding to the human eye gaze point. Where μ (S) and σ (S) represent the mean and variance, respectively, of the saliency prediction result map, i represents the ith pixel, and N represents the number of human eye gaze points. The larger NSS value indicates the better the significance prediction model performance.
Alpha 1、α2、α3 is a manually set weight, 1, -0.2, -0.1 in this example, respectively.
On the other hand, the invention also provides a visual saliency prediction system based on the audio and video characteristics, which comprises the following modules:
the data preprocessing module is used for preprocessing data, and firstly, the video to be predicted and the time-synchronous audio thereof are processed into continuous video frames and logarithmic Mel spectrograms;
The basic feature extraction module is used for constructing a visual saliency basic feature extraction module and extracting features of the video frames;
The visual saliency feature and audio semantic feature extraction module is used for extracting spatial saliency features and temporal saliency features contained in the video frames based on the features of the video frames, fusing the spatial features and the temporal features to obtain visual saliency features, and extracting audio semantic features contained in the logarithmic Mel spectrogram;
the feature fusion module is used for constructing an audio and video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio and video saliency features;
the single-channel saliency map synthesis module is used for integrating the audio and video saliency features into a single-channel saliency map by using a decoder;
The training module is used for training an integral model formed by the basic feature extraction module, the visual saliency feature and the audio semantic feature extraction module and the feature fusion module;
and the prediction module is used for realizing visual saliency prediction by using the trained integral model.
The specific implementation manner of each module corresponds to each step, and the invention is not written.
The implementations described herein are merely illustrative of the principles of the present invention, and various modifications or additions may be made to the described implementations by those skilled in the art without departing from the spirit or scope of the invention as defined in the accompanying claims.

Claims (8)

1. The video significance prediction method based on the audio and video features is characterized by comprising the following steps of:
step 1, data preprocessing, namely firstly processing a video to be predicted and time-synchronized audio thereof into a continuous video frame and a logarithmic Mel spectrogram;
step 2, firstly, constructing a visual saliency basic feature extraction module for extracting features of video frames;
step 3, extracting space significance features and time significance features contained in the video frames based on the features of the video frames, fusing the space features and the time features to obtain visual significance features, and extracting audio semantic features contained in the logarithmic mel-frequency spectrogram;
in step 3, the space and time features are connected according to the channel layer, and the two convolution layers are used for automatically fusing the space and time features to output visual saliency features
Step3, constructing an audio encoder to extract audio semantic features contained in the logarithmic mel spectrogram, wherein the specific implementation mode is as follows;
Inputting logarithmic mel-spectrogram a n into audio encoder in time synchronization with video frame of input spatial encoder, obtaining audio semantic features from logarithmic mel-spectrogram a n using ResNet model as audio encoder
Step 4, constructing an audio-video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio-video saliency features;
The audio and video fusion module automatically learns weight parameters among channels based on a channel attention mechanism, and the specific implementation mode is as follows;
first, visual saliency is characterized The second step is to compress the visual saliency features through a global averaging pooling layer by using two non-linear fully connected layers f V and f A Audio semantic featuresAnd then calculating the attention weight W n of the channel hierarchy by using a multi-layer perceptron U with a sigmoid activation function, wherein the flow is shown in the following equation:
finally, the obtained attention weight W n is the same as Element-by-element multiplication to obtain salient features based on audio and video features
Step 5, integrating the audio and video saliency features into a single-channel saliency map by using a decoder;
step 6, training the integral model formed by the steps 2-5;
And 7, utilizing the trained integral model to realize visual saliency prediction.
2. The video saliency prediction method based on audio and video features of claim 1, wherein the specific implementation manner of step 1 is as follows;
step 1.1, firstly, processing a video to be predicted, separating out audio, converting the audio into a wav format, and cutting the video into video frames according to the original video frame rate;
Step 1.2, processing the video frame comprises the steps of reading a picture and converting the picture into an RGB format, adjusting the resolution of the picture into W multiplied by H, carrying out normalization operation on the picture, and converting the data type into a Tensor type;
Step 1.3, converting an audio wav file into a logarithmic Mel spectrogram which is time-synchronous with video frames, wherein the method comprises the steps of resampling the audio wav file into 16000Hz, dividing the resampled audio file into frames with non-overlapping time length of 960ms, decomposing each frame with 960ms through a window with time length of 25ms and step length of 10ms by short-time Fourier transform, integrating the generated spectrogram into 64 frequency bands with Mel intervals, adding a small offset to the amplitude of each frequency band, carrying out logarithmic conversion to obtain a logarithmic Mel spectrogram with the size of 96 multiplied by 64 and the channel number of 1, acquiring each frame timestamp of the video frame, corresponding the time length covered by the logarithmic Mel spectrogram with the video frame timestamp, and converting the data type into Tensor type.
3. The video saliency prediction method based on audio and video features of claim 1, wherein the visual saliency basic feature extraction module is improved on the basis of MobileNet-V2 model, and the specific improvement is as follows:
Firstly, changing a final space pooling layer in a MobileNet-V2 model into a pyramid pooling layer with holes, wherein the pyramid pooling layer with holes consists of K=4 parallel convolution layers with holes, using a single convolution layer with the aperture size of 1 to retain original information, extracting different scale features by using three convolution layers with different aperture sizes, and connecting the features obtained by the four convolution layers to be used as the output of the last convolution block of the MobileNet-V2 model;
And secondly, connecting MobileNet-V2 to output of the last three convolution blocks, respectively carrying out channel dimension adjustment on the output characteristics of the three convolution blocks by using three convolution layers, wherein the output spatial resolutions obtained by the three convolution blocks are different, carrying out bilinear interpolation up-sampling on the output of the last two convolution blocks, carrying out scaling on the feature graphs to enable the sizes of the three feature graphs to be the same, connecting the three features according to the channel direction, and then obtaining a visual saliency basic feature F x by using one convolution layer, wherein x is the sequence number of an input video frame.
4. The video saliency prediction method based on audio and video features of claim 1, wherein the spatial saliency feature contained in the video frame is extracted by a spatial encoder constructed in step 3, and the implementation mode is as follows;
after the video frame X n is subjected to a visual saliency basic feature extraction module to obtain a feature F n, integrating features of different abstract levels in F x through a convolution layer, adding a convolution Gaussian prior layer to simulate a center deflection phenomenon in order to improve the performance of a model, automatically combining K=8 Gaussian graphs with different horizontal and vertical variances through two convolution layers to obtain a prior feature graph, adding the prior feature graph into a saliency feature, wherein the Gao Situ size is the same as that of the saliency feature, and the like Gao Situ is generated by the following equation:
where f (x, y) is a Gaussian function value of Gao Situ at the (x, y) position and μ x、μy、σx、σy represents the mean and variance in the horizontal and vertical directions, respectively, all Gaussian maps are set Connecting the prior feature map to the salient features according to the channel direction, integrating the prior features and the salient features through two layers of convolution layers and outputting spatial salient features
5. The video saliency prediction method based on audio and video features of claim 1, wherein the time encoder is constructed in step 3 to extract the time saliency features contained in the video frames, and the implementation is as follows;
n continuous video frames (X 1,X2,…,Xb) are processed through n parallel visual saliency basic feature extraction modules, the obtained features are connected to obtain time features with the size of n multiplied by 45 multiplied by 60 multiplied by 256, the time features are extracted and integrated through two layers of convolution layers, a convolution Gaussian prior layer is added in a time encoder to simulate a center deflection phenomenon, K=8 Gaussian graphs with different horizontal and vertical variances are automatically combined through the two layers of convolution layers to obtain a prior feature graph, and all Gaussian graphs are set According to experience settingConnecting the obtained prior feature map to the salient features of each frame according to the channel direction, integrating the prior features and the salient features through two layers of three-dimensional convolution layers and outputting time salient features
6. The method for predicting video saliency based on audio and video features according to claim 1, wherein in step 6, training is performed in two steps, namely training a visual saliency basic feature extraction module by using SALICON, training by using a random gradient descent algorithm, training by using a DIEM training overall model, and training by using a random gradient descent algorithm.
7. The method for predicting video saliency based on audio and video features of claim 1, wherein the loss function is used in training as shown in the following formula:
L(S,M,F)=α1Lkl(S,M)+α2Lcc(S,M)+α3Lnss(S,F)
S represents a single-channel significance prediction graph output by a model, M represents a significance density graph in a data true value, F represents a gaze point graph in the data true value, alpha 1、α2、α3 is a weight set manually, and L kl、Lcc、Lnss respectively represents calculation formulas of KL divergence, linear correlation coefficient and normalized path scanning significance, and the calculation formulas are specifically as follows:
Wherein i represents the i-th pixel;
Wherein cov (S, M) and σ (S, M) represent the covariance and standard deviation between the model-derived saliency prediction result and the true saliency density map, respectively;
where μ (S) and σ (S) represent the mean and variance, respectively, of the significance prediction result map, and N represents the number of eye gaze points.
8. A video saliency prediction system based on audio and video features, for implementing a video saliency prediction method based on audio and video features as claimed in any one of claims 1 to 7, comprising the following modules:
the data preprocessing module is used for preprocessing data, and firstly, the video to be predicted and the time-synchronous audio thereof are processed into continuous video frames and logarithmic Mel spectrograms;
The basic feature extraction module is used for constructing a visual saliency basic feature extraction module and extracting features of the video frames;
The visual saliency feature and audio semantic feature extraction module is used for extracting spatial saliency features and temporal saliency features contained in the video frames based on the features of the video frames, fusing the spatial features and the temporal features to obtain visual saliency features, and extracting audio semantic features contained in the logarithmic Mel spectrogram;
the feature fusion module is used for constructing an audio and video feature fusion module, and carrying out self-adaptive fusion on the visual saliency features and the audio semantic features to obtain audio and video saliency features;
the single-channel saliency map synthesis module is used for integrating the audio and video saliency features into a single-channel saliency map by using a decoder;
The training module is used for training an integral model formed by the basic feature extraction module, the visual saliency feature and the audio semantic feature extraction module and the feature fusion module;
and the prediction module is used for realizing visual saliency prediction by using the trained integral model.
CN202310247030.1A 2023-03-10 2023-03-10 Video significance prediction method and system based on audio and video features Active CN116403135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310247030.1A CN116403135B (en) 2023-03-10 2023-03-10 Video significance prediction method and system based on audio and video features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310247030.1A CN116403135B (en) 2023-03-10 2023-03-10 Video significance prediction method and system based on audio and video features

Publications (2)

Publication Number Publication Date
CN116403135A CN116403135A (en) 2023-07-07
CN116403135B true CN116403135B (en) 2025-10-17

Family

ID=87009440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310247030.1A Active CN116403135B (en) 2023-03-10 2023-03-10 Video significance prediction method and system based on audio and video features

Country Status (1)

Country Link
CN (1) CN116403135B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116955693A (en) * 2023-08-23 2023-10-27 中国石油大学(华东) Audio-visual significance detection method based on audio-visual consistency sensing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200803236A (en) * 2006-02-14 2008-01-01 Sibeam HD physical layer of a wireless communication device
CN102368819A (en) * 2011-10-24 2012-03-07 南京大学 System for collection, transmission, monitoring and publishment of mobile video

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103210651B (en) * 2010-11-15 2016-11-09 华为技术有限公司 Method and system for video summary

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200803236A (en) * 2006-02-14 2008-01-01 Sibeam HD physical layer of a wireless communication device
CN102368819A (en) * 2011-10-24 2012-03-07 南京大学 System for collection, transmission, monitoring and publishment of mobile video

Also Published As

Publication number Publication date
CN116403135A (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN112954312B (en) Non-reference video quality assessment method integrating space-time characteristics
CN113516990B (en) A speech enhancement method, a method for training a neural network, and related equipment
Min et al. Study of subjective and objective quality assessment of audio-visual signals
US10701303B2 (en) Generating spatial audio using a predictive model
CN110751649B (en) Video quality evaluation method and device, electronic equipment and storage medium
CN118658128A (en) AI multi-dimensional teaching behavior analysis method and system based on classroom video
Moss et al. On the optimal presentation duration for subjective video quality assessment
CN113473117B (en) Non-reference audio and video quality evaluation method based on gated recurrent neural network
Wu et al. Cross-modal perceptionist: Can face geometry be gleaned from voices?
CN113489971B (en) Full-reference audio and video objective quality evaluation method, system and terminal
CN116580720A (en) Speaker vision activation interpretation method and system based on audio-visual voice separation
CN116403135B (en) Video significance prediction method and system based on audio and video features
Lee et al. Seeing through the conversation: Audio-visual speech separation based on diffusion model
CN115578512A (en) Method, device and equipment for training and using generation model of voice broadcast video
WO2023020500A1 (en) Speech separation method and apparatus, and storage medium
CN120470245B (en) Multimodal children's voice data processing method based on deep learning and federated learning
Sanaguano-Moreno et al. Real-time impulse response: a methodology based on machine learning approaches for a rapid impulse response generation for real-time acoustic virtual reality systems
CN115883869A (en) Processing method, device and processing equipment of video frame interpolation model based on Swin Transformer
Kim et al. Modern trends on quality of experience assessment and future work
CN119252275B (en) Mouth shape generating method and device for voice driving
CN118138833B (en) Digital person construction method and device and computer equipment
US20240349001A1 (en) Method and system for determining individualized head related transfer functions
CN118038856A (en) Digital human video generation method, device, terminal equipment and storage medium
CN112437290A (en) Stereoscopic video quality evaluation method based on binocular fusion network and two-step training frame
Maniyar et al. Persons facial image synthesis from audio with Generative Adversarial Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant