WO2025020080A1

WO2025020080A1 - Spatio-temporal video saliency analysis

Info

Publication number: WO2025020080A1
Application number: PCT/CN2023/109116
Authority: WO
Inventors: Sijie SONG; Li Jia; Yingjun Bai
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2025-01-30
Anticipated expiration: 2026-01-25

Abstract

A multi-stage training process is used to train a neural network to identify regions of spatio-temporal saliency in video frames using supervised eye fixation, action labels and frame annotation. A first stage is used to train the neural network to detect spatially salient regions in video frames and a second stage is used to train the neural network to detect temporally salient regions in the video frames. Events occurring within the video frames are predicted based on the identified regions. Frames of the video segment that have salient regions may be played back at a lower compression rate and/or at an increased frame rate in relation to frames of the video segment having non-salient regions.

Description

Spatio-Temporal Video Saliency Analysis

FIELD OF THE INVENTION

This disclosure relates to video saliency and more specifically to performing spatio-temporal saliency analysis in videos.

BACKGROUND

Most modern smartphones include integrated digital camera technology. With the advancement of technology, digital cameras integrated with smartphones rival the capabilities provided by stand-alone digital cameras. As more smartphones are used for capturing video, the demand for resources increases, especially as the quality of videos produced by integrated digital cameras improves. In current times, it can be preferable to be able to more intelligently analyze and/or categorize video segments of captured videos.

BRIEF SUMMARY

The present embodiments may relate to, inter alia, systems and methods for providing a unified spatio-temporal framework for capturing and predicting salient moments in a media file, such as a video clip. In some embodiments, the unified framework is used to predict spatio-temporal saliency of a media file, such as a video file, clip, segment, or the like, during capture by an image sensor, such as a camera. The captured media file may be processed using file compression techniques based on the predictions made. For example, predictions made by the unified framework may affect the video compression rate of different video frames within the same media file. Additionally, during playback of the media file, the frame rate may be adjusted based on the predictions made by the unified framework.

In some embodiments of the present disclosure, spatial saliency prediction and detection provides identification of regions of interest within each frame of a media file. Regions of interest in a video may include portions of video frames where action is taking place, for example. Temporal saliency detection may provide identification of interesting video frames over time, such as when something in particular is occurring during a media file. Additionally, the unified spatio-temporal framework predicts salient action (s) occurring with the media file and, for example, the types of salient actions and/or where such salient actions are represented within individual frames of the media file. The systems and methods described may cause additional, less, or alternate actions, including those discussed elsewhere herein.

Various non-transitory computer readable media embodiments are disclosed herein. Such computer readable media are readable by one or more processors. Instructions may be stored on the computer readable media for causing the one or more processors to perform any of the techniques disclosed herein.

Various programmable electronic devices are also disclosed herein, in accordance with the program storage device embodiments enumerated above. Such electronic devices may include one or more image capture devices, such as optical image sensors/camera units; a display; a user interface; one or more processors; and a memory coupled to the one or more processors. Instructions may be stored in the memory, the instructions causing the one or more processors to execute instructions in accordance with the various techniques disclosed herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows, in block diagram form, a simplified network diagram, according to one or more embodiments.

FIG. 2 shows, in block diagram form, a simplified diagram for an electronic device, in accordance with one or more embodiments.

FIG. 3 shows an example framework for determining spatio-temporal saliency for a video segment using a network trained to perform action recognition, according to one or more embodiments.

FIG. 4 shows an example framework including a plurality of components of a sub-network for performing temporal saliency analysis on video data, according to one or more embodiments.

FIG. 5 shows an example framework for training a network to determine spatio-temporal saliency for a video segment using a multi-task learning approach, according to one or more embodiments.

FIG. 6 shows an example framework for training a network to determine spatio-temporal saliency for a video segment using a single-stage training strategy, according to one or more embodiments.

FIG. 7 shows an example framework for training a network to determine spatio-temporal saliency for a video segment using a multi-stage training strategy, according to one or more embodiments.

FIG. 8 shows an example network architecture and framework for determining spatio-temporal saliency for a video segment, according to one or more embodiments.

FIG. 9 shows, in flow chart form, an example method for performing spatio-temporal saliency analysis for a video segment, according to one or more embodiments.

DETAILED DESCRIPTION

The following disclosure relates to technical improvements to detecting salient portions of media files, e.g., video segments, using a unified framework that performs spatio-temporal saliency analysis via the use of machine learning (ML) and/or artificial intelligence (AI) -based approaches. According to aspects of the present disclosure, systems, methods, and computer readable media for performing spatial temporal saliency analysis of media files is provided. Media files may be captured, for example, by a user’s mobile device. For example, media files may be captured using one or more camera inputs, one or more microphone inputs, or a combination thereof of the user’s mobile device. Additionally, or alternatively, media files may be sent to the user’s mobile device, such as by another mobile device or a media server, over a network, for example. In some embodiments, the mobile device may be a user computing device, a tablet computing device, or the like.

According to one or more embodiments, the disclosed technology may include one or more software modules embodied on a user’s mobile device. The one or more software modules may include instructions to perform a spatio-temporal saliency analysis on media files. For example, a media file captured by the user’s mobile device may be received by the one or more software modules as input. Using a predictive model built using artificial intelligence and/or machine learning-based techniques, the one or more software modules may provide, as output, indications (e.g., tags, labels, etc. ) of portions (e.g., the video frames) of the media file. In some embodiments, the predictive model may also predict particular types of actions that are taking place within particular portions of the media file. For example, while capturing video during a swim/dive meet, the unified spatio-temporal framework may predict not only when a salient action (e.g., a dive) is taking place (e.g., a temporal indication, such as a timestamp, within the captured video when a dive is occurring) , but where in the video segment, or video frame, that some type of action is taking place. In some embodiments, the neural network may be trained to identify specific types of actions taking place within salient portions of the video segment (e.g., a forward dive, a twisting dive, a backwards dive, etc. ) .

According to one or more embodiments, the disclosed technology addresses the need in the art to predict and identify interesting, or salient, portions of a media file. For example, to predict video frames of a media file that viewers would find interesting. Interesting video frames may contain, for example, action sequences (e.g., fight scenes, chase scenes, battle scenes) , human expressions (e.g., change in facial expression, act of surprise) and/or interaction (e.g., talking, kissing) , animal engagement (e.g., nature scenes, animal chases) , or the like. Predicting and identifying salient portions of a media file provides many technical advantages, such as enabling a viewer to playback only the salient portions of the media file, saving not only the viewer’s time but also reducing resource usage for the viewer’s device that is providing playback of the media file. For example, the viewer may choose to only view salient portions of a media file that is five minutes in length. In this example, the media file may be of a track and field event and the viewer may only want to see the actual event (e.g., a race) and not sit through other aspects of the media file (e.g., race participant introductions, pre-race commentary, etc. ) .

In some embodiments, video playback of a media file is improved by compressing the media file and reducing the amount of computer resources needed to provide video playback. In this example, portions of the media file that do not include salient portions as predicted and identified by the unified spatio-temporal framework described herein may undergo image data compression. Image data compression reduces the size of the media file, causing the media file to require less space, such as on a storage device or memory of a viewer’s mobile device or on another storage option (e.g., cloud storage, remote media server) . Additionally, or alternatively, a compressed media file, when streamed from a remote storage option to a viewer’s mobile device, will require less network resources (e.g., bandwidth) . Continuing with the example, portions of the media file that do include salient portions would not undergo any type of image or video compression. By not compressing video frames that include salient regions, playback of the uncompressed video frames may be provided at the original resolution. For example, if the media file was captured in 1080p, during playback of the media file, video frames that include salient regions may be played back in 1080p and video frames that do not include salient regions may be played back at a lesser resolution, such as 720p or less. Additionally, or alternatively, the entire media file may undergo video compression, however video frames that include salient regions may be compressed at a lower rate when compared to video frames that lack salient regions.

In some embodiments, video playback of a media file is improved by adjusting the frame rate during playback of the media file by a viewer’s mobile device. Adjusting the frame rate during playback may reduce the amount of computer resources needed to provide video playback by the viewer’s mobile device. In this example, video frames of the media file that include salient portions as predicted and identified by the unified spatio-temporal framework described herein may be played back at a higher frame rate when compared to video frames of the media file that lack salient portions during playback. By reducing the frame rate of what are deemed less salient video frames during playback, the burden on system resources may be reduced. For example, during media file playback, less is required of a processor, such as a video processor, when the frame rate is reduced. In this example, video frames that contain salient regions may be displayed at a first frame rate and video frames that lack salient regions may be displayed at a second frame rate, where the first frame rate is higher than the second frame rate. For example, the first frame rate may be 60 fps, which is suitable for 4K video resolution, and the second frame rate may be 24 fps, which is suitable for streaming video content. Other frames rates may be used, such as 30 fps, which is suitable for live TV broadcasts (e.g., sports, news) , and 120 fps, which is considered suitable for slow-motion video and video games.

In some embodiments, video playback of a media file is improved by adjusting a video compression rate of the media file and adjusting the frame rate of the media file during playback. For example, based on results of the unified spatio-temporal framework described herein, video frames of the media file identified to include salient regions may be compressed at a first compression rate and played back at a first frame rate. Additionally, video frames of the media file that lack salient regions may be compressed at a second compression rate and played back at a second frame rate. In this example, the first compression rate is less than the second compression rate. Further, the first frame rate is greater than the second frame rate. By adjusting video compression rates and playback frame rates, computer and network resource usage may be reduced. Additionally, in this example, video frames that include salient regions (e.g., interesting moments in a video) , may be played back to a viewer in an unaltered state.

Image data compression reduces the size of the media file, causing the media file to require less space, such as on a storage device or memory of a viewer’s mobile device or on another storage option (e.g., cloud storage, remote media server) . Additionally, or alternatively, a compressed media file, when streamed from a remote storage option to a viewer’s mobile device, will require less network resources (e.g., bandwidth) . Continuing with the example, portions of the media file that do include salient portions would not undergo any type of image or video compression. By not compressing video frames that include salient regions, playback of the uncompressed video frames may be provided at the original definition. For example, if the media file was captured in 1080p, during playback of the media file, video frames that include salient regions may be played back in 1080p and video frames that do not include salient regions may be played back at a lesser definition, such as 720p or less.

In the following description, for purposes of explanation, numerous specific details are set forth to provide a thorough understanding of the disclosed concepts. As part of this description, some of this disclosure’s drawings represent structures and devices in block diagram form to avoid obscuring the novel aspects of the disclosed embodiments. In this context, references to numbered drawing elements without associated identifiers (e.g., 100) refer to all instances of the drawing element with identifiers (e.g., 100a and 100b) . Further, this disclosure’s drawings may be provided in the form of a flow diagram. The boxes in any of the flow charts may be presented in a particular order. However, the flow of any flow diagram is used only to exemplify one embodiment. In other embodiments, any of the various components depicted in the flow chart may be deleted, or the components may be performed in a different order, or even concurrently. In addition, other embodiments may include additional steps not depicted as part of the flow chart. The language used in this disclosure has been principally selected for readability and instructional purposes and may not have been selected to delineate or circumscribe the disclosed subject matter. Reference in this disclosure to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment, and multiple references to “one embodiment” or to “an embodiment” should not be understood as necessarily all referring to the same embodiment or to different embodiments.

It should be appreciated that in the development of any actual implementation (as in any development project) , decisions must be made to achieve the developers’ specific goals (e.g., compliance with system and business-related constraints) , and that these goals will vary from one implementation to another. It will also be appreciated that such development efforts might be complex and time consuming but would nevertheless be a routine undertaking for those of ordinary skill in the art of image capture having the benefit of this disclosure.

For purposes of this disclosure, images captured by a camera device are referred to as “media files. ” However, in one or more embodiments, the captured images referred to as “media files” could be any audiovisual media data, such as video clips, audio snippets, movies, song files, music videos, or the like.

Referring now to FIG. 1, a simplified block diagram is depicted of a unified spatio-temporal service and framework 100 including a spatio-temporal saliency service 102 connected to a client device 140 over a network, for example over a network 150. Client device 140 may be a personal computer or multifunctional device, such as a cell phone, tablet computer, personal digital assistant, portable music/video player, wearable device, or any other electronic device that includes a media playback system.

Spatio-temporal saliency service 102 may include one or more servers or other computing or storage devices on which the various modules and storage devices may be contained. Although spatio-temporal saliency service 102 is depicted as comprising various components in an exemplary manner, in one or more embodiments, the various components and functionality may be distributed across multiple network devices, such as servers, network storage, and the like. Further, additional components may be used, some combination of the functionality of any of the components may be combined. Generally, spatio-temporal saliency service 102 may include one or more memory devices 112, one or more storage devices 114, and one or more processors 116, such as a central processing unit (CPU) or a graphical processing unit (GPU) . Further processor 116 may include multiple processors of the same or different type. Memory 112 may each include one or more different types of memory, which may be used for performing device functions in conjunction with processor 116. For example, memory 112 may include cache, ROM, and/or RAM. Memory 112 may store various programming modules during execution, including training module 104 and integrated spatio-temporal saliency module 106A.

Spatio-temporal saliency service 102 may store media files, media file data, saliency prediction data, neural network model training data, or the like. Additional data may be stored by spatio-temporal saliency service 102 including, but not limited to, media classification data, frame annotation data, and action label data. Spatio-temporal saliency service 102 may store this data in media store 118 within storage 114. Storage 114 may include one or more physical storage devices. The physical storage devices may be located within a single location, or may be distributed across multiple locations, such as multiple servers.

In another embodiment, media store 118 may include model training data for creating a dataset to train a predictive model using training module 104. Model training data may include, for example, labeled training data that is used by training module 104 to train a machine learning (ML) model, such as integrated spatio-temporal saliency module 106A or integrated saliency module 106B. A trained ML model may then be used, as described herein, to predict salient regions, frames, or portions of a media file. In some embodiments, training data may consist of data pairs of input and output data. For example, input data may include media items that have been processed along with output data, such as temporal annotations, action labels, or the like. Temporal annotations and/or action labels may be objective or subjective. Additionally, temporal annotations and/or action labels may be obtained through detailed experimentation with human viewers. In another embodiment, output data may be obtained through detailed experimentation based on eye gaze of human viewers of video frames during playback of the media items. An objective label may, for example, indicate a quality measure obtained by computing a video frame quality metric. The video frame quality metric may be based, at least in part or in combination, on frame rate, resolution, or the like. A subjective label may, for example, indicate a spatio-temporal saliency score that is assigned by a human annotator.

Returning to the spatio-temporal saliency service 102, the memory 112 includes modules that include computer readable code executable by processor 116 to cause the spatio-temporal saliency service 102 to perform various tasks. As depicted, the memory 112 may include a training module 104 and an integrated spatio-temporal saliency module 106A. According to one or more embodiments, the training module 104 generates and maintains a predictive model for integrated spatio-temporal saliency module 106A of spatio-temporal saliency service 102 and/or integrated saliency module 106B of client device 140. The predictive model, along with additional data pertaining to the model including, but not limited to, training data (e.g., spatial saliency training data) , classification data, label data, or the like may be stored on a local device, such as storage 114, on a remote storage device, such as storage 124 of client device 140, or a combination or variation thereof.

Memory 112 also includes an integrated spatio-temporal saliency module 106A. In one or more embodiments, the integrated spatio-temporal saliency module 106A may access a machine learning model (e.g., a predictive model generated by training module 104) that includes a video saliency tool used to detect interesting, or salient, frames within a media file. The ML model may be accessed from storage 114 or from another storage location (e.g., remote storage location) over a network, such as network 150. For example, the video saliency tool may be used to detect spatio-temporal salient regions of video frames within a media file. Media files having detected spatio-temporal salient regions may be classified and labeled by the integrated spatio-temporal saliency module 106A. Media files processed by the integrated spatio-temporal saliency module 106A may be stored in media store 118 of storage 114. Additionally, or alternatively, processed media files may be sent or streamed to another device, such as client device 140, over a network, such as network 150.

In some embodiments, the integrated spatio-temporal saliency module may be provided on a viewer’s mobile device, such as integrated saliency module 106B on client device 140. In this example, integrated saliency module 106B may include a video saliency tool to detect spatio-temporal saliency regions within video frames of a media file in the same, if not similar, manner described above with respect to integrated spatio-temporal saliency module 106A. The integrated saliency module 106B may process media files at time of capture, in real-time or near-real-time, or a combination or variation thereof, by client device 140. For example, while the client device 140 is being used to capture and record an event or a scene using sensor inputs (e.g., microphone (s) and/or camera (s) ) of the client device 140. Alternatively, a media file may be obtained from a media store, such as media store 130, or received from a remote storage location, such as media store 118, and provided as input to integrated saliency module 106B. In some embodiments, a media file may be processed, or provided as input, to integrated spatio-temporal saliency module 106A and integrated saliency module 106B. In this example, modules 106A and 106B may both process the media file, either partially or in full. Captured and processed content may be stored within media store 130 of storage 124 and played back to a viewer using media player 126. Alternatively, a processed media file may be stored in a remote storage location over a network, such as media store 114 of spatio-temporal saliency service 102 over a network, such as network 150.

Referring now to FIG. 2, a simplified functional block diagram of an illustrative multifunction device 200 is shown according to one embodiment. Multifunctional device 200 may show representative components, for example, for devices of spatio-temporal saliency service 102 and client device 140 of FIG. 1. Multifunction electronic device may include processor 205, display 210, user interface 215, graphics hardware 220, device sensors 225, communications circuitry 245, video codec (s) 255, memory 260, storage device 265, and communications bus 270. The multifunction electronic device may be, for example, a personal computing device such as a personal digital assistant (PDA) , personal music player, mobile telephone, a tablet computer, a laptop, or the like.

Processor 205 may execute instructions necessary to control the operation of functions performed by the multifunction device 200 (e.g., such as the spatio-temporal saliency prediction methods as disclosed herein) . Processor 205 may, for instance, drive display 210 and receive user input from user interface 215. User interface 215 may allow a user to interact with device 200. For example, user interface 215 can take a variety of forms, such as a button, keypad, dial, a click wheel, keyboard, display screen and/or a touch screen. Processor 205 may also, for example, be a system-on-chip such as those found in mobile devices and include a dedicated graphics processing unit (GPU) . Processor 205 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores. Graphics hardware 220 may be special purpose computational hardware for processing graphics and/or assisting processor 205 to process graphics information. In one embodiment, graphics hardware 220 may include a programmable GPU.

Memory 260 may include one or more different types of media used by processor 205 and graphics hardware 220 to perform device functions. For example, memory 260 may include memory cache, read-only memory (ROM) , and/or random-access memory (RAM) . Storage 265 may store media (e.g., audio, image, and video files) , computer program instructions or software, preference information, device profile information, model training data and any other suitable data. Storage 265 may include one more non-transitory computer-readable storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs) , and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM) , and Electrically Erasable Programmable Read-Only Memory (EEPROM) . Memory 260 and storage 265 may be used to tangibly retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 205 such computer program code may implement one or more of the methods described herein.

DETERMINING SPATIO-TEMPORAL SALIENCY FOR A VIDEO SEGMENT USING A NETWORK TRAINED TO PERFORM ACTION RECOGNITION

Turning now to FIG. 3, a framework 300 for determining spatio-temporal saliency for a video segment using a network trained to perform action recognition is shown, according to one or more embodiments. Framework 300 begins at Block 302 by an electronic device (e.g., client device 140 of FIG. 1, in the context of training module 104 and integrated spatio-temporal saliency modules 106A/106B) receiving input frames, such as image frames of a media file described herein and above. In this example, the media file includes three video frames (302 (a) , 302 (b) , 302 (c) ) , however it is understood that the media file may contain more or less than the example provided. For example, a media file may contain thousands of video frames (e.g., a ten minute video at 30 fps would contain 18,000 frames) . In this example, the first frame may be provided at time t, the second frame may be provided at time t+1, the third frame may be provided at time t+2, and so on. As described above, the media file may be captured by client device 140 or received from another device over a network, such as network 150. As shown, each of the input frames of the media file may be processed by the framework. At Block 304, each of the input frames are provided as input to a neural network. Using the neural network, a saliency mask at Block 306 is used to mask portions of the input frames that will not be used to classify the spatial saliency of the input frames. For example, the saliency mask may be an attention mask in the form of a binary mask that may be used to guide spatial information extracted from the frame. At Block 308, a feature map of the input frames is generated. For example, a feature map may be generated by applying one or more filters or feature detectors to the input frames. Additionally, or alternatively, a feature map may be generated based on prior video frames used as input. Action labels, using the spatial saliency network, may be assigned on a frame-by-frame basis, such as annotation labels or the like.

At Block 310, a recurrent neural network, such as ConvLSTM (Convolutional Long Short-Term Memory) , is used to predict spatio-temporal saliency portions based on each of the input frames and one or more previous frames. Output from Block 310 is provided to Block 312 and Block 314. At Block 312, output from Block 310 is used to perform an average pooling operation on a feature map produced by Block 310. At Block 314, the temporal saliency of the input frames is predicted (see FIG. 4) and represented as a value ranging from zero (0) to one (1) , wherein, for example, a value of zero represents that frames are highly unlikely to contain salient content, and a value of one represents that particular frames are highly likely to contain salient content. In some embodiments, input frames having a predicted saliency value close to 0 (e.g., 0.1, 0.2) may contain irrelevant information and therefore have little influence on the action label applied to the media file. Input frames having a predicted saliency value closer to 1 (e.g., 0.8, 0.9) would have a higher influence on the action label applied to the media label.

At Block 316, a feature map is produced by performing a weighted fusion operation on feature maps produced by Block 314 for the input frames. At Block 318, a classifier sub-network is used to predict probability values for one or more action types being present in the media file (Block 320) . At Block 322, an action classification label is determined for the whole media file. For example, the label may be a ground-truth label for the entire media file. In some embodiments, training of the unified framework 300 may involve minimizing a cross-entropy loss between the determined label (Block 322) for the media file and a “ground truth” action classification label for the media file. The generated output (e.g., action labels, classification labels) may be stored within a storage device as described herein and above for subsequent retrieval, additional post-processing, or the like. In some embodiments, the output, in the form of an action label, may be used to train the temporal saliency to recognize what action is going on in the media file.

TEMPORAL SALIENCY COMPONENTS AND METHODS

Turning now to FIG. 4, a framework 400 including exemplary components of a sub-module for predicting temporal saliency on video data is shown. In some embodiments, the sub-module of framework 400 is part of temporal saliency of Block 314 of FIG. 3. Framework 400 includes, for example, a series of modules that may be used to predict the likelihood that video frames contain temporally salient regions. Deep learning functions include, but are not limited to, (FC) at Blocks 402 and 406, a rectified linear unit (ReLU) activation function at Block 404, and a Sigmoid differentiable function at Block 408. Additional deep learning functions may be used by framework 400.

SPATIO-TEMPORAL SALIENCY FROM MULTI-TASK LEARNING

Turning now to FIG. 5, a framework 500 for training a network to determine spatio-temporal saliency for a video segment using a multi-task learning approach is shown. Framework 500 begins at Block 502 by an electronic device (e.g., client device 140 of FIG. 1, in the context of training module 104 and integrated spatio-temporal saliency modules 106A/106B) receiving input frames, such as image frames of a media file described herein and above. In this example, the media file includes three video frames (502 (a) , 502 (b) , 502 (c) ) , however it is understood that the media file may contain more or less than the example provided. An input frame may include, for example, an image frame of a media or video file described herein and above. As described above, the media file may be captured by client device 140 or received from another device over a network, such as network 150. Next, at Block 504, the input frame may be provided as input to neural network 504. Using the neural network, at Block 506, a saliency mask is used to mask portions of the input frame that will not be used to classify the spatial saliency of the input frame. In some embodiments, the saliency mask may be determined using supervision at Block 508. For example, direct supervision may be provided to the spatial saliency using human eye-fixation. At Block 510, a feature map of the input frame is generated. For example, a feature map may be generated by applying one or more filters or feature detectors to the input frame. Additionally, or alternatively, a feature map may be generated based on prior video frames used as input.

At Block 512, a recurrent neural network, ConvLSTM (Convolutional Long Short-Term Memory) , may be used to predict spatio-temporal saliency based on the input frame and previous frames. Output from Block 512 may be provided to Block 514 and Block 516. At Block 514, output from Block 512 may be used to generate an average pool. At Block 516, the temporal saliency of the input frame may be predicted (see FIG. 4) and represented as a value ranging from zero (0) to one (1) .

At Block 518, weighted fusion of the input frame may be predicted by multiplying the average pool value (Block 514) by the temporal saliency at Block 516. At Block 522, a classifier may be used to predict action recognition based on probability (Block 524) and a label may be applied (Block 526) to entire the media file. For example, the label may be a ground-truth label for the entire media file. In some embodiments, training of the unified framework 500 may be used to minimize cross-entropy loss. In some embodiments, supervision may also be applied to predicting the temporal saliency of the input frame at Block 518. For example, a binary label may be applied to each input frame indicating if the current input frame is, or is not, a key frame. The generated output (e.g., action labels, classification labels) may be stored within a storage device as described herein and above for subsequent retrieval, additional post-processing, or the like. In some embodiments, the output, in the form of an action label, may be used to train the temporal saliency to recognize what action is going on in the media file.

Next, at Block 528, if the media file contains additional frames (YES) , the process resumes at Block 502 for the next available input frame of the media file. If the media file does not contain additional frames (NO) , the process ends. Once the process ends, the results of the process may be stored within a storage device as described herein and above for subsequent retrieval, additional post-processing, or the like.

SPATIO-TEMPORAL SALIENCY FROM MULTI-TASK LEARNING USING MULTI-STAGE TRAINING METHODS

Turning now to FIG. 6, a framework 600 for training a network to determine spatio-temporal saliency for a video segment using a single-stage training strategy is provided. Framework 600 may begin at Block 602 by an electronic device (e.g., client device 140 of FIG. 1, in the context of training module 104 and integrated spatio-temporal saliency modules 106A/106B) receiving an input frame at Block 602. In this example, the media file includes three video frames (602 (a) , 602 (b) , 602 (c) ) , however it is understood that the media file may contain more or less than the example provided. An input frame may include, for example, an image frame of a media or video file described herein and above. An input frame may include, for example an image frame of a media file described herein and above. As described above, the media file may be captured by client device 140 or received from another device over a network, such as network 150. Next, at Block 604, the input frame may be provided as input to a neural network at Block 604, such as a convoluted neural network (CNN) . Using the neural network, a saliency mask (Block 606) may be used to mask portions of the input frame that will not be used to classify the spatial saliency of the input frame. In some embodiments, at Block 608, the spatial saliency may be trained using supervision from eye fixation and action labels. In some embodiments, Kullback-Leibler (KL) -divergence and a linear correlation coefficient may be used to measure the distance between a predicted saliency map and an eye fixation heat map. The KL-Divergence and the linear correlation coefficient terms may be minimized to make the distribution of the predicted saliency map and eye fixation heat map consistent.

At Block 610, a feature map of the input frame is generated. For example, a feature map may be generated by applying one or more filters or feature detectors to the input frame. Additionally, or alternatively, a feature map may be generated based on prior video frames used as input. At Block 612, a recurrent neural network, ConvLSTM (Convolutional Long Short-Term Memory) , may be used to predict spatio-temporal saliency based on the input frame and previous frames. Output from Block 612 may be provided to Block 612. At Block 612, output from Block 612 may be used to generate an average pool.

At Block 616, weighted fusion of the input frame may be predicted based on the average pool value (Block 614) . At Block 618, a classifier may be used to predict action recognition based on probability (Block 620) and a label may be applied (Block 622) to the input frame. In some embodiments, training of the unified framework 300 may be used to minimize cross-entropy loss.

Next, at Block 624, if the media file contains additional frames (YES) , the process resumes at Block 602 for the next available input frame of the media file. If the media file does not contain additional frames (NO) , the process ends. Once the process ends, the results of the process may be stored within a storage device as described herein and above for subsequent retrieval, additional post-processing, or the like.

Turning now to FIG. 7, a framework 700 for training a network to determine spatio-temporal saliency for a video segment using a mult-stage training strategy is provided. In this example embodiment, the network may be fixed for spatial saliency by freezing network parameters for spatial saliency. Additionally, the temporal saliency may be trained with supervision from frame-level annotation and action labels using an L1-loss to regress the temporal saliency to the frame-level label. Framework 700 shows at Block 702 an electronic device (e.g., client device 140 of FIG. 1, in the context of training module 104 and integrated spatio-temporal saliency modules 106A/106B) receiving an input frame, such as an image frame of a media file described herein and above. In this example, the media file includes three video frames (702 (a) , 702 (b) , 702 (c) ) , however it is understood that the media file may contain more or less than the example provided. An input frame may include, for example, an image frame of a media or video file described herein and above. As described above, the media file may be captured by client device 140 or received from another device over a network, such as network 150. Next, at Block 704, the input frame may be provided as input to neural network, such as a convoluted neural network (CNN) . Using the neural network, a saliency mask at Block 706 may be used to mask portions of the input frame that will not be used to classify the spatial saliency of the input frame, as described herein. At Block 710, a feature map of the input frame may be generated. For example, a feature map may be generated by applying one or more filters or feature detectors to the input frame. Additionally, or alternatively, a feature map may be generated based on prior video frames used as input. In this example, the network parameters for predicting spatial saliency may be frozen, as indicated by Block 712, based on a first stage of training, such as described above with reference to FIG. 6. As shown in FIG. 7, a second stage of training may be performed based on temporal saliency, wherein the network is trained with supervision from frame-level annotation and action labels. At Block 714, a recurrent neural network, such as ConvLSTM (Convolutional Long Short-Term Memory) , may be used to predict spatio-temporal saliency based on the input frame and previous frames. Output from Block 714 may be provided to Block 716 and Block 718. At Block 716, output from Block 714 may be used to generate an average pool. At Block 718, the temporal saliency of the input frame may be predicted using a sub-network (see FIG. 4) and represented as a value ranging from zero (0) to one (1) . Temporal saliency of the input frame may be trained with supervision from frame-level annotation and action labels. In some embodiments, L1-loss may be used to regress the temporal saliency to the frame-level label.

At Block 720, weighted fusion of the input frame may be predicted by multiplying the average pool value (Block 716) by the temporal saliency (Block 718) . At Block 722, a classifier may be used to predict action recognition based on probability (Block 724) and a label may be applied (Block 726) to the input frame. In some embodiments, training of the framework 700 may be used to minimize cross-entropy loss.

Next, at Block 728, if the media file contains additional frames (YES) , the process resumes at Block 702 for the next available input frame of the media file. If the media file does not contain additional frames (NO) , the process ends. Once the process ends, the results of the process may be stored within a storage device as described herein and above for subsequent retrieval, additional post-processing, or the like.

DETAILED NETWORK ARCHITECTURE DURING INFERENCE

Turning now to FIG. 8, an example network architecture 800 of a neural network architecture for inference of spatio-temporal saliency at a device is shown, according to one or more embodiments. As shown in FIG. 8, an input media file at Block 802 that includes one or more image frames may be provided to a neural network at Block 804. The neural network may include a unified framework for providing spatio-temporal saliency by combining output at Block 820, the spatial saliency output, with the output from the neural network at Block 822 to the spatio-temporal saliency of the media file, as described herein and above. The example network architecture 800 may include, for example, a plurality of layers at Blocks 806 (a) -806 (d) for down-sampling and up-sampling, layers for Post NN, US1 and US2 at Blocks 810, 812, and 814, respectively, and skip functions at Block 808 (a) and 808 (b) . The network architecture 800 may also include a Conv2D class at Block 816 as part of the neural network as well as a Sigmoid function at Block 818. The network architecture 800 includes, at Block 824, a recurrent neural network, ConvLSTM, for providing spatio-temporal prediction output. As described herein, additional functions for predicting salient regions of a media file may be provided including, but not limited to, Post NN at Block 826, Average Pool at Block 828, FC at Blocks 830 and 834, ReLU at Block 832, Sigmoid at Block 836, and output at Block 838.

FIG. 9 shows, in flow chart form, an example method for predicting spatio-temporal saliency of a media item. The method may be implemented by integrated spatio-temporal saliency module 106A, integrated spatio-temporal saliency module 106B, or combinations or variations thereof. The method may be implemented on a server device, such as spatio-temporal service 100, or on a client device, such as client device 140 of FIG. 1. For purposes of explanation, the following steps will be described in the context of FIG. 1. However, the various actions may be taken by alternate components. In additional, the various actions may be performed in a different order. Further, some actions may be performed simultaneously, some may not be required, and others may be added.

The flow chart begins at Block 905 with the capturing of image content by a device, such as a device equipped with one or more cameras, image sensors, or the like. Image content may include video data comprised of a plurality of video frames, for example. Additionally, or alternatively, image content may be retrieved from a storage device, such as device’s memory, or a remote storage device, such as a memory of another device or a cloud computing device over a network. Next, at Block 910, video frames of the image content having spatio-temporal salient regions may be identified. In some embodiments, a saliency score may be assigned to the image content indicating a likelihood that the image content includes salient regions. Additionally, image content may be analyzed in, or near, real-time (e.g., during capture or very soon after time of capture) .

The flow chart continues at Block 915 with the prediction of an event occurring within the salient regions of the one or more frames of the video segment. As described herein, a neural network may be used to predict the event. At Block 920, one or more action labels may be assigned to the video segment. The one or more action labels may be based, at least in part, on the event predicted by the neural network.

For playback of image content, at Block 925, video frames of the image content may be encoded at variable frame rate which show meaningful action change. For example, video frames that include salient regions may be played back at a lower compression rate and/or at an increased frame rate in relation to frame (s) of the image content that include non-salient regions.

According to some embodiments, a processor or a processing element may be trained using supervised machine learning and/or unsupervised machine learning, and the machine learning may employ an artificial neural network, which, for example, may be a convolutional neural network, a recurrent neural network, a deep learning neural network, a reinforcement learning module or program, or a combined learning module or program that learns in two or more fields or areas of interest. Machine learning may involve identifying and recognizing patterns in existing data in order to facilitate making predictions for subsequent data. Models may be created based upon example inputs in order to make valid and reliable predictions for novel inputs.

According to certain embodiments, machine learning programs may be trained by inputting sample data sets or certain data into the programs, such as images, object statistics and information, historical estimates, and/or image/video/audio classification data. The machine learning programs may utilize deep learning algorithms that may be primarily focused on pattern recognition and may be trained after processing multiple examples. The machine learning programs may include Bayesian Program Learning (BPL) , voice recognition and synthesis, image or object recognition, optical character recognition, and/or natural language processing. The machine learning programs may also include natural language processing, semantic analysis, automatic reasoning, and/or other types of machine learning.

According to some embodiments, supervised machine learning techniques and/or unsupervised machine learning techniques may be used. In supervised machine learning, a processing element may be provided with example inputs and their associated outputs and may seek to discover a general rule that maps inputs to outputs, so that when subsequent novel inputs are provided the processing element may, based upon the discovered rule, accurately predict the correct output. In unsupervised machine learning, the processing element may need to find its own structure in unlabeled example inputs.

The scope of the disclosed subject matter should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein. ”

Claims

A system for processing videos, comprising:

one or more processors; and

one or more computer readable media comprising computer readable code executable by the one or more processors to:

receive a video segment comprising a plurality of frames of video data;

identify, using a neural network, regions of spatio-temporal saliency in one or more frames of the video segment;

predict at least one event occurring within the identified regions of spatio-temporal saliency of the one or more frames of the video segment; and

apply one or more action labels to the video segment based, at least in part, on the at least one predicted event.
The system of claim 1, wherein the computer readable code to predict the at least one event happening within the identified regions of spatio-temporal saliency in the one or more frames of the video segment comprises computer readable code to:

calculate a spatio-temporal saliency score for at least one of the identified regions of spatio-temporal saliency in the one or more frames of the video segment; and

predict the at least one event based, at least in part, on the calculated spatio-temporal saliency score.
The system of claim 2, wherein the spatio-temporal saliency score predicts how interested one or more persons would be in the corresponding regions of the one or more frames of the video segment.
The system of claim 1, wherein the one or more computer readable media further comprises computer readable code to:

provide, during playback of the video segment, the one or more frames having the identified regions of spatio-temporal saliency at a lower compression rate.
The system of claim 1, wherein the one or more computer readable media further comprises computer readable code to:

provide, during playback of the video segment, the one or more frames having the identified regions of spatio-temporal saliency at an increased frame rate.
The system of claim 1, wherein the one or more computer readable media further comprises computer readable code to:

provide, during playback of the video segment, the one or more frames having the identified regions of spatio-temporal saliency at a lower compression rate and an increased frame rate.
The system of claim 1, wherein the one or more computer readable media further comprises computer readable code to:

categorize the video segments based, at least in part, on the applied one or more action labels.
The system of claim 1, wherein the one or more computer readable media further comprises computer readable code to:

store the video segment, wherein the one or more frames having the identified regions of spatio-temporal saliency are stored at a first compression rate and the one or more frames having the identified regions of spatio-temporal saliency are stored at a second compression rate.
The system of claim 8, wherein the first compression rate is lower than the second compression rate.
A system for processing videos, comprising:

one or more processors; and

one or more computer readable media comprising computer readable code executable by the one or more processors to:

receive a video segment comprising a plurality of frames of video data; and

train a neural network to identify regions of spatio-temporal saliency in one or more frames of the video segment,

wherein the neural network is trained using multi-stage training.
The system of claim 10, wherein the multi-stage training includes at least one stage to train the neural network to identify regions of spatial saliency in the one or more frames of the video segment.
The system of claim 11, wherein the regions of spatial saliency in the one or more frames of the video segment are identified using supervision eye fixation, one or more action labels, or both.
The system of claim 10, wherein the multi-stage training includes at least one stage to train the neural network to identify regions of temporal saliency in the one or more frames of the video segment.
The system of claim 13, wherein the regions of temporal saliency in the one or more frames of the video segment are identified using supervision from frame annotation and action labels.
The system of claim 10, wherein the multi-stage training includes a first stage to train the neural network to identify regions of spatial saliency in the one or more frames of the video segment and a second stage to train the neural network to identify regions of temporal saliency in the one or more frames of the video segment.
A method for processing videos for playback, comprising:

capturing a video comprising a plurality of frames of image data;

identifying, using a neural network, regions of spatio-temporal saliency in one or more of the plurality of frames of the video;

predicting at least one event occurring within the identified regions of spatio-temporal saliency of the one or more of the plurality of frames of the video; and

applying one or more action labels to the video based on the at least one predicted event.
The method of claim 16, wherein predicting the at least one event happening within the identified regions of spatio-temporal saliency in the one or more of the plurality of frames of the video comprises:

calculating a spatio-temporal saliency score for at least one of the identified regions of spatio-temporal saliency in the one or more of the plurality of frames of the video; and

predicting the at least one event based, at least in part, on the calculated spatio-temporal saliency score.
The method of claim 17, wherein the spatio-temporal saliency score predicts how interested one or more persons would be in the corresponding regions of the one or more of the plurality of frames of the video.
The method of claim 16, wherein the method further comprises:

providing, during playback of the video, the one or more of the plurality of frames having the identified regions of spatio-temporal saliency at a lower compression rate.
The method of claim 16, wherein the method further comprises:

providing, during playback of the video, the one or more of the plurality of frames having the identified regions of spatio-temporal saliency at an increased frame rate.