[go: up one dir, main page]

WO2025104250A1 - Entraînement de réseaux neuronaux sur des flux vidéo continus - Google Patents

Entraînement de réseaux neuronaux sur des flux vidéo continus Download PDF

Info

Publication number
WO2025104250A1
WO2025104250A1 PCT/EP2024/082496 EP2024082496W WO2025104250A1 WO 2025104250 A1 WO2025104250 A1 WO 2025104250A1 EP 2024082496 W EP2024082496 W EP 2024082496W WO 2025104250 A1 WO2025104250 A1 WO 2025104250A1
Authority
WO
WIPO (PCT)
Prior art keywords
video frames
neural network
training
video frame
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/EP2024/082496
Other languages
English (en)
Inventor
Joao Carreira
Viorica PATRAUCEAN
Dilara Gokay
Michael John King
Yi Yang
Catalin-Dumitru IONESCU
Dima Jamal Allan ALDAMEN
Daniel Zoran
Yusuf Aytar
Carl DOERSCH
Andrew Zisserman
Joseph Francis HEYWARD
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DeepMind Technologies Ltd
Original Assignee
DeepMind Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DeepMind Technologies Ltd filed Critical DeepMind Technologies Ltd
Publication of WO2025104250A1 publication Critical patent/WO2025104250A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • This specification relates to training neural networks on video data.
  • Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input.
  • Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.
  • Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
  • This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains a neural network to perform any of a variety of video processing tasks.
  • a method of training a neural network having a plurality of parameters comprising: obtaining a temporal sequence of video frames, wherein the temporal sequence comprises one or more initial video frames at one or more initial time steps followed by one or more subsequent video frames at one or more subsequent time steps; for each of the one or more subsequent video frames: selecting, from a plurality of patches of the subsequent video frame that each comprise a different subset of pixels of the subsequent video frame, a subset of selected patches; for each of the one or more initial video frames: generating a superimposed initial video frame that comprises (i) a plurality of patches of the initial video frame that each comprise a different subset of pixels of the initial video frame and (ii) one or more selected patches in the subsets of selected patches; processing the one or more superimposed initial video frames using the neural network in accordance with current values of the plurality of parameters to generate, for each of the one or more subsequent video frames, a prediction of the subsequent video frame; and determining an update to the current values of
  • the prediction of the subsequent video frame may comprise at least a prediction of unselected patches in the plurality of patches of the subsequent video frame.
  • Obtaining the temporal sequence of video frames may comprise obtaining the temporal sequence of video frames from a continuous video stream.
  • Obtaining the temporal sequence of video frames may comprise obtaining the temporal sequence of video frames from a replay buffer that stores a plurality of temporal sequences of video frames.
  • Training the neural network may comprise computing a cumulative metric that represents a training performance of the neural network as of a training step based on evaluating a score function over temporal sequences of video frames that have been processed by the neural network as of the training step.
  • the method may further comprise training the neural network on an image reconstruction task.
  • the method may further comprise training the neural network on one or more of a depth estimation task, a semantic segmentation task, or an action classification task, using annotated video clips.
  • the neural network may have a U-Net with self-attention architecture.
  • the neural network may have a vision Transformer (ViT) architecture.
  • ViT vision Transformer
  • the method may further comprise updating the current values of the plurality of parameters according to the determined update.
  • the method may comprise repeatedly updating the current values of the plurality of parameters at a specified frequency.
  • the specified frequency may be 0.4Hz or less.
  • the specified frequency may be 0.4Hz.
  • the method may comprise obtaining an adaptation parameter value indicative of a desired degree of adaptation; determining the specified frequency based on the adaptation parameter value.
  • the specified frequency may be greater for adaptation parameter values corresponding to greater desired degree of adaptations.
  • the specified frequency may increase monotonically with the adaptation parameter values.
  • the method may comprise obtaining a generalization parameter value indicative of a desired degree of generalization; and determining the specified frequency based on the generalization parameter.
  • the specified frequency may be less for adaptation parameter values corresponding to greater desired degree of generalization.
  • a method of training a neural network having a plurality of parameters comprising: a temporal sequence of video frames, wherein the temporal sequence comprises one or more initial video frames at one or more initial time steps followed by one or more subsequent video frames at one or more subsequent time steps; for each of the one or more initial video frames: selecting, from a plurality of patches of the initial video frame that each comprise a different subset of pixels of the initial video frame, a subset of selected patches; and generating a masked initial video frame by replacing each selected patch in the subset with a mask patch; processing the one or more masked initial video frames using the neural network in accordance with the current values of the plurality of parameters to generate a prediction of the one or more subsequent video frames; and determining an update to the current values of the plurality of parameters based on optimizing an objective function that measures a difference between the prediction of the one or more subsequent video frames and the one or more subsequent video frames.
  • Obtaining the temporal sequence of video frames may comprise obtaining the temporal sequence of video frames from a continuous video stream.
  • Obtaining the temporal sequence of video frames may comprise obtaining the temporal sequence of video frames from a replay buffer that stores a plurality of temporal sequences of video frames.
  • Training the neural network may comprise computing a cumulative metric that represents a training performance of the neural network as of a training iteration based on evaluating a score function over temporal sequences of video frames that have been processed by the neural network as of the training iteration.
  • the method may further comprise training the neural network on an image reconstruction task.
  • the method may further comprise training the neural network on one or more of a depth estimation task, a semantic segmentation task, or an action classification task, using annotated video clips.
  • the neural network may have a U-Net with self-attention architecture.
  • the neural network may have a vision Transformer (ViT) architecture.
  • the method may further comprise updating the current values of the plurality of parameters according to the determined update.
  • ViT vision Transformer
  • the method may comprise repeatedly updating the current values of the plurality of parameters at a specified frequency.
  • the specified frequency may be 0.4Hz or less.
  • the specified frequency may be 0.4Hz.
  • the method may comprise obtaining an adaptation parameter value indicative of a desired degree of adaptation; and determining the specified frequency based on the adaptation parameter value.
  • the specified frequency may be greater for adaptation parameter values corresponding to greater desired degree of adaptations.
  • the specified frequency may increase monotonically with the adaptation parameter values.
  • the method may comprise obtaining a generalization parameter value indicative of a desired degree of generalization; and determining the specified frequency based on the generalization parameter.
  • the specified frequency may be less for adaptation parameter values corresponding to greater desired degree of generalization.
  • one or more computer-readable storage media encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the above method aspect.
  • a system comprising one or more computers and one or more storage devices storing instructions that when executed by one or more computers cause the one or more computers to perform the respective operations of the above method aspect.
  • This specification describes a system that consumes training data that includes continuous video streams to train a neural network.
  • Continuous video streams can be relatively easily obtained in a large volume, at a low hardware cost, or both, for example in robotics, camera-based applications, traffic camera, and smart glass applications. Obtaining such continuous video streams also need a considerably less amount of data pre-processing compared to, e.g., a meticulously curated video dataset for video processing tasks.
  • Some techniques described in this specification can be used by the system to implement a continual training framework suitable for training the neural network on a very long, potentially infinitely long, sequence of video frames to continuously update the parameter values of the neural network as new video frames become available. In general this is a difficult task because of the high correlation between consecutive video frames.
  • the described techniques use a patch-based approach that involves determining a difference between a subsequent video frame and a prediction of the subsequent video frame, and that in some implementations can involve determining a difference between corresponding pixels of the video frames.
  • the continual training framework can train the neural network to achieve a balance between its generalization and adaptation capabilities.
  • a neural network would be able to adapt to a new deployment environment without forgetting what the neural network was trained to do during training and without losing its ability to generalize to new videos.
  • the balance between generalization and adaptation capabilities is needed for embodied intelligence in the field of robotics, as well as for any type of digital assistants that should have the capability to adapt to the specific needs of a user.
  • the continuously pre-trained neural network thus can be more easily adapted, e.g., through fine-tuning, to any of a range of downstream tasks. Once adapted, the continuously pre-trained neural network can achieve or even exceed the performance of a conventionally pre-trained neural network on any of the video processing tasks, despite an adaptation process that consumes fewer computing resources, is faster in terms of wall-clock time, or both.
  • a depth estimation neural network adapted from the continuously pre-trained neural network can generate a more accurate depth estimation for each of a plurality of pixels in a video frame of a video in comparison to existing depth estimation models.
  • a depth estimation neural network can facilitate a higher performance of the control system in various robotics applications and other camera-based agent control tasks.
  • FIG. 1 shows an example training system
  • FIG. 2A shows an example of a superimposed initial video frame.
  • FIG. 2B shows an example of a masked initial video frame.
  • FIG. 3 is a flow diagram of an example process for training a neural network to perform a guided future prediction task.
  • FIG. 4 is a flow diagram of an example process for training a neural network to perform a masked future prediction task.
  • FIG. 5 shows quantitative examples of improvements that can be achieved by using the training techniques described in this specification to train a neural network on a future prediction task compared to conventional training techniques.
  • FIG. 1 shows an example neural network training system 100.
  • the neural network training system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations that trains a neural network 120 to perform one or more video processing tasks.
  • Some examples of video processing tasks that the neural network 120 can be trained to perform follow are Some examples of video processing tasks that the neural network 120 can be trained to perform follow.
  • the neural network 120 can be configured to perform video generation tasks. That is, the neural network 120 can generate a video, e.g., either unconditioned or conditioned on a context input, e.g., a text description, or another video segment that characterizes a content of the generated video.
  • the generated video includes a respective video frame at each of multiple time steps.
  • Each video frame can be an image, e.g. defined by pixel values of the image.
  • the neural network 120 can be configured to perform video analysis tasks.
  • the neural network 120 can receive multiple video frames of a video, and can process each video frame to generate an output that characterizes the video frames.
  • the neural network 120 can process a video that includes multiple video frames to generate a classification output for each of one or more the multiple video frames.
  • the classification output can include a respective score corresponding to each of multiple categories.
  • the score for a category indicates a likelihood that the video frame belongs to the category.
  • the categories may be classes of objects (e.g., dog, cat, person, and the like), and the video frame may belong to a category if it depicts an object included in the object class corresponding to the category.
  • the categories may be classes of actions , and the video frame may belong to a category if it depicts an action in the action class corresponding to the category.
  • the categories may represent global video frame properties (e.g., whether the video frame depicts a scene in the day or at night, or whether the video frame depicts a scene in the summer or the winter), and the video frame may belong to the category if it has the global property corresponding to the category.
  • global video frame properties e.g., whether the video frame depicts a scene in the day or at night, or whether the video frame depicts a scene in the summer or the winter
  • the neural network 120 can process a video that includes multiple video frames to generate a pixel-level classification output for each of one or more the multiple video frames.
  • the pixel-level classification output can include, for each pixel in the video frame, a respective score corresponding to each of multiple categories.
  • the score for a category indicates a likelihood that pixel belongs to the category.
  • the categories may be classes of objects, and a pixel may belong to a category if it is part on an object included in the object class corresponding to the category. That is, the pixel-level classification output may comprise a semantic segmentation output. As another example the pixel-level classification output may comprise an instance segmentation output that labels individual instances of objects in the video. As a further example the category can indicate a depth in a quantized depth estimation task.
  • the neural network 120 can process a video that includes multiple video frames to generate a regression output for each of one or more the multiple video frames.
  • the regression output for a video frame estimates one or more continuous variables (i.e., that can assume infinitely many possible numerical values) that characterize the video frame.
  • the regression output may estimate the coordinates of bounding boxes that enclose respective objects depicted in the video frame.
  • the coordinates of a bounding box may be defined by (x, y) coordinates of the vertices of the bounding box.
  • the neural network 120 can process a video that includes multiple video frames to generate a depth estimation output, an action classification output, an action recognition output, and so on.
  • a depth estimation output for a video frame can include a respective depth estimation for each of a plurality of pixels in the video frame.
  • Each depth estimation represents information about the distance of a surface in an environment relative to a camera sensor that captures the video frame.
  • the regression output may define a continuous depth estimate in a depth estimation task.
  • an action recognition output for a video can recognize an action that spans multiple video frames in the video.
  • the action can for example be an action, activity, and/or other temporally varying occurrence which involves a human actor and/or a non-human actor, such as an animal, a robot, an inanimate object, or portions thereof.
  • the action recognition output of the video when expressed as text, the action recognition output of the video may include at least one verb that is descriptive of the action, activity, and/or other temporally varying occurrence. Action recognition outputs may be used to facilitate video retrieval, video captioning, and/or visual question-and-answer, among other tasks.
  • the neural network 120 can have any appropriate architecture that allows the neural network to perform the video processing tasks, e.g., an attention-based neural network architecture such as a Transformer architecture, a convolutional architecture, a fully- connected architecture, a recurrent architecture, and so on.
  • an attention-based neural network architecture such as a Transformer architecture, a convolutional architecture, a fully- connected architecture, a recurrent architecture, and so on.
  • the neural network 120 can have a U-Net architecture , e.g. a U-Net with self-attention architecture.
  • the neural network 120 can have a vision Transformer (ViT) architecture (Dosovitskiy et al. 2020).
  • the neural network 120 can have a diffusion model architecture.
  • the neural network 120 can include a neural network backbone that processes an input of the neural network 120 to generate one or more embeddings, and an output head that processes the one or more embeddings to generate the output for the task that the neural network 120 is configured to perform.
  • the neural network backbone can have one of the architectures described in more detail in Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models”, in Advances in Neural Information Processing Systems, pages 6840-6851.
  • the neural network training system 100 trains the neural network 120 to perform the one or more video processing tasks in two stages: a pre-training stage followed by an adaptation stage.
  • the neural network training system 100 implements a continual training framework to train the neural network 120 on one or more temporal sequences of video frames 102 to determine pre-trained values of the parameters of the neural network 120.
  • Each of the video frames 102 is an image that can be represented as a two- dimensional (2D) array of pixels, where each pixel is represented as a vector of one or more values.
  • each pixel can be represented as an integer or floating point number (i.e., a vector with one component) representing the brightness of the pixel.
  • integer or floating point number i.e., a vector with one component
  • each pixel can be represented as a vector with three integer or floating point components, which respectively represent the intensity of the red, green, and blue color of the pixel.
  • RGB red- green-blue
  • YUV or HSV color channels may be employed.
  • the pixels of a video frame can be indexed by (x, y) coordinates, where x and y are integer values.
  • the video frames 102 are pre-stored video frames
  • the neural network training system 100 can obtain a temporal sequence of video frames 102, e.g., through sampling, from a replay buffer 130 that stores a plurality of temporal sequences of pre-stored video frames 102.
  • the video frames 102 are live video frames
  • the neural network training system 100 can obtain the video frames 102 from a live video stream in which new video frames are continuously appended to the stream as they become available.
  • FIG. 1 illustrates that the neural network training system 100 can obtain a temporal sequence of video frames 102 from a sensor system 140, which can include a still or video camera.
  • a sequence of video frames is referred to as a “temporal” sequence when the video frames are arranged according to the time at which the video frames were captured by a sensor.
  • a temporal sequence of video frames can be obtained from sensor measurements of a real -world environment that have been generated as an agent, e.g., a vehicle or a robot, navigate through the real-world environment. Other ways of obtaining the sequence of video frames are possible.
  • the neural network training system 100 uses a video frame augmentation engine 105 and a parameter update engine 110 to train the neural network 120 by alternating between two or more different pre-training tasks.
  • These different pre-training tasks generally involve training the neural network 120 to generate a prediction of future video frames in a temporal sequence of video frames.
  • the inputs processed by the neural network 120 can be different for different ones of the tasks.
  • One example pre-training task is a vanilla future prediction task that trains the neural network 120 to predict the next video frame(s) in a temporal sequence of video frames from the currently seen video frame(s).
  • the neural network 120 receives a sequence of one or more initial video frames in a temporal sequence of video frames and processes the sequence of one or more initial video frames in accordance with current values of the parameters of the neural network 120 to generate one or more predicted video frames.
  • the one or more predicted video frames are a prediction of one or more subsequent video frames in the temporal sequence of video frames, i.e., the one or more video frames that will follow the last video frame in the one or more initial video frames.
  • Another example pre-training task is a guided future prediction task that trains the neural network 120 to predict the next video frame(s) in a temporal sequences of video frames from the currently seen video frame(s) and from selected patches of the next video frame(s).
  • the video frame augmentation engine 105 To train the neural network 120 on the guided future prediction task, the video frame augmentation engine 105 generates one or more superimposed initial video frames based on (i) one or more initial video frames in the temporal sequence of video frames and (ii) one or more subsequent video frames in the temporal sequence of video frames.
  • the video frame augmentation engine 105 selects one or more patches from each of the one or more subsequent video frames.
  • the video frame augmentation engine 105 can select the one or more patches in any appropriate way from each subsequent video frame.
  • a patch can be selected randomly from a subsequent video frame.
  • a patch can be selected based on a predetermined order from a subsequent video frame, e.g., a top left patch from the first subsequent video frame, a top right patch from the second subsequent video frame, a bottom left patch from the third subsequent video frame, and so on.
  • a patch of a video frame is a strict subset (i.e. less than all) of the pixels of the video frame.
  • each patch includes a different subset of multiple contiguous pixels of the video frame.
  • each pixel in the video frame is included in exactly one of the patches.
  • certain pixels from the video frame can be excluded from all of the patches, such that one or more pixels are not included in any of the patches.
  • the video frame augmentation engine 105 then generates, for each of the one or more initial video frames, a superimposed initial video frame based on (i) a plurality of patches of the initial video frame and (ii) one or more of the patches selected from the one or more subsequent video frames.
  • the superimposed initial video frame 230 includes a plurality of patches, e.g., patch 202, of the initial video frame 210.
  • the superimposed initial video frame 230 includes one or more selected patches, e.g., patch 204, of the subsequent video frame 220.
  • the superimposed initial video frame includes both the patch 204 of the subsequent video frame 220 and a patch 203 of the initial video frame 210 that occupies the same spatial location within the initial video frame as is occupied by the patch 204 within the subsequent video frame 220.
  • the values of the pixel can be a combination, e.g., a sum or an average, of the values of the pixel of the patch 203 of the initial video frame 210 and the values of the pixel of the patch 204 of the subsequent video frame 220 that occupy the same spatial location.
  • the superimposed initial video frame can include the patch 204 of the subsequent video frame 220 in place of the patch 203 of the initial video frame 210 that occupies the same spatial location within the initial video frame as is occupied by the patch 204 within the subsequent video frame 220, such that for each pixel within this spatial location, the values of the pixel can be the same as the values of the pixel of the patch 204 of the subsequent video frame 220.
  • the patches of a video frame all have a square shape. However, this is not required.
  • the patches of a video frame can have one or more different shapes. For example, the patches can have a rectangular shape. As another example, some patches can have a rectangular shape while other patches can have a circular shape.
  • the neural network 120 receives a sequence of one or more superimposed initial video frames in a temporal sequence of video frames and processes the sequence of one or more superimposed initial video frames in accordance with current values of the parameters of the neural network 120 to generate one or more predicted video frames.
  • the one or more predicted video frames are a prediction of one or more subsequent video frames in the temporal sequence of video frames, i.e., the one or more video frames that will follow the last video frame in the one or more initial video frames.
  • a further example pre-training task is a masked future prediction task that trains the neural network 120 to predict the next video frame(s) in a temporal sequences of video frames from a masked version of the currently seen video frame(s).
  • the video frame augmentation engine 105 To train the neural network 120 on the masked future prediction task, the video frame augmentation engine 105 generates one or more masked initial video frames based on one or more initial video frames in the temporal sequence of video frames. In particular, for each of the one or more initial video frames the video frame augmentation engine 105 selects one or more patches from the initial video frame, and then replaces the one or more patches with mask patches. Thus, each masked initial video frame is a masked version of an initial video frame which replaces some patches originally included in the initial video frame with mask patches.
  • the masked initial video frame 250 includes a plurality of mask patches, e.g., mask patch 208, that replaces the patches that originally occupy the same spatial location within the initial video frame 240.
  • Each mask patch can include predetermined pixel values that are generally different from the pixel values of the patches originally included in the initial video frame 240.
  • each mask patch can include all zeros.
  • each mask patch can include all ones.
  • each mask patch can include all negative infinity values.
  • each mask patch can include all positive infinity values.
  • the neural network 120 receives a sequence of one or more masked initial video frames in a temporal sequence of video frames and processes the sequence of one or more masked initial video frames in accordance with current values of the parameters of the neural network 120 to generate one or more predicted video frames.
  • the one or more predicted video frames are a prediction of one or more subsequent video frames in the temporal sequence of video frames, i.e., the one or more video frames that will follow the last video frame in the one or more initial video frames.
  • the parameter update engine 110 trains the neural network 120 by performing iterations of a neural network training technique to optimize one or more objective functions and to update the current values of the parameters of the neural network 120.
  • the one or more objective functions can include an objective function that measures, for each predicted video frame generated by the neural network 120, a difference between the predicted video frame and a corresponding, actual subsequent video frame in the temporal sequence of video frames. Examples of how the difference can be measured will be described further below.
  • the parameter update engine 110 uses the same objective function across different pre-training tasks, whereas, in other implementations, the parameter update engine 110 uses different objective functions for different pre-training tasks.
  • the neural network training system 100 continuously monitors the training performance of the neural network by way of computing a cumulative metric, and adjusts one or more aspects of the pre-training to ensure the efficiency, effectiveness, or both of the pre-training as a result of the continuous monitoring. Examples of how the cumulative metric can be computed, as well as examples of what aspects of the pre-training can be adjusted, will be explained further below.
  • the neural network training system 100 can adapt the pre-trained neural network 120 to a downstream task, which can be any one of the video processing tasks mentioned above, and possibly other video (or image) processing, video (or image) classification, or video (or image) regression tasks.
  • a downstream task can be any one of the video processing tasks mentioned above, and possibly other video (or image) processing, video (or image) classification, or video (or image) regression tasks.
  • the neural network 120 can be adapted to perform a next frame prediction task on a different video, which includes video frames that are different from those included in the temporal sequences of video frames used in the pre-training.
  • the neural network 120 can be adapted to perform an image reconstruction task by receiving as input an input image and processing the input to generate a reconstruction of the input image.
  • the neural network training system 100 fine-tunes, i.e., further trains, the neural network 120 to determine fine-tuned values of some or all of the parameters of the neural network 120 beginning from the pre-trained values of the parameters that have been learned as a result of the pre-training.
  • the neural network training system 100 can train the neural network 120 on a different collection of video frames, e.g., video frames included in a collection of annotated video clips that are each associated with a ground truth output that defines a target output that should be generated by the neural network 120 for one of the tasks mentioned above, and the parameter update engine 110 can update the values of the parameters of the neural network 120 using a different optimizer or on a different objective function, e.g., a mean per-frame intersection over union (loU) and recall loss function for the semantic segmentation task, and a log relative mean square error (logRMSE) for the depth estimation task, based on this different collection of video frames.
  • a different optimizer e.g., a mean per-frame intersection over union (loU) and recall loss function for the semantic segmentation task, and a log relative mean square error (logRMSE) for the depth estimation task, based on this different collection of video frames.
  • logRMSE log relative mean square error
  • FIG. 3 is a flow diagram of an example process 300 for training a neural network to perform a guided future prediction task.
  • the process 300 will be described as being performed by a system of one or more computers located in one or more locations.
  • a neural network training system e.g., the neural network training system 100 of FIG. 1, appropriately programmed, can perform the process 300.
  • the system obtains a temporal sequence of video frames (step 302).
  • the temporal sequence includes one or more initial video frames at one or more initial time steps followed by one or more subsequent video frames at one or more subsequent time steps.
  • Each of the video frames is an image that can be represented as a two-dimensional (2D) array of pixels, where each pixel is represented as a vector of one or more values.
  • the system can obtain the temporal sequence of video frames from a continuous video stream that has new video frames continuously appended to it as they become available.
  • the system can obtain, e.g., through sampling, the temporal sequence of video frames from a replay buffer that stores a plurality of temporal sequences of video frames.
  • the system can apply a sliding window to a longer temporal sequence of video frames that includes many video frames to identify a fixed frame window (which slides frame by frame along the temporal dimension), and partition the video frames included in the fixed frame window into the one or more initial video frames and the one or more subsequent video frames.
  • the system can apply a first sliding window to a longer temporal sequence of video frames that includes many video frames to identify a first fixed frame window that includes the one or more initial video frames, and apply a second sliding window to the longer temporal sequence of video frames to identify a second fixed frame window that includes the one or more subsequent video frames.
  • the first sliding window and the second sliding window both slide frame by frame along the temporal dimension, but are displaced from each other by a predetermined number of time steps.
  • the first sliding window can span time steps tl-t3, the second sliding window can span time steps tl 0-tl 3 ; then, the first sliding window shifts by one time step, thereby spanning time steps t2-t4, while the second sliding window shifts by one time step, thereby spanning time steps tl 1 -tl 4; and so on, in that order.
  • the way the system obtains the video frames is different from how the video frames would be obtained in some existing neural network training systems, which may select video frames in an independent and identically distributed (IID) manner from one or more videos.
  • IID independent and identically distributed
  • the batch obtained by those systems will include 6 individual video frames from one or more videos and arranged in a random order, e.g., such that one video frame included in the batch may be unrelated to another video frame also included in the batch.
  • the batch obtained by the described system will include 6 video frames arranged in a temporal order, i.e., will include a first video frame at a first time step, a second video frame at a second time that is a continuation of the first video frame at the first time step, and so on, in that order, until a sixth video frame at a sixth time that is a continuation of the fifth video frame at the fifth time step.
  • the system selects a subset of selected patches from a plurality of patches of the subsequent video frame (step 304). Each selected patch includes a different subset of pixels of the subsequent video frame. In some implementations, a fixed number patches are selected from each subsequent video frame, whereas, in other implementations, the numbers of patches that are selected from the subsequent video frames can vary.
  • the system For each of the one or more initial video frames, the system generates a superimposed initial video frame (step 306).
  • the superimposed initial video frame includes (i) a plurality of patches of the initial video frame that each include a different subset of pixels of the initial video frame and (ii) one or more selected patches in the subsets of selected patches.
  • the superimposed initial video frame includes one or more selected patches in the subset of patches selected from a single subsequent video frame, whereas, in other implementations, the superimposed initial video frame includes one or more selected patches in the subsets of patches selected from two or more subsequent video frames.
  • the superimposed initial video frame includes all of the patches originally included in the initial video frame, where some of the patches originally included in the initial video frame are combined with the patches selected from the one or more subsequent video frames. In other implementations, the superimposed initial video frame includes some of the patches originally included in the initial video frame, while some others of the patches originally included in the initial video frame are replaced by the patches selected from the one or more subsequent video frames.
  • the system processes the one or more superimposed initial video frames using the neural network in accordance with current values of the plurality of parameters to generate, for each of the one or more subsequent video frames, a predicted video frame that corresponds to the subsequent video frame (step 308).
  • Each predicted video frame is a prediction of the corresponding subsequent video frame that is generated by the neural network based on processing the one or more superimposed initial video frames.
  • the prediction of the corresponding subsequent video frame includes at least a prediction of unselected patches in the plurality of patches of the corresponding subsequent video frame.
  • each predicted video frame can be an image that can be represented as a two-dimensional (2D) array of pixels, where each pixel is represented as a vector of one or more predicted values.
  • the system determines an update to the current values of the plurality of parameters based on optimizing an objective function that is suitable for the guided future prediction task (step 310).
  • the system can do this by computing, for each superimposed initial video frame, respective gradients of the objective function with respect to the parameters of the neural network by backpropagation through the appropriate parameters of the neural network. Then, to update the current values of the plurality of parameters according to the determined update, the system can apply an optimizer to the respective gradients.
  • the objective function can measure, for each of the one or more subsequent video frames, a difference between the predicted video frame and the subsequent video frame. In some implementations, the difference can be a pixel-to-pixel difference between the predicted video frame and the subsequent video frame.
  • the objective function can evaluate an LI loss, an L2 loss, or a Huber loss between the predicted values of the pixels of the predicted video frame and the actual values of the pixels of the subsequent video frame.
  • FIG. 4 is a flow diagram of an example process 400 for training a neural network to perform a masked future prediction task.
  • the process 400 will be described as being performed by a system of one or more computers located in one or more locations.
  • a neural network training system e.g., the neural network training system 100 of FIG.1, appropriately programmed, can perform the process 400.
  • the system obtains a temporal sequence of video frames (step 402).
  • the temporal sequence includes one or more initial video frames at one or more initial time steps followed by one or more subsequent video frames at one or more subsequent time steps.
  • Each of the video frames is an image that can be represented as a two-dimensional (2D) array of pixels, where each pixel is represented as a vector of one or more values.
  • 2D two-dimensional
  • the system selects a subset of selected patches (step 404) from a plurality of patches of the initial video frame that each include a different subset of pixels of the initial video frame, and then generates a masked initial video frame (step 406) by replacing each selected patch in the subset with a mask patch.
  • each masked initial video frame corresponds to an initial video frame and includes the actual values of some of the pixels originally included in the corresponding initial video frame, but excludes the actual values of others of the pixels originally included in the corresponding initial video frame. Instead, the masked initial video frame includes predetermined values in place of the actual values of the others of the pixels originally included in the corresponding initial video frame.
  • the system processes the one or more masked initial video frames using the neural network in accordance with the current values of the plurality of parameters to generate, for each of the one or more subsequent video frames, a predicted video frame that corresponds to the subsequent video frame (step 408).
  • Each predicted video frame is a prediction of the corresponding subsequent video frame that is generated by the neural network based on processing the one or more masked initial video frames.
  • each predicted video frame can be an image that can be represented as a two-dimensional (2D) array of pixels, where each pixel is represented as a vector of one or more predicted values.
  • the system determines an update to the current values of the plurality of parameters based on optimizing an objective function that measures a difference between the prediction of the one or more subsequent video frames and the one or more subsequent video frames (step 410).
  • the system can do this by computing, for each masked initial video frame, respective gradients of the objective function with respect to the parameters of the neural network by backpropagation through the appropriate parameters of the neural network. Then, to update the current values of the plurality of parameters according to the determined update, the system can apply an optimizer to the respective gradients.
  • the objective function can measure, for each of the one or more subsequent video frames, a difference between the predicted video frame and the subsequent video frame.
  • the difference can be a pixel-to-pixel difference between the predicted video frame and the subsequent video frame.
  • the objective function can evaluate an LI loss, an L2 loss, or a Huber loss between the predicted values of the pixels of the predicted video frame and the actual values of the pixels of the subsequent video frame.
  • the system can repeatedly perform iterations of the process 300 and process 400 on different batches of initial video frames and subsequent video frames obtained from the same or different video to repeatedly update the parameters of the neural network as part of the pretraining of the neural network.
  • the system can continue performing iterations of the process 300 until termination criteria for the training of the neural network on the guided future prediction task, or alternatively until switching criteria for switching to train the neural network on a different pre-training task, have been satisfied, e.g., until the parameters have converged, until a threshold amount of wall clock time has elapsed, until a threshold number of iterations of the process 300 have been performed, or once an indication to switch to a different pre-training task is received.
  • the system can continue performing iterations of the process 400 until termination criteria for the training of the neural network on the masked future prediction task, or alternatively until switching criteria for switching to train the neural network on a different pre-training task, have been satisfied.
  • the system can incorporate any number of techniques to improve the speed, the effectiveness, or both of the pre-training process.
  • the system can enable or disable the use of certain optimizers to ensure robustness when applying the updates to the values of the parameters of the neural network.
  • the updates are determined based on respective gradients of the objective function that is suitable for the pre-training training task.
  • the system can disable the use of Adam optimizer or any other optimizers that use momentum, and instead use a RMSprop optimizer, an AdaGrad optimizer, or another optimizer that does not use momentum.
  • the system can use a constant learning rate that is multiplied to the respective gradients of the objective function when determining the update.
  • a dynamic learning rate schedule e.g., a constant schedule, a linear schedule, a cosine and exponential decay schedule, or “1 cycle” and cosine decay with restarts schedule.
  • the system can adjust the number of the video frames, e.g., to limit it to a relatively small number, e.g., no more than 4, 8, or 16, obtained at step 302 of each iteration of process 300.
  • the system can adjust the size of the displacement between the two sliding windows, to limit it to a relatively small size, e.g., 1 time step, 2 time steps, or 4 time steps.
  • the system continuously monitors the training performance of the neural network by way of computing a cumulative metric, and adjusts one or more aspects of the pre-training as a result of the continuous monitoring.
  • the cumulative metric can represent a training performance of the neural network as of a training iteration, and the system can compute the cumulative metric based on evaluating a score function over the losses that have been computed using the objective function for the temporal sequences of video frames that have been processed so far by the neural network as of the training iteration.
  • the cumulative metric can be a current adaptation parameter value.
  • the current adaptation parameter value is indicative of a current degree of adaptation performance that the neural network has attained as a result of the ongoing pre-training of the neural network by the system.
  • the adaptation performance is the performance of the neural network on the pre-training task when processing other temporal sequences of video frames that are included in the same video as the temporal sequences of video frames that have been processed by the neural network.
  • the current adaptation parameter value can be computed based on a running average of losses computed using the objective function for the most recent n video frames in the current video that have been processed by the neural network so far, where n is a positive integer.
  • the cumulative metric can be a current generalization parameter value.
  • the current generalization parameter value is indicative of a current degree of generalization performance of the neural network that the neural network has attained as a result of the ongoing pre-training of the neural network by the system.
  • the generalization performance is the performance on the pre-training task when processing other temporal sequences of video frames that are included in different video than the temporal sequences of video frames that have been processed by the neural network.
  • the current generalization parameter value can be computed based on a running average of losses computed using the objective function for the n video frames in a held-out video that is different from the video which includes the temporal sequences of video frames being processed by the neural network, where n is a positive integer.
  • the system can adjust, i.e., increase or lower, the frequency at which the values of the parameters of the neural network are updated throughout the pretraining, such that the values of the parameters of the neural network are repeatedly updated at a specified frequency during at least some time period during the pre-training.
  • the system dynamically adjusts the number of superimposed initial video frames evaluated by the objective function at step 310 (the gradients of which are then used to determine the update) in each iteration of the process 300 or the number of masked initial video frames evaluated by the objective function at step 410 (the gradients of which are then used to determine the update) in each iteration of the process 400.
  • Adjustment of the parameter value update frequency can be used to balance between the generalization performance and the adaptation performance of the neural network.
  • the specified frequency at which the values of the parameters of the neural network are updated can be set to 0.4Hz or less (so as to update the values of the parameters of the neural network no faster than every 2.5 seconds, measured in wall clock time).
  • the system can obtain a target adaptation parameter value indicative of a desired degree of adaptation performance, and determine the specified frequency at which the values of the parameters of the neural network are updated based on comparing the current adaptation parameter value to the target adaptation parameter value.
  • the system can determine a greater (higher) specified frequency.
  • the specified frequency determined by the system increases monotonically with the target adaptation parameter values.
  • the system can obtain a target generalization parameter value indicative of a desired degree of generalization performance, and determine the specified frequency at which the values of the parameters of the neural network are updated based on comparing the current generalization parameter value to the target generalization parameter value.
  • the system can determine a lower specified frequency.
  • the specified frequency determined by the system decreases monotonically with the target generalization parameter values.
  • the target generalization parameter can measure desired performance after training the neural network, i.e. on a task to be performed by the neural network after training. More particularly it can be determined as a target average of losses computed using the objective function for one or more evaluation videos, e.g. of the type for which the task is to be performed.
  • FIG. 5 shows quantitative examples of improvements that can be achieved by using the training techniques described in this specification (which involve obtaining temporal sequences of video frames from a continuous video stream) to train a neural network on a future prediction task compared to conventional training techniques (which involve selecting video frames in an independent and identically distributed (IID) manner from one or more videos).
  • the diagram on the left shows that, when training on IID video frames, the cosine similarity of consecutive gradients of an objective function for the future prediction task with respect to the parameters of the neural network is close to a normal distribution.
  • the cosine similarity of consecutive gradients shows strong correlations.
  • the diagram on the right shows that, when training on video frames obtained from the continuous video stream, the losses (L2 losses) of the predicted video frames generated by the neural network that are computed using the objective function are consistently lower.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus.
  • the computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • data processing apparatus refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a program may, but need not, correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
  • Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
  • a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
  • the central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto optical disks e.g., CD ROM and DVD-ROM disks.
  • embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
  • a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
  • Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.
  • Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a JAX framework.
  • a machine learning framework e.g., a TensorFlow framework or a JAX framework.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.
  • Data generated at the user device e.g., a result of the user interaction, can be received at the server from the device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne des procédés, des systèmes et un appareil, incluant des programmes informatiques codés sur un support de stockage informatique, pour effectuer l'entraînement d'un réseau neuronal. Selon un aspect, un procédé consiste à obtenir une séquence temporelle de trames vidéo et à entraîner le réseau neuronal sur une ou plusieurs trames vidéo initiales superposées ou une ou plusieurs trames vidéo initiales masquées qui sont générées sur la base de trames vidéo incluses dans la séquence temporelle. La séquence temporelle comprend une ou plusieurs trames vidéo initiales à une ou plusieurs étapes temporelles initiales suivies d'une ou de plusieurs trames vidéo ultérieures à une ou plusieurs étapes temporelles ultérieures.
PCT/EP2024/082496 2023-11-17 2024-11-15 Entraînement de réseaux neuronaux sur des flux vidéo continus Pending WO2025104250A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363600628P 2023-11-17 2023-11-17
US63/600,628 2023-11-17

Publications (1)

Publication Number Publication Date
WO2025104250A1 true WO2025104250A1 (fr) 2025-05-22

Family

ID=93590858

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2024/082496 Pending WO2025104250A1 (fr) 2023-11-17 2024-11-15 Entraînement de réseaux neuronaux sur des flux vidéo continus

Country Status (1)

Country Link
WO (1) WO2025104250A1 (fr)

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ALEXEY DOSOVITSKIYLUCAS BEYERALEXANDER KOLESNIKOVDIRK WEISSENBORNXIAOHUA ZHAITHOMAS UNTERTHINERMOSTAFA DEHGHANIMATTHIAS MINDERERGE: "An image is worth 16x16 words: Transformers for image recognition at scale", ARXIV PREPRINT ARXIV, vol. 2010, 2020, pages 11929
CARREIRA JOÃO ET AL: "Learning from One Continuous Video Stream", 2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 1 December 2023 (2023-12-01), pages 28751 - 28761, XP093245376, ISBN: 979-8-3503-5300-6, Retrieved from the Internet <URL:https://arxiv.org/pdf/2312.00598v1> DOI: 10.1109/CVPR52733.2024.02716 *
CASTELLÓ JAVIER SELVA ET AL: "A Comprehensive Survey on Deep Future Frame Video Prediction", 1 January 2018 (2018-01-01), XP093245433, Retrieved from the Internet <URL:https://upcommons.upc.edu/bitstream/handle/2117/118121/126510.pdf> *
JONATHAN HOAJAY JAINPIETER ABBEEL: "Advances in Neural Information Processing Systems", 2020, CURRAN ASSOCIATES, INC., article "Denoising diffusion probabilistic models", pages: 6840 - 6851
LEE JOOYEON ET AL: "Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection", 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), IEEE, 21 August 2022 (2022-08-21), pages 1012 - 1018, XP034235894, DOI: 10.1109/ICPR56361.2022.9956507 *
VIKRAM VOLETI ET AL: "MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 October 2022 (2022-10-12), XP091341203 *
ZHOUQIANG JIANG ET AL: "Concatenated Masked Autoencoders as Spatial-Temporal Learner", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 November 2023 (2023-11-02), XP091650524 *
ZIAO YANG ET AL: "PTCT: Patches with 3D-Temporal Convolutional Transformer Network for Precipitation Nowcasting", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 June 2022 (2022-06-03), XP091238315 *

Similar Documents

Publication Publication Date Title
US10853726B2 (en) Neural architecture search for dense image prediction tasks
US11144782B2 (en) Generating video frames using neural networks
US11688077B2 (en) Adaptive object tracking policy
US11488067B2 (en) Training machine learning models using teacher annealing
US11163989B2 (en) Action localization in images and videos using relational features
US11967150B2 (en) Parallel video processing systems
CN111369299A (zh) 识别的方法、装置、设备及计算机可读存储介质
CN113469204A (zh) 数据处理方法、装置、设备和计算机存储介质
US20250191194A1 (en) Tracking query points in videos using point tracking neural networks
CN116452472B (zh) 基于语义知识引导的低照度图像增强方法
US20250046071A1 (en) Temporal aggregation for dynamic channel pruning and scaling
WO2025104250A1 (fr) Entraînement de réseaux neuronaux sur des flux vidéo continus
US20240403636A1 (en) Self-attention based neural networks for processing network inputs from multiple modalities
Zeng High efficiency pedestrian crossing prediction
US20250356635A1 (en) Performing computer vision tasks using guiding code sequences
US12395685B2 (en) Highly efficient model for video quality assessment
US20250139959A1 (en) Detecting objects in images by generating sequences of tokens
US20240282093A1 (en) Fine-tuning computer vision neural neworks using task rewards
KR20220164957A (ko) 인공지능 및 사물인식을 이용한 이미지 기반의 장소 추정 방법 및 이를 수행하는 컴퓨팅 시스템
CN119941795A (zh) 目标跟踪方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24809289

Country of ref document: EP

Kind code of ref document: A1