WO2024238949A1 - Sequence packing for training image processing neural networks - Google Patents
Sequence packing for training image processing neural networks Download PDFInfo
- Publication number
- WO2024238949A1 WO2024238949A1 PCT/US2024/030005 US2024030005W WO2024238949A1 WO 2024238949 A1 WO2024238949 A1 WO 2024238949A1 US 2024030005 W US2024030005 W US 2024030005W WO 2024238949 A1 WO2024238949 A1 WO 2024238949A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- training
- image
- neural network
- images
- tokens
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
Definitions
- This specification relates to training neural networks.
- Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input.
- Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer.
- Each layer of the network generates an output from a received input in accordance with current value inputs of a respective set of parameters.
- This specification describes a system implemented as computer programs on one or more computers in one or more locations that trains an image processing neural network.
- the image processing neural network is a neural netw ork that receives an input image and processes the input image to generate an output for the image.
- Vision Transformers and other neural networks, e.g., Vision Transformer variants, MLP-Mixer neural networks, and so on, that process sequences of patches from images have shown very strong performance on a variety of computer vision tasks.
- these neural networks generally require that, during training, input images are resized to a fixed resolution and a fixed aspect ratio, e.g., a square aspect ratio, and then split into a fixed number of patches before they are processed by the neural netw ork.
- a single input sequence during training can include patches from multiple different input images and these different input images can have different resolutions, different aspect ratios, or both.
- Training the neural network in this manner can enable variable resolution images to be processed at inference, can improve training efficiency and dow stream inference performance, and allow' for incorporating a variety of other techniques that may further improve the efficacy of the training. Examples of such techniques include randomly sampled token dropping and resolution sampling.
- the image processing neural netw ork can be trained to have improved performance at inference time while being able to accurately process images of different resolutions and aspect ratios. Moreover, the neural network can achieve this improved performance while consuming fewer computational resources during training.
- FIG. 1 shows an example neural network system.
- FIG. 2 is a flow diagram of an example process for performing a training step during the training of the image processing neural network.
- FIG. 3 is a flow diagram of an example process for generating an input sequence when token dropping is employed.
- FIG. 4 is a flow diagram of an example process for obtaining a batch when resolution sampling is employed.
- FIG. 5 show s an example of generating an input sequence and processing the input sequence using the image processing neural network.
- FIG. 6 shows an example of the performance of the described techniques.
- FIG. 1 shows an example neural network system 100.
- the neural network system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.
- system 100 is a system that trains an image processing neural network 110 on training data 120.
- the image processing neural network 110 is a neural network that receives an input image and processes the input image to generate an output for the image.
- the output for the image can be an image embedding of the input image in an embedding space.
- An ‘'embedding’ 7 as used in this specification is a vector of numeric values, e.g., floating point values or other values, having a pre-determined dimensionality.
- the space of possible vectors having the pre-determined dimensionality 7 is referred to as the “embedding space.”
- an embedding of an image is an encoding of the image.
- the output for the image can be a classification output for a classification task for the input image. That is, the output includes a respective score for each of a set of object categories, with the score for an object category representing the likelihood that the image depicts an object belonging to the object category.
- the image processing neural network 110 can be adapted for one or more downstream tasks.
- the training performed by the system 100 on the training data 120 can be referred to as a “pre-training” stage that is performed in order to improve how well and how easily the neural network 110 can be adapted for one or more downstream tasks after being trained on the training data 120.
- the system 100 can train a downstream neural network 130 that includes at least some of the layers of the image processing neural network 110 on training data for the downstream task.
- the downstream task can be an image classification task, as described above.
- the downstream task can be object detection, e.g., open vocabulary object detection, where the output for a given image identifies locations of one or more bounding boxes in the image and, for each bounding box, a category to which an object depicted in each bounding box belongs.
- object detection task is an open vocabulary object detection task, where the set of possible categories can be different for different inputs and are specified by embeddings of category labels generated by a text encoder neural network.
- the system can select the category label embedding having the highest similarity, e.g., in terms of cosine similarity or dot product, with an embedding of the bounding box generated by the dow nstream neural network.
- the downstream task can be image segmentation. That is. the neural network can be configured to generate an element-level classification output (e.g., a pixel-level classification output) that includes, for each element in the network input, a respective score corresponding to each of multiple categories. For a given element (e.g., for a given pixel), the score for a category indicates a likelihood that element belongs to the category.
- the categories may be classes of objects, and an element may belong to a category if it is part of an object included in the object class corresponding to the category 7 .
- the downstream task can be image depth prediction.
- the output generated by the neural netw ork identifies, for each pixel in the image, a predicted depth of the scene at the pixel.
- the downstream task can be a video understanding task. That is, the neural network can be configured to process a sequence of video frames to generate an output that characterizes the video frames, e.g., by characterizing whether the video frames depict a person or other agent performing a particular action, by generating a caption that describes the semantic content of the video, by classifying one or more objects in the video and so on.
- the downstream task can be a multi-modal text and vision task, e.g., image captioning, where the input is an image and the output is a text caption describing the input image, or visual question answering, where the input is an image or a video and a question about the image or video and the output is an answ er to the question.
- image captioning where the input is an image and the output is a text caption describing the input image
- visual question answering where the input is an image or a video and a question about the image or video and the output is an answ er to the question.
- the image processing neural network 110 can have any appropriate architecture that processes a sequence of tokens representing an image to generate the output for the image.
- the neural network 110 can have an architecture that includes multiple self-attention network blocks that each perform self-attention to update the tokens in the sequence.
- Examples of such architectures include Vision Transformers (ViTs) and other ViT variants.
- the neural network can have the architecture described in Dosovitskiy, et al, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv:2010. 11929.
- the neural network 110 can have an architecture that includes multiple network blocks that perform a different type of operation to update the tokens in the sequence.
- An example of such an architecture is an MLP-mixer architecture.
- the neural network can have the architecture described in Tolstikhin, et al, MLP-Mixer: An all-MLP Architecture for Vision. arXiv:2105.01601.
- the neural network 110 can include a sequence of network blocks that each update the tokens in the input sequence.
- the neural network 110 can also include one or more additional neural network layers that process the outputs of the last network block in the sequence to generate the training output.
- the system 100 or another adaptation system can replace the one or more additional neural network layers with a set of additional neural networks layers that are specific to the downstream task.
- the system 100 can then train the resulting downstream neural network on training data for the downstream task, either holding the network blocks fixed and training only the new additional layers or training both the new additional layers and the network blocks.
- the neural network 110 divides the input image into patches.
- the neural network 110 then generates a respective embedding for each patch and processes a sequence that includes the embeddings for the patches to generate the training output for the input image.
- the training data 120 includes multiple different training examples, with each training example including at least a training image and, optionally, other data, e.g., a classification label or a corresponding text sequence.
- the training images in the training examples generally have vary ing resolutions and aspect ratios.
- conventional training schemes for training these neural networks generally require that the training images are resized to a fixed resolution and a fixed aspect ratio, e.g., a square aspect ratio, and then split into a fixed number of patches before they are processed by the neural network 110.
- the system 100 trains the neural network 110 by "packing" multiple images into any given ‘‘packed” input sequence 140 that is processed by the neural network 110 during training.
- Training the neural network 110 in this manner has many implications for the training of the neural network 110 and for subsequently fine-tuning and using (at least a portion) of the neural network 110 to perform inference.
- the described training scheme consistently outperforms conventional approaches.
- the described training scheme can result in a neural network 110 that matches the performance of a Vision Transfomer (ViT) trained using previously state-of-the-art techniques with 4x less compute.
- ViT Vision Transfomer
- sequence packing results in multiple images being combined within the same input sequence, there is a substantial increase in the number of training examples processed within the allocated compute budget, which contributes to the increase in performance.
- sequence packing coupled with variable resolution inputs and variable token dropping (described below) enable the system 100 to process five times more images during training than conventional schemes. Additionally, this improved efficiency extends to the fine-tuning process, where similar schemes can be applied.
- a single model demonstrates excellent performance when evaluated on various resolutions, significantly advantaging the neural network 110 in terms of inference cost relative to neural networks trained using conventional schemes.
- FIG. 2 is a flow diagram of an example process 200 for performing a training step during the training the image processing neural network.
- the process 200 will be described as being performed by a system of one or more computers located in one or more locations.
- a neural network system e.g., the neural network system 100 of FIG.1, appropriately programmed, can perform the process 200.
- the system can perform the training of the image processing neural network across multiple training steps.
- the system obtains a batch of training examples and trains the image processing neural network on the training examples in the batch, i.e., by performing an iteration of the process 200.
- the system obtains a batch of training examples, with each training example including a respective training image (step 202).
- the training example includes only the image.
- the training example includes other information, e.g., a corresponding text sequence that describes the image or data identifying a label for the image.
- the batch of training examples can include images with varying resolutions and aspect ratios.
- the system trains the neural network on the original resolutions of the training images in the batch.
- the system makes use of resolution sampling when generating the batch of training images.
- the system For each training image in the training examples, the system divides the training image into a respective plurality of patches and generates a respective token representing each of the patches (step 204).
- the system can partition each image into fixed size regions. Even though the regions have a fixed size, because different images have different resolutions, different images can be divided into different numbers of patches.
- the system To generate the tokens representing the patches, the system generates a patch embedding of the patch and generates a positional embedding of a position of the patch within the corresponding training image.
- the system then combines the patch embedding of the patch and the positional embedding of the patch to generate the respective token representing the patch.
- the system can sum, average, or concatenate the patch embedding and the positional embedding.
- the system can process the intensity values of the pixels in the patch using a patch embedding subnetwork to generate the patch embedding of the patch.
- the patch embedding subnetwork can be a single linear projection layer, can be a multi-layer subnetwork, e.g., an MLP, or have a different appropriate architecture.
- the patch embedding subnetwork is trained jointly with the remainder of the image processing neural network, e.g., is trained using gradients of the loss function, computed as described below.
- the system can generate the positional embedding of the position of a given patch within the corresponding training image in any of a variety of ways.
- the system can use one dimensional (1-D) positional embeddings that map each coordinate to a position in a 1-D, flattened representation of the image and are learned j ointly with the remainder of the neural network.
- the system can use two-dimensional (2-D) positional embeddings that are learned jointly with the remainder of the neural network.
- a respective embedding for each index in a [maxLen, maxLen] grid is learned, with the grid being indexed with the (x, y) coordinates of each patch and maxLen being a pre-set maximum length of any given dimension of an image.
- P is the number of patches from any given image.
- every combination of (x, v) coordinates must be seen during training.
- the system uses “factorized” positional embeddings.
- the system maintains a separate set of embeddings for each of a set of multiple x coordinates and a separate set of embedding for each of a set of multiple y coordinates.
- the embeddings are absolute embeddings, so that the embedding of a given coordinate is a function of the absolute index of the coordinate within the image.
- the embeddings are fractional embeddings, so that the embedding of a given coordinate is a function of the ratio of the absolute index of the coordinate within the image to the side length of the image along the corresponding dimension, i.e.. the number of patches along the corresponding dimension for the given image.
- the system When using factorized positional embeddings, to determine the positional embedding for a given patch, the system generates a first embedding of an x coordinate of the position of the patch within the corresponding training image, e.g., by mapping the absolute index of the coordinate to the first embedding, or by mapping the ratio of the coordinate within the image to the side length of the image along the x dimension to the first embedding.
- the system generates a second embedding of ay coordinate of the position of the patch within the corresponding training image, e.g., by mapping the absolute index of the coordinate to the second embedding, or by mapping the ratio of the coordinate within the image to the side length of the image along the y dimension to the second embedding.
- the system then combines the first and second embeddings to generate the positional embedding of the position of the patch.
- the system can sum the first and second embeddings to generate the positional embedding of the position of the patch.
- the system generates one or more input sequences (step 206). Generally, the one or more input sequences collectively represent all of the images in the batch.
- the system generates the input sequences using sequence packing by “packing” tokens from multiple images into the same input sequence.
- the images in the batch can have variable resolutions and aspect ratios
- one sequence can include tokens from two different images with two different resolutions, two different aspect ratios, or both.
- the system employs a toking dropping scheme when generating each input sequence.
- the system can determine, for each image, whether to remove any of the tokens that represent the image from the input sequence and then generate the input sequence with only the tokens that have not been removed from each of the two or more input images.
- the system can determine which images in a given batch are represented in which input sequence in any of a variety of ways.
- the system can generate input sequences using a greedy approach in which the system adds images to the first sequence with enough remaining space. That is, the system can traverse the images in the batch, e.g., according to a random ordering of the images, and can determine whether the tokens representing the image can “fit” within any already generated, partially complete input sequence. If so. the system adds the image to one of the partially complete input sequence. If not, the system places the image in a new sequence. Once no more images can fit in any of the generated sequences, sequences are filled with padding (“masking”) tokens, yielding the fixed sequence lengths.
- a greedy approach in which the system adds images to the first sequence with enough remaining space. That is, the system can traverse the images in the batch, e.g., according to a random ordering of the images, and can determine whether the tokens representing the image can “fit” within any already generated, partially complete input sequence. If so. the system adds the image to one of the partially complete input sequence. If not, the system places the image in
- the system when generating a given input sequence, can dynamically choose the resolution or token dropping rate of the final image in the sequence to exactly fit the remaining tokens in the given input sequence.
- the system then processes each input sequence using the image processing neural network to generate a respective training output for each corresponding training image (step 208).
- the system processes the input sequence through the sequence of self-attention layers to update the tokens in the input sequence and then generates the training output for each training image from the updated tokens.
- the system trains the image processing neural network on a loss function for the task using the training outputs for the training images in the batch (step 210).
- the loss function can be a contrastive loss function or a classification loss function, e.g., a loss function with a cross-entropy term.
- the system trains the neural network through contrastive learning.
- the training output generated by the image processing neural network for each image is an image embedding of the image in an embedding space.
- the image processing neural network is trained jointly with a text encoder neural network having text encoder neural network parameters and configured to process a text segment to generate a text embedding of the text segment in the embedding space.
- the text encoder neural network can be a Transformer neural network, a recurrent neural network (RNN) or another appropriate type of text processing neural network.
- the system can, for each training text segment, process the training text segment using the text encoder neural network to generate a training text embedding of the training text segment.
- the system trains the image processing neural network on an image classification task.
- the training output generated by the image processing neural network can be a classification output and each training example includes a ground truth classification for the training image in the training example.
- the system trains the image processing neural network on a loss function that includes, for each training example, a classification loss relative to the ground truth classification for the training image, e.g., a cross-entropy or other classification loss.
- the system can then adapt the image processing neural network to perform a downstream computer vision task.
- Examples of such tasks include image classification, object detection, image segmentation, depth prediction, video understanding, and multi-modal text and vision tasks.
- FIG. 3 is a flow diagram of an example process 300 for generating an input sequence when the system performs token dropping.
- the process 300 will be described as being performed by a system of one or more computers located in one or more locations.
- a neural network system e g., the neural network system 100 of FIG.1, appropriately programmed, can perform the process 300.
- Token dropping i.e., the random omission of input patches during training
- Token dropping has been developed to accelerate training.
- typically the same proportion of tokens are dropped from all examples.
- sequence packing the system enables continuous token dropping, whereby the token dropping rate can be varied per-image. This enables the benefits of faster throughput enabled by token dropping while still allowing the model to process some complete images, reducing the train/inference discrepancy and improving performance at inference time. Further, with packing, the drop-distribution can vary throughout training, further improving training performance.
- the system selects two or more of the training images in the batch of training examples (step 302). For example, the system can select the training images that will be represented in the training sequence greedily as described above.
- the system determines, for each of the two or more training images, a respective token dropping rate (step 304).
- the respective token dropping rate is the same for all of the training images in the batch.
- the system can determine the token dropping rate for the batch of training images in any of a variety of ways.
- the system can randomly sample the respective token dropping rate for the batch from a set of possible token dropping rates.
- the system can sample the token dropping rate from a Beta distribution or other distribution over a range of possible token dropping rates.
- the system can determine the respective token dropping rate for the batch according to an index of the training step among training steps performed during the training. That is, the system can maintain a function that maps the index of the training step to a token dropping rate.
- different images in a batch can have different token dropping rates.
- the system can determine the respective token dropping rate for a given training image in the batch according to an index of the training image among training images processed during the training. That is, the system can maintain a function that maps the index of the training image to a token dropping rate.
- the system can determine the token dropping rate based on the resolution of the training image.
- the system can sample the token dropping rate for a given image from a distribution that depends on the resolution of the training image, e g., that has a mean that is based on the resolution of the training image.
- the system determines, in accordance with the token dropping rate for the training image, whether to select any of the tokens representing patches from the two or more training images for removal (step 306).
- the system generates an input sequence that includes only the tokens for the two or more images that were not selected for removal (step 308).
- FIG. 4 is a flow diagram of an example process 400 for obtaining a batch of training examples when the system employs resolution sampling.
- the process 400 will be described as being performed by a system of one or more computers located in one or more locations.
- a neural network system e.g., the neural network system 100 of FIG. 1, appropriately programmed, can perform the process 400.
- the system samples an original batch of original training examples (step 402).
- each original training example includes an original training image, with each original training image having a respective original resolution and a respective original aspect ratio.
- the system can sample the original batch from a larger set of training examples that is available to the system for use in training the neural network.
- the system modifies the original images in the original training examples (step 404).
- the system samples a resolution for the original training image and then modifies the original training image to have the sampled resolution while maintaining the original aspect ratio of the original training image. That is, the total number of pixels in the original image is resampled to match the sampled resolution while the aspect ratio is preserved.
- the system can sample the resolution from a fixed set of resolutions, e.g., by sampling the resolution from a uniform distribution over the resolutions in the fixed set.
- the system generates a batch of training examples that includes the modified images (step 406). That is, the training example corresponding to any given original training example includes the modified training image generated from the original training image in the training example and any other information that was in the original training example, e.g., a training text if the neural network is being trained through contrastive learning or a ground truth label if the neural network is being trained to perform classification.
- the training example corresponding to any given original training example includes the modified training image generated from the original training image in the training example and any other information that was in the original training example, e.g., a training text if the neural network is being trained through contrastive learning or a ground truth label if the neural network is being trained to perform classification.
- FIG. 5 shows an example 500 of generating and processing an input sequence during a given training step during the training step.
- the image processing neural network is a neural network that includes one or more self-attention layers that each apply self-attention.
- the system selects three input images 502 (image 1, image 2, and image 3).
- the system then “patchifies” 504 each of the input images 502 to divide each of the images 502 into patches.
- the system can perform token dropping 506 on each input image 502 to determine whether any of the tokens representing the patches of any of the images should be removed.
- the system then generates an input sequence 510 (“packed sequence”) that includes only the tokens that were not removed from the three input images.
- the input sequence 510 includes a respective subsequence for each of the three input images. More specifically, the input sequence 510 includes a concatenation of a plurality of subsequences that each include tokens representing patches from a respective one of the three training images.
- the system has appended two masking tokens in the “pad” portion of the input sequence 510 so that the input sequence 510 has the maximum number of tokens.
- the system then processes the input sequence 510 using a sequence of selfattention blocks 520.
- the self-attention network blocks 520 each apply self-attention over the tokens in the input sequence 510 to update at least a subset of the tokens in the input sequence. For example, the masking tokens in the “pad” section of the input sequence 510 can be masked out so they are not updated by the self-attention layer.
- the self-attention is masked so that the tokens for each corresponding image are only updated using the tokens for the corresponding image and not the tokens for any other corresponding images or padding.
- the neural network applies a pooling operation 530, e g., global average pooling or another type of pooling, to the tokens to generate a respective pooling representation for each of the three images corresponding to the input sequence.
- a pooling operation 530 e g., global average pooling or another type of pooling
- the neural network can generate the respective training output for each image using the pooling representation for the image.
- the system can use the pooling representation as the embedding of the input image or can apply one or more transformations, e.g.. learned projections or other learned operations, to the pooling representation to generate the embedding.
- transformations e.g.. learned projections or other learned operations
- the system can process the pooling representation using an output subnetwork to generate the classification output.
- the output subnetwork can include one or more fully -connected layers followed by a softmax layer.
- the pooling operation 530 is masked so that the pooling representation for each corresponding image is only generated using the tokens for the corresponding image and not the tokens for any other corresponding images., i.e., only the tokens for the corresponding image and not the tokens for any other corresponding images or padding are pooled to generate the pooling representation.
- FIG. 6 shows an example 600 of the performance of the described techniques.
- the example 600 shows the performance of various architectures trained using the described techniques relative to the same architectures trained using a ViT training technique.
- the described techniques outperform the ViT models along a variety of dimension: model accuracy given pre-training hours, fine-tuned accuracy at a given pre-training cost, and on fine-tuned accuracy at a given inference cost.
- Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry', in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by. or to control the operation of, data processing apparatus.
- the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory' device, or a combination of one or more of them.
- the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- the term '‘data processing apparatus’’ refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- the apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages: and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g.. files that store one or more modules, sub programs, or portions of code.
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
- the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations.
- the index database can include multiple collections of data, each of which may be organized and accessed differently.
- engine is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.
- an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
- the processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by special purpose logic circuitry, e.g.. an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
- Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
- a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
- the central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- USB universal serial bus
- Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD ROM and DVD-ROM disks.
- embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser.
- a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
- Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.
- Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a Jax framework.
- a machine learning framework e.g., a TensorFlow framework or a Jax framework.
- Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application sen’ er, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client.
- Data generated at the user device e.g., a result of the user interaction, can be received at the server from the device.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
- Facsimile Image Signal Circuits (AREA)
Abstract
Description
Claims
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| AU2024273693A AU2024273693A1 (en) | 2023-05-17 | 2024-05-17 | Sequence packing for training image processing neural networks |
| CN202480029247.8A CN121100341A (en) | 2023-05-17 | 2024-05-17 | Sequence Packaging for Training Image Processing Neural Networks |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363467296P | 2023-05-17 | 2023-05-17 | |
| US63/467,296 | 2023-05-17 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024238949A1 true WO2024238949A1 (en) | 2024-11-21 |
Family
ID=91585894
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/030005 Pending WO2024238949A1 (en) | 2023-05-17 | 2024-05-17 | Sequence packing for training image processing neural networks |
Country Status (3)
| Country | Link |
|---|---|
| CN (1) | CN121100341A (en) |
| AU (1) | AU2024273693A1 (en) |
| WO (1) | WO2024238949A1 (en) |
-
2024
- 2024-05-17 AU AU2024273693A patent/AU2024273693A1/en active Pending
- 2024-05-17 WO PCT/US2024/030005 patent/WO2024238949A1/en active Pending
- 2024-05-17 CN CN202480029247.8A patent/CN121100341A/en active Pending
Non-Patent Citations (5)
| Title |
|---|
| CHEN JOYA ET AL: "DROPIT: DROPPING INTERMEDIATE TENSORS FOR MEMORY-EFFICIENT DNN TRAINING", 2 March 2023 (2023-03-02), XP093197629, Retrieved from the Internet <URL:https://arxiv.org/pdf/2202.13808> [retrieved on 20240822] * |
| DOSOVITSKIY ET AL.: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", ARXIV, vol. 11929, 2010 |
| GAO DIFEI ET AL: "MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 19 December 2022 (2022-12-19), pages 14773 - 14783, XP093196938, Retrieved from the Internet <URL:https://arxiv.org/pdf/2212.09522> [retrieved on 20240821], DOI: 10.1109/CVPR52729.2023.01419 * |
| TOLSTIKHIN ET AL.: "MLP-Mixer: An all-MLP Architecture for Vision", ARXIV, vol. 2105, pages 01601 |
| YUANHAO XIONG ET AL: "Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 March 2023 (2023-03-28), XP091470428 * |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2024273693A1 (en) | 2025-11-20 |
| CN121100341A (en) | 2025-12-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3732631B1 (en) | Neural architecture search for dense image prediction tasks | |
| CN109960734B (en) | Question Answering for Data Visualization | |
| US11869170B2 (en) | Generating super-resolution images using neural networks | |
| US20220188636A1 (en) | Meta pseudo-labels | |
| CN109389027B (en) | List structure extraction network | |
| EP3732627B1 (en) | Fast decoding in sequence models using discrete latent variables | |
| EP4095758A1 (en) | Training large-scale vision transformer neural networks | |
| EP4196917A1 (en) | Processing images using self-attention based neural networks | |
| US11348203B2 (en) | Image generation using subscaling and depth up-scaling | |
| EP3701429A1 (en) | Auto-regressive neural network systems with a soft attention mechanism using support data patches | |
| US11386114B2 (en) | Structure-based transformers with localization and encoding for chart question answering | |
| US12400432B2 (en) | Memory-optimized contrastive learning | |
| CN112037239A (en) | Text guidance image segmentation method based on multi-level explicit relation selection | |
| US12283104B2 (en) | Method, electronic device, and computer program product for extracting target frame | |
| US12380712B2 (en) | Unified scene text detection and layout analysis | |
| AU2024273693A1 (en) | Sequence packing for training image processing neural networks | |
| US20220036172A1 (en) | Olfactory predictions using neural networks | |
| KR20250174955A (en) | Sequence Packing for Image Processing Neural Network Training | |
| US20250356635A1 (en) | Performing computer vision tasks using guiding code sequences | |
| US20240378858A1 (en) | Training generative models for generating stylized content | |
| US20250078350A1 (en) | Reflowing documents to display semantically related content | |
| WO2025251082A1 (en) | Neural networks with nested mixture-of-experts layers | |
| WO2024206507A1 (en) | Region-aware pre-training for computer vision tasks | |
| WO2024138177A1 (en) | Recurrent interface networks | |
| WO2025117964A1 (en) | Visual entity recognition using generative neural networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24734416 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2024734416 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: AU2024273693 Country of ref document: AU |
|
| ENP | Entry into the national phase |
Ref document number: 2024734416 Country of ref document: EP Effective date: 20251023 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: KR1020257038392 Country of ref document: KR |
|
| ENP | Entry into the national phase |
Ref document number: 2024273693 Country of ref document: AU Date of ref document: 20240517 Kind code of ref document: A |