WO2024167491A1

WO2024167491A1 - Efficiently performing computations of a multi-output fully convolutional network

Info

Publication number: WO2024167491A1
Application number: PCT/US2023/012634
Authority: WO
Inventors: Tushar Kumar; Soorgoli Ashok HALAMBI; Arun Chauhan; Dong Hyuk Woo
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2024-08-15
Anticipated expiration: 2025-08-08
Also published as: KR20250127761A; TW202433337A

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating multiple outputs for inputs of different sizes using a fully convolutional neural network. One of the methods include: receiving a new input, the new input having a first size that is different from a fixed size that the fully convolutional neural network is configured to process; determining multiple fixed-size input tiles from the new input; and generating respective outputs for the multiple fixed-size input tiles using the fully convolutional neural network. For each of multiple output layers, respective fixed-size output tiles are generated and based on the respective fixed-size output tiles, a respective final output for the output layer is generated, which is equivalent to an output that would be generated from the output layer by processing the new input using the fully convolutional neural network configured for processing inputs with the first size.

Description

EFFICIENTLY PERFORMING COMPUTATIONS OF A MULTIOUTPUT FULLY CONVOLUTIONAL NETWORK

BACKGROUND

[0001] This specification generally relates to efficiently performing inference computations of a multi-output fully convolutional network.

[0002] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of network parameters.

[0003] A fully convolutional network is a neural network that includes only convolutional neural network layers and, optionally, other layers that are made up solely of components that only operate on local input regions, e.g., pooling layers and element-wise layers, e.g., those that apply an element-wise non-linear activation function. Specifically, unlike other types of convolutional neural networks, a fully convolutional network does not have any fully connected layers. A fully convolutional network can be configured to make pixel-wise predictions of an input (e.g., an image with multiple pixels). In other words, the fully convolutional network can be used to make a respective prediction for each pixel of the input. An example of a task that requires making pixel-wise prediction is image segmentation, in which a neural network is configured to generate, for each pixel of the input image, a respective score for each of multiple classes.

SUMMARY

[0004] This specification generally describes techniques for performing inference computations of a neural network. The neural network is a fully convolutional network configured to generate multiple outputs for each of inputs with different sizes.

[0005] A fully convolutional network (FCN) can include at least one or more convolutional neural network layers, and optionally, pooling layers, transposed convolution layers, and element-wise layers (e.g., layers applying element-size activation functions). An FCN can be deployed on a hardware accelerator to generate pixel-wise predictions of an input (e.g., an input image with multiple pixels).

[0006] An FCN can be configured to generate one output for processing an input. Such a FCN can be constructed as a sequence of network layers, and can be configured to include an input layer as the first layer to receive an input, and an output layer configured to generate an output for the input.

[0007] In some implementations, an FCN can be configured to generate multiple outputs for processing an input. For simplicity, such a FCN is referred to as multioutput FCN in the specification. A multi-output FCN can include an input layer as the first layer for receiving a common input, and more than one output layer that each generates a respective output as a result of processing the common input.

[0008] A multi-output FCN can include different network structures. For example, a multi-output FCN can be constructed to include tree-like structures with one or more branch points. The term “branch points” used throughout the specification refers to a network layer that is a preceding layer of more than one succeeding layer in the FCN, i.e., an output generated by the network layer is provided as an input for each of two or more succeeding layers. As another example, a multi-output FCN can be constructed to include a directed acyclic graph (DAG) structure. A DAG structure can include a branch point (similar to that of a tree structure), from which two or more branches or paths start. To distinguish from a tree structure, the term “branch point in a DAG” structure is also referred to as an “initial point” in this specification. A DAG structure can further include a “re-convergence point.” The term “re-convergence point” used in this specification can broadly represent a network layer at which the diverted two or more branches merge together. In some implementations, a multi-output FCN is configured to include both trees and DAGs.

[0009] Multi-output FCNs are generally more efficient than single-output FCNs, because multi-output FCNs can generate multiple outputs for processing common inputs, which saves time and computation power compared to single-output FCNs. Multi-output FCNs can be deployed and implemented for different applications. For example, in face recognition related applications, multi-output FCNs can process a single input image frame to generate different outputs. The different outputs can include data/predictions related to face detection and/or recognition, background detection, object classification, and depth image generation. As another example, a multi-output FCN can generate a first output related to a probability map of whether each input pixel is part of a detected object in an input image, and a second output indicating a detection score for each region of the input image. In this example, the size of the first output can have the same size as the input image, and the second output can have a size based on the number of regions in the input image.

[0010] Although an FCN, single-output or multi-output, is in principle able to process inputs with arbitrary sizes, an FCN can be compiled and deployed on hardware accelerators before any inference operations are performed. The compilation normally requires a specified input size for FCNs. To process a new input with a different size, a system might need to re-compile a compiled FCN for the new input size, which reduces efficiency for the inference operations, e.g., increasing overhead time (or idle time) and computation power.

[0011] Alternatively, a system can dynamically deploy an FCN. However, dynamically deploying an FCN to be capable of consecutively processing inputs of different sizes can raise issues in computational cost. First of all, network parameters including data structures (e.g., matrix dimensions for computations, or paddings, strides, filter sizes, and scale factors for network layers) scale with the size of the input and a change in the input size might require a shuffle of the current network parameters, which can lead to an increase of downtime (e.g., overhead) for a system including one or more hardware accelerators. Moreover, to allow for a dynamic input size, a host needs to send instructions for performing inference computations using more general execution mechanisms. For example, the host may allocate a larger memory for data storing (at the cost of being slower, and a portion of the large memory might not be used), perform more checks on vector or tensor sizes, or more frequently and dynamically change numbers of computing units used for performing calculations during parallel computation. Therefore, in practice, an FCN is usually statically deployed (e.g., compiled with fixed network hyper-parameters) on one or more hardware accelerators and configured to receive input of a fixed size, to avoid issues raised by dynamic deployment.

[0012] To perform inference operations in a multi-output FCN for a common input, a system using existing techniques can re-compile the multi-output FCN when the system needs to process a new input with a different size. Alternatively, the system can replace a multi-output FCN with multiple single-output FCNs (e.g., splitting the entire multi-output FCN or splitting only at branch points), and process the common input using the multiple single-output FCNs. The multiple single-output FCNs can be re- compiled to process different input sizes. Alternatively, each of the single-output FCNs can be statically compiled for a particular size, and the system can tile inputs into multiple fixed-size inputs and process the multiple fixed-size inputs to generate respective outputs for the multiple single-output FCNs.

[0013] However, the above-noted existing techniques can suffer from poor efficiency and increasing computation cost. For example, a system might need to recompile a multi-output FCN too frequently, which leads to increasing overhead time. As another example, a system might need additional time and computation to perform operations in multiple single-input FCNs rather than doing so using a multi-output FCN. In addition, sometimes it is difficult to split a multi-output FCN into multiple single-output FCNs with equivalent inference accuracy.

[0014] The techniques described in this specification can solve at least the abovenoted problems.

[0015] According to an aspect, the described techniques relate to a method performed by one or more computers. The method includes receiving a new input to be processed by a fully convolutional neural network deployed on a hardware accelerator. The new input has a size that is different from a fixed size that the fully convolutional neural network is configured to process when deployed on the hardware accelerator. The fully convolutional neural network includes multiple network layers, including an input layer as a first layer for receiving the new input, and multiple output layers, each configured to generate a respective output for processing the new input. The method further includes determining multiple fixed-size input tiles from the new input. Each of the fixed-size input tiles has the same fixed size. The multiple fixed-size input tiles are provided to the hardware accelerator to generate respective outputs using the fully convolutional neural network.

[0016] For each of the multiple output layers, respective fixed-size output tiles are generated for the multiple fixed-size input tiles. The respective fixed-size outputs include one or more inaccurate pixel-wise values. For each output layer of the multiple output layers, a respective final output for the output layer is generated from the respective fixed-size outputs. The respective final outputs are equivalent to respective outputs that would be generated from the output layers by processing the new input using the fully convolutional neural network configured for processing inputs with the first size. [0017] The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages. [0018] The described techniques allow a statically-compiled, multi-output FCN deployed on a hardware accelerator to process input data with unknown or varying sizes and generate multiple outputs for each input data. The described techniques are also robust to different network structures of a multi-output FCN. For example, a multioutput FCN can include a tree structure. At least one layer in the tree structure is a branch point such that an output generated by the layer is processed by multiple subsequent layers, producing multiple branches and, ultimately, multiple model outputs. In some situations, a multi-output FCN can have a DAG structure. Unlike a tree structure, a DAG structure can include one or more re-convergence points and branch points. Each re-convergence point is a layer where multiple outputs from previous layers are processed as inputs by the layer. In other words, multiple branches introduced by branch points may re-converge at a re-convergence point.

[0019] In general, a DAG structure is a more general term than a tree structure because the DAG structure includes both branch points and re-convergence points, whereas a tree structure includes only branch points. Furthermore, the described techniques are advantageous because the system does not need to identify whether a multi-output FCN belongs to a tree structure or a DAG structure at the topological level. Nor does the system need information identifying whether a multi-output FCN belongs to a tree structure or a DAG structure. Instead, one or more algorithms performed during the tiling process can determine the existence of branch points and re-convergence points and process inputs and outputs according to the properties of branch points and re-convergence points. This way, the described techniques are more robust and efficient for processing multi-output FCNs with various structures.

[0020] The described techniques can generate accurate multiple outputs for a given input with a different size using a statically-compiled multi-output FCN as if the outputs are generated by the multi-output FCN but compiled for the different input size. More specifically, the system can determine a tiling pattern for the model that includes input tile size, output tile sizes for different outputs, and alignment requirements for each paths that spans from the first layer to each output layer, determine which pixel values in output tiles are accurate or inaccurate, and combine pixels with accurate values together to form respective outputs for the multi-output FCN. [0021] The described techniques can perform inference operations of different tiles (e.g., fixed-size inputs) from a single input in parallel, taking advantage of the data sharing properties between adjacent accelerators to decrease memory usage. For example, the described techniques can optimize data transfer across overlapping regions of adjacent fixed-size inputs according to the input or output data with various sizes. In some implementations, the described techniques can perform, using one or more hardware accelerators, inference operations of different tiles in parallel, taking advantage of the data sharing properties between adjacent accelerators to decrease memory usage.

[0022] Furthermore, the described techniques are robust to different hardware accelerator architectures. The described techniques can automatically identify hardware constraints or requirements, such as system memory bandwidth. The described techniques can efficiently tile arbitrary large-size inputs to fit a multi-output FCN deployed on the hardware accelerator based on the identified hardware constraints or requirements. For example, for accelerators with advanced memory addressing capabilities (e.g., accelerators including direct memory access (DMA) engines), the described techniques can reduce or eliminate overhead time related to data manipulation for tiling inputs and stitching fixed-size outputs. For accelerators with a simpler architecture or less memory bandwidth, the described techniques can determine instructions of preparatory work and assign them to a supporting host processor. Once the instructions are executed, the supporting host processor can perform operations such as generating fixed-size input tiles from an input or packing up fixed-size output tiles to generate a final output.

[0023] Moreover, the techniques described in this specification are distinct and advantageous over conventional data parallelization techniques. In general, data parallelization techniques can divide input data (e.g., an input image) into multiple disjoint portions (e.g., segments of the input image) and assign the multiple portions to multiple hardware components (e.g., hardware accelerators) to process the portions independently and in parallel to generate partial outputs. After all of the portions are processed by the hardware components, a system configured to perform the data parallelization techniques can generate a final output by aggregating the partial outputs. As long as the operations are correctly performed by each hardware component for respectively designated portions, the system does not need to consider whether any parts of the partial outputs are not suitable or inaccurate for generating the final output. [0024] However, in general, an FCN generally does not take advantage of data parallelization techniques because an output generated by an FCN processing a portion of an input image (e.g., a tile of an input image) can include one or more incorrect or inaccurate pixel-wise values. This is because the computation of the system processing a tile of input can involve “neighbor pixels” so that a portion of the output pixels can be inaccurate.

[0025] The term “neighbor pixels” used throughout the specification represents pixels of one-or-more-pixel width, which surround a boundary of an input to the FCN. The neighbor pixels can also include pixels additionally added to the boundary of the input through zero padding. The region including the neighbor pixels is referred to as the “neighbor pixel region” in the specification. The neighbor pixel region can include a width of one or more pixels. In some implementations, the width of the neighbor pixel region can be determined based on the characteristics of the fully convolutional network model. The neighbor pixels of an input tile can have non-zero values in the original input image, but when the input tile is provided to an FCN as an input, the neighbor pixels of the input tile are set or considered as zero values . This inadvertently using zero values for non-zero values in neighbor pixels can cause inaccurate pixel-wise outputs in output tiles.

[0026] Therefore, it is problematic for a system to simply combine fixed-size outputs without determining and discarding inaccurate data when using a FCN. The system needs to determine both accurate data (e.g., valid values) and inaccurate data (e.g., dummy pixel values) based on characteristics of an FCN.

[0027] Furthermore, the described techniques can perform “tiling and stitching” analysis both online and offline. For offline analysis, a host processor can re-use parameters that are saved from a statically-compiled, multi-output FCN, and apply the parameters for processing different input data using the same compiled FCN. The previously saved parameters can include at least a tiling pattern such as alignment information for the compiled FCN, a fixed-size for tiling different input, overlapping regions in both input and output tiles, and dummy and valid regions in fixed-size outputs.

[0028] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS

[0029] FIG. 1 shows an example inference system for performing inference computations in a statically compiled, multi-output FCN configured to process inputs with different sizes.

[0030] FIG. 2 illustrates an example inference process using an example inference system.

[0031] FIG. 3 illustrates an example multi-output FCN having a tree structure.

[0032] FIGS. 4A and 4B illustrate a respective example DAG structure with a respective re-convergence point.

[0033] FIG. 5 illustrates an example multi-output FCN having a DAG structure.

[0034] FIG. 6 illustrates an example fixed-size input with neighbor pixel region and an example fixed-size output with dummy region.

[0035] FIG. 7 illustrates an example process for performing inference computations in a multi-output FCN for inputs with different sizes.

[0036] FIG. 8 illustrates an example backtracking stack.

DETAILED DESCRIPTION

[0037] The techniques described below can allow a statically-compiled, multioutput FCN deployed (or to be deployed) on a hardware accelerator to effectively process inputs of different sizes and generate multiple outputs for each of the inputs. [0038] FIG. 1 shows an example inference system 100 for performing inference computations in a statically-compiled, multi-output FCN 115 configured to process inputs with different sizes. The input size can represent a total number of elements in an input. For example, the input size for a two-dimension image can have a size of 50 by 50 pixels, 100 by 100 pixels, 100 by 150 pixels, 200 by 100 pixels, 400 by 400 pixels, or other suitable sizes.

[0039] The inference system 100 is an example of a system implemented on one or more computers in one or more locations, in which systems, components, and techniques described below can be implemented. Some of the components of the inference system 100 can be implemented as computer programs configured to run on one or more computers.

[0040] As shown in FIG. 1, the inference system 100 can be configured to process input data 150 to generate output data 170 using a multi-output FCN compiled and deployed for processing fixed-size inputs, where the fixed size is different from sizes of the input data 150.

[0041] A multi-output FCN can include multiple network layers, for example, a first layer as an input layer for receiving an input, and one or more output layers each configured to generate a respective output as a result of processing the input. Therefore, a multi-output FCN, once properly trained and deployed, can generate multiple outputs as a result of processing a single input.

[0042] A multi-output FCN can further include one or more “paths” or “branches.” The term “path” or “branch” used throughout the specification broadly represents a sequence of network layers that span from the first layer and one of the output layers. [0043] A multi-output FCN can include different structures, for example, a multioutput FCN can include a tree structure, with one or more branch points (e.g., particular network layers) at which two or more branches diverge. As another example, a multioutput FCN can include a directed acyclic graph (DAG) structure, which includes an initial point and a re-convergence point, where the initial point is a particular network layer, similar to a branch point in a tree structure, that two or more branches diverge, and the re-convergence point (e.g., another network layer) at which the two more branches re-converge. The details regarding the tree and DAG structures and process of generating a tiling pattern for FCNs with such structures are described in connection with FIG. 3, FIG. 4A, and FIG. 4B.

[0044] A multi-output FCN 115 can be implemented for performing tasks related to object classification and detection (e.g., human face detection), image segmentation, color-image generation, image super-resolution, image completion, or image colorization, to name just a few examples. For example, a multi-output FCN can be used to process a grayscale image that captures a person and a dog, and generate a first output which represents a color image corresponding to the grayscale image, a second output which determines whether there is a cat in the input image (i.e., object detection), and a third output indicating pixel-wise classification for the input image (i.e., semantic segmentation which labels each pixel in the input image to a class, e.g., a person, a dog, or a cat).

[0045] A multi-output FCN 115 is advantageous over a single-output FCN 115 as it can generate multiple outputs for processing a single input using less time, has a smaller size, and requires less computation power and/or memory bandwidth relative to generating multiple outputs using separate single-output FCNs. In some implementations, a multi-output FCN 115 can be implemented for processing facial recognition tasks, such as facial detection, facial recognition, background detection, boundary detection, and depth image generation.

[0046] For image segmentation tasks, the input data 150 can be an image input, the inference system 100 can generate output data 170 having pixel-wise predictions, i.e., a respective prediction for each pixel of the input image or for each pixel of an output image being generated by the FCN 115. For example, a compiled multi-output FCN 115 can process a single input image that includes a few human faces and other objects at a scene, and generates semantic outputs that indicates pixels belonging to a human face, pixels belonging to a background tree, vehicle, or animal, and the depth information between different pixels. In some implementations, the output data 170 can include a respective score distribution for each pixel that assigns a respective score to each of multiple categories. Given that, the system 100 can detect an existence, a shape, and a location of an object from an input image.

[0047] For image super-resolution tasks, the inference system 100 can increase image resolution for an input image by predicting pixels to be added around each input pixel, using the compiled multi-output FCN 115. Given that, the output image 170 can have a higher resolution than the input image 150, and one or more output pixels can be associated with each pixel in the input image.

[0048] The input data 150 can have one or more sizes different from the fixed size that the compiled multi-output FCN 115 is configured to process inputs of. For example, the input data 150 can include multiple image frames, each having a respective size different from the fixed size.

[0049] The generated output data 170 are outputs generated by the system 100 performing inference operations of a trained, statically-deployed, multi-output FCN model 115 for processing input data 150. In particular, the output data 170 can include multiple outputs for processing a single input using the compiled FCN model 115. In some implementations, the multiple outputs from respective output layers can have different sizes for processing a common input using the compiled FCN model 115. For example, for an input size of 100 by 100 pixels, the different outputs can include a size of 20 by 20 pixels, 50 by 50 pixels, 25 by 100 pixels, or other suitable sizes.

[0050] The inference system 100 can compile, using a compiling engine 160, a multi-output FCN for processing fixed-size inputs. To compile the multi-output FCN 115, the host 130 can receive data 155 representing a multi-output FCN, compile the FCN model, and generate instructions (e.g., binary data) to deploy the FCN model on one or more hardware accelerators 110.

[0051] The host 130 can include a compile engine 160 for statically compiling the multiple-output FCN 155. In general, the compiling engine 160 can decode program codes written in any suitable high-order languages and encoded with data representing characteristics of an FCN, into machine-readable binary codes on a hardware accelerator. The data representing characteristics of an FCN can include hyperparameters defining the structure of the FCN (e.g., input size, number of layers, number of nodes in each layer, layer types and positions, and padding, stride, filter size, and scale factor for one or more layers), and layer weights obtained from a training process. [0052] During compiling, the system 100 needs to allocate respective computing resources based on characteristics of an FCN. For example, the system 100 needs to allocate respective data structures to accommodate respective calculations for performing inference computations, e.g., data structures allocated for layer weight matrices, activation inputs, and outputs, that are dependent on the input size. As another example, the system needs to allocate respective memories for storing respective data structures and associated computation results during performing inference operations for the deployed FCN.

[0053] In general, the statically-compiled, multi-output FCN 115 can process any suitable inputs of the same fixed-size (e.g., fixed-size inputs 138). However, input data 150 can include different inputs with sizes different from the fixed size. To efficiently process inputs with different sizes, the inference system 100 can include mechanisms to process different inputs to obtain multiple fixed-size inputs and generate each of the multiple outputs by processing the multiple fixed-size inputs along a respective branch or path. The details of the mechanism are described below.

[0054] The inference system 100 can perform a “tiling and stitching” analysis for an input according to the characteristics of a multi-output FCN 155. The system 100 can further determine a tiling pattern for each input. The tiling pattern can include data representing a tiling size for compiling a multi-output FCN 155 (e.g., the fixed-size). The tiling pattern can further include how different fixed-size tiles are generated from an input, e.g., overlapping between tiles, arrangements of input tiles, and stitching processes for output tiles based on the tiling pattern. The tiling pattern is generated under different constraints or requirements, for example, an output tile generated from an input tile should have at least one valid pixel value. As another example, the coordinate shifts in both input and output tiles should satisfy particular alignment requirements inherent in a multi-output FCN. As another example, the tiling pattern should ensure that each pixel value of each output of multiple outputs (e.g., each output generated using a multi-output FCN compiled particularly for the corresponding input size) should be generated accurately in at least one of multiple output tiles. The details of performing the “tiling and stitching” analysis are described below.

[0055] After performing the “tiling and stitching analysis,” the system 100 can obtain data stored in a data structure (e.g., a data tuple) for performing the inference processes using the statically-compiled, multi-output FCN 115. More specifically, the system 100 (or the host 130) can include a tiling engine 135 and a stitching engine 140 to assist the process. The tiling engine 135 can be configured to tile an input into multiple fixed-size inputs 138 based on a tiling pattern. When an input to the tiling engine 135 has a smaller size than the fixed size, the system 100 can pad zeros around the input to reach the fixed size, and provide it to the hardware accelerator 110 for performing inference computations.

[0056] The system 100 can provide the fixed-size inputs 138 to hardware accelerators 110 to perform inference operations in the multi-output FCN 115. The system 100 can generate multiple fixed-size outputs 133 from different output layers for processing each fixed-size input 138 and provide the fixed-size outputs 133 to the stitching engine 140 to obtain output data 170. To generate the output data, the stitching engine 140 is configured to “stitch” multiple fixed-size outputs to generate different outputs for an input as if the different outputs are generated for the input using a multi-output FCN compiled for the size of the input. The stitching process is quite efficient as the system 100 obtains, after determining a tiling pattern, coordinate information for all valid pixels in the fixed-size output. The system can adopt a particular algorithm for the stitching process, which will be described in more details below.

[0057] In some implementations, the tiling and stitching process do not have to be performed on a host 130. For example, any suitable hardware accelerators including GPUs and CPUs can perform the tiling and stitching process. Moreover, the tiling and stitching process can be performed at different physical locations off the host. For example, the tiling can be performed by a first set of accelerators at a first location, the stitching process can be perform by a second set of accelerators at a second location, and the host 130 can be located at a third location and configured to receive outputs from the second set of accelerators. The accelerators and hosts are communicatively connected, either physically or wireless connected at one or more locations.

[0058] In some implementations, one or more hosts (e.g., offline analysis/compilation hosts) different from the host 130 can perform the tiling-pattern analysis offline or ahead of time off the host 130, and compile and deploy the multioutput FCN 155 on the host 130 (e.g., one or more communicatively coupled computers), or deploy the multi-output FCN 155 as an “application” on one or more edge devices (e.g., cell phones or tablets) for processing inputs of unknown sizes.

[0059] The system 100 can include any suitable type of hardware accelerator 110 for performing inference computations of an FCN model. For example, the hardware accelerator 110 can be a CPU, GPU, or TPU. The hardware accelerator 110 can include components such as memory for storing parameters of the FCN model. Moreover, the hardware accelerator 110 can include one or more computing units for parallel computation.

[0060] The host 130 can communicate with the hardware accelerator 110 by transferring data, or instructions, or both. The host 130 and the hardware accelerator 110 can communicate through wired or wireless connections and, in some cases, can be located remotely from each other. For example, the host 130 can be a server at a different physical location from where the accelerator 110 is located.

[0061] FIG. 2 illustrates an example inference process 200 using an example inference system 245. The example inference system 245 can be equivalent to the inference system 100 of FIG. 1, and when appropriately programmed, the inference system 245 can perform the process 200.

[0062] Distinguishing from existing techniques where a system re-compiles a multi-output FCN for a new input with a size different from the size that the FCN has been compiled for, the inference system 245 can process inputs with different sizes 210 using a statically-compiled, multi-output FCN 235. As shown in FIG. 2, the inference system 245 can process each input of the inputs 210 to generate multiple outputs, e.g., a first output 225a, a second output 225b, ... , and the Nth output 225n. Each of the multiple outputs are generated from a respective output layer in the multi-output FCN 235, and can have a respective output size. It should be noted that the techniques described herein are different from using multiple statically-compiled single-output FCNs where each of the single-output FCNs can process a common input to generate a respective output. The statically-compiled, multi-output FCN is a single FCN model having an input layer as the first layer and multiple output layers for generating multiple outputs. Examples of the structure of a multi-output FCN is described in greater detail in connection with FIG 3, FIG. 4A, and FIG. 4B.

[0063] The general process of performing a “tiling and stitching” analysis includes, by system 100, selecting an input tile size that is compatible with each output size of output layers in the multi-output FCN 235, and generating a tiling pattern based on alignment information and valid region information. The details of the “tiling and stitching” analysis are described in connection with FIG. 3, FIG. 4A, and FIG. 4B. [0064] FIG. 3 illustrates an example multi-output FCN 300 having a tree structure.

[0065] As shown in FIG. 3, the multi-output FCN 300 can include a first layer 301, or equivalently, an input layer, configured to receive an input. Although the FCN 300 is supposed to process inputs with arbitrary size, the FCN 300 could only process inputs of a particular size (e.g., the fixed size) after being statically compiled (if not performing the techniques described herein).

[0066] The multi-output FCN 300 can further include multiple output layers 303, for example, a first output layer 303a, a second output layer 303b, and a third output layer 303c. Each of the output layers 303 can generate a respective output for a given input. As described above, the outputs can include semantic classification of a human face, background objects, and depth images for processing an input image.

[0067] The multi-output FCN 300 can include other network layers between the first layer and the output layers. For example, the multi-output FCN 300 can have a second layer 311 following the first layer, layers 313 and 315 following the second layer 311, layers 317 and 318 following layer 313, and layer 319 following layer 317. [0068] As described earlier, a tree-structure multi-output FCN 300 can include two or more branches or paths. As shown in FIG. 3, a branch or a path represents a sequence of network layers connecting each other in a predetermined order. For example, the multi-output FCN 300 can include a first path with a first sequence of network layers 301, 311, 313, 318, and 303b, a second path with a second sequence of network layers 301, 311, 315, and 303c, and a third path with a third sequence of network layers 301, 311, 313, 317, 319, and 303a. Each path or branch in the multioutput FCN 300 starts from the first layer 301 and ends at one of the output layers 303. [0069] The multi-output FCN 300 further includes one or more branch points. As described above, a branch point in a multi-output FCN 300 represents a network layer at which two branches diverge. Alternatively, a branch point can be considered a network layer such that an output generated by the network layer is provided as an input for each succeeding layer in each of two or more branches. For example, the multi-output FCN 300 shown in FIG. 3 includes a first branch point 313 and a second branch point 311.

[0070] Furthermore, the network layers included in the multi-output FCN 300 can include different types. For example, the network layers can include one or more of a convolution layer, transposed convolution layers, pooling layers, or other suitable layers. Each layer can include a different size and parameters, for example, a different number of nodes, different nodal operations (e.g., different activation functions, normalization functions, or different nodal weight values), and/or a different stride size, filter size, and zero-padding size.

[0071] It should be appreciated that the structure of the multi-output FCN 300 is illustrated for simplicity. A multi-output FCN 300 for implementing the described techniques can include different numbers of output layers, branches, branch points, and different types and arrangements of network layers. For example, the multi-output FCN 300 can include 5, 10, 20, or more branches, 4, 9, 15, or more branch points, or 10, 20, 50, or more layers.

[0072] To perform the “tiling and stitching” analysis of a multi-output FCN 300, the system, e.g., the inference system 100 of FIG. 1, can, for each output layer 303, obtain a first-valid-pixel-offset b_L for the i^th output, the i^th output being generated from the I^th output layer of the FCN 300. The first-valid-pixel-offset is generally related to a dummy region and a valid region of an output from a network layer. The first-valid-pixel-offset can, for example in one-dimensional cases, represent a width of b_t pixels away from an edge in an output tile such that pixels in the output tile that are at least b_L pixels away from the edge are valid and accurate. The detailed explanation of a valid region and a dummy region in an output tile are described in connection with FIG. 5.

[0073] To generate the first-valid-pixel-offset b_L for the I^th output, the system can use the FirstValidPixelOffset() algorithm encoded based on characteristics of the FCN model, e.g., layer types, stride, filter, and padding. For simplicity, the example algorithms presented in the specification are by default compatible with a sequence of layers that do not include branch points and re-convergence points, unless explained otherwise. The system includes new techniques to consider operations for branch points and re-convergence points using the example algorithms described below. An example

FirstValidPixelOffsetQ) is presented as below: function FirstValidPixelOffset(layers) :

// First pixel offset at which the prior layer produced a valid result. first valid offset = 0 for layer = input to output layers: if layer type is “conv”: s = stride of layer; p = padding of layer; f = filter size of layer first valid offset = ceil((first_valid_offset + p) / s) else if layer type is “trans conv”: s = stride of layer; p = padding of layer; f = filter size of layer

// Offset of last invalid pixel on the input activation. Can be -1 or greater. last invalid offset = first valid offset + p - 1

// Offset of last invalid pixel on the output activation. last invalid offset = last invalid offset * s + f - 1 first valid offset = last invalid offset + 1 return first valid offset b=FirstValidPixelOffset(layers)

[0074] In general, the criteria for the first valid pixel calculated from the left edge and the right edge of a fixed-size output is not entirely symmetrical - a few pixels may remain on the right side of the fixed-size input where the filter could not be applied, which keeps one more valid pixel on the right side of the fixed-size output than the left. The output (e.g., the first valid offset b₍) of the FirstValidPixelOffsetQ) algorithm is calculated from the left, and this value should also be correct for the right. Similarly, the above-described analysis should also apply for calculations from the top or the bottom of the fixed-size output.

[0075] In some implementations, the system can first determine two smallest tile sizes for an I^th output layer. The two smallest tile sizes lead to possible tile sizes that work for other output layers and the input layer. Based on the two smallest tile sizes, the system can generate a formula to determine other acceptable tile sizes for the I^th output layer. One example algorithm for different acceptable tile sizes can be t₀ ^l = a * k + b, where k is an non-negative integer, and a and b are two smallest acceptable tile sizes for the I^th output layer.

[0076] The system can validate the candidate output tile size for a driver output layer in the multi-output FCN 300. Generally, a driver output layer can be arbitrarily determined and the described techniques should still hold. In some implementations, a driver output layer can be determined according to different criteria. For example, a system can select an output layer that generates fixed-size outputs with the smallest size among the multiple output layers 303. As another example, a system can select an output layer that the system has the most information about.

[0077] In the example shown in FIG. 3, the system can select the output layer 303b as the driver output layer, and validate the candidate tile from the driver output by backward projection 323 (or projecting forward and backward described below). During backward projection, the system can determine, at least for each convolution layer in a path that spans from the first layer 301 and the driver output layer 303b, all possible/candidate input or output tile sizes for the convolution layer. This is because different input tile sizes can generate the same output tile size due to characteristics of the convolution layers. The characteristics of a convolution layer can include, for example, a stride size and filter size. More specifically, if an FCN model (single-output or multi-output) includes one or more convolution layers that have a stride size greater than one, the input tile sizes and output tile sizes for these convolution layers can be a many-to-one mapping, no longer being a one-to-one mapping. For example, an FCN model can include a convolution layer (e.g., layer 318), with a filter size of 5 by 5, a stride size of 2, and zero paddings of 1. The output size for the layer 318 can be identical for inputs of different sizes. For example, after processing a first input of 50 by 50 pixels and a second input of 49 by 49 pixels, the layer 318 generates corresponding first and second outputs of an identical size, i.e., 24 by 24 pixels. This is because the stride size 2 in the second convolution layer can trigger a rounding process in processing the input tile through each network layer.

[0078] The system can first determine linear segments in the FCN structure. The term “linear segments” generally refers to one or more consecutive network layers in a sequence that do not include branch points and re-convergence points. Each linear segment can include at most one convolution layer with a stride of greater than one. For example, a linear segment in the multi-output FCN 300 can include (i) layers 303b and 318 (where layer 318 is a convolution layer with a stride S>1), (ii) layers 303a, 319, and 317, 313, and 311 (where layer 311 is a convolution layer with a stride S>1), or (iii) layers 303c, 315, and 311 (where layer 311 is a convolution layer with a stride S>1).

[0079] For simplicity, a convolution layer with a stride greater than one is also referred to as a “strided” convolution layer. In some cases, a linear segment along the forward projection direction has a sequence of convolution layers and the single strided convolution layer is the first layer in the sequence. [0080] The system can perform a forward projection, or a backward projection, or both, to determine candidate input tile sizes along a linear path having one or more linear segments. For each linear segment that includes a strided convolution layer, the system can store one or more possible sizes for the strided layer in the linear segment during backward projection. This is because the input tile size to the output tile size in a strided convolution layer is not guaranteed one-to-one size mapping, and one output size of a network layer can be mapped to one or more different input tile sizes.

[0081] To facilitate the process of determining one or more candidate input tile sizes, the system can use linear segments in the backward projection and the forward projection. One example ProjectBackward() algorithm called by the AlignOutputTile () algorithm is presented as below: function ProjectBackward($ag wt, hb, wb), layers): for layer = output to input layers: if ht == hb or wt == wb:

THROW EXCEPTION; // layer is vanished if layer type is “conv”:

// Conv layer: n = floor((m + 2 p - f) / s) + l, n is output size, m = input size s = stride of layer; p = padding of layer; f = filter size of layer n_h = hb - ht; n_w = wb - wt //output tile sizes in the h and w dims // input tile size m_h = (n_h - l) * s + f - 2 p; m_w = (n_w - 1) * s + f- 2p

// Note: m_h x m_w is the smallest input tile size, but any size up to // (m_h + s - 1) x (m_w + s - 1) will work to produce n_h x n_w output. // Selecting a larger size may matter if a trans conv layer // precedes the conv layer in model order and disallows certain sizes. // Sizes can be explored as back-tracking, not shown here for simplicity.

else if layer type is “trans conv”:

// TransConv layer: n = (m + 2 p - l) * s + f s = stride of layer; p = padding of layer; f = filter size of layer n_h = hb - ht; n_w = wb - wt //output tile sizes in the h and w dims // input tile size m_h = Validate( (n_h - f) / s - 2 p + l ) m_w = Validate( (n_w - f) / s - 2 p + l ) ht = Validate( ht / s ); hb = ht + m_h wt = Validate( wt / s ); wb = wt + m_w return (ht, wt, hb, wb) where: function Validate(xa\ucy. if value is integral: return value else: THROW EXCEPTION

// Value cannot be used for coordinates of a fixed-size input in an FCN. [0082] The terms (ht, wt, hb, wb) can represent a location and size of an output tile. For example and for a two-dimension output tile, “ht” represents a vertical coordinate of a top left comer pixel of the output tile, “wf ’ represents a horizontal coordinate of the top left comer pixel of the output tile, “hb” represents a vertical coordinate of a bottom right comer pixel the output tile, and “wb” represents a horizontal coordinate of the bottom right comer pixel the output tile. The returned term (ht, wt, hb, wb) can generally represent a location and size of an input tile projected from the output tile. The projected input tile can be used by the system to determine an output tile alignment value Ai°, and whether the output tile alignment value Aj°is satisfied. For example, if the output size and location (e.g., pixel coordinates) are not properly chosen to satisfy the alignment requirements, the project-backward coordinates for a corresponding input tile can include non-integer coordinate values. The system can therefore determine the output tile alignment value Ai° and the currently chosen output size does not satisfy the output tile alignment value At° .

[0083] The system can project the output tile alignment values along different paths from respective output layers 303 to the first layer 301 to obtain input tile alignment values Ai . Similarly, the input tile alignment values ^/represent corresponding coordinate shifts in corresponding input tiles. In some implementations, the system can use an algorithm ProjectBackwardTL (), which projects backward only the top-left comer coordinate of an output tile — not both the top-left and bottom right comer coordinates in the ProjectBackward() algorithm. An example ProjectBackwardTL () is presented below. function ProjectBackwardsTL(Qat, wt), layers): for layer = output to input layers: if layer type is “element wise op”: // Any element-wise op layer, e.g., add two tensors continue // Nothing to do else if layer type is “conv”: // Conv layer s = stride of layer

else if layer type is “trans conv”: // TransConv layer s = stride of layer ht = Validate( ht / s ) wt = Validate( wt / s ) return (ht, wt) [0084] The input to the ProjectBackwardTL () algorithm is the top left comer coordinate of an output tile alignment value for the I^th output, i.e., (ht, wt), and the output of the algorithm is the top left comer coordinate of the input tile alignment value At' in each corresponding dimension, e.g., for the height dimension in “ht” and the width dimension in “wf ’ for 2-dimension FCN input data.

[0085] The system can compute a common input tile alignment value A^,c for the input layer 301 based on respective input tile alignment values A /corresponding to the respective output tiles. For example, the system can generate the common input tile alignment value by obtaining a least common multiple from all input tile alignment values Ai , as below:

Equation (2)

[0086] The system can project forward the common input tile alignment value A^IC , along respective paths, to each of the respective output layers to determine a respective oc common output tile alignment value A_t for the output layer.

[0087] Similar to the ProjectBackwardTL () algorithm, the system can use linear segments for the forward projection. One example forward projection algorithm can be ProjectForwardTL () algorithm, which projects forward only the top-left comer coordinate of the input tile. The example ProjectForwardTL () algorithm is presented as follows: function ProjectForwardTL{(jit, wt), layers): for layer = input to output layers: if layer type is “element wise op”: // Any element-wise op layer, e.g., add two tensors continue // Nothing to do else if layer type is “conv” : // Conv layer s = stride of layer ht = Validate( ht / s ) wt = Validate( wt / s ) else if layer type is “trans conv”: // TransConv layer s = stride of layer

return (ht, wt)

[0088] The input to the ProjectForwardTL () algorithm is the common input tile alignment value A^IC , and the output of the algorithm is the respective common output tile alignment values A°^c along each path of different paths to different output layers in the multi-output FCN. [0089] Since hoth the ProJectBackwardTLf) and ProjectForwardTL () algorithms do not receive input tile size as an input, the process of determining alignment information is independent of input tile sizes. In other words, the process of validating a candidate output tile size is independent of the process of generating alignment values, so the system can perform the two processes in any suitable order or in parallel. [0090] In addition, the system can also implement an example ProjectForward() algorithm to determine or validate a selection of a fixed-size input by determining that a forward-projected output that corresponds to the fixed-size input has all integer coordinates. The example ProjectForward() algorithm is presented as follows: function ProjectForward( (ht, wt, hb, wb), layers): for layer = input to output layers: if ht == hb or wt == wb:

THROW EXCEPTION; // layer is vanished if layer type is “conv”:

// Conv layer: n = floor((m + 2 p - f) / s) + l, n is output size, m = input size s = stride of layer; p = padding of layer; f = filter size of layer m_h = hb - ht; m_w = wb - wt // input tile sizes in the h and w dims // output tile size n_h = floor((m_h + n_w = floor((m_w ht = Validate( ht / s wt = Validate( wt / else if layer type is “tr

// TransConv layer:

s = stride of layer; p = padding of layer; f = filter size of layer m_h = hb - ht; m_w = wb - wt // input tile sizes in the h and w dims // output tile size

return (ht, wt, hb, wb)

[0091] The system further includes mechanisms to resolve failures or mismatches when exploring possible input or output tile sizes in the backward projection, or the forward projection, or both. The system can store all possible input tile sizes in a backtracking data structure (e.g., a backtracking stack, or for simplicity, a stack, or other suitable data structures). The stack can include a sequence of cells, and each of the cells includes one or more possible input tile sizes for a strided layer. The sequence of cells are stacked on one another according to the order of layers being traversed in the backward projection. When the system determines a first failure or mismatch at a layer, the system tracks back to a particular strided convolution layer associated with the top cell of the backtracking stack. All projected tile sizes along the path from the failure layer to the particular strided convolution layer are removed from the memory (also referred to as undone). The system then selects another input tile size different from the previously-chosen one and resumes the backward projection, or the forward projection, or both from the particular strided convolution layer. More details of a backtracking stack are described below in connection with FIG. 8.

[0092] In addition, the system can determine whether the backward projection has reached a branch point. In response to determining that the search has arrived at a branch point (e.g., network layer 313), the system performs forward projection along another branch that starts from the branch point. As shown in FIG. 3, when the backward projection reaches the network layer 313, the system can use one of the possible/candidate input tile sizes from the the top cell in the backtracking stack to project forward along the other branch (e.g., layers 317, 319, and 303a).

[0093] If the selected input tile size at the branch point 313 is not compatible with the output size of the output layer 303a (e.g., forward projections using the selected input tile do not produce possible tile sizes for layers following the branch point 313), the system can determine that the forward projection fails. The system then performs backward tracking by rewinding to a particular strided convolution layer (e.g., layer 318) associated with the top cell in the backtracking stack and selecting another possible input tile size stored in the top cell for the particular layer.

[0094] Alternatively, the system can determine that the forward projection fails if the output size for the output layer (e.g., layer 303a) does not satisfy the output size constraint in Equation (1). Similarly, the system tries another candidate input tile size from possible input tile sizes stored in the backtracking stack.

[0095] After selecting another input tile size for a particular layer (e.g., layer 318) in a particular cell in the backtracking stack, the system continues to perform backward projection along a path from the branch point 313 to an upper layer in the FCN model. For example, the system can perform backward projection 325 till another branch point 311 along the path. The system then performs the forward projection along another path having layers 315 and 303c based on the other tile size selected for layer 318. The system determines whether the new tile size is “compatible” for the output size from the output layer 303c.

[0096] If all other possible input tiles for the top cell have been tried and failed, the system can pop out the top cell from the backtracking stack, and backtracks to another strided convolution layer associated with the new top cell (the previously second top cell) in the backtracking stack. The system can repeatedly backtracking in the stack until a solution is found or all cells are popped out due to failures.

[0097] The system can repeatedly perform the back-tracking search, together with the forward projection and the backward projection, at different branches along the path until reaching the first layer 301, and determine compatible/valid input tile sizes for all layers in the FCN 300. The valid input tile sizes are stored in a data structure, e.g., a tuple. A tuple can store valid input tile sizes and, optionally, associated network layers. For example, a tuple can include a set of data indicating the output tile size for the driver output layer 303b, the input tile size for the input layer 301, the projected output tile size for the output layer 303a, and the projected output tile size for the output layer 303c. Since the tile sizes for the input layer 301 and the output layers 303 are not unique, the system can collect more than one data tuple. The system can select any tuple of the data tuples as parameters used to tile inputs with various sizes for the FCN 300. The example backward projection and the example forward projection are described above.

[0098] Although the back-tracking search seems to have a multiplicative search space as candidate input tile sizes obtained for each convolution layer might multiply with each other, the search space is in fact a linear space due to the high redundancy in candidate input sizes. For example, assuming that the candidate input tile sizes for a layer (e.g., the second closest backtracking point) is 8 and 9, the system can determine that the candidate input tile sizes for an upper layer (e.g., the closest backtracking point) are 15, 16, and 17 when the candidate input tile size for the second closest backtracking point is selected as 8. The system can determine that the candidate input tile sizes for the upper layer are 16, 17, and 18 when the candidate input tile size for the second closest backtracking point is selected as 9. The total distinct candidate input tile sizes for the closest backtracking point are 15, 16, 17, and 18, which is much less than nine different input tile sizes.

[0099] After determining at least one set of input tile size and output tile sizes for the FCN 300 using alignment information for the FCN 300, the system can generate a tiling pattern to tile an arbitrary input to the FCN 300 such that the output tiles for an output are ensured that at least one of the output tiles has a valid pixel value for a corresponding pixel in the output. [0100] To determine alignment information, the system can determine an alignment value for an input layer 301 such that the alignment value is compatible for each output tile from respective output layers 303.

[0101] Alignment information is important to generate reasonable tiling patterns when an FCN includes one or more transposed convolution layers. A transposed convolution layer, in general, is configured to increase the special resolution of an input to generate an output that has a larger size. The increase of output size is based on the stride size of the transpose convolution layer. For example, the FCN 300 can include a transposed convolution layer appended to a convolution layer with a stride size of 2. The transposed convolution layer can produce an output of 51 by 51 pixels by processing a 24 by 24 pixel output from the convolution layer. Since the stride is 2, adding one pixel to the input adds two pixels to the output of the transposed convolution layer. Due to the stride of 2, an output tile for the transposed convolution layer must be aligned to multiples-of-2 (regarding the top-left coordinate of the output tile), since an in-between location of the output-tile does not correspond to values produced by the transposed convolution layer from the input tile.

[0102] The system can obtain alignment information for fixed-size outputs according to the computation requirements set forth by transposed convolution layers in the FCN 300. For example, the requirement can include that the pixel indices for one or more pixels in the fixed-size inputs and corresponding the fixed-size outputs should be integers.

[0103] One example process performed by the system for the multi-output FCN 300 is described below.

[0104] For each output layer 303, the system can compute an output tile alignment value A_L° for the i^th output from the i^th output layer. The system can use different algorithms to generate respective output tile alignment values. For example, the output tile alignment values can be obtained using the AlignOutputTile() algorithm based on at least one of the local search or analytical methods. The output tile alignment values generally represent coordinate shifts for shifting the “unaligned” fixed-size output tiles leftward and upward in the corresponding outputs. An example AlignOutputTileQ algorithm for a sequence of layers not including branch points and re-convergence points is presented as follows: function Align()utputl'ile( (ht, wt, hb, wb), layers): if approach == “local search”: for (hs, ws) = try all values in some pattern from 0 to max shift: try: return ProjectBackward( (ht - hs, wt-ws, hb-hs, wb-ws), layers ) except:

// Failed to project, keep trying with other shift values.

THROW EXCEPTION // failed to find valid alignment for tile else if approach == “analytical”: hts = int (ht / alignment) * alignment wts = int (wt / alignment) * alignment hbs =hb - (ht - hts) wbs =wb - (wt - wts) return (hts, wts, hbs, wbs) where: alignment = CalculateAnalyticalAlignment(layers)

[0105] The ProjectBackward () function can be equivalent to the example

ProjectBackward() function as described above. One example

CalculateAnalyticalAlignmentQ) algorithm called by the AlignOutputTile () algorithm is presented as follows: function CalculateAnalytical Alignment^ layers ):

// Algo to find the smallest correct alignment: presence of conv layers before trans conv

// layers eases the alignment needed by the trans conv layers. conv_stride_product = 1 // product of strides of back-to-back conv layers trans_conv_stride_product = 1 // product of strides of back-to-back trans conv layers alignment = 1 // required tile alignment at the FCN output layer for layer = output to input layers: if layer type is “conv”: s = stride of layer conv_stride_product *= s else if layer type is “trans conv”: s = stride of layer trans_conv_stride_product *= s prev layer = previous layer //prev layer produces the input for layer

// prev layer == null if layer is the input layer for the whole FCN. if prev layer == null OR prev layer type != “trans conv”:

// utilize the subsequent stack of conv layers to ease the

// alignment requirement imposed by a stack of trans conv layers by using the greatest common divisor gcd = GCD (conv_stride_product, trans_conv_stride_product) alignment for stack = trans_conv_stride_product / gcd alignment *= alignment for stack

// Reset stacks conv_stride_product, trans_conv_stride_product = 1, 1 return alignment

[0106] As shown in the example (kilculateAnalyticalAlignm.ent(y' algorithm, the system 100 determines a constant alignment value based on the characteristics of each layer of the FCN model. For example, the characteristics can be a layer type (e.g., convolutional, transpose convolution layer, or other layer such as pooling layer), or a size for padding, filter, and stride for the layer. In some implementations, other types of layers in the FCN model, e.g., pooling layers, are treated as a convolution layer throughout the specification.

[0107] The system can further coordinate the arrangement of output tiles for generating a tiling pattern. In particular, the system can determine an output tiling pattern for the driver output using the common output tile alignment value for the driver output layer 303b, i.e., the common driver output alignment value A^^c . By doing so, the system can obtain tile coordinates for the output tiles in the driver output and corresponding tile coordinates for the input tiles in the input. The system can then project forward the input tile coordinates along respective paths in the FCN 300 to obtain corresponding coordinates for all other FCN output tiles.

[0108] The system can coordinate the arrangement of output tiles by ensuring valid pixels are generated at each coordinate in each output. For example, for each output in the multiple outputs other than the driver, the system can determine that an overlapping pattern for the input tiles can ensure that at least one of the output tiles for the output has a valid pixel value that corresponds to a pixel value in the output. If the forward- projected output tiles for an output k has a valid region that does not abut or overlap with the valid region of a neighboring output tile to the left or the top, the system can shift the current output tile left or up by a value of At°^c , project backward the output tiling pattern for the output k till the input layer, and forward to all other output layers i + k.

[0109] After determining the tiling pattern, the system can compile the multi-output FCN 300 on different hardware accelerators, and perform inference operations for processing inputs with different sizes.

[0110] FIGS. 4A and 4B illustrate a respective example DAG structure 400, 450 with a respective re-convergence point 403, 453. As described above, an FCN model can include a more complex structure than a tree structure. For example, an FCN model can include a DAG structure. As an example, a tree-like multi-output FCN can include one DAG structure to replace a network layer of the FCN model. Unlike a tree structure, a DAG structure includes at least one re-convergence point and corresponding one or more branch points (also referred to as an “initial point.” Although the specification describes tree structures and DAG structures in respective figures and at the structural level, the system does not have to explicitly determine a tree structure or a DAG structure at a topological level or receive information explicitly indicating a tree structure or a DAG structure. Instead, the system can independently perform operations of a branch point and a re-convergence point. In other words, the algorithms do not need to know whether there is a re-convergence point in an FCN model when processing a branch point and vice versa. The term “initial point” below generally refers to a branch point corresponding to a re-convergence point in a DAG structure. However, the algorithms treat an initial point the same as a regular branch point. This way, the system can process tree-like or D AG-like FCN models more efficiently. Determining operations at the layer level is described in greater detail in connection with FIG. 5.

[oni] As shown in FIG. 4 A, a multi-output FCN, e.g., the multi-output FCN 155 in FIG. 1 or the multi-output FCN 300 in FIG. 3, can have a DAG structure 400 that includes multiple layers. For example, the DAG structure 400 can include an initial layer 401 (also referred to as an initial point or a branch point), which represents a first layer in the DAG structure where two or more branches or paths diverge. The DAG structure 400 can include a re-convergence point 403, which represents a network layer in the DAG 400 at which the two or more branches or paths re-converge. Note the two or more branches that converge at the re-convergence point 403 do not need to come from a common branch point. In some cases, the branches can come from different branch points. The term “branch” or “path” is equivalently defined to that in a tree structure, as described above. For example, the DAG 400 can include two paths, as shown in FIG. 4A. The first path includes a first sequence of network layers from the initial point 401 to the re-convergence point 403, i.e., layers 401, 405, 407, and 403. The second path includes a second sequence of network layers also from the initial point 401 to the re-convergence point 403, i.e., layers 401, 409, 411, and 403.

[0112] The DAG structure 450, as shown in FIG. 4B, has a structure substantially similar to that of the DAG structure 400. In some cases, the DAG structures 400 and 450 are identical but for different projection directions. For example, the DAG structure 450 in FIG. 4A shows a backward projection direction from the driver output layer 403 to an input layer 401, and the DAG structure 450 in FIG. 4B shows a forward projection direction from the input layer 403 to the driver output layer 403. [0113] The DAG structure 450 includes multiple layers, including an initial point 451 and a re-convergence point 453. The DAG structure 450 can include two paths, e.g., a first path including a sequence of layers 451, 455, 457, and 453, and a second path including a sequence of layers 451, 459, 461, and 453. The DAG structure 450 in FIG. 4B can be placed in a different portion of a multi-output FCN than the DAG structure 400 in FIG. 4A.

[0114] The system, e.g., the system 100 of FIG.l, can perform the “tiling and stitching” analysis for a multi-output FCN with a DAG structure, similar to that with a tree structure. In general, the system can determine an input tile size, and determine a tiling pattern for an input based on alignment information and valid region information. [0115] The system can determine a first-valid-pixel offset for each output of the multi-output FCN, along each path in a DAG structure from the first layer of the FCN to the respective output layer of the FCN. A first-valid-pixel offset can also be denoted as b_t for the I^th output layer. The first-valid-pixel offsets can be calculated efficiently using a graph traversal (e.g., a topologically-sorted order traversal of a DAG structure) instead of explicitly enumerating all paths to different outputs. The system needs to determine a maximum offset value among all paths at the re-convergence point, before computing the first-valid-pixel offset for a succeeding layer following the reconvergence point. To determine a maximum offset value, the system can determine a first offset value at the convergence point 403 along a first path, e.g., a path including layers 401, 405, 407, and 403, , and a second offset value at the convergence point 403 along a second path, e.g., a path including layers 401, 409, 411, and 403. The system can take the maximum value from the first and second offset values as the maximum offset for the re-convergence point, and compute the corresponding first-valid-pixel offset in a following layer based on the maximum offset. The system can generally expand the above-noted FirstValidPixelOffset() algorithm, by taking into consideration the maximum offsets at re-convergence points in one or more DAG structures.

[0116] The system can select a candidate output tile size, based on the output size constraint, for a selected driver output layer from outputs of multiple output layers in a multi-output FCN. The system can validate the selected candidate output along all paths from the first layer of the multi-output FCN to respective output layers. The validation process for a multi-output FCN including a DAG structure is generally similar to that for a tree-structured FCN, with the following modifications. [0117] As described above, the system treats the initial points 401, 451 similar to a regular branch point in the backward projection (shown in FIG. 4 A) and the forward projection (shown in FIG. 4B). For example, during the backward projection process at a re-convergence point 403, the system can start along any one of the paths from the reconvergence point 403. For example, the system can select the path including layers 401, 409, 411, and 403 to perform the backward projection. The system projects forward at the initial point 401 along another path to the re-convergence point 403. [0118] The system generally does not need to know in advance which layer is a reconvergence point. Instead, the system perceives a re-convergence point whenever the forward projection or backward projection reaches a layer having an existing input tile size. For example, when the system projects forward at layer 403, the system perceives that layer 403 is a re-convergence layer by determining layer 403 already has a tile size (e.g., an input or output tile size). The system then determines if the input tile size “reconciles” (e.g., matches) with the other input tile size obtained from the other path from the initial point 401. If the two sizes match, the system determines the tile sizes for the initial point 401 is valid for the layer 403. In some implementations, the system can determine whether output tile sizes for layer 403 from different paths “reconcile” at a re-convergence point, which is essentially similar to using input tile sizes.

[0119] Similarly, for FIG. 4B, the system first performs forward projection along the left path from the initial point 451 to layer 453. The system then performs forward projection along the right path from the initial point 451 to layer 453, and by the time reaching the layer 453, the system perceives that layer 453 is a re-convergence point because it already has an input or output tile size. In this case, the system determines whether the input or output tile sizes from both paths match with each other. If the two sizes match, the system determines the tile size for the initial point is valid or suitable. [0120] If the tile sizes do not match at the re-convergence point, the system terminates the forward projection at the re-convergence point, and determines a failure of the forward projection. In addition, the system performs the back-tracking algorithm to determine another possible tile size in the backtracking stack. More details of the backtracking stack are described in connection with FIG. 8.

[0121] The candidate output tile size for a driver output layer is validated only if the backward projection and the forward projection succeed. The system obtains a valid model input tile size and respective output tile sizes for corresponding output layers after validating the candidate output tile size for the driver layer. [0122] After validating a candidate output tile size for a driver output layer, the system can determine a common input tile size for the first layer of a multi-output FCN so that the common input tile size is compatible for respective output tile sizes for different outputs.

[0123] The system can determine alignment information for generating a tiling pattern for an input to a multi-output FCN having a DAG structure. The system can follow a similar process above-described regarding a multi-output FCN with a tree structure, and modifying the above-described process for obtaining respective output tile alignment values Ai° . The modification is needed because a tree structure does not include a re-convergence point where branches converge, and branches can each include a different number and type of layers and therefore have different alignment constraints. For example, the left path in the DAG structure 400 can include only convolution layers, and the right path in the DAG structure 400 can include only transposed convolution layers with strides greater than one. A modification can be made to ensure the alignment information or requirements are compatible for each of two or more branches in a DAG structure. An example modification is described below.

[0124] To obtain an output tile alignment value Ai° for the I^th output layer (assuming a path from the I^th output layer and the initial layer includes a DAG structure), the system can determine an output tile alignment value for the i^th output layer that is compatible with all the branches in the DAG structure along the path. More specifically, the system can generate respective initial output tile alignment values for each branch along the path based on layer characteristics, and generate an output tile alignment value

the i^th output layer based on the respective initial output tile alignment values.

[0125] For example and as shown in FIG. 4A, assuming the re-convergence point 403 is an output layer, or layers between the re-convergence layer 403 and the corresponding output layer do not include any transposed convolution layer, the system can obtain a first initial output tile alignment value (e.g., 5 pixels) for the corresponding output layer along the left path (e.g., a path including layers 401, 405, 407, and 403). The system can obtain a second initial output tile alignment value (e.g., 7 pixels) for the corresponding output layer along the right path (e.g., a path including layers 401, 409, 411, and 403). The system can determine an output tile alignment value for the corresponding output layers based on the first and second initial output tile alignment values. For example, the system can take the least common multiple of the two initial values to obtain the output tile alignment value (e.g., 35 pixels).

[0126] Alternatively, for each output layer, the system can systematically and efficiently determine the output tile alignment value using a depth-first search traversal by (i) set an initial output tile alignment value to be one, and (ii) update the initial output tile alignment value by taking the least common multiple between a previous value along a previous path and a new value along a new path (e.g., Ai° = LCM(i4j°, ^)).

[0127] After obtaining respective output tile alignment values for all outputs in the FCN with a DAG structure, the system can follow the same procedures abovedescribed for FCNs with tree structures using the obtained output tile alignment values A °-

[0128] The process of determining coordinate tiling patterns for an FCN with a DAG structure can be substantially the same as that for an FCN with tree structures, as described above.

[0129] FIG. 5 illustrates an example multi-output FCN 500 having a DAG structure. As described above, the system does not need to identify a tree structure or a DAG structure at a topological level. Instead, the system can implicitly determine a tree structure at the layer level by detecting one or more branch points, implicitly determine a DAG structure at the layer level by detecting one or more re-convergence points, or both. In some implementations, the system does not determine tree structures or DAG structures at all. This way, the system can more efficiently generate parameters for tiling inputs of various sizes for a compiled FCN model. The system can further save the development and research resources for instructions for determining the structures or hardcoding the structure information into data transmitted to the system.

[0130] As shown in FIG. 5, the multi-output FCN 500 includes two branch points, i.e., layers 501 and 503, and two re-convergence points, i.e., layers 509 and 515. On the one hand, to determine a layer to be a branch point, the system can determine whether an output generated by the layer is consumed by two or more layers that succeed the layer. In response to determining that two or more layers consume output generated from a layer, the system can determine the layer as a branch point. The existence of one or more branch points in an FCN implicitly indicates at least a tree structure in the FCN. On the other hand, to determine a layer to be a re-convergence point, the system can determine whether there are two more inputs to the point, e.g., an element- wise addition of two input tensors from two different layers preceding the point along different paths. In some cases, the system can also determine if the point (or a layer) already has an input tile size for an input from a different branch when providing another input to the point from a current branch. In response to determining that the point already has an input from a different branch, the system can determine that the point as a re-convergence point.

[0131] The existence of one or more re-convergence points in an FCN further implicitly indicates that the FCN model includes a DAG structure. One example process, including the forward projection, the backward projection, and the backtracking search, is described in connection with the FCN 500 below.

[0132] First, assuming the system determines tiles sizes by starting with the driver output layer 517 and performs the backward projection from the driver output layer 517 to a model input layer (e.g., layer 501 or another layer preceding layer 501). Due to one or more convolution layers with strides of greater than one along a path (e.g., a sequence of network layers) in the backward projection, the system can determine more than one possible input tile sizes for these strided convolution layers. This is because the input size to output size mapping for a convolution layer with a stride greater than one is not a one-to-one mapping. Instead, the mapping is multiple-to-one. For example, assuming layer 511 has strides greater than one, the system can determine two or more input tiles for layer 511. For example, the layer 515 can also be a strided convolution layer. The system can determine multiple possible input tile sizes for the layer 515 and select one of the multiple possible input tile sizes (e.g., 1*49*49*3) for the layer 515 when the backward propagation first reaches the point 515. The system continues the backward projection along, let’s say, the right path including layer 513, 511, and 501. The system can choose the left path as well.

[0133] After reaching layer 501, the system performs forward projection to determine the input tile size of layer 503 and the output tile size of layer 503. The system can determine that layer 503 is another branch point, and more importantly, one of the next layers that consume the output of layer 503 is layer 515, which has a predetermined requirement for the input tile size (e.g., 1*49*49*3). Therefore, the system determines whether the projected output tile size of layer 503 matches the input tile size requirement for layer 515. [0134] Suppose the two sizes reconcile (e.g., both are of the size 1*49*49*3). In that case, the system determines that the possible input tile size previously determined for layer 515 is valid and continues forward projections from layer 503 to layers 505 and 507. However, suppose the two sizes do not reconcile. In that case, the system performs a particular back-tracking algorithm to select another possible input tile size in the top cell of a backtracking stack. An example backtracking stack is described in connection with FIG. 8.

[0135] Further, suppose the system determines that no candidate output tile size for the driver output can result in a match between the output tile size of layer 503 and the input tile size of layer 515, or one or more output tile sizes for one or more other output layer has been tried but not a set of possible tile sizes are identified, the system determines there is no solution for the tiling task. In some cases, the system terminates the process, including the forward projection, the backward projection, and the backtracking search process. In some other implementations, the system can generate a failure notification for display on a user interface or store information in a log file. [0136] Referring back to layer 503 and assuming that the output size of layer 503 reconciles with the input tile size for layer 515, the system can perform the forward projection along either the left or right path to layer 509 as shown in FIG. 5. As an instance, the system can project forward to layer 509 first through layer 505 along the left path, and then to layer 509 through layer 507 along the right path. The system determines an input tile size for layer 509 after the forward projection reaches layer 509. Later, the system identifies layer 509 as a re-convergence layer by determining an existing input tile size requirement at layer 509 when the forward projection visits layer 509 for the second time. The system determines whether an output tile size from the right path matches the input tile size for the layer selected through the left path. For example, if both paths generate output tiles with a size of 1*52*52*3, the system determines that the two output tile sizes match and proceeds the forward projection for the rest of the layers 519. Otherwise, the system performs a particular backtracking algorithm, which is described in more details in FIG. 8.

[0137] FIG. 6 illustrates an example fixed-size input 638 with a neighbor pixel region 615 and an example fixed-size output 633 with a dummy region 635. The fixed- size input 638 is equivalent to the fixed-size input 138 of FIG. 1, and the fixed-size output 633 is equivalent to the fixed-size output 133 of FIG. 1. [0138] As described above, the system can determine multiple input tiles 638 with the fixed-size after tiling a full input 650. The fixed-size input 638 can be surrounded by pixels in the neighboring pixel region 610. The term “neighbor pixel region 610” represents a region including neighbor pixels generated by replacing non-zero values in the original input with zero values. For example, the neighbor pixel region 610 can include one or more pixels in the full input data 150 surrounding the fixed-size input 638. The width 615 of the neighbor pixel region 610 can represent a number of pixels included in the neighbor pixel region 610. The system can determine the width 615 for the neighbor pixel region 610 based on the characteristics of the multi-output FCN 115. [0139] The system can, as described above, generate multiple fixed-size outputs 633 for corresponding fixed-size inputs 638 using the statically-compiled multi-output FCN model 115. Due to the zero pixel values in the neighboring pixel region 610, the fixed-size outputs 633 can include pixel-wise values that are inaccurate. The system can determine, for each fixed-size input 633, a valid region 630 and a dummy region 620. In general, the valid region 630 can be located at the center of the fixed-size output 633 and the dummy region 620 can surround the periphery of the valid region 630 at a width 635. The width 635 can represent a particular number of pixels that are invalid. The width 635 can be determined using, for example, the example FirstValidPixelOffsetQ) algorithm as described above. The valid region 630 includes pixel-wise values for pixels computed using the valid pixel-wise values in the corresponding fixed-size input 138, and the dummy region 620 includes pixel-wise values computed using at least one or more neighbor pixels. The pixel-wise values in the valid region 630 contribute at least a portion to the final output 170, while the dummy pixels will be eliminated or discarded during the stitching process. The system can ensure each pixel value in one of the outputs (e.g., a full output 670) should be generated from at least one output tile for the output.

[0140] To determine the coordinates of an output tile (e.g., fixed-size output 633) and a valid region 630 of the fixed-size output 633, the system can perform different algorithms. One example algorithm that considers the above-noted alignment requirement is presented as below:

Initialize: 0(0, 0) = (0, 0, T_o, T_o)F(0, 0) = (0, 0, T_o - b, T_o - b~) Left boundary tiles:

U_o = ( F(i - 1, 0). bb — b,0,V(i - 1, 0). bb - b + T_o, T_o)O(i, 0) = AlignOutputTile( U₀ )V(i, 0) = (O(i, O). bt + b, 0, 0(i, 0). hb — b, T₀ — b) Top boundary tiles:

[0141] In this example algorithm, the system can denote the width of dummy regions 620 as b (which can be obtained from, for example, the example FirstValidPixelOffsetQ) algorithm as above-described), and mapping functions I(i,j) = (hti'. wtj . hb'i. wbj), O(i,j) = (ht° . wt® , hb° . wb®), and V(i,j) =

(htt , wtj', hbi , wb'-') for coordinates of a fixed-size input, a corresponding fixed-size output, and a valid region of the fixed-size output, respectively. The fixed-size output has both a dummy region and valid region with the left and top dummy regions being omitted, denoted as Uo. A fixed-size output O(i,j) is calculated using the above-noted AlignOutputTile algorithm for a corresponding Uo. Note the term Uo represents fixed- size output coordinates not subjected to the above-described alignment constraints, and the term O(i,j) represents output tile coordinates after considering the alignment constraints. Each mapping function can return a particular coordinate in a particular direction (e.g., I(i,j). ht = ht- represents a coordinate in a vertical or height direction). For simplicity, the system 100 assumes the fixed-size inputs and fixed-size outputs are squares in two-dimensional space, and denotes the size of the fixed-size inputs as T], and the fixed-size outputs as T_o . The system 100 denotes the size for the initial input data as H] and W_h and without losing generality, it is assumed that H] >= 7} and VF, >= 7).

[0142] After computing all the fixed-size outputs for each output of the multiple outputs through the multi-output FCN model, the system can apply the O(i, ;)and V(l,j) mappings to construct a full output for each output layer as if the input were entirely processed by the multi-output FCN model. An example StitchOutputlmageQ) algorithm for “stitching” fixed-size outputs for an output is presented as below: function StitchOutputImage(y. for tile indices (i,j)in a top -to -bottom, left-to-right scan of the tiles:

(ht_O, wt_O, hb_O, wb_0) = O(i,j) (ht_V, wt_V, hb_V, wb_V) = V(i,j) Output(ht_V:hb_V, wt_V:wb_V) = OutputTile( (ht_V-ht_O):(hb_V-ht_O), (wt_V-wt_O):(wb_V-wt_O) ) [0143] The OutputTile(i,j) represents the fixed-size output of size T_o corresponding to the (i,j)^th fixed-size input. For example, a fixed-size input in the I^th column and j^th row of a tiling grid.

[0144] FIG. 7 illustrates an example process 700 for performing inference computations in a multi-output FCN for inputs with different sizes. For convenience, the process 700 is described as being performed by a system of one or more computers located in one or more locations. For example, an inference system, e.g., the system 100 of FIG. 1, appropriately programmed, can perform the process 700.

[0145] The system receives a new input to be processed by a multi-output FCN deployed on a hardware accelerator (710). The new input has a size different from a fixed size that the multi-output FCN is configured to process when deployed on a hardware accelerator. The multi-output FCN includes multiple network layers. For example, a multi-output FCN includes an input layer and multiple output layers. The multi-output FCN can process a single input to the input layer to generate multiple outputs, where each output is generated for a respective task of multiple different tasks and generated using a different output layer of the multiple output layers that corresponds to the respective task.

[0146] In some implementations, the multi-output FCN can include a tree structure. The tree structure can include multiple paths stemming from a branch point, where each path has a respective sequence of network layers. The respective sequence of network layers spans from the first layer of the multi-output FCN to a corresponding output layer of the multiple output layers. In some implementations, the multi-output FCN can include a directed acyclic graph (DAG) structure. The DAG structure can include an initial point that represents a network layer from which two or more paths or branches in the DAG structure diverge. The DAG structure can further include a reconvergence point that represents a network layer at which the two or more paths or branches re-converge.

[0147] The system determines one or more fixed-size inputs from the new input (720). Each fixed-size input of one or more fixed-size inputs can have the fixed size. More specifically, the system can determine a tiling pattern for tiling the new input based at least on the characteristics of the deployed FCN model, e.g., alignment information, padding sizes, stride sizes, filter sizes, and scale factors. To determine a tiling pattern, the system performs a “tiling and stitching” analysis by determining an input tile size (e.g., the fixed size) for the input and generating a tiling pattern based on alignment information and valid region information.

[0148] To determine the fixed size, the system selects an output layer from the multiple output layers as a driver output layer, and selects an output size for an output from the driver output layer as a candidate output tile size. The system then validates the candidate output tile size to obtain a set of tile sizes (e.g., an input tile size and output tile sizes for different output layers). After validating that the candidate output tile size, the system sets the input tile size from the set as the fixed-size. The validation process is described in greater detail above in connection with FIG. 3, FIG. 4A, and FIG. 4B.

[0149] To determine alignment information for a multi-output FCN with tree structures, the system determines, for each output layer of the multiple output layers, an output tile alignment value for the output layer. The system then projects backward the output tile alignment value along a path that spans from the first layer and the output layer to obtain an input tile alignment value for the first layer. The system computes a common input alignment value for the first layer based on input tile alignment values for the output layers, e.g., taking the least common multiple of all input tile alignment values. The system projects forward, using the common input alignment value for the first layer, along respective paths to obtain respective common output alignment values each for a corresponding output layer.

[0150] For a multi-output FCN with a DAG structure, the system modifies the step for generating the output tile alignment values. More specifically, the system generates, for each path of the two or more paths in the DAG structure, a respective initial output tile alignment value for an output layer, through the path in the DAG structure from the output layer to the first layer; and generates the output tile alignment value for the output layer based on the respective initial output tile alignment values, e.g., taking the least common multiple of the respective initial output tile alignment values.

[0151] The system provides each of the multiple fixed-size input tiles to the hardware accelerator for performing inference computations using the multi-output FCN (730).

[0152] The system generates, for each of the multiple output layers, respective fixed-size output tiles for the multiple fixed-size input tiles (740). The respective fixed- size outputs can include one or more inaccurate pixel-wise values. The inaccurate pixel-wise values are discarded when generating respective outputs. [0153] For each output layer of the multiple output layers, the system generates, from the respective fixed-size outputs, a respective final output for the output layer that is equivalent to an output that would be generated from the output layer by processing the new input using the multi-output FCN configured for processing inputs with the first size (750).

[0154] To determine the multiple fixed-size input tiles using the determined fixed- size for the new input, the system determines a tiling pattern using the fixed-size input tiles for the new input such that, (i) for each output layer of the multiple output layers, each pixel value for an output of the output layer can be extracted from at least one of the respective output tiles for the output layer, and (ii) for each output layer of the multiple output layers, the respective output tiles satisfy corresponding output alignment values determined for the output layer. More specifically, the system projects forward the input tile coordinates to get the corresponding tile coordinates for all the other FCN outputs, and examines whether the output tiles for each output layer of multiple output layers are valid. If a first output tile for a first output layer has a valid region that does not abut or overlap with a neighboring valid region of the first output tile (e.g., a valid region at the top or the left of the first output tile), the system shifts the input tile left or up according to the alignment information (e.g., At°^c, as described above), and re-projects the input tiles to all the output tiles of all output layers. This way, the system can ensure all valid pixels of outputs of all output layers can be extracted from at least one output tile.

[0155] In some implementations, the fixed size for compiling the multi-output FCN does not have to be a scalar. Instead, the fixed size can be a vector representing a rectangle in two-dimensional space, or a block in three-dimensional space. More specifically, the fixed size can include a respective value in a respective dimension. For example, if the input image is two-dimensional, the system can determine a fixed size vector with a first size for the first dimension (e.g., horizontal dimension) and a second size for a second dimension (e.g., vertical dimension) different from the first dimension. The system can generate multiple fixed-size inputs of 30 by 10 pixels from an input image with a size of 300 by 100 pixels.

[0156] In addition, the multi-output FCN can be adapted to process each of the multiple dimensions of an input as long as the dimension is fully convolutional. For example, a multi-output FCN can process an image input with B x H x W x C dimensions to generate multiple outputs. Assuming that the batch dimension and channel dimension are not fully convolutional, the multi-output FCN can process the input only in the height and width dimensions, where the process can be generally considered a two-dimension problem. As another example, a multi-output FCN can process an audio input with multiple dimensions by processing only a single dimension of the audio input if the rest of the dimensions are not fully convolutional.

Alternatively, the multi-output FCN model can process higher dimensions, e.g., higher than two dimensions, if these dimensions are fully convolutional.

[0157] FIG. 8 illustrates an example backtracking stack 805. Backtracking stack 805 is an example data structure for storing all possible input tile sizes for strided convolution layers in an example FCN model 800. The FCN model 800 is similar to the FCN model 500 in FIG. 5.

[0158] As shown in FIG. 8, the FCN model 800 can include multiple convolution layers with strides greater than one. For example, a first convolution layer 810 has a stride of 3, a second convolution layer has a stride of 2, and a third convolution layer has a stride of 2. The FCN model 800 can include additional strided convolution layers, which are omitted from FIG. 8 for ease of illustration.

[0159] In general, the system traverses one or more sequences of network layers in the FCN to determine a tiling pattern for tiling a new input. The sequence of network layers can form a path for validating a candidate output tile size selected for one of multiple output layers. To address potential failures or mismatches in the validation process, the system can perform the a particular backtracking technique to (i) undo failure validation segments and (ii) select a different candidate input tile size stored for a preceding layer for resuming the validation process. The backtracking technique is associated with a particular type of network layer, e.g., a strided convolution layer or a network layer with a stride greater than one.

[0160] As described above, the system can determine all possible input tile sizes for strided convolution layers along a path for the backward projection or the forward projection, or both. As shown in FIG. 8, the system can perform a portion of the backward projection that starts at layer 810 and traverses along a first path including layers 820, 830, 840, 880 and other layers between layers 840 and 880. In addition, the system can perform a portion of the forward projection traversing a second path including layers 840, 850, and 860, and performing a portion of another forward projection traversing from a third layer including layers 880, 890, and 850. It should be appreciated that these paths for the forward and backward projections are for ease of illustration, and the system can include other suitable paths.

[0161] When the system reaches layer 810 from a previous layer (not shown in FIG. 8) during the backward projection, the system can determine a set of possible input tile sizes for layer 810. For example, the set of possible input tile sizes can be 11 and 12. The system can store these tile sizes in a first cell 815 (Cell #0) in the backtracking stack 805.

[0162] Before proceeding to the layer 820, the system selects one of the possible input tile sizes in the first cell 815 as the candidate input tile size for performing the rest of the backward and forward projections. For example, the system can select input tile size 11 for layer 810, as shown in a dashed box 825, for the rest of the backward and forward projections. Referring to layer 820, since layer 820 is a convolution layer with a stride of one, the system can get a unique input tile size for layer 820 based on the candidate size 11 selected for layer 810.

[0163] Along the backward projection path, the system can determine another set of possible input tile sizes for layer 830 based on the input tile size for layer 820 (which is implicitly based on the selected tile size 11 for layer 810). For example, the other set of possible input tile sizes can be 24 and 25. The system can store the other set of input tile sizes in a second cell 835 (Cell #1) stacked on top of the first cell 815. Similarly, the system can select one of these possible input tile sizes (e.g., tile size 24) for layer 830, as represented by the dashed box 845, for downstream operations.

[0164] Similarly, the system determines another set of possible input tile sizes for layer 870 based on the selected tile size 11 for layer 810 and the selected tile size 24 for layer 830. The other set of possible input tile sizes for layer 870 can include 50 and 51. The system can store these tile sizes in a third cell 855 (Cell #2) stacked on top of the second cell 835. The system can select input tile size 50 for layer 870 for the rest of backward and forward projections, as shown in the dashed box 865.

[0165] When a mismatch or failure occurs in a layer with a size determined based at least on one selected tile size for one of strided convolution layers, the system can revert or undo a corresponding projection to the particular layer that is associated with the top cell stored in the backtracking stack. For example, if a failure occurs in layer 890 (e.g., layer 890 cannot obtain an integer input tile size), the system backtracks to a preceding layer associated with cell 855 (Cell #2), e.g., layer 870. The system can select another possible input tile size different from the previously-selected size in the cell 855, e.g., input tile size 51, for layer 870, and use the newly-selected input tile size 51 to resume the validation process and determine a corresponding tile size (e.g., an output tile size) of layer 890 in the forward projection. As another example, if a failure occurs in layer 850 and the backward projection process has not reached layer 870 yet, the backtracking stack only includes cell 835 for layer 830 and cell 815 for layer 810. In this case, the tile size for layer 850 is dependent on the currently-selected input tile size for layer 830. The system thus backtracks to layer 830 to get another possible input tile size stored in cell 835 (Cell #1), e.g., input tile size 25, for layer 830. The system then uses the newly-selected input tile size 25 for layer 830 to resume the validation process and determine a corresponding tile size (e.g., an output tile size) for layer 850 in the forward projection. As another example, if a failure occurs at a particular layer between the layers 840 and 870, the system backtracks to the layer associated with cell 835 (Cell #1), e.g., layer 830, because the tile size for the particular layer in the backward projection is dependent on the selected input tile size for layer 830. Any previously-determined tile sizes between the particular layer to the convolution layer 830 are reverted and removed from the memory (also referred to as “undone”). The system selects another input tile size 25 stored in the cell 835 for layer 830, and resumes the backward projection from layer 830 using the newly-selected input tile size 25.

[0166] If the system determines that all possible input tile sizes for a cell has been tried and failed, the system can pop out the cell from the backtracking stack 805. For example, if tile sizes 50 and 51 of cell 855 have failed, the system can pop out cell 855 from the backtracking stack 805. The previously second top cell (e.g., cell 835) now becomes the top cell in the backtracking stack 805. The system selects another possible value in the current top cell and resumes corresponding backward and forward projections, and re-calculates a new set of possible input tile sizes for layer 870 based on the newly-selected input tile size for layer 830. The re-calculated possible input tile sizes are then stored in a new cell stacked on top of the cell 835. Using the abovedescribed backtracking stack and the corresponding mechanism, the system can efficiently explore the solution space to find a solution (e.g., a set of tile sizes) for tiling the multi-output FCN model according to different requirements.

[0167] Implementations of the subject matter and the actions and operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer programs, e.g., one or more modules of computer program instructions, encoded on a computer program carrier, for execution by, or to control the operation of, data processing apparatus. The carrier may be a tangible non-transitory computer storage medium. Alternatively or in addition, the carrier may be an artificially-generated propagated signal, e.g., a machinegenerated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be or be part of a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. A computer storage medium is not a propagated signal.

[0168] The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. Data processing apparatus can include special-purpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), or a GPU (graphics processing unit). The apparatus can also include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0169] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, an engine, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, engine, subroutine, or other unit suitable for executing in a computing environment, which environment may include one or more computers interconnected by a data communication network in one or more locations.

[0170] A computer program may, but need not, correspond to a file in a file system. A computer program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code.

[0171] The processes and logic flows described in this specification can be performed by one or more computers executing one or more computer programs to perform operations by operating on input data and generating output. The processes and logic flows can also be performed by special-purpose logic circuitry, e.g., an FPGA, an ASIC, or a GPU, or by a combination of special-purpose logic circuitry and one or more programmed computers.

[0172] Computers suitable for the execution of a computer program can be based on general or special-purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special-purpose logic circuitry. [0173] Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to one or more mass storage devices. The mass storage devices can be, for example, magnetic, magneto-optical, or optical disks, or solid state drives. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[0174] To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on, or configured to communicate with, a computer having a display device, e.g., a LCD (liquid crystal display) monitor, for displaying information to the user, and an input device by which the user can provide input to the computer, e.g., a keyboard and a pointing device, e.g., a mouse, a trackball or touchpad. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser, or by interacting with an app running on a user device, e.g., a smartphone or electronic tablet. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

[0175] This specification uses the term “configured to” in connection with systems, apparatus, and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operations or actions. For specialpurpose logic circuitry to be configured to perform particular operations or actions means that the circuitry has electronic logic that performs the operations or actions. [0176] Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[0177] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some implementations, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device. [0178] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what is being or may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claim may be directed to a subcombination or variation of a subcombination.

[0179] In addition to the embodiments described above, the following embodiments are also innovative:

[0180] Embodiment 1 is a method performed by one or more computers, the method comprising: receiving a new input to be processed by a fully convolutional neural network deployed on a hardware accelerator, the new input having a first size that is different from a fixed size that the fully convolutional neural network is configured to process when deployed on the hardware accelerator, wherein the fully convolutional neural network comprises a plurality of network layers, wherein the plurality of network layers include an input layer as a first layer and a plurality of output layers each configured to generate a respective output for processing the new input; determining a plurality of fixed-size input tiles from the new input, each fixed- size input tiles having the fixed size; providing each of the plurality of fixed-size input tiles to the hardware accelerator for performing inference computations to generate respective outputs using the fully convolutional neural network; for each of the plurality of output layers, generating respective fixed-size output tiles for the plurality of fixed-size input tiles, wherein the respective fixed-size outputs comprise one or more inaccurate pixel-wise values; and for each output layer of the plurality of output layers, generating, from the respective fixed-size outputs, a respective final output for the output layer that is equivalent to an output that would be generated from the output layer by processing the new input using the fully convolutional neural network configured for processing inputs with the first size.

[0181] Embodiment 2 is the method of Embodiment 1 , wherein the fully convolutional network comprises a tree structure, wherein the tree structure includes a plurality of paths each having a respective sequence of network layers, the respective sequence of network layers spanning from the first layer of the fully convolutional network to a corresponding output layer of the plurality of output layers.

[0182] Embodiment 3 is the method of Embodiment 1 or 2, wherein the fully convolutional network comprises a directed acyclic graph structure, wherein the directed acyclic graph structure includes an initial point representing a network layer from which two or more paths diverge and a re-convergence point representing a network layer at which two or more paths re-converge.

[0183] Embodiment 4 is the method of Embodiment 2 or 3, wherein determining the fixed-size for deploying the fully convolutional network further comprises: selecting an output layer from the plurality of output layers as a driver output layer, determining a first candidate output tile size for the driver output layer, validating whether the first candidate output tile size is compatible with an input tile size and output tile sizes of the plurality of output layers, and in response to validating that the first candidate output tile size is compatible with the input tile size and the output tile sizes of the plurality of output layers, setting the input tile size as the fixed-size.

[0184] Embodiment 5 is the method of Embodiment 4, wherein when the fully convolutional network comprises the tree structure, the validation comprises: projecting backward, using the first candidate output tile size, along a path that spans from the first layer to the driver output layer to determine the input tile size, wherein the backward projection comprises: for each convolution layer in the path, determining respective candidate input tile sizes compatible with the first candidate output tile size; and during the backward projection, determining that a network layer in the path is a branch point of the fully convolutional network, wherein the branch point represents a network layer such that an output generated from the network layer is provided as an input for a succeeding layer in each path of two or more paths, projecting forward, using a first input tile size determined for the branch point, along another path that leads to another output layer, and determining whether the first input tile size for the branch point is compatible with the other output layer. [0185] Embodiment 6 is the method of Embodiment 4, wherein when the fully convolutional network comprises the directed acyclic graph structure, the validation comprises: projecting backward, using the first candidate output tile size, along a path that spans from the first layer to the driver output layer to determine the input tile size; wherein the backward projection comprises: for each convolution layer in the path, determining candidate input tile sizes compatible with the first candidate output tile size; and during the backward projection, selecting a first path of the two or more paths converging at a re-convergence point in the directed acyclic graph structure as the path for the backward projection.

[0186] Embodiment 7 is the method of Embodiment 6, wherein the validation further comprises: during the backward projection, determining that a network layer in the path is the initial point in the directed acyclic graph structure; projecting forward, using one input tile size determined for the initial point, along a second path of the two or more paths converging at the re-convergence point; and determining whether an input tile size, generated based on the one input tile size, along the second path at the re-convergence point matches with a layer input size obtained along the first path at the re-convergence point.

[0187] Embodiment 8 is the method of any one of Embodiments 1-7, wherein determining the plurality of fixed-size inputs comprises determining alignment information for tiling the new input based on the fixed size, wherein determining the alignment information comprises: for each output layer of the plurality of output layers, determining an output tile alignment value for the output layer; projecting backward the output tile alignment value along a path that spans from the first layer and the output layer to obtain an input tile alignment value for the first layer; computing a common input alignment value for the first layer based on input tile alignment values for the plurality of output layers; and projecting forward, using the common input alignment value for the first layer, along respective paths to obtain respective common output alignment values each for a corresponding output layer.

[0188] Embodiment 9 is the method of Embodiment 8, wherein when the fully convolutional network comprises the directed acyclic graph structure, determining an output tile alignment value for an output layer further comprises: for each path of the two or more paths in the directed acyclic graph structure, generating a respective initial output tile alignment value for an output layer, through the path in the directed acyclic graph structure from the output layer to the first layer; and generating the output tile alignment value for the output layer based on the respective initial output tile alignment values.

[0189] Embodiment 10 is the method of Embodiment 8 or 9, wherein generating the respective final outputs for the new input is based on the respective common output alignment values.

[0190] Embodiment 11 is the method of any one of Embodiments 1-10, further comprising: determining that an input tile size of a particular layer is incompatible with one of the plurality of output layers; in response, obtaining another input tile size stored in a backtracking stack, wherein the obtaining process comprises: accessing a set of possible input tile sizes stored in a first cell in the backtracking stack, wherein the backtracking stack is generated during a backward projection process and includes multiple cells stacked on one another according to a sequence defined by a path in the backward projection, wherein each cell is associated with a respective layer and includes a respective set of possible input tile sizes for the respective layer, wherein the first cell is the top cell in the backtracking stack; and selecting, from the first cell, one of the set of possible input tile sizes that is different from the input tile size of the particular layer as the other input tile size; and projecting forward or backward from the corresponding layer according to the path using the other input tile size..

[0191] Embodiment 12 is the method of any one of Embodiments 1-11, wherein each of the respective fixed-size outputs comprises a central valid region, and a peripheral dummy region at a width of a first number of pixels, wherein the central valid region includes at least a portion of the final output, wherein the peripheral dummy region comprises one or more inaccurate pixel-wise values.

[0192] Embodiment 13 is the method of any one of Embodiments 1-12, wherein determining the plurality of fixed-size input tiles from the new input, each fixed-size input tiles having the fixed size, comprising: determining a tiling pattern using the fixed-size input tiles for the new input such that, (i) for each output layer of the plurality of output layers, each value pixel value for an output of the output layer can be extracted from at least one of the respective output tiles for the output layer, and (ii) for each output layer of the plurality of output layers, the respective output tiles satisfy corresponding output alignment values determined for the output layer.

[0193] Embodiment 14 is a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the method of any one of Embodiments 1 to 13.

[0194] Embodiment 15 is one or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the method of any one of Embodiments 1 to 13.

[0195] Embodiment 16 is a method performed by one or more computers, the method comprising: traversing backward a sequence of network layers in a fully convolutional neural network to determine a tiling pattern for a new input to the fully convolutional neural network, wherein the fully convolutional neural network has been compiled and deployed on a hardware for processing input sizes with a fixed size different from the new input, wherein the fully convolutional neural network comprises a plurality of network layers including an input layer as a first layer and one or more output layers each configured to generate a respective output for the input, wherein the sequence of network layers form a path for validating a candidate output tile size selected for one of the one or more output layers, identifying that a current network layer in the sequence of network layers does not satisfy a criterion for validating the candidate output tile size; and in response, backtracking to a preceding network layer according to the sequence in a backward order; selecting another candidate input tile size for the preceding network layer from a set of candidate input tile sizes stored for the preceding network layer, and resuming the traversing from the preceding network layer using the other candidate input tile size.

[0196] Embodiment 17 is the method of Embodiment 16, wherein the traversing comprises: for each network layer in the sequence of network layers according to the backward order, determining a set of candidate input tile sizes for the network layer when the network layer is a particular type.

[0197] Embodiment 18 is the method of Embodiment 17, wherein the traversing further comprises: for each network layer of the particular type, selecting a candidate input tile size from the set of candidate input tile sizes for the network layer of the particular type and using the selected candidate input tile size for traversing backward according to the sequence.

[0198] Embodiment 19 is the method of Embodiment 17 or 18, further comprising: storing the set of candidate input tile sizes for the network layer of the particular type in a cell of a data structure, the cells in the data structure stacking on top of one another according to the sequence in the backward order in which the network layers of the particular type are traversed.

[0199] Embodiment 20 is the method of Embodiment 19, wherein backtracking to the preceding network layer comprises: undoing the validation process from the current network layer to the preceding network layer of the particular type; wherein the preceding network layer of the particular type is associated with the top cell in the data structure; selecting a different candidate input tile size for the preceding network layer from the set of candidate input tile sizes stored in the top cell; and resuming the traversing backward from the preceding layer of the particular type using the different candidate input tile size.

[0200] Embodiment 21 is the method of Embodiment 20, further comprising: determining that the different candidate input tile size causes a succeeding network layer in the sequence not to satisfy the criterion; in response, undoing the validation process from the succeeding network layer to a corresponding network layer; wherein the corresponding network layer is a particular type and associated with a current top cell in the data structure; selecting a different candidate input tile size for the corresponding network layer from the set of candidate input tile sizes stored in the current top cell; and resuming the traversing backward from the corresponding network layer of the particular type using the different candidate input tile size.

[0201] Embodiment 22 is the method of any one of Embodiments 19-21, further comprising: determining the set of candidate tile sizes in the top cell of the data structure has been tried and failed, and in response, removing the top cell from the data structure and setting the second top cell in the data structure as a new top cell.

[0202] Embodiment 22 is the method of any one of Embodiments 17-22, wherein the particular type comprises a convolution network layer with a stride greater than one.

[0203] Embodiment 24 is the method of any one of Embodiments 16-23, wherein the criterion for validating the candidate output tile size comprises a candidate input tile size determined for any network layer being an integer.

[0204] Embodiment 25 is the method of any one of Embodiments 16-24, wherein the fully convolutional neural network comprises a branch point, the branch point being a network layer generating a layer output being received as an input by more than one other network layers. [0205] Embodiment 26 is the method of any one of Embodiments 16-25, wherein the fully convolutional neural network comprises a re-convergence point, the reconvergence point being a network layer receiving and combining outputs from more than one other network layers.

[0206] Embodiment 27 is the method of any one of Embodiments 16-26, wherein determining the tiling pattern for the new input further comprises determining alignment information for tiling the new input using the fixed size.

[0207] Embodiment 28 is the method of Embodiment 27, wherein determining the alignment information comprises: for each output layer of the one or more output layers, determining an output tile alignment value for the output layer; projecting backward the output tile alignment value along a path that spans from the first layer and the output layer to obtain an input tile alignment value for the first layer; computing a common input alignment value for the first layer based on input tile alignment values for the one or more output layers; and projecting forward, using the common input alignment value for the first layer, along respective paths to obtain respective common output alignment values each for a corresponding output layer.

[0208] Embodiment 29 is the method of any one of Embodiments of 16-28, wherein determining the tiling pattern further comprises: using the fixed size for tiling the new input such that (i) for each output layer of the one or more output layers, each value pixel value for a final output generated by the output layer can be extracted from at least one of respective output tiles generated by the output layer, the respective output tiles each being generated by the output layer processing a fixed size tile of the new input, and (ii) for each output layer of the one or more of output layers, the respective output tiles satisfy corresponding output alignment values determined for the output layer.

[0209] Embodiment 30 is a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the method of any one of claims 16 to 29.

[0210] Embodiment 31 is one or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the method of any one of claims 16 to 29.

[0211] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0212] Particular implementations of the subject matter have been described. Other implementations are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

[0213] What is claimed is:

Claims

1. A method performed by one or more computers, the method comprising: receiving a new input to be processed by a fully convolutional neural network deployed on a hardware accelerator, the new input having a first size that is different from a fixed size that the fully convolutional neural network is configured to process when deployed on the hardware accelerator, wherein the fully convolutional neural network comprises a plurality of network layers, wherein the plurality of network layers include an input layer as a first layer and a plurality of output layers each configured to generate a respective output for processing the new input; determining a plurality of fixed-size input tiles from the new input, each fixed- size input tiles having the fixed size; providing each of the plurality of fixed-size input tiles to the hardware accelerator for performing inference computations to generate respective outputs using the fully convolutional neural network; for each of the plurality of output layers, generating respective fixed-size output tiles for the plurality of fixed-size input tiles, wherein the respective fixed-size outputs comprise one or more inaccurate pixel-wise values; and for each output layer of the plurality of output layers, generating, from the respective fixed-size outputs, a respective final output for the output layer that is equivalent to an output that would be generated from the output layer by processing the new input using the fully convolutional neural network configured for processing inputs with the first size.

2. The method of claim 1, wherein the fully convolutional network comprises a tree structure, wherein the tree structure includes a plurality of paths each having a respective sequence of network layers, the respective sequence of network layers spanning from the first layer of the fully convolutional network to a corresponding output layer of the plurality of output layers.

3. The method of claim 1 or 2, wherein the fully convolutional network comprises a directed acyclic graph structure, wherein the directed acyclic graph structure includes an initial point representing a network layer from which two or more paths diverge and a re-convergence point representing a network layer at which two or more paths reconverge.

4. The method of claim 2 or 3, wherein determining the fixed-size for deploying the fully convolutional network further comprises: selecting an output layer from the plurality of output layers as a driver output layer, determining a first candidate output tile size for the driver output layer, validating whether the first candidate output tile size is compatible with an input tile size and output tile sizes of the plurality of output layers, and in response to validating that the first candidate output tile size is compatible with the input tile size and the output tile sizes of the plurality of output layers, setting the input tile size as the fixed-size.

5. The method of claim 4, wherein when the fully convolutional network comprises the tree structure, the validation comprises: projecting backward, using the first candidate output tile size, along a path that spans from the first layer to the driver output layer to determine the input tile size, wherein the backward projection comprises: for each convolution layer in the path, determining respective candidate input tile sizes compatible with the first candidate output tile size; and during the backward projection, determining that a network layer in the path is a branch point of the fully convolutional network, wherein the branch point represents a network layer such that an output generated from the network layer is provided as an input for a succeeding layer in each path of two or more paths, projecting forward, using a first input tile size determined for the branch point, along another path that leads to another output layer, and determining whether the first input tile size for the branch point is compatible with the other output layer.

6. The method of claim 4, wherein when the fully convolutional network comprises the directed acyclic graph structure, the validation comprises: projecting backward, using the first candidate output tile size, along a path that spans from the first layer to the driver output layer to determine the input tile size; wherein the backward projection comprises: for each convolution layer in the path, determining candidate input tile sizes compatible with the first candidate output tile size; and during the backward projection, selecting a first path of the two or more paths converging at a re-convergence point in the directed acyclic graph structure as the path for the backward projection.

7. The method of claim 6, wherein the validation further comprises: during the backward projection, determining that a network layer in the path is the initial point in the directed acyclic graph structure; projecting forward, using one input tile size determined for the initial point, along a second path of the two or more paths converging at the re-convergence point; and determining whether an input tile size, generated based on the one input tile size, along the second path at the re-convergence point matches with a layer input size obtained along the first path at the re-convergence point.

8. The method of any one of claims 1-7, wherein determining the plurality of fixed-size inputs comprises determining alignment information for tiling the new input based on the fixed size, wherein determining the alignment information comprises: for each output layer of the plurality of output layers, determining an output tile alignment value for the output layer; projecting backward the output tile alignment value along a path that spans from the first layer and the output layer to obtain an input tile alignment value for the first layer; computing a common input alignment value for the first layer based on input tile alignment values for the plurality of output layers; and projecting forward, using the common input alignment value for the first layer, along respective paths to obtain respective common output alignment values each for a corresponding output layer.

9. The method of claim 8, wherein when the fully convolutional network comprises the directed acyclic graph structure, determining an output tile alignment value for an output layer further comprises: for each path of the two or more paths in the directed acyclic graph structure, generating a respective initial output tile alignment value for an output layer, through the path in the directed acyclic graph structure from the output layer to the first layer; and generating the output tile alignment value for the output layer based on the respective initial output tile alignment values.

10. The method of claim 8 or 9, wherein generating the respective final outputs for the new input is based on the respective common output alignment values.

11. The method of any one of claims 1-10, further comprising: determining that an input tile size of a particular layer is incompatible with one of the plurality of output layers; in response, obtaining another input tile size stored in a backtracking stack, wherein the obtaining process comprises: accessing a set of possible input tile sizes stored in a first cell in the backtracking stack, wherein the backtracking stack is generated during a backward projection process and includes multiple cells stacked on one another according to a sequence defined by a path in the backward projection, wherein each cell is associated with a respective layer and includes a respective set of possible input tile sizes for the respective layer, wherein the first cell is the top cell in the backtracking stack; and selecting, from the first cell, one of the set of possible input tile sizes that is different from the input tile size of the particular layer as the other input tile size; and projecting forward or backward from the corresponding layer according to the path using the other input tile size.

12. The method of any one of claims 1-11, wherein each of the respective fixed-size outputs comprises a central valid region, and a peripheral dummy region at a width of a first number of pixels, wherein the central valid region includes at least a portion of the final output, wherein the peripheral dummy region comprises one or more inaccurate pixel-wise values.

13. The method of any one of claims 1-12, wherein determining the plurality of fixed-size input tiles from the new input, each fixed-size input tiles having the fixed size, comprising: determining a tiling patern using the fixed-size input tiles for the new input such that, (i) for each output layer of the plurality of output layers, each value pixel value for an output of the output layer can be extracted from at least one of the respective output tiles for the output layer, and (ii) for each output layer of the plurality of output layers, the respective output tiles satisfy corresponding output alignment values determined for the output layer.

14. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the method of any one of claims 1 to 13.

15. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the method of any one of claims 1 to 13.

16. A method performed by one or more computers, the method comprising: traversing backward a sequence of network layers in a fully convolutional neural network to determine a tiling patern for a new input to the fully convolutional neural network, wherein the fully convolutional neural network has been compiled and deployed on a hardware for processing input sizes with a fixed size different from the new input, wherein the fully convolutional neural network comprises a plurality of network layers including an input layer as a first layer and one or more output layers each configured to generate a respective output for the input, wherein the sequence of network layers form a path for validating a candidate output tile size selected for one of the one or more output layers, identifying that a current network layer in the sequence of network layers does not satisfy a criterion for validating the candidate output tile size; and in response, backtracking to a preceding network layer according to the sequence in a backward order; selecting another candidate input tile size for the preceding network layer from a set of candidate input tile sizes stored for the preceding network layer, and resuming the traversing from the preceding network layer using the other candidate input tile size.

17. The method of claim 16, wherein the traversing comprises: for each network layer in the sequence of network layers according to the backward order, determining a set of candidate input tile sizes for the network layer when the network layer is a particular type.

18. The method of claim 17, wherein the traversing further comprises: for each network layer of the particular type, selecting a candidate input tile size from the set of candidate input tile sizes for the network layer of the particular type and using the selected candidate input tile size for traversing backward according to the sequence.

19. The method of claim 17, further comprising: storing the set of candidate input tile sizes for the network layer of the particular type in a cell of a data structure, the cells in the data structure stacking on top of one another according to the sequence in the backward order in which the network layers of the particular type are traversed.

20. The method of claim 19, wherein backtracking to the preceding network layer comprises: undoing the validation process from the current network layer to the preceding network layer of the particular type; wherein the preceding network layer of the particular type is associated with the top cell in the data structure; selecting a different candidate input tile size for the preceding network layer from the set of candidate input tile sizes stored in the top cell; and resuming the traversing backward from the preceding layer of the particular type using the different candidate input tile size.

21. The method of claim 20, further comprising: determining that the different candidate input tile size causes a succeeding network layer in the sequence not to satisfy the criterion; in response, undoing the validation process from the succeeding network layer to a corresponding network layer; wherein the corresponding network layer is a particular type and associated with a current top cell in the data structure; selecting a different candidate input tile size for the corresponding network layer from the set of candidate input tile sizes stored in the current top cell; and resuming the traversing backward from the corresponding network layer of the particular type using the different candidate input tile size.

22. The method of claim 19, further comprising: determining the set of candidate tile sizes in the top cell of the data structure has been tried and failed, and in response, removing the top cell from the data structure and setting the second top cell in the data structure as a new top cell.

23. The method of claim 17, wherein the particular type comprises a convolution network layer with a stride greater than one.

24. The method of claim 16, wherein the criterion for validating the candidate output tile size comprises a candidate input tile size determined for any network layer being an integer.

25. The method of claim 16, wherein the fully convolutional neural network comprises a branch point, the branch point being a network layer generating a layer output being received as an input by more than one other network layers.

26. The method of claim 16, wherein the fully convolutional neural network comprises a re-convergence point, the re-convergence point being a network layer receiving and combining outputs from more than one other network layers.

27. The method of claim 16, wherein determining the tiling pattern for the new input further comprises determining alignment information for tiling the new input using the fixed size.

28. The method of claim 27, wherein determining the alignment information comprises: for each output layer of the one or more output layers, determining an output tile alignment value for the output layer; projecting backward the output tile alignment value along a path that spans from the first layer and the output layer to obtain an input tile alignment value for the first layer; computing a common input alignment value for the first layer based on input tile alignment values for the one or more output layers; and projecting forward, using the common input alignment value for the first layer, along respective paths to obtain respective common output alignment values each for a corresponding output layer.

29. The method of claim 16, wherein determining the tiling pattern further comprises: using the fixed size for tiling the new input such that (i) for each output layer of the one or more output layers, each value pixel value for a final output generated by the output layer can be extracted from at least one of respective output tiles generated by the output layer, the respective output tiles each being generated by the output layer processing a fixed size tile of the new input, and (ii) for each output layer of the one or more of output layers, the respective output tiles satisfy corresponding output alignment values determined for the output layer.

30. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the method of any one of claims 16 to 29.

31. One or more computer storage media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations of the method of any one of claims 16 to 29.