US12217179B2 - Intelligent regularization of neural network architectures - Google Patents
Intelligent regularization of neural network architectures Download PDFInfo
- Publication number
- US12217179B2 US12217179B2 US18/372,415 US202318372415A US12217179B2 US 12217179 B2 US12217179 B2 US 12217179B2 US 202318372415 A US202318372415 A US 202318372415A US 12217179 B2 US12217179 B2 US 12217179B2
- Authority
- US
- United States
- Prior art keywords
- weights
- network
- indirect
- direct
- expected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0985—Hyperparameter optimisation; Meta-learning; Learning-to-learn
Definitions
- This specification relates generally to machine learning and more specifically to training systems such as neural networks.
- Computer models such as neural networks, learn mappings between a set of inputs to a set of outputs according to a function.
- each processing element also called a node or hidden element
- the mapping is considered a “direct” mapping, representing a function that translates the set of inputs to a set of outputs.
- the mapping is represented by a set of weights for the function to translate the inputs to the outputs.
- a mapping is a transformation that, for example, may map from images to images for de-noising, from images to the labels of the objects in the images, from English sentences to French sentences, from states of a game to actions required to win the game, or from vehicle sensors to driving actions.
- both the input to a mapping and the output of the mapping are represented as digitally-encoded arrays.
- a function f maps an input x to an output y.
- mappings may be represented with artificial neural networks which transform the input x to the output y via a sequence of simple mathematical operations involving summing inputs and nonlinear transformations.
- Mappings employed in machine learning, statistics, data science, pattern recognition, and artificial intelligence may be defined in terms of a collection of parameters, also termed weights w for performing the mapping.
- these parameters reflect weights accorded to different inputs x to the function for parameters of the function itself to generate the output of the network.
- individual nodes (or “hidden units”) of the network each individually operate on a set of inputs to generate an output for that node according to weights of that node.
- Neural network architectures commonly have layers, where the overall mapping of the neural network is composed of the composition of the mapping in each layer through the nodes of each layer.
- the initial input undergoes successive transformations by each layer into a new array of values.
- the network 100 comprises an input layer 110 , an output layer 150 and one hidden layer 130 .
- the input layer is a 2-dimensional matrix having lengths P ⁇ P
- the output layer 150 is a 2-dimensional matrix having lengths Q ⁇ Q.
- a set of inputs x to the layer are processed by nodes of the layer according to a function f with weights w to outputs y of the layer. The outputs of each layer may then become inputs to a subsequent layer.
- the set of inputs x at each layer may thus be a single value, an array, vector, or matrix of values
- the set of outputs y at each layer may also be a single value, an array, vector, or matrix of values.
- an input node 111 in the input layer 110 represents a value from a data input to the network
- a hidden node 131 in the hidden layer 130 represents a value generated by the weights 121 for node 131 in the hidden layer applied to the input layer 110
- output node 151 represents an output value 151 from the network 100 generated by weights 141 for the node 151 applied to the hidden layer 130 .
- each node in a layer may include its own set of weights for processing the values of the previous layer (e.g., the inputs to that node).
- Each node thus represents some function f, usually nonlinear transformations in each layer of the mapping, with associated weights w.
- the parameters w correspond to the collection of weights ⁇ w(1), . . . , w(L) ⁇ defining the mapping, each being a matrix of weights for each layer.
- the weights may also be defined at a per-node or per-network level (e.g. where each layer has an associated matrix for its nodes).
- the goal of the network is to learn a function f through the layers of the network that approximate the mapping of inputs to outputs in the training set D and also generalize well to unseen test data D test .
- E(w, D) evaluates a loss function L which measures the quality or the misfit of the generated outputs ⁇ to the true output values y.
- Such an error function E can be minimized by starting from some initial parameter values (e.g., weights w), and then evaluating partial derivatives of E(w, D) with respect to the weights w and changing w in the direction given by these derivatives, a procedure called the steepest descent optimization algorithm.
- error function E Various optimization algorithms may be used for adjusting the weights w according to the error function E, such as stochastic gradients, variable adaptive step-sizes, second-order derivatives or approximations thereof, etc.
- error function E may also be modified to include various additional terms.
- Regularization refers to a process which introduces additional information to the error function to prevent overfitting the network to data, solve ill-posed or underdetermined problems and guide parametric models to solutions consistent with a priori assumptions about the data. Regularization may be implemented as an additional regularization term of the error function.
- Regularization may be employed in neural networks by assuming an adequate penalty R over the parameters and adding it to the error function weighted by a scalar regularization parameter ⁇ .
- a number of penalty functions may be employed, such as various Lp-norms.
- One examplary norm is the L2 norm
- a strength of the regularization parameter ⁇ is a global value that is typically chosen manually or via cross-validation or similar procedures.
- the regularization parameter ⁇ thus may control the strength of the effect of the regularization on the error function E.
- Additional regularization may be performed by “dropout,” which switches some number of hidden units (neural network processing units or atoms) stochastically during training and leads to models which more cleverly exploit the capacity of larger network to represent data.
- dropout which switches some number of hidden units (neural network processing units or atoms) stochastically during training and leads to models which more cleverly exploit the capacity of larger network to represent data.
- MTL Multi-Task Learning
- Domain Adaptation Domain Adaptation
- a direct mapping for a network is learned in conjunction with an indirect network that designates expected weights for the direct mapping.
- the network generating the “direct mapping” may also be termed a “direct network” or a “direct model.”
- the indirect network learns an expected weight distribution of the weights of the direct network, which may be represented as a set of “expected” weights for the direct mapping.
- the indirect network may also be termed an “indirect model.”
- the direct model may include a portion of a larger modeled network, such as a multi-node, multi-layered neural network, wherein each direct network models transitions for one or more nodes in the network.
- the indirect model generates the expected weight distribution based on a set of indirect parameters that affect how the indirect network models the direct network weights.
- the indirect model may also be generated based on a control input that describes characteristics of the direct model, such as the particular node, layer, type of input, or other aspect that conditions the generated weights. This indirect model may thus predict more general changes in the weights of the model that vary across the characteristics.
- the indirect model thus produces “expected” weights and the distribution thereof and that may be used in various ways to improve the direct model for an input to an output.
- the indirect model is used to regularize the weights applied to the mapping in the direct network.
- the error term for the direct weights is regularized by the expected weights given by the indirect network.
- the indirect network provides an ‘anchor’ or set point from which the direct network weights may vary when it more accurately reflects the data.
- the regularization term may preference a low difference (and penalizes a high one) between the expected weight and the actual weight of the network.
- the regularization function may take various forms, such as a linear or squared difference from the expected weights. Rather than preferencing a “zero” term for weights, this regularization may thus simulate an L1 or L2 norm with respect to the expected weights generated by the indirect network. These may permit the regularization to provide a “spring” to the “anchored” expectation of the expected weights, encouraging direct weights that are consistent with the expected weights.
- the regularization may also be applied to the direct network weights based on a regularization parameter ( ⁇ ) describing the preference for the expected weight compared to deviation accounting for the input data.
- the error term may thus be used to update the direct network as well as the indirect network.
- the direct network may be updated based on a derivative of the error term with respect to the direct network weights
- the indirect network may be updated based on a derivative of the error term with respect to the indirect network parameters (which generate the expected weights).
- the regularization parameter ⁇ itself may be an output of the indirect model.
- the regularization parameter ⁇ reflects the uncertainty or deviation of actual weights in the direct network relative to the indirect network.
- expected weights and the regularization parameter ⁇ output by the indirect network may represent an expected Gaussian distribution of the weights in the direct network.
- the indirect network may output any expected distribution of the network (e.g., values for the weights and associated probability of the weights) and use the expected distribution as a regularization of the direct weights.
- the regularization may not be a linear or non-linear function from a mean of the expected weight distribution, and may instead penalize weights according to the associated probability of the set of weights based on the distribution.
- the expected distribution of weights of the direct network (as output by the indirect network) is used to generate the outputs of the direct network.
- the outputs of the direct network may be generated based on an integration over the distribution of weights given by the indirect network.
- the output of the indirect network (the expected weight distribution of the direct network) represents a probabilistic prior distribution of the direct network weights.
- the indirect network provides a distribution of these weights which may be used to effectively ‘simulate’ many possible sets of weights according to the possible distribution of these weights or by evaluating as the mean of the sample outputs.
- the integration and ‘simulation’ is performed by sampling from multiple points in the distribution of weights and determining the resulting output for the direct network based on each sampled set of weights. The different samples is then be combined according to the probability of the samples to generate the output of the direct network. Because the indirect network may learn the weights of the direct network as a distribution of ‘possible’ weights, the indirect network may more consistently learn the expected weights of the direct network and overreliance on initial training data or bias due to the ordering in which the training data is batched; the different direct network weights as encouraged by different sets of training data may now be effectively captured as different distributions of these weights in the direct weight distribution.
- an input is evaluated according to the expected prior weight distribution, and a loss function is used to evaluate updates to the distribution based on error to the data term generated from the prior weight distribution and error for an updated weight distribution.
- the loss function is used to update the expected prior distribution of direct weights and accordingly update the indirect parameters.
- the indirect network aids in the generation of transfer learning for different tasks. Since the indirect network predicts general expected characteristics of a network, the parameters for the indirect network may be used as initial expected parameters for training additional direct networks for different tasks. In this example, the indirect network may be used to initialize the weights for another direct network. In addition when designating a domain as a control parameter, either with or without latent control inputs, the new domain may readily incorporated by the control parameters for the indirect network because the training for the new domain may only require learning the differences from the prior domain while re-using the previously-learned aspects of the initial domain.
- control inputs may define known properties or parameters of the environment in which the direct network is applied, and changes to those properties may be used to learn other data sets having other properties simply by designating the properties of the other data sets when learning the new data sets.
- the indirect network jointly trained with multiple direct networks, permitting the indirect network to learn more general ‘rules’ for the direct networks and reflect an underlying or joint layer for the direct networks.
- the use of the indirect network to generate expected weights or general distributions of the direct networks also permits the indirect network to be trained more accurately with more limited training data.
- the indirect network results in more flexible use cases of the direct network. If portions of a trained direct network are lost or no longer describe current data well, the indirect network can provide a “starting place” for generating weights without complete re-training of the direct network.
- the indirect network may be used to adapt the direct network to account for the missing input data. For example, the portions of the direct network that use that data, have high expected weights for that data, or high weight distributions that weigh that data highly, may be deactivated or adjusted to account for the missing data. For example, the expected weight distribution for the direct network may be evaluated and modified to reduce reliance on (or deactivate) portions of the direct network affected by the missing data. For example, an expected weight distribution for a direct network may be modified to exclude portions of the distribution that highly weigh the missing or affected inputs.
- FIG. 1 illustrates an exemplary neural network
- FIG. 2 illustrates a computer model that includes a direct network and an indirect network according to one embodiment.
- FIG. 3 illustrates an indirect network for a plurality of direct network layers, according to one embodiment.
- FIG. 4 illustrates a process for training weights of a direct network and associated indirect network, according to one embodiment.
- FIG. 5 illustrates an example in which the direct network is evaluated with the expected weight distribution provided by an indirect network, according to one embodiment.
- FIG. 6 illustrates a process for training an indirect network to generate an expected weight distribution for a direct network to be evaluated by the expected weight distribution, according to one embodiment.
- FIG. 7 is a high-level block diagram illustrating physical components of a computer used to train or apply direct and indirect networks, according to one embodiment.
- FIG. 2 illustrates a computer model that includes a direct network and an indirect network, according to one embodiment.
- the computer model refers to the learned networks to learn from a data set D having inputs x and associated outputs y. These inputs and outputs thus represent data input and related data output of the dataset. In that sense, the modeling learns to generate a predicted output y from an input x.
- this computer model may include a direct network 200 and an indirect network 220 .
- the computer model may include both the direct network 200 and the indirect network 220 .
- the trained network itself may be applied in one example to unknown data with only the direct network and its weights, while in another example the trained network may be applied to new data with the indirect network and the structure of the direct network using predicted weights predicted an indirect network.
- the direct network 200 implements a function f for mapping a set of direct inputs x 210 to a set of direct outputs y 250 .
- the mapping of the direct inputs x to direct outputs y may be evaluated by applying direct weights w to the various nodes.
- a single layer is illustrated in which the direct outputs y 250 are generated by applying direct weights w to the direct inputs 210 .
- a direct input 211 may be used to generate one or more direct outputs 250 , such as direct output 251 , according to the direct weights related to the respective direct output 250 .
- the direct network 200 is termed a “direct” network because its weights “directly” generate data outputs from data inputs.
- the data input to the network model is entered as an initial layer of the direct network, and the output of the direct network is the desired output of the network model itself.
- the training data D its input input x is provided as the direct inputs 210 , and training is expected to result in the values of direct outputs 250 matching the training data's associated output y.
- the indirect network 220 generates an expected weight distribution ⁇ for the direct weights 230 of the direct network.
- the expected weight distribution ⁇ describes possible values of the weights of the direct network and probabilities associated with the possible values. In this way, the expected weight distribution ⁇ may also be considered to model a statistical prior of the direct weights and captures a belief about the distribution of the set of weights and may describe the dependence of each weight on the other weights.
- the expected weight distribution ⁇ may describe the possible values and associated probabilities as a function or as discrete values. As a result, rather than directly describing the function applied to the input x to generate the output y for a given set of weights in the direct network, the indirect network describes the expected weights themselves of the direct network.
- the indirect network 220 is a learned computing network, and typically may be a neural network or other trainable system to output the expected weight distribution ⁇ for the set of weights w of the direct network.
- the indirect network may use a set of indirect parameters ⁇ 280 designating how to apply the functions of the indirect network 220 in generating the expected weight distribution of the direct network.
- the indirect network 220 may also receive a set of indirect control inputs 260 that describe how to apply the indirect network to generate the expected weights. These indirect control inputs z 260 serve as an “input” to the indirect network 220 , and provide an analog in the indirect network for the inputs x of the direct network.
- the indirect network provides a function g that outputs the expected weight distribution ⁇ as a function of the indirect parameters it, 280 and the indirect control inputs z 260 .
- z, ⁇ ) may be trained to generate the expected weight distribution without indirect control input z 260 , or with a set of “dummy” or constant indirect control inputs.
- the expected weight distribution ⁇ may take several forms according to the type of indirect network 220 and the resulting parameters generated by the indirect network.
- the expected weight distribution may follow various patterns or types, such as a Gaussian or other probabilistic distribution of the direct weights, and may be represented as a mixture model, multi-modal Gaussian, density function, a function fit from a histogram, any (normalized and unnormalized) implicit distribution resulting from draws of stochastic function and so forth. Accordingly, the expected weight distribution describes various sets of weights for the direct network and the relative likelihood of the different possible sets of weights.
- the expected weight distribution ⁇ may reflect a Gaussian or normal distribution of the direct weights, having a mean, standard deviation, and a variance.
- the expected weight distribution ⁇ may independently describe a distribution of each weight w, or may describe a multi-variate distribution of more than one direct weight w together.
- the indirect network 220 may be structured as various types of networks or models. Though termed a network, the indirect network 220 may include alternate types of trainable models that generate the expected weight distribution ⁇ . Thus, the indirect network 220 may include multivariate or univariate models.
- the indirect network 220 may be a parametric model or neural network, but may also apply to nonparametric models, such as kernel functions or Gaussian Processes, Mixture Density Networks, nearest neighbor techniques, lookup tables, decision trees, regression trees, point processes, and so forth. In general, various types of models may be used as the indirect network 220 that effectively characterize the expected weight distribution and have indirect parameters 280 that may be trained from errors in the output y predicted by the direct network.
- the indirect control inputs z describe characteristics that may condition the generation of the expected weight distribution ⁇ of the direct network 200 . These characteristics may describe various relevant information, for example describing a particular computing element or node of a larger network, a layer of the network, designate a portion of an input operated on by a given direct network, or a domain or function of the data set. As an example of a portion of an input, for an image or video input, different portions of the input may be separately processed, for example when the direct network performs a convolution or applies a kernel to the portion of the input. By varying the indirect control inputs, the indirect network may be used to effectively learn ‘rules’ that more generally describe the direct network data as it varies across the different characteristics described by the indirect control inputs.
- FIG. 3 illustrates an indirect network for a plurality of direct network layers, according to one embodiment.
- the indirect network 220 generates expected weight distributions for nodes of the network model.
- the expected weight distributions may be generated for each separate layer or for each node within a layer.
- the network model includes several layers in which each layer includes one or more nodes.
- the initial data inputs are entered at an initial network input data layer 400 - 403 , and are initially processed by a layer of direct nodes 410 - 413 .
- the output of these direct nodes 410 - 413 are used as inputs to the next layer of direct nodes 420 - 423 , which is used as an input to the direct nodes 430 - 431 , and finally as inputs to a model output data node 440 .
- the “direct network” as shown in FIG. 2 may represent a single layer or node in the larger model, such that the expected weight distribution generated by the indirect network 220 are generated to account with respect to the inputs and outputs of that particular layer.
- the indirect control inputs 260 specify the layer and an index of the node.
- the error in expected weights may be propagated to the indirect network 220 and specify to which control inputs 260 (the particular node) the error is associated.
- the indirect network 220 may learn, through the indirect control inputs and indirect parameters, how to account for the more general ways in which the weights differ across the larger network of weights being predicted by the indirect network 220 .
- the varying control inputs may be used to learn ways other designated characteristics affect the weights, for example by designating a domain of a data set, a source of the data set, characteristics of a model or environment for the data set, and so forth. Though shown in FIG.
- the indirect control inputs may additionally or alternatively reflect additional types of conditions that affect the output expected weight distribution as also discussed herein.
- the weights of the direct network 200 may be regularized based on the expected weight distribution ⁇ of the direct network.
- the indirect network may be used to set expectations for the direct network weights and encourage the values of the direct network weights towards “more-likely” values.
- the indirect network may also be modified to generate an expected weight distribution that increases the likelihood of the direct network generating the expected result.
- the error is a function of the training data output y with respect to the predicted output ⁇
- the regularization term is a function of the direct network weights and the expected weight distribution ⁇ .
- a data sample is evaluated by the direct network using the direct network weights w.
- the error for the direct network may be evaluated with respect to the direct network.
- the direct network weights w may be updated based on the loss function by the derivative of the loss function with respect to the direct network weights w. Since the loss function L includes the expected weight distribution ⁇ , the update to the direct network weights w is encouraged towards likely values as reflected in the expected network distribution as provided by the indirect network.
- the indirect parameters of the indirect network may be updated by determining the derivative of the loss function L with respect to the estimated weight distributions (and the indirect parameters generating the estimated weight distribution) and propagating the derivative to the indirect parameters. Accordingly, the error function thus permits the direct network to be “encouraged” towards the weights suggested by the indirect network, while also permitting deviation to account for the particular data processed by the direct network. Since the indirect network may be a simpler network or otherwise describe more general trends across various indirect control inputs, the training of the direct network to preference, but still deviate within, the expected weight distribution permits general description of the weights in the indirect network and accounting for particular data sets in the direct network.
- the regularization parameter ⁇ may be an output (direct or derived) from the indirect network 220 .
- the regularization parameter ⁇ may also represent a noise variance or other data uncertainty that may be learned from the data set itself.
- the regularization parameter ⁇ may vary according to the particular weight, layer, or node of the direct network (e.g., as output by the indirect network 220 ).
- the regularization parameter ⁇ may represent a variance of the expected weight distribution ⁇ .
- the regularization function R may comprise various functions to represent different norms, such as an L1 or L2 norm, and may be any distribution or parametric loss function that may represent the direct network. Additional regularization functions include those corresponding to heavytailed or robust losses, binary losses for modeling binary weights, rich sparsity patterns for dropout, and other arbitrary explicit and implicit distributions and so forth. Primarily, the regularization function R used with the expected weight distribution discourages (but permits) the direct weights from being assigned less-likely weight values and encourage more-likely ones. Accordingly, various regularization functions R may be selected that increase the penalty for a direct weight based on how unlikely the direct weight is in the expected weight distribution, such that more-unlikely weights are penalized more than less-unlikely weights. E.g., the lower the probability of the weight, the higher the penalty.
- the regularization function R itself may be a learnable function from the indirect control inputs and the indirect parameters. That is, the regularization function R may itself be output from the indirect network function based on the expected weight distribution. As a result, the regularization function R may learn to evaluate the direct weight w according to the expected weight distribution and according to the likelihood of the direct weight w suggested by the data set training the model.
- the indirect network generates an expected weight distribution represented as a Gaussian model.
- the expected value for a weight ⁇ in the direct network is a mean of the Gaussian distribution, while a variance of the Gaussian distribution is used as the regularization parameter ⁇ . This is illustrated in FIG. 2 as expected weights 222 and the regularization parameter ⁇ 224 .
- the loss function may use the expected weight as an “anchor” or set point for the direct network weights w. As one example loss function for this embodiment:
- E ⁇ ( w , ⁇ , ⁇ , D ) E ⁇ ( w , D ) + ⁇ ⁇ R ⁇ ( w , w ⁇ ) Equation ⁇ 7
- E ⁇ ( w , D ) + ⁇ 2 ⁇ ⁇ i 1 m ( w i - w ⁇ i ( ⁇ ) ) 2 - m 2 ⁇ ln ⁇ ⁇ Equation ⁇ 8
- the error function includes an error term, a regularizer term, and in equation 8, a normalizer
- the error term is a function of the direct network weights w and the input data. That is, based on the input data and the weights, a generated output ⁇ is compared with the known outputs y.
- the regularizer term represents the expected weights ⁇ as generated by the indirect network with indirect parameters ⁇ with the expected weights encouraged with the regularization parameter ⁇ , which may represent the variance of the distribution.
- the normalizer term may be used to account for the Gaussian distribution and normalize the Gaussian across the direct network; the number of weights m in the direct network is included to normalize, since each is modeled by a Gaussian.
- the derivative is obtained with respect to the parameters of each network and applied to the parameters.
- various types of parameter optimization and error propagation may be used to adjust the parameters, such as steepest gradient descent.
- the weights may be updated by a linear combination of the unregulated weights w* and the expected weights ⁇ (or the expected weight distribution ⁇ ).
- the unregulated weights w* represent weights of the direct network for an error function in which the error function does not include a regularization term, and in certain cases may be based only on an error term measuring the difference between the training data output y and the generated output ⁇ .
- Other alternative or more complex combinatory functions may also be used to combine the indirect network outputs (e.g., the expected weight distribution) with the unregulated direct network weights W*.
- C is a constant and the weights are regularized with an L p norm, although other regularizers may also be used as discussed above.
- coefficients ⁇ and ⁇ between 0 and 1 these may additively determine the direct weights.
- the coefficients may be set in combination to equal 1.
- FIG. 4 illustrates a process for training weights of a direct network and associated indirect network.
- the direct weights w and the indirect parameters ⁇ are initialized.
- the process shown in FIG. 4 may be performed for one or more inputs x from a training dataset D each having an associated output y, for example as a batch of training data selected from D.
- the direct network weights w are applied to the input x to generate a model output ⁇ .
- the expected weights is identified for the indirect network 410 , for example by processing the indirect parameters ⁇ and, if applicable, the indirect control input z through the indirect network.
- the error function is applied to identify 420 an error function.
- the direct network may be updated as discussed above, for example by determining the derivative of the error function with respect to the direct network weights and updating the weights according to a gradient descent or other update algorithm.
- One representation of the update 430 for the direct weights is:
- one representation of the update 440 of the indirect parameters for the indirect parameters is:
- the direct network weights may thus be regularized to learned characteristics of the direct network, rather than regularizing to low weights or to a “simple” set of weights. This permits the direct network to include regularization for more generally describing the data set and while doing so more naturally to the general contours of the data set itself.
- FIG. 5 illustrates an example in which the direct network 200 is evaluated with the expected weight distribution provided by an indirect network 220 according to one embodiment.
- the indirect network 220 may have a similar structure as discussed above, and may receive indirect control inputs z 260 and indirect parameters ⁇ 280 .
- the expected weight distribution ⁇ 500 is used to model the possible weights for the direct network.
- various possible weights are evaluated and the results combined to make an ultimate prediction by the weight distribution ⁇ 500 as a whole when applied to the direct network, effectively creating an ensemble of networks which form a joint predictive distribution.
- the generated output ⁇ is evaluated as the most-likely value of y given the expected distribution of the weight sets.
- y may be represented as an integral over the likelihood given an input and the expected weight distribution ⁇ .
- the direct network output ⁇ may also be considered as a Bayesian Inference over the expected weight distribution ⁇ , which may be considered a posterior distribution for the weights (since the expected weight distribution is a function of training from an observed dataset).
- the indirect parameters ⁇ may be learned from an error of the expected weight distribution, for example by Type-II maximum likelihood.
- the integration averages over all possible solutions for the output y weighted by the individual posterior probabilities of the weights: P( ⁇
- this inference may determine a value of output y as a probability function based on the direct network input x, the indirect control inputs z, and the indirect parameters ⁇ or more formally: P(y
- the uncertainty of the direct weights is explicitly accounted for in the expected weight distribution ⁇ , which allows inferring complex models from little data and more formally accounts for model misspecification.
- the direct network output y may be evaluated by sampling a plurality of weight sets from the distribution and applying the direct network to the sampled weight sets.
- This posterior inference for the expected weight distribution and the indirect control parameters may be performed by a variety of techniques, including Markov Chain Monte Carlo (MCMC), Gibbs-Sampling, Hamiltonian Monte-Carlo and variants, Sequential Monte Carlo and Importance Sampling, Variational Inference, Expectation Propagation, Moment Matching, and varients thereof.
- MCMC Markov Chain Monte Carlo
- Gibbs-Sampling Hamiltonian Monte-Carlo and variants
- Sequential Monte Carlo and Importance Sampling Sequential Monte Carlo and Importance Sampling
- Variational Inference Expectation Propagation
- Moment Matching Moment Matching
- the posterior inference may be a variational inference, which approximates the inference based on a marginal likelihood.
- various such inference techniques may be used, and this is one example embodiment.
- possible improved weight distributions are evaluated that improve the posterior approximation of the expected weight distribution.
- an approximate distribution is evaluated q( ⁇ tilde over ( ⁇ ) ⁇ ) relative to a set of samples from the approximate distribution q( ⁇ tilde over ( ⁇ ) ⁇ ).
- the approximate distribution in this example can be evaluated in one embodiment by a loss function on the approximate distribution q( ⁇ tilde over ( ⁇ ) ⁇ ) given a prior distribution p( ⁇ ) of the set of weights. This loss function for a given approximate distribution q( ⁇ tilde over ( ⁇ ) ⁇ ) and indirect parameters ⁇ given an input x, an output y, and indirect control inputs z:
- the initial term is a data term that describes the fit of the distribution with weights sampled from (represented as an integral) the variational distribution q( ⁇ tilde over ( ⁇ ) ⁇
- the second term is a regularization term expressed as a KL-divergence between the current approximate distribution q( ⁇ tilde over ( ⁇ ) ⁇ ) and its prior distribution p( ⁇ ).
- a number L of samples of the data term ⁇ tilde over ( ⁇ ) ⁇ 1 , ⁇ tilde over ( ⁇ ) ⁇ 2 , ⁇ tilde over ( ⁇ ) ⁇ 3 , . . . , ⁇ tilde over ( ⁇ ) ⁇ L ⁇ may be used to approximate the data term:
- an updated expected weight distribution of the direct network may be evaluated and used to update the indirect parameters ⁇ for subsequent evaluation.
- the derivative of the loss function may be taken with respect to the expected weight distribution to update the expected weight distribution:
- the derivative of the indirect parameters may be taken with respect to the loss function (and potentially with respect to the updated expected weight distribution) as shown:
- FIG. 6 illustrates a process for training an indirect network to generate an expected weight distribution for a direct network to be evaluated by the expected weight distribution, according to one embodiment.
- the indirect network weights or “priors” may be set initialized by the indirect network, for example to zero.
- the direct weight distribution or prior is identified 600 from the indirect network, for example based on the control inputs and indirect parameters ⁇ .
- the prior distribution is sampled 610 to identify a plurality of possible weight sets for the direct network and generate outputs for the weight sets, reflecting potential outputs of the direct network.
- a loss function is identified 620 and applied to evaluate the loss of the expected weight distribution and a potential approximation of an updated weight distribution.
- the expected prior distribution (e.g., the expected weight distribution) is updated 630 based on an evaluation and identification of an approximation of the updated weight distribution that improves the error of the generated output from the sampling of the distribution approximation. For example, a derivative of the distribution may be used with respect to an error function for the approximate weight distribution.
- the indirect parameters used by the indirect network to generate the distribution are also updated 640 .
- the indirect network provides additional flexibility in training the direct network and additional ability of the trained network to avoid biases from initial training sets.
- the indirect network may also be used for transfer learning of other related data sets. Rather than begin anew, the indirect network from the initial data set may be re-used to provide some initial characterization of the new data set.
- the control inputs of the indirect networks may designate the domain of the different data sets, permitting the control parameters to quickly learn characteristics of the new data set because it differs in the control parameter, but may not differ in other control parameters of the network.
- incorporating characteristics of an environment into the control parameters may allow the model to quickly acclimate to new environments. For example, when the control parameters describe physical characteristics, such as an image field of view, viscosity, or the effect of gravity, the indirect network may readily learn how these characteristics affect the expected weight distribution of the direct network, particularly when the change in the characteristic may be known.
- control inputs to the indirect network may be treated as hidden or latent codes which control the generation of the indirect network.
- the input code for a given direct network may be inferred from the training data based on the different indirect parameters suggested by the various training data.
- variations in the input data may be used to learn a most-likely posterior distribution of the indirect control inputs. This may permit additional data sets to effectively leverage the more general structural characteristics of the direct network as reflected in the indirect parameters by identifying the appropriate control input for the additional data sets.
- the error function may incorporate the latent code in the regularizer and perform gradient descent with respect to the latent code to identify likely latent codes for different data.
- the regularizer itself may be weighted by a regularization parameter z that is dependent on the latent code.
- the solutions for z may be integrated to evaluate a joint posterior distribution for the indirect parameters ⁇ 0 as well as the latent control input z:
- Using the latent control inputs may permit general formulations of the expected weight distributions and permitting the expected weight distributions to reflect how different data sets generate different weight distributions without a priori knowledge of the latent ‘state’ of the data sets.
- the indirect control inputs z may be only partially latent and may include control inputs that describe a domain or subdomain of a data set as noted above.
- the domain terms may be fixed for structural modifications of the network, but vary across data sets.
- the domain term may be constant, and the subdomain may vary.
- the latent indirect control inputs may be generated and evaluated for the domain or subdomain of the data set, and the parameters and latent indirect control inputs may more effectively capture how the expected value of direct weights are modified across different applications and domains.
- Using the indirect network to generate an expected weight distribution or expected weights for the direct model provides many advantages and applications of computer models. Some examples are discussed below for transfer learning, learning from limited data sets, ‘repairing’ mappings, and adjusting for changes in incoming input data.
- the indirect network aids in the generation of transfer learning for different tasks. Since the indirect network predicts general expected characteristics of a network, the parameters for the indirect network may be used as initial expected parameters for training additional direct networks for different tasks. In this example, the indirect network may be used to initialize the weights for another direct network. As another example, the domain of a task or data set may be specified as a control input z, either with or without latent control inputs. This permits the indirect network to be re-used for similar types of data and tasks in transfer learning by re-using the indirect network trained for an initial task.
- the modified control input may permit effective and rapid learning of additional domains because the training for the new domain may only require learning the differences from the prior domain while re-using the previously-learned aspects of the general data as reflected in the trained indirect control parameters ⁇ .
- the control inputs z may define known properties or parameters of the environment in which the direct network is applied, and changes to those properties may be used to learn other data sets having other properties by designating the properties of the other data sets when learning the new data sets.
- Such a control input z may be a vector describing the relatedness of tasks. For many purposes that can be an embedding of task in some space. For example, when trying to classify animals we may have a vector containing a class-label for quadrupeds in general and another entry for the type of quadruped. In this case, dogs may be encoded as [1,0] and cats as [1,1] if both are quadrupeds and differ in their substructure.
- the indirect network can describe shared information through the quadruped label “1” at the beginning of that vector and can model differences in the second part of the vector.
- control input z can be given by time of year (month, day, time, and so forth) and geographical location of the location we care to predict at. More generally, z can also be a learned vector without knowing the appropriate control inputs a priori, as long as we can share them between tasks. Explicitly, z can also be predicted from the direct input x. An example of this is images taken from a camera with different weather conditions and a network predicting the appropriate control input z to ensure that the indirect network instantiates a weather-appropriate direct network for the relevant predictive task.
- the indirect network is jointly trained with multiple direct networks for different tasks, permitting the indirect network to learn more general ‘rules’ for the direct networks and reflect an underlying or joint layer for the direct networks that may then be individual direct weights for individual direct networks for individual tasks.
- one of the control inputs z may specify the direct network (e.g., relating to a particular task) for the indirect network applied (known parameters would be classes as above or geographical location or other covariates related to the process at hand). An example of this may be instantiated as a predictive task across cities where a company may operate.
- the predictive task relates to properties of cities, such as a spatiotemporal supply and demand prediction for a ridesharing platform does, one can utilize the indirect network by deploying it across cities jointly and using the different city-specific variables as inputs to improve local instances of the forecasting model.
- City-specific inputs may be related to population density, size, traffic conditions, legal requirements and other variables describing information related to the forecasting task.
- the use of the indirect network to generate expected weights or general distributions of the direct networks also permits the indirect network to be trained more accurately with more limited training data.
- a direct network may be trained (or the expected weight distribution ⁇ updated for the direct network) for a new data set quickly. According, even when a small amount of data is known for a particular task, the direct network may be effectively trained, and in some examples, by a single data set (“one-shot”) using the indirect network.
- a single data set or batch may be more effectively train the model as a whole because the indirect network naturally generalizes (regularizes) from the specific data.
- An example of this is when having a robot or autonomous agent act in unconstrained environments it has not previously been trained on. If, for example, an autonomous agent has access to a strong model previously learned for related tasks and is exposed to a novel environment, for instance a street-network with previously unseen properties of visual nature or previously unseen obstacles or legal requirements, an ideal learner can adapt previously learned rules in the indirect network rapidly to changes in the new environment. For instance, if the speed limit has been changed only one example of the new speed limit could suffice to learn a speed controller if the indirect network contains an input related to speed limits.
- the indirect network may also be used to shrink the total size of a computer model while maintaining high predictive power for the outputs. Because the indirect network generates the expected weight distribution ⁇ , the indirect network parameters may be used as a “compressed” form of describing direct network weights without a system needing to store a complete set of direct network weights. To apply and use the network, the indirect network parameters ⁇ and control inputs z may be applied to generate the expected weight distribution ⁇ for a particular direct network when required for application of the model and thereby avoid pre-storage of a large number of direct network weights w, and in some examples the expected weight distribution ⁇ itself is not stored and may be generated at run-time for a model. The expected weight distribution ⁇ may be probabilistically evaluated, for example through sampling, as discussed above to determine an input as discussed with respect to FIGS. 5 and 6 .
- the direct network(s) have reduced efficacy due to data loss of the weights of the direct network itself, or because a portion of the inputs or outputs of the direct network are missing, become unreliable, or become ineffective.
- the indirect network may be used in these cases to improve the prediction of the direct network and permit the direct network to be modified for these problems. If portions of a trained direct network are lost or no longer describe current data well, the indirect network can provide a “starting place” for the direct network weights without complete re-training of the direct network. In this example, when a system identifies that portions of the direct network are lost or missing, the system may replace those weights based on the expected weight distribution ⁇ from the indirect network.
- the expected weight distribution ⁇ corresponding to that potion of the direct network may be used in lieu of the direct network weights.
- the average or mean of the expected weight distribution may be identified and replace the missing or lost weights.
- the system can replace the missing direct weights with a learned approximation from the indirect network.
- the direct network weights may then be applied with a combination of the learned weights w that were known (e.g., not missing) and expected weights ⁇ (e.g., when weights w are missing).
- the indirect network may be used to adapt the direct network to account for the missing input data. This may occur when portions of the input x are generated from or derived by sensor data, and those sensors have become unreliable or broken. In many control scenarios, an evaluation of that data as though it were present may have significant implications for control of a device if the output of the network errs when the data is missing.
- the direct network weights may be adjusted or modified based on the lack of that input data to prioritize a weight distribution from the expected weight distribution that more-heavily uses the other portions of the input to generate the output.
- the expected weight distribution ⁇ is jointly determined for several weights, certain distributions may represent higher or lower dependence on different inputs for evaluating the output.
- the portions of the direct network that use the missing or erroneous data as inputs have high expected weights for that data, or high weight distributions that weigh that data highly, may be deactivated or adjusted to account for the missing data based on the expected weight distribution ⁇ .
- the expected weight distribution ⁇ for the direct network may be evaluated and modified to reduce reliance on (or deactivate) portions of the direct network affected by the missing data.
- an expected weight distribution ⁇ for a direct network may be modified to exclude portions of the distribution that highly weigh the missing or affected inputs. Accordingly, the expected weight distribution ⁇ from the indirect network to identify and use ‘alternate’ weights for the direct network when inputs to the direct network become unreliable.
- a vehicle such as a fully or partially autonomous car or aircraft, can include a vehicle controller that receives sensor data as input and generates actuator commands as output.
- a vehicle controller for a car may receive as sensor data one or more of position, speed, acceleration, heading, and heading rate, and it may generate as actuator commands one or more of steering angle, throttle setting, and brake settings.
- a vehicle controller for an aircraft may receive as sensor data one or more of positional data, velocity, acceleration, angular orientation, and angular rates and it may generate as actuator commands one or more of control surface deflections (e.g., elevator, ailerons, elevens, rudder) and throttle settings.
- control surface deflections e.g., elevator, ailerons, elevens, rudder
- the vehicle controller can be implemented by a direct network, such as the direct network 220 of FIG. 2 .
- the vehicle system can include a controller reconfiguration module that includes the indirect network for adjusting or adapting the direct network of the vehicle controller.
- the controller reconfiguration module can receive information indicative of the health of the vehicle (e.g., sensors, actuators, vehicle components such as tires). This information can be used to generate the indirect control input z.
- Each component of z can represent a particular sensor, actuator, or other component of the vehicle.
- the components can represent the health in a binary manner (e.g., 1 represents a healthy component and 0 represents a completely failed component) or in a continuous manner.
- a component that represents the operational mode of each component e.g., a first state represents nominal operation, a second state can represent a stuck sensor (constant sensor output) or actuator (fixed actuator state), a third state can represent a floating actuator (e.g., there is no control over the actuator and it is free to move based on external forces), a fourth state can represent a noisy sensor, and so on).
- a first state represents nominal operation
- a second state can represent a stuck sensor (constant sensor output) or actuator (fixed actuator state)
- a third state can represent a floating actuator (e.g., there is no control over the actuator and it is free to move based on external forces)
- a fourth state can represent a noisy sensor, and so on).
- the vehicle system may also include a fault monitoring systems that generates an output that provides information related to the health or performance of the vehicle's sensors and/or actuators.
- the fault monitoring system can operate in real time based on signals generated by or measurements of the sensors, actuators, or other components of the vehicle.
- This vehicle health information can be provided to the controller reconfiguration module to generate the indirect control input z of the indirect network.
- Some sensors and actuators include systems to report health status. This information can be used as input to the fault monitoring system
- the above vehicle system comprising the vehicle controller, controller reconfiguration module, and fault monitoring system can adjust the controller in real-time or online based on conditions of the vehicle. In this way, the controller can be robust to a wide variety of failure modes and vehicle dynamics.
- FIG. 7 is a high-level block diagram illustrating physical components of a computer 700 used to train or apply computer models such as those including a direct and indirect network as discussed herein. Illustrated are at least one processor 702 coupled to a chipset 704 . Also coupled to the chipset 704 are a memory 706 , a storage device 708 , a graphics adapter 712 , and a network adapter 716 . A display 718 is coupled to the graphics adapter 712 . In one embodiment, the functionality of the chipset 704 is provided by a memory controller hub 720 and an I/O controller hub 722 . In another embodiment, the memory 706 is coupled directly to the processor 702 instead of the chipset 704 .
- the storage device 708 is any non-transitory computer-readable storage medium, such as a hard drive, compact disk read-only memory (CD-ROM), DVD, or a solid-state memory device.
- the memory 706 holds instructions and data used by the processor 702 .
- the graphics adapter 712 displays images and other information on the display 718 .
- the network adapter 716 couples the computer 700 to a local or wide area network.
- a computer 700 can have different and/or other components than those shown in FIG. 7 .
- the computer 700 can lack certain illustrated components.
- a computer 700 such as a host or smartphone, may lack a graphics adapter 712 , and/or display 718 , as well as a keyboard or external pointing device.
- the storage device 708 can be local and/or remote from the computer 600 (such as embodied within a storage area network (SAN)).
- SAN storage area network
- the computer 700 is adapted to execute computer program modules for providing functionality described herein.
- module refers to computer program logic utilized to provide the specified functionality.
- a module can be implemented in hardware, firmware, and/or software.
- program modules are stored on the storage device 708 , loaded into the memory 706 , and executed by the processor 702 .
- a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
- Embodiments of the invention may also relate to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
- any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
- Embodiments of the invention may also relate to a product that is produced by a computing process described herein.
- a product may comprise information resulting from a computing process, where the information is stored on a non-transitory, tangible computer readable storage medium and may include any embodiment of a computer program product or other data combination described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
Description
y=f(x)
y=f(x)=f L(f L−1 . . . (f 1(x)))) Equation 2
where f′ denotes the mapping computed by the Lth layer. In other words, the initial input undergoes successive transformations by each layer into a new array of values.
E(w,D)=(y,f(x;w))=(y,{circumflex over (y)})=∥y−{circumflex over (y)}∥2. Equation 3
E(w,D)=(y,ŷ)+λR(w)=(y,ŷ)+λ∥w∥ p, Equation 5
where each weight has an index i. This norm pushes the square values of individual weights towards zero and thus favours small weights. Another exemplary norm is the L1-norm,
which induces sparsity by penalizing the absolute values of weights. In these examples, a strength of the regularization parameter λ is a global value that is typically chosen manually or via cross-validation or similar procedures. The regularization parameter λ thus may control the strength of the effect of the regularization on the error function E.
L=E(y,{circumflex over (y)})+λR(w,⊖) Equation 6
The error term is a function of the direct network weights w and the input data. That is, based on the input data and the weights, a generated output ŷ is compared with the known outputs y. The regularizer term represents the expected weights ŵ as generated by the indirect network with indirect parameters ϕ with the expected weights encouraged with the regularization parameter λ, which may represent the variance of the distribution. The normalizer term may be used to account for the Gaussian distribution and normalize the Gaussian across the direct network; the number of weights m in the direct network is included to normalize, since each is modeled by a Gaussian.
w=τ{circumflex over (w)}(ϕ)+ξw* Equation 9
{tilde over (E)}(w,ϕ,λ,D)=E(w,D)+λ∥w−ŵ(ϕ)∥p −C Equation 10
In equation 10, C is a constant and the weights are regularized with an Lp norm, although other regularizers may also be used as discussed above. By setting coefficients τ and ε between 0 and 1, these may additively determine the direct weights. In addition, the coefficients may be set in combination to equal 1.
By evaluating the direct network weights with respect to an estimated weight distribution as provided by the indirect network, the direct network weights may thus be regularized to learned characteristics of the direct network, rather than regularizing to low weights or to a “simple” set of weights. This permits the direct network to include regularization for more generally describing the data set and while doing so more naturally to the general contours of the data set itself.
{tilde over (E)}(w,z,D)={tilde over (E)}(w,D)+λz R z(z,{circumflex over (z)}) Equation 17
In this example, the regularizer itself may be weighted by a regularization parameter z that is dependent on the latent code.
Claims (20)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US18/372,415 US12217179B2 (en) | 2016-10-20 | 2023-09-25 | Intelligent regularization of neural network architectures |
| US19/011,485 US20250139436A1 (en) | 2016-10-20 | 2025-01-06 | Intelligent regularization of neural network architectures |
Applications Claiming Priority (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201662410393P | 2016-10-20 | 2016-10-20 | |
| US201762451818P | 2017-01-30 | 2017-01-30 | |
| US15/789,898 US11164076B2 (en) | 2016-10-20 | 2017-10-20 | Intelligent regularization of neural network architectures |
| US17/513,517 US11829876B2 (en) | 2016-10-20 | 2021-10-28 | Intelligent regularization of neural network architectures |
| US18/372,415 US12217179B2 (en) | 2016-10-20 | 2023-09-25 | Intelligent regularization of neural network architectures |
Related Parent Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US17/513,517 Continuation US11829876B2 (en) | 2016-10-20 | 2021-10-28 | Intelligent regularization of neural network architectures |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/011,485 Continuation US20250139436A1 (en) | 2016-10-20 | 2025-01-06 | Intelligent regularization of neural network architectures |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20240013049A1 US20240013049A1 (en) | 2024-01-11 |
| US12217179B2 true US12217179B2 (en) | 2025-02-04 |
Family
ID=61970509
Family Applications (4)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/789,898 Active 2040-06-04 US11164076B2 (en) | 2016-10-20 | 2017-10-20 | Intelligent regularization of neural network architectures |
| US17/513,517 Active 2038-02-14 US11829876B2 (en) | 2016-10-20 | 2021-10-28 | Intelligent regularization of neural network architectures |
| US18/372,415 Active US12217179B2 (en) | 2016-10-20 | 2023-09-25 | Intelligent regularization of neural network architectures |
| US19/011,485 Pending US20250139436A1 (en) | 2016-10-20 | 2025-01-06 | Intelligent regularization of neural network architectures |
Family Applications Before (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/789,898 Active 2040-06-04 US11164076B2 (en) | 2016-10-20 | 2017-10-20 | Intelligent regularization of neural network architectures |
| US17/513,517 Active 2038-02-14 US11829876B2 (en) | 2016-10-20 | 2021-10-28 | Intelligent regularization of neural network architectures |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US19/011,485 Pending US20250139436A1 (en) | 2016-10-20 | 2025-01-06 | Intelligent regularization of neural network architectures |
Country Status (1)
| Country | Link |
|---|---|
| US (4) | US11164076B2 (en) |
Families Citing this family (29)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117709426B (en) * | 2017-02-24 | 2024-11-15 | 渊慧科技有限公司 | Method, system and computer storage medium for training machine learning models |
| US11429861B1 (en) | 2017-05-01 | 2022-08-30 | Perceive Corporation | Device storing multiple sets of parameters for machine-trained network |
| EP3648017A4 (en) * | 2017-06-29 | 2021-08-04 | Preferred Networks, Inc. | METHOD OF TRAINING A DATA DISCRIMINATOR, TRAINING DEVICE FOR DATA DISCRIMINATOR, PROGRAM AND TRAINING METHOD |
| US11537870B1 (en) * | 2018-02-07 | 2022-12-27 | Perceive Corporation | Training sparse networks with discrete weight values |
| US12165066B1 (en) | 2018-03-14 | 2024-12-10 | Amazon Technologies, Inc. | Training network to maximize true positive rate at low false positive rate |
| US11995537B1 (en) * | 2018-03-14 | 2024-05-28 | Perceive Corporation | Training network with batches of input instances |
| US11586902B1 (en) | 2018-03-14 | 2023-02-21 | Perceive Corporation | Training network to minimize worst case surprise |
| US10599769B2 (en) * | 2018-05-01 | 2020-03-24 | Capital One Services, Llc | Text categorization using natural language processing |
| US11887003B1 (en) * | 2018-05-04 | 2024-01-30 | Sunil Keshav Bopardikar | Identifying contributing training datasets for outputs of machine learning models |
| JP6951295B2 (en) * | 2018-07-04 | 2021-10-20 | 株式会社東芝 | Learning method, learning device and image recognition system |
| US11386295B2 (en) | 2018-08-03 | 2022-07-12 | Cerebri AI Inc. | Privacy and proprietary-information preserving collaborative multi-party machine learning |
| US11556846B2 (en) | 2018-10-03 | 2023-01-17 | Cerebri AI Inc. | Collaborative multi-parties/multi-sources machine learning for affinity assessment, performance scoring, and recommendation making |
| CN113272822B (en) * | 2018-11-14 | 2025-10-31 | 直观外科手术操作公司 | Convolutional neural network for efficient tissue segmentation |
| US11693373B2 (en) * | 2018-12-10 | 2023-07-04 | California Institute Of Technology | Systems and methods for robust learning-based control during forward and landing flight under uncertain conditions |
| US20220164652A1 (en) * | 2019-02-15 | 2022-05-26 | Nokia Technologies Oy | Apparatus and a method for neural network compression |
| CN110084271B (en) * | 2019-03-22 | 2021-08-20 | 同盾控股有限公司 | Method and device for identifying picture category |
| US11610154B1 (en) | 2019-04-25 | 2023-03-21 | Perceive Corporation | Preventing overfitting of hyperparameters during training of network |
| US11531879B1 (en) | 2019-04-25 | 2022-12-20 | Perceive Corporation | Iterative transfer of machine-trained network inputs from validation set to training set |
| US12112254B1 (en) | 2019-04-25 | 2024-10-08 | Perceive Corporation | Optimizing loss function during training of network |
| US11900238B1 (en) | 2019-04-25 | 2024-02-13 | Perceive Corporation | Removing nodes from machine-trained network based on introduction of probabilistic noise during training |
| CN110276113A (en) * | 2019-06-11 | 2019-09-24 | 嘉兴深拓科技有限公司 | A kind of network structure prediction technique |
| CN111582299B (en) * | 2020-03-18 | 2022-11-01 | 杭州铭之慧科技有限公司 | Self-adaptive regularization optimization processing method for image deep learning model identification |
| WO2021236551A1 (en) | 2020-05-18 | 2021-11-25 | Intel Corporation | Methods and apparatus for attestation of an artificial intelligence model |
| CN111881439B (en) * | 2020-07-13 | 2022-05-27 | 深圳市捷讯云联科技有限公司 | Recognition model design method based on antagonism regularization |
| US11836600B2 (en) * | 2020-08-20 | 2023-12-05 | D5Ai Llc | Targeted incremental growth with continual learning in deep neural networks |
| CN114531696B (en) * | 2020-11-23 | 2024-11-19 | 维沃移动通信有限公司 | Method and device for processing missing partial input of AI network |
| US12367661B1 (en) | 2021-12-29 | 2025-07-22 | Amazon Technologies, Inc. | Weighted selection of inputs for training machine-trained network |
| CN117314763A (en) * | 2023-08-17 | 2023-12-29 | 贵州医科大学附属口腔医院 | Oral hygiene management method and system based on machine learning |
| CN117131786B (en) * | 2023-10-26 | 2024-01-26 | 华中科技大学 | Voltage transformer insulation fault online identification method |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200184337A1 (en) | 2016-09-28 | 2020-06-11 | D5Ai Llc | Learning coach for machine learning system |
-
2017
- 2017-10-20 US US15/789,898 patent/US11164076B2/en active Active
-
2021
- 2021-10-28 US US17/513,517 patent/US11829876B2/en active Active
-
2023
- 2023-09-25 US US18/372,415 patent/US12217179B2/en active Active
-
2025
- 2025-01-06 US US19/011,485 patent/US20250139436A1/en active Pending
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200184337A1 (en) | 2016-09-28 | 2020-06-11 | D5Ai Llc | Learning coach for machine learning system |
Non-Patent Citations (1)
| Title |
|---|
| United States Office Action, U.S. Appl. No. 15/789,898, filed Mar. 8, 2021, seven pages. |
Also Published As
| Publication number | Publication date |
|---|---|
| US20250139436A1 (en) | 2025-05-01 |
| US20180114113A1 (en) | 2018-04-26 |
| US11829876B2 (en) | 2023-11-28 |
| US11164076B2 (en) | 2021-11-02 |
| US20240013049A1 (en) | 2024-01-11 |
| US20220051100A1 (en) | 2022-02-17 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12217179B2 (en) | Intelligent regularization of neural network architectures | |
| US20190286970A1 (en) | Representations of units in neural networks | |
| US11714996B2 (en) | Learning motor primitives and training a machine learning system using a linear-feedback-stabilized policy | |
| CN111727441B (en) | Neural network system for implementing conditional neural process for efficient learning | |
| US11803744B2 (en) | Neural network learning apparatus for deep learning and method thereof | |
| US9524461B1 (en) | Conceptual computation system using a hierarchical network of modules | |
| US11593611B2 (en) | Neural network cooperation | |
| JP4970408B2 (en) | An adaptive driver assistance system using robust estimation of object properties | |
| KR102031982B1 (en) | A posture classifying apparatus for pressure distribution information using determination of re-learning of unlabeled data | |
| US20140081895A1 (en) | Spiking neuron network adaptive control apparatus and methods | |
| US20250051289A1 (en) | Training an unsupervised memory-based prediction system to learn compressed representations of an environment | |
| WO2019018533A1 (en) | Neuro-bayesian architecture for implementing artificial general intelligence | |
| JP2020191088A (en) | Neural network with layers to solve semidefinite programming problems | |
| US20210279955A1 (en) | Gaussian mixture model based approximation of continuous belief distributions | |
| Ryou et al. | Multi-fidelity reinforcement learning for time-optimal quadrotor re-planning | |
| CN118968272A (en) | Method, device, electronic device and storage medium for identifying underwater objects | |
| US12298774B2 (en) | Computer architecture for identification of nonlinear control policies | |
| Mustapha et al. | Introduction to machine learning and artificial intelligence | |
| JP7642308B2 (en) | Learning device, learning method, and trained model | |
| US20250165679A1 (en) | State Estimation using Physics-Constrained Machine Learning | |
| Ansari | Deep learning and artificial neural networks | |
| Bouteiller | Managing the world complexity: From linear regression to deep learning | |
| Dawood et al. | Robot behaviour learning using topological gaussian adaptive resonance hidden markov model | |
| Marochko et al. | Pseudorehearsal in actor-critic agents with neural network function approximation | |
| US20240394506A1 (en) | Uncertainty estimation for neural networks using graphical representation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| AS | Assignment |
Owner name: UBER TECHNOLOGIES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GHAHRAMANI, ZOUBIN;BEMIS, DOUGLAS;KARALETSOS, THEOFANIS;SIGNING DATES FROM 20171023 TO 20171024;REEL/FRAME:065871/0623 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |