WO2025179276A1

WO2025179276A1 - Regressing experiment outcomes using language model neural networks

Info

Publication number: WO2025179276A1
Application number: PCT/US2025/017045
Authority: WO
Inventors: Xingyou Song; Yutian CHEN; Sagi Perel; Chansoo LEE; Daiyi Peng
Original assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Current assignee: DeepMind Technologies Ltd; Gdm Holding LLC
Priority date: 2024-02-22
Filing date: 2025-02-24
Publication date: 2025-08-28
Anticipated expiration: 2026-08-22

Abstract

Methods, systems, and apparatuses, including computer programs encoded on computer storage media, for predicting metric values of experiment outcomes. That is, by receiving and processing arbitrary settings data and metric data of an experiment and, in response, generating an output sequence of tokens from a vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment, the described techniques can leverage transfer learning across vastly different experiment classes and function as a universal predictor of metric values of experiment outcomes.

Description

REGRESSING EXPERIMENT OUTCOMES USING LANGUAGE MODEL NEURAL NETWORKS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No. 63/556,842, filed on February 22, 2024. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application.

BACKGROUND

This specification relates to processing inputs using neural networks.

Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.

SUMMARY

This specification generally describes a system implemented as one or more computer programs on one or more computers in one or more locations that predicts (“regresses”) experiment outcomes using a neural network.

In particular, the system can receive and process arbitrary settings data characterizing particular values for settings for a particular experiment and metric data specifying a metric for evaluating an outcome of the particular experiment to predict the value of the metric. The experiment may be one performed in a real-world environment - such as one involving the chemical, optical or electromagnetic properties and/or dynamics of fluids or of solid objects in the real-world-environment, or one in which a computational task is split over multiple (real) data processing devices - or an experiment may be one performed in a computational environment which simulates a real-world environment, such as an experiment based on measured data characterizing the real world environment (e.g. the properties or dynamics of fluids or of solid objects located in the environment). The results of the evaluation of the output of the experiment may be used in the real world, e.g., to control how multiple (real) data processing devices collectively perform a computational task, by using the result of the evaluation to select how the computational task is split between the data processing devices. By processing arbitrary settings data and metric data, the system can function as a universal regressor (i.e., a universal predictor of metric values of experiment outcomes). That is, experiments are generally grouped into experiment classes (where an experiment class characterizes possible inputs for experiments and their possible outcomes), and the system can predict metric values of experiment outcomes for arbitrary experiments belonging to arbitrary experiment classes.

One example of an experiment class is “hyper parameter tuning for a machine learning model”, where an experiment can be, e.g., “training a machine learning model using a particular set of hyperparameters”. The experiment’s settings data can include a hyperparameter value for a machine learning model, and the metric data for the outcome of the experiment can specify the metric to evaluate the outcome of the experiment, e.g., the “test set mean squared error performance” of the machine learning model (i.e., the mean squared error metric on unseen test data, i .e., a measure of quality of the training of the machine learning model). A regressor for such an experiment class could predict “test set mean squared error” for the machine learning model for various hyperparameter values (i.e., experiments).

Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.

Regression is a fundamental task for experimental design in many domains, such as hyperparameter tuning, computer software, industrial engineering, chemical discovery, and so on. A goal of regression is to predict metric values of experiment outcomes given input data (i.e., settings data and metric data). Such regressors can later be used for various applications, such as offline optimization, online optimization, low-cost benchmarking and simulation.

Traditional regression methods are limited to processing experiment-class-dependent experiment data to generate predictions of metric values of experiment outcomes. That is, traditional regression methods work well only when applied to one experiment class and require that input data from that experiment class be converted to a standardized format in order to predict metric values of experiment outcomes.

As one example of the experiment-class-dependent limitation, consider a first experiment class of “determining water needs for farms” and a second experiment class of “determining water needs for cities”. A standard regression method trained using experiments of only one experiment class but applied to process input data of both experiment classes, e.g., topography information to predict “test set mean squared error” of predicted monthly water needs, will not perform equally well for the two experiment classes. The failure of the standard regression method to perform well for both experiment classes is due to the relationships of input data and experiment outcome changing across the experiment classes and the regression method only learning from one of these experiment classes.

As one example of the limitation of requiring a standardized format, a standard regression method applied to an original experiment class may require categorical features included in the input data be one-hot embedded against user-provided set categories. As a result, any new categorial feature values not belonging to the user-provided set categories cannot be processed, which means the method cannot be used to make predictions of metric values for experiment outcomes from experiment classes that include any different categories in the input data.

As another example of the limitation of requiring a standardized format, a standard regression method applied to an original experiment class may require input data be rescaled to be within a numerical range, (e.g., a user defined bound, e.g., [-1,1]). As a result, dynamic yet minor input space changes (i.e., changes in the frequency and range of possible values for the input data after training the neural network) are incompatible with this static standardized format, and will produce less reliable predictions of metric values.

This specification describes a system that can address the aforementioned challenges. That is, this specification describes techniques that (i) process arbitrary input data (i.e., arbitrary settings and metric data) to generate predictions using a neural network, and (ii) train the neural network using experiments from any number of experiment classes to achieve the benefits of transfer learning (i.e., improved prediction performance and generalization). As a result, the system performs well in terms of prediction performance on arbitrary experiments on both seen and unseen experiment classes.

More generally, the advantages of the described techniques, as summarized in Table 1 below, are as follows: (i) the ability to process arbitrary input data (referred to as being able to process ‘dynamic input space’), (ii) the ability to train the model using experiments from any number of different experiment classes (referred to as being able to ‘multitask’), (iii) the ability to bypass converting input data to a fixed format (referred to as not requiring ‘tensorizing x’, where x represents input data), (iv) the ability to bypass scaling a predicted value of a metric to within a set range of values (referred to as ‘rescale y’). As can be seen in Table 1 below, the described techniques have favorable advantages over conventional techniques.

Table 1

The described techniques’ ability to bypass ‘tensorizing x’ (i.e., bypassing the need to featurize inputs into numerical tensors) allows for direct processing of input data which eliminates the need to transform input data entirely and contributes to the ability to process arbitrary input data. Therefore, in addition to being able to process arbitrary set of input data (i.e., the ability to process dynamic input space), the system can also process input data in any format.

Additionally, the described techniques’ ability to bypass ‘rescaling’ a predicted value of a metric to within a set range avoids the potential problematic issues that arise when determining what the range should be and how to perform the mapping of a predicted value into the range. For example, a range may be determined using previous experiments and, if these previous experiments are not wholly representative of the possible range, the determined range may be distorted, causing misleading predicted values. As another example, if a newly predicted value of a metric lies outside of the previously determined range, scaling the newly predicted value to be the max of a previously determined range removes the information of relative difference this predicted value has with respect to other predicted values.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. According to a first aspect there is provided a method performed by one or more computers. The method includes receiving settings data characterizing particular values for settings for a particular experiment. Then, receiving metric data specifying a metric for evaluating an outcome of the particular experiment. Then, generating, from at least the settings data and the metric data, a particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs that each include a respective key and a respective value each represented as one or more tokens from a vocabulary of tokens. Then lastly, processing the particular input sequence that represents at least the settings data and the metric data as a plurality of key-value pairs using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment.

In some implementations, the method includes selecting, using the predicted value of the metric, final values for the settings for the particular experiment for use in performing the particular experiment.

In some implementations, the method further includes performing the particular experiment in accordance with the final values for the settings.

In some implementations, the particular experiment may include training a machine learning model, where the settings for the particular experiment includes hyperparameters for the training.

In some implementations, the metric measures a quality of the training of the machine learning model.

In some cases, the particular experiment includes deploying a machine learning model on a set of one or more hardware devices, where the settings for the particular experiment include architecture parameters of the machine learning model, architecture parameters of the one or more hardware accelerators, or both.

In some cases, the metric measures one or more properties of a performance of the machine learning model when deployed on the set of one or more hardware devices.

In some cases, the particular experiment includes designing a hardware accelerator, where the settings for the particular experiment include architecture parameters of the hardware accelerator. In some implementations, the metric measures one or more properties of the hardware accelerator.

In some cases, the key-value pairs are represented as tokens selected from a first set of tokens from the vocabulary and the output sequence may include tokens selected from a second set of tokens from the vocabulary, where the first set of tokens is disjoint from the second set of tokens.

In some implementations, the method further includes receiving experiment metadata characterizing the particular experiment, where the particular input sequence represents the experiment metadata as one or more key -value pairs that each include a respective key and a respective value each represented as tokens from the vocabulary of tokens.

In some cases, the experiment metadata identifies a user that is performing the particular experiment.

In some implementations, the neural network has been trained on a set of training data that includes, for each of a plurality of training experiments, a respective training input sequence for each of a different set of values for settings of the training experiment and that represents the set of values for settings of the training experiment and, for each respective training input sequence, a corresponding target output sequence that represents an actual value of a metric for the training experiment when performed in accordance with the set of values for the settings of the training experiment represented by the training input sequence.

In some cases, after the training of the neural network on the set of training data and prior to processing the particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs using a neural network, the method further includes obtaining fine-tuning data for the particular experiment that includes a respective fine-tuning input sequence for each of a different set of values for the settings of the particular experiment and that represents the set of values for settings of the particular experiment. Then, for each respective fine-tuning input sequence, obtaining a corresponding target output sequence that represents an actual value of the metric for the particular experiment when performed in accordance with the set of values for the settings of the particular experiment represented by the fine-tuning input sequence. Then lastly, training the neural network on the fine-tuning data. In some cases, training the neural network on the fine-tuning data includes training the neural network using an objective that measures normalized errors between predicted values of the metric and the actual values of the metric.

In some cases, processing the particular input sequence that represents the setting data and the metric data as a plurality of key -value pairs using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment may include processing the particular input sequence using the neural network to generate a plurality of candidate output sequences of tokens from the vocabulary that each represent a candidate predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment. Then, determining a final predicted value of the metric from the candidate predicted values represented by the candidate output sequence.

In some cases, determining a final predicted value of the metric from the candidate predicted values represented by the candidate output sequence includes determining a final predicted value of the metric to be a median of the candidate predicted values represented by the candidate output sequence.

In some cases, the particular input sequence does not represent any actual values of the metric for any actual trials of the experiment that have already been performed.

In some implementations, each corresponding target output sequence includes only one actual value of the metric for the particular experiment when performed in accordance with a corresponding single set of values for the settings of the particular experiment represented by the fine-tuning input sequence.

In some cases, the neural network is an encoder-decoder neural network.

In some other cases, the neural network is a decoder-only neural network.

In some cases, the neural network includes one or more self-attention layers.

In some cases, the neural network is a self-attention neural network.

According to a second aspect, there is provided the methods of the first aspect performed by one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations of the respective method. According to a third aspect, there is provided the methods of the first aspect performed by a computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a computer system.

FIG. 2 is a flow diagram of an example process for predicting metric values of experiments. FIG. 3 is a flow diagram of an example process for determining a final predicted value of a metric using a plurality of candidate output sequences.

FIG. 4 is a flow diagram of an example process for training a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of a metric.

FIG. 5 is an example of the performance of the described techniques. FIG. 6 is an example pictorial representation of training the system’s neural network and then using the system to predict metric values for experiments belonging to a variety of experimental classes.

FIG. 7 is an example of a definition of possible inputs for experiments of an experiment class and two possible experiment settings data, each represented as a plurality of key -value pairs.

FIG. 8 is an example of the performance of the described techniques.

DETAILED DESCRIPTION

FIG. 1 shows an example computer system 100. The computer system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

The system 100 predicts metric values for respective experiment outcomes using a neural network 108. That is, for a particular experiment, the system 100 can use the neural network 108 to predict a metric value associated with the particular experiment.

The particular experiment can be any of a variety of experiments, and the metric can be any of a variety of types of metrics.

For example, the particular experiment can include deploying a machine learning model on a set of one or more hardware devices (e.g. real devices, or ones simulated based on measured data from real devices), e.g., on a set of devices that includes one or more hardware accelerators, e.g., GPUs (graphical processing unites), TPUs (tensor processing units), VPUs (vision processing units), FPGAs (field-programmable gate arrays) or other application-specific integrated circuit (ASICs). In this example, the settings for the particular experiment can include architecture parameters of the deployment of the machine learning model (e.g., specification of process group over which the model is sharded, sharding strategy selection, and so on), architecture parameters of the one or more hardware accelerators (e.g., accelerator memory (e g., ROM [read only memory], RAM [random access memory], EEPROM [electrically erasable programmable read-only memory], flash memory [electronic non-volatile computer memory storage], for GPUs, TPUs, VPUs, FPGAs or other ASICs) , core clock speed, number of cores, per core cache size, and so on), or both.

In this example, the metric can measure one or more properties of a performance of the machine learning model when deployed on the set of one or more hardware devices, e.g., a latency, a memory footprint, a power consumption, and so on. The machine learning model can be any appropriate model, e.g., a model that performs one of the machine learning tasks described in more detail below.

As another example, the particular experiment can include designing a hardware accelerator (or other computational device), e.g., an ASIC for performing machine learning computations or other computationally intensive workloads. In this example, the settings for the particular experiment can include architecture parameters of the hardware accelerator (e.g., accelerator memory (e g., ROM, RAM, EEPROM, flash memory, for GPUs, TPUs, VPUs, FPGAs or other ASICs), core clock speed, number of cores, per core cache size, and so on). The designing may include generation of control data for a real-world hardware acceleration fabrication system. Following the designing process, there may be step of fabricating the hardware accelerator according to the design (i.e., in the real world). Furthermore, there may be a step of using the fabricated hardware accelerator to process data, such as data obtained by sensors from the real world.

In this example, the metric can measure one or more properties of the hardware accelerator, e.g., an area of the hardware accelerator, a power consumption of the hardware accelerator, a latency of the hardware accelerator when performing a certain workload, and so on.

As another example, the particular experiment can include training a machine learning model, e.g., a neural network, a linear model, a support vector machine, and so on, and the settings for the particular experiment can include hyperparameters for the training.

In particular, prior to training the machine learning model, the system 100 determines optimized settings for a set of hyperparameters for the training.

The set of hyperparameters can include any of a variety of values that are not learned during the training process but that can impact the performance of the trained model, the efficiency of the training process, or both.

For example, hyperparameters can include any of: an optimizer that is used for the training a learning rate that is used for the training a weight decay factor that is used for the training a batch size that is used for the training a depth of the neural network (i.e., the number layers) that is being trained respective weights assigned to one or more of the terms in the loss function that is used for the training and so on

In this example, the metric can measure a quality of the training of the machine learning model, e.g., an accuracy or a loss of the model as measured on a set of data after training. For example, the measure of performance can measure the accuracy of the trained neural network, e.g., on a validation or other held-out set. As another example, the measure of performance can also measure the efficiency of the training process of training the neural network.

In this example, the machine learning model can be trained to perform any kind of machine learning task, i.e., can be configured to receive any kind of digital data input and to generate any kind of score, classification, or regression output based on the input.

In some cases, the machine learning model is a machine learning model that is configured to perform an image processing task, i.e., receive an input image (e.g., an image of the real world captured by a camera device) and to process the intensity values of the pixels of the input image to generate a network output for the input image. For example, the task may be image classification and the output generated by the machine learning model for a given image may be scores for each of a set of object categories, with each score representing an estimated likelihood that the image contains an image of an object belonging to the category. As another example, the task can be image embedding generation and the output generated by the machine learning model can be a numeric embedding of the input image. As yet another example, the task can be object detection and the output generated by the machine learning model can identify locations in the input image at which particular types of objects are depicted. As yet another example, the task can be image segmentation and the output generated by the machine learning model can assign each pixel of the input image to a category from a set of categories.

As another example, if the inputs to the machine learning model are Internet resources (e.g., web pages), documents, or portions of documents or features extracted from Internet resources, documents, or portions of documents, the task can be to classify the resource or document, i.e., the output generated by the machine learning model for a given Internet resource, document, or portion of a document may be a score for each of a set of topics, with each score representing an estimated likelihood that the Internet resource, document, or document portion is about the topic.

As another example, if the inputs to the machine learning model are features of an impression context for a particular advertisement, the output generated by the machine learning model may be a score that represents an estimated likelihood that the particular advertisement will be clicked on. As another example, if the inputs to the machine learning model are features of a personalized recommendation for a user, e.g., features characterizing the context for the recommendation, e.g., features characterizing previous actions taken by the user, the output generated by the machine learning model may be a score for each of a set of content items, with each score representing an estimated likelihood that the user will respond favorably to being recommended the content item.

As another example, if the input to the machine learning model is a sequence of text in one language, the output generated by the machine learning model may be a score for each of a set of pieces of text in another language, with each score representing an estimated likelihood that the piece of text in the other language is a proper translation of the input text into the other language.

As another example, the task may be an audio processing task. For example, if the input to the machine learning model is a sequence representing a spoken utterance (e g., a sound signal captured from the real world by a microphone), the output generated by the machine learning model may be a score for each of a set of pieces of text, each score representing an estimated likelihood that the piece of text is the correct transcript for the utterance. As another example, the task may be a keyword spotting task where, if the input to the machine learning model is a sequence representing a spoken utterance, the output generated by the machine learning model can indicate whether a particular word or phrase (“hotword”) was spoken in the utterance. As another example, if the input to the machine learning model is a sequence representing a spoken utterance, the output generated by the machine learning model can identify the natural language in which the utterance was spoken.

As another example, the task can be a natural language processing or understanding task, e.g., an entailment task, a paraphrase task, a textual similarity task, a sentiment task, a sentence completion task, a grammaticality task, and so on, that operates on a sequence of text in some natural language.

As another example, the task can be a text to speech task, where the input is text in a natural language or features of text in a natural language and the network output is a spectrogram or other data defining audio of the text being spoken in the natural language.

As another particular example, the task can be a text generation task, where the input is a sequence of text, and the output is another sequence of text, e.g., a completion of the input sequence of text, a response to a question posed in the input sequence, or a sequence of text that is about a topic specified by the first sequence of text. For instance, the machine learning model can be an autoregressive neural network, e.g., a self-attention based autoregressive neural network. As another example, the input to the text generation task can be an input other than text, e g., an image, and the output sequence can be text that describes the input.

As another example, the task can be a health prediction task, where the input is electronic health record data for a patient and the output is a prediction that is relevant to the future health of the patient, e.g., a predicted treatment that should be prescribed to the patient, the likelihood that an adverse health event will occur to the patient, or a predicted diagnosis for the patient.

As another example, the task can be an agent control task, where the input is an observation characterizing the state of an environment (e.g. a real-world environment) and the output defines an action to be performed by an agent which interacts with the environment in response to the observation. The agent can be, e.g., a real-world or simulated robot, a control system for an industrial facility, or a control system that controls a different kind of agent.

As another example, the particular experiment can be the outcome of a computer simulation, e.g., outcome of a computer simulation of a real-world phenomenon, e.g., outcome of a simulated aircraft flight. For the example of a simulated aircraft flight, the settings for the particular experiment can include weather conditions, temperature, windspeeds, aircraft speed, aircraft weight, aircraft type, and so on for the computer flight simulation experiment. Any one or more of these settings may be based on a measurement of the real world at a certain time (e.g. at the time that the experiment is carried out), for example measured weather conditions (e.g., whether it is raining), measured temperatures or measured windspeeds, or measured characteristics of the aircraft (e.g. its weight including its current loading, or its measured fuel efficiency) so that the simulation is a simulation of aircraft flight in actual (e.g. current) real- world conditions.

In this example, the metric can measure that simulation experiment outcome (e.g., distance the aircraft traveled during flight, fuel efficiency of flight, and so on).

Another example of a particular experiment that is the outcome of computer simulations of a real-world phenomenon, is a computational fluid dynamics simulation (e.g., weather simulation, fluid flow simulation, combustion simulation, and so on). In this example, the settings for the particular experiment can include geometry and dimensions of the environment, material properties (e.g., material equation of state), initial conditions (e.g., initial temperature, pressure, and density distributions), and model choices for radiation transport, hydrodynamics.

In this example, the metric can measure dynamics of fluids or of solid objects in the real-world-environment, e g., the transport of mass, radiation, pressure, and heat of fluids or solids.

As another example, the particular experiment can include designing chemical compounds (e.g., protein variants, synthetic organic compounds, and so on). In this example, the settings for the particular experiment can include chemical design specifiers (e.g., amino acid sequences for proteins, 3D atomic coordinates for synthetic organic compounds, and so on). The designing may include generation of fabrication data (i.e., information of how to fabricate the chemical compound) for a real-world chemical fabrication system. Following the designing process, there may be a step of fabricating the chemical compound according to the design (i.e., in the real world). Furthermore, there may be a step of using the fabricated chemical compound to serve a physical function (e.g., using the protein to catalyze a biochemical reaction, using the synthetic organic compound to catalyze a chemical reaction, and so on).

In this example, the metric can measure chemical properties, e.g., enzymatic activity for proteins or catalytic activity for synthetic organic compounds. Other examples of chemical properties include solubility, boiling point, melting point, and so on.

As another example, the particular experiment can include designing materials (e.g., multilayered materials, e.g., organic light emitting diodes, quantum dots, thin-film transistors, and so on). In this example, the settings for the particular experiment can include material design specifiers (e.g., number of layers, type of layer (e.g., anode, cathode, emissive-layer), configuration of layer (e.g., layer thickness, layer dimensions), material type for each layer (e.g., chemical composition of the layer), and so on). The designing may include generation of fabrication data (i.e., information of how to fabricate the material) for a real-world material fabrication system. Following the designing process, there may be a step of fabricating the material according to the design (i.e., in the real world). Furthermore, there may be a step of using the material to serve a physical function (e.g., using the material to perform light absorption, light emission, energy conversion, charge transport, and so on). In this example, the metric can measure optical or electromagnetic properties, e.g., the minimum optical band gap value, and current conduction efficiency. Other examples of optical or electromagnetic properties include external quantum efficiency, power efficiency, color indices, electroluminescence band-structure, and so on. The neural network 108 can have any of a variety of neural network architectures that enables a prediction of metric values for experiment outcomes.

For example, the neural network 108 can be a language model neural network. A language model neural network is one that predicts an output sequence of tokens that follows an input sequence of tokens that is received as input to the neural network. For example, the neural network 108 can auto-regressively generate the tokens in the output sequence one after the other, with any given token in the output sequence being generated conditioned on (i) the input sequence of tokens and (ii) any tokens that precede the given token in the output sequence. The output sequence of tokens may predict a metric value for the experiment, e.g., in natural language.

For example, the neural network 108 can have an encoder-decoder architecture in which an encoder neural network processes the input sequence to generate an encoded representation of the input sequence and a decoder neural network generates the output sequence conditioned on the encoded representation.

As another example, the neural network 108 can have a decoder-only architecture in which a decoder neural network processes a combined sequence that includes the input sequence and any tokens that have already been generated from the output sequence to generate the next token in the output sequence.

In either of the above examples, the neural network 108 can be a self-attention neural network, e.g., include a self-attention encoder, a self-attention decoder, or both. More generally, the neural network 108 can include one or more self-attention layers. One example of such a neural network is an encoder-decoder Transformer neural network. Another example of such a neural network is a decoder-only Transformer neural network. As a particular example, the neural network 108 can include an encoder-decoder architecture (e.g., the T5 encoder-decoder neural networks as described by arXiv: 1910.10683), or decoder-only architecture (e.g., the Gemini decoder only neural network as described by arXiv:2312.11805 and arXiv:2403.05530). More specifically, the system 100 can receive settings data 102 characterizing particular values for the settings for a particular experiment.

Generally, the settings for a particular experiment can be any relevant settings to the particular experiment. In particular, the settings data can include any setting that has causal relevance to the experiment outcome in that the setting has a direct causal effect on the experiment outcome in some way.

As an example causal setting, for the particular experiment of training a machine learning model described above, the causal setting can be the number of layers included in a neural network model, which influences how well the neural network can regress on training data. The more layers present, the better the capability of the neural network to learn patterns from training data will be.

In contrast, if the training objective is fixed to be convex, an example non-causal setting can be the “order” of an iterative optimization method (e.g., using first order gradient updates or using second order Hessian updates to optimize the model parameters) because regardless of the chosen order the model parameters will converge to be the same values (due to the convexity of the objective guaranteeing the global minimum can always be reached during training regardless of the chosen order).

The system can also receive metric data 104 specifying a metric for evaluating the outcome of the particular experiment.

The system can also receive experiment metadata 112 as defined below.

Generally, the metric can be any of a variety of metrics for evaluating the particular experiment, i.e., any of a variety of metrics that are a quantifiable measure of how well (or the degree to which) the experiment achieved its objective.

For example, for the particular experiment of training a machine learning model described above, the metric can be the performance of the model as measured on a set of data after training. For example, the metric can be the mean squared error, absolute squared error, accuracy, precision, recall, area under the receiver operator curve, mean cross entropy loss, and so on as measure on a set of unseen data using the model after training the model.

The system 100 can generate, from at least the settings data 102 and the metric data 104, and optionally also from the experiment metadata 112, an input sequence 106. The input sequence 106 represents the settings data 102 and the metric data 104 as a plurality of keyvalue pairs.

Each key-value pair includes a respective key and a respective value that are each represented as one or more tokens from a vocabulary of tokens. Thus, the input sequence 106 includes tokens that represent the keys and values in the key -value pairs and optionally additional tokens that represent separations between key -value pairs and between the key and value within a given key-value pair.

Generally, the ‘key’ of the key-value pairs represents names of the settings or metric data, and the ‘value’ of the key -value pairs represents the value corresponding to the respective key name.

For example, for the particular experiment of training a machine learning model described above, the settings data could include a learning rate scalar value of 72.5. In which case, the key can be represented as one or more tokens from a vocabulary that represents the setting name, such as {‘learning’, ‘rate’} where the curly brackets enclose a sequence of tokens and the opening-closing pairs of single quotation marks enclose tokens. The value can also be represented as one or more tokens from the vocabulary, such as {‘7’, ‘2.5’ }. In addition, for this example, the metric data could include the metric of accuracy. In which case, the key can be represented as one or more tokens from a vocabulary, e.g., {‘metric’}, and the value can also be represented as one or more tokens from the vocabulary, e.g., {‘accuracy’} or {‘accuracy’, ‘%’}.

For examples of additional tokens that represent separations between key-value pairs and between the key and value within a given key -value pair, consider the example input sequence { ‘batch’, ‘size’, ‘ 128’, ‘,’, ‘ kernel’, ‘rbf } where the token ‘,’ represents a token that represents separation between key-value pairs (batch size, 128) and (kernel, rbf) and the token ‘:’ represents separation between the key and value within a given key -value pair, e.g., separation of key and value for the key -value pair (batch size, 128).

The system 100 processes the particular input sequence 106 using the neural network 108 to generate an output sequence 110 of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment.

In some implementations, the key-value pairs are represented as tokens selected from a first set of tokens from the vocabulary and the output sequence 110 includes (only) tokens selected from a second set of tokens from the vocabulary, with the first set of tokens being disjoint from the second set of tokens.

For example, the vocabulary can include tokens used by a tokenizer, e.g., SentencePiece or another appropriate tokenizer, to tokenize text. These tokens can make up the first set of tokens and can be used for the key-value pairs. The vocabulary also includes another set of “custom” or “reserved” tokens (the second set of tokens) that are used to represent numeric metric values. For example, for the key -value pair (learning rate, 72.5) and output sequence representing the value 1234.5, the first set of tokens can include tokens used by the SentencePiece tokenizer such that the key can be represented as the first set of tokens {‘learning’, ‘rate’ }, and the value can be represented as the sequence of first set tokens {‘7’, ‘2.5’}. While, the second set of tokens can be a custom set of tokens using specific tokens to express sign, exponent, and significant digits, and the output sequence that represents the value of 1234.5 can be represented as the sequence of second set tokens {‘+’,‘1 ’, ‘2’, ‘3 ’,‘4’, ‘5’, ‘E-l’} where {‘+’} represents the sign, {‘ 1’,‘2’,‘3’,‘4’,‘5’ } represents the significant digits, and {‘E- 1’} represents the exponent (e.g., the token ‘E-l’ denotes the exponent -1 for base of 10, i.e., 10 ¹).

In some other implementations, numeric values in the key-value pairs are also represented using the second set of tokens while the keys and non-numeric values are represented using the first set of tokens.

For example, continuing with the previous example, the value 72.5 of a key-value pair (learning rate, 72.5) tokenized using SentencePiece tokens described in the previous example can instead be represented as {‘+’,‘7’, ‘2’, ‘5’, ‘E-l ’} using the previously described custom second set of tokens.

Optionally, the system 100 can then select, using the predicted value of the metric, final values for the settings for the particular experiment for use in performing the particular experiment. For example, the system 100 can search through a set of possible values of the settings for the experiment (where the set includes the particular values) to identify the values that result in the highest-performing predicted value, and then select the values that result in the highest-performing predicted value as the final values.

The system 100 can then perform the particular experiment in accordance with the final values for the settings. The particular experiment can be any of a variety of experiments, e.g., the experiments described above.

In some implementations, the system 100 can bootstrap predictions of metric values into ranking-based metrics, which can, e.g., be downstream used for evolutionary algorithms which are agnostic to absolute scaling of the metric values. For example, the system 100 can, using the predicted value of the metric, iteratively compare, rank, and generate particular experiments to determine final values for the settings for the particular experiment for use in performing the particular experiment. For example, the system 100 can compare two or more sets of possible values of the settings for the experiment to identify the values that result in a highest- performing predicted value of the sets, and then generate new candidate settings using the settings with the highest-performing predicted value of the sets to then repeat comparing, ranking, and generating new candidates until a criterion is met (e.g., max number of iterations, a sufficient predicted metric value is found).

FIG. 2 is a flow diagram of an example process 200 for predicting metric values of experiments. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a computer system, e.g., the computer system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 100.

The system receives settings data characterizing particular values for settings for a particular experiment (step 202).

As described above, generally, the settings for a particular experiment can be any relevant settings to the particular experiment.

As a particular example, as described above, if the experiment is training a machine learning model, then the settings can include hyperparameters for the machine learning model (e.g., optimizer used for training, learning rate used for training, and so on).

As another particular example, as described above, if the experiment is deploying a machine learning model on a set of one or more hardware devices or hardware accelerators (e.g., GPU, TPU, ASIC, and so on), then the settings can include architecture parameters of the machine learning model (e.g., specification of process group over which the model is sharded, sharding strategy selection, and so on), architecture parameters of the one or more hardware devices or hardware accelerators (e.g., accelerator memory (e.g., ROM, RAM, EEPROM, flash memory, for GPUs, TPUs, VPUs, FPGAs or other ASICs), core clock speed, number of cores, per core cache size, and so on), or both.

As another particular example, as described above, if the experiment is designing a hardware accelerator, then the settings can include hardware accelerator architecture parameters (e.g., accelerator memory (e.g., ROM, RAM, EEPROM, flash memory, for GPUs, TPUs, VPUs, FPGAs or other ASICs), core clock speed, number of cores, per core cache size, and so on) for the hardware accelerator.

The system receives metric data specifying a metric for evaluating an outcome of the particular experiment (step 204).

As described above, generally, the metric can be any of a variety of metrics for evaluating the particular experiment.

As a particular example, as described above, if the experiment is training a machine learning model, then the metric can include metrics of the quality of the trained machine learning model (e.g., mean squared error, accuracy, precision, recall, cross-entropy loss, cumulative gain, rank precision, and so on).

As another particular example, as described above, if the experiment is deploying a machine learning model on a set of one or more hardware devices or hardware accelerators (e.g., GPU, TPU, ASIC, and so on), then the metric can include one or more properties of a performance of the machine learning model when deployed on the set of one or more hardware devices or hardware accelerators, e.g., a latency, a memory footprint, a power consumption, and so on.

As another particular example, as described above, if the experiment is designing a hardware accelerator, then the metric can include one or more properties of the hardware accelerator, e.g., an area of the hardware accelerator, a power consumption of the hardware accelerator, a latency of the hardware accelerator when performing a certain workload, and so on.

In some implementations, in addition to receiving settings data and metric data, the system can receive experiment metadata characterizing the particular experiment.

Generally, the metadata characterizing the particular experiment can be any data associated with the particular experiment. Additionally, metadata is distinct from settings data because it generally remains fixed when the system searches for the highest performing predicted values, e.g., when the system selects final values for the settings for use in performing the particular experiment by searching through a set of possible values of the settings for the experiment to identify values associated with the highest performing predicted values.

Some examples of metadata include experiment conductor username, experiment title, experiment description, experiment objective name, and optional freeform text.

In some cases, the experiment metadata that characterizes the particular experiment does not include values that directly influence the experiment’s outcome. That is, the metadata can include values that are correlated with particular experiment outcomes, and, therefore contains useful information for predicting metric values, but does not have a causal relationship with the experiment outcome.

For example, the metadata can include values for details such as the date and time for particular experiment, the user that is performing the particular experiment, the ID for the particular experiment, the location the particular experiment, and so on.

As a particular example of metadata that may be useful for predicting metric values but does not have a causal relationship with the experiment outcome consider the metadata of time of day for particular experiments from a certain experiment class that executes machine learning model inference on a set of data with a corresponding metric of total execution time. If the model inference is performed on shared compute resources that experiences varying amounts of free compute resources throughout the day, e.g., maximum free resources at night and minimum free resources during the day, it may appear that the time of day influences the total execution time of the experiment outcomes but instead the total amount of free compute resources influences the outcome. Regardless of the true relationship between time of day and experiment outcome, the information of time of day is useful for predicting the outcome of the experiment. Yet, if the system (upon request by a user) is searching for the particular experiment that results in the highest-performing predicted value, then the user will not search over the optimal time of day to execute machine learning model inference because the time of day is non-causal and is not guaranteed to hold predictive power under different contexts that are of interest to the user (e.g., executing machine learning model inference on exclusive compute resources, executing the machine learning model during holidays, and so on). The system can receive the settings data, the metric data, the metadata or any combination of these data from a user or another system through any of a variety of means, e.g., through a network connection, e.g., a cloud-based network, the internet, or a local network.

The system generates, from at least the settings data and the metric data, a particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs that each include a respective key and a respective value each represented as one or more tokens from a vocabulary of tokens (step 206). That is, the system generates the key-value pairs, and then the system tokenizes the key -value pairs by mapping each character, word, or sub-word of the natural language text representation of the key-value pairs to a corresponding token included in the first set of tokens.

In some cases, when the system receives experiment metadata characterizing the particular experiment, the particular input sequence additionally represents the experiment metadata as one or more key -value pairs that each include a respective key and a respective value each represented as tokens from the vocabulary of tokens.

Generally, the ‘key’ of the key-value pairs represents names of the settings data, metric data, or metadata and the ‘value’ of the key -value pairs represents the value corresponding to the respective key name.

For example, for an experiment that is training a machine learning model and includes settings that include hyperparameters for the training such as batch size, learning rate, model, optimizer and includes the metric of accuracy, the input sequence can be { ‘batch’, ‘size’, ‘:’, ‘ 128’, ‘,’, ‘kernel’, ‘rbf, ‘,’ ,‘learning’,‘rate’,‘ :’, ‘0.5’, ‘//model’/:’, ‘svm’/,’, ‘optimzier’/:’, ‘sgd’/,’, ‘metric’/ :’, ‘accuracy’ }, where the curly brackets enclose a sequence of tokens and the opening-closing pairs of single quotation marks enclose tokens.

Continuing with the previous example key -value pair representation of the particular input sequence, if the machine learning model training experiment is also associated with the metadata that includes a title for the experiment, a user ID for who conducts the experiment, a description of the machine learning task the machine learning model performs, the input sequence can also include the key-value pairs {‘title’, ‘:’, ‘classification’, ‘/, ‘user’, ‘ :’, ‘some- person’, ‘,’, ‘description’, ‘ :’, ‘spam detection’} and the input sequence could be { ‘batch’, ‘size’, ‘:’, ‘ 128’, ‘,’, ‘kernel’, ‘ :’, ‘rbf, ‘/ , Teaming’, ‘rate’/ :’, ‘0.5’ model’, ‘ :’, ‘svm’/,’, ‘optimzier’/:’, ‘sgd’/,’, ‘metric’ :’, ‘accuracy’, ‘title’, ‘classification’, ‘,’, ‘user’, ‘some-person’, ‘,’, ‘description’, ‘spam detection’ }.

Although the previous input sequence example illustrates key-value pair orderings of (1) settings data, (2) metric data, and (3) metadata, in practice the input sequence can have any ordering of key-value pairs.

Because the settings data, metric data, and, when included, the metadata can be represented as key-value pairs and not as a standardized format, the system can handle dynamic input space (i.e., changes in the frequency and range of possible values for features included in the input data).

The system processes the particular input sequence that represents at least the settings data and the metric data as a plurality of key -value pairs using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment (step 208).

As described above, in some implementations, the key-value pairs of the input sequence are represented as tokens selected from a first set of tokens from the vocabulary and the output sequence includes (only) tokens selected from a second set of tokens from the vocabulary, with the first set of tokens being disjoint from the second set of tokens.

For example, the first set of tokens can include tokens used by a tokenizer, e.g., WordPiece, SentencePiece, or another appropriate tokenizer to tokenize text. While the second set of tokens can be a “custom” set of tokens.

Generally, the second set of tokens differs from the first set in that the second set of tokens defines an output space of output sequences that better aligns with the intended meaning of the output sequence of representing a predicted value of the metric. For example, the second set of tokens can include specific tokens to express sign, exponent, and significant digits. That is, the second set of tokens can include tokens to express the sign of a predicted value, tokens to represent an exponent of a predicted value (e.g., the token ‘E-l’ denotes the exponent -1 for base of 10, i.e., 1 O’¹), and tokens to represent significant digits of a predicted value. Using this example set of tokens more closely aligns the output sequence with representing a predicted value of the metric than including tokens that correspond to other unrelated values, e.g., non- numeric values, e.g., letters or special characters, because the tokens of the unrelated values would render the output sequence incoherent.

As an example of representing the input sequence as tokens selected from a first set of tokens and the output sequence as tokens selected from a second set of tokens, for the key- value pair (learning rate:72.5) of an input sequence and an output sequence representing the value 1234.5, the first set of tokens can include tokens used by the SentencPiece tokenizer such that the key can be represented as the first set of tokens {‘learning’, ‘rate’ }, and the value can be represented as the sequence of first set tokens {‘7’, ‘2.5’}. While, the second set of tokens can be a custom set of tokens using specific tokens to express sign, exponent, and significant digits, and the output sequence that represents the value of 1234.5 can be represented as the sequence of second set tokens {‘+’,‘ 1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘E-l’ }.

As described above, in some implementations, numeric values in the key-value pairs are also represented using the second set of tokens while the keys and non-numeric values are represented using the first set of tokens.

For example, the value 72.5 of a key-value pair learning rate:72.5 described in the previous example can be represented as {‘+’,‘7’,‘2’,‘5’,‘E-1’} using the previously described custom second set of tokens.

The above example custom set of tokens uses the “separate sign and digit-by-digit” style (i.e., {‘+’,‘ 1’,‘2’,‘3’,‘4’,‘E-1’ } represents 123.4) where a token is used to express the sign (e.g., ‘+’ denotes a positive sign, ‘-’ denotes a negative sign), one or more tokens represent significant digits (i.e., ‘ 1’,‘2’,‘3’,‘4’ denotes four significant digits 1234), and a token represents the exponent (e.g., the token ‘E-l ’ denotes the exponent -1 for base of 10, i.e., 10 ¹) in that order.

As another example style, the custom set of tokens can use “merged mantissa” style (i.e., {‘+1234’, ’E-l’} represents 123.4) where a token represents the sign and significant digits (e.g., ‘+1234’ represents both a positive sign and the significant digits 1234) and a token represents the exponent (e g., the token ‘E-l ’ denotes the exponent -1 for base of 10, i.e., 10'¹) in that order.

As another example style, the custom set of tokens can use “exponent before mantissa” style (i.e., {‘+’,’E-1’,‘ 1’, ‘2’, ‘3’, ‘4’ } represents 123.4) where a token is used to express the sign (e.g., ‘+’ denotes a positive sign, ‘-’ denotes a negative sign), another token is used to represent the exponent (e.g., the token ‘E-l ’ denotes the exponent -1 for base of 10, i.e., 1 O'¹), and one or more tokens are used to represent significant digits (i.e., ‘ 1’,‘2’,‘3’,‘4’ denotes four significant digits 1234) in that order.

As another example style, the custom set of tokens can use a custom style, e.g., a floating point representation for a base B. For example, a custom style can have a token represent the sign, a token represent the sign of the exponent for base B, one or more tokens represent the exponent, and one or more tokens represent the most significant base-B digits of the mantissa. As a particular example, the number 10⁺² x 1.234 has a base of 10, a positive sign represented as the token ‘+’, a positive exponent sign represented as the token ‘+’, an exponent value that can be represented as one token ‘2’, a mantissa of 1234 that can be represented as the tokens {‘ 1’,‘2’,‘3’,‘4’}, and so, the value 123.4 (i.e., 10⁺² x 1.234) can be represented as a sequence of tokens as {‘+’, ‘+’, ‘2’,‘r,‘2’,‘3’,‘4’}-

In some implementations, the vocabulary can include “special” tokens for representing values not within a supported numeric range. For example, the system vocabulary can include a token ‘NaN’ to represent non-numbers. Such a special token can be used, e.g., for cases where the input data cannot be processed by the system.

Different custom sets of tokens can each have different vocabulary size and, as a consequence, different output sequence space, meaning the possible range of output sequences is constrained by the possible number of tokens in the output sequence and the number of tokens each token of the sequence can be. The larger the space of the output sequence the greater the expressibility the output sequence has but at the cost of increased complexity for training the neural network to generate the correct output sequence. A benefit of having choices of styles for the custom set of tokens and, therefore, using a custom set of tokens is that the most appropriate style can be chosen based on the range of possible metric values the system needs to predict.

As described above, in some cases, the neural network the system uses to process the particular input sequence generates the output tokens of the output sequence auto-regressively, with each token of the output sequence being generated conditioned on the particular input sequence and previously generated output sequence tokens. In particular, to generate a given token of the output sequence from a vocabulary, the neural network generates a probability distribution over the tokens of the vocabulary conditioned on the particular input sequence and previously generated output sequence tokens and selects the token with the highest probability. Then the system can repeat token selection in this manner until a stopping criterion is reached, e.g., generating a pre-determined number of tokens for the output sequence or producing a token signifying the end of the output sequence.

In some implementations, instead of selecting the tokens of the sequence to be those with the highest probability at each point of the sequence, the system samples the token according to the probabilities of the tokens. That is, the system can determine the next token of the output sequence by sampling the probability distribution over tokens generated by the neural network conditioned on the particular input sequence and previously generated output sequence tokens.

In some cases, the system restricts the selection of tokens from the vocabulary for the output sequence to be tokens belonging to the second set of tokens. That is, the system can select tokens for the output sequence as described above but restricts the possible selections to be those tokens belonging to the second set of tokens. As described above, this second set of tokens can represent a set of custom tokens well suited for creating an output sequence representing the predicted value, e.g., any of the described second set of tokens above. Additionally, the second set of tokens can be disjoint from a first set of tokens and the union of the first and second sets of tokens represent the vocabulary.

In some implementations, the system determines a final predicted value by processing the particular input sequence using a neural network to generate a plurality of candidate output sequences that each represent a candidate predicted value. That is, the system can, repeatedly generate candidate output sequences that represent a distribution of candidate predicted values, e.g., using the sampling method described above, and use the candidate predicted values to determine a final predicted value of the metric.

Further details of determining a final predicted value of a metric using a plurality of candidate output sequences are described below with reference to FIG. 3.

In some cases, the particular input sequence that the system processes does not represent any actual values of the metric for any actual trials of the experiment that have already been performed. For example, the particular input sequence is not a “few-shot input” but a “zero shot input” for the neural network in that the input sequence does not include input data and respective metric values of experiment(s) that have already occurred in order to condition the neural network. In other words, because of the architecture of the neural network, because of the training of the neural network, because of the fine-tuning of the neural network, or some combination of the above, the neural network can accurately generate predictions even when no outcomes of trials of the experiment are available for use to condition the prediction.

In other cases, the particular input sequence that the system processes does represent some actual values of the metric for actual trials of the experiment that have already been performed. For example, the particular input sequence can be a “few-shot input” for the neural network in that the input sequence does include input data and respective metric values of experiment s) that have already occurred in order to condition the neural network.

Prior to using the neural network to process the particular input sequence to generate an output sequence, the system or another system trains the neural network.

In some cases, the system trains the neural network from scratch, e.g., starting from randomly initialized values of the parameters of the neural network.

In some other cases, because the neural network 108 processes particular input sequences that include ‘keys’ of the key -value pairs representing names for values of settings data, metric data, and metadata, in some cases, in some implementations, the system initializes the training of the neural network starting from pre-trained values of the parameters of the neural network, e.g., starting from a pre-trained neural network. That is, the system uses a neural network pre-trained, e.g., on a language modeling task. This can serve as an improved starting point to train the neural network 108 to predict metric values using input sequences that include ‘keys’ that represent names.

Further details of training the neural network are described below with reference to FIG. 4.

FIG. 3 is a flow diagram of an example process 300 for determining a final predicted value of a metric using a plurality of candidate output sequences. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a computer system, e.g., the computer system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 100.

The system processes the particular input sequence using the neural network to generate a plurality of candidate output sequences of tokens from the vocabulary that each represent a candidate predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment (step 302).

For example, the system can repeatedly sample output sequences by repeatedly processing the particular input sequence with the neural network. That is, when the system repeats processing the particular input sequence to generate the output sequence and the system selects tokens according to the probabilities over the tokens determined by the neural network, e.g., as described above, the system will sample different plausible output sequences because selecting tokens according to the probabilities over the tokens can result in different output sequences.

In some cases, the system can repeatedly sample output sequences by repeatedly processing the particular input sequence with the neural network through the technique of “temperature decoding”. That is, the system can repeatedly sample output sequences by iteratively selecting tokens for the output sequence according to their corresponding probabilities as described above, but the probabilities can be modified using a temperature parameter T. For example, the temperature T can modify the probability of selecting token v_k as ePti’fcl/T

P'(.^vk) — v — where p'(^rfc) represents the temperature modified probability of selecting token v_k, p(v_fc)represent the original probability of selecting v_k, the index i runs over all eligible tokens for selection and the variable T is the temperature parameter that can be set. The higher the value of T, the more equal the modified probabilities for the tokens become among each other. While the lower the value of T, the more polarizing the modified probabilities for the tokens become relative to the original probabilities, with higher original probabilities becoming higher modified probabilities and lower original probabilities becoming lower modified probabilities. Therefore, the various values of T in the context of temperature decoding control the probabilistic variability of sampled output sequences, with a value of T=1.0 not modifying the original token selection probabilities, lower values of T (e.g., 0.1, 0.2, 0.5, and so on) resulting in sampled output sequences that more often closely align with a ‘highest probability selection procedure’ (i.e., the system selects each token of the output sequence according to the highest probability over the tokens of the vocabulary) and higher values of T (e.g., 1.1, 1.2, 1.5, 2.0, and so on) resulting in output sequences that more often closely align with a ‘random selection procedure’ (i.e., the system selects each token of the output sequence randomly from among the tokens of the vocabulary).

In some cases, the system performs temperature decoding as described in the example above with the temperature parameter T set to less than one (e.g., 0.1, 0.3, or 0.5) in order for sampled output sequences to more frequently be similar to an output sequence determined using the ‘highest probability selection procedure’ yet still have variability.

In other cases, the system performs temperature decoding as described in the example above with the temperature parameter T set to greater than one 1.0 (e.g., 1.1, 1.5, or 2.0) in order for sampled output sequences to more frequently be similar to an output sequence determined using the ‘random selection procedure’ and have large variability.

In other cases, the system performs “regular temperature decoding”, e.g., the system performs temperature sampling as described in the example above with the temperature parameter T set to 1.0. The system determines a final predicted value of the metric from the candidate predicted values represented by the candidate output sequence (step 304).

For example, the system can aggregate the candidate predicted values to determine the final predicted value. That is, the system can use any of a variety of aggregation methods or procedures to determine the final predicted value.

For example, the system can determine the final predicted value to be the computed mean of the plurality of candidate predicted values represented by the candidate output sequences.

As another example, the system can determine the final predicted value to be the median of the plurality of candidate predicted values represented by the candidate output sequences.

As another example, the system can determine the final predicted value to be the weighted mean of the plurality of candidate predicted values, where each candidate predicted value is weighted by the probability of sampling the respective output sequence of the candidate predicted value.

Other examples of aggregations the system can use are percentile (i.e., the system can compute the final predicted value to be an inferred percentile value, e.g., 25^th percentile, 50^th percentile, 75^th percentile, according to the plurality of candidate predicted values), and mode or max-likelihood (i.e., the system can compute the final predicted value to be the most frequently repeated candidate predicted value or the candidate predicted value with the highest probability density value according to the distribution created by the plurality of candidate predicted values).

The ability of the system to determine a final predicted value from the candidate predicted values in a variety of ways allows the system to quantify the certainty of the final predicted value. For example, if the interquartile range of the final predicted value calculated using the above example methods of computing the 25 percentile and 75 percentile of the candidate predicted values results in a narrow interquartile range (i.e., the difference between the 75 percentile and the 25 percentile values of the final predicted value is small), then the system can determine to have high confidence that the final predicted value determined using the median or mean is accurate. Conversely, if the previously mentioned interquartile range is large, then the system can determine to have low confidence that the final predicted value determined using the median or mean is accurate.

FIG. 4 is a flow diagram of an example process 400 training a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of a metric. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a computer system, e.g., the computer system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 100.

The system or another training system trains the neural network, i.e., the neural network 108, by repeatedly updating the trainable parameters of the neural network using training data. The system can train the neural network on a set of training data that includes, for each of a plurality of training experiments, a respective training input sequence for each of a different set of values for settings of the training experiment and that represents the set of values for settings of the training experiment and, for each respective training input sequence, a corresponding target output sequence that represents an actual value of a metric for the training experiment when performed in accordance with the set of values for the settings of the training experiment represented by the training input sequence. That is, the system can repeatedly perform the following described example process using training experiments to train the neural network from scratch, i.e., train from randomly initialized parameters, or to fine-tune, i.e., further train from previously determined parameters, the neural network. In particular, the system obtains training data (step 402). The training data includes a plurality of training experiments, where each training experiment includes at least settings data and metric data of the experiment and an actual value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment.

Generally, the training experiments in the training data include a diverse set of settings and a diverse set of experiment outcome metrics across a diverse set of contexts. That is, the training experiments encompass a variety of “experiment classes”, ensuring that a variety of relationships between settings data and actual metric values are present in the training data.

In some cases, the training experiment also includes metadata characterizing the particular experiment.

For each training experiment, the system processes the training experiment settings data and metric data to generate a respective predicted value of the metric (step 404).

That is, the system generates a particular input sequence for the training experiment using the training experiment’s settings data, metric data, and potentially metadata using any of the variety of methods to generate a particular input sequence from settings data, metric data, and potentially metadata described in step 206 of FIG. 2 above. Then, the system processes the training experiment’s particular input sequence using the neural network to generate an output sequence that represents a predicted value of the metric for the training experiment using any of the variety of methods to generate output sequences from particular input sequences described in step 208 of FIG. 2 above.

The system evaluates an objective using the predictions for each training experiment (step 406). Generally, the objective is one that when optimized results in the neural network generating output sequences for training experiments that more closely represents the metric value of the training experiment.

For example, the system can evaluate the cross-entropy loss between an output sequence that represents a predicted value of the metric for a training experiment and a sequence representation of the training experiment’s actual metric value of the same format (referred to here on out as the target output sequence) for all training experiments.

In some cases, the output sequence that represents a predicted value of the metric for a training experiment and sequence representation of the training experiment’s actual metric value each include tokens selected from a second set of tokens that differs from a first set of tokens used to generate the particular input sequence for the training experiment, e.g., using any one of the custom sets of tokens described above as the second set of tokens and using Sentencepiece tokens for the first set of tokens. That is, the system can, e.g., can restrict the selection of tokens from the vocabulary for the output sequence that represents a predicted value of the metric for a training experiment to be tokens belonging to the second set of tokens. At the same time, the system generates the sequence representation of the training experiment’s actual metric value by sequentially mapping each character, word, or sub-word of the natural language text representation of the metric value to a corresponding token included in the second set of tokens.

Then the system can define the objective to be the sum or average of the cross-entropy loss over all training experiments. The cross-entropy loss for a training experiment refers to the sum of the losses between each predicted token of the experiment output sequence and the respective token of the target output sequence, where the loss between a predicted token of the training output sequence and the target token of the target output sequence is the negative log probability according to the neural network of selecting the predicted token to be the same as the target token.

In some cases, when the system computes the cross-entropy loss for a training experiment as the sum of the losses between each predicted token of the experiment output sequence and the respective token of the target output sequence, the system weights specific tokens differently. That is, the system can assign the loss for each token of the target output sequence a respective weight, with different tokens having different weights. By weighting more significant tokens (e.g., a token representing a leading digit or exponent in the target output sequence) more highly, the prediction accuracy can be improved by making the training loss more sensitive to numerical distances.

In some implementations, the system employs ‘teacher forcing” when generating the training output sequence. That is, the system generates each token of the training output sequence by processing previous tokens that include the corresponding respective tokens of the target output sequence.

The system updates the model trainable parameters to optimize the objective (step 408). The system can update the neural network trainable parameters to optimize the objective in any variety of ways, e.g., gradient based method, evolutionary algorithm-based method, Bayesian optimization, etc.

For example, the system can optimize the objective using any of a variety of gradient descent techniques (e.g., batch gradient descent, stochastic gradient descent, or mini-batch gradient descent) that include the use of a backpropagation technique to estimate the gradient of the loss with respect to neural network trainable parameters and to update the learnable parameters accordingly.

Generally, the system repeats the above steps (404-408) until one or more criteria are satisfied (e.g., the system performs a pre-determined number of iterations, the updates to the trainable parameters no longer exceed a pre-determined magnitude of change, a metric regarding a validation dataset exceeds a pre-determined value, and so on).

In some cases, the system can fine-tune the neural network, i.e., further train the neural network using, e.g., new or recently obtained training experiments, prior to processing particular input sequences.

That is, after training of the neural network on the set of training data and prior to processing the particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs using a neural network, the system can obtain fine-tuning data for the particular experiment. The fine-tuning data can include a respective fine-tuning input sequence for each of a different set of values for the settings of the particular experiment that represents the set of values for settings of the particular experiment and, for each respective fine-tuning input sequence, a corresponding target output sequence that represents an actual value of the metric for the particular experiment when performed in accordance with the set of values for the settings of the particular experiment represented by the fine-tuning input sequence. Then the system can train the neural network on the fine-tuning data, e.g., by repeatedly performing steps 404-408 as described above.

Generally, each corresponding target output sequence includes only one actual value of the metric for the particular experiment when performed in accordance with a corresponding single set of values for the settings of the particular experiment represented by the fine-tuning input sequence. That is, the system does not use few-shot prompting during training and instead trains the neural network to predict metric values in a “zero-shot” manner. As an example of fine tuning, the system can perform a low rank adaption (LoRA), i.e., determine values for additional trainable parameters that are combined with previously trained parameters of the neural network as described in arXiv:2106.09685, to fine-tune the neural network using fine-tuning data.

In some cases, the system fine-tunes the neural network on recently obtained training experiments, e.g., recently obtained training experiments that belong to a previously unseen experiment class. For example, to adapt to an unseen experiment class the system can quickly finetune the neural network online over the experiment class’ corresponding training data, optionally using LoRA.

In some cases, the system trains the neural network by using an objective that measures normalized errors between predicted values of the metric and the actual values of the metric.

Generally, the system normalizes the error between a predicted value of the metric and the actual value of the metric using the corresponding experiments of an experiment class that the predicted value corresponds to. Sometimes the training data constitutes a single experiment class, but other times the training data constitutes multiple experiment classes.

As one example of the normalized error, the normalized error can be the normalized mean absolute error according to where y refers to the actual metric value for the experiment and y_max and y_min are the maximum and minimum metric values present in the experiment class S , x represents the settings data, metric data, and sometimes metadata of the experiment, |S | refers to the number of experiments in the experiment class S, and Aggregate(s(x ) refers to a final predicted value determined using a plurality of candidate output sequences. The final predicted value using a plurality of candidate output sequences can be determined using any of a variety of methods, e.g., the methods described above in step 304 of FIG. 3.

The system using an objective that measures normalized errors between predicted values of the metric and the actual values of the metric can prove advantageous when the range of actual metric values for experiments across experiment classes have vastly different scales. Using the normalized errors prevents the training of the neural network from biasedly updating trainable parameters to correct the prediction of large scale metric values over low scale metric values.

In some implementations, the system trains a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of a metric by reinforcing with non-differentiable scores. For example, the system can use reinforcement learning techniques to optimize the model's performance based on feedback that does not have a gradient.

FIG. 5 is an example 500 of the performance of the described techniques.

More specifically, example 500 shows the advantage of the described techniques’ capability to exhibit transfer learning across vastly different experiment classes (i.e., training using experiments from a vast array of experiment classes) to improve the quality of predictions of metric values of experiment outcomes well beyond the quality that conventional methods of regressing can achieve.

In particular, example 500 shows graphs of the mean prediction error of metric values of experiment outcomes when varying the number of different experiment classes used in training (log scale) using various methods of regressing for two data sets (i.e., “AutoML Eval” and “BBOB Eval”). The colored horizontal lines display the mean error of conventional methods (Gaussian Process [denoted as GP], Random Forest, Tree, Multilayer Perceptron [denoted as MLP]) which only observe training data from the experiment class being evaluated. While the black line represents the described methods (denoted as LM) which use training data from multiple different experiment classes.

For the “AutoML Eval” graph, all experiment classes that make up the x-axis were “hyperparameter tuning experiment classes” using data from collection of proprietary AutoML data for tuning user objectives. The “AutoML” graph shows the accuracy of the described methods (LM) improves with the more experiment classes used in training and outperforms all traditional baselines for the entirety of the displayed range.

For the “BBOB Eval” graph, all experiment classes that make up the x-axis were “hyperparameter tuning experiment classes” using data from the BBOB (Black-Box Optimization Benchmarking) data set which is a suite of benchmark functions used to evaluate and compare the performance of optimization algorithms described in arXiv: 1903.06396. the accuracy of the described methods (LM) improves with the more experiment classes used in training and outperforms all traditional baselines after approximately 20,000 experiment classes are used for training.

FIG. 6 is an example 600 pictorial representation of training the system’s neural network and then using the system to predict metric values for experiments belonging to a variety of experimental classes. More specifically, example 600 shows the system using training data (denoted as “Offline Database” in example 600) collected from a variety of experimental classes to train a neural network (denoted as ‘LM’ in example 600). Example 600 also shows the application of the system to predict metric values for a variety of experimental classes (e.g., predicting metric values for hyperparameter tuning experiments, predicting metric values for protein design experiments, predicting metric values for hardware design experiments).

The training data (“Offline Database”) includes a plurality of training experiments (each experiment including corresponding settings data, metric data, metadata, and actual metric value for the experiment) from a variety of different experimental classes provided by users or other systems.

For example, the training data includes hyperparameter tuning training experiments (e.g., settings data that includes a learning rate value of 0.001 and a stochastic gradient descent optimizer choice, metadata that includes the experiment name “convnet on cifarlO”, metric data that includes a metric choice of accuracy, and an actual metric value of 90%).

As another example, the training data includes hardware design experiments (e g., settings data that includes a number of tiles and windows for matrix multiplication, metadata that includes the experiment name “tpu design”, metric data that includes a metric choice of latency, and an actual metric value of 0.00015 seconds).

After training, example 600 shows the system predicting metric values for experiments belonging to a variety of different experiment classes (e.g., predicting metric values for hyperparameter tuning experiments, predicting metric values for protein design experiments, predicting metric values for hardware design experiments).

Upon request from a user, for each experiment, the system can search through a set of possible values of the settings for the experiment to identify the values that result in the highest- performing predicted value, and then select the values that result in the highest-performing predicted value as the final values to provide to a user. The above description generally describes that the system generates, from at least the settings data and the metric data, a particular input sequence of key -value pairs and then processes the particular input sequence using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment.

In some implementations, however, the system instead processes the settings data (and, optionally, the metadata described above) using an encoder neural network to generate an encoded representation and then processes the encoded representation using a decoder neural network (a “decoder head”) to generate an output sequence of tokens that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment. The encoded representation can be, e.g., a set of one or more embedding vectors that represent the information about the experiment.

In particular, the system processes the encoded representation using the decoder neural network to auto-regressively generate an output sequence of tokens from a decoder vocabulary that includes the tokens from the second set described above, i.e., the vocabulary does not need to include any of the tokens from the first subset.

The decoder neural network can generally have any appropriate architecture for auto- regressively generating an output sequence, e.g., one of the architectures described above. The decoder neural network can be conditioned on the encoded representation in any appropriate way, e.g., by including one or more cross-attention layers that cross-attend into the encoded representation or by providing the encoded representation as input to the decoder neural network directly.

Similarly, the encoder can generally have any appropriate architecture that can map the settings data and, optionally, the metadata to a set of one or more vectors, e.g., a multi-layer perceptron (MLP), a recurrent neural network (RNN), and so on.

Thus, in some implementations, rather than requiring a transformation of the settings data into a sequence of tokens, the system can instead process the settings data using a domain- appropriate encoder neural network to generate an encoded representation, and then generate the output sequence auto-regressively using a decoder neural network conditioned on the encoded representation. The system can sample the output sequence using any appropriate auto-regressive decoding technique, e.g., using beam search, using the Harrell-Davis estimator, using top-k decoding, or using increased temperature decoding.

FIG. 7 is an example 700 of a definition 702 of possible inputs for experiments of an experiment class and two possible experiment settings data 704 each represented as a plurality of key -value pairs.

In particular, the experiment class is a “hyperparameter tuning experiment class” and the definition 702 of possible inputs for experiments of this class is illustrated as a search space where any experiment settings can be defined as a Cartesian product of settings parameters. That is, any single experiment settings data is defined as a learning rate with possible values ranging as [0,1.0], a batch size with possible integer values ranging as [1,256], a model with possible values includes ‘svm’ (support vector machine) or ‘mlp’ (multi-layer perceptron), and an optimizer with possible values including ‘sgd’ (stochastic gradient descent) or ‘adam’ (adaptive moment estimation). Also, every parameter can have child parameters, only active when the corresponding parent parameter is a specific value (e.g. “beta” is active only if a parent categorical parameter selects “adam”, but not “sgd”).

The example two possible experiment settings data 704 are labeled as “Trial 1” and “Trial 2” are represented as a plurality of key -value pairs. Trial 1 is represented as the pairs learning rate : 0.5, batch size : 128, model : ‘svm’, kernel: ‘rbf , and optimizer : ‘sgd’. Trial 2 is represented as the pairs learning rate : 0.2, batch size : 14, model : ‘mlp’, num layers: ‘2’, and optimizer : ‘adam’, ‘beta’ : 0.6.

FIG. 8 is an example 800 of the performance of the described techniques.

More specifically, example 800 shows graphs of the performance of the described techniques for predicting real world experiment metric values for various experiments and experiment classes. Each graph corresponds to an experiment class, and the experiment classes of these experiments vary drastically with respect to different input spaces, representative of objectives tuned commonly in real world settings. These include standard machine learning (e.g. image classification and language modeling), production systems (e.g. bid simulation, LLM inference latency), and scientific research (e.g. protein and hardware design).

For each graph of example 800, the x-axis represents the ground truth metric value over varying experiments and the y-axis represents the described techniques predictions. Each scatter point represents an experiment, and the closer along the diagonal the scatter points are the more closely the predicted metric values are to the true metric value.

Example 800 shows the described techniques can be used to predict metric values of various experiments for various experiment classes well.

In this specification, the term "configured" is used in relation to computing systems and environments, as well as computer program components. A computing system or environment is considered "configured" to perform specific operations or actions when it possesses the necessary software, firmware, hardware, or a combination thereof, enabling it to carry out those operations or actions during operation. For instance, configuring a system might involve installing a software library with specific algorithms, updating firmware with new instructions for handling data, or adding a hardware component for enhanced processing capabilities. Similarly, one or more computer programs are "configured" to perform particular operations or actions when they contain instructions that, upon execution by a computing device or hardware, cause the device to perform those intended operations or actions.

The embodiments and functional operations described in this specification can be implemented in various forms, including digital electronic circuitry, software, firmware, computer hardware (encompassing the disclosed structures and their structural equivalents), or any combination thereof. The subject matter can be realized as one or more computer programs, essentially modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by or to control the operation of a computing device or hardware. The storage medium can be a storage device such as a hard drive or solid-state drive (SSD), a storage medium, a random or serial access memory device, or a combination of these. Additionally or alternatively, the program instructions can be encoded on a transmitted signal, such as a machine-generated electrical, optical, or electromagnetic signal, designed to carry information for transmission to a receiving device or system for execution by a computing device or hardware. Furthermore, implementations may leverage emerging technologies like quantum computing or neuromorphic computing for specific applications, and may be deployed in distributed or cloud-based environments where components reside on different machines or within a cloud infrastructure.

The term "computing device or hardware" refers to the physical components involved in data processing and encompasses all types of devices and machines used for this purpose. Examples include processors or processing units, computers, multiple processors or computers working together, graphics processing units (GPUs), tensor processing units (TPUs), and specialized processing hardware such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). In addition to hardware, a computing device or hardware may also include code that creates an execution environment for computer programs. This code can take the form of processor firmware, a protocol stack, a database management system, an operating system, or a combination of these elements. Embodiments may particularly benefit from utilizing the parallel processing capabilities of GPUs, in a General- Purpose computing on Graphics Processing Units (GPGPU) context, where code specifically designed for GPU execution, often called kernels or shaders, is employed. Similarly, TPUs excel at running optimized tensor operations crucial for many machine learning algorithms. By leveraging these accelerators and their specialized programming models, the system can achieve significant speedups and efficiency gains for tasks involving artificial intelligence and machine learning, particularly in areas such as computer vision, natural language processing, and robotics.

A computer program, also referred to as software, an application, a module, a script, code, or simply a program, can be written in any programming language, including compiled or interpreted languages, and declarative or procedural languages. It can be deployed in various forms, such as a standalone program, a module, a component, a subroutine, or any other unit suitable for use within a computing environment. A program may or may not correspond to a single file in a file system and can be stored in various ways. This includes being embedded within a file containing other programs or data (e.g., scripts within a markup language document), residing in a dedicated file, or distributed across multiple coordinated files (e.g., files storing modules, subprograms, or code segments). A computer program can be executed on a single computer or across multiple computers, whether located at a single site or distributed across multiple sites and interconnected through a data communication network. The specific implementation of the computer programs may involve a combination of traditional programming languages and specialized languages or libraries designed for GPGPU programming or TPU utilization, depending on the chosen hardware platform and desired performance characteristics. In this specification, the term "engine" broadly refers to a software-based system, subsystem, or process designed to perform one or more specific functions. An engine is typically implemented as one or more software modules or components installed on one or more computers, which can be located at a single site or distributed across multiple locations. In some instances, one or more dedicated computers may be used for a particular engine, while in other cases, multiple engines may operate concurrently on the same one or more computers. Examples of engine functions within the context of Al and machine learning could include data pre-processing and cleaning, feature engineering and extraction, model training and optimization, inference and prediction generation, and post-processing of results. The specific design and implementation of engines will depend on the overall architecture and the distribution of computational tasks across various hardware components, including CPUs, GPUs, TPUs, and other specialized processors.

The processes and logic flows described in this specification can be executed by one or more programmable computers running one or more computer programs to perform functions by operating on input data and generating output. Additionally, graphics processing units (GPUs) and tensor processing units (TPUs) can be utilized to enable concurrent execution of aspects of these processes and logic flows, significantly accelerating performance. This approach offers significant advantages for computationally intensive tasks often found in Al and machine learning applications, such as matrix multiplications, convolutions, and other operations that exhibit a high degree of parallelism. By leveraging the parallel processing capabilities of GPUs and TPUs, significant speedups and efficiency gains compared to relying solely on CPUs can be achieved. Alternatively or in combination with programmable computers and specialized processors, these processes and logic flows can also be implemented using specialized processing hardware, such as field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs), for even greater performance or energy efficiency in specific use cases.

Computers capable of executing a computer program can be based on general-purpose microprocessors, special-purpose microprocessors, or a combination of both. They can also utilize any other type of central processing unit (CPU). Additionally, graphics processing units (GPUs), tensor processing units (TPUs), and other machine learning accelerators can be employed to enhance performance, particularly for tasks involving artificial intelligence and machine learning. These accelerators often work in conjunction with CPUs, handling specialized computations while the CPU manages overall system operations and other tasks. Typically, a CPU receives instructions and data from read-only memory (ROM), random access memory (RAM), or both. The elements of a computer include a CPU for executing instructions and one or more memory devices for storing instructions and data. The specific configuration of processing units and memory will depend on factors like the complexity of the Al model, the volume of data being processed, and the desired performance and latency requirements. Embodiments can be implemented on a wide range of computing platforms, from small embedded devices with limited resources to large-scale data center systems with high- performance computing capabilities. The system may include storage devices like hard drives, SSDs, or flash memory for persistent data storage.

Computer-readable media suitable for storing computer program instructions and data encompass all forms of non-volatile memory, media, and memory devices. Examples include semiconductor memory devices such as read-only memory (ROM), solid-state drives (SSDs), and flash memory devices; hard disk drives (HDDs); optical media; and optical discs such as CDs, DVDs, and Blu-ray discs. The specific type of computer-readable media used will depend on factors such as the size of the data, access speed requirements, cost considerations, and the desired level of portability or permanence.

To facilitate user interaction, embodiments of the subject matter described in this specification can be implemented on a computing device equipped with a display device, such as a liquid crystal display (LCD) or an organic light-emitting diode (OLED) display, for presenting information to the user. Input can be provided by the user through various means, including a keyboard), touchscreens, voice commands, gesture recognition, or other input modalities depending on the specific device and application. Additional input methods can include acoustic, speech, or tactile input, while feedback to the user can take the form of visual, auditory, or tactile feedback. Furthermore, computers can interact with users by exchanging documents with a user's device or application. This can involve sending web content or data in response to requests or sending and receiving text messages or other forms of messages through mobile devices or messaging platforms. The selection of input and output modalities will depend on the specific application and the desired form of user interaction. Machine learning models can be implemented and deployed using machine learning frameworks, such as TensorFlow or JAX. These frameworks offer comprehensive tools and libraries that facilitate the development, training, and deployment of machine learning models.

Embodiments of the subject matter described in this specification can be implemented within a computing system comprising one or more components, depending on the specific application and requirements. These may include a back-end component, such as a back-end server or cloud-based infrastructure; an optional middleware component, such as a middleware server or application programming interface (API), to facilitate communication and data exchange; and a front-end component, such as a client device with a user interface, a web browser, or an app, through which a user can interact with the implemented subject matter. For instance, the described functionality could be implemented solely on a client device (e.g., for on-device machine learning) or deployed as a combination of front-end and back-end components for more complex applications. These components, when present, can be interconnected using any form or medium of digital data communication, such as a communication network like a local area network (LAN) or a wide area network (WAN) including the Internet. The specific system architecture and choice of components will depend on factors such as the scale of the application, the need for real-time processing, data security requirements, and the desired user experience.

The computing system can include clients and servers that may be geographically separated and interact through a communication network. The specific type of network, such as a local area network (LAN), a wide area network (WAN), or the Internet, will depend on the reach and scale of the application. The client-server relationship is established through computer programs running on the respective computers and designed to communicate with each other using appropriate protocols. These protocols may include HTTP, TCP/IP, or other specialized protocols depending on the nature of the data being exchanged and the security requirements of the system. In certain embodiments, a server transmits data or instructions to a user's device, such as a computer, smartphone, or tablet, acting as a client. The client device can then process the received information, display results to the user, and potentially send data or feedback back to the server for further processing or storage. This allows for dynamic interactions between the user and the system, enabling a wide range of applications and functionalities. While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

What is claimed is:

Claims

1. A method performed by one or more computers, the method comprising: receiving settings data characterizing particular values for settings for a particular experiment; receiving metric data specifying a metric for evaluating an outcome of the particular experiment; generating, from at least the settings data and the metric data, a particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs that each comprise a respective key and a respective value each represented as one or more tokens from a vocabulary of tokens; and processing the particular input sequence that represents at least the settings data and the metric data as a plurality of key -value pairs using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment.

2. The method of claim 1, further comprising: selecting, using the predicted value of the metric, final values for the settings for the particular experiment for use in performing the particular experiment.

3. The method of claim 2, further comprising: performing the particular experiment in accordance with the final values for the settings.

4. The method of any one of claims 1-3, wherein the particular experiment comprises training a machine learning model, and wherein the settings for the particular experiment comprise hyperparameters for the training.

5. The method of claim 4, wherein the metric measures a quality of the training of the machine learning model.

6. The method of any one of claims 1 -3, wherein the particular experiment comprises deploying a machine learning model on a set of one or more hardware devices, and wherein the settings for the particular experiment comprise architecture parameters of the machine learning model, architecture parameters of the one or more hardware accelerators, or both.

7. The method of claim 6, wherein the metric measures one or more properties of a performance of the machine learning model when deployed on the set of one or more hardware devices.

8. The method any one of claims 1-3, wherein the particular experiment comprises designing a hardware accelerator, and wherein the settings for the particular experiment comprise architecture parameters of the hardware accelerator.

9. The method of claim 8, wherein the metric measures one or more properties of the hardware accelerator.

10. The method of any preceding claim, wherein the key -value pairs are represented as tokens selected from a first set of tokens from the vocabulary and the output sequence comprises tokens selected from a second set of tokens from the vocabulary, and wherein the first set of tokens is disjoint from the second set of tokens.

11. The method of any preceding claim, further comprising: receiving experiment metadata characterizing the particular experiment, wherein the particular input sequence represents the experiment metadata as one or more key-value pairs that each comprise a respective key and a respective value each represented as tokens from the vocabulary of tokens.

12. The method of claim 11, wherein the experiment metadata identifies a user that is performing the particular experiment.

13. The method of any preceding claim, wherein the neural network has been trained on a set of training data that comprises, for each of a plurality of training experiments, a respective training input sequence for each of a different set of values for settings of the training experiment and that represents the set of values for settings of the training experiment and, for each respective training input sequence, a corresponding target output sequence that represents an actual value of a metric for the training experiment when performed in accordance with the set of values for the settings of the training experiment represented by the training input sequence.

14. The method of claim 13, further comprising: after the training of the neural network on the set of training data and prior to processing the particular input sequence that represents the settings data and the metric data as a plurality of key -value pairs using a neural network: obtaining fine-tuning data for the particular experiment that comprises a respective fine-tuning input sequence for each of a different set of values for the settings of the particular experiment and that represents the set of values for settings of the particular experiment and, for each respective fine-tuning input sequence, a corresponding target output sequence that represents an actual value of the metric for the particular experiment when performed in accordance with the set of values for the settings of the particular experiment represented by the fine-tuning input sequence; and training the neural network on the fine-tuning data.

15. The method of claim 14, wherein training the neural network on the fine-tuning data comprises training the neural network using an objective that measures normalized errors between predicted values of the metric and the actual values of the metric.

16. The method of any preceding claim, wherein processing the particular input sequence that represents the setting data and the metric data as a plurality of key -value pairs using a neural network to generate an output sequence of tokens from the vocabulary that represents a predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment comprises: processing the particular input sequence using the neural network to generate a plurality of candidate output sequences of tokens from the vocabulary that each represent a candidate predicted value of the metric if the particular experiment is performed in accordance with the particular values for the settings for the particular experiment; and determining a final predicted value of the metric from the candidate predicted values represented by the candidate output sequence.

17. The method of claim 16, wherein determining a final predicted value of the metric from the candidate predicted values represented by the candidate output sequence comprises: determining a final predicted value of the metric to be a median of the candidate predicted values represented by the candidate output sequence.

18. The method of any preceding claim, wherein the particular input sequence does not represent any actual values of the metric for any actual trials of the experiment that have already been performed.

19. The method of claim 18 when dependent on claim 14, wherein each correspond target output sequence includes only one an actual value of the metric for the particular experiment when performed in accordance with a corresponding single set of values for the settings of the particular experiment represented by the fine-tuning input sequence.

20. The method of any preceding claim, wherein the neural network is an encoder-decoder neural network.

21. The method of any one of claims 1-19, wherein the neural network is a decoder-only neural network.

22. The method of any preceding claim, wherein the neural network includes one or more self-attention layers.

23. The method of any preceding claim, wherein the neural network is a self-attention neural network.

24. A system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the operations of the respective method of any one of claims 1-23.

25. A computer storage medium encoded with instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of the respective method of any one of claims 1-23.