EP4147173A1 - Variational auto encoder for mixed data types - Google Patents
Variational auto encoder for mixed data typesInfo
- Publication number
- EP4147173A1 EP4147173A1 EP21721364.4A EP21721364A EP4147173A1 EP 4147173 A1 EP4147173 A1 EP 4147173A1 EP 21721364 A EP21721364 A EP 21721364A EP 4147173 A1 EP4147173 A1 EP 4147173A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- latent
- data
- feature
- vae
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/091—Active learning
Definitions
- Neural networks are used in the field of machine learning and artificial intelligence (AI).
- a neural network comprises plurality of nodes which are interconnected by links, sometimes referred to as edges.
- the input edges of one or more nodes form the input of the network as a whole, and the output edges of one or more other nodes form the output of the network as a whole, whilst the output edges of various nodes within the network form the input edges to other nodes.
- Each node represents a function of its input edge(s) weighted by a respective weight, the result being output on its output edge(s).
- the weights can be gradually tuned based on a set of experience data (training data) so as to tend towards a state where the network will output a desired value for a given input.
- the nodes are arranged into layers with at least an input and an output layer.
- a “deep” neural network comprises one or more intermediate or “hidden” layers in between the input layer and the output layer.
- the neural network can take input data and propagate the input data through the layers of the network to generate output data. Certain nodes within the network perform operations on the data, and the result of those operations is passed to other nodes, and so on.
- FIG. 1A gives a simplified representation of an example neural network 101 by way of illustration.
- the example neural network comprises multiple layers of nodes 104: an input layer 102i, one or more hidden layers 102h and an output layer 102o. In practice, there may be many nodes in each layer, but for simplicity only a few are illustrated.
- Each node 104 is configured to generate an output by carrying out a function on the values input to that node.
- the inputs to one or more nodes form the input of the neural network, the outputs of some nodes form the inputs to other nodes, and the outputs of one or more nodes form the output of the network.
- a weight may define the connectivity between a node in a given layer and the nodes in the next layer of the neural network.
- a weight can take the form of a single scalar value or can be modelled as a probabilistic distribution.
- the weights are defined by a distribution, as in a Bayesian model, the neural network can be fully probabilistic and captures the concept of uncertainty.
- the values of the connections 106 between nodes may also be modelled as distributions. This is illustrated schematically in Figure IB.
- the distributions may be represented in the form of a set of samples or a set of parameters parameterizing the distribution (e.g. the mean m and standard deviation s or variance s 2 ).
- the network learns by operating on data input at the input layer, and adjusting the weights applied by some or all of the nodes based on the input data.
- each node takes into account the back propagated error and produces a revised set of weights. In this way, the network can be trained to perform its desired operation.
- the input to the network is typically a vector, each element of the vector representing a different corresponding feature.
- the elements of this feature vector may represent different pixel values, or in a medical application the different features may represent different symptoms or patient questionnaire responses.
- the output of the network may be a scalar or a vector.
- the output may represent a classification, e.g. an indication of whether a certain object such as an elephant is recognized in the image, or a diagnosis of the patient in the medical example.
- Figure 1C shows a simple arrangement in which a neural network is arranged to predict a classification based on an input feature vector.
- experience data comprising a large number of input data points X is supplied to the neural network, each data point comprising an example set of values for the feature vector, labelled with a respective corresponding value of the classification Y.
- the classification Y could be a single scalar value (e.g. representing elephant or not elephant), or a vector (e.g. a one-hot vector whose elements represent different possible classification results such as elephant, hippopotamus, rhinoceros, etc.).
- the possible classification values could be binary or could be soft-values representing a percentage probability.
- the learning algorithm tunes the weights to reduce the overall error between the labelled classification and the classification predicted by the network. Once trained with a suitable number of data points, an unlabelled feature vector can then be input to the neural network, and the network can instead predict the value of the classification based on the input feature values and the tuned weights.
- Training in this manner is sometimes referred to as a supervised approach.
- Other approaches are also possible, such as a reinforcement approach wherein the network each data point is not initially labelled. The learning algorithm begins by guessing the corresponding output for each point, and is then told whether it was correct, gradually tuning the weights with each such piece of feedback.
- Another example is an unsupervised approach where input data points are not labelled at all and the learning algorithm is instead left to infer its own structure in the experience data.
- the term “training” herein does not necessarily limit to a supervised, reinforcement or unsupervised approach.
- a machine learning model can also be formed from more than one constituent neural network.
- An example of this is an auto encoder, as illustrated by way of example in Figures 4A-D.
- an encoder network is arranged to encode an observed input vector X 0 into a latent vector Z, and a decoder network is arranged to decode the latent vector back into the real-world feature space of the input vector.
- the difference between the actual input vector X 0 and the version of the input vector X predicted by the decoder is used to tune the weights of the encoder and decoder so as to minimize a measure of overall difference, e.g. based on an evidence lower bound (ELBO) function.
- ELBO evidence lower bound
- the latent vector Z can be thought of as a compressed form of the information in the input feature space.
- each element of the latent vector Z is modelled as a probabilistic or statistical distribution such as a Gaussian.
- the encoder learns one or more parameters of the distribution, e.g. a measure of centre point and spread of the distribution. For instance the centre point could be the mean and the spread could be the variance or standard deviation.
- the value of the element input to the decoder is then randomly sampled from the learned distribution.
- the encoder is sometimes referred to as an inference network in that it infers the latent vector Z from an input observation X 0 .
- the decoder is sometimes referred to as a generative network in that it generates a version X of the input feature space from the latent vector Z.
- the auto encoder can be used to impute missing values from a subsequently observed feature vector X 0.
- a third network can be trained to predict a classification Y from the latent vector, and then once trained, used to predict the classification of a subsequent, unlabelled observation.
- one or more of the features in the input feature space may be categorical values (e.g. a yes/no answer to a questionnaire, or gender) whilst one or more others may be continuous numerical values (e.g. height, or weight). Contrast for example with the case of image recognition where all the input features may represent pixel values.
- the performance of any imputation or prediction performed based on the latent vector depends on the dimensionality of the latent space. In other words, the more elements (greater number of dimensions) are included in the latent vector, then the better the performance (where performance may be measured in terms of accuracy of prediction compared to a known ground truth in some test data).
- the limiting factor on a conventional VAE is not the size of latent vector, but rather the mixed nature of the data types. It is identified herein that in such cases, increasing the latent size will not improve the performance significantly.
- the computational complexity in terms of both training and prediction or imputation
- the computational complexity will continue to scale with the dimensionality of the latent space (the number of elements in the latent vector Z) even if increasing the dimensionality is no longer increasing performance.
- conventional VAEs are not making efficient use of the computational complexity incurred.
- a method comprising a first and a second stage.
- the method comprises training each of a plurality of individual first variational auto encoders, VAEs, each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data.
- VAEs individual first variational auto encoders
- the method comprises training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation.
- a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation.
- a second encoder and decoder can then be trained in a subsequent stage to encode into a second latent space and decode back to the individual first latent values, and thus learn the dependencies between the different data types.
- This two-stage approach including a stage of separation between the different types of data, provides improved performance when handling mixed data.
- the dimensionality of the latent space is simply the dimensionality of the single latent vector Z between encoder and decoder.
- the dimensionality is the sum of the dimensionality of the second latent representation (the number of elements in the second latent vector) plus the dimensionalities of each of the first latent representation (in embodiments one element each).
- the dimensionality may be represented as dim(H) + D, where dim (H) is the number of elements in the second latent vector i/, and D is the number of features or feature subsets.
- an issue with a vanilla VAE is that under mixed type data, it cannot make use of the latent space very efficiently.
- the disclosed method since the disclosed method has a two-stage structure, it will actually have a larger latent size if H has the same dimensionality as Z.
- the increase of latent size in the disclosed model gives a significant boost compared with a vanilla VAE. So the latent space and training procedure are designed to make use of the latent space much more efficiently.
- Figure 1A is a schematic illustration of a neural network
- Figure IB is a schematic illustration of a node of a Bayesian neural network
- Figure 1C is a schematic illustration of a neural network arranged to predict a classification based on an input feature vector
- Figure 2 is a schematic illustration of a computing apparatus for implementing a neural network
- Figure 3 schematically illustrates a data set comprising a plurality of data points each comprising one or more feature values
- FIG. 4A is a schematic illustration of a variational auto encoder (VAE),
- Figure 4B is another schematic representation of a VAE
- Figure 4C is a high-level schematic representation of a VAE
- Figure 4D is a high-level schematic representation of a VAE
- Figure 5A schematically illustrates a first stage of training a machine learning model in accordance with embodiments disclosed herein
- Figure 5B is a schematically illustrates a second stage of training a machine learning model in accordance with embodiments disclosed herein,
- Figure 5C is a high-level schematic representation of the knowledge model of figures 5A and 5B,
- Figure 5D illustrates a variant of the decoder in the model of Figures 5 A and 5B
- Figure 5E illustrates use of the model to predict a classification
- Figure 6 illustrates a partial inference network for imputing missing values
- Figure 7A shows pair plots of 3-dimensional data for a ground truth
- Figure 7B shows pair plots of 3-dimensional data generated using a model
- Figure 7C shows pair plots of 3-dimensional data generated using another model
- Figure 7D shows pair plots of 3-dimensional data generated using another model
- Figure 7E shows pair plots of 3 -dimensional data generated using another model
- Figure 7F shows pair plots of 3 -dimensional data generated using another model
- Figure 8 shows (a)-(e) some information curves of sequential active information acquisition for some example scenarios and (f) a corresponding area under information curve (AUIC) comparison, and
- Figure 9 is a flow chart of an overall method in accordance with the presently disclosed techniques.
- Deep generative models often perform poorly in real-world applications due to the heterogeneity of natural data sets. Heterogeneity arises from having different types of features (e.g. categorical, continuous, etc.) each with their own marginal properties which can be drastically different. “Marginal” refers to the distribution of different possible values of the feature verses number of samples, disregarding co-dependency with other features. In other words the shape of the distribution for different types of feature can be quite different.
- the types of data may include for example: categorical (the value of the feature takes one of a plurality of non-numerical categories), ordinal (integer numerical values) and/or continuous (continuous numerical values). YAE will try to optimize all likelihood functions all at once.
- FIG. 7 (d) shows an example in which a vanilla VAE fits some of the categorical variables but performs poorly on the continuous ones.
- VAEM variational auto-encoder for heterogeneous mixed type data
- VAEM employs a deep generative model for the heterogeneous mixed type data.
- VAEM may be extended to handle missing data, perform conditional data generation, and employ algorithms that enable it to be used for efficient sequential active information acquisition. It will be shown herein that VAEM obtains strong performance for conditional data generation as well as sequential active information acquisition in cases where VAEs perform poorly.
- FIG. 2 illustrates an example computing apparatus 200 for implementing an artificial intelligence (AI) algorithm including a machine-learning (ML) model in accordance with embodiments described herein.
- the computing apparatus 200 may comprise one or more user terminals, such as a desktop computer, laptop computer, tablet, smartphone, wearable smart device such as a smart watch, or an on-board computer of a vehicle such as car, etc. Additionally or alternatively, the computing apparatus 200 may comprise a server.
- a server herein refers to a logical entity which may comprise one or more physical server units located at one or more geographic sites. Where required, distributed or “cloud” computing techniques are in themselves known in the art.
- the one or more user terminals and/or the one or more server units of the server may be connected to one another via a packet-switched network, which may comprise for example a wide- area internetwork such as the Internet, a mobile cellular network such as a 3GPP network, a wired local area network (LAN) such as an Ethernet network, or a wireless LAN such as a Wi-Fi, Thread or 6L0WPAN network.
- a packet-switched network may comprise for example a wide- area internetwork such as the Internet, a mobile cellular network such as a 3GPP network, a wired local area network (LAN) such as an Ethernet network, or a wireless LAN such as a Wi-Fi, Thread or 6L0WPAN network.
- the computing apparatus 200 comprises a controller 202, an interface 204, and an artificial intelligence (AI) algorithm 206.
- the controller 202 is operatively coupled to each of the interface 204 and the AI algorithm 206.
- Each of the controller 202, interface 204 and AI algorithm 206 may be implemented in the form of software code embodied on computer readable storage and run on processing apparatus comprising one or more processors such as CPUs, work accelerator co-processors such as GPUs, and/or other application specific processors, implemented on one or more computer terminals or units at one or more geographic sites.
- the storage on which the code is stored may comprise one or more memory devices employing one or more memory media (e g. electronic or magnetic media), again implemented on one or more computer terminals or units at one or more geographic sites.
- one, some or all the controller 202, interface 204 and AI algorithm 206 may be implemented on the server.
- a respective instance of one, some or all of these components may be implemented in part or even wholly on each of one, some or all of the one or more user terminals.
- the functionality of the above- mentioned components may be split between any combination of the user terminals and the server.
- distributed computing techniques are in themselves known in the art. It is also not excluded that one or more of these components may be implemented in dedicated hardware.
- the controller 202 comprises a control function for coordinating the functionality of the interface 204 and the AI algorithm 206.
- the interface 204 refers to the functionality for receiving and/or outputting data.
- the interface 204 may comprise a user interface (UI) for receiving and/or outputting data to and/or from one or more users, respectively; or it may comprise an interface to one or more other, external devices which may provide an interface to one or more users.
- UI user interface
- the interface may be arranged to collect data from and/or output data to an automated function or equipment implemented on the same apparatus and/or one or more external devices, e.g. from sensor devices such as industrial sensor devices or IoT devices.
- the interface 204 may comprise a wired or wireless interface for communicating, via a wired or wireless connection respectively, with the external device.
- the interface 204 may comprise one or more constituent types of interface, such as voice interface, and/or a graphical user interface.
- the interface 204 is thus arranged to gather observations (i.e. observed values) of various features of an input feature space. It may for example be arranged to collect inputs entered by one or more users via a UI front end, e.g. microphone, touch screen, etc.; or to automatically collect data from unmanned devices such as sensor devices.
- the logic of the interface may be implemented on a server, and arranged to collect data from one or more external user devices such as user devices or sensor devices. Alternatively some or all of the logic of the interface 204 may also be implemented on the user device(s) or sensor devices its/themselves.
- the controller 202 is configured to control the AI algorithm 206 to perform operations in accordance with the embodiments described herein. It will be understood that any of the operations disclosed herein may be performed by the AI algorithm 206, under control of the controller 202 to collect experience data from the user and/or an automated process via the interface 204, pass it to the AI algorithm 206, receive predictions back from the AI algorithm and output the predictions to the user and/or automated process through the interface 204.
- the machine learning (ML) algorithm 206 comprises a machine-learning model 208, comprising one or more constituent neural networks 101.
- a machine-leaning model 208 such as this may also be referred to as a knowledge model.
- the machine learning algorithm 206 also comprises a learning function 209 arranged to tune the weights w of the nodes 104 of the neural network(s) 101 of the machine-learning model 208 according to a learning process, e.g. training based on a set of training data.
- FIG. 1A illustrates the principle behind a neural network.
- a neural network 101 comprises a graph of interconnected nodes 104 and edges 106 connecting between nodes, all implemented in software.
- Each node 104 has one or more input edges and one or more output edges, with at least some of the nodes 104 having multiple input edges per node, and at least some of the nodes 104 having multiple output edges per node.
- the input edges of one or more of the nodes 104 form the overall input 108i to the graph (typically an input vector, i.e. there are multiple input edges).
- the output edges of one or more of the nodes 104 form the overall output 108o of the graph (which may be an output vector in the case where there are multiple output edges). Further, the output edges of at least some of the nodes 104 form the input edges of at least some others of the nodes 104.
- Each node 104 represents a function of the input value(s) received on its input edges(s) 106i, the outputs of the function being output on the output edge(s) 106o of the respective node 104, such that the value(s) output on the output edge(s) 106o of the node 104 depend on the respective input value(s) according to the respective function.
- the function of each node 104 is also parametrized by one or more respective parameters w, sometimes also referred to as weights (not necessarily weights in the sense of multiplicative weights, though that is certainly one possibility).
- weights not necessarily weights in the sense of multiplicative weights, though that is certainly one possibility.
- Each weight could simply be a scalar value.
- the respective weight may be modelled as a probabilistic distribution such as a Gaussian.
- the neural network 101 is sometimes referred to as a Bayesian neural network.
- the value input/output on each of some or all of the edges 106 may each also be modelled as a respective probabilistic distribution.
- the distribution may be modelled in terms of a set of samples of the distribution, or a set of parameters parameterizing the respective distribution, e g. a pair of parameters specifying its centre point and width (e.g. in terms of its mean m and standard deviation s or variance s 2 ).
- the value of the edge or weight may be a random sample from the distribution.
- the learning or the weights may comprise tuning one or more of the parameters of each distribution.
- the nodes 104 of the neural network 101 may be arranged into a plurality of layers, each layer comprising one or more nodes 104.
- the neural network 101 comprises an input layer 102i comprising one or more input nodes 104i, one or more hidden layers 102h (also referred to as inner layers) each comprising one or more hidden nodes 104h (or inner nodes), and an output layer 102o comprising one or more output nodes 104o.
- hidden layers 102h also referred to as inner layers
- output layer 102o comprising one or more output nodes 104o.
- the different weights of the various nodes 104 in the neural network 101 can be gradually tuned based on a set of experience data (training data), so as to tend towards a state where the output 108o of the network will produce a desired value for a given input 108i.
- training comprises inputting experience data in the form of training data to the inputs 108i of the graph and then tuning the weights w of the nodes 104 based on feedback from the output(s) 108o of the graph.
- the training data comprises multiple different input data points, each comprising a value or vector of values corresponding to the input edge or edges 108i of the graph 101.
- each element of the feature vector X may represent a respective pixel value. For instance one element represents the red channel for pixel (0,0); another element represents the green channel for pixel (0,0); another element represents the blue channel of pixel (0,0); another element represents the red channel of pixel (0,1); and so forth.
- each of the elements of the feature vector may represent a value of a different symptom of the subject, physical feature of the subject, or other fact about the subject (e.g. body temperature, blood pressure, etc.).
- Each data point i comprises a respective set of values of the feature vector (where xid is the value of the dth feature in the ith data point).
- the input feature vector X represents the input observations for a given data point, where in general any given observation i may or may not comprise a complete set of values for all the elements of the feature vector X.
- the classification Yi represents a corresponding classification of the observation i. In the training data an observed value of classification Yi is specified with each data point along with the observed values of the feature vector elements (the input data points in the training data are said to be “labelled” with the classification Yi).
- the classification Y is predicted by the neural network 101 for a further input observation X.
- the classification Y could be a scalar or a vector.
- Y could be a single binary value representing either elephant or not elephant, or a soft value representing a probability or confidence that the image comprises an image of an elephant.
- Y could be a single binary value representing whether the subject has the condition or not, or a soft value representing a probability or confidence that the subject has the condition in question.
- Y could comprise a “1-hot” vector, where each element represents a different animal or condition.
- the true value of Yi for each data point i is known.
- the AI algorithm 206 measures the resulting output value(s) at the output edge or edges 108o of the graph, and uses this feedback to gradually tune the different weights w of the various nodes 104 so that, over many observed data points, the weights tend towards values which make the output(s) 108i (Y) of the graph 101 as close as possible to the actual observed value(s) in the experience data across the training inputs (for some measure of overall error).
- the predetermined training output is compared with the actual observed output of the graph 108o.
- This comparison provides the feedback which, over many pieces of training data, is used to gradually tune the weights of the various nodes 104 in the graph toward a state whereby the actual output 108o of the graph will closely match the desired or expected output for a given input 108i.
- feedback techniques include for instance stochastic back-propagation.
- the neural network 101 can then be used to infer a value of the output 108o (Y) for a given value of the input vector 108i (X), or vice versa.
- Explicit training based on labelled training data is sometimes referred to as a supervised approach.
- Other approaches to machine learning are also possible.
- another example is the reinforcement approach.
- the neural network 101 begins making predictions of the classification Yi for each data point i, at first with little or no accuracy.
- the AI algorithm 206 receives feedback (e.g. from a human) as to whether the prediction was correct, and uses this to tune the weights so as to perform better next time
- feedback e.g. from a human
- the AI algorithm receives no labelling or feedback and instead is left to infer its own structure in the experienced input data.
- Figure 1C is a simple example of the use of a neural network 101.
- the machine-learning model 208 may comprise a structure of two or more constituent neural networks 101.
- FIG 4A schematically illustrates one such example, known as a variational auto encoder (VAE).
- VAE variational auto encoder
- the machine learning model 208 comprises an encoder 208q comprising an inference network, and a decoder 208p comprising a generative network.
- Each of the inference networks and the generative networks comprises one or more constituent neural networks 101, such as discussed in relation to Figure 1A.
- An inference network for the present purposes means a neural network arranged to encode an input into a latent representation of that input, and a generative network means a neural network arranged to at least partially decode from a latent representation.
- the encoder 208q is arranged to receive the observed feature vector X 0 as an input and encode it into a latent vector Z (a representation in a latent space).
- the decoder 208p is arranged to receive the latent vector Z and decode back to the original feature space of the feature vector.
- the version of the feature vector output by the decoder 208p may be labelled herein X.
- the latent vector Z is a compressed (i.e. encoded) representation of the information contained in the input observations X 0 .
- No one element of the latent vector Z necessarily represents directly any real world quantity, but the vector Z as a whole represents the information in the input data in compressed form. It could be considered conceptually to represent abstract features abstracted from the input data X 0 , such as “wrinklyness” and “trunk-like-ness” in the example of elephant recognition (though no one element of the latent vector Z can necessarily be mapped onto any one such factor, and rather the latent vector Z as a whole encodes such abstract information).
- the decoder 208p is arranged to decode the latent vector Z back into values in a real-world feature space, i.e. back to an uncompressed form X representing the actual observed properties (e.g. pixel values).
- the decoded feature vector X has the same number of elements representing the same respective features as the input vector X 0 .
- weights w of the inference network (encoder) 208q are labelled herein ⁇ , whilst the weights w of the generative network (decoder) 208p are labelled Q.
- Each node 104 applies its own respective weight as illustrated in Figure 4.
- the learning function 209 tunes the weights ⁇ and Q so that the VAE 208 learns to encode the feature vector X into the latent space Z and back again.
- this may be done by minimizing a measure of divergence between q 0 (Z j
- ” means “given”.
- the model is trained to reconstruct X; and therefore maintains a distribution over X t . At the “input side”, the value of Xo L is known, and at the “output side”, the likelihood of Xi under the output distribution of the model is evaluated.
- x) is referred to as posterior
- x) as approximate posterior
- p(z) and q(z) are referred to as priors.
- this may be done by minimizing the Kullback-Leibler (KL) divergence between q 0 (Zi ⁇ Xi) and r q (Ci ⁇ Zi).
- KL Kullback-Leibler
- the minimization may be performed using an optimization function such as an ELBO (evidence lower bound) function, which uses cost function minimization based on gradient descent.
- An ELBO function may be referred to herein by way of example, but this is not limiting and other metrics and functions are also known in the art for tuning the encoder and decoder networks of a VAE.
- This is the general principle of an autoencoder.
- the latent vector Z is subject to an additional constraint that it follows a predetermined form of probabilistic distribution such as a multidimensional Gaussian distribution or gamma distribution.
- Figure 4B shows a more abstracted representation of a VAE such as shown in Figure 4 A.
- Figure 4C shows an even higher level representation of a YAE such as that shown in Figure 4A and 4B.
- the solid lines represent a generative network of the decoder 208q
- the dashed lines represents an inference network of the encoder 208p.
- a vector shown without a circle represents a fixed point. So in the illustrated example, the weights Q of the generative network are modelled as simple values, not distributions (though that is a possibility as well).
- a vector outside the plate is global, i.e. it does not scale with the number of data points i (nor the number of features d in the feature vector).
- the rounded rectangle labelled D represents that the feature vector X comprises multiple elements xi.. xd.
- VAE 208 can be used for a practical purpose.
- One use is, once the VAE has been trained, to generate a new, unobserved instance of the feature vector X by inputting a random or unobserved value of the latent vector Z into the decoder 208p.
- the feature space of X represents the pixels of an image
- the VAE has been trained to encode and decode human faces
- Z by inputting a random value of Z into the decoder 208p it is possible to generate a new face that did not belong to any of the sampled subjects during training. E.g. this could be used to generate a fictional character for a movie or video game.
- Another use is to impute missing values.
- another instance of an input vector X 0 may be input to the encoder 208q with missing values. I.e. no observed value of one or more (but not all) of the elements of the feature vector X 0 .
- the values of these elements may be set to zero, or 50%, or some other predetermined value representing “no observation.”
- the corresponding element(s) in the decoded version of the feature vector X can then be read out from the decoder 208p in order to impute the missing value(s).
- the VAE may also be trained using some data points that have missing values of some features.
- a further decoder 208pY is arranged to decode the latent vector Z into a classification Y, which could be a single element or a vector comprising multiple elements (e.g. a one-hot vector).
- a classification Y which could be a single element or a vector comprising multiple elements (e.g. a one-hot vector).
- each input data point (each observation of X 0 ) is labelled with an observed value of the classification Y, and the further decoder 208pY is thus trained to decode the latent vector Z into the classification Y.
- this can then be used to input an unlabelled feature vector X 0 and have the decoder 208pY generate a prediction of the classification Y for the observed feature vector X 0 .
- FIG. 5A-5E An improved method of forming a machine learning model 208’, in accordance with embodiments disclosed herein, is now described with reference to Figures 5A-5E. Particularly, the method disclosed herein is particularly suited to handling mixed types of data.
- This machine learning (ML) model 208’ can be used in place of a standard VAE in the apparatus 200 of Figure 2, for example, in order to make predictions or perform imputations.
- the model is trained in two stages. In a first stage, an individual VAE is trained for each of the individual features or feature types, without one influencing another. In a second stage, a further VAE is then trained to learn the inter-feature dependencies.
- vanilla VAE uses multiple likelihood functions.
- issue with a vanilla VAE is that it tries to optimize all likelihood functions all at once.
- some likelihood functions may have larger values, hence the VAE will pay attention to a particular likelihood function and ignore others.
- the disclosed method works to optimize all likelihood function separately so that it mitigates this issue.
- each subset comprises a different respective one or more of the features of the feature space. I.e. each subset is a different one or more of the elements of the feature vector X 0 .
- the different subsets of features comprise features of different data types. For instance, the types may comprise two or more of: categorical, ordinal, or continuous.
- Categorical refers to data whose value takes one of a discrete number of categories. An example of this could be gender, or a response to a question with a discrete number of qualitative answers. In some cases categorical data could be divided into two types: binary categorical and non-binary categorical. E g. an example of binary data would be answers to a yes/no question, or smoker/non-smoker. An example of non-binary data could be gender, e.g. male, female or other; or town or country of residence, etc. An example of ordinal data would be age measured in completed years, or a response to a question giving a ranking on a scale of 1 to 10, or one or five stars, or such like. An example of continuous data would be weight or height. It will be appreciated that these different types of data have very different statistical properties.
- each subset X od is only a single respective feature.
- one feature X 0l could be gender, another feature X o2 could be age, whilst another feature X o3 could be weight (such as in an example for predicting or imputing a medical condition of a user).
- features of the same type could be grouped together into the subset trained by one of the individual VAEs.
- one subset X 0l could consist of categorical variables
- another subset X o2 could consist of ordinal variables
- another subset X o3 could consist of continuous variables.
- X 0l is encoded into Z 1 and then decoded into C ⁇ 5 whilst X o2 2 is encoded into Z 2 and then decoded into 2 , and X o3 is encoded into Z 3 and then decoded into X 3 (and so forth if there are more than three feature subsets.
- each of the latent representations Z d is one-dimensional, i.e. consists of only a single latent variable (element). Note however this does not imply the latent variable Z d is a modelled only as simple, fixed scalar value. Rather, as the auto encoder is a variation auto-encoder, then for each latent variable Z d the encoder learns a statistical or probabilistic distribution, and the value input to the decoder is a random sample from the distribution. This means that for each individual element of latent space, the encoder learns one or more parameters of the respective distribution, e.g. a measure of centre point and spread of the distribution.
- each latent variable Z d (a single dimension) may be modelled in the encoder by a respective mean value m ⁇ and standard deviation a d or variance a d .
- the possibility of a multi-dimensional Z d is also not excluded (in which case each dimension is modelled by one or more parameters of a respective distribution), though this would increase computational complexity, and generally the idea of a latent representation is that it compresses the information from the input feature space into a lower dimensionality.
- each individual VAE is trained (i.e. has it weights tuned) by the learning function 209 (e.g. an ELBO function) to minimize a measure of difference between the respective observed feature subset X od and the respective decoded version of that feature subset X d .
- the learning function 209 e.g. an ELBO function
- the second stage employs a second VAE comprising a second decoder 208pH and second encoder 208qH to form a second VAE.
- the second stage of the method comprises training this second VAE.
- each of the feature subsets X od is combined with its respective latent vector Z d (using the values of Z d learned using the first VAE in the first stage).
- this combination comprises concatenating each feature subset X od with its respective latent vector Z d .
- any function which combines the information of the two could be used, e.g. a multiplication, or interleaving, etc. Whatever function is used, each such combination forms one of the inputs of the second encoder 208qH.
- the second encoder 208qH is arranged to encode these inputs into a second latent representation in the form of a latent vector H, having multiple dimensions (with each dimension i.e. each element of the vector being modelled as a respective distribution, so represented in terms of one or more parameters of the respective distribution, e.g. a respective mean and variance or standard deviation). H is also referred to later as h (in the vector form), not to be confused with h(- ) the function. [067]
- the weights of the first encoder 208qd are represented by 0, the weights of the first decoder 208pd are represented by Q, the weights of the second encoder 208qh are represented by l, and the weights of the second decoder 208pH are represented by y.
- the model thus learns to first disentangle the dependencies between different data types, and then learn to the effect of the dependencies between data types.
- vanilla VAE Flowever, an issue with a vanilla VAE is that under mixed type data, it cannot make use of the latent space very efficiently. Hence increasing the size of latent space will not help. On the contrary, by disentangling the different feature types in the first learning stage, then in the presently disclosed approach the increase of latent size in the disclosed model improves performance compared with the vanilla VAE. So the latent space and training procedure are designed to make use of the latent space much more efficiently.
- the model 208’ can be used to make predictions or perform imputations in an analogous manner to that described in relation to Figures 4A-4D.
- a value of the second latent vector H is input to the second decoder which decodes to the first latent vector Z, and then each element Zi, Z2, Z3... of the first latent vector Z is decoded by its respective individual first decoder 208pl, 208p2, 208p3...
- a random or unobserved value of the latent vectorH can be input to the second decoder 208pH in order to generate a new instance of the feature vector X that was not observed in the training data.
- Figure 5E illustrates another example, where a third decoder network 208pY is trained during the second training stage to decode the second latent vector H into a classification Y.
- each data point (each instance of the input feature vector Xo) is labelled with a corresponding observed value of the classification Y.
- the learning function 209 trains the third decoder 208pY (i.e. tunes its weights) so as to minimize a measure of difference between the labelled classification and the predicted classification.
- an unlabelled input feature vector Xo can then be input into the second encoder 208qH of the model 208’, in order to generate a corresponding prediction of the classification Y.
- the model 208’ can be used to impute missing values in the input feature vector Xo.
- a subsequent observed instance of the feature vector Xo may be input to the second encoder 208qH, wherein this instance of the feature vector Xo which has some (but not all) of the features (i.e. elements) of the feature vector missing (i.e. unobserved).
- the missing elements may be set to zero, 50% or some other predetermined value representing “no observation”.
- the value(s) of the corresponding features (i.e. same elements) of the feature space can then be read out from the decoded version X of the feature vector, and taken as imputed values of the missing observations.
- the model 208’ may also be trained using some data points that have one or more missing values.
- FIG. 6 shows an example structure of the second encoder 208qH which can be used to improve the imputation of missing features.
- each function h(-) represents an individual constituent neural network, each taking a respective input (N.B. h( ⁇ ) as a function is not to be confused with h the latent vector discussed later - also called H above).
- Each input is a combination (e.g. multiplication) of a respective embedding e with a respective value v, where v is either Xd or Zd.
- v is either Xd or Zd.
- Preferably as many values of X and Z are input as available, during training and/or imputation.
- X d and/or Z d may be missing for some feature d (if Xd is missing then Zd will also be missing).
- the corresponding inputs are simply omitted (there is no need to replace it with an input of a predetermined value like 0 or 50%). This is possible because of the use of the permutation invariant operator g( ⁇ ), discussed shortly.
- Each value v is combined with its respective embedding, e.g. by multiplication or concatenation, etc. In embodiments multiplication is used here, but it could be any operator that combines the information from the two.
- the embedding e is the coordinate of the respective input - it tells the encoder which element is being input at that input. E.g. this could be a coordinate of a pixel or an index of the feature d.
- Each individual neural network h(-) outputs a vector. These vectors are combined by a permutation invariant operator g, such as a summation.
- a permutation invariant operator is an operator which outputs a value - in this case a vector - which depends on the values of the inputs to the operator but which is independent of the order of those inputs. Furthermore, this output vector is of a fixed size regardless of the number of inputs to the operator. This means that $( ⁇ ) can supply a vector c of a given format regardless of which inputs are present and which are not, and the order in which they are supplied. This enables the encoder 208qH to deal handle missing inputs.
- the encoder 208qH comprises a further, common neural network /( ⁇ ) which is common to all of the inputs v.
- the output c of the permutation invariant operator g( ⁇ ) is supplied to the input of this further neural network f( ).
- This neural network encodes the output c of g(-) into the second latent vector H (also labelled h, as a vector rather than a function, in the later working).
- the further neural network ( ⁇ ) is used, rather than just using c directly, because the size of observed features are not fixed. Therefore first a common function f is preferably applied to all observed features.
- a reward function Ri may be used to determine which observation to make next following the first and second stages of training of the model 208’ .
- the reward function is a function of the observations obtained so far, and represents the amount of new information that would be added by observing a given missing input. By determining which currently missing feature maximizes the reward function (or equivalently minimizes a cost function), this determines which of the unobserved inputs would be the most informative input to collect next. It represents the fact that some inputs have a greater dependency on one another than others, so the input that is least correlated with the other, already-observed inputs will provide the most new information.
- the reward function is evaluated for a plurality of different candidate unobserved features, and the feature which maximises the reward (or minimizes the cost) will be the feature that gives the most new information by being observed next.
- the model 208’ may then undergo another cycle of the first and second training stage, now incorporating the new observation.
- the new observation could be used to improve the quality of a prediction, or simply be used by a human analyst such as a doctor in conjunction with the result (e g. classification Y or an inputted missing feature Xd) of the already-trained model 208’.
- FIG. 9 is a flow chart giving an overview of a method in accordance with the presently disclosed approach.
- the first training stage is performed, in which each of the individual VAEs is trained separately for a respective one of the feature subsets in order to learn to encode for that feature subset in a manner that is disentangled for the other feature subsets, i.e. to model the marginal properties of each feature subset.
- the second training stage is performed, in which the second, common VAE is trained to learn or model the dependencies between feature subsets.
- the trained model 208’ may be used for a practical purpose such as to make a prediction or perform an imputation.
- the reward function may be used to determine which missing feature to observed next. In some cases the method may then comprise observing this missing feature and cycling back to SI to retrain the model 208’ including the new observation of the previously missing feature.
- the proposed method fits the data in a two-stage procedure.
- the first Stage for each variable we fit a low-dimensional VAE independently to each marginal data distribution. We call them marginal VAEs.
- the second stage in order to capture the inter-variable dependencies, a new multi-dimensional VAE, called the dependency network , is built on top of the latent representations provided by encoders in the first Stage.
- D we use D to denote the dimension of the observations and /Vis the number of data points x, thus x nd is the t/th dimension of the n ⁇ h point.
- Stage one training individual marginal VAEs to each single variable.
- D individual VAEs pe d (x nd ) G (1, 2, ..., D ⁇ independently, each one is trained to fit a single dimension x d from the dataset:
- VAE ry(z) IE r(L) Ry(z
- VAEM heterogeneous mixed type data
- VampPrior uses a mixture of Gaussians (MoGs) as the prior distribution for the high-level latent variable i.e., are a subset of points.
- MoGs Gaussians
- normalization is considered to be an essential preprocessing step. For example, it is common to first normalize the data to have the zero mean and standard deviation. However, for mixed-type data, no standard normalization method can be applied. With our VAEM, each marginal VAE is trained independently to model the heterogeneous properties of each data dimension, thus transforming the mixed type data x d to a continuous representation z d . The collection of z d forms the aggregated posterior which is close to a standard normal distribution thanks to the regularization effectfrom the prior p( z). In this way, we overcome the heterogeneous mixed-type problem and the dependency VAE can focus on learning the relationships among variables.
- the target of interest c f P x 0 — 0 is often the target that we try to predict.
- r l (c f ⁇ c 0 , c uf ) is the discriminator that gives probabilistic prediction of c f based on both observed variables x 0 , imputed variables x u and the global latent representation h (the last one is optional).
- the discriminator in Equation Eq. 16 offers additional predictive power for the target c f of interest.
- HI-VAE Heterogeneous-Incomplete VAE
- VAE A vanilla VAE equipped with VampPrior.
- the number of latent dimensions is the same as that in the second stage of VAEM. We denote this by VAE.
- VAE with extended latent dimensions Note that the total latent dimension of VAEM is D + L, where D and L is the dimensionalities of the data instance and h respectively. To be fair, in this baseline we extend the latent dimension of vanilla VAE baseline to D + L. We denote this baseline by VAE-extended.
- VAE with automatically balanced likelihoods.
- This baseline tries to automatically balance the scale of the log-likelihood values of different variable types in the ELBO, by adaptively multiplying a constant before likelihood terms We denote this baseline by VAE-balanced.
- Bank marketing dataset (45211 instances, 11 continuous, 8 categorical, 2 discrete); and avocado sales prediction dataset (18249 instances, 9 continuous, 5 categorical)
- MIMIC III Medical Information Mart for Intensive Care
- Figure 7 (a) shows the ground truth for each variable.
- Figure 7 (b) shows the values calculated using VAEM.
- Figure 7 (c) shows the values calculated by vanilla VAE.
- Figure 7 (d) shows the values calculated used VAE-extended.
- Figure 7 (e) shows the values calculated using VAE-balanced.
- Figure 7 (f) shows the values calculated using VAE-HI.
- the vanilla VAE is able to generate the second categorical variable.
- the third variable of the dataset ( Figure 7 (a)), which corresponds to the “duration” feature of the dataset, is a very important variable that has a heavy tail.
- the vanilla VAE ( Figure 7 (c)) fails to mimic this heavy tail behaviour of the variable.
- the VAE-balanced model and VAE-HI ( Figure 7(e) (f)) can capture part of this heavy-tail behaviour, it fails to model the second categorical variable well.
- Our VAEM model ( Figure 7 (b)) is able to generate accurate marginals and joint distributions for both categorical and heavy-tailed continuous distribution.
- NLL marginal negative log-likelihood
- An important aspect of generative models is the ability to perform conditional data generation. That is, given a data instance, to infer the posterior distribution regarding unobserved variables xu given xo. For all baselines evaluated in this task, we train the partial version of them (i.e., generative + partial inference net). To train the partial models, we randomly sample 90% of the dataset to be training set, and remove a random portion (uniformly sampled between 0% and 99%) of observations each epoch during training. Then, we remove 50% of the test set and use generative models to make inference regarding unobserved data.
- SAIA sequential active information acquisition
- VAE-no-disc a VAE without discriminator structure. This is the baseline to show the importance of the extension described above in prediction tasks.
- Other settings are the same as VAE baseline. All experiments are repeated for ten times.
- Figure 8 shows the test RMSEs on c f for each variable selection step on all five datasets, where c f is the target variable of each dataset.
- the y-axis shows the error of the prediction and the x-axis shows the number of features acquired for making the prediction.
- AUIC area under the information curves
- VAEM performs consistently better than other baselines. Note that on Bank marketing and Avocado sales datasets where a lot of heterogeneous variables are involved, other baselines almost fails to reduce test RMSE quickly, where VAEM outperforms them by a large margin. These experiments show that VAEM is able to acquire information efficiently on mixed type datasets.
- a method comprising: in a first stage, training each of a plurality of individual first variational auto encoders, VAEs, each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data; and in a second stage following the first stage, training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, , wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the
- each dimension of their latent representation is modelled as a probabilistic distribution.
- the decoded versions of the features as output by the decoders may also be modelled as distributions, or may be simple scalars.
- the weights of the nodes in the neural networks may also be modelled as distributions or scalars.
- Each of the encoders and decoders may comprise one or more neural networks.
- the training of each VAE may comprise comparing the features as output by the decoder with the features as input to the encoder, and tuning parameters of the nodes of the neural networks in the VAE to reduce the difference therebetween.
- each of said subsets is a single feature.
- each of said subsets may be more the one feature.
- the respective features within each subset may be of the same type but a different respective data type relative to the other subset.
- each of the first latent representations is a single respective one dimensional latent variable.
- each latent variable is nonetheless still modelled as a distribution.
- the different data types may comprise two or more of: categorical, ordinal, and continuous.
- the different data types may comprise: binary categorical, and categorical with more than two categories.
- the features may comprise one or more sensor readings from one or more sensors sensing a material or machine.
- the features may comprise one or more sensor readings and/or questionnaire responses from a user relating to the users health.
- a third decoder may be trained to generate a categorization from the second latent representation.
- the second encoder may comprise a respective individual second encoder arranged to encode each of a plurality of the feature subsets and/or first latent representations, a permutation invariant operator arranged to combine encoded outputs of the individual second encoders into a fixed size output, and a further encoder arranged to encode the fixed size output into the second latent representation.
- said combination may be a concatenation.
- aspects disclosed herein also provide a method of using the second VAE, after having been trained as hereinabove mentioned in any of the aspects or embodiments , to perform a prediction or imputation.
- the method may use the second VAE to predict or impute a condition of the material or machine.
- the method may use the second VAE to predict or impute a health condition of the user.
- the method may use the third decoder together with the second encoder, after having been trained, to predict the categorization of a subsequently observed feature vector of said feature space.
- the method may use the second VAE, after having been trained, to impute a value of one or more missing features in a subsequently observed feature vector of said feature space, by:
- the method may use the second encoder after having been trained, to impute one or more unobserved features by:
- Another aspect provides a computer program embodied on computer-readable storage and configured so as when run on one or more processing units to perform the method of any of the aspects or embodiments hereinabove defined.
- Another aspect provides a computer system comprising: memory comprising one or more memory units, and processing apparatus comprising one or more processing units; wherein the memory stores code arranged to run on the processing apparatus, the code being configured so as when run on the processing apparatus to carry out the method of any of the aspects or embodiments hereinabove defined.
- the computer system may be implemented as a server comprising one or more server units at one or more geographic sites, the server arranged to perform one or both of:
- the network for the purpose of one or both of these services may be a wide area internetwork such as the Internet.
- said gathering may comprise gathering some or all of the observations from a plurality of different users through different respective user devices.
- said gathering may comprise gathering some or all of the observations from a plurality of different sensor devices, e.g. IoT devices or industrial measurement devices.
- Another aspect provides use of a variational encoder which has been trained by, in a first stage, training each of a plurality of individual first variational auto encoders,
- VAEs each comprising an individual respective first encoder arranged to encode a respective subset of one or more features of a feature space into an individual respective first latent representation having one or more dimensions, and an individual respective first decoder arranged to decode from the respective latent representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different types of data; and in a second stage following the first stage, training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second latent representation having a plurality of dimensions, and a second decoder arranged to decode the second latent representation into decoded versions of the first latent representations, , wherein each respective one of the plurality of inputs comprises a combination of a different respective one of feature subsets with the respective first latent representation, the use being of the second variational encoder.
- the trained model may be employed to predict the state of a condition of a user, such as a disease or other health condition.
- the model may receive the answers to questions presented to a user about their health status to provide data to the model.
- a user interface may be provided to enable questions to be output to a user and to receive responses from a user for example through a voice or other interface means.
- the user interface may comprise a chatbot.
- the user interface may comprise a graphical user interface (GUI) such as a point and click user interface or a touch screen user interface.
- GUI graphical user interface
- the trained algorithm may be configured to generate an overall score from the user responses, which provide his or her health data, to predict a condition of the user from that data.
- the model can be used to predict the onset of a certain condition of the user, for example, a health condition such as asthma, depression or heart disease.
- a user’s condition may be monitored by asking questions which are repeated instances of the same question (asking the same thing, i.e. the same question content), and/or different questions (asking different things, i.e. different question content).
- the questions may relate to a condition of the user in order to monitor that condition.
- the condition may be a health condition such as asthma, depression, fitness etc.
- the monitoring could be for the purpose of making a prediction on a future state of the user’s condition, e.g. to predict the onset of a problem with the user’s health, or for the purpose of information for the user, a health practitioner or a clinical trial etc.
- User data may also be provided from sensor devices, e.g. a wearable or portable sensor device worn or carried about the user’s person.
- sensor devices e.g. a wearable or portable sensor device worn or carried about the user’s person.
- a wearable or portable sensor device worn or carried about the user’s person.
- a device could take the form of an inhaler or spirometer with embedded communication interface for connecting to a controller and supplying data to the controller.
- Data from the sensor may be input to the model and form part of the patient data for using the model to make predictions.
- Contextual metadata may also be provided for training and using the algorithm.
- Such metadata could comprise a user’s location.
- a user’s location could be monitored by a portable or wearable device disposed about the user’s person (plus any one or more of a variety of known localisation techniques such as triangulation, trilateration, multiliteration or finger printing relative to a network to known nodes such WLAN access points, cellular base stations, satellites or anchor nodes of a dedicated positioning network such an indoor location network).
- Other contextual information such as sleep quality may be inferred from personal device data, for example by using a wearable sleep monitor.
- sensor data from e.g. a camera, localisation system, motion sensor and/or heart rate monitor can be used as metadata.
- the model may be trained to recognise a particular disease or health outcome. For example, a particular health condition such as a certain type of cancer or diabetes may be used to train the model using existing feature sets from patients. Once a model has been trained, it can be utilised to provide a diagnosis of that particular disease when patient data is provided from a new patient.
- the model may make other health related predictions, such as predictions of mortality once it has been trained on a suitable set of patient training data with known mortality outcomes.
- Another example of use of the model can be to determine geological conditions, for example for drilling to establish the likelihood of encountering oil or gas, for example.
- Different sensors may be utilised on a tool at a particular geographic location.
- the sensors could comprise for example radar, lidar and location sensors.
- Other sensors such as the thermometers or vibration sensors may also be utilised.
- Data from the sensors may be in different data categories and therefore constitute mixed data.
- the model Once the model has been effectively trained on this mixed data, it may be applied in an unknown context by taking sensor readings from equivalent sensors in that unknown context and used to generate a prediction of geological conditions.
- a possible further application is to determine the status of a self-driving car.
- data may be generated from sensors such as radar sensors, lidar sensors and location sensors on a car and used as a feature set to train the model for certain condition that the car may be in.
- sensors such as radar sensors, lidar sensors and location sensors on a car and used as a feature set to train the model for certain condition that the car may be in.
- a corresponding mixed data set may be provided to the model to predict certain car conditions.
- a further possible application of the trained model is in machine diagnosis and management in an industrial context. For example, readings from different machine sensors including without limitation, temperature sensors, vibration sensors, accelerometers, fluid pressure sensors may be used to train the model for certain breakdown conditions of a machine. Once a model has been trained, it can be utilised to predict what may have caused a machine breakdown once data from that machine has been provided from corresponding sensors.
- a further application is in the context of predicting heat load and cooling load for different buildings. Attributes of a building may be provided to the model for training purposes, these attributes including for example surface area, wall area, roof area, height, orientation etc. Such attributes may be of a mixed data type. As an example, orientation may be a categorical data type and area may be a continuous data type.
- the model can be used to predict the heating load or cooling load of a particular building once corresponding data has been supplied to it for a new building.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB2006809.4A GB202006809D0 (en) | 2020-05-07 | 2020-05-07 | Variational auto encoder for mixed data types |
| US16/996,348 US20210358577A1 (en) | 2020-05-07 | 2020-08-18 | Variational auto encoder for mixed data types |
| PCT/US2021/026502 WO2021225741A1 (en) | 2020-05-07 | 2021-04-09 | Variational auto encoder for mixed data types |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| EP4147173A1 true EP4147173A1 (en) | 2023-03-15 |
Family
ID=75660419
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP21721364.4A Pending EP4147173A1 (en) | 2020-05-07 | 2021-04-09 | Variational auto encoder for mixed data types |
Country Status (3)
| Country | Link |
|---|---|
| EP (1) | EP4147173A1 (en) |
| CN (1) | CN115516460A (en) |
| WO (1) | WO2021225741A1 (en) |
Families Citing this family (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12272341B2 (en) * | 2021-11-08 | 2025-04-08 | Lemon Inc. | Controllable music generation |
| US20250349219A1 (en) * | 2022-11-16 | 2025-11-13 | Eagle Technology, Llc | Anomaly detection device for an aircraft and related methods |
| CN116434005A (en) * | 2023-03-29 | 2023-07-14 | 深圳智现未来工业软件有限公司 | Wafer defect data enhancement method and device |
| GR1010739B (en) * | 2023-06-09 | 2024-08-27 | KOP ΚΑΙΝΟΤΟΜΙΑ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑ ΟΕ με δ.τ. CORE INNOVATION, | Power transormer's predictive maintenance system using a variational autoencoder |
| CN119442967A (en) * | 2024-10-31 | 2025-02-14 | 硒钼科技(北京)有限公司 | A method, system and device for predicting and reducing noise of fluid simulation data |
-
2021
- 2021-04-09 WO PCT/US2021/026502 patent/WO2021225741A1/en not_active Ceased
- 2021-04-09 CN CN202180033226.XA patent/CN115516460A/en active Pending
- 2021-04-09 EP EP21721364.4A patent/EP4147173A1/en active Pending
Non-Patent Citations (3)
| Title |
|---|
| JO�O GAMA ET AL: "Cascade Generalization", MACHINE LEARNING, 1 December 2000 (2000-12-01), Boston, pages 315 - 343, XP055280680, Retrieved from the Internet <URL:http://rd.springer.com/content/pdf/10.1023/A:1007652114878.pdf> DOI: 10.1023/A:1007652114878 * |
| See also references of WO2021225741A1 * |
| SIMIDJIEVSKI NIKOLA ET AL: "Variational Autoencoders for Cancer Data Integration: Design Principles and Computational Practice", FRONTIERS IN GENETICS, vol. 10, 11 December 2019 (2019-12-11), Switzerland, XP093332285, ISSN: 1664-8021, DOI: 10.3389/fgene.2019.01205 * |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2021225741A1 (en) | 2021-11-11 |
| CN115516460A (en) | 2022-12-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20210358577A1 (en) | Variational auto encoder for mixed data types | |
| US20230394368A1 (en) | Collecting observations for machine learning | |
| US11257579B2 (en) | Systems and methods for managing autoimmune conditions, disorders and diseases | |
| EP4147173A1 (en) | Variational auto encoder for mixed data types | |
| US12165056B2 (en) | Auxiliary model for predicting new model parameters | |
| US11423538B2 (en) | Computer-implemented machine learning for detection and statistical analysis of errors by healthcare providers | |
| US20210406765A1 (en) | Partially-observed sequential variational auto encoder | |
| US12265939B2 (en) | Systems and methods for generation and traversal of a skill representation graph using machine learning | |
| AU2020260078A1 (en) | Computer-implemented machine learning for detection and statistical analysis of errors by healthcare providers | |
| CN117877763B (en) | Nursing communication system and method based on smart wristband | |
| US20200373017A1 (en) | System and method for intelligence crowdsourcing reinforcement learning for clinical pathway optimization | |
| US20250036947A1 (en) | Auxiliary model for predicting new model parameters | |
| CN113570497A (en) | Image processing method, image processing device, computer equipment and storage medium | |
| CN115879564A (en) | Adaptive Aggregation for Federated Learning | |
| CN116825331A (en) | Microwave physiotherapy self-learning self-diagnosis system and device based on outpatient big data | |
| CN113656589A (en) | Object attribute determination method and device, computer equipment and storage medium | |
| US20250149133A1 (en) | Adversarial imitation learning engine for kpi optimization | |
| US20240197287A1 (en) | Artificial Intelligence System for Determining Drug Use through Medical Imaging | |
| Abdullah et al. | Information gain-based enhanced classification techniques | |
| Sravani et al. | An Innovative Deep Learning Framework for Healthcare Cost Prediction | |
| US12369883B2 (en) | Artificial intelligence system for determining clinical values through medical imaging | |
| Lipnick | The Optimal Transport Barycenter Problem for Data Analysis: Prediction Methods and a Framework for Factor Discovery | |
| Sahoo et al. | Network Inference Using Deep Reinforcement Learning for Early Disease Detection | |
| Zhang | Reinforcement Learning and Relational Learning with Applicationsin Mobile-health and Knowledge Graph | |
| Varun et al. | Predictive Analytics for Heart Disease Using Machine Learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
| 17P | Request for examination filed |
Effective date: 20221024 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20251112 |